saveAsTextFile到S3的火花不工作,只是挂火花、工作、saveAsTextFile

2023-09-11 23:45:44 作者:颓废。

我加载从S3 CSV文本文件到火花,过滤和映射记录并将结果写入到S3。

I am loading a csv text file from s3 into spark, filtering and mapping the records and writing the result to s3.

我尝试了好几种输入尺寸:10万行,1M行和放大器; 3.5M行。 前两者成功完成,而后者(3.5M行)挂在一些奇怪的状态下工作阶段监视Web应用程序(一个在端口4040)将停止,并在命令行控制台卡住甚至不回应CTRL- C。大师的网络监控应用程序仍然响应,并显示状态为完成

I have tried several input sizes: 100k rows, 1M rows & 3.5M rows. The former two finish successfully while the latter (3.5M rows) hangs in some weird state in which the job stages monitor web app (the one in port 4040) stops , and the command line console gets stuck and does not even respond to ctrl-c. The Master's web monitoring app still responds and shows the state as FINISHED.

在S3中,我看到一个零大小的入口 _temporary_ $文件夹$ 空目录。 S3的URL使用 S3N给出:// 协议

In s3, I see an empty directory with a single zero-sized entry _temporary_$folder$. The s3 url is given using the s3n:// protocol.

我没有看到在Web控制台的日志中的任何错误。 我也试了几个簇大小(1个主站+ 1的工人,1个主站+ 5名工人),并得到了相同的状态。

I did not see any error in the logs in the web console. I also tried several cluster sizes (1 master + 1 worker, 1 master + 5 workers) and got to the same state.

有没有人遇到过这样的问题? 任何想法是怎么回事?

Has anyone encountered such an issue? Any idea what's going on?

推荐答案

这是可能的,你正在运行起来反对的5GB对象限制S3N文件系统。您可能能够解决这个问题通过使用 S3文件系统(而不是 S3N ),或者通过分割你的输出。

It's possible you are running up against the 5GB object limitation of the s3n FileSystem. You may be able to get around this by using s3 FileSystem (not s3n), or by partitioning your output.

下面就是 AmazonS3 - Hadoop的维基说:

S3本地文件系统(URI方案:S3N)   本机文件系统进行读取和写入的S3常规文件。这个文件系统的好处是,你可以访问上写了与其他工具S3文件。 [...]它的缺点是由S3强加的文件大小的5GB限制。的

S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. [...] The disadvantage is the 5GB limit on file size imposed by S3.

...

S3块文件系统(URI方案:S3)   基于块的文件系统由S3支持。文件存储块,就好像他们是在HDFS。这使得高效的执行重命名的。此文件系统需要你奉献一个水桶的文件系统[...]这个文件系统存储的文件可以比5GB大,但它们不能互操作与其他S3工具。的

S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem [...] The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

...

AmazonS3(最后编辑2014年7月1日13时27分49秒通过SteveLoughran)

Save Save As greyed out

AmazonS3 (last edited 2014-07-01 13:27:49 by SteveLoughran)