阿帕奇星火任务失败星火、阿帕奇、任务

2023-09-11 09:05:41 作者:浅时光﹏

为什么阿帕奇星火任务失败?我认为,由于DAG,即使没有缓存的任务是recomputable?我其实缓存,和我要么得到一个 FileNotFoundException异常或以下内容:

Why do Apache Spark tasks fail? I thought, due to the DAG, that even without caching Tasks were recomputable? I am in fact caching, and I either get a filenotfoundexception or the following:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9238.0 (TID 17337, ip-XXX-XXX-XXX.compute.internal): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_299_piece0 of broadcast_299
    org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:930)
    org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)
    sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:606)
    java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
    java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
    org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    java.lang.Thread.run(Thread.java:745)

这是非常奇怪的,因为我已经运行在较小的情况下,同样的计划,我没有得到 FileNotFoundException异常 - 没有剩余空间这个装置上,而不是我得到的上述错误。当我说,双实例的大小,它告诉我有没有剩余空间设备上约1小时后工作的 - 同样的程序,更大的内存和它运行的空间!是什么给了?

It's very bizarre because I have run the same program on smaller instances and I don't get the filenotfoundexception - no space left on this device, instead I get the above error. When I say, double the instance size, it tells me there's no space left on the device after about 1 hour of working - same program, bigger memory and it runs out of space! What gives?

推荐答案

作为mentionned在 SPARK -751 问题:

As mentionned in SPARK-751 issue :

现在每台机器上,我们创建了对M * R临时文件   洗牌,其中M = map任务数,R =的减少任务数目。   这可能是pretty的高时,有很多映射器和减速机   (如1K地图* 1K减少= 1百万个文件的一个洗牌)。该   大量可以削弱文件系统和显著减缓   系统瘫痪。我们应该削减这个数字下降到O(R),而不是O(M * R)。

Right now on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks. This can be pretty high when there are lots of mappers and reducers (e.g. 1k map * 1k reduce = 1 million files for a single shuffle). The high number can cripple the file system and significantly slow the system down. We should cut this number down to O(R) instead of O(M*R).

所以,如果你真的看到你的磁盘正在运行的inode的,你可以试着用以下方法解决问题:

So if you indeed see that your disks are running out of inodes, You can try the following to fix the problem:

减小分区(见凝聚与洗牌= FALSE)。 您也可以尝试通过分区的数量下降到O(R)合并文件,因为文件系统不同的表现。 有时你可能只是发现你需要系统管理员增加i节点的FS支持的数量。 Decrease partitions (see coalesce with shuffle = false). You can also try to drop the number of partitions to O(R) by "consolidating files" since file-systems behave differently. Sometimes you may simply find that you need your system administrator to increase the number of inodes the FS supports.