ExecutorLostFailure when trying to run Apache Spark on Apache Mesos

August 24, 2015 - Spark

I was running into errors when trying to run Spark on Mesos.

Here’s the set up I had:

  • 4-nodes, Linux
  • Clustered Mesos 0.21.0
  • Clustered Hadoop 2.4
  • (to-be) clustered Spark 1.4.1

I’d followed directions and deployed Spark onto Mesos via a Spark binary placed on HDFS, and was successfully able to launch a spark-shell pointing at my Mesos master. However, I was receiving the following types of errors when running in client-mode:

15/08/24 14:48:25 WARN TaskSetManager: Lost task 2.2 in stage 12.0 (TID 339, node4): UnknownReason
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 12.0 failed 4 times, most recent failure: Lost task 1.3 in stage 12.0 (TID 340, node4): ExecutorLostFailure (executor 20150824-144052-2212425248-5352-16532-S1 lost)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Further review of Mesos logs (on each slave node) (i.e. /tmp/mesos/slaves/…/runs/stderr ) showcased the following error:

I0824 14:47:34.348436 15307 fetcher.cpp:76] Fetching URI 'hdfs:///tmp/spark-1.4.1-bin-hadoop2.4.tgz'
I0824 14:47:34.348600 15307 fetcher.cpp:105] Downloading resource from 'hdfs:///tmp/spark-1.4.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150824-144052-2255525248-5050-14232-S2/frameworks/20150824-144052-2255525248-5050-14232-0001/executors/20150824-144052-2255525248-5050-14232-S2/runs/910b72dd-b35b-4d33-8601-13cdefcc37b3/spark-1.4.1-bin-hadoop2.4.tgz'
E0824 14:47:34.351416 15307 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs:///tmp/spark-1.4.1-bin-hadoop2.4.tgz' '/tmp/mesos/slaves/20150824-144052-2255525248-5050-14232-S2/frameworks/20150824-144052-2255525248-5050-14232-0001/executors/20150824-144052-2255525248-5050-14232-S2/runs/910b72dd-b35b-4d33-8601-13cdefcc37b3/spark-1.4.1-bin-hadoop2.4.tgz'
sh: hadoop: command not found
Failed to fetch: hdfs:///tmp/spark-1.4.1-bin-hadoop2.4.tgz
Failed to synchronize with slave (it's probably exited)

That’s when it became clear, that the hadoop  command was not apparently available. I confirmed this was the case by trying to run hadoop  from the command line on my nodes, and found that it wasn’t available on my nodes’ PATH and thus was not found.

Solution

Two of my nodes did not have HADOOP_HOME  defined and its bin  directory added to the PATH  environment variable. This needed to be fixed before Spark could successfully run on Mesos via HDFS deployment of the Spark binary.

Steps:

1. Open up bashrc on all of your executor nodes

2. Add the following (if not already present):

export HADOOP_HOME=/path/to/hadoop-2.4.0 export PATH=${PATH}:${HADOOP_HOME}/bin

3. Make sure you source ~/.bashrc  or restart your shell on each executor, and restart all the Mesos slaves for all nodes on your Mesos cluster

4. Retry submitting a job or opening up spark-shell

Leave a Reply

Your email address will not be published. Required fields are marked *