Spark – How to fix “WARN TaskSchedulerImpl: Initial job has not accepted any resources”

Apache Spark and Firewalls

When setting up Apache Spark on your own cluster, in my case on OpenStack VMs, a common pitfall is the following error message:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

This error can pop up in the log output of the interactive Python Spark shell or Jupyter (formerly IPython Notebook) after starting a PySpark session and trying to perform any kind of Spark action (like .count() or .take() on a RDD), rendering PySpark unusable.

As the error message suggests, I investigated resource shortages first. The Spark Master UI reported that my PySpark shell had allocated all the available CPU cores and a small portion of the available memory. I therefore lowered the number of CPU cores for each Spark application on the cluster, by adding the following line in spark-env.sh on the master node and restarting the master:

SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"

After this change my PySpark shell was limited to 4 CPU cores of the 16 CPU cores in my cluster at that time, instead of reserving all available cores (the default setting). However, even though the Spark UI now reported there would be enough free CPU cores and memory to actually run some Spark actions, the error message still popped up and no Spark actions would execute.

While debugging this issue, I came across a Spark-user mailing list post by Marcelo Vanzin of Cloudera where he outlines two possible causes for this particular error:

"...
- You're requesting more resources than the master has available, so
your executors are not starting. Given your explanation this doesn't
seem to be the case.

- The executors are starting, but are having problems connecting 
back to the driver. In this case, you should be able to see 
errors in each executor's log file.
..."

The second of these was causing this error in my case. The host firewall on the host where I ran my PySpark shell rejected the connection attempts back from the worker nodes. After allowing all traffic between all nodes involved, the problem was resolved! The driver host was another VM in the same OpenStack project, so allowing all traffic between the VMs in the same project was OK to do security-wise.

The error message is not particularly useful in the case where executors are unable to connect back to the driver. If you encounter the same error message, remember to check firewall logs from all involved firewalls (host and/or network firewalls).

On a side note, this requirement of Spark to connect back from executors to the driver makes it harder to set up a Spark cluster in a secure way. Unless the driver is in the same security zone as the Spark cluster, it may not be possible to allow the Spark cluster workers to establish connections to the driver host on arbitrary ports. Hopefully the Apache Spark project will address this limitation in a future release, by making sure all necessary connections are established by the driver (client host) only.

~ Arne ~

Advertisements

7 thoughts on “Spark – How to fix “WARN TaskSchedulerImpl: Initial job has not accepted any resources”

    1. Spark logs on both master node and worker nodes (executors) are logged to the “logs” subdirectory of the Spark installation on each node. Since I installed Spark into /opt/spark, the executor logs are in the /opt/spark/logs/ directory on each worker node.

      Liked by 1 person

  1. Getting this on Master
    15/10/27 11:16:59 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@172.29.7.24:49871] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
    15/10/27 11:16:59 INFO master.Master: akka.tcp://sparkDriver@172.29.7.24:49871 got disassociated, removing it.
    15/10/27 11:17:00 WARN master.Master: Got status update for unknown executor app-20151027111555-0000/0
    15/10/27 11:17:00 WARN master.Master: Got status update for unknown executor app-20151027111555-0000/1

    while on worker getting this..
    15/10/27 12:14:11 INFO worker.Worker: Asked to kill executor app-20151027111555-0000/0
    15/10/27 12:14:11 INFO worker.ExecutorRunner: Runner thread for executor app-20151027111555-0000/0 interrupted
    15/10/27 12:14:11 INFO worker.ExecutorRunner: Killing process!
    15/10/27 12:14:11 ERROR logging.FileAppender: Error writing stream to file /home/hadoop/spark-1.4.1-bin-hadoop2.6/work/app-20151027111555-0000/0/stderr

    Like

    1. That might be a variant of the error I encountered. It looks like the master node is unable to connect back to the driver node, where Jupyter normally runs. Have you checked that the master node can connect to the driver node at 172.29.7.24 without being blocked by network og host firewalls? Sometimes using the wrong IP addresses can be the cause of such errors, like using floating IPs instead of private IPs for Spark master URI when setting up Spark on OpenStack.

      Regarding the worker errors, it seems that the worker is simply told to terminate its executor processes since the master cannot connect to the driver node. In addition you might want to look into the permissions issue that the last log line indicates (unable to write log to disk).

      Like

  2. PD says:

    Thank you, Thank You, Thank you So MUCH! I kept scratching my head wondering why it says ‘Insufficient Resources’ when all it needs to do is read 2 rows from the database! šŸ™‚

    Like

  3. Jing Kang says:

    because of the firewall policies, stop the firewall and restart all services. if master and slave already started, just stop the FW, it still doesn’t work, troubled me.

    ./sbin/stop-all.sh
    iptables -F

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s