Abbiamo avuto un piccolo cluster di scintille in esecuzione per un mese ora che sta eseguendo con successo i lavori o mi consente di avviare una scintilla shell nel cluster.Spark non può più eseguire lavori. Gli esecutori non riescono a creare la directory
Non importa se invio un lavoro al cluster o ci si connette utilizzando la shell, l'errore è sempre lo stesso.
[email protected]~]$ $SPARK_HOME/bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
14/11/10 20:43:01 INFO spark.SecurityManager: Changing view acls to: root,
14/11/10 20:43:01 INFO spark.SecurityManager: Changing modify acls to: root,
14/11/10 20:43:01 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root,); users with modify permissions: Set(root,)
14/11/10 20:43:01 INFO spark.HttpServer: Starting HTTP Server
14/11/10 20:43:01 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/11/10 20:43:01 INFO server.AbstractConnector: Started [email protected]:60223
14/11/10 20:43:01 INFO util.Utils: Successfully started service 'HTTP class server' on port 60223.
Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
14/11/10 20:43:05 INFO spark.SecurityManager: Changing view acls to: root,
14/11/10 20:43:05 INFO spark.SecurityManager: Changing modify acls to: root,
14/11/10 20:43:05 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root,); users with modify permissions: Set(root,)
14/11/10 20:43:05 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/11/10 20:43:05 INFO Remoting: Starting remoting
14/11/10 20:43:05 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:41369]
14/11/10 20:43:05 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:41369]
14/11/10 20:43:05 INFO util.Utils: Successfully started service 'sparkDriver' on port 41369.
14/11/10 20:43:05 INFO spark.SparkEnv: Registering MapOutputTracker
14/11/10 20:43:05 INFO spark.SparkEnv: Registering BlockManagerMaster
14/11/10 20:43:05 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-local-20141110204305-a4f0
14/11/10 20:43:05 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/spark-local-20141110204305-991c
14/11/10 20:43:05 INFO util.Utils: Successfully started service 'Connection manager for block manager' on port 56708.
14/11/10 20:43:05 INFO network.ConnectionManager: Bound socket to port 56708 with id = ConnectionManagerId(ip-10-237-182-163.ec2.internal,56708)
14/11/10 20:43:05 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB
14/11/10 20:43:05 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/11/10 20:43:05 INFO storage.BlockManagerMasterActor: Registering block manager ip-10-237-182-163.ec2.internal:56708 with 265.4 MB RAM
14/11/10 20:43:05 INFO storage.BlockManagerMaster: Registered BlockManager
14/11/10 20:43:05 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fa8cd9e8-5a4a-40a4-bc76-c2215886873e
14/11/10 20:43:05 INFO spark.HttpServer: Starting HTTP Server
14/11/10 20:43:05 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/11/10 20:43:05 INFO server.AbstractConnector: Started [email protected]:36394
14/11/10 20:43:05 INFO util.Utils: Successfully started service 'HTTP file server' on port 36394.
14/11/10 20:43:06 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/11/10 20:43:06 INFO server.AbstractConnector: Started [email protected]:4040
14/11/10 20:43:06 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
14/11/10 20:43:06 INFO ui.SparkUI: Started SparkUI at http://ec2-54-91-220-90.compute-1.amazonaws.com:4040
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Connecting to master spark://ec2-54-91-220-90.compute-1.amazonaws.com:7077...
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
14/11/10 20:43:06 INFO repl.SparkILoop: Created spark context..
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20141110204306-0389
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/0 on worker-20140929210658-ip-10-225-160-49.ec2.internal-60693 (ip-10-225-160-49.ec2.internal:60693) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/0 on hostPort ip-10-225-160-49.ec2.internal:60693 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/1 on worker-20140929210658-ip-10-147-28-32.ec2.internal-60731 (ip-10-147-28-32.ec2.internal:60731) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/1 on hostPort ip-10-147-28-32.ec2.internal:60731 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/2 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/2 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/2 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/1 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/2 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/2)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/2 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/2
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/3 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/3 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/0 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/3 is now RUNNING
Spark context available as sc.
scala> 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/3 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/3)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/3 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/3
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/4 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/4 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/4 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/4 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/4)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/4 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/4
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/5 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/5 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/5 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/5 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/5)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/5 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/5
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/6 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/6 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/6 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/6 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/6)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/6 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/6
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/7 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/7 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/7 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/7 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/7)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/7 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/7
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/8 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/8 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/8 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/8 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/8)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/8 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/8
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/9 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/9 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/9 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/9 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/9)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/9 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/9
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/10 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/10 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/10 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/10 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/10)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/10 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/10
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/11 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/11 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/11 is now RUNNING
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/11 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/11)
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/11 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/11
14/11/10 20:43:06 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
14/11/10 20:43:06 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED`
Se tuttavia lo eseguo solo in modalità locale, si connette e funziona correttamente.
Sono incline a questo errore di autorizzazione di qualche tipo ma non abbiamo toccato nulla per tutto il tempo in cui ha funzionato.
EDIT
Dopo un po 'più di scavo ho scoperto che i nodi dei lavoratori erano fuori di spazio su disco. Nella cartella di lavoro risulta che sta memorizzando il jar copiato oltre a un file stdout e stderr per quel lavoro. Esiste comunque la possibilità di cancellarlo quando viene eseguito con la configurazione di registrazione per i lavori in S3.