2014-11-10 7 views
7

Abbiamo avuto un piccolo cluster di scintille in esecuzione per un mese ora che sta eseguendo con successo i lavori o mi consente di avviare una scintilla shell nel cluster.Spark non può più eseguire lavori. Gli esecutori non riescono a creare la directory

Non importa se invio un lavoro al cluster o ci si connette utilizzando la shell, l'errore è sempre lo stesso.

[email protected]~]$ $SPARK_HOME/bin/spark-shell 
Spark assembly has been built with Hive, including Datanucleus jars on classpath 
14/11/10 20:43:01 INFO spark.SecurityManager: Changing view acls to: root, 
14/11/10 20:43:01 INFO spark.SecurityManager: Changing modify acls to: root, 
14/11/10 20:43:01 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root,); users with modify permissions: Set(root,) 
14/11/10 20:43:01 INFO spark.HttpServer: Starting HTTP Server 
14/11/10 20:43:01 INFO server.Server: jetty-8.y.z-SNAPSHOT 
14/11/10 20:43:01 INFO server.AbstractConnector: Started [email protected]:60223 
14/11/10 20:43:01 INFO util.Utils: Successfully started service 'HTTP class server' on port 60223. 
Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65) 
Type in expressions to have them evaluated. 
Type :help for more information. 
14/11/10 20:43:05 INFO spark.SecurityManager: Changing view acls to: root, 
14/11/10 20:43:05 INFO spark.SecurityManager: Changing modify acls to: root, 
14/11/10 20:43:05 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root,); users with modify permissions: Set(root,) 
14/11/10 20:43:05 INFO slf4j.Slf4jLogger: Slf4jLogger started 
14/11/10 20:43:05 INFO Remoting: Starting remoting 
14/11/10 20:43:05 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:41369] 
14/11/10 20:43:05 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:41369] 
14/11/10 20:43:05 INFO util.Utils: Successfully started service 'sparkDriver' on port 41369. 
14/11/10 20:43:05 INFO spark.SparkEnv: Registering MapOutputTracker 
14/11/10 20:43:05 INFO spark.SparkEnv: Registering BlockManagerMaster 
14/11/10 20:43:05 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-local-20141110204305-a4f0 
14/11/10 20:43:05 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/spark-local-20141110204305-991c 
14/11/10 20:43:05 INFO util.Utils: Successfully started service 'Connection manager for block manager' on port 56708. 
14/11/10 20:43:05 INFO network.ConnectionManager: Bound socket to port 56708 with id = ConnectionManagerId(ip-10-237-182-163.ec2.internal,56708) 
14/11/10 20:43:05 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB 
14/11/10 20:43:05 INFO storage.BlockManagerMaster: Trying to register BlockManager 
14/11/10 20:43:05 INFO storage.BlockManagerMasterActor: Registering block manager ip-10-237-182-163.ec2.internal:56708 with 265.4 MB RAM 
14/11/10 20:43:05 INFO storage.BlockManagerMaster: Registered BlockManager 
14/11/10 20:43:05 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fa8cd9e8-5a4a-40a4-bc76-c2215886873e 
14/11/10 20:43:05 INFO spark.HttpServer: Starting HTTP Server 
14/11/10 20:43:05 INFO server.Server: jetty-8.y.z-SNAPSHOT 
14/11/10 20:43:05 INFO server.AbstractConnector: Started [email protected]:36394 
14/11/10 20:43:05 INFO util.Utils: Successfully started service 'HTTP file server' on port 36394. 
14/11/10 20:43:06 INFO server.Server: jetty-8.y.z-SNAPSHOT 
14/11/10 20:43:06 INFO server.AbstractConnector: Started [email protected]:4040 
14/11/10 20:43:06 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 
14/11/10 20:43:06 INFO ui.SparkUI: Started SparkUI at http://ec2-54-91-220-90.compute-1.amazonaws.com:4040 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Connecting to master spark://ec2-54-91-220-90.compute-1.amazonaws.com:7077... 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 
14/11/10 20:43:06 INFO repl.SparkILoop: Created spark context.. 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20141110204306-0389 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/0 on worker-20140929210658-ip-10-225-160-49.ec2.internal-60693 (ip-10-225-160-49.ec2.internal:60693) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/0 on hostPort ip-10-225-160-49.ec2.internal:60693 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/1 on worker-20140929210658-ip-10-147-28-32.ec2.internal-60731 (ip-10-147-28-32.ec2.internal:60731) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/1 on hostPort ip-10-147-28-32.ec2.internal:60731 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/2 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/2 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/2 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/1 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/2 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/2) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/2 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/2 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/3 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/3 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/0 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/3 is now RUNNING 
Spark context available as sc. 
scala> 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/3 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/3) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/3 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/3 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/4 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/4 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/4 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/4 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/4) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/4 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/4 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/5 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/5 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/5 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/5 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/5) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/5 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/5 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/6 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/6 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/6 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/6 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/6) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/6 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/6 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/7 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/7 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/7 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/7 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/7) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/7 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/7 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/8 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/8 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/8 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/8 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/8) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/8 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/8 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/9 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/9 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/9 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/9 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/9) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/9 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/9 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/10 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/10 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/10 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/10 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/10) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/10 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/10 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/11 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/11 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/11 is now RUNNING 
14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/11 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/11) 
14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/11 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/11 
14/11/10 20:43:06 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED 
14/11/10 20:43:06 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED` 

Se tuttavia lo eseguo solo in modalità locale, si connette e funziona correttamente.

Sono incline a questo errore di autorizzazione di qualche tipo ma non abbiamo toccato nulla per tutto il tempo in cui ha funzionato.

EDIT

Dopo un po 'più di scavo ho scoperto che i nodi dei lavoratori erano fuori di spazio su disco. Nella cartella di lavoro risulta che sta memorizzando il jar copiato oltre a un file stdout e stderr per quel lavoro. Esiste comunque la possibilità di cancellarlo quando viene eseguito con la configurazione di registrazione per i lavori in S3.

risposta