Sono nuovo in tensorflow distribuito. Ho trovato questo test mnist distribuito qui: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.pycome eseguire un esempio di mnist distribuito di tensorflow
Ma non so come farlo funzionare. Ho usato il seguente script:
python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=0 --worker_grpc_url="grpc://tf-worker0:2222"\
& python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=1 --worker_grpc_url="grpc://tf-worker1:2222"\
& python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=2 --worker_grpc_url="grpc://tf-worker2:2222"
Ho appena trovato questi parametri mancanti, quindi li ho passati al programma. Ecco cosa è successo:
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Worker GRPC URL: grpc://tf-worker0:2222
Worker index = 0
Number of workers = 3
Worker GRPC URL: grpc://tf-worker2:2222
Worker index = 2
Number of workers = 3
Worker GRPC URL: grpc://tf-worker1:2222
Worker index = 1
Number of workers = 3
Worker 0: Initializing session...
Worker 2: Waiting for session to be initialized...
Worker 1: Waiting for session to be initialized...
E0608 20:37:13.514249023 7501 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.514287961 7501 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:13.548052986 7502 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.548091527 7502 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:13.555449386 7503 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.555473898 7503 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
^CE0608 20:37:28.517451603 7504 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.517491102 7504 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:28.551002331 7505 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.551029795 7505 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:28.556681378 7506 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.556709728 7506 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
Qualcuno sa come eseguirlo correttamente? Molte grazie!
Grazie mille per la spiegazione! Ci proverò. – xyd
Grazie mille! solo contribuendo con i miei 2 cent al tuo fantastico post. tf.merge_all_summaries() sembra essere deprecato e sta dando errore all'ultima versione o usa tf.merge_all o tf.contrib.deprecated.merge_all_summaries –
@sunilmanikani Grazie per averlo indicato ... Ho aggiornato il codice per usare 'tf .summary.merge_all() '. – mrry