Name		Name	Last commit message	Last commit date
parent directory ..
Dockerfile.keras_model_to_estimator		Dockerfile.keras_model_to_estimator
Dockerfile.tf_std_server		Dockerfile.tf_std_server
README.md		README.md
keras_model_to_estimator.py		keras_model_to_estimator.py
keras_model_to_estimator_client.py		keras_model_to_estimator_client.py
render_template.py		render_template.py
template.yaml.jinja		template.yaml.jinja
tf_std_server.py		tf_std_server.py

README.md

Multi-worker Training Using Distribution Strategies

This directory provides an example of running multi-worker training with Distribution Strategies.

Please first read the documentation of Distribution Strategy for multi-worker training. We also assume that readers of this page have experience with Google Cloud and its Kubernetes Engine.

This directory contains the following files:

template.yaml.jinja: a jinja template to be rendered into a Kubernetes yaml file
Dockerfile.keras_model_to_estimator: a docker file to build the model image
Dockerfile.tf_std_server: a docker file to build the standard TensorFlow server image
keras_model_to_estimator.py: model code to run multi-worker training
tf_std_server.py: a standard TensorFlow binary
keras_model_to_estimator_client.py: model code to run in standalone client mode

Prerequisite

You first need to have a Google Cloud project, set up a service account and download its JSON file. Make sure this service account has access to Google Cloud Storage.
Install gcloud commandline tools on your workstation and login, set project and zone, etc.
Install kubectl:
```
gcloud components install kubectl
```
Start a Kubernetes cluster eiter with gcloud command or with GKE web UI. Optionally you can add GPUs to each node.
Set context for kubectl so that kubectl knows which cluster to use:
```
kubectl config use-context <your_cluster>
```

Install CUDA drivers in your cluster:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Create a Kubernetes secret for the JSON file of your service account:

kubectl create secret generic credential --from-file=key.json=<path_to_json_file>

How to run the example

Let's first build the Docker image:

docker build --no-cache -t keras_model_to_estimator:v1 -f Dockerfile.keras_model_to_estimator .

and push the image to Google Cloud Container Registery:

docker tag keras_model_to_estimator:v1 gcr.io/<your project>/keras_model_to_estimator:v1
docker push gcr.io/<your project>/keras_model_to_estimator:v1

Modify the header of jinja template. You probably want to change name, image, worker_replicas, num_gpus_per_worker, has_eval, has_tensorboard, script and cmdline_args.
- name: name your cluster, e.g. "my-dist-strat-example".
- image: the name of your docker image.
- worker_replicas: number of workers.
- num_gpus_per_worker: number of GPUs per worker, also for the "evaluator" job if it exists.
- has_eval: whether to include a "evaluator" job. If this is False, no evaluation will be done even though tf.estimator.train_and_evaluate is used.
- has_tensorboard: whether to run tensorboard in the cluster.
- train_dir: the model directory.
- script: the script in the docker image to run.
- cmdline_args: the command line arguments passed to the script delimited by spaces.
- credential_secret_json: the filename of the json file for your service account.
- credential_secret_key: the name of the Kubernetes secret storing the credential of your service account.
- port: the port for all tasks including tensorboard.
Start training cluster:
```
python ../render_template.py template.yaml.jinja | kubectl create -f -
```
You'll see your cluster has started training. You can inspect logs of workers or use tensorboard to watch your model training.

How to run with standalone client mode

Please refer to the documentation of Distribution Strategy for the details of multi-worker training with standalone client mode. It basically consists of a cluster of standard TensorFlow servers and a model running on your workstation which connects to the cluster to request and coordinate training. All the training will be controlled by the model running on your workstation.

First install Kubernetes python client:
```
pip install kubernetes
```
Build a docker image for standard TensorFlow server:
```
docker build --no-cache -t tf_std_server:v1 -f Dockerfile.tf_std_server .
```
and push it to the container registry as well.
Modify the header of jinja template: set image, script to /tf_std_server.py and cmdline_args to empty to run this standard TensorFlow server on each Kubernetes pod.

Start the cluster of standard TensorFlow servers:

../render_template.py template.yaml.jinja | kubectl create -f -

Run the model binary on your workstation:
```
keras_model_to_estimator_client.py gs://<your_gcs_bucket>
```
You'll find your model starts training and logs printed on your terminal.

If you see any authentication issue, it is possibly because your workstation doesn't have access to the GCS bucket. In this case you can set the credential pointing to the json file of your service account before you run the model binary:
```
GOOGLE_APPLICATION_CREDENTIALS="<path_to_json_file>"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distribution_strategy

distribution_strategy

README.md

Multi-worker Training Using Distribution Strategies

Prerequisite

How to run the example

How to run with standalone client mode

Files

distribution_strategy

Directory actions

More options

Directory actions

More options

Latest commit

History

distribution_strategy

Folders and files

parent directory

README.md

Multi-worker Training Using Distribution Strategies

Prerequisite

How to run the example

How to run with standalone client mode