View on TensorFlow.org | Run in Google Colab | View source on GitHub | Download notebook |
This guide demonstrates how to migrate your multi-worker distributed training workflow from TensorFlow 1 to TensorFlow 2.
To perform multi-worker training with CPUs/GPUs:
- In TensorFlow 1, you traditionally use the
tf.estimator.train_and_evaluate
andtf.estimator.Estimator
APIs. - In TensorFlow 2, use the Keras APIs for writing the model, the loss function, the optimizer, and metrics. Then, distribute the training with Keras
Model.fit
API or a custom training loop (withtf.GradientTape
) across multiple workers withtf.distribute.experimental.ParameterServerStrategy
ortf.distribute.MultiWorkerMirroredStrategy
. For more details, refer to the following tutorials:
Setup
Start with some necessary imports and a simple dataset for demonstration purposes:
# The notebook uses a dataset instance for `Model.fit` with
# `ParameterServerStrategy`, which depends on symbols in TF 2.7.
# Install a utility needed for this demonstration
!pip install portpicker
import tensorflow as tf
import tensorflow.compat.v1 as tf1
features = [[1., 1.5], [2., 2.5], [3., 3.5]]
labels = [[0.3], [0.5], [0.7]]
eval_features = [[4., 4.5], [5., 5.5], [6., 6.5]]
eval_labels = [[0.8], [0.9], [1.]]
You will need the 'TF_CONFIG'
configuration environment variable for training on multiple machines in TensorFlow. Use 'TF_CONFIG'
to specify the 'cluster'
and the 'task'
s' addresses. (Learn more in the Distributed_training guide.)
import json
import os
tf_config = {
'cluster': {
'chief': ['localhost:11111'],
'worker': ['localhost:12345', 'localhost:23456', 'localhost:21212'],
'ps': ['localhost:12121', 'localhost:13131'],
},
'task': {'type': 'chief', 'index': 0}
}
os.environ['TF_CONFIG'] = json.dumps(tf_config)
Use the del
statement to remove the variable (but in real-world multi-worker training in TensorFlow 1, you won't have to do this):
del os.environ['TF_CONFIG']
TensorFlow 1: Multi-worker distributed training with tf.estimator APIs
The following code snippet demonstrates the canonical workflow of multi-worker training in TF1: you will use a tf.estimator.Estimator
, a tf.estimator.TrainSpec
, a tf.estimator.EvalSpec
, and the tf.estimator.train_and_evaluate
API to distribute the training:
def _input_fn():
return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)
def _eval_input_fn():
return tf1.data.Dataset.from_tensor_slices(
(eval_features, eval_labels)).batch(1)
def _model_fn(features, labels, mode):
logits = tf1.layers.Dense(1)(features)
loss = tf1.losses.mean_squared_error(labels=labels, predictions=logits)
optimizer = tf1.train.AdagradOptimizer(0.05)
train_op = optimizer.minimize(loss, global_step=tf1.train.get_global_step())
return tf1.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
estimator = tf1.estimator.Estimator(model_fn=_model_fn)
train_spec = tf1.estimator.TrainSpec(input_fn=_input_fn)
eval_spec = tf1.estimator.EvalSpec(input_fn=_eval_input_fn)
tf1.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
TensorFlow 2: Multi-worker training with distribution strategies
In TensorFlow 2, distributed training across multiple workers with CPUs, GPUs, and TPUs is done via tf.distribute.Strategy
s.
The following example demonstrates how to use two such strategies: tf.distribute.experimental.ParameterServerStrategy
and tf.distribute.MultiWorkerMirroredStrategy
, both of which are designed for CPU/GPU training with multiple workers.
ParameterServerStrategy
employs a coordinator ('chief'
), which makes it more friendly with the environment in this Colab notebook. You will be using some utilities here to set up the supporting elements essential for a runnable experience here: you will create an in-process cluster, where threads are used to simulate the parameter servers ('ps'
) and workers ('worker'
). For more information about parameter server training, refer to the Parameter server training with ParameterServerStrategy tutorial.
In this example, first define the 'TF_CONFIG'
environment variable with a tf.distribute.cluster_resolver.TFConfigClusterResolver
to provide the cluster information. If you are using a cluster management system for your distributed training, check if it provides 'TF_CONFIG'
for you already, in which case you don't need to explicitly set this environment variable. (Learn more in the Setting up the 'TF_CONFIG'
environment variable section in the Distributed training with TensorFlow guide.)
# Find ports that are available for the `'chief'` (the coordinator),
# `'worker'`s, and `'ps'` (parameter servers).
import portpicker
chief_port = portpicker.pick_unused_port()
worker_ports = [portpicker.pick_unused_port() for _ in range(3)]
ps_ports = [portpicker.pick_unused_port() for _ in range(2)]
# Dump the cluster information to `'TF_CONFIG'`.
tf_config = {
'cluster': {
'chief': ["localhost:%s" % chief_port],
'worker': ["localhost:%s" % port for port in worker_ports],
'ps': ["localhost:%s" % port for port in ps_ports],
},
'task': {'type': 'chief', 'index': 0}
}
os.environ['TF_CONFIG'] = json.dumps(tf_config)
# Use a cluster resolver to bridge the information to the strategy created below.
cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
Then, create tf.distribute.Server
s for the workers and parameter servers one-by-one:
# Workers need some inter_ops threads to work properly.
# This is only needed for this notebook to demo. Real servers
# should not need this.
worker_config = tf.compat.v1.ConfigProto()
worker_config.inter_op_parallelism_threads = 4
for i in range(3):
tf.distribute.Server(
cluster_resolver.cluster_spec(),
job_name="worker",
task_index=i,
config=worker_config)
for i in range(2):
tf.distribute.Server(
cluster_resolver.cluster_spec(),
job_name="ps",
task_index=i)
In real-world distributed training, instead of starting all the tf.distribute.Server
s on the coordinator, you will be using multiple machines, and the ones that are designated as "worker"
s and "ps"
(parameter servers) will each run a tf.distribute.Server
. Refer to Clusters in the real world section in the Parameter server training tutorial for more details.
With everything ready, create the ParameterServerStrategy
object:
strategy = tf.distribute.experimental.ParameterServerStrategy(cluster_resolver)
Once you have created a strategy object, define the model, the optimizer, and other variables, and call the Keras Model.compile
within the Strategy.scope
API to distribute the training. (Refer to the Strategy.scope
API docs for more information.)
If you prefer to customize your training by, for instance, defining the forward and backward passes, refer to Training with a custom training loop section in Parameter server training tutorial for more details.
dataset = tf.data.Dataset.from_tensor_slices(
(features, labels)).shuffle(10).repeat().batch(64)
eval_dataset = tf.data.Dataset.from_tensor_slices(
(eval_features, eval_labels)).repeat().batch(1)
with strategy.scope():
model = tf.keras.models.Sequential([tf.keras.layers.Dense(1)])
optimizer = tf.keras.optimizers.legacy.Adagrad(learning_rate=0.05)
model.compile(optimizer, "mse")
model.fit(dataset, epochs=5, steps_per_epoch=10)
model.evaluate(eval_dataset, steps=10, return_dict=True)
Partitioners (
tf.distribute.experimental.partitioners
)
ParameterServerStrategy
in TensorFlow 2 supports variable partitioning and offers same partitioners as TensorFlow 1, with less confusing names: -tf.compat.v1.variable_axis_size_partitioner
->tf.distribute.experimental.partitioners.MaxSizePartitioner
: a partitioner that keeps shards under a maximum size). -tf.compat.v1.min_max_variable_partitioner
->tf.distribute.experimental.partitioners.MinSizePartitioner
: a partitioner that allocates a minimum size per shard. -tf.compat.v1.fixed_size_partitioner
->tf.distribute.experimental.partitioners.FixedShardsPartitioner
: a partitioner that allocates a fixed number of shards.
Alternatively, you can use a MultiWorkerMirroredStrategy
object:
# To clean up the `TF_CONFIG` used for `ParameterServerStrategy`.
del os.environ['TF_CONFIG']
strategy = tf.distribute.MultiWorkerMirroredStrategy()
You can replace the strategy used above with a MultiWorkerMirroredStrategy
object to perform training with this strategy.
As with the tf.estimator
APIs, since MultiWorkerMirroredStrategy
is a multi-client strategy, there is no easy way to run distributed training in this Colab notebook. Therefore, replacing the code above with this strategy ends up running things locally. The Multi-worker training with Keras Model.fit/a custom training loop tutorials demonstrate how to run multi-worker training with
the 'TF_CONFIG'
variable set up, with two workers on a localhost in Colab. In practice, you would create multiple workers on external IP addresses/ports, and use the 'TF_CONFIG'
variable to specify the cluster configuration for each worker.
Next steps
To learn more about multi-worker distributed training with tf.distribute.experimental.ParameterServerStrategy
and tf.distribute.MultiWorkerMirroredStrategy
in TensorFlow 2, consider the following resources:
- Tutorial: Parameter server training with ParameterServerStrategy and Keras Model.fit/a custom training loop
- Tutorial: Multi-worker training with MultiWorkerMirroredStrategy and Keras Model.fit
- Tutorial: Multi-worker training with MultiWorkerMirroredStrategy and a custom training loop
- Guide: Distributed training with TensorFlow
- Guide: Optimize TensorFlow GPU performance with the TensorFlow Profiler
- Guide: Use a GPU (the Using multiple GPUs section)