Migrate from TPU embedding_columns to TPUEmbedding layer

This guide demonstrates how to migrate embedding training on on TPUs from TensorFlow 1's embedding_column API with TPUEstimator to TensorFlow 2's TPUEmbedding layer API with TPUStrategy.

Embeddings are (large) matrices. They are lookup tables that map from a sparse feature space to dense vectors. Embeddings provide efficient and dense representations, capturing complex similarities and relationships between features.

TensorFlow includes specialized support for training embeddings on TPUs. This TPU-specific embedding support allows you to train embeddings that are larger than the memory of a single TPU device, and to use sparse and ragged inputs on TPUs.

In TensorFlow 1, tf.compat.v1.estimator.tpu.TPUEstimator is a high level API that encapsulates training, evaluation, prediction, and exporting for serving with TPUs. It has special support for tf.compat.v1.tpu.experimental.embedding_column.
To implement this in TensorFlow 2, use the TensorFlow Recommenders' tfrs.layers.embedding.TPUEmbedding layer. For training and evaluation, use a TPU distribution strategy—tf.distribute.TPUStrategy—which is compatible with the Keras APIs for, for example, model building (tf.keras.Model), optimizers (tf.keras.optimizers.Optimizer), and training with Model.fit or a custom training loop with tf.function and tf.GradientTape.

For additional information, refer to the tfrs.layers.embedding.TPUEmbedding layer's API documentation, as well as the tf.tpu.experimental.embedding.TableConfig and tf.tpu.experimental.embedding.FeatureConfig docs for additional information. For an overview of tf.distribute.TPUStrategy, check out the Distributed training guide and the Use TPUs guide. If you're migrating from TPUEstimator to TPUStrategy, check out the TPU migration guide.

Setup

Start by installing TensorFlow Recommenders and importing some necessary packages:

pip install tensorflow-recommenders

import tensorflow as tf
import tensorflow.compat.v1 as tf1

# TPUEmbedding layer is not part of TensorFlow.
import tensorflow_recommenders as tfrs

/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/requests/__init__.py:104: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (2.3.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  RequestsDependencyWarning)

And prepare a simple dataset for demonstration purposes:

features = [[1., 1.5]]
embedding_features_indices = [[0, 0], [0, 1]]
embedding_features_values = [0, 5]
labels = [[0.3]]
eval_features = [[4., 4.5]]
eval_embedding_features_indices = [[0, 0], [0, 1]]
eval_embedding_features_values = [4, 3]
eval_labels = [[0.8]]

TensorFlow 1: Train embeddings on TPUs with TPUEstimator

In TensorFlow 1, you set up TPU embeddings using the tf.compat.v1.tpu.experimental.embedding_column API and train/evaluate the model on TPUs with tf.compat.v1.estimator.tpu.TPUEstimator.

The inputs are integers ranging from zero to the vocabulary size for the TPU embedding table. Begin with encoding the inputs to categorical ID with tf.feature_column.categorical_column_with_identity. Use "sparse_feature" for the key parameter, since the input features are integer-valued, while num_buckets is the vocabulary size for the embedding table (10).

embedding_id_column = (
      tf1.feature_column.categorical_column_with_identity(
          key="sparse_feature", num_buckets=10))

Next, convert the sparse categorical inputs to a dense representation with tpu.experimental.embedding_column, where dimension is the width of the embedding table. It will store an embedding vector for each of the num_buckets.

embedding_column = tf1.tpu.experimental.embedding_column(
    embedding_id_column, dimension=5)

Now, define the TPU-specific embedding configuration via tf.estimator.tpu.experimental.EmbeddingConfigSpec. You will pass it later to tf.estimator.tpu.TPUEstimator as an embedding_config_spec parameter.

embedding_config_spec = tf1.estimator.tpu.experimental.EmbeddingConfigSpec(
    feature_columns=(embedding_column,),
    optimization_parameters=(
        tf1.tpu.experimental.AdagradParameters(0.05)))

Next, to use a TPUEstimator, define:

An input function for the training data
An evaluation input function for the evaluation data
A model function for instructing the TPUEstimator how the training op is defined with the features and labels

def _input_fn(params):
  dataset = tf1.data.Dataset.from_tensor_slices((
      {"dense_feature": features,
       "sparse_feature": tf1.SparseTensor(
           embedding_features_indices,
           embedding_features_values, [1, 2])},
           labels))
  dataset = dataset.repeat()
  return dataset.batch(params['batch_size'], drop_remainder=True)

def _eval_input_fn(params):
  dataset = tf1.data.Dataset.from_tensor_slices((
      {"dense_feature": eval_features,
       "sparse_feature": tf1.SparseTensor(
           eval_embedding_features_indices,
           eval_embedding_features_values, [1, 2])},
           eval_labels))
  dataset = dataset.repeat()
  return dataset.batch(params['batch_size'], drop_remainder=True)

def _model_fn(features, labels, mode, params):
  embedding_features = tf1.keras.layers.DenseFeatures(embedding_column)(features)
  concatenated_features = tf1.keras.layers.Concatenate(axis=1)(
      [embedding_features, features["dense_feature"]])
  logits = tf1.layers.Dense(1)(concatenated_features)
  loss = tf1.losses.mean_squared_error(labels=labels, predictions=logits)
  optimizer = tf1.train.AdagradOptimizer(0.05)
  optimizer = tf1.tpu.CrossShardOptimizer(optimizer)
  train_op = optimizer.minimize(loss, global_step=tf1.train.get_global_step())
  return tf1.estimator.tpu.TPUEstimatorSpec(mode, loss=loss, train_op=train_op)

With those functions defined, create a tf.distribute.cluster_resolver.TPUClusterResolver that provides the cluster information, and a tf.compat.v1.estimator.tpu.RunConfig object.

Along with the model function you have defined, you can now create a TPUEstimator. Here, you will simplify the flow by skipping checkpoint savings. Then, you will specify the batch size for both training and evaluation for the TPUEstimator.

cluster_resolver = tf1.distribute.cluster_resolver.TPUClusterResolver(tpu='')
print("All devices: ", tf1.config.list_logical_devices('TPU'))

All devices:  []

tpu_config = tf1.estimator.tpu.TPUConfig(
    iterations_per_loop=10,
    per_host_input_for_training=tf1.estimator.tpu.InputPipelineConfig
          .PER_HOST_V2)
config = tf1.estimator.tpu.RunConfig(
    cluster=cluster_resolver,
    save_checkpoints_steps=None,
    tpu_config=tpu_config)
estimator = tf1.estimator.tpu.TPUEstimator(
    model_fn=_model_fn, config=config, train_batch_size=8, eval_batch_size=8,
    embedding_config_spec=embedding_config_spec)

WARNING:tensorflow:Estimator's model_fn (<function _model_fn at 0x7f168033fc80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpxc9fm1_q
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpxc9fm1_q', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.240.1.10:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.240.1.10:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.240.1.10:8470', '_evaluation_master': 'grpc://10.240.1.10:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=10, num_shards=None, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu.tpu_cluster_resolver.TPUClusterResolver object at 0x7f16803b4a20>}
INFO:tensorflow:_TPUContext: eval_on_tpu True

Call TPUEstimator.train to begin training the model:

estimator.train(_input_fn, steps=1)

INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.10:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 6349538157198932596)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 9059152445598865227)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 3922455949451878923)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -6084187114011162725)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 8400191476321474241)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 5484621084550964852)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -8416772895681308377)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 2490716523526845408)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 7973779273400954871)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 6478654019570154047)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 7299189593611257732)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/tpu/feature_column_v2.py:479: IdentityCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.10:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 6349538157198932596)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 9059152445598865227)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 3922455949451878923)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -6084187114011162725)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 8400191476321474241)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 5484621084550964852)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -8416772895681308377)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 2490716523526845408)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 7973779273400954871)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 6478654019570154047)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 7299189593611257732)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/adagrad.py:77: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Bypassing TPUEstimator hook
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:TPU job name worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:758: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Installing graceful shutdown hook.
INFO:tensorflow:Creating heartbeat manager for ['/job:worker/replica:0/task:0/device:CPU:0']
INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: WAIT_FOR_COORDINATOR

INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 8 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (0, 0)
INFO:tensorflow:loss = 0.6467617, step = 1
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Loss for final step: 0.6467617.
INFO:tensorflow:training_loop marked as finished
<tensorflow_estimator.python.estimator.tpu.tpu_estimator.TPUEstimator at 0x7f168035b128>

Then, call TPUEstimator.evaluate to evaluate the model using the evaluation data:

estimator.evaluate(_eval_input_fn, steps=1)

INFO:tensorflow:Could not find trained model in model_dir: /tmp/tmpxc9fm1_q, running initialization to evaluate.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.10:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 6349538157198932596)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 9059152445598865227)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 3922455949451878923)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -6084187114011162725)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 8400191476321474241)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 5484621084550964852)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -8416772895681308377)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 2490716523526845408)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 7973779273400954871)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 6478654019570154047)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 7299189593611257732)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:3406: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2022-03-02T13:22:48
INFO:tensorflow:TPU job name worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 11 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (0, 0)
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Inference Time : 12.06464s
INFO:tensorflow:Finished evaluation at 2022-03-02-13:23:00
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 0.16138805
INFO:tensorflow:evaluation_loop marked as finished
{'loss': 0.16138805, 'global_step': 1}

TensorFlow 2: Train embeddings on TPUs with TPUStrategy

In TensorFlow 2, to train on the TPU workers, use tf.distribute.TPUStrategy together with the Keras APIs for model definition and training/evaluation. (Refer to the Use TPUs guide for more examples of training with Keras Model.fit and a custom training loop (with tf.function and tf.GradientTape).)

Since you need to perform some initialization work to connect to the remote cluster and initialize the TPU workers, start by creating a TPUClusterResolver to provide the cluster information and connect to the cluster. (Learn more in the TPU initialization section of the Use TPUs guide.)

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.10:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.10:8470
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Finished initializing TPU system.
All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')]

Next, prepare your data. This is similar to how you created a dataset in the TensorFlow 1 example, except the dataset function is now passed a tf.distribute.InputContext object rather than a params dict. You can use this object to determine the local batch size (and which host this pipeline is for, so you can properly partition your data).

When using the tfrs.layers.embedding.TPUEmbedding API, it is important to include the drop_remainder=True option when batching the dataset with Dataset.batch, since TPUEmbedding requires a fixed batch size.
Additionally, the same batch size must be used for evaluation and training if they are taking place on the same set of devices.
Finally, you should use tf.keras.utils.experimental.DatasetCreator along with the special input option—experimental_fetch_to_device=False—in tf.distribute.InputOptions (which holds strategy-specific configurations). This is demonstrated below:

global_batch_size = 8

def _input_dataset(context: tf.distribute.InputContext):
  dataset = tf.data.Dataset.from_tensor_slices((
      {"dense_feature": features,
       "sparse_feature": tf.SparseTensor(
           embedding_features_indices,
           embedding_features_values, [1, 2])},
           labels))
  dataset = dataset.shuffle(10).repeat()
  dataset = dataset.batch(
      context.get_per_replica_batch_size(global_batch_size),
      drop_remainder=True)
  return dataset.prefetch(2)

def _eval_dataset(context: tf.distribute.InputContext):
  dataset = tf.data.Dataset.from_tensor_slices((
      {"dense_feature": eval_features,
       "sparse_feature": tf.SparseTensor(
           eval_embedding_features_indices,
           eval_embedding_features_values, [1, 2])},
           eval_labels))
  dataset = dataset.repeat()
  dataset = dataset.batch(
      context.get_per_replica_batch_size(global_batch_size),
      drop_remainder=True)
  return dataset.prefetch(2)

input_options = tf.distribute.InputOptions(
    experimental_fetch_to_device=False)

input_dataset = tf.keras.utils.experimental.DatasetCreator(
    _input_dataset, input_options=input_options)

eval_dataset = tf.keras.utils.experimental.DatasetCreator(
    _eval_dataset, input_options=input_options)

Next, once the data is prepared, you will create a TPUStrategy, and define a model, metrics, and an optimizer under the scope of this strategy (Strategy.scope).

You should pick a number for steps_per_execution in Model.compile since it specifies the number of batches to run during each tf.function call, and is critical for performance. This argument is similar to iterations_per_loop used in TPUEstimator.

The features and table configuration that were specified in TensorFlow 1 via the tf.tpu.experimental.embedding_column (and tf.tpu.experimental.shared_embedding_column) can be specified directly in TensorFlow 2 via a pair of configuration objects:

(Refer to the associated API documentation for more details.)

strategy = tf.distribute.TPUStrategy(cluster_resolver)
with strategy.scope():
  optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.05)
  dense_input = tf.keras.Input(shape=(2,), dtype=tf.float32, batch_size=global_batch_size)
  sparse_input = tf.keras.Input(shape=(), dtype=tf.int32, batch_size=global_batch_size)
  embedded_input = tfrs.layers.embedding.TPUEmbedding(
      feature_config=tf.tpu.experimental.embedding.FeatureConfig(
          table=tf.tpu.experimental.embedding.TableConfig(
              vocabulary_size=10,
              dim=5,
              initializer=tf.initializers.TruncatedNormal(mean=0.0, stddev=1)),
          name="sparse_input"),
      optimizer=optimizer)(sparse_input)
  input = tf.keras.layers.Concatenate(axis=1)([dense_input, embedded_input])
  result = tf.keras.layers.Dense(1)(input)
  model = tf.keras.Model(inputs={"dense_feature": dense_input, "sparse_feature": sparse_input}, outputs=result)
  model.compile(optimizer, "mse", steps_per_execution=10)

INFO:tensorflow:Found TPU system:
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)

With that, you are ready to train the model with the training dataset:

model.fit(input_dataset, epochs=5, steps_per_epoch=10)

Epoch 1/5
10/10 [==============================] - 2s 175ms/step - loss: 0.0057
10/10 [==============================] - 0s 3ms/step - loss: 0.0000e+00
10/10 [==============================] - 0s 3ms/step - loss: 0.0000e+00
10/10 [==============================] - 0s 3ms/step - loss: 0.0000e+00
10/10 [==============================] - 0s 3ms/step - loss: 0.0000e+00
<keras.callbacks.History at 0x7f16803b4a90>

Finally, evaluate the model using the evaluation dataset:

model.evaluate(eval_dataset, steps=1, return_dict=True)

1/1 [==============================] - 1s 1s/step - loss: 12.2297
{'loss': 12.229663848876953}

Next steps

Learn more about setting up TPU-specific embeddings in the API docs:

tfrs.layers.embedding.TPUEmbedding: particularly about feature and table configuration, setting the optimizer, creating a model (using the Keras functional API or via subclassing tf.keras.Model), training/evaluation, and model serving with tf.saved_model
tf.tpu.experimental.embedding.TableConfig
tf.tpu.experimental.embedding.FeatureConfig

For more information about TPUStrategy in TensorFlow 2, consider the following resources:

Guide: Use TPUs (covering training with Keras Model.fit/a custom training loop with tf.distribute.TPUStrategy, as well as tips on improving the performance with tf.function)
Guide: Distributed training with TensorFlow
Guide: Migrate from TPUEstimator to TPUStrategy.

To learn more about customizing your training, refer to:

Guide: Customize what happens in Model.fit
Guide: Writing a training loop from scratch

TPUs—Google's specialized ASICs for machine learning—are available through Google Colab, the TPU Research Cloud, and Cloud TPU.