Developing a new backend for XLA

This guide is for system engineers who want XLA to output programs that target their hardware efficiently. The guide is not step-by-step and assumes knowledge of LLVM, Bazel, and XLA.

XLA provides an abstract interface that a new architecture or accelerator can implement to create a backend to run ML programs output by XLA. Retargeting XLA should be significantly simpler and more scalable than implementing every existing op from a frontend framework such as PyTorch or TensorFlow for new hardware.

Most implementations will fall into one of the following scenarios:

Existing CPU architecture not yet officially supported by XLA, with or without an existing LLVM backend.
Non-CPU-like hardware with an existing LLVM backend.
Non-CPU-like hardware without an existing LLVM backend.

Note: An LLVM backend can mean either one of the officially released LLVM backends or a custom LLVM backend developed in-house.

Scenario 1: Existing CPU architecture not yet officially supported by XLA

In this scenario, start by looking at the existing XLA CPU backend. XLA makes it easy to target different CPUs by using LLVM, because the main difference between XLA backends for CPUs is the code generated by LLVM.

If the hardware vendor has an LLVM backend for their hardware, it is simple to link the backend with the LLVM built with XLA. In JIT mode, the XLA CPU backend emits code for the host CPU. For ahead-of-time compilation, xla::AotCompilationOptions can provide an LLVM triple to configure the target architecture.

If there is no existing LLVM backend, but another kind of code generator exists, it should be possible to reuse most of the existing CPU backend.

Scenario 2: Non-CPU-like hardware with an existing LLVM backend

It is possible to model a new xla::Compiler implementation on the existing xla::CPUCompiler and xla::GPUCompiler classes, since these already emit LLVM IR. Depending on the nature of the hardware, it is possible that many aspects of the LLVM IR generation will have to be changed, but a lot of code can be shared with the existing backends.

A good example to follow is the GPU backend of XLA. The GPU backend targets a non-CPU-like ISA, and therefore some aspects of its code generation are unique to the GPU domain. Other kinds of hardware, e.g. DSPs like Hexagon (which has an upstream LLVM backend), can reuse parts of the LLVM IR emission logic, but other parts will be unique.

Scenario 3: Non-CPU-like hardware without an existing LLVM backend

If it is not possible to utilize LLVM, then the best option is to implement a new backend for XLA for the desired hardware. This option requires the most effort. The classes that need to be implemented are as follows:

StreamExecutor: For many devices not all methods of StreamExecutor are needed. See existing StreamExecutor implementations for details.
xla::Compiler: This class encapsulates the compilation of an HLO computation into an xla::Executable.
xla::Executable: This class is used to launch a compiled computation on the platform.
xla::TransferManager: This class enables backends to provide platform-specific mechanisms for constructing XLA literal data from given device memory handles. In other words, it helps encapsulate the transfer of data from the host to the device and back.