TF Replicator: Distributed Machine Learning for Researchers
Deepmind’s Research Platform team develops infrastructure to empower and accelerate their AI research. Presently, they are thrilled to share their development process of the TF-Replicator, a software library that assists researchers go about deploying their TensorFlow models on GPUs and Cloud TPUs with reduced effort and no prior experience with distributed systems. TF-Replicator’s programming model has now been open sourced as portion of TensorFlow’s tf.distribute.Strategy. This blog post by AICoreSpot provides an overview of the concepts and technical hurdles underlying TF-Replicator.
A recurrent theme in latest AI breakthroughs – from AlphaFold to BigGAN to AlphaStar – is the requirement for effortless and reliable scalability. An appreciating amount of computational capacity facilitate scientists to train ever-bigger neural networks with brand new capacities. To tackle this, the Research Platform Team produced TF-Replicator, which enables researchers to target differing hardware accelerators for machine learning, scale up workloads to several devices, and seamlessly switch amongst differing types of accelerators. While it was preliminarily developed as a library on top of TensorFlow, TF-Replicator’s API has since had integration into TensorFlow 2.0’s new tf.distribute.strategy.
While TensorFlow furnishes direct support for CPU, GPU, and TPU (Tensor Processing Unit) devices, moving between targets needs considerable effort from the user. This usually consists of specialising code for a specific hardware target, limiting research concepts to the capabilities of that platform. Some current frameworks developed on top of SensorFlow e.g., Estimators look to tackle this issue. But they are usually targeted at production use cases and don’t have the expressivity and flexibility needed for swift iteration of research concepts.
The original motivation for producing TF-Replicator was to furnish a simplistic API for DeepMind analysts to leverage TPUs. TPUs furnish scalability for machine learning workloads, facilitating revolutionary breakthroughs like state-of-the-art image synthesis with the BigGAN model. TensorFlow’s native API for TPU is different from how GPUs are targeted, putting up barriers to TPU adoption. TF-Replicator furnishes a simplistic, more user-friendly API that hides the intricacy of TensorFlow’s TPU API. Crucially, the Research Platform Team produced the TF-Replicator API in close collaboration with scientists across several machine learning disciplines to make sure that there is much needed flexibility and accessibility.
Code documented leveraging TF-Replicator looks like the code documented in TensorFlow for a singular device, enabling users the autonomy to define their own model’s run loop. The user merely requires to define 1) an input function that exposes a Dataset, and 2) a step function that defines the logic of their model (for e.g., a singular step of gradient descent)
Scaling computation to several devices needs the devices to interact with each other. In the context of training Machine Learning models, the most typical form of interaction is to accumulate gradients for leveraging in optimisation algorithms like Stochastic Gradient Descent. We thus furnish a convenient method to wrap TensorFlow Optimizers, so that gradients are accumulated across devices prior to updating the model’s parameters. For more general communication patterns, we furnish MPI-like primitives, like “all_reduce” and ‘broadcast’. These make it trivial to go about implementing operations like global batch normalisation, a strategy that is critical to scale up training of our BigGAN models.
For multi-GPU computation, TF-Replicator is reliant on an “in-graph replication” pattern, where the computation for every device is replicated in the same TensorFlow graph. Communication amongst devices is accomplished by connecting nodes for the devices correlating sub-graphs. Implementing this in TF-Replicator was a challenge, as communication can happen at any point in the data-flow graph. The order in which computations are built up and is thus critical.
The starting idea was to develop every device’s sub-graph concurrently in an independent Python thread. When facing a communication primitive, the threads undergo synchronization and the primary thread inserts the needed cross-device computation. Following that, every thread would continue developing its device’s computation. But at the same time this approach was being spoken about, TensorFlow’s graph building API was not thread-safe which made concurrently developing sub-graphs in differing threads very tough. Rather, we graph rewriting was leverage to insert the communication following the building of all devices sub-graphs.
During construction of the sub-graphs, placeholders are put in places where communication is needed. We then collect all matching placeholders across devices and replace them with the relevant cross-device computation.
Through close collaboration with scientists throughout the development and implementation of TF-Replicator, they were able to develop a library that enables users to easily scale computation throughout several hardware accelerators, while leaving them with the control and flexibility needed to perform bleeding-edge artificial intelligence research. For instance, MPI-style communication primitives were included like all-reduce following discussion with researchers. TF-Replicator and other common infrastructure enables us to develop increasingly complicated environments on robust foundations and swiftly spread best practices throughout DeepMind.
During the time of creation of this document, TF-Replicator is the most broadly leveraged interface for TPU Programming at DeepMind. Whereas the library by itself is not limited to training neural networks, it is most typically leveraged for training on large groups of data. The BigGAN model, for instance, received training on batches of size 2048 across up to 512 cores of a TPUv3 POD. Within reinforcement learning, agents with a distributed actor-learner setup, like our importance weighted actor-learner architectures, scalability is accomplished by possessing several actors producing new experiences by communicating with the environment. This information is then processed by the learner to enhance the agent’s policy, indicated by a neural network. To cope with an escalating number of actors, TF-Replicator can be leveraged to simply distribute the learner throughout several hardware accelerators.
TF-Replicator is merely one of many instances of impactful technology developed by DeepMind’s Research Platform Team. Several of DeepMind’s spearheading work in Artificial Intelligence, from AlphaGo to AlphaStar, were facilitated by the team.