Scaling to trillion-parameter model training on AWS

January 22, 2024

[ad_1]

In an Amazon Science blog post earlier this summer, we presented MiCS, a method that significantly improves the training efficiency of machine learning models with up to 175 billion parameters. But there is a continuing push to scale natural-language-processing models to a trillion parameters, to enable reliable few-shot learning for new tasks.

In this post, we introduce two new augmentations to MiCS, to allow AWS customers to train and fine-tune models at the trillion-parameter scale: (1) contiguous parameter management and (2) prefetched activation offloading.

Illustration of forward and backward passes of ZeRO-3 data parallelism on a cluster with four devices; a two-layer deep learning neural network model is used here for illustration purposes.

The figure above illustrates the process of parameter gathering during forward and backward passes for a two-layer deep-learning neural-network model. Before we start the forward step, each worker (rank) holds only a part of the model parameters. In order to compute the activations for the first layer, we use the all-gather operation to gather its parameters.

Related content

A new distributed-training library achieves near-linear efficiency in scaling from tens to hundreds of GPUs.

Once we obtain the output of the first layer, we immediately partition its parameters to release memory and proceed to the next neural-network layer. These two steps are repeated in a reverse order when we compute the gradient.

Repeated all-gather and partitioning processes result in heavy use of collective communication, which causes severe memory fragmentation and cache flush in Pytorch. To address this issue, we pre-allocate a contiguous parameter buffer to hold the complete parameter tensors after the gathering and to self-manage the tensor liveness and defragmentation without affecting the behavior of the Pytorch memory allocator. We observed that this approach greatly improved the performance of the memory-bounded tasks.

In addition, we have developed prefetched activation offloading to further save GPU memory, which we enable in conjunction with activation checkpointing. Each checkpointed activation is offloaded to CPU memory and prefetched when needed during backpropagation, using a dedicated stream opened using the CUDA parallel-computing platform. Since the data transfer is asynchronous, we observed only about a 1-2% speed loss using prefetched activation offloading.

Scaling to trillion-parameter model training on AWS

Alexa’s text-to-speech research at Interspeech 2022

SLT-Code: A new hackathon to promote language diversity in speech technology

Using Amazon web traffic to track the eclipse

A quick guide to Amazon’s 20+ papers at ICASSP 2024

Play the latest Prime games – Fallout 3 and Fallout: New Vegas

Help support the National Park Foundation by watching select Prime Video content on Fire TV.

Leave a reply Cancel reply