Record: Training with multiple GPU

Friday, February 7, 2025

Training with multiple GPU

Model parallelism has a to build a pipeline of GPU. Each GPU handles a layer of the model. This is used when the model is too big to fit on a single GPU memory. The efficiency is affected with bubble in the pipeline when data passing through GPU at different speed

Data parallelism is used if the model can fit on a single GPU. Data us the divided into mini-batch to run in ea h GPU. Gradients are computed and combined to adjust the weighs

Tensor parallelism is to map different part of the model to multiple GPU. It is different from model parallelism such that portion of layers and not the entire layer is map to a GPU Input is split to feed into different GPU based on the mapping boundary

Record

Friday, February 7, 2025

Training with multiple GPU

No comments:

About Me

Blog Archive