Friday, February 7, 2025

Training with multiple GPU

Model parallelism has a to build a pipeline of GPU. Each GPU handles a layer of the model. This is used when the model is too big to fit on a single GPU memory. The efficiency is affected with bubble in the pipeline when data passing through GPU at different speed  

Data parallelism is used if the model can fit on a single GPU.  Data us the divided into mini-batch to run in ea h GPU.  Gradients are computed and combined to adjust the weighs  

Tensor parallelism is to map different part of the model to multiple GPU. It is different from model parallelism such that portion of layers and not the entire layer is map to a GPU  Input is split to feed into different GPU based on the mapping boundary  


No comments: