Skip to content

Latest commit

 

History

History
10 lines (7 loc) · 501 Bytes

File metadata and controls

10 lines (7 loc) · 501 Bytes

Distributed Deep Learning

Led by Huihuo Zheng, Corey Adams, and Zhen Xie from ALCF

This section of the workshop will introduce to you the methods we use to run distributed deep learning training on ALCF resources like Theta and ThetaGPU.

We show distributed training using three frameworks:

  1. Horovod (for TensorFlow and PyTorch), and
  2. DistributedDataParallel (DDP) (for PyTorch only).
  3. DeepSpeed