Optimizing ML Workflows on an AI Cluster
Bala Desinghu and Ella Batty
Join us for an interactive workshop, part of the Workshops @ Kempner series, to learn how to optimize machine learning workflows for efficient, reproducible training on an AI cluster. Using TorchVision models (AlexNet, ResNet, etc.,) trained on CIFAR-10 and ImageNet-1k as running examples, we’ll walk through challenges and solutions at each stage of a machine learning pipeline – from environment setup and data management to experiment tracking, model training, evaluation, and deployment. A special focus will be placed on using Weights & Biases (W&B) for managing experiments and hyperparameter sweeps, as well as implementing checkpointing during training.
What will attendees learn from this workshop?
- How to use Weights & Biases to log experiments, perform hyperparameter sweeps, and perform model comparisons
- How to implement effective model checkpointing practices, including resuming from a checkpoint if a job fails
- Best practices for setting up Conda environments, managing and versioning data, and packaging a workflow for reproducibility
Prerequisites:
- Familiarity with PyTorch
- Familiarity with HPC, including Slurm batch job submission
- Access to the FASRC cluster, Kempner-specific access is not necessary
Who can attend this workshop?
Any Harvard-affiliated students, postdocs and faculty, with priority given to Kempner community members.
Contact Information:
For any questions about the workshop, please contact kempnereducation@harvard.edu