Optimizing ML Workflows on an AI Cluster

Name: Optimizing ML Workflows on an AI Cluster
Start: 2025-05-06T12:00:00-04:00
End: 2025-05-06T14:30:00-04:00
Location: Kempner Large Conference Room (SEC 6.242)

Bala Desinghu and Ella Batty

Date: Tuesday, May 6, 2025 Time: 12:00 - 2:30pm

Location: Kempner Large Conference Room (SEC 6.242)

Join us for an interactive workshop, part of the Workshops @ Kempner series, to learn how to optimize machine learning workflows for efficient, reproducible training on an AI cluster. Using TorchVision models (AlexNet, ResNet, etc.,) trained on CIFAR-10 and ImageNet-1k as running examples, we’ll walk through challenges and solutions at each stage of a machine learning pipeline – from environment setup and data management to experiment tracking, model training, evaluation, and deployment. A special focus will be placed on using Weights & Biases (W&B) for managing experiments and hyperparameter sweeps, as well as implementing checkpointing during training.

What will attendees learn from this workshop?

How to use Weights & Biases to log experiments, perform hyperparameter sweeps, and perform model comparisons
How to implement effective model checkpointing practices, including resuming from a checkpoint if a job fails
Best practices for setting up Conda environments, managing and versioning data, and packaging a workflow for reproducibility

Prerequisites:

Familiarity with PyTorch
Familiarity with HPC, including Slurm batch job submission
Access to the FASRC cluster, Kempner-specific access is not necessary

Who can attend this workshop?
Any Harvard-affiliated students, postdocs and faculty, with priority given to Kempner community members.

Contact Information:
For any questions about the workshop, please contact kempnereducation@harvard.edu