Loading [MathJax]/extensions/Safe.js
Loading Events
Event Categories Past Event | Workshops @ Kempner

Optimizing ML Workflows on an AI Cluster

Bala Desinghu and Ella Batty

Date: Tuesday, May 6, 2025 Time: 12:00 - 2:30pm
Location: Kempner Large Conference Room (SEC 6.242)

Join us for an interactive workshop, part of the Workshops @ Kempner series, to learn how to optimize machine learning workflows for efficient, reproducible training on an AI cluster. Using TorchVision models (AlexNet, ResNet, etc.,) trained on CIFAR-10 and ImageNet-1k as running examples, we’ll walk through challenges and solutions at each stage of a machine learning pipeline – from environment setup and data management to experiment tracking, model training, evaluation, and deployment. A special focus will be placed on using Weights & Biases (W&B) for managing experiments and hyperparameter sweeps, as well as implementing checkpointing during training.

What will attendees learn from this workshop?

  • How to use Weights & Biases to log experiments, perform hyperparameter sweeps, and perform model comparisons
  • How to implement effective model checkpointing practices, including resuming from a checkpoint if a job fails
  • Best practices for setting up Conda environments, managing and versioning data, and packaging a workflow for reproducibility

Prerequisites:

  • Familiarity with PyTorch
  • Familiarity with HPC, including Slurm batch job submission
  • Access to the FASRC cluster, Kempner-specific access is not necessary

Who can attend this workshop?
Any Harvard-affiliated students, postdocs and faculty, with priority given to Kempner community members.

Contact Information:
For any questions about the workshop, please contact kempnereducation@harvard.edu