Loading Events
Event Categories Workshops @ Kempner

Where’s My Data?: Finding, Tokenizing and Loading LLM datasets with TATM Workshop

Ella Batty & Tim Ngotiaoco

Date: Tuesday, April 8, 2025 Time: 12:00 - 2:30pm
Location: Kempner Large Conference Room (SEC 6.242)

Date: Tuesday April 8th

Time: 12 – 2:30 pm

Location: Kempner Large Conference Room (SEC 6.242)

Presenters: Ella Batty and Timothy Ngotiaoco

Join us for an interactive Workshops @ Kempner session to learn about TATM (transformer-assistive testbed module), a tool developed by the Kempner Research & Engineering team. TATM is a Python library designed for working with text data which provides tools to assist in the development of transformer and other model architectures. It serves as an interface for accessing and manipulating data on HPC clusters, streamlining dataset loading and processing – particularly for LLM datasets – while ensuring seamless integration into existing training workflows. In this workshop, attendees will learn about TATM, why we built it, and the advantages it offers, and they will work through a hands-on example. 

Who can attend this workshop?

  • Any Harvard-affiliated students, postdocs, and faculty, with priority given to Kempner community members.  Some components of the workshop require Kempner cluster access.

What will attendees be able to do after this workshop?

  • Describe what TATM library does 
  • Brainstorm how to use it in their projects
  • Use TATM to load data and train a simple LLM model

Prerequisite: 

  • Access to FASRC cluster
  • Familiarity with LLMs and tokenization
  • Comfort working with the command line and Python

Contact Information:
For any questions about the workshop, please contact kempnereducation@harvard.edu.