Where’s My Data?: Finding, Tokenizing and Loading LLM datasets with TATM Workshop
Ella Batty & Tim Ngotiaoco
Date: Tuesday April 8th
Time: 12 – 2:30 pm
Location: Kempner Large Conference Room (SEC 6.242)
Presenters: Ella Batty and Timothy Ngotiaoco
Join us for an interactive Workshops @ Kempner session to learn about TATM (transformer-assistive testbed module), a tool developed by the Kempner Research & Engineering team. TATM is a Python library designed for working with text data which provides tools to assist in the development of transformer and other model architectures. It serves as an interface for accessing and manipulating data on HPC clusters, streamlining dataset loading and processing – particularly for LLM datasets – while ensuring seamless integration into existing training workflows. In this workshop, attendees will learn about TATM, why we built it, and the advantages it offers, and they will work through a hands-on example.
Who can attend this workshop?
- Any Harvard-affiliated students, postdocs, and faculty, with priority given to Kempner community members. Some components of the workshop require Kempner cluster access.
What will attendees be able to do after this workshop?
- Describe what TATM library does
- Brainstorm how to use it in their projects
- Use TATM to load data and train a simple LLM model
Prerequisite:
- Access to FASRC cluster
- Familiarity with LLMs and tokenization
- Comfort working with the command line and Python
Contact Information:
For any questions about the workshop, please contact kempnereducation@harvard.edu.