Model Architecture Design for Modern Hardware
Tri Dao


Join us for a talk by Tri Dao, Assistant Professor of Computer Science at Princeton University and Chief Scientist at Together AI. This talk is part of the Kempner Seminar Series, a research-level seminar series on recent advances in the field.
Thanks to test-time compute, inference efficiency now drives progress in AI, demanding a greater emphasis on inference-aware architectures. Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We describe recent progress on architectures such structured state space models (SSMs) and attention variants to speed up inference. We identify 3 main ingredients for strong sub-quadratic architectures: large state size, expressive state updates, and efficient hardware-aware algorithms. We then discuss how to design attention variants to optimally use the memory and compute subsystems of modern accelerators. Finally we describe to combine these two classes of models to improve the efficiency of LLM inference, distillation, and test-time compute.
Tri Dao is an Assistant Professor at Princeton University and chief scientist of Together AI. He completed his PhD in Computer Science at Stanford. He works at the intersection of machine learning and systems, and his research interests include hardware-aware algorithms and sequence models with long-range memory. His work has received the COLM 2024 Outstanding paper award and ICML 2022 Outstanding paper runner-up award.
Coming from Longwood? Sign up to take the shuttle.