About
This graduate seminar surveys the landscape of training data for AI with a focus on contemporary
learning problems such as training language models or multimodal models.
Students will learn about important datasets and common data processing methods including filtering and deduplication.
Further topics include tools for large-scale data, data attribution, and synthetic data.
The course also covers ethical and legal aspects of training data such as copyright and privacy.
Instructor: Ludwig Schmidt
Schedule: Monday, Wednesday 1:30 PM - 2:50 PM in Shriram 104
Office hours: Wednesday 3:00 - 3:30 PM in Gates 341 (Ludwig's office)
TA: Etash Guha
TA Office hours: Tuesday 4:00 - 5:00 PM in Gates 314
Prerequisites: Familiarity with graduate-level machine learning (e.g., CS 229), including supervised learning, deep learning, transformers, and language models.
Communication: Course communication happens on the Stanford Slack. Enrolled students can join the course workspace by clicking the Slack link in the Canvas navigation bar and selecting "Join Workspace".
Lecture Details
Lecture 1: Introduction, course logistics, ImageNet (Jan 5) Slides
We have a close look at the ImageNet dataset due to its central role in the recent development of AI.
Studying generalization in the context of ImageNet will also illustrate how data is key for training reliable models.
Specifically we look at
- How and why [Fei-Fei Li](https://profiles.stanford.edu/fei-fei-li) and her team built ImageNet.
- The AlexNet breakthrough on ImageNet and its impact on computer vision and AI.
- What we can learn from benchmarks like ImageNet, in particular whether they are reliable indicators of progress or suffer from overfitting.
## Papers
The original ImageNet papers:
- [ImageNet: A Large-Scale Hierarchical Image Database](https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf)
- [ImageNet Large Scale Visual Recognition Challenge](https://arxiv.org/abs/1409.0575)
The AlexNet paper made a lot of progress on ImageNet and led to a resurgence of deep learning:
- [ImageNet Classification with Deep Convolutional Neural Networks](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf)
The ImageNetV2 reproduction of ImageNet to test for overfitting:
- [Do ImageNet Classifiers Generalize to ImageNet?](https://arxiv.org/abs/1902.10811)
Lecture 2: Reliable generalization from training data (Jan 7) Slides
We continue our discussion of the ImageNetV2 experiment mentioned in Lecture 1.
We find that the cause of the accuracy drop from ImageNet to ImageNetV2 is due to distribution shift, not overfitting from test set re-use.
Other distribution shift benchmarks such as ObjectNet, ImageNet-Sketch, and ImageNet-R also showed that ImageNet models struggled to generalize reliably under distribution shift.
The CLIP model from OpenAI was a big step forward in terms of reliable generalization on ImageNet and related benchmarks.
The cause for CLIP's robustness is its improved training data.
In contrast, earlier approaches based on modifying model architecture or training algorithm failed to achieve meaningful robustness gains on the same benchmarks where CLIP succeeded.
This highlights the key role data plays in enabling reliable generalization.
The above results focus on ImageNet as a "model organism".
Later work has shown that similar phenomena often also hold for other datasets and learning problems.
## Papers
The large robustness investigation center on ImageNet comes from the following paper:
- [Measuring Robustness to Natural Distribution Shifts in Image Classification](https://arxiv.org/abs/2007.00644)
CLIP made large progress in the aforementioned framework for evaluating reliable generalization:
- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
Training data is key to CLIP's reliable generalization behavior:
- [Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)](https://arxiv.org/abs/2205.01397)
Lecture 3: From Highleyman to LAION-5B (Jan 12) Slides
We continue the series of experiments to pinpoint the cause of CLIP's robustness gains.
This finishes our tour through the ImageNet line of work which highlighted the importance of data both for guiding progress in machine learning and training models that generalize reliably.
Next we go through a brief history of training data for AI / ML from the 1950s (Perceptron & Highleyman's data) to the current time.
Finally we start discussing contemporary training sets for multimodal models with the LAION-5B paper.
## Papers / book chapters
As in Lecture 2, the experiments to determine the cause of CLIP's robustness are from the following paper:
- [Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)](https://arxiv.org/abs/2205.01397)
The history of datasets is based on [Chapter 8](https://mlstory.org/data.html) from the book "Patterns, Predictions, and Actions" by Moritz Hardt and Ben Recht.
The LAION-5B dataset is described in the corresponding paper:
- [LAION-5B: An open large-scale dataset for training next generation image-text models](https://arxiv.org/abs/2210.08402)
Lecture 4: Software tools for large-scale data (Jan 14) Slides · Guest lecture by Tony Wang
Tony covered data processing fundamentals for large-scale machine learning workflows, including [Parquet](https://parquet.apache.org/) files, [S3](https://aws.amazon.com/s3/), [Spark](https://spark.apache.org/), [Iceberg](https://iceberg.apache.org/), and [Pola.rs](https://pola.rs/), and how these tools fit together in modern data pipelines.
## Resources
- [Notes on setting up an AWS EMR cluster](https://drive.google.com/file/d/1Da-rmbWS8G0mbNar8hpYxQmj_JehDdOc/view?usp=sharing)
Lecture 5: Contemporary image datasets (Jan 21) Ludwig + Students
Ludwig started the lecture by covering [DataComp](https://arxiv.org/abs/2304.14108), a benchmark for studying CLIP training data ([slides](https://drive.google.com/file/d/13waW-ib6wFniywsqhM6oOUYOT7Ch1YPB/view?usp=sharing)). After that we had our first round of student paper presentations.
## Student presentations
**CLIP training data** (Ayush Agrawal, [slides](https://drive.google.com/file/d/1wN0aESd-JGmbaKjoxzMPo_0KL3Ouk_bY/view?usp=drive_link)):
- [MetaCLIP: Demystifying CLIP Data](https://arxiv.org/abs/2309.16671)
**Interleaved training data** (James Cheng, [slides](https://drive.google.com/file/d/1RlCV91Vh7Imav7kv8kexh3Thy41pwbjC/view?usp=drive_link)):
- [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://arxiv.org/abs/2306.16527)
- [MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training](https://arxiv.org/abs/2403.09611)
**Some good methods on DataComp** (Shlok Natarajan, [slides](https://drive.google.com/file/d/1XzwZrjQjpHsen_w4ifWPoWqGP9ri48Zd/view?usp=drive_link)):
- [FLYT: Fast Filtering for High-Quality Web-Scale Image-Text Datasets](https://arxiv.org/abs/2503.08805)
- [T-MARS: Improving Visual Representations by Circumventing Text Feature Learning](https://arxiv.org/abs/2307.03132)
Lecture 6: Data for LMs: pre-training (Jan 26) Slides · Guest lecture by Vaishaal Shankar
Vaishaal presented a deep dive into DCLM (DataComp for Language Models), a benchmark and open dataset for studying language model pre-training data. The lecture covered:
- The data problem: why LMs need vast amounts of training data and scaling laws
- Dataset evaluation methodology: controlled experiments across scales (412M to 7B parameters)
- The DCLM-Baseline pipeline: text extraction, heuristic filtering, deduplication, model-based filtering, decontamination
- Defining data quality: human judgment experiments (71% annotator agreement) and the surprising finding that methods disagreeing with humans can produce better downstream performance
- Results: DCLM-Baseline achieves 64% MMLU with 7B parameters using 6.6x less compute than comparable models
## Papers
- [DataComp-LM: In search of the next generation of training sets for language models](https://arxiv.org/abs/2406.11794)
- [RefinedWeb: The Falcon RefinedWeb Dataset for English Massive Web-Only Pretraining](https://arxiv.org/abs/2306.01116)
- [Training Compute-Optimal Large Language Models (Chinchilla)](https://arxiv.org/abs/2203.15556)
Lecture 7: Data for LMs: pre-training - Student Presentations (Jan 28) Students + Ludwig
This lecture featured student presentations on LLM pre-training data, followed by an overview of training data used in open-weight language models.
## Student presentations
**FineWeb & Nemotron-CC** (Animesh Jha & Nick Jiang, [slides](https://drive.google.com/file/d/1WIxHhWKEkFPsO33f23G4RpP22AzCPoRR/view?usp=drive_link)):
- [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2406.17557)
- [Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset](https://arxiv.org/abs/2412.02595)
**Scaling data-constrained language models** (Niklas Muennighoff, [slides](https://drive.google.com/file/d/1mj6M1qTXiwCiT0t1HIgBQHF0brSyO8iy/view?usp=drive_link)):
- [Scaling Data-Constrained Language Models](https://arxiv.org/abs/2305.16264)
**Scaling laws for data filtering** (Rohan Sinha, [slides](https://drive.google.com/file/d/1nIsE4tRstQ1Q4_6JWtU6NbVG8vn0eXmp/view?usp=drive_link)):
- [Scaling Laws for Data Filtering — Data Curation cannot be Compute Agnostic](https://arxiv.org/abs/2404.07177)
## Instructor overview
Ludwig gave an overview of what's known about training data for open-weight / open-data language models ([slides](https://drive.google.com/file/d/10fseBAijnhyNM68eXImE3Y8k0St0Xlg5/view?usp=drive_link)), covering:
- OLMo [1](https://arxiv.org/abs/2402.01032), [2](https://arxiv.org/abs/2411.15466), [3](https://arxiv.org/abs/2501.14734)
- [Marin](https://marin.readthedocs.io/en/latest/reports/marin-8b-retro/)
- Llama [1](https://arxiv.org/abs/2302.13971), [2](https://arxiv.org/abs/2307.09288), [3](https://arxiv.org/abs/2407.21783)
- Qwen [1](https://arxiv.org/abs/2309.16609), [2](https://arxiv.org/abs/2407.10671), [2.5](https://arxiv.org/abs/2412.15115), [3](https://arxiv.org/abs/2505.09388)
- DeepSeek [V1](https://arxiv.org/abs/2401.02954), [V2](https://arxiv.org/abs/2405.04434), [V3](https://arxiv.org/abs/2412.19437), [R1](https://arxiv.org/abs/2501.12948)
- Gemma [1](https://arxiv.org/abs/2403.08295), [2](https://arxiv.org/abs/2408.00118), [3](https://arxiv.org/abs/2503.19786)
- GLM [130B](https://arxiv.org/abs/2210.02414), [4](https://arxiv.org/abs/2406.12793)
- [MiniMax-01](https://arxiv.org/abs/2501.08313)
- [MiMo-V2-Flash](https://arxiv.org/abs/2501.02780)
Lecture 8: Data for LMs: post-training (Feb 2) Slides · Etash Guha
Etash covered the landscape of post-training data for language models. The lecture covered:
- Instruction tuning: FLAN and the emergence of instruction-following capabilities
- Reinforcement Learning from Human Feedback (RLHF): InstructGPT and Constitutional AI
- Synthetic data generation: Self-Instruct, Alpaca, and UltraFeedback
- Reasoning and test-time compute: DeepSeek-R1, s1, and OpenThoughts
- Code and agentic data: SWE-Gym for software engineering tasks
## Papers
- [Finetuned Language Models Are Zero-Shot Learners (FLAN)](https://arxiv.org/abs/2109.01652)
- [Training language models to follow instructions with human feedback (InstructGPT)](https://arxiv.org/abs/2203.02155)
- [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073)
- [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)
- [Alpaca: A Strong, Replicable Instruction-Following Model](https://crfm.stanford.edu/2023/03/13/alpaca.html)
- [UltraFeedback: Boosting Language Models with High-quality Feedback](https://arxiv.org/abs/2310.01377)
- [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)
- [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393)
- [OpenThoughts: Data Recipes for Reasoning Models](https://arxiv.org/abs/2506.04178)
- [SWE-Gym: Training Software Engineering Agents and Verifiers with Open Source](https://arxiv.org/abs/2412.21139)
Lecture 9: Data for LMs: post-training - Student Presentations (Feb 4) Students
This lecture featured student presentations on post-training data for language models.
## Student presentations
**Reasoning models**
DeepSeek-R1 (Lillian Weng, [slides](https://drive.google.com/file/d/1LlcsCfWnevooSHOUNzt1g7u-5HHxmPPN/view?usp=drive_link)):
- [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://www.nature.com/articles/s41586-025-09422-z)
Llama-Nemotron (Arpandeep Khatua, [slides](https://drive.google.com/file/d/1vNGKm-ZZKqKM0lUTLhSwelfQAIL_Xsho/view?usp=drive_link)):
- [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
**Fully open post-training data**
Tulu 3 (Simran Nayak, [slides](https://drive.google.com/file/d/1-8j3b1bl3e-P2hJhMJbxDVhdX4ntLpZs/view?usp=drive_link)):
- [Tulu 3: Pushing Frontiers in Open Language Model Post-Training](https://arxiv.org/abs/2411.15124)
OLMo 3 (Anya Hansen, [slides](https://drive.google.com/file/d/1lgoJ24OoniapSKl082coIFOcfA8Sv5aB/view?usp=drive_link)):
- [OLMo 3: A truly open language model](https://arxiv.org/abs/2512.13961)
**Post-training data for software engineering**
SWE-Smith (John Yang, [slides](https://drive.google.com/file/d/1wfZ3j5MORAjAmuBCcLErMwSE2IPoUKHn/view?usp=drive_link)):
- [SWE-Smith: Scaling Data for Software Engineering Agents](https://arxiv.org/abs/2504.21798)
Lecture 10: Data attribution & influence functions (Feb 9) Slides 1, 2 · Guest lecture by Logan Engstrom & Juhan Bae
Logan and Juhan covered data attribution and influence functions: methods for understanding how individual training examples affect model predictions.
Logan gave an overview of data attribution, covering various formulations of the problem, applications including data selection and debugging, and limitations of current methods.
Juhan then went deeper on influence functions, focusing on their use in the context of language models, including scalability challenges and recent advances.
## Papers
- [Datamodels: Predicting Predictions from Training Data](https://arxiv.org/abs/2202.00622)
- [TRAK: Attributing Model Behavior at Scale](https://arxiv.org/abs/2303.14186)
- [Understanding Black-box Predictions via Influence Functions](https://arxiv.org/abs/1703.04730)
- [Studying Large Language Model Generalization with Influence Functions](https://arxiv.org/abs/2308.03296)
See also the [ML & Data Tutorial](https://ml-data-tutorial.org/).
Lecture 11: Training data for robotics (Feb 11) Slides PDF, Keynote (with videos) · Guest lecture by Karl Pertsch
Karl covered the landscape of training data for robotics, including data collection, scaling challenges, and cross-embodiment learning. The lecture covered:
- The data problem in robotics: limited scale compared to language and vision, high cost of real-world collection
- Cross-embodiment datasets: pooling data across different robots and environments
- Simulation data and sim-to-real transfer
- Foundation models for robotics and the role of internet-scale pre-training
## Papers
- [Open X-Embodiment: Robotic Learning Datasets and RT-X Models](https://arxiv.org/abs/2310.08864)
- [DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset](https://arxiv.org/abs/2403.12945)
- [π0.5: a Vision-Language-Action Model with Open-World Generalization](https://arxiv.org/abs/2504.16054)
Lecture 12: 3D data + student presentations (Feb 18) Slides 1, 2, 3, 4 · Matt Deitke + Students
Matt Deitke gave a guest lecture on training data for 3D, followed by student presentations on data attribution.
## Guest lecture
**Training data for 3D** (Matt Deitke, [slides](https://drive.google.com/file/d/1D9hHJGslTg6UfBygwypjZl-32-OZJZB5/view?usp=drive_link)):
- [ShapeNet: An Information-Rich 3D Model Repository](https://arxiv.org/abs/1512.03012)
- [Objaverse: A Universe of Annotated 3D Objects](https://arxiv.org/abs/2212.08051)
- [Zero-1-to-3: Zero-shot One Image to 3D Object](https://arxiv.org/abs/2303.11328)
- [Objaverse-XL: A Universe of 10M+ 3D Objects](https://arxiv.org/abs/2307.05663)
- [TRELLIS.2 (Native and Compact Structured Latents for 3D Generation)](https://arxiv.org/abs/2512.14692)
- [From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos](https://arxiv.org/abs/2412.07770)
## Student presentations
**Datamodels** (Simon Guo, [slides](https://drive.google.com/file/d/1WP4yxD6tqFYoyXEj11RmNxiPth_bHhad/view?usp=drive_link)):
- [Datamodels: Predicting Predictions from Training Data](https://arxiv.org/abs/2202.00622)
**TRAK** (Hugo Buurmeijer, [slides](https://drive.google.com/file/d/1lB4YHZxBCmS4w7UM61KXVcyWkHgILZUa/view?usp=drive_link)):
- [TRAK: Attributing Model Behavior at Scale](https://arxiv.org/abs/2303.14186)
**Metagradients** (James Cheng, [slides](https://drive.google.com/file/d/1qO46YDpFtnRIAKJQsW-O4T-j8-mHK2GF/view?usp=drive_link)):
- [Metagradients for Efficient Data Attribution](https://arxiv.org/abs/2503.13751)