CS525: Training Data for AI

About

This graduate seminar surveys the landscape of training data for AI with a focus on contemporary learning problems such as training language models or multimodal models. Students will learn about important datasets and common data processing methods including filtering and deduplication. Further topics include tools for large-scale data, data attribution, and synthetic data. The course also covers ethical and legal aspects of training data such as copyright and privacy.

Instructor: Ludwig Schmidt

Schedule: Monday, Wednesday 1:30 PM - 2:50 PM in Shriram 104

Office hours: Wednesday 3:00 - 3:30 PM in Gates 341 (Ludwig's office)

TA: Etash Guha

TA Office hours: Tuesday 4:00 - 5:00 PM in Gates 314

Prerequisites: Familiarity with graduate-level machine learning (e.g., CS 229), including supervised learning, deep learning, transformers, and language models.

Communication: Course communication happens on the Stanford Slack. Enrolled students can join the course workspace by clicking the Slack link in the Canvas navigation bar and selecting "Join Workspace".

Schedule

The schedule below is tentative and subject to change. Click on a lecture for readings and details.

Date	Topic	Lecturer	Slides
Jan 5 (M)	Introduction, course logistics, ImageNet	Ludwig Schmidt	Slides
Jan 7 (W)	Reliable generalization from training data	Ludwig Schmidt	Slides
Jan 12 (M)	From Highleyman to LAION-5B	Ludwig Schmidt	Slides
Jan 14 (W)	Software tools for large-scale data	Tony Wang	Slides
Jan 19 (M)	MLK Day - No Class
Jan 21 (W)	Contemporary image datasets	Ludwig + Students	Slides 1, 2, 3, 4
Jan 26 (M)	Data for LMs: pre-training	Vaishaal Shankar	Slides (HTML)
Jan 28 (W)	Data for LMs: pre-training	Students + Ludwig	Slides 1, 2, 3, 4
Feb 2 (M)	Data for LMs: post-training	Etash Guha	Slides
Feb 4 (W)	Data for LMs: post-training	Students	Slides 1, 2, 3, 4, 5
Feb 9 (M)	Data attribution & influence functions	Logan Engstrom & Juhan Bae	Slides 1, 2
Feb 11 (W)	Training data for robotics	Karl Pertsch	PDF, Keynote (with videos)
Feb 16 (M)	Presidents' Day - No Class
Feb 18 (W)	3D data + student presentations	Matt Deitke + Students	Slides 1, 2, 3, 4
Feb 23 (M)	Training data for AI in biology	Cade Gordon & Jonah Kallenbach	Slides 1, 2 (HTML)
Feb 25 (W)	Synthetic data	Zitong Yang	Slides
Mar 2 (M)	Privacy and data poisoning	Nicholas Carlini	Slides
Mar 4 (W)	Data mixing + synthetic data	Mayee Chen + Students	Slides 1, 2, 3, 4
Mar 9 (M)	Limits of data scaling & token-level filtering	Suhas Kotha + Students	Slides 1, 2, 3
Mar 11 (W)	Project presentations	Students

Lecture Details

Lecture 1: Introduction, course logistics, ImageNet (Jan 5) Slides

We have a close look at the ImageNet dataset due to its central role in the recent development of AI. Studying generalization in the context of ImageNet will also illustrate how data is key for training reliable models. Specifically we look at - How and why [Fei-Fei Li](https://profiles.stanford.edu/fei-fei-li) and her team built ImageNet. - The AlexNet breakthrough on ImageNet and its impact on computer vision and AI. - What we can learn from benchmarks like ImageNet, in particular whether they are reliable indicators of progress or suffer from overfitting. ## Papers The original ImageNet papers: - [ImageNet: A Large-Scale Hierarchical Image Database](https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf) - [ImageNet Large Scale Visual Recognition Challenge](https://arxiv.org/abs/1409.0575) The AlexNet paper made a lot of progress on ImageNet and led to a resurgence of deep learning: - [ImageNet Classification with Deep Convolutional Neural Networks](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) The ImageNetV2 reproduction of ImageNet to test for overfitting: - [Do ImageNet Classifiers Generalize to ImageNet?](https://arxiv.org/abs/1902.10811)

Lecture 2: Reliable generalization from training data (Jan 7) Slides

We continue our discussion of the ImageNetV2 experiment mentioned in Lecture 1. We find that the cause of the accuracy drop from ImageNet to ImageNetV2 is due to distribution shift, not overfitting from test set re-use. Other distribution shift benchmarks such as ObjectNet, ImageNet-Sketch, and ImageNet-R also showed that ImageNet models struggled to generalize reliably under distribution shift. The CLIP model from OpenAI was a big step forward in terms of reliable generalization on ImageNet and related benchmarks. The cause for CLIP's robustness is its improved training data. In contrast, earlier approaches based on modifying model architecture or training algorithm failed to achieve meaningful robustness gains on the same benchmarks where CLIP succeeded. This highlights the key role data plays in enabling reliable generalization. The above results focus on ImageNet as a "model organism". Later work has shown that similar phenomena often also hold for other datasets and learning problems. ## Papers The large robustness investigation center on ImageNet comes from the following paper: - [Measuring Robustness to Natural Distribution Shifts in Image Classification](https://arxiv.org/abs/2007.00644) CLIP made large progress in the aforementioned framework for evaluating reliable generalization: - [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) Training data is key to CLIP's reliable generalization behavior: - [Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)](https://arxiv.org/abs/2205.01397)

Lecture 3: From Highleyman to LAION-5B (Jan 12) Slides

We continue the series of experiments to pinpoint the cause of CLIP's robustness gains. This finishes our tour through the ImageNet line of work which highlighted the importance of data both for guiding progress in machine learning and training models that generalize reliably. Next we go through a brief history of training data for AI / ML from the 1950s (Perceptron & Highleyman's data) to the current time. Finally we start discussing contemporary training sets for multimodal models with the LAION-5B paper. ## Papers / book chapters As in Lecture 2, the experiments to determine the cause of CLIP's robustness are from the following paper: - [Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)](https://arxiv.org/abs/2205.01397) The history of datasets is based on [Chapter 8](https://mlstory.org/data.html) from the book "Patterns, Predictions, and Actions" by Moritz Hardt and Ben Recht. The LAION-5B dataset is described in the corresponding paper: - [LAION-5B: An open large-scale dataset for training next generation image-text models](https://arxiv.org/abs/2210.08402)

Lecture 4: Software tools for large-scale data (Jan 14) Slides · Guest lecture by Tony Wang

Tony covered data processing fundamentals for large-scale machine learning workflows, including [Parquet](https://parquet.apache.org/) files, [S3](https://aws.amazon.com/s3/), [Spark](https://spark.apache.org/), [Iceberg](https://iceberg.apache.org/), and [Pola.rs](https://pola.rs/), and how these tools fit together in modern data pipelines. ## Resources - [Notes on setting up an AWS EMR cluster](https://drive.google.com/file/d/1Da-rmbWS8G0mbNar8hpYxQmj_JehDdOc/view?usp=sharing)

Lecture 5: Contemporary image datasets (Jan 21) Ludwig + Students

Ludwig started the lecture by covering [DataComp](https://arxiv.org/abs/2304.14108), a benchmark for studying CLIP training data ([slides](https://drive.google.com/file/d/13waW-ib6wFniywsqhM6oOUYOT7Ch1YPB/view?usp=sharing)). After that we had our first round of student paper presentations. ## Student presentations **CLIP training data** (Ayush Agrawal, [slides](https://drive.google.com/file/d/1wN0aESd-JGmbaKjoxzMPo_0KL3Ouk_bY/view?usp=drive_link)): - [MetaCLIP: Demystifying CLIP Data](https://arxiv.org/abs/2309.16671) **Interleaved training data** (James Cheng, [slides](https://drive.google.com/file/d/1RlCV91Vh7Imav7kv8kexh3Thy41pwbjC/view?usp=drive_link)): - [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://arxiv.org/abs/2306.16527) - [MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training](https://arxiv.org/abs/2403.09611) **Some good methods on DataComp** (Shlok Natarajan, [slides](https://drive.google.com/file/d/1XzwZrjQjpHsen_w4ifWPoWqGP9ri48Zd/view?usp=drive_link)): - [FLYT: Fast Filtering for High-Quality Web-Scale Image-Text Datasets](https://arxiv.org/abs/2503.08805) - [T-MARS: Improving Visual Representations by Circumventing Text Feature Learning](https://arxiv.org/abs/2307.03132)

Lecture 6: Data for LMs: pre-training (Jan 26) Slides (HTML) · Guest lecture by Vaishaal Shankar

Vaishaal presented a deep dive into DCLM (DataComp for Language Models), a benchmark and open dataset for studying language model pre-training data. The lecture covered: - The data problem: why LMs need vast amounts of training data and scaling laws - Dataset evaluation methodology: controlled experiments across scales (412M to 7B parameters) - The DCLM-Baseline pipeline: text extraction, heuristic filtering, deduplication, model-based filtering, decontamination - Defining data quality: human judgment experiments (71% annotator agreement) and the surprising finding that methods disagreeing with humans can produce better downstream performance - Results: DCLM-Baseline achieves 64% MMLU with 7B parameters using 6.6x less compute than comparable models ## Papers - [DataComp-LM: In search of the next generation of training sets for language models](https://arxiv.org/abs/2406.11794) - [RefinedWeb: The Falcon RefinedWeb Dataset for English Massive Web-Only Pretraining](https://arxiv.org/abs/2306.01116) - [Training Compute-Optimal Large Language Models (Chinchilla)](https://arxiv.org/abs/2203.15556)

Lecture 7: Data for LMs: pre-training - Student Presentations (Jan 28) Students + Ludwig

This lecture featured student presentations on LLM pre-training data, followed by an overview of training data used in open-weight language models. ## Student presentations **FineWeb & Nemotron-CC** (Animesh Jha & Nick Jiang, [slides](https://drive.google.com/file/d/1WIxHhWKEkFPsO33f23G4RpP22AzCPoRR/view?usp=drive_link)): - [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2406.17557) - [Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset](https://arxiv.org/abs/2412.02595) **Scaling data-constrained language models** (Niklas Muennighoff, [slides](https://drive.google.com/file/d/1mj6M1qTXiwCiT0t1HIgBQHF0brSyO8iy/view?usp=drive_link)): - [Scaling Data-Constrained Language Models](https://arxiv.org/abs/2305.16264) **Scaling laws for data filtering** (Rohan Sinha, [slides](https://drive.google.com/file/d/1nIsE4tRstQ1Q4_6JWtU6NbVG8vn0eXmp/view?usp=drive_link)): - [Scaling Laws for Data Filtering — Data Curation cannot be Compute Agnostic](https://arxiv.org/abs/2404.07177) ## Instructor overview Ludwig gave an overview of what's known about training data for open-weight / open-data language models ([slides](https://drive.google.com/file/d/10fseBAijnhyNM68eXImE3Y8k0St0Xlg5/view?usp=drive_link)), covering: - OLMo [1](https://arxiv.org/abs/2402.01032), [2](https://arxiv.org/abs/2411.15466), [3](https://arxiv.org/abs/2501.14734) - [Marin](https://marin.readthedocs.io/en/latest/reports/marin-8b-retro/) - Llama [1](https://arxiv.org/abs/2302.13971), [2](https://arxiv.org/abs/2307.09288), [3](https://arxiv.org/abs/2407.21783) - Qwen [1](https://arxiv.org/abs/2309.16609), [2](https://arxiv.org/abs/2407.10671), [2.5](https://arxiv.org/abs/2412.15115), [3](https://arxiv.org/abs/2505.09388) - DeepSeek [V1](https://arxiv.org/abs/2401.02954), [V2](https://arxiv.org/abs/2405.04434), [V3](https://arxiv.org/abs/2412.19437), [R1](https://arxiv.org/abs/2501.12948) - Gemma [1](https://arxiv.org/abs/2403.08295), [2](https://arxiv.org/abs/2408.00118), [3](https://arxiv.org/abs/2503.19786) - GLM [130B](https://arxiv.org/abs/2210.02414), [4](https://arxiv.org/abs/2406.12793) - [MiniMax-01](https://arxiv.org/abs/2501.08313) - [MiMo-V2-Flash](https://arxiv.org/abs/2501.02780)

Lecture 8: Data for LMs: post-training (Feb 2) Slides · Etash Guha

Etash covered the landscape of post-training data for language models. The lecture covered: - Instruction tuning: FLAN and the emergence of instruction-following capabilities - Reinforcement Learning from Human Feedback (RLHF): InstructGPT and Constitutional AI - Synthetic data generation: Self-Instruct, Alpaca, and UltraFeedback - Reasoning and test-time compute: DeepSeek-R1, s1, and OpenThoughts - Code and agentic data: SWE-Gym for software engineering tasks ## Papers - [Finetuned Language Models Are Zero-Shot Learners (FLAN)](https://arxiv.org/abs/2109.01652) - [Training language models to follow instructions with human feedback (InstructGPT)](https://arxiv.org/abs/2203.02155) - [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073) - [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560) - [Alpaca: A Strong, Replicable Instruction-Following Model](https://crfm.stanford.edu/2023/03/13/alpaca.html) - [UltraFeedback: Boosting Language Models with High-quality Feedback](https://arxiv.org/abs/2310.01377) - [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948) - [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393) - [OpenThoughts: Data Recipes for Reasoning Models](https://arxiv.org/abs/2506.04178) - [SWE-Gym: Training Software Engineering Agents and Verifiers with Open Source](https://arxiv.org/abs/2412.21139)

Lecture 9: Data for LMs: post-training - Student Presentations (Feb 4) Students

This lecture featured student presentations on post-training data for language models. ## Student presentations **Reasoning models** DeepSeek-R1 (Lillian Weng, [slides](https://drive.google.com/file/d/1LlcsCfWnevooSHOUNzt1g7u-5HHxmPPN/view?usp=drive_link)): - [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://www.nature.com/articles/s41586-025-09422-z) Llama-Nemotron (Arpandeep Khatua, [slides](https://drive.google.com/file/d/1vNGKm-ZZKqKM0lUTLhSwelfQAIL_Xsho/view?usp=drive_link)): - [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949) **Fully open post-training data** Tulu 3 (Simran Nayak, [slides](https://drive.google.com/file/d/1-8j3b1bl3e-P2hJhMJbxDVhdX4ntLpZs/view?usp=drive_link)): - [Tulu 3: Pushing Frontiers in Open Language Model Post-Training](https://arxiv.org/abs/2411.15124) OLMo 3 (Anya Hansen, [slides](https://drive.google.com/file/d/1lgoJ24OoniapSKl082coIFOcfA8Sv5aB/view?usp=drive_link)): - [OLMo 3: A truly open language model](https://arxiv.org/abs/2512.13961) **Post-training data for software engineering** SWE-Smith (John Yang, [slides](https://drive.google.com/file/d/1wfZ3j5MORAjAmuBCcLErMwSE2IPoUKHn/view?usp=drive_link)): - [SWE-Smith: Scaling Data for Software Engineering Agents](https://arxiv.org/abs/2504.21798)

Lecture 10: Data attribution & influence functions (Feb 9) Slides 1, 2 · Guest lecture by Logan Engstrom & Juhan Bae

Logan and Juhan covered data attribution and influence functions: methods for understanding how individual training examples affect model predictions. Logan gave an overview of data attribution, covering various formulations of the problem, applications including data selection and debugging, and limitations of current methods. Juhan then went deeper on influence functions, focusing on their use in the context of language models, including scalability challenges and recent advances. ## Papers - [Datamodels: Predicting Predictions from Training Data](https://arxiv.org/abs/2202.00622) - [TRAK: Attributing Model Behavior at Scale](https://arxiv.org/abs/2303.14186) - [Understanding Black-box Predictions via Influence Functions](https://arxiv.org/abs/1703.04730) - [Studying Large Language Model Generalization with Influence Functions](https://arxiv.org/abs/2308.03296) See also the [ML & Data Tutorial](https://ml-data-tutorial.org/).

Lecture 11: Training data for robotics (Feb 11) Slides PDF, Keynote (with videos) · Guest lecture by Karl Pertsch

Karl covered the landscape of training data for robotics, including data collection, scaling challenges, and cross-embodiment learning. The lecture covered: - The data problem in robotics: limited scale compared to language and vision, high cost of real-world collection - Cross-embodiment datasets: pooling data across different robots and environments - Simulation data and sim-to-real transfer - Foundation models for robotics and the role of internet-scale pre-training ## Papers - [Open X-Embodiment: Robotic Learning Datasets and RT-X Models](https://arxiv.org/abs/2310.08864) - [DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset](https://arxiv.org/abs/2403.12945) - [π0.5: a Vision-Language-Action Model with Open-World Generalization](https://arxiv.org/abs/2504.16054)

Lecture 12: 3D data + student presentations (Feb 18) Slides 1, 2, 3, 4 · Matt Deitke + Students

Matt Deitke gave a guest lecture on training data for 3D, followed by student presentations on data attribution. ## Guest lecture **Training data for 3D** (Matt Deitke, [slides](https://drive.google.com/file/d/1D9hHJGslTg6UfBygwypjZl-32-OZJZB5/view?usp=drive_link)): - [ShapeNet: An Information-Rich 3D Model Repository](https://arxiv.org/abs/1512.03012) - [Objaverse: A Universe of Annotated 3D Objects](https://arxiv.org/abs/2212.08051) - [Zero-1-to-3: Zero-shot One Image to 3D Object](https://arxiv.org/abs/2303.11328) - [Objaverse-XL: A Universe of 10M+ 3D Objects](https://arxiv.org/abs/2307.05663) - [TRELLIS.2 (Native and Compact Structured Latents for 3D Generation)](https://arxiv.org/abs/2512.14692) - [From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos](https://arxiv.org/abs/2412.07770) ## Student presentations **Datamodels** (Simon Guo, [slides](https://drive.google.com/file/d/1WP4yxD6tqFYoyXEj11RmNxiPth_bHhad/view?usp=drive_link)): - [Datamodels: Predicting Predictions from Training Data](https://arxiv.org/abs/2202.00622) **TRAK** (Hugo Buurmeijer, [slides](https://drive.google.com/file/d/1lB4YHZxBCmS4w7UM61KXVcyWkHgILZUa/view?usp=drive_link)): - [TRAK: Attributing Model Behavior at Scale](https://arxiv.org/abs/2303.14186) **Metagradients** (James Cheng, [slides](https://drive.google.com/file/d/1qO46YDpFtnRIAKJQsW-O4T-j8-mHK2GF/view?usp=drive_link)): - [Metagradients for Efficient Data Attribution](https://arxiv.org/abs/2503.13751)

Lecture 13: Training data for AI in biology (Feb 23) Slides 1, 2 (HTML) · Guest lecture by Cade Gordon & Jonah Kallenbach

Cade and Jonah gave a two-part guest lecture on training data for AI in biology. **Biological sequence modeling** (Cade): Covered DNA and protein data sources (RefSeq, UniProt, PDB), how biological sequences map naturally to tokenization, sequence models from DNABERT to ESM-2 and Evo/Evo 2, and how evolution provides a natural data hierarchy for filtering that parallels techniques like SemDeDup. **Protein structure and small molecules** (Jonah): Covered the Protein Data Bank and CASP as key data and evaluation infrastructure, protein structure prediction (AlphaFold 2, RFdiffusion), molecular representations (SMILES, fingerprints, graphs), and binding affinity prediction. A central theme was that data quality matters more than model complexity. ## Papers (Cade) - [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.biorxiv.org/content/10.1101/622803v4) - [Evolutionary-scale prediction of atomic-level protein structure with a language model](https://www.science.org/doi/10.1126/science.ade2574) - [Simulating 500 million years of evolution with a language model](https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1) - [Sequence modeling and design from molecular to genome scale with Evo](https://www.biorxiv.org/content/10.1101/2024.02.27.582234v1) - [Genome modeling and design across all domains of life with Evo 2](https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1) ## Papers (Jonah) - [Highly accurate protein structure prediction with AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) - [De novo design of protein structure and function with RFdiffusion](https://www.nature.com/articles/s41586-023-06415-8) - [Accurate structure prediction of biomolecular interactions with AlphaFold 3](https://www.nature.com/articles/s41586-024-07487-w) - [Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction](https://www.biorxiv.org/content/10.1101/2025.06.14.659707v1)

Lecture 14: Synthetic data (Feb 25) Slides · Guest lecture by Zitong Yang

Zitong gave a guest lecture on synthetic data for training language models, covering synthetic data for pre-training, mid-training, and post-training. ## Papers - [Synthetic continued pretraining](https://arxiv.org/abs/2409.07431) - [Synthetic bootstrapped pretraining](https://arxiv.org/abs/2509.15248)

Lecture 15: Privacy and data poisoning (Mar 2) Slides · Guest lecture by Nicholas Carlini

Nicholas gave a guest lecture on privacy and security aspects of training data. The lecture covered: - Data poisoning: modifying training data to cause targeted test errors - Membership inference and training data extraction: determining whether a specific example was in the training set and recovering training data from models ## Papers - [Poisoning Attacks against Support Vector Machines](https://arxiv.org/abs/1206.6389) - [Poisoning Web-Scale Training Datasets is Practical](https://arxiv.org/abs/2302.10149) - [What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation](https://arxiv.org/abs/2008.03703) - [Extracting Training Data from Diffusion Models](https://arxiv.org/abs/2301.13188) - [Extracting Training Data from Large Language Models](https://arxiv.org/abs/2012.07805) - [Scalable Extraction of Training Data from (Production) Language Models](https://arxiv.org/abs/2311.17035)

Lecture 16: Data mixing + synthetic data (Mar 4) Mayee Chen + Students

Mayee gave a guest lecture on dataset mixing ([slides](https://drive.google.com/file/d/1IM6gytc0bC00W_gQqMrre-deoXsKFD3z/view?usp=drive_link)), followed by student presentations on synthetic data. ## Guest lecture **Dataset mixing** (Mayee Chen, [slides](https://drive.google.com/file/d/1IM6gytc0bC00W_gQqMrre-deoXsKFD3z/view?usp=drive_link)): - [OLMix: A Framework for Data Mixing Throughout LM Development](https://arxiv.org/abs/2602.12237) ## Student presentations **Retrieval vs synthetic data in vision** (Bobby Shi, [slides](https://drive.google.com/file/d/1R3PtR9LJoIHfK5VJZKZMu8dbw_nyS_uK/view?usp=drive_link)): - [The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better](https://arxiv.org/abs/2406.05184) **Synthetic rewrites** (Thanawat Sornwanee, [slides](https://drive.google.com/file/d/1DsbBPWLJdULISyL1xHnWgpHpAYw9bCAi/view?usp=drive_link)): - [Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling](https://arxiv.org/abs/2401.16380) **Phi-4** (Seyone Chithrananda, [slides](https://drive.google.com/file/d/12ySjF2bVa_o_HwG4QaWbnItghZ4F2jFX/view?usp=drive_link)): - [Phi-4 Technical Report](https://arxiv.org/abs/2412.08905)

Lecture 17: Limits of data scaling & token-level filtering (Mar 9) Suhas Kotha + Students

Student presentations followed by a guest lecture from Suhas. ## Student presentations **Shaping capabilities with token-level data filtering** (Andrew Lanpouthakoun, [slides](https://drive.google.com/file/d/1DOsx7nmEzXq2EH5lD3crZOHUMWaLVBIf/view?usp=drive_link)): - [Scaling Laws for Downstream Task Performance in Machine Translation](https://arxiv.org/abs/2601.21571) **Will we run out of data?** (Emily Steiner, [slides](https://drive.google.com/file/d/1sIV5BRfZt97_AX8r9iBR0_irXDacc5oe/view?usp=drive_link)): - [Will we run out of data? Limits of LLM Scaling Based on Human-Generated Data](https://arxiv.org/abs/2211.04325) ## Guest lecture **Pre-training under infinite compute** (Suhas Kotha, [slides](https://drive.google.com/file/d/1QEVHOIh5AIzZkTsM31_78yQuqEyCUfW8/view?usp=drive_link)): - [Pre-training under infinite compute](https://arxiv.org/abs/2509.14786)

Course Project

The course project is to build a training set for a learning problem of your choice. Good projects will be research-level and could become conference papers.

Team work is encouraged! Up to three people per project.

Milestones

Deliverable	Length	Due Date
Project proposal	1 - 2 pages	January 21
Progress report	3 pages	February 11
Final write-up	5 pages	~~March 9~~ March 11
Presentation	10 minutes	March 11

Proposal Guidelines

Your proposal should include:

Motivation: What problem are you tackling? What learning task will your dataset support?
Data sources: Where will you get the data? (e.g., web scraping, existing datasets, synthetic generation)
Methodology: How will you curate, filter, or process the data?
Evaluation: How will you assess the quality of your dataset?
Compute requirements: What compute resources will you need to conduct your experiments?

Evaluation Criteria

Projects will be evaluated based on:

Technical quality: Is the methodology sound? Is the dataset well-constructed?
Originality: Does the project explore new territory or address an important gap?
Communication: Is the work clearly presented in the report and presentation?

Grading

The course grade is determined as follows:

Course project (90%):
- Project proposal: 10%
- Progress report: 10%
- Final write-up: 50%
- Presentation: 20%
Paper presentations (10%): In-class presentations (sign up for slots)

Policies

Late days: Three late days available (but not for the final project report)
Submissions: All assignments submitted via Gradescope (see Canvas)