Title: Value of Pretraining Data: Scaling Laws for Downstream Task Performance of Large Language Models
Abstract: This talk explores the challenges and open questions surrounding the value of pretraining data for large language models (LLMs) in transfer learning settings. While scaling laws have provided valuable insights for LLM design, existing work has predominantly focused on pretraining loss. In contrast, this work investigates scaling behavior in a transfer learning setting where LLMs are finetuned for downstream tasks. Specifically, we examine how the choice and size of pretraining data impact downstream performance, as measured by cross-entropy and translation quality metrics such as BLEU and COMET. Our experiments reveal that the size of the finetuning dataset and the alignment between pretraining and downstream data significantly influence scaling behavior. With sufficient alignment, both cross-entropy and translation quality improve with increased pretraining data, and we demonstrate the ability to predict translation quality using a new log-law. However, in cases of moderate misalignment, we observe that translation quality can fluctuate or even deteriorate with more pretraining data, despite consistent improvements in cross-entropy. Through analysis of these findings, we provide insights for selecting appropriate pretraining data. The talk will conclude with a discussion of future research directions and remaining open questions in this area.
Bio: Berivan Isik is a research scientist at Google, working on efficient and trustworthy AI. Her current interests are efficient training/finetuning of large models, pretraining data valuation and scaling laws for LLMs, differential privacy, and unlearning. She earned her PhD from Stanford University in 2024, where she was affiliated with the SAIL and StatsML groups. Her research was supported by Stanford Graduate Fellowship (2019-2023), Google Ph.D. Fellowship (2023-2026), and a Meta research grant.