The SaferAI Roundup #2
Observational Scaling Laws and the Predictability of Language Model Performance & Lessons from the Trenches on Reproducible Evaluation of Language Models
Welcome to "The SaferAI Roundup". Each fortnight, we will publish LLM-generated summaries of 2-3 papers that we consider consequential in the fields of AI governance, safety, and risk management. These summaries are curated and lightly edited by the SaferAI team, an AI risk management organization.
Our goal is to enable you to stay up-to-date with this fast-evolving literature by delivering concise, accessible summaries in your inbox, helping you stay informed about critical advancements in AI safety and governance without having to go through numerous academic papers.
Now, let's dive into the 2 papers we selected for this second edition.
Observational Scaling Laws and the Predictability of Language Model Performance
Yangjun Ruan et al.
TLDR: The paper proposes "observational scaling laws" that allow predicting language model performance across different model families using publicly available models, without needing to train new models.
• Observational scaling laws generalize compute scaling: The authors hypothesize that language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency of converting compute to capabilities. This allows combining data from multiple model families to build more robust scaling laws.
• Low-dimensional capability space extracted from benchmarks: Using a dimensionality reduction technique, on standard language model benchmarks, the authors find that just 3 dimensions explain 97% of variance in model performance across tasks. These components roughly correspond to general capability, reasoning, and programming skills.
• Accurate predictions of complex capabilities: The method accurately predicts "emergent" capabilities, agentic performance, and effects of techniques like chain-of-thought reasoning, even when using only data from smaller models to predict larger ones. This suggests complex capabilities may scale more smoothly than previously thought.
• Practical implementation with few models: The authors demonstrate that their method can be implemented using just 10-20 carefully selected publicly available models, making it accessible for researchers without large compute resources. They provide guidelines for model selection to enable others to easily apply the technique.
Link to the paper: https://arxiv.org/pdf/2405.10938
Link to a previous public explainer: https://www.linkedin.com/posts/sim%C3%A9on-campos-8555251a8_observational-scaling-laws-is-one-of-the-activity-7217814657779068928-ppaI?utm_source=share&utm_medium=member_desktop
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika et al.
TLDR: This paper introduces the Language Model Evaluation Harness (lm-eval), an open-source library for reproducible and extensible evaluation of language models, addressing challenges in LM evaluation like reproducibility, implementation difficulties, and fast-changing progress.
• Challenges in LM evaluation: The paper outlines key issues in evaluating language models, including the difficulty of assessing natural language responses, benchmark design validity, implementation inconsistencies, and rapid changes in LM capabilities. These challenges make fair comparisons across models and methods problematic.
• Best practices recommended: The authors provide guidelines for rigorous LM evaluation, including sharing exact prompts and code, avoiding copying results from other implementations, providing model outputs, performing qualitative analyses, and measuring/reporting uncertainty. These practices aim to improve reproducibility and comparability of results.
• lm-eval design and features: The paper describes the design of lm-eval, which allows for modular implementation of evaluation tasks and integration with various LM implementations. It supports different types of evaluation requests (e.g., loglikelihoods, perplexities, generation) and incorporates best practices like statistical testing and qualitative analysis tools.
• Case studies demonstrating utility: The authors present case studies showing how lm-eval has been used to improve evaluation rigor, including enabling multi-prompt evaluation in the BigScience project and facilitating benchmarking of novel LM architectures. These examples illustrate how the library addresses challenges in LM evaluation and promotes more standardized, reproducible practices.
Link to the paper: https://arxiv.org/pdf/2405.14782
Thanks for reading us, your feedback on this new format is greatly appreciated!