Mechanical Dreams

Mechanical Dreams

An automatically generated podcast about machine learning and natural language processing. The two fictional hosts talk about papers that I want to learn more about on my way to work. It's not good, but it's useful.

RSS Feed

Lost in Backpropagation- The LM Head is a Gradient Bottleneck

0:21:09
In this episode:
• Chapter 1: Introduction to the Bottleneck: Linda introduces the paper and the general concept of the LM head. Professor Norris expresses initial skepticism about revisiting the softmax bottleneck.
• Chapter 2: Expressivity vs. Optimization: The hosts discuss how the paper shifts the focus from the classical expressivity limitation to a fundamental optimization problem.
• Chapter 3: The Math of Gradient Destruction: Linda breaks down the matrix math, explaining how backpropagating a V-dimensional gradient through a rank-D layer destroys up to 99 percent of the gradient norm.
• Chapter 4: SpamLang and Real-World Evidence: The discussion moves to the SpamLang synthetic task and 2B parameter pretraining experiments, proving that the gradient bottleneck severely limits training speed and capacity.
• Chapter 5: Implications for Scaling Laws: Norris and Linda wrap up by discussing what this means for the future of LLM pretraining and potential architectural fixes.

Let's (not) just put things in Context- Test-Time Training for Long-Context LLMs

0:23:56
In this episode:
• The Context Window Illusion: Norris and Linda introduce the episode and the paper, discussing why million-token context windows don't automatically solve reasoning tasks.
• The Math of Score Dilution: Linda dives into the theoretical bottleneck of static self-attention, explaining why the target-distractor margin must scale logarithmically.
• Query-Only Test-Time Training: Linda reveals the paper's solution: updating only the query projection matrices at inference time to avoid invalidating the KV cache.
• Compute Equivalency: qTTT vs Thinking Tokens: Norris challenges the computational cost, leading to a discussion on how qTTT strictly matches the FLOPs of chain-of-thought decoding.
• Results and Takeaways: The hosts discuss the empirical results on LongBench-v2 and ZeroScrolls, concluding with the implications for inference-time compute scaling.

Learning State-Tracking from Code Using Linear RNNs

0:20:34
In this episode:
• Introduction to State-Tracking: Linda and Professor Norris introduce the paper and discuss the historical context of state-tracking in sequence models.
• The Next-Token Prediction Testbed: The hosts discuss how the authors used Python REPL traces with print statements to evaluate models using next-token prediction instead of sequence-to-sequence.
• DeltaNet Triumphs Over Transformers: Linda explains how DeltaNet with extended eigenvalues perfectly extrapolated the tracking task, while Transformers failed even with dense supervision.
• The Catch: Partial Observability: Professor Norris questions the limits, leading Linda to introduce Probabilistic Finite-State Automata with State Reveals (PFSA-SR) and unobservable branching.
• The Math of Norm Decay: A deep dive into why linear RNNs suffer exponential norm decay without non-linear renormalization, finalizing the episode's takeaways.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

0:19:50
In this episode:
• Dessert Before Vegetables?: Professor Norris and Linda introduce the concept of Curriculum Learning in LLMs and discuss why the intuitive idea of saving the best data for last has historically failed to produce significant results.
• The Invisible antagonist: Learning Rate Decay: Linda reveals the paper's core insight: standard learning rate schedules decay to near-zero just as the high-quality data arrives, effectively wasting the most valuable training tokens.
• Signal, Noise, and the River Valley: The hosts discuss the theoretical mechanism, using a 'river valley' analogy to explain how high-quality data provides a strong signal direction that is dampened by aggressive optimization schedules.
• The Solution: Curriculum Model Averaging (CMA): Linda details the paper's proposed method: replacing learning rate decay with a constant learning rate combined with weight averaging (EMA) to stabilize the model while keeping it plastic enough to learn from good data.
• Results at Scale: A deep dive into the experimental results on 1.5B parameter models, showing how this new regime outperforms random shuffling by over 1.6% on standard benchmarks.
• Rethinking the Pretraining Recipe: Professor Norris concedes the brilliance of the approach, and the two discuss the broader implications for mid-training and the necessity of co-designing data curricula with optimization hyperparameters.

GLM-5

0:24:36
In this episode:
• Welcome & The End of Vibe Coding?: Linda introduces GLM-5 and the paradigm shift from passive vibe coding to autonomous agentic engineering.
• Architecture & DeepSeek Sparse Attention: Professor Norris and Linda examine the 744B parameter model and how transitioning from dense to sparse attention drastically cuts compute costs.
• Asynchronous RL and the Slime Framework: A deep dive into decoupled training engines, addressing off-policy drift with TITO and token-level clipping.
• Evaluating Real-World Agentic Engineering: Reviewing GLM-5's performance on SWE-bench and the innovative Agent-as-a-Judge pipeline for interactive frontend testing.
• Hardware Adaptation & Pony Alpha: Discussing the model's extreme quantization for domestic GPUs and the dramatic anonymous release on OpenRouter.

Cautious Optimizers

0:21:19
In this episode:
• Introduction to Cautious Optimizers: Linda introduces the paper and its bold claim of improving optimizers with just one line of code, while Norris expresses his initial skepticism.
• The Inertia Problem in Momentum: The hosts discuss how standard momentum-based optimizers like AdamW can overshoot due to inertia, temporarily increasing the loss function.
• The One-Line Fix and Scaling: Linda breaks down the PyTorch implementation of the cautious mask, explaining how it zeros out conflicting directions and scales the remaining updates.
• Hamiltonian Dynamics and Convergence: Norris and Linda explore the theoretical guarantees of the paper, discussing how the method preserves Hamiltonian descent and ensures monotonic loss reduction.
• Empirical Triumphs and Overhead: The conversation shifts to the experimental results on LLaMA pretraining and Vision Transformers, noting the impressive performance and minimal 3 percent computational overhead.
• Conclusion: Norris admits he is fully convinced by the elegant simplicity of the paper, and Linda signs off for the episode.

Backward Gradient Normalization in Deep Neural Networks

0:22:10
In this episode:
• Welcome and Introduction: Professor Norris and Linda introduce the episode and the paper of the week: 'Backward Gradient Normalization in Deep Neural Networks'.
• The Ghost of Gradients Past: A discussion on the classic vanishing and exploding gradient problems, and why existing solutions like Batch Normalization and ResNets still leave room for improvement.
• Unpacking Backward Gradient Normalization: Linda explains the core mechanics of the BGN layer, detailing how it leaves the forward pass untouched while scaling gradients during backpropagation.
• Visualizing the Flow: The hosts delve into the paper's experiments with 90-layer deep networks, comparing gradient decay across ReLU, Sigmoid, and Tanh activation functions.
• Results, Trade-offs, and Conclusions: A breakdown of the accuracy improvements and training time efficiency of BGN compared to Batch Normalization on the MNIST dataset, followed by final thoughts.

Attention Residuals

0:21:00
In this episode:
• The PreNorm Dilution Problem: Professor Norris and Linda introduce the episode and discuss the fundamental limitations of standard residual connections, focusing on the unbounded magnitude growth caused by PreNorm.
• Attention Residuals and the Time-Depth Duality: Linda introduces the core concept of Full Attention Residuals, treating network depth like sequence length. Professor Norris raises concerns about the memory and communication overhead.
• Block Attention Residuals: The hosts discuss how the Kimi Team solves the overhead problem by partitioning layers into blocks, reducing the cost while preserving the benefits of selective aggregation.
• Infrastructure and System Optimizations: A deep dive into the engineering feats that make Block AttnRes practical, including cross-stage caching for pipeline parallelism and a two-phase computation strategy for inference.
• Results, Scaling Laws, and Wrap-up: Linda shares the impressive scaling law results and downstream benchmark improvements. The hosts reflect on how AttnRes bounds hidden-state magnitudes.

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

0:20:28
In this episode:
• Introduction: Linda and Professor Norris introduce the podcast and the focus of the episode: the PoPE paper.
• The Problem with RoPE: A discussion on Rotary Position Embedding and how it entangles content and positional information.
• Introducing PoPE: Linda explains the mathematical shift to polar coordinates to decouple the what and the where.
• Empirical Triumphs: Reviewing the massive performance jump on the Indirect Indexing task, plus music, genomics, and language modeling.
• Length Extrapolation and Conclusion: Analyzing PoPE's zero-shot length extrapolation capabilities compared to YaRN, followed by episode wrap-up.

Scaling Laws for Precision

0:20:33
In this episode:
• Introduction to Precision in Scaling Laws: Linda introduces the new paper which adds precision as a third variable to the Chinchilla scaling laws. Professor Norris reflects on how precision is usually treated as an afterthought.
• The Post-Training Quantization Paradox: The hosts discuss the surprising finding that overtraining models on too much data actually makes them degrade worse when applying post-training quantization.
• Effective Parameters and Low-Precision Training: Linda explains the concept of effective parameter count, and how lowering precision in weights, activations, and KV cache shrinks the model's effective size multiplicatively.
• Finding the Compute-Optimal Precision: Professor Norris is surprised to learn that compute-optimal pretraining precision is around 7 to 8 bits, completely independent of the compute budget unless model size is constrained.
• A Unified Scaling Law and Takeaways: The episode wraps up by bringing pretraining and post-training precision into a single mathematical framework, discussing what this means for the future of model training.

On the "Induction Bias" in Sequence Models

0:17:09
In this episode:
• Introduction: The Transformer's Kryptonite: Professor Norris jokes about Transformers solving everything, but Linda introduces a new paper that challenges their ability to perform basic state tracking efficiently. They set the stage by distinguishing between the well-known Out-of-Distribution failures and the paper's focus on In-Distribution data efficiency.
• The Setup: Modulo Arithmetic and Supervision Regimes: Linda explains the experimental setup using modular addition and permutation composition, and defines the three supervision formats: Outcome Supervision, Chain-of-Thought (CoT), and Aligned CoT. Norris questions why simple math requires such complex architectures, leading to a discussion on sample efficiency.
• The Showdown: Transformers vs. RNNs: The hosts discuss the surprising results where recurrent models (LSTMs and Dense-SSMs) crush Transformers in outcome supervision. They analyze why Transformers rely heavily on Chain-of-Thought to function, whereas RNNs struggle with standard CoT due to recall bottlenecks but excel with Aligned CoT.
• The Core Theory: Induction Bias and The Sharing Factor: Linda dives into the concept of the "Sharing Factor" (kappa), explaining that RNNs use an inductive bias to share weights across sequence lengths, effectively learning the algorithm. Norris is fascinated by the finding that Transformers exhibit "length isolation," essentially relearning the task from scratch for every new sequence length.
• Conclusion: Brute Force vs. True Learning: The pair wraps up by discussing the implications for Large Language Models, specifically regarding "context rot" and the massive data requirements for agentic workflows. Norris concedes that perhaps we haven't solved state tracking just yet, and they sign off.

NOBLE- Accelerating Transformers with Nonlinear Low-Rank Branches

0:16:05
In this episode:
• A Noble Introduction: Professor Norris makes a pun about aristocracy while Linda introduces the paper 'NOBLE' from Canva Research, setting the stage for a discussion on accelerating Transformer pretraining.
• The Linear Collapse Problem: Linda explains why standard LoRA doesn't work for pretraining from scratch, and Norris helps clarify the difference between parameter-efficient fine-tuning and architectural augmentation.
• Anatomy of a Nonlinear Branch: A deep dive into the NOBLE architecture and the 'CosNet' activation function, discussing why a cosine sandwich is better than ReLU for low-rank bottlenecks.
• Crunching the Numbers: The hosts discuss the experimental results, highlighting the 1.47x step speedup and debating whether the parameter overhead is worth the wall-clock time savings.
• The Mixup Mystery: Linda reveals a fascinating caveat regarding Mixup/CutMix augmentation, leading to a theoretical realization about NOBLE's role in learning high-frequency signals versus smooth global trends.
• Inference and Impact: The duo wraps up by discussing the trade-offs, specifically the permanent inference cost, and gives their final verdict on whether NOBLE is the future of pretraining.

Flash Attention 4

0:15:52
In this episode:
• Welcome to the Hardware Lottery: Professor Norris and Linda introduce the episode's focus: FlashAttention-4. They set the stage by discussing the arrival of NVIDIA's Blackwell architecture and why existing optimization techniques suddenly hit a wall.
• The Asymmetry Problem: Linda explains the concept of 'Asymmetric Hardware Scaling' found in the B200 GPUs, where tensor cores doubled in speed but memory bandwidth and special function units didn't. Norris questions why simply running FlashAttention-3 isn't good enough.
• Bottlenecks in the Forward Pass: The duo dives into the algorithmic changes for the forward pass, specifically how the paper mitigates the 'exponential unit' bottleneck by emulating exponential functions on FMA units and using conditional softmax rescaling.
• Taming the Backward Pass with TMEM: A deep dive into the backward pass optimizations. Linda explains the use of Tensor Memory (TMEM) and the '2-CTA MMA' mode to reduce shared memory traffic, satisfying Norris's curiosity about how to hide latency.
• Escaping Template Hell: They discuss the implementation framework: CuTe-DSL embedded in Python. Norris rejoices at the reduction in compile times compared to C++ templates, while Linda highlights the flexibility for researchers.
• The Verdict: The hosts wrap up the findings, noting the impressive speedups over cuDNN and Triton, and offer final thoughts on the future of hardware-aware algorithm design.

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

0:16:55
In this episode:
• The Multi-Million Dollar NaN: Linda introduces the paper 'An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence' by Zhang et al., setting the stage with the high stakes of expensive pretraining runs failing. Professor Norris expresses skepticism that simple 'bad data' is the root cause of complex divergences.
• The Toxic Five Tokens: The hosts discuss the paper's methodology of injecting synthetic uniform random noise. Linda reveals the counter-intuitive finding that a restricted vocabulary of noise (like repeating hash codes) is significantly more destabilizing than random noise drawn from the full vocabulary.
• Bigger, Deeper, More Fragile: A look at the scaling laws of failure; contrary to the hope that scale fixes everything, Linda explains how the paper proves that deeper models are substantially more likely to diverge when exposed to noise than their wider or smaller counterparts.
• CSI: Gradient Descent: Professor Norris and Linda dive into the forensic diagnostics, distinguishing between failures caused by high learning rates versus noisy data. They discuss the specific 'smoking gun' of maximum attention logits capping at around 1800 for noise-induced failures versus 4000 for learning rate issues.
• MoE Stability and The QK-Fix: They address the concern that Mixture-of-Experts (MoE) models might be hypersensitive to noise, which the paper disproves, and discuss QK-LayerNorm as the architectural 'safety belt' when perfect data cleaning isn't possible.
• Closing Thoughts: Final takeaways on the necessity of data curation and a witty sign-off from Professor Norris regarding the cleanliness of his own reading glasses versus the training data.

Midtraining Bridges Pretraining and Posttraining Distributions

0:16:22
In this episode:
• Introduction: Do We Really Need Another Phase?: Professor Norris jokingly laments the ever-expanding terminology of LLM training, while Linda introduces the paper on 'Midtraining' as a distinct, intermediate phase between pretraining and post-training.
• The Mechanism: Building a Distributional Bridge: Linda explains the core theory: midtraining isn't just 'cooling down,' but shifting the model's initialization closer to the target distribution to smooth out the optimization path.
• Results: Where It Works (and Where It Doesn't): The hosts discuss the finding that midtraining shines in 'distant' domains like Code and Math but matters less for general instructions, and cover the surprising reduction in catastrophic forgetting.
• The Plasticity Window: Timing and Mixtures: A deep dive into the interaction between when you start midtraining and how much specialized data you use, highlighting the dangers of late, aggressive data injection.
• Conclusion: Better Than Continued Pretraining?: Norris concedes the method's utility after seeing the comparison against standard continued pretraining, and the pair summarize the practical takeaways for training schedules.

SiameseNorm

0:17:21
In this episode:
• Introduction: The Never-Ending Normalization Wars: Professor Norris and Linda kick off the episode. Norris cracks a joke about how normalization layers are like seasoning—too little and it's bland, too much and you ruin the dish. Linda introduces the paper 'SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm', setting the stage for a discussion on the fundamental trade-offs in Transformer architecture.
• The Dilemma: Dilution vs. Distortion: Linda explains the core problem: Pre-Norm is stable but suffers from 'signal dilution' which limits effective depth, while Post-Norm offers high expressivity but is plagued by 'gradient distortion' and instability. Norris plays the skeptic, asking why we can't just combine them, leading to a discussion on why previous hybrid attempts have failed.
• The Solution: SiameseNorm's Dual Streams: Linda describes the paper's novel architecture: SiameseNorm. She explains how it uses two parallel streams (one Pre-Norm-like, one Post-Norm-like) that share the same residual block parameters. This allows the model to decouple the optimization dynamics (via the identity path) from the representation learning (via the normalized path).
• Under the Hood: The Gradient Analysis: Professor Norris dives into the mathematical justification provided in the paper. He breaks down the Jacobian matrix analysis, seemingly impressed by how the architecture preserves an explicit identity term (for the gradient highway) while simultaneously enforcing bounded representations, effectively solving the vanishing/exploding gradient problem.
• Results: The Arithmetic Leap and High Learning Rates: Linda presents the empirical evidence, highlighting that SiameseNorm allows for much more aggressive learning rates (up to 2e-3) without diverging. She emphasizes the massive 40% relative gain in arithmetic reasoning tasks compared to Pre-Norm, which finally convinces Norris that the 'effective depth' has indeed been restored.
• Conclusion: A Unified Future?: The hosts wrap up the episode. Norris concedes that this might be the 'best of both worlds' solution the field has been waiting for. They discuss the implications for training even larger models and sign off with their catchphrase.

ÜberWeb

0:18:00
In this episode:
• Welcome to the ÜberWeb: Professor Norris and Linda introduce the episode's focus: the 'ÜberWeb' paper by DatologyAI, setting the stage for a discussion on the challenges of training high-quality multilingual models on a massive scale.
• The Curse That Wasn't: The hosts debate the 'curse of multilinguality,' with Linda explaining the paper's central thesis: that performance degradation is often due to poor data quality ('curse of data quality') rather than a lack of model parameters.
• A Rising Tide Lifts All Boats: Discussion on the paper's most surprising finding: that curating high-quality English data improves non-English performance, and conversely, cleaning non-English data boosts English capabilities.
• Bespoke Curation and the Translation Trap: Linda details why generic filters fail for diverse scripts and how the paper utilized bespoke pipelines, while Norris interrogates the nuance of using translated data effectively versus blindly translating noise.
• The New Pareto Frontier: A look at the hard numbers, where the hosts analyze how 3B and 8B models trained on just 1 trillion curated tokens managed to outperform significantly larger open-source baselines like Llama and Qwen.
• Conclusion and Sign-off: Norris and Linda wrap up the episode, reflecting on the future of data-centric AI and the move toward more efficient, language-inclusive foundation models.

Why Do Reasoning Models Loop

0:17:16
In this episode:
• Introduction: The Infinite Loop: Professor Norris and Linda introduce the episode's topic: the phenomenon of reasoning models getting stuck in repetitive loops. Norris jokes about his own lectures looping, while Linda introduces the paper 'Wait, Wait, Wait... Why Do Reasoning Models Loop?' and the context of Chain-of-Thought reasoning.
• The Distillation Mystery: Linda presents the paper's empirical findings, highlighting that 'student' models (distilled) loop significantly more than their 'teacher' models. Norris is skeptical that a student could be worse than the teacher if trained properly, leading to a discussion on 'errors in learning.'
• Mechanism 1: Risk Aversion and Hard Steps: The hosts dive into the first theoretical mechanism: Risk Aversion due to Hardness of Learning. Linda uses the 'Star Graph' analogy to explain how models prefer easy, cyclic actions (like resetting) over hard, progress-making steps when they are uncertain.
• Mechanism 2: Deja Vu and Correlated Errors: They discuss the second mechanism: Inductive Bias for Temporally Correlated Errors. Norris learns why models don't just guess randomly when confused but instead make the *same* mistake repeatedly, leading to the 'Groundhog Day' effect in reasoning traces.
• Temperature: A Cure or a Band-Aid?: Linda explains why turning up the 'temperature' (randomness) helps break loops but is ultimately just a stopgap that masks the underlying learning errors. They conclude with a look at how loops become self-reinforcing catalysts.

OPUS- Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

0:17:42
In this episode:
• Introduction: Hitting the Data Wall: Professor Norris and Linda introduce the episode's paper, 'OPUS', and discuss the looming 'Data Wall' where high-quality public text is exhausted, necessitating a shift from more tokens to better tokens.
• The Flaw in Current Data Selection: The hosts debate existing methods, contrasting static filters like FineWeb-Edu with dynamic selection. Linda explains why scoring data based on raw gradients fails when modern optimizers like AdamW or Muon reshape the update geometry.
• Defining Utility in the Optimizer's World: Linda breaks down the core mechanism of OPUS: measuring data utility in the optimizer-induced update space rather than the raw gradient space. Norris grapples with the concept of aligning data selection with the actual trajectory of the optimization.
• Scaling Up: Ghosts and Sketches: A deep dive into how OPUS makes per-sample gradient estimation computationally feasible. The discussion covers the use of the 'Ghost' technique combined with CountSketch to project updates into low-dimensional space without full materialization.
• Diversity via Boltzmann and The Proxy: The hosts discuss how OPUS avoids 'diversity collapse' using Boltzmann sampling instead of greedy selection, and how it constructs a stable 'Bench-Proxy' from the pre-training corpus to guide the model.
• Results and Final Thoughts: Reviewing the empirical results where OPUS outperforms industrial baselines on GPT-2 and Qwen3-8B. Norris concedes the cleverness of the approach, and they wrap up with thoughts on data efficiency.

Teon

0:18:52
In this episode:
• Introduction: The Optimizer Zoo: Professor Norris and Linda introduce the topic of optimization in LLMs, joking about the explosion of new optimizers before introducing the paper of the week: TEON.
• The Muon Foundation: Linda recaps the Muon optimizer, explaining how it uses orthogonalization to prevent gradient rank collapse, while Norris questions its limitations regarding layer independence.
• Enter the Tensor: How TEON Works: Linda explains the core innovation of TEON: stacking gradients from multiple layers into a tensor and using matricization to orthogonalize them jointly.
• The Theory: Singular Vector Alignment: The hosts discuss the theoretical justification, focusing on Proposition 4.6 and why gradients in Transformers (specifically Q, K, and V) exhibit strong singular vector alignment.
• Results and The Polar Express: A look at the experimental results on GPT and LLaMA models, confirming TEON outperforms Muon even when using approximate SVD methods like PolarExpress.
• Conclusion: Professor Norris concedes that TEON offers a principled improvement over Muon, and the duo signs off.

Cautious Weight Decay

0:20:59
In this episode:
• Introduction: The Weight Decay Dilemma: Professor Norris and Linda introduce the episode's topic: Cautious Weight Decay. They discuss the historical context of weight decay as a regularization technique and why standard approaches might be accidentally sabotaging model learning.
• The Mechanism: To Decay or Not to Decay?: Linda explains the core algorithm of Cautious Weight Decay (CWD). The hosts break down the 'sign alignment' logic, explaining how CWD decides when to apply the 'brakes' of regularization and when to let the weights grow freely.
• Mathematical Foundations: Lyapunov and Sliding Modes: Professor Norris dives into the theoretical proofs provided in the paper. He discusses how CWD doesn't just optimize a proxy loss function but actually finds Pareto-optimal points on the stationary manifold of the original objective.
• Experimental Results: A Drop-in Upgrade: Linda presents the empirical data, covering performance on Large Language Models and Vision Transformers. They highlight the 'killer feature': CWD requires no hyperparameter retuning compared to AdamW.
• Conclusion and Final Verdict: The hosts summarize the findings. Norris gives his skeptical-but-approved stamp of approval, and they discuss the potential for this simple one-line change to become a new standard in deep learning optimization.

Predictable Scale

0:17:42
In this episode:
• Introduction: The Alchemy of Training: Professor Norris laments the 'black magic' of hyperparameter tuning, and Linda introduces the paper 'Predictable Scale: Part I, Step Law' which promises to turn that alchemy into science.
• The Million-Hour Experiment: The hosts discuss the unprecedented scale of the study, involving 3,700 models and nearly one million H800 GPU hours, to map the loss landscape.
• Defining the Step Law: Linda explains the core mathematical findings: how Learning Rate scales with model size (N) and data size (D), and the surprising revelation that optimal Batch Size depends almost entirely on D, not N.
• Universality: MoEs and Data Recipes: A deep dive into how the Step Law holds up against sparse Mixture-of-Experts models and varying data distributions (like code or multilingual data), outperforming previous scaling laws like DeepSeek or OpenAI's.
• Conclusion: A Plug-and-Play Future: Norris concedes that the empirical evidence is overwhelming. They wrap up with the implications for efficient LLM training and what this means for the industry.

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

0:17:20
In this episode:
• Introduction: The Heavy Cost of Curvature: Professor Norris and Linda introduce the paper 'A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs' by researchers at Meta and UMD, setting the stage by discussing why measuring the Hessian matrix is a computational nightmare for large models.
• The Proposal: Critical Sharpness: Linda explains the core innovation of the paper: a method to estimate sharpness using a simple line search algorithm (Critical Sharpness) instead of expensive eigenvalue decompositions.
• Validation: The Edge of Stability: The hosts discuss how this new metric confirms the 'Edge of Stability' phenomenon in massive models like OLMo-2 7B, proving that models naturally train right on the precipice of instability.
• Application: Solving Catastrophic Forgetting: The discussion moves to the most practical takeaway: using 'Relative Critical Sharpness' to determine the perfect ratio of pre-training data to mix in during fine-tuning to prevent the model from becoming 'dumb' on general tasks.
• Conclusion and Takeaways: Norris and Linda wrap up with final thoughts on how this tool essentially gives engineers a flashlight to navigate the dark, high-dimensional valleys of loss landscapes without needing a supercomputer.

Challenges and Research Directions for Large Language Model Inference Hardware

0:19:24
In this episode:
• Introduction: The Disconnect: Professor Norris and Linda introduce the paper 'Challenges and Research Directions for Large Language Model Inference Hardware' by Ma and Patterson, discussing the widening gap between academic architecture research and industry reality.
• The Inference Crisis: Prefill vs. Decode: The hosts break down why LLM inference is fundamentally different from training, explaining the 'Memory Wall' and the specific bottleneck of the autoregressive Decode phase.
• Solution 1: High Bandwidth Flash: Linda proposes High Bandwidth Flash (HBF) as a solution for capacity, while Professor Norris questions the latency and endurance issues inherent to flash memory.
• Solution 2 & 3: PNM and 3D Stacking: A discussion on Processing-Near-Memory (PNM) versus Processing-In-Memory (PIM), and how 3D stacking can shorten the distance between compute and data.
• Solution 4: Interconnects and New Metrics: The duo discusses why latency matters more than bandwidth for inference interconnects, and concludes with a look at new evaluation metrics like TCO and Carbon Footprint.

Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

0:20:00
In this episode:
• To PE or Not to PE?: Professor Norris and Linda kick off the episode by introducing the paper 'Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings' (DroPE). Norris expresses immediate skepticism about removing such a fundamental component of the Transformer architecture, setting the stage for the debate.
• The Inductive Bias Paradox: Linda explains the paper's first major observation: Positional Embeddings (PEs) are necessary scaffolding for fast training convergence but become a straitjacket for zero-shot generalization. They discuss the theoretical findings regarding 'attention positional bias' and why NoPE models struggle to learn initially.
• Why Scaling Frequency Breaks Meaning: The hosts dive into the technical critique of current context-extension methods like YaRN and RoPE-scaling. Linda details how compressing low frequencies to fit longer contexts distorts 'semantic heads'—the parts of the model that match content rather than position—causing failures in retrieval tasks.
• The DroPE Solution: Remove the Training Wheels: They discuss the proposed method: training with RoPE, then stripping it away and doing a quick recalibration. Norris warms up to the analogy of PEs acting as 'scaffolding' that should be removed once the building (the model) is self-supporting.
• Needles in Haystacks and Future Architectures: Reviewing the empirical results, including the massive improvements on Needle-in-a-Haystack benchmarks compared to YaRN. The episode concludes with a discussion on whether future foundation models will all be trained with this 'drop-and-recalibrate' paradigm.

The Quantization Model of Neural Scaling

0:17:12
In this episode:
• Introduction: The Mystery of the Straight Line: Professor Norris and Linda introduce the paper 'The Quantization Model of Neural Scaling' by Michaud et al., setting the stage by discussing the ubiquity of power laws in deep learning and the puzzle of why scaling curves are so predictable.
• The Quantization Hypothesis: Linda explains the core theory that neural network knowledge is not continuous but composed of discrete, indivisible chunks called 'quanta,' analogous to Max Planck's quantization of energy.
• Zipf's Law and the Toy Model: The hosts discuss how learning discrete skills ordered by frequency (Zipfian distribution) results in smooth power law scaling, using the authors' 'multitask sparse parity' toy dataset as proof.
• Monogenic vs. Polygenic Traits in LLMs: Transitioning to real Language Models (Pythia), the discussion explores why some capabilities emerge suddenly (monogenic) while others improve gradually (polygenic), borrowing terminology from genetics.
• Mechanistic Evidence: Clustering Gradients: Linda details the 'Quanta Discovery from Gradients' (QDG) technique used to automatically identify specific skills within a model, such as incrementing numbers or closing quotes.
• Conclusion: A Society of Quanta: Professor Norris and Linda wrap up by reflecting on Minsky's 'Society of Mind' and the implications of this decomposability for the future of mechanistic interpretability.

EAGLE-3

0:17:52
In this episode:
• Introduction: The Wait for Tokens: Professor Norris and Linda introduce the episode's paper, EAGLE-3, and discuss the persistent bottleneck of autoregressive generation costs in modern LLMs.
• The Speculative Ceiling: Linda explains how previous speculative sampling methods like EAGLE hit a performance wall where adding more training data failed to improve the draft model, identifying the feature prediction constraint as the culprit.
• Innovation: Training-Time Test: A deep dive into EAGLE-3's core innovation: abandoning feature prediction in favor of direct token prediction that simulates the testing environment during the training phase.
• Going Deeper: Multi-Layer Fusion: The hosts discuss the second major architectural change, where the model stops relying solely on top-layer features and instead fuses low, mid, and high-level features for better context.
• Results: A New Scaling Law: Linda reveals the experimental results, including a 6.5x speedup, SGLang integration, and the discovery of a scaling law where draft models finally benefit from more data.

Engram Paper

0:17:35
In this episode:
• The Memory Bottleneck: Professor Norris and Linda introduce the paper 'Conditional Memory via Scalable Lookup' and debate the inefficiency of using expensive neural computation to simulate simple knowledge retrieval.
• Engram: N-grams Strike Back: Linda breaks down the 'Engram' module, explaining how it uses hashed N-grams and context-aware gating to inject static embeddings directly into the Transformer backbone.
• The U-Shaped Curve of Sparsity: The hosts discuss the 'Sparsity Allocation' problem, analyzing the trade-off between MoE experts and memory capacity, and the discovery that a hybrid approach yields superior results.
• Deepening the Network Without Layers: A discussion on mechanistic analysis, focusing on how Engram handles static patterns like named entities in early layers, freeing up the model's attention for complex reasoning.
• Prefetching the Future: Linda and Norris explore the system-level advantages of deterministic lookups, including offloading massive embedding tables to CPU memory, and conclude the episode.

From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence

0:19:56
In this episode:
• Introduction: Is Shannon Information Theory Broken?: Professor Norris and Linda introduce the episode, with Norris expressing skepticism about challenging the foundations of information theory. Linda introduces the paper 'From Entropy to Epiplexity' and the premise that traditional theory fails to account for computational bounds.
• The Paradox of Deterministic Creation: The hosts discuss the first major paradox: how deterministic processes like AlphaZero or synthetic data generation seem to create new knowledge, despite the Data Processing Inequality suggesting otherwise. Linda explains why cryptographic randomness proves that 'computational difficulty' looks like entropy.
• Defining Epiplexity and Time-Bounded Entropy: Linda breaks down the core definitions of the paper, explaining Epiplexity as the structural information a specific model can actually learn, versus Time-Bounded Entropy, which is the residual unpredictability relative to that model's resources.
• Emergence, Induction, and the Chess Experiment: A deep dive into the paper's experiments with Cellular Automata and Chess. The hosts discuss how the order of data (Forward vs. Reverse) impacts what a model learns and how limited compute forces models to learn emergent rules rather than brute-force simulation.
• Practical Implications for LLMs and Conclusion: The discussion moves to real-world application, specifically how Epiplexity explains why pre-training on text transfers better than images. Norris admits the utility of the theory for data selection in Large Language Models.

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

0:19:43
In this episode:
• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.
• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.
• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Query-Key norms and unifies scaling across depth, batch size, and training duration using SDE principles.
• The Per-Module Revolution: Linda gets excited about the paper's boldest claim: optimizing hyperparameters specifically for different modules (like embeddings vs. attention heads), and explains the 'jagged' optimization landscape that requires Trust Region Random Search.
• Scaling Up: 50 Million to 7 Billion: Discussion of the empirical results, focusing on how settings found on a small 50M parameter proxy model successfully transferred to a 7B model, resulting in significant training speed-ups.
• Conclusion: A Skeptic Convinced: Professor Norris admits that the rigorous math behind the SDE scaling rules is convincing, and the duo wraps up with final thoughts on what this means for the future of efficient model training.

NorMuon- Making Muon more efficient and scalable

0:19:09
In this episode:
• Introduction: The Optimizer Menagerie: Professor Norris and Linda kick off the episode by discussing the explosion of new optimizers in the LLM space. Linda introduces 'NorMuon,' a paper from Georgia Tech and Microsoft that attempts to bridge the gap between the industry standard, AdamW, and the geometric newcomer, Muon.
• The Geometry Problem: Why Adam and Muon Fall Short: Linda explains the fundamental trade-off: Adam handles coordinate-wise scaling well but ignores matrix geometry, while Muon fixes the geometry via orthogonalization but suffers from imbalanced update norms across neurons. Norris challenges the necessity of fixing Muon, prompting a discussion on 'condition numbers' versus 'neuron norms.'
• The NorMuon Solution: Best of Both Worlds: The hosts dive into the algorithm itself, detailing how NorMuon applies neuron-wise adaptive learning rates (similar to Adam-mini) *after* Muon's orthogonalization step. They discuss the intuition behind using second-order momentum to normalize the disparate scales of neuron updates.
• Engineering at Scale: FSDP2 and Distributed Newton-Schulz: The discussion shifts to the systems engineering required to make this work on large clusters. Linda explains how the authors implemented NorMuon under the FSDP2 framework, specifically how they distribute the expensive Newton-Schulz orthogonalization across devices to avoid redundant computation.
• Results and Verdict: Efficiency Gains: Norris reviews the empirical results, noting the 21% efficiency gain over Adam on 1.1B parameter models and the impressive memory savings. The episode concludes with a consensus that orthogonalization and adaptive scaling are complementary, not competitive, technologies.

Dion- Distributed Orthonormalized Updates

0:18:40
In this episode:
• The GPU Bill Blues: Professor Norris laments the exorbitant cost of training large models, setting the stage for Linda to introduce the episode's focus: 'Dion: Distributed Orthonormalized Updates' by researchers from Microsoft and Harvard.
• Muon's Heavy Lifting: Linda explains the predecessor, the Muon optimizer, and its orthonormalization benefits. Norris questions why a new method is needed, leading to a discussion on how Newton-Schulz iterations become a communication bottleneck in sharded distributed training.
• Rethinking Linear Algebra: Linda details Dion's core innovation: replacing full matrix reconstruction with amortized power iteration on a momentum buffer. Norris is skeptical about the math, but Linda explains how this integrates cleanly with weight sharding.
• The Magic of Error Feedback: The hosts discuss the 'rank-fraction' parameter and how low-rank updates save compute. Linda explains the crucial role of 'error feedback' in maintaining accuracy, finally winning over Norris's skepticism.
• Lazy Updates and CPU Offloading: A look at the algorithmic flexibility of Dion, including 'Lazy-Dion' and CPU offloading variants. They discuss the experimental results showing Dion matching Muon's performance with significantly lower wall-clock time.
• Future-Proofing Optimization: Professor Norris admits the elegance of the solution. The pair wraps up with thoughts on how Dion might become the standard for training next-generation foundation models.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v5

0:20:02

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v4

0:19:38

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v3

0:20:02

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v2

0:22:09

Key and Value Weights Are Probably All You Need

0:14:36
In this episode:
• Is Query Redundant?: Linda introduces a provocative paper suggesting a core part of the Transformer attention mechanism, the Query matrix, might be unnecessary. Professor Norris expresses his trademark skepticism about simplifying such a fundamental component.
• The Usual Suspects: Q, K, and V: Linda provides a quick, intuitive refresher on the roles of Query, Key, and Value matrices in self-attention. Professor Norris helps frame it with an analogy, emphasizing why each component has traditionally been considered essential.
• Disappearing Queries and Basis Transformations: Linda explains the paper's core theoretical claim that the Query matrix can be mathematically absorbed into other components through a change of basis. Professor Norris probes the 'simplifying assumptions,' like the absence of Layer Normalization, required for the proof to hold.
• Putting It to the Test: The discussion moves to the empirical results, where models trained without Query matrices perform surprisingly well. Linda details the crucial hyperparameter adjustments, which Professor Norris identifies as the key to bridging the gap between theory and practice.
• So, Is Query Really All You Don't Need?: The hosts debate the broader implications for parameter efficiency and our understanding of transformer architecture. They conclude by questioning if this simplification is an artifact of smaller models or a fundamental insight that will reshape future designs.

Latent State Models of Training Dynamics

0:12:26
In this episode:
• Why Does Seed 42 Work Best?: Linda introduces a paper that tries to answer a classic machine learning question: why does the random seed have such a big impact on training? Professor Norris laments that this is a problem as old as neural networks themselves.
• A Roadmap for Training: Linda explains the paper's novel approach of using a Hidden Markov Model to turn messy training dynamics into a clean 'training map' of latent states. Professor Norris expresses his surprise and curiosity at seeing a classic model like an HMM used to analyze modern deep learning.
• Taking the Scenic Route to Convergence: The hosts discuss the paper's key findings on 'grokking' tasks, where different random seeds lead to different paths on the training map. Linda explains the concept of 'detour states,' which are optional, slower paths to convergence that some models get stuck in.
• You Are the Traffic Controller: Professor Norris highlights the paper's powerful conclusion that training variability isn't inherent to a task, but a result of the training setup. Linda explains how removing components like batch normalization can create detours in stable tasks, while adding them can remove detours from unstable ones.
• Maps, Not Just Metrics: Linda and Professor Norris conclude by discussing the practical implications, such as a new way to analyze and compare hyperparameter settings by looking at the structure of their training maps.

DeepSeek OCR

0:13:59

The Coverage Principle - How Pre-training Enables Post-Training

0:15:19

The Coverage Principle- How Pre-training Enables Post-Training

0:15:19
In this episode:
• Why a Good Pre-trainer Isn't Always a Good Finetuner: The hosts introduce the puzzle of pre-training: why doesn't a lower cross-entropy loss always guarantee better performance after fine-tuning? They set the stage for today's paper which proposes a new perspective.
• Are We Covering Our Bases? The Coverage Principle: Linda explains the paper's central concept of 'coverage,' a metric that measures if a model assigns at least some probability to a wide range of high-quality responses, contrasting it with the pitfalls of cross-entropy.
• The Implicit Genius of Next-Token Prediction: The hosts dive into the paper's main theoretical result, explaining how the standard next-token prediction objective implicitly optimizes for good coverage, and why this metric is a much better predictor of downstream success than raw loss.
• From Theory to Practice: Interventions for Better Coverage: The discussion turns to practical applications, exploring the paper's proposed methods for actively improving coverage, including gradient normalization schemes and novel checkpoint selection strategies.
• What's Next for Coverage?: Professor Norris and Linda recap the key insight that coverage is a crucial link between pre-training and post-training success, and discuss the future research directions this new perspective opens up.

The Art of Scaling Reinforcement Learning Compute for LLMs

0:13:51
In this episode:
• The Art and Science of Scaling RL: Professor Norris and Linda introduce today's topic, a new paper from Meta that aims to make training large models with reinforcement learning more predictable and scientific.
• More Art than Science: Linda explains why scaling Reinforcement Learning is so difficult compared to pre-training, highlighting the lack of predictive scaling laws and the immense compute costs that sideline smaller research groups.
• Not a Power Law, but a Sigmoid: The hosts discuss the paper's core proposal: using a sigmoidal curve to model performance. Linda breaks down the key parameters like asymptotic performance (A) and compute efficiency (B), while Professor Norris relates it to human learning curves.
• The ScaleRL Cookbook: Linda walks through the 'ScaleRL' recipe, a combination of techniques discovered through a massive 400,000 GPU-hour study. They discuss the difference between choices that raise the performance ceiling versus those that just improve efficiency.
• Predictable Progress and The Bitter Lesson: The hosts discuss the implications of this work, such as enabling cheaper, more accessible research by extrapolating from small-scale experiments, and how it reinforces the 'bitter lesson' of prioritizing scalable methods.
• Next Week on Mechanical Dreams: Professor Norris and Linda wrap up their discussion on scaling RL and give a brief teaser for the topic of next week's episode.

Continual Learning via Sparse Memory Finetuning

0:13:33
In this episode:
• The Frozen Brains of AI: Linda introduces the problem of static LLMs and the challenge of 'catastrophic forgetting.' Professor Norris provides historical context on this long-standing issue in AI and introduces the day's paper on continual learning.
• Why Can't Models Just Keep Learning?: The hosts discuss traditional approaches to continual learning, like data replay and regularization. Linda explains why modern methods like LoRA, while better than full finetuning, still fall short of solving the forgetting problem.
• Memory and Sparsity: The Secret Sauce: Linda details the paper's main contribution: Sparse Memory Finetuning. She explains the concepts of memory layers and how the authors use a TF-IDF-like mechanism to identify and update only a tiny fraction of model parameters.
• Learning vs. Forgetting: The Showdown: Linda and Professor Norris analyze the paper's striking results, highlighting how the proposed method learns new facts effectively while forgetting dramatically less than both full finetuning and LoRA. They discuss the Pareto frontier plot as a key piece of evidence.
• What's Next for Lifelong Learners?: The hosts discuss the implications and future directions for this research, such as applying the technique to more complex skills beyond fact acquisition. They conclude that sparse updates are a promising path toward creating truly dynamic AI models.

DeepSeek OCR Paper

0:13:59
In this episode:
• A Picture is Worth a Thousand Tokens: The hosts introduce the challenge of long context in LLMs and present the paper's radical idea: compressing text by taking a picture of it.
• Compressing Text into Pixels: A deep dive into the main concept of optical compression, exploring how a page of text can be represented with far fewer vision tokens than text tokens.
• The Secret Sauce: DeepEncoder: An explanation of the novel 'DeepEncoder' architecture, which efficiently processes high-resolution images into a small number of vision tokens for the language model to read.
• The Proof is in the Pixels: Discussion of the experimental results, focusing on the impressive ~97% accuracy at a 10x compression ratio and its superior efficiency on industry benchmarks.
• Forgetting, The Smart Way: Exploring the broader implications of optical compression, particularly the paper's proposal to use it as a 'forgetting mechanism' for ultra-long contexts that mimics human memory.

Untitled Episode

0:11:56

Characterization and Mitigation of Training Instabilities in Microscaling Formats

0:13:44
In this episode:
• The Need for Speed: Microscaling Formats: Linda introduces new low-precision MX formats for training LLMs, designed to save massive amounts of compute. Professor Norris is intrigued but skeptical about the practical trade-offs.
• When Good Training Goes Bad: The hosts discuss the core problem identified in the paper: severe training instabilities and sudden, unrecoverable loss spikes when using MX formats, especially at scale.
• It's the Layernorm, Stupid!: Linda explains how the researchers used a proxy model to diagnose the instabilities, tracing the root cause to a systematic gradient bias from quantizing layernorm parameters.
• The Hybrid Solution: Professor Norris and Linda discuss the paper's proposed mitigations, focusing on a clever hybrid-precision approach that uses low-precision for weights and high-precision for activations.
• Precision on a Budget: The episode concludes by showing how these mitigation strategies successfully stabilize training, allowing for performance competitive with full-precision while still saving compute.

Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls

0:14:20
In this episode:
• The Synthetic Data Gold Rush: The hosts introduce the data scarcity problem for training large language models and present today's paper, which systematically investigates synthetic data as a potential solution.
• Real Fake Data: What Kinds Are We Talking About?: Linda breaks down the different types of synthetic data studied, including rephrased web text and entirely novel 'synthetic textbooks', while Professor Norris questions the quality of this model-generated content.
• The Secret Sauce: How Much Synthetic is Too Much?: Discussion of the paper's core finding: a 'good' mixture of ~30% rephrased synthetic data with natural web text can accelerate pre-training up to 10x, whereas 100% synthetic data offers no advantage.
• Does a Bigger Generator Mean Better Data?: The hosts explore the paper's counter-intuitive discovery that using an 8B parameter model to generate data can outperform a much larger 70B model, challenging the 'bigger is always better' intuition.
• Takeaways: A Measured Dose of Artificial Text: Professor Norris and Linda summarize the practical takeaways: synthetic data is a powerful but nuanced tool, not a silver bullet. The right type, mixture, and generator model are key to accelerating training.

Drop-Muon- Update Less, Converge Faster

0:11:56
In this episode:
• Introduction: Less is More in Optimization?: Professor Norris and Linda introduce the "Drop-Muon" paper, which challenges the fundamental assumption that all neural network layers must be updated at every training step. They set the stage by questioning if selectively updating layers could lead to faster convergence.
• A Refresher on the Muon Family: Linda provides a high-level overview of modern non-Euclidean optimizers like Muon, Scion, and Gluon. They discuss how these methods use layer-specific geometry to improve training, which provides the foundation for the Drop-Muon approach.
• The Drop-Muon Algorithm: Randomized Progressive Training: Linda explains the core mechanism of Drop-Muon, focusing on how it samples a random subset of layers to update at each iteration. Professor Norris probes the practicalities of this approach, especially the concept of 'Randomized Progressive Training' and its computational cost.
• The Theoretical Justification: When is Full-Network Update Optimal?: The hosts delve into the paper's theoretical contributions, highlighting the key finding that full-network updates are only optimal under a very restrictive and unlikely condition on layer smoothness constants. They discuss the implications of the cost model, which accounts for backpropagation and parameter update costs.
• Empirical Results and Final Thoughts: Linda presents the experimental results, which show Drop-Muon achieving the same accuracy as standard Muon up to 1.4x faster in wall-clock time. They conclude by discussing the practical impact of this 'update less, converge faster' strategy for training large models.

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

0:13:09
In this episode:
• The Finicky Diet of Large Language Models: Linda introduces a paper about how LLMs learn from mixtures of web data and high-quality data. Professor Norris expresses his initial intuition that more data is always better, setting the stage for the paper's surprising findings.
• It's Not a Slope, It's a Cliff: Unveiling Phase Transitions: The hosts discuss the paper's core finding: knowledge acquisition isn't gradual but exhibits sudden 'phase transitions'. Linda explains how, below a critical model size or data mixing ratio, models learn almost nothing from specialized datasets, a result Professor Norris finds both fascinating and counter-intuitive.
• The Knapsack Theory of Knowledge: To explain the 'why', Linda and Professor Norris explore the paper's theoretical model of 'capacity allocation'. They use a knapsack analogy to describe how a model with finite capacity strategically decides which data is 'worth' learning to minimize overall loss.
• Learning More by Training on Less?: Linda and Professor Norris discuss the practical implications, including the paradoxical strategy of throwing away data to improve learning. They cover the paper's proposed solutions, like random subsampling and Compact Knowledge Mixing, and what this means for data curation.
• Final Thoughts and Critical Points: The hosts summarize the paper's key insight: data mixing recipes are not one-size-fits-all, and the relationship between model size, data, and knowledge is sharp and discontinuous. They wrap up by emphasizing the importance of understanding these dynamics for efficient model training.

Apertus Tech Report

0:13:11
In this episode:
• Another Week, Another 'Open' Model?: Linda introduces the Apertus paper, framing it as a response to the systemic shortcomings of current open models. Professor Norris questions what makes this one different from the countless other 'open' releases.
• Data Compliance and the Goldfish in the Machine: The hosts dive into Apertus's strict data compliance, including its novel retroactive application of robots.txt and the use of the 'Goldfish' training objective to prevent the model from memorizing its training data.
• More Than Just English: A Truly Global LLM: Linda gets excited about the model's vast multilingual capabilities, trained on over 1800 languages. They discuss the implications for low-resource languages and the significance of a 40% non-English training data mix.
• The Swiss AI Charter and Other Training Secrets: The discussion turns to the technical details of training Apertus, including its unique optimizer and its novel approach to safety alignment using a 'Swiss AI Charter' for controversial topics.
• Final Thoughts: A New Standard for Openness?: Professor Norris and Linda summarize Apertus's contributions, concluding that its commitment to compliance, multilingualism, and full transparency sets a powerful new benchmark for the entire field.

Learning Facts at Scale with Active Reading

0:14:46
In this episode:
• The Forgetful Student: Professor Norris and Linda introduce the central problem: Large Language Models often struggle to reliably learn and recall facts. They set the stage for this week's paper, which proposes a solution inspired by how humans study.
• Learning by Self-Teaching: Linda explains the core concept of 'Active Reading,' where a model generates its own diverse study materials like timelines, summaries, and associations to internalize knowledge from a given text.
• From 16% to 66% Accuracy: The hosts dive into the stunning results, where Active Reading drastically outperforms methods like simple repetition or standard data augmentation on expert QA benchmarks, showing massive gains in factual recall.
• A Trillion Tokens of Homework: The discussion turns to scaling this method to the entire Wikipedia, creating an 8-billion parameter 'WikiExpert' model that punches far above its weight, and the surprising training tweaks needed to make it work.
• The Self-Taught Model: Professor Norris and Linda wrap up by reflecting on the key insight that models learn best when they teach themselves. They discuss the implications for building more reliable and factual AI systems.

Fantastic Pretraining Optimizers and Where to Find Them

0:13:51
In this episode:
• The Optimizer Royal Rumble: Professor Norris and Linda introduce the chaotic landscape of LLM optimizers, where everyone claims to beat the reigning champion, AdamW. They introduce today's paper, which aims to be the referee in this messy fight.
• The Art of the Unfair Comparison: Linda explains the paper's core thesis: many new optimizers seem fast only because they are compared against poorly tuned baselines. Professor Norris agrees, highlighting the critical importance of fair hyperparameter tuning.
• Diminishing Returns and Shifting Allegiances: The hosts dive into the paper's main findings, discussing how the speedup of new optimizers shrinks with model size and how the 'best' optimizer can change depending on the amount of training data.
• So... Do We Ditch AdamW?: Norris and Linda synthesize the practical takeaways for practitioners. They conclude that while AdamW's dominance is challenged, the victory of its rivals is not as clear-cut as claimed, praising the paper for its methodological rigor.

Benchmarking Optimizers for Large Language Model Pretraining

0:16:18
In this episode:
• Beyond Adam: The Great Optimizer Bake-Off: Linda introduces a paper questioning the decade-long reign of the AdamW optimizer for training large language models. Professor Norris expresses his healthy skepticism about the endless stream of 'new and improved' optimizers.
• Adam's Kingdom and Its Challengers: The hosts discuss why AdamW became the default and the paper's motivation: the lack of systematic, fair comparisons between the many new optimizers claiming to be better. Professor Norris recalls past optimizer fads.
• Creating a Level Playing Field: Linda details the paper's rigorous experimental setup, covering the 11 optimizers tested and the massive hyperparameter tuning effort required for a fair fight. Professor Norris is impressed by the scale of the benchmark.
• And the Winner Is... It's Complicated: Linda reveals the main results, highlighting that AdEMAMix and MARS are the new frontrunners, especially at scale. They break down the results from the paper's many graphs, discussing where different optimizers shine.
• Actionable Advice for the Practitioner: Professor Norris and Linda distill the paper's 'takeaways' into practical advice for listeners. They discuss the critical and often overlooked role of weight decay, learning rate schedules, and warmup.
• The Optimization Frontier: The hosts conclude that while AdamW's dominance is over, the best optimizer is context-dependent. They wrap up by discussing the paper's impact and the future of optimization research.

Learning Facts at Scale with Active Reading.old

0:15:42

Apertus Tech Report.old

0:12:53
In this episode:
• Opening Up a New Chapter: Apertus: Linda introduces the Apertus paper, highlighting its focus on data compliance and extreme multilingualism. Professor Norris is intrigued but skeptical about what 'fully open' and 'compliant' truly mean in practice.
• Clean Data, Clear Conscience?: The hosts discuss Apertus's novel approach to data compliance, including retroactively honoring robots.txt opt-outs. They debate the ethical implications and the performance trade-offs of training on a more restricted, 'cleaner' dataset.
• Speaking Over 1800 Languages: Linda explains the massive scale of Apertus's multilingual training, with 40% of its data being non-English. Professor Norris questions the depth versus breadth of language understanding, especially for the thousands of low-resource languages included.
• Forgetting for a Better Future: The Goldfish Loss: The conversation turns to the technical recipe, focusing on the 'Goldfish objective' designed to prevent memorization. Professor Norris finds the name amusing and probes whether this technique genuinely reduces copyright and privacy risks without harming the model's capabilities.
• The Verdict on Apertus: Linda and Professor Norris wrap up by evaluating Apertus's position in the LLM landscape. They conclude that its commitment to full transparency—releasing code, data scripts, and checkpoints—sets a new, important standard for the open-source community.

Benchmarking Optimizers for Large Language Model Pretraining.old

0:16:39

Fantastic Pretraining Optimizers and Where to Find Them.old

0:14:30

The Pitfalls of Next-Token Prediction

0:10:58
In this episode:
• Introduction: More Than an Improv Artist?
• The Two Faces of Prediction: Linda breaks down the key difference between teacher-forced training and autoregressive inference. Professor Norris likens this to learning to drive with an instructor who constantly corrects you, which doesn't prepare you for driving alone.
• The Clever Hans Cheat: Linda explains the paper's core concept: 'teacher-forcing' can lead to a 'Clever Hans cheat,' where the model learns shortcuts from the training data instead of the actual task. This results in the first, most crucial token of a plan becoming 'indecipherable.'
• Breaking the Cheat Code: The hosts discuss the paper's proposed solutions, like 'teacherless training,' which force the model to look ahead and plan instead of relying on shortcuts. Professor Norris notes this is like removing the training wheels, forcing the model to learn the hard way.
• Conclusion: Recalibrating the Paradigm: Linda and Professor Norris conclude that the paper makes a strong, empirically-backed case that the teacher-forcing training objective is a core limitation. They agree it's a pivotal step in moving the field towards models that can genuinely plan.

Large Language Models and Games

0:17:02
In this episode:
• More Than Just NPCs: Linda introduces this week's paper, a comprehensive survey on the roles of Large Language Models in games. Professor Norris provides some witty historical context on game AI, setting the stage for the discussion.
• What Counts as 'Large'?: The hosts discuss the paper's specific definition of a Large Language Model. They cover why establishing a clear scope, centered on transformer models of a certain size, is crucial for a useful survey.
• The LLM as a Player: Linda breaks down how LLMs can act as players in different types of games, from board games like Chess to complex API-driven worlds like Minecraft's VOYAGER. Professor Norris questions the path to superhuman performance.
• The Supporting Cast: The discussion moves beyond players to other in-game roles for LLMs. This includes creating dynamic Non-Player Characters (NPCs), acting as helpful Player Assistants, and even generating commentary for game streams.
• The LLM as Creator: The hosts explore the creative roles for LLMs, such as acting as a Game Master in tabletop role-playing games or as a Design Assistant that collaborates with human developers to generate levels and game concepts.
• Hallucinations, Copyright, and the Future: Professor Norris and Linda conclude by discussing the paper's roadmap, highlighting the key limitations like hallucinations and the significant ethical questions around copyright and bias that the field must address.

UQ - Assessing Language Models on Unsolved Questions

0:13:35

UQ- Assessing Language Models on Unsolved Questions

0:13:35
In this episode:
• The Benchmark Treadmill: Linda introduces the problem with existing ML benchmarks, noting they are often either too easy or too artificial. Professor Norris adds witty commentary on how quickly new models seem to 'solve' and saturate these tests.
• Let's Ask the Unanswerable: Linda presents the core idea from the UQ paper: evaluating models on genuinely unsolved questions from platforms like Stack Exchange. Professor Norris and Linda discuss how this hits a sweet spot between being difficult and realistic.
• How to Find a Good Unsolved Question: The hosts dive into the meticulous creation of the UQ-Dataset. Linda explains the three-stage filtering pipeline, and Professor Norris expresses his appreciation for the rigor involved in finding high-quality, truly unsolved problems.
• Who Validates the Validator?: With no ground truth answers, how do you score the models? Linda explains the clever 'UQ-Validator' system and the 'generator-validator gap,' while Professor Norris highlights the crucial role of the community platform for human verification.
• Pushing the Frontier of Knowledge... Slowly: Linda and Professor Norris review the humbling results, where even top models pass the validator on only 15% of questions. They discuss the implications of this new, more challenging evaluation paradigm for the future of AI research.

Signal and Noise - A Framework for Reducing Uncertainty in Language Model Evaluation

0:14:20

Signal and Noise- A Framework for Reducing Uncertainty in Language Model Evaluation

0:14:20
In this episode:
• The Billion-Dollar Guessing Game: Professor Norris and Linda introduce the high-stakes problem of LLM evaluation. Linda presents today's paper, which offers a framework to make our small-scale experiments more predictive of large-scale success.
• Tuning In the Signal, Tuning Out the Noise: Linda breaks down the paper's core concepts: 'signal' as a benchmark's ability to distinguish models and 'noise' as its random variability. Professor Norris helps clarify with analogies, questioning if it's really that simple.
• From Lab Coat to Crystal Ball: The hosts discuss how the Signal-to-Noise Ratio (SNR) predicts real-world outcomes, like whether a good small model scales up well (decision accuracy) and how accurately we can predict future performance (scaling law error).
• Three Simple Tricks to a Better Benchmark: Linda enthusiastically details the paper's three practical interventions for improving benchmarks: filtering noisy subtasks, averaging final checkpoints, and switching to continuous metrics like bits-per-byte.
• The Sound of a Clear Signal: Professor Norris and Linda recap the main lesson: when choosing or creating a benchmark, aim for high signal and low noise. They conclude that this simple framework provides a powerful, practical tool for the entire ML community.

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

0:14:43
In this episode:
• Can a Jack-of-All-Trades Learn to Be a Doctor?: Linda introduces the challenge of specializing large language models for specific domains and presents a new paper that proposes a smarter way to pick the right training data.
• The Deceptive First Sip: The hosts discuss why the common practice of 'micro-annealing'—using small-scale tests to evaluate data sources—can be misleading, as the best data for a short run may not be the best for a long one.
• Plotting the Curve, Not Just the Point: Linda explains the paper's core proposal: instead of relying on a single test, estimate a scaling law for each data source by running multiple experiments to predict its utility at scale.
• The Tortoise and the Hare of Data: Professor Norris and Linda dive into the paper's key experiment, revealing how synthetic data (the hare) starts fast but is overtaken by more diverse, filtered data (the tortoise) as compute increases.
• Scaling Smartly: The Takeaway: The hosts conclude by emphasizing the practical importance of scaling-aware data selection to avoid wasting significant compute and money on suboptimal data strategies.

Thinking Like Transformers

0:13:19
In this episode:
• The Transformer's Black Box: Linda introduces the 'Thinking Like Transformers' paper, highlighting the challenge of understanding the computational model behind transformers, unlike RNNs and their connection to finite state machines. Professor Norris agrees, sharing a witty remark about the opacity of modern deep learning models.
• Introducing RASP: A Language for Transformers: Linda explains the core concept of RASP (Restricted Access Sequence Processing Language), a programming language designed to mirror the information flow of a transformer. She details the main operations: element-wise computations, and the crucial 'select' and 'aggregate' pair that mimics attention.
• From Code to Heads: RASP in Action: To make the concepts concrete, Linda walks through a simple RASP program from the paper, such as creating a histogram of tokens. They discuss the key insight that a RASP program can be 'compiled' to estimate the number of layers and attention heads a transformer would need for the task.
• Implications and Insights: The hosts explore the broader implications of the RASP model, such as analyzing the expressive power of restricted-attention models and explaining empirical results like the 'Sandwich Transformer'. Professor Norris is particularly intrigued by how this formal model can explain real-world phenomena.
• Thinking Like a Researcher: Professor Norris and Linda summarize the paper's contributions, agreeing that RASP provides a powerful conceptual tool for reasoning about transformer capabilities. Linda concludes by mentioning the publicly available RASP REPL for listeners who want to experiment themselves.

Kimi K2

0:13:29
In this episode:
• Honey, I Shrunk the Activated Parameters: Linda introduces the massive Kimi K2 paper, focusing on its 'agentic intelligence' and surprisingly small number of activated parameters. Professor Norris offers his initial witty skepticism about yet another trillion-parameter model.
• Taming the Exploding Logits: The hosts get technical, discussing the novel MuonClip optimizer designed to solve training instability. They also explore the clever pre-training data strategy of 'rephrasing' to maximize token utility from a limited data pool.
• Teaching a Model to Use Tools: This chapter focuses on post-training, where Linda explains the large-scale synthetic data pipeline for teaching tool use. They also delve into the reinforcement learning framework that combines verifiable rewards with a self-critique mechanism.
• Climbing the Leaderboard: Linda and Professor Norris unpack Kimi K2's impressive benchmark performance, highlighting its state-of-the-art results on agentic and coding tasks. They conclude with final thoughts on what this powerful open-weight model means for the field.

ERNIE Technical Report

0:11:17
In this episode:
• A New ERNIE on the Block: Linda introduces the new ERNIE 4.5 technical report from Baidu, setting the stage for a discussion on their new family of large-scale foundation models, including their massive 424 billion parameter Mixture-of-Experts model.
• Not Your Average MoE: The hosts discuss the core concept of ERNIE 4.5: its Mixture-of-Experts (MoE) architecture. Linda explains the novel 'heterogeneous' structure with modality-specific experts for vision and text, and Professor Norris comments on the implications for training stability.
• Building a Multimodal Beast: A deep dive into the specific architectural components that enable ERNIE's multimodality. This chapter covers the adaptive-resolution vision encoder, timestamp rendering for video, and the unified 3D positional embeddings for handling text, images, and video seamlessly.
• Training at Scale, Efficiently: Professor Norris and Linda unpack the impressive engineering behind training ERNIE 4.5. They cover the multi-stage training recipe, novel loss functions like Router Orthogonalization, and the remarkable 47% Model FLOPs Utilization.
• From Lab to Production: The discussion shifts to practical applications and deployment. The hosts talk about the aggressive W4A8 and 2-bit quantization schemes, impressive inference speeds, and the open-sourcing of models and toolkits like ERNIEKit and FastDeploy.
• Final Thoughts and Takeaways: Professor Norris and Linda share their final thoughts on the ERNIE 4.5 paper, highlighting its key contributions in efficient multimodal training and the importance of its open-source release for the research community.

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

0:14:30

Gemini 2.5

0:11:12

How new data permeates LLM knowledge and how to dilute it

0:11:18

Harnessing the Universal Geometry of Embeddings

0:12:24

Model Merging in Pre-training of Large Language Models

0:10:57

Learning Dynamics in Continual Pre-Training for Large Language Models

0:11:10

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs

0:11:50

Scalable-Softmax Is Superior for Attention

0:10:20

Breast Cancer Recurrence Prediction

0:10:20

Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research

0:10:27

Native Sparse Attention

0:11:40

Critical Batch Size Revisited

0:10:57

Base of RoPE Bounds Context Length

0:11:03

Rope to Nope and Back Again

0:12:11

Training Deep Learning Models with Norm-Constrained LMOs

0:11:10

SkyLadder

0:12:25

LLMs on the Line

0:09:30

The Leaderboard Illusion

0:10:16

Why Linearly Decaying the Learning Rate to Zero Works Best

0:09:01

Not All Data Are Unlearned Equally

0:12:38

A Multi-Power Law for Loss Curve Prediction

0:12:31

Efficient Training of Ultra-Long Context Large Language Models

0:10:48

Multi-Token Attention

0:15:04

From Style to Facts

0:10:50

Compute Optimal Scaling of Skills

0:09:10

Predictive Data Selection

0:08:43

Continual Pre-training of MoEs

0:10:42

s1 - Simple test-time scaling

0:10:38

Cognitive Behaviors that Enable Self-Improving Reasoners

0:07:21

Phi 4 Multimodal Instruct

0:11:27

Claude 3.7 Sonnet System Card

0:09:22

Project Sid: Many-agent simulations toward AI civilization

0:10:16

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

0:09:07

Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

0:08:18

NExtLong - Toward Effective Long-Context Training without Long Documents

0:11:38

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

0:12:37

Over-Tokenized Transformer

0:10:55

HashAttention: Semantic Sparsity for Faster Inference

0:11:05

From Tokens to Words

0:14:05

DeepSeek V3

0:11:11

Optimal Linear Decay Learning Rate Schedules and Further Refinements

0:18:28
In this episode:
• The Death of Cosine?: Introduction to the episode and the paper. Professor Norris expresses his skepticism about changing established habits like Cosine Annealing, while Linda teases a shake-up in the status quo.
• Theory vs. Reality: A discussion on the massive gap between theoretical learning rates (like 1/t) and what practitioners actually use. Linda explains why the theory has historically failed to match practice.
• Linear Decay Takes the Crown: Linda explains the paper's core theoretical finding: that a simple linear decay is optimal for the last iterate of SGD, challenging the dominance of Cosine Decay.
• Refining the Schedule: Deep dive into the 'Refinement' technique where past gradient norms dictate the future schedule. Discussion on how 'warm-up' naturally emerges from the mathematics rather than being a heuristic hack.
• The Verdict and The Future: Reviewing the experimental results across Vision and LLMs. Final thoughts on whether practitioners should actually switch to Linear Decay or Refined schedules.

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

0:09:34

Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale

0:11:49

Phi-4

0:09:05

Rephrasing natural text data with different languages and quality levels

0:11:02

Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs

0:11:59

EXAONE 3.5

0:08:59

Model soups - averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

0:06:15

Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

0:13:14

Nemotron-CC

0:12:39

Tülu 3

0:12:03

The Zamba2 Suite

0:13:04

Small-scale proxies for large-scale Transformer training instabilities

0:10:07

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

0:10:29

I slightly tweaked the personality of the hosts.

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

0:08:34

Understanding WSD Learning Rates

0:09:10

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

0:07:30

New generation algorithm! Should make the episodes longer, more detailed, and more coherent.

Amuro & Char - Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

0:13:36

Evaluation data contamination in LLMs: How do we measure it and (when) does it matter?

0:06:28

How Does Critical Batch Size Scale in Pre-training?

0:07:41

The Road Less Scheduled

0:08:54

Learning-Rate-Free Learning by D-Adaptation

0:04:37

Scaling FP8 Training to Trillion Token LLMs

0:09:55

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

0:15:44

A Survey on Model MoErging

0:08:58

Liquid Time-constant Networks

0:08:11

Scaling Laws for Predicting Downstream Performance in LLMs

0:10:07

A Spectral Condition for Feature Learning

0:16:53

Don't decay the learning rate

0:07:07

A classic paper about learning rates.

OLMoE

0:06:59

professor norris: Welcome back to Mechanical Dreams, the podcast where we delve into the exciting world of machine learning and natural language processing. I'm Professor Norris, and as always, I'm joined by my brilliant student, Linda.

linda: It's great to be back, Professor. And I'm particularly excited about today's paper. It tackles a topic that's been buzzing in the NLP community: Mixture-of-Experts models, or MoEs for short.

professor norris: Ah yes, MoEs. I remember when they were a promising but somewhat fringe concept. It seems they're making a comeback, especially with industry giants like Google incorporating them into their frontier models.

linda: Exactly! And that's what makes today's paper so intriguing. It's not just about pushing the boundaries of MoE performance but also about making this technology accessible to the wider research community. The paper is titled "OLMOE: Open Mixture-of-Experts Language Models."

professor norris: Open, you say? That's certainly a welcome change in a field often dominated by closed-source, proprietary models. What makes OLMOE so open, Linda?

linda: Well, Professor, the authors have gone above and beyond the usual practice of just releasing model weights. They've open-sourced everything: the model weights, the training data, the code, and even the training logs.

professor norris: That's remarkable! Such transparency is crucial for advancing our understanding of MoEs, which, as you know, introduce a whole new layer of complexity to language modeling. Tell me, Linda, what are some of the key design decisions involved in building a successful MoE model?

linda: That's a great question, Professor. One of the primary decisions is determining the number of experts and how many of those experts are activated for each input.  There's also the question of expert granularity: should we use a few large experts or many smaller ones? And then there's the routing algorithm, which decides how to assign inputs to the appropriate experts.

professor norris: These are indeed crucial decisions. And if I recall correctly, there's also the matter of whether to share experts across layers, right?

linda: Absolutely, Professor. That's another important design choice that can significantly impact performance.

professor norris: So, how does OLMOE approach these design challenges, Linda?

linda: OLMOE-1B-7B, the specific model they focus on, has a total of 7 billion parameters, but only 1.3 billion are active for each input. This makes it comparable to dense models with around 1 billion parameters in terms of inference cost.

professor norris: That's clever. They're essentially trying to achieve the efficiency of a smaller model while leveraging the capacity of a much larger one.

linda: Precisely! And they've opted for a fine-grained approach with 64 small experts per layer, out of which 8 are activated. They use a token-based routing algorithm called "dropless" to assign inputs to experts.

professor norris: And do they share experts across layers, Linda?

linda: No, Professor. They found that sharing experts didn't provide any significant benefits.

professor norris: Interesting. So, how well does OLMOE perform compared to other models, both dense and MoE?

linda: Well, it significantly outperforms all open 1-billion parameter models. It even achieves competitive performance on common benchmarks like MMLU compared to dense models with significantly higher inference costs, like the Llama2-13B.

professor norris: That's quite impressive! And what about after adaptation with instruction tuning and preference tuning?

linda: They create OLMOE-1B-7B-INSTRUCT, which further improves performance and even exceeds larger instruct models, including the Llama2-13B-Chat and DeepSeekMoE-16B, on various benchmarks.

professor norris: Remarkable! It

An Empirical Model of Large Batch Training

0:11:32

First attempt to automatically generate a podcast from a paper. This one is way too short, but it's a start.