What the Books Get Wrong about AI -- Double Descent

The video explains why the classic U‑shaped “bias–variance tradeoff” story in many ML books is incomplete, and introduces double descent using both deep networks and polynomial curve fitting as examples.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) Below is a detailed outline, followed by the researchers mentioned. ## High-level structure - **0:00–3:43 – Intro & classic ML books** - Shows several canonical ML texts and the familiar plot: model size on the x‑axis, training/test error on the y‑axis.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Explains the classic story: as model complexity increases, test error decreases, reaches a minimum, then rises again due to overfitting, illustrated with polynomial curve fitting.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **3:43–6:45 – AlexNet & regularization** - Describes AlexNet (about 60M parameters) and how Krizhevsky, Sutskever, and Hinton used data augmentation, dropout, and weight decay (ridge regression) to fight overfitting.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Emphasizes the takeaway many practitioners internalized: large neural nets sit in the overfitting regime and need strong regularization; lower training error is assumed to cause higher test error there.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **6:45–11:05 – Rethinking generalization (Google Brain 2016)** - Summarizes “Understanding Deep Learning Requires Rethinking Generalization”: train on CIFAR/ImageNet with random labels; networks can still perfectly fit the training set even with regularization, but fail on the test set.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - When trained on correct labels, the same architectures generalize well and explicit regularization only modestly improves test accuracy while leaving training accuracy ~100%, challenging the simple bias–variance picture.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **11:05–12:28 – Sponsor segment (KiwiCo)** - Brief detour into children’s learning and hands‑on KiwiCo crates, framed as a parallel to learning vs memorization.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **12:28–13:57 – The double descent hypothesis (Belkin et al.)** - Introduces Mikhail Belkin’s 2018 “double descent” hypothesis: test error first decreases (classic regime), then increases near the interpolation threshold, and then decreases again as model size keeps growing.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Shows small‑scale demos on MNIST using random Fourier feature models, where test error improves again in the highly overparameterized regime.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **13:57–16:01 – Double descent is real (Harvard/OpenAI 2019)** - Describes a Harvard–OpenAI team’s 2019 work showing double descent across architectures, including transformers on vision and language data.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Notes they see double descent both as a function of model size and training time; if you stop when test error first turns up, you can miss a later, lower second minimum.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **16:01–20:36 – Polynomial double descent example** - Returns to 1D polynomial curve fitting with noisy parabolic data: degree 1 underfits (high bias), degree 2 is near optimal, degree 3–4 overfit with test error rising, and degree 4 is the interpolation threshold (just enough parameters to fit all points).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Then increases degree beyond the threshold; because there are infinitely many interpolating polynomials, the solver picks the minimum‑norm solution in a specific polynomial basis, which yields smoother fits and lower test error again, exhibiting double descent.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **20:36–24:28 – Why double descent happens** - Explains the interpolation threshold: at the first exactly‑fitting model there is effectively only one solution and it is extremely sensitive to noise; beyond that, many interpolating solutions exist.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Shows that the particular optimization/basis choice (e.g., using a Legendre basis and minimum L2 norm) pushes the solver toward smoother, better‑generalizing interpolants, giving the second descent; different bases may not show it.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **24:28–28:30 – Revisiting the bias–variance tradeoff** - Carefully unpacks bias and variance using many random resamples and fits: low‑degree polynomials have high bias, high‑degree polynomials have high variance, producing the classic U‑shaped region before the interpolation threshold.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - After the threshold, test error changes are driven more by how the optimizer selects among many interpolating solutions (and their variance) than by a simple bias–variance balance, so the tradeoff picture alone no longer dictates the curve shape.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **28:30–30:47 – “My take” on books and theory** - Discusses Trevor Hastie’s later work (“Surprises in high-dimensional ridgeless least squares interpolation”) and the new edition of _Introduction to Statistical Learning_ that adds a double descent section but keeps U‑shaped curves as conceptual tools.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Argues that double descent does not strictly contradict bias–variance theory, but shows that “model size” as usually drawn on the x‑axis is an incomplete measure of **flexibility**, especially beyond the interpolation threshold.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **30:47–end – Welch Labs AI book promo** - Presents _The Welch Labs Illustrated Guide to AI_: highly visual explanations, code for each chapter, exercises (including building a perceptron machine), and chapters on perceptrons, gradient descent, backprop, deep models, AlexNet, scaling laws, mechanistic interpretability, and generative models like Sora.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Mentions shipping timeline, US‑only preorders for now, and an international waitlist.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) ## Researchers and authors mentioned The video explicitly names or clearly refers to the following researchers and authors: - **AlexNet team:** Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton (referred to as “Alex Krizhevsky, Ilya Sutskever, and Jeff Hinton”).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **Google Brain generalization paper:** A “Google Brain team” behind _Understanding Deep Learning Requires Rethinking Generalization_ (names not spoken here, but the team is cited).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **Double descent hypothesis:** Mikhail Belkin (named) and co‑authors on the original double descent paper.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **Harvard/OpenAI double descent work:** A joint “Harvard and OpenAI team” that empirically confirms double descent across multiple architectures.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **Books/authors cited for classic U‑curves:** - Kevin P. Murphy (_Probabilistic Machine Learning: An Introduction_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Ian Goodfellow et al. (_Deep Learning_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Stuart Russell and Peter Norvig (_Artificial Intelligence: A Modern Approach / Prentice Hall series in AI_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Christopher M. Bishop (and Nasser Nasrabadi in the listed edition) (_Pattern Recognition and Machine Learning_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Tom Mitchell (_Machine Learning_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Trevor Hastie, Robert Tibshirani, Jerome Friedman (_The Elements of Statistical Learning_ and _An Introduction to Statistical Learning_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - Yaser Abu‑Mostafa, Malik Magdon‑Ismail, Hsuan‑Tien Lin (_Learning from Data_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - David J. C. MacKay (_Information Theory, Inference, and Learning Algorithms_).[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **Later double descent theory work:** Trevor Hastie again, as lead author of “Surprises in high-dimensional ridgeless least squares interpolation” and co‑author of the updated _Introduction to Statistical Learning_.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **Other named experts:** Simon Prince, quoted for a line about neural networks’ generalization being “dumbfounding,” and described as the author of a deep learning book.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) - **Special thanks:** Mikhail Belkin, Preetum Nakkiran, Emily Zhang, Varun Reddy are thanked explicitly at the end for discussions or feedback.[youtube](https://www.youtube.com/watch?v=z64a7USuGX0) If you want, a follow‑up can drill this into an Obsidian‑style note template (e.g., sections per experiment, theorem, and citation). 1. [https://www.youtube.com/watch?v=z64a7USuGX0](https://www.youtube.com/watch?v=z64a7USuGX0) -----