Legendary Papers

Papers

Welcome to my list of legendary papers. If a paper is in this list, it means it guides my intuition or writing style. You might see some highly popular papers missing from here (e.g. you won’t see Attention is All You Need). This doesn’t mean that they are not relevant to me or they are bad papers–it’s just that it did not change the way I think. I am not done finishing this list. I expect it to contain roughly 70 papers once I am done. My relationship with any of the authors here bears no impact on my assessment.

TODO add: convnext, knowledge distillation, lora, protoypical networks, ECE, the lottery ticket hypothesis, style vectors for llms, decision transformers, RL in large action spaces, distilling human priors, td3, HER, goal conditioned sl, MAML, incoherence of the philo, detection and estimation theory van trees, complex analysis stein, sirl, getting aligned on alignment, can you turst your model after covariate shift?, TENT, discovering faster matrix mult, understanding DL still requires rethinking generalization, airgnn

Self-taught Learning: Transfer Learning from Unlabeled Data

Authors: Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng
Subject: Transfer Learning, Dictionary Learning
Remark: Dictionary learning meets ML.

How Powerful Are Graph Neural Networks?

Authors: Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka
Subject: Graph Neural Networks
Remark: This paper introduces the Graph Isomorphism Network (GIN), pushing the envelope on what graph neural networks can achieve. GIN effectively captures the nuances of graph structures, rivaling the Weisfeiler-Lehman graph isomorphism test. Really revitalized my interest in graph theory!

The Platonic Representation Hypothesis

Authors: Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola
Subject: Representation, Philosophy
Remark: Proposes that deep learning models are on a trajectory towards a unified model of reality, dubbed the “Platonic representation.” By drawing on Plato’s Allegory of the Cave, the paper suggests our current AI models are mere shadows, glimpses of a more profound truth. It’s a bold claim that echoes convergent realism, suggesting that, like science, AI is narrowing in on the truth of the universe.

Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information

Authors: Emmanuel J. Candès, Justin Romberg, Terence Tao
Subject: Sparse Reconstruction
Remark: TERRANCE TAO IS SO COOL. I did not expect to see number theory make such a big impact on the field of signal processing. I was also surprised at the fact people used the minimum energy solution for so long.

Optimal Brain Damage

Authors: Yann LeCun, John S. Denker, Sara A. Solla
Subject: Pruning
Remark: This was the paper started the massive argument/shit show on if the magnitude of a parameter corresponds with relevance w.r.t the downstream task. Since then, the community has gone back and forth on the importance of $|w|_p.$ For more information, read A Gradient Flow Framework For Analyzing Network Pruning.

Optimal Brain Surgeon

Authors: Babak Hassibi, David G. Stork, Gregory J. Wolff
Subject: Pruning
Remark: Extending the ideas from “Optimal Brain Damage,” this 1993 paper further refines the approach of network pruning by employing a more precise, Hessian-based method to remove weights.

Winning the Lottery Ahead of Time: Efficient Early Network Pruning

Authors: Ekdeep Singh Lubana, Robert P. Dick
Subject: Pruning
Remark: Dives into early network pruning using a gradient flow framework, borrowing concepts from the neural tangent kernel (NTK) to justify early parameter removal. It challenges the traditional pruning post-training by demonstrating that the NTK can guide the removal of redundant parameters even before substantial training, sparking fresh debates on when and how pruning should be optimally performed.