Legendary Papers
Papers
Welcome to my list of legendary papers. If a paper is in this list, it means it guides my intuition or writing style. You might see some highly popular papers missing from here (e.g. you won’t see Attention is All You Need). This doesn’t mean that they are not relevant to me or they are bad papers–it’s just that it did not change the way I think. I am not done finishing this list. I expect it to contain roughly 70 papers once I am done. My relationship with any of the authors here bears no impact on my assessment.
TODO add: convnext, knowledge distillation, lora, protoypical networks, ECE, the lottery ticket hypothesis, style vectors for llms, decision transformers, RL in large action spaces, distilling human priors, td3, HER, goal conditioned sl, MAML, incoherence of the philo, detection and estimation theory van trees, complex analysis stein, sirl, getting aligned on alignment, can you turst your model after covariate shift?, TENT, discovering faster matrix mult, understanding DL still requires rethinking generalization, airgnn, the lottery ticket hypothesis
Self-taught Learning: Transfer Learning from Unlabeled Data
Authors: Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng
Subject: Transfer Learning, Dictionary Learning
Remark: Dictionary learning meets ML.
How Powerful Are Graph Neural Networks?
Authors: Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka
Subject: Graph Neural Networks
Remark: This paper introduces the Graph Isomorphism Network (GIN), pushing the envelope on what graph neural networks can achieve. GIN effectively captures the nuances of graph structures, rivaling the Weisfeiler-Lehman graph isomorphism test. Really revitalized my interest in graph theory!
The Platonic Representation Hypothesis
Authors: Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola
Subject: Representation, Philosophy
Remark: Proposes that deep learning models are on a trajectory towards a unified model of reality, dubbed the “Platonic representation.” By drawing on Plato’s Allegory of the Cave, the paper suggests our current AI models are mere shadows, glimpses of a more profound truth. It’s a bold claim that echoes convergent realism, suggesting that, like science, AI is narrowing in on the truth of the universe.
Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information
Authors: Emmanuel J. Candès, Justin Romberg, Terence Tao
Subject: Sparse Reconstruction
Remark: TERRANCE TAO IS SO COOL. I did not expect to see number theory make such a big impact on the field of signal processing. I was also surprised at the fact people used the minimum energy solution for so long.
Optimal Brain Damage
Authors: Yann LeCun, John S. Denker, Sara A. Solla
Subject: Pruning
Remark: This was the paper started the massive argument/shit show on if the magnitude of a parameter corresponds with relevance w.r.t the downstream task. Since then, the community has gone back and forth on the importance of $|w|_p.$ For more information, read A Gradient Flow Framework For Analyzing Network Pruning.
Optimal Brain Surgeon
Authors: Babak Hassibi, David G. Stork, Gregory J. Wolff
Subject: Pruning
Remark: Extending the ideas from “Optimal Brain Damage,” this 1993 paper further refines the approach of network pruning by employing a more precise, Hessian-based method to remove weights.
Winning the Lottery Ahead of Time: Efficient Early Network Pruning
Authors: Ekdeep Singh Lubana, Robert P. Dick
Subject: Pruning
Remark: Dives into early network pruning using a gradient flow framework, borrowing concepts from the neural tangent kernel (NTK) to justify early parameter removal. It challenges the traditional pruning post-training by demonstrating that the NTK can guide the removal of redundant parameters even before substantial training, sparking fresh debates on when and how pruning should be optimally performed.