Machine Learning: The Great Stagnation

The bureaucrats are running the asylum

This blog post generated a lot of discussion on Hacker News - many people have reached out to me giving more examples of the stagnation and more examples of projects avoiding it. Maybe I’ll add to this article or maybe I’ll write a new one, let’s see what happens. In the meantime if you can’t wait for me to stop staring at the ceiling and write something new, I’m pretty sure you’ll enjoy my e-book

Machine Learning Researchers

Academics think of themselves as trailblazers, explorers - seekers of the truth.

Any fundamental discovery involves a significant degree of risk. If an idea is guaranteed to work then it moves from the realm of research to engineering. Unfortunately, this also means that most research careers will invariably be failures at least if failures are measured via “objective” metrics like citations.

The construction of Academia was predicated on providing a downside hedge or safety net for researchers. Where they can pursue ambitious ideas where the likelihood of success if secondary to the boldness of the vision.

Academics sacrifice material opportunity costs in exchange for intellectual freedom. Society admires risk takers, for it is only via their heroic self sacrifice that society moves forward.

Unfortunately most of the admiration and prestige we have towards academics are from a bygone time. Economists were the first to figure out how to maintain the prestige of academia while taking on no monetary or intellectual risk. They’d show up on CNBC finance and talk about “corrections” or “irrational fear/exuberance”. Regardless of how correct their predictions were, their media personalities grew with the feedback loops from the YouTube recommendation algorithm.

It’s hard to point the blame towards any individual researcher, after all while risk is good for the collective it’s almost necessarily bad for the individual. However, this risk free approach is growing in popularity and has specifically permeated my field “Machine Learning”. FAANG salary with an academic appointment is the best job available in the world today.

With State Of The Art (SOTA) Chasing we’ve rewarded and lauded incremental researchers as innovators, increased their budgets so they can do even more incremental research parallelized over as many employees or graduate students that report to them.

Machine Learning Researchers can now engage in risk-free, high-income, high-prestige work

They are today’s Medieval Catholic priests

Machine Learning Students

Machine Learning PhD students are the new Investment Banking analysts, both seek optionality in their career choices but differ in superficial ways like preferring Meditation over Parties and Marijuana & Adderall over Alcohol and Cocaine.

A Machine Learning PhD is now just an extended interview for FAANG.

The entire data science interview process at larger labs has become a mix of trivia and prestige. Checking out a portfolio takes way to too long but checking that you graduated from Stanford or coauthored a paper with Google Brain, now that’s a good filter!

We’ve gamified and standardized the process so much that it’s starting to resemble case studies at consulting interviews.

“Recruiter: Which activation functions do you know?”

“Me: Infinitely many x ^n * sigmoid(x) ∀n”

“Recruiter: Ok.. Tell me about your biggest career failure”

“Me: Having to answer your questions”

Matrix Multiplication is all you need

I often get asked by young students new to Machine Learning, what math do I need to know for Deep Learning and my answer is Matrix Multiplication and Derivatives of square functions. All these neuron analogies do more harm than good in explaining how Machine Learning actually works.

LSTMs a bunch of matrix multiplications, Transformers a whole bunch of matrix multiplications, CNNs use convolutions which are a generalization of matrix multiplication.

Deep Neural Networks are a composition of matrix multiplications with the occasional non-linearity in between

Matrix multiplication was invented way back in 1812 by Jacques Philippe Marie Benet but you’d be forgiven for thinking forward propagation was invented much later than that.

With Automatic Differentiation, the backward pass is essentially free and is as engaging to compute as 50 digit number long division. Deriving long complicated gradients is fake rigor that was useful before we had computers.

When I was a graduate student at UC San Diego I remember shying away from Deep Learning because it was not considered serious Machine Learning because there were no good proofs for why these models should work.

Empiricism and Feedback Loops

I’ve learnt “the hard way” that Deep Learning is an empirical field, so why or how something works is often anecdotal as opposed to theoretical.

The best people in empirical fields are typically those who have accumulated the biggest set of experiences and there’s essentially two ways to do this.

  1. Spend lots of time doing it

  2. Get really good at running many concurrent experiments

Age is a proxy for experience but an efficient experimentation methodology allows you to compress the amount of time it would take to gain more experiences.

If you have a data center at your disposal this further multiplies your ability to learn. If you and all your peers have access to data centers this is yet another multiplicative feedback loop since you can all learn from each other.

This helps explain why the most impactful research in Machine Learning gets published in only a few labs such as Google Brain, DeepMind and Open AI. There are feedback loops everywhere.

The Rise of Transformers

The past 3 years in particular have been an unrelenting deluge of incremental work with paper titles that read like tabloid headlines:

“Attention is all you need!”

“Transformers on Proteins!”

“Transformers on Molecules"!”

“Transformers on Images”

“Fast Transformer!”

“Transformers are Graph Neural Networks!”

“Learning to Transform with Transformers“

“Small Transformer!”

“Long Transformer"!

“Useful” Machine Learning research on all datasets has essentially reduced to making Transformers faster, smaller and scale to longer sequence lengths.

This is a problem reminiscent of the discovery of NP-completeness - journals were flooded with the proofs that yet another problem was NP-complete.

BERT engineer is now a full time job. Qualifications include:

  1. Some bash scripting

  2. Deep knowledge of pip

  3. Waiting for new HuggingFace models to be released

  4. Watching Yannic Kilcher’s new Transformer paper the day it comes out

  5. Repeating what Yannic said at your team reading group

It’s kind of like Dev-ops but you get paid more.

Graduate Student Descent and the Death of First Principles

Neural Network weights are learnt via Gradient Descent and Neural Network architectures are learnt via Graduate Student Descent.

The below flow chart describes how Graduate Student Descent works.

Graduate Student Descent is one of the most reliable ways of getting state of the art performance in Machine Learning today and it’s also a fully parallelizable over as many graduate students or employees your lab has. Armed with Graduate Student descent you are more likely to get published or promoted than if you took on uncertain projects.

The popularity of Graduate Student Descent stems from Cargo Culting configs where certain loss functions, depths, architecture are generally regarded as good.

It’s quite difficult to actually reason from first principles because Machine Learning algorithms are complex systems with a huge number of variability in parameters where the interactions are nonlinear and unpredictable. Ablations help but even then aren’t entirely conclusive over such a wide range of parameters.

Technical papers that try to give intuition behind how or why a specific technique works often looks closer to astrology than science with sentences like “encourage the network to be confident in its predictions”.

Scale is Trivial

I sometimes get the impression that academics think that transitioning to large models goes something like:

python --tiny_model --local
python --super_large --on_pristine_cluster

It’s trivial to think of running a large model but it’s certainly not trivial to actually do it. This misconception is best illustrated with this immensely popular meme.

If something simple like stacking more layers works better than statistical learning, then you have to wonder who the real clown is

To “STACK MORE LAYERS” you need to worry about model and data parallelism, pipelining, tuning your hyperparameters, hardware accelerators, network vs compute vs storage vs IO bottlenecks, early stopping, scalable architectures, distillation, pruning etc..

You can do all the convergence bounds you like with Chernoff Bounds, Markov Inequality over some parameter which is assumed to be Gaussian but if your proposed algorithm is worse than “STACK MORE LAYERS” then your proposed algorithm isn’t very good.

Every paper is SOTA, has strong theoretical guarantees, an intuitive explanation, is interpretable and fair but almost none are mutually consistent with each other

If Constantinople fell so can the institution of Theoretical Machine Learning.

I’m cautiously optimistic about Causal Reasoning, what I’d like to see is it graduating from a tool for meditation to libraries that people actually use daily.

Fake Rigor

This does not mean I’m opposed to Mathematical formalism, if anything I love math and I want to see more of it in Deep Learning - I only caution against fake rigor.

Assuming certain “nice properties” about data so that theorems work out, gradient derivations that take up multiple pages of an appendix instead of just using Automatic Differentiation.

But how do you prove that new theoretical results are useful?

Well the worst way to do this is to simply combine complicated mathematical ideas into a neural network because you can, you can draw on the readers aesthetic sensibilities and discuss why Fourier is the heart of computation. The optimization community is often guilty of this, where they propose activation functions like swish and then spend pages and pages talking about the nice properties of the loss landscape.

The most reliable to get new ideas widespread is to create a benchmark where existing SOTA methods fail and then show how your technique is better. This is hard by design, it should be hard to displace proven ideas with unproven ones. This technique has the advantage of not requiring any Twitter arguments.

It’s important to avoid becoming Gary Marcus and criticize existing techniques that work without proposing something else that works even better.

If you can’t do that then maybe you haven’t stumbled on something useful, maybe it’s just beautiful or elegant. Searching for warmth and fuzziness is still a worthwhile goal.

Where is the Innovation in Machine Learning?

While my introductory lament may have led you to believe that there is no innovation going on in Machine Learning, that we’ve settled on a cargo-culting monoculture - nothing could be further from the truth.

There is still substantial innovation happening in Machine Learning just not from Data Scientists or Machine Learning Researchers.

Here are the projects that I believe represent a glimmer of hope against the Stagnation of Machine Learning.

Keras and

Machine Learning is a Language, Compiler and Design problem

A programming language is tool for thinking that needs to be designed with the same sensibilities as any other consumer product.

When I first got interested in Keras, some of my peers mentioned to me that it wasn’t as serious doing “real ML work” in Tensorflow and it made me silently wonder why they weren’t programming in FORTRAN.

Keras introduced me to the idea of viewing Machine Learning as a language design problem. Instead of thinking of matrices or neurons I was now thinking in terms of layers just like I do when I read a paper.

# Create layers
layer1 = layers.Dense(2, activation="relu", name="layer1") 
layer2 = layers.Dense(3, activation="relu", name="layer2") 
layer3 = layers.Dense(4, name="layer3")

# Call layers on a test input
x = tf.ones((3, 3))
y = layer3(layer2(layer1(x)))

A matrix is a linear map but linear maps are far more intuitive to think about than matrices

You can then build networks with multiple inputs and outputs to build far more interesting networks.

model = Model(inputs=[x1, x2], outputs=y)

Keras is a user centric library whereas Tensorflow especially Tensorflow 1.0 is a machine centric library. ML researchers think in terms of terms of layers, automatic differentiation engines think in terms of computational graphs.

As far as I’m concerned my time is more valuable than the cycles of a machine so I’d rather use something like Keras.

This doesn’t mean I’m happy with slow code which is why it’s crucial to build good compilers and Intermediate representations like XLA that would run my user friendly code fast.

Performance vs Abstraction is a false dichotomy - the history of computing is proof of this

Putting the user first is the approach that took when building their Deep Learning library. I think of Jeremy Howard as the Don Norman of Machine Learning.

Instead of just focusing on the model building part, builds tools around all of the below.

By tools I don’t mean a black box service, I mean software design patterns specific to Machine Learning. Instead of Abstract Factories, abstractions like Pipelines to chain preprocessing steps, callbacks for early stopping all the way to generic yet simple implementations of Transformers. Design patterns are more useful than black boxes because you can understand how they work, modify them and improve for both yourself and others.

Honorable mention to nbdev which aims at removing some of the common annoyances of working with notebooks by eliminating all obstacles that would get in your way of shipping your code as a library including a human readable git representation, continuous integration and automated PyPi package submission.


The promise of Deep Learning is Differentiable Computing, where instead of writing programs to accomplish certain tasks you feed a model input/output pairs and ask it to generate a program for you.

Once you start thinking of neural networks as programs, you can also think of programs as neural networks and differentiate programs.

Unfortunately, Python is not differentiable which is Google and Facebook have each built their own languages in C++ with Python bindings that is automatically differentiable.

Julia on the other hand is a language made for scientific computing where everything is automatically differentiable by default. So if you build an ODE solver you get a neural ODE solver for free.

Ordinary and Partial Differential Equations are the way we formalize most relationships in Science from astronomy to pharmacology so being able to speed up these simulations by orders of magnitude means we’re in the middle of a golden age of Scientific Computing.

I also particularly find the Neural ODE approach to Robotic simulations to be extremely exciting. The main problem in Reinforcement Learning is that models are generally large and difficult to train but you can make them much smaller and easier to train if instead of treating a simulation as a black box you can treat it a like differentiable white box ODE solver.


What is the most important Machine Learning company today?

If you asked this question in 2018 the answer may have been Open AI. They dazzled the world with beautiful demos of agents besting expert video game players. They reminded me why I got into Machine Learning in the first place. Their reputation shadowed everything but over time it became their core product.

Open AI is a media and service company

Look at the pretty blog posts with beautiful typography, pay to use GPT-3. Open AI is not a platform company.

I can’t think of a single large company where the NLP team hasn’t experimented with HuggingFace. They add new Transformer models within days of the papers being published, they maintain tokenizers, datasets, data loaders, NLP apps. HuggingFace has created multiple layers of platforms that each could be a compelling company in its own right.

Billions of Dollars of value will be created from HuggingFace on problems which are a lot less speculative than AGI. HuggingFace avoids the typical ML startup trap of turning into either a consulting firm or citation farm.


Haskellers often resemble a cult, everything is a function or a Monad or a Lens. It’s often hard to follow what Bi-Category is or why you should care.

Real world Haskell tries to model IO as a Monad or a web server as a function with state but there’s an application that I believe Haskellers don’t hammer on quite enough.

Haskell is the best functional programming language in the world and Neural Networks are functions

This is the main motivation behind Hasktorch which lets you discover new kinds of Neural Network architectures by combining functional operators. Justin Le does this idea a lot of justice in his series Purely Functional Typed Models

Training a model

Composing Logistic and Linear Regression


Admittedly the RNN code sample is a tad complicated but consider it’s orders of magnitude less code than what you’d see in a typical neural network library and not something you can get by combing logistic regression with a couple of extra operators. I’m hoping I can convince Justin to one day create a lecture for us mere mortals to invent our own neural network architectures.

Unity ML agents

Open AI gym was a great attempt at creating a platform where various games were used in benchmarks for Reinforcement Learning research. The goal was to eventually advance Reinforcement Learning to deal with more complex problems and really push the boundary.

Unfortunately, SOTA chasing meant that the benchmarks became the goal and an entire community of researchers has overfit techniques like “world models” to this dataset. So generally there’s a trend towards ever more complex algorithms on simple benchmarks whereas the most complex benchmarks like Dota seem to be using the simplest algorithms scaled with an impressive infrastructure.

If Rich Sutton wrote his Bitter Lesson as a meme it’d probably look like the below.

Unity is the most user friendly Game Engine out today, I love it and use it for all my side projects. Unity ML agents is a way for you to turn a video game into a Reinforcement Learning environment.

Reinforcement Learning environments are essentially custom datasets and I’m confident it will be the de-facto simulator of complex robotic applications. Create a complex multi agent negotiation with bluffing game where agents need to have a grasp of physics and an understanding of optical illusions. Go crazy!

Any intelligent behavior is best benchmarked with Unity ML agents

If you’re interested in learning more about this space this tweetstorm is a great start



I wrote this article before AlphaFold2 was released - I explain how it works on YouTube. Biotech is a huge deal and is most definitely not stagnating, I’ll probably address this in a future article.

Parting words

I am sad to see that most exciting work in Machine Learning is coming from outside of Machine Learning - I’ve spent 10 years in this field and I learn more from the crackpot outsiders on Twitter today then I do from peer reviewed papers.

I want to hear more proposals that may not probably work, I want more ideas, I want Machine Learning to be fun again.

Keep an open mind and most importantly don’t be this guy. And if you enjoyed this please let me know by subscribing below!


Thank you sudomaze, thecedarprince, krishnanpc and 19_Rafael for helpful feedback while I was livestreaming myself writing this article on