Thoughts: Sutton’s The Bitter Lesson

Ponderings on Richard Sutton’s the bitter lesson

Hassaan Naeem
3 min readMay 10, 2022
Photo by Misty Rose on Unsplash

I recently re-read Richard Sutton’s The Bitter Lesson. It states that general learning methods that can scale with computation are ultimately the most effective.

The two methods that can seemingly scale endlessly are search and learning, and they have bore their fruit. Sutton lists out their successes in chess, go, speech recognition, computer vision, etc, etc.

This is in contrast to the human-knowledge approach, where our knowledge of a specific domain is built into the algorithms that are trying to “solve” or “work-out”, so to speak, that domain. In speech recognition, this was via the hand crafting of phonemes, words, etc; in games like chess/go this was through crafting for features of the game; and the list goes on and on.

The main takeaway is that ultimately, general purpose learning methods can continually leverage compute, and available compute continues to climb.

I think in a certain sense, Ilya Sutskever put it succinctly in this tweet, most non-compute based methods, at scale, are training data in disguise, and the benefit they provide, ceases to exist when the data is scaled. It becomes too arduous a task to scale the method.

Opposition

Countering Sutton was Rodney Brooks who preached his side even more concisely in his A Better Lesson. He points to the example of translational invariance built into CNNs (via convolution and max pooling), which falls within the human-knowledge category. Of course, such things can be learned by throwing more compute at them, but it seems rather unnecessary. He also alludes to the fact that most ML problems today require a specific network architecture, which is … human built. Human knowledge is transferred into the algorithms via a different form. There is also the question of just how far can compute scale? Brooks argues that Moore’s Law is slowing down (although Jim Keller insists it is not), and that once again human knowledge is applied in a different form to architect new ways to scale this compute, and thereby scale the learning methods.

I think both these sides hold valid arguments. However, I also think the Better Lesson shifts reasoning away from the scope of learning algorithms, and that Sutton saying “Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances…”, does not give full appreciation to the impact these notions make in practice.

The problem more or so becomes what human-knowledge must be incorporated within scalable methods to allow for the discovery process which Sutton is after. What priors and inductive biases must the scalable method be cognizant of to allow learning to scale? After all, is this not what is bread into us, to help encode our complex brain architectures, through millions of years of evolution?

Many more recent successes such as AlphaFold or even LQCD all rely on domain knowledge to give the performance they do. Does the bitter lesson only apply to domains for which human knowledge of that domain is scarce? How does the “meta” approach fit into all of this? These all seem to be valid questions, at-least for myself.

--

--