Bootstrapped Meta-Learning — An Implementation

7 min readJul 17, 2022

… continuing from where I left off. I finally got around to implementing the bootstrapped meta learning paper. This was quite a tricky implementation… Like the Meta Gradients post, I used PyTorch.

Key Issues:

If you’ve read my previous post, the two main problems that BMG addresses are the issues of myopia and curvature. I have described them below, the issues are much clearer when contrasting a standard “meta-learning” algorithm against this bootstrapped version.

Myopia

Myopia meaning that learning has a short horizon. If you recall the implementation of Meta-Gradients, the roll-out is only conducted for a fixed number steps, after which, the optimizers are stepped for a first time, a second trajectory is rolled out for use as cross-validation, which is used for the meta-parameter update. The problem with this is that learning takes place within those fixed number of steps, and nothing is to be gained from future steps within an episode in the environment.

Again, the choice of the k-steps all comes down to a computation vs. cost tradeoff. Ideally one wants to increase the learning horizon, without having to backprop through all the learning steps.

Curvature

Curvature meaning the outer learning is constrained to the structure/geometry of the inner learning. Again, looking back at the Meta-Gradients implementation, when the second roll-out is conducted (for the learning of the meta-objective/parameters, we used a single gamma variable), it is being conditioned on the same objective as the initial roll out. To use the authors’ term the “meta optimization landscape” is the same. Sebastian Flennerhag (one of the authors) relates this to the credit assignment problem in this interview which I recently came across (as an aside, this Minsky piece is a good read on the credit assignment problem). Putting it briefly, in the regular meta-learning approach, it is hard to figure out which improvement in the algorithm deserves credit. This is not ideal when one considers the problem that meta-learning is aiming to solve.

Solutions to Issues:

Bootstrapping to a Future Target

This is what helps to mitigate the issue of myopia. A target is generated by continuing to update the learner’s parameters, for some number of steps.

A target bootstrap is introduced, and the choice of this target is what actually guarantees that performance will be improved. The choice of Equations 1 and 2 (Section 3) of the paper, fulfill the guarantee (check Section 4 for the math). It unrolls the meta learner’s update rule 𝜑 some further number of steps (L-1 steps in our case). Importantly, this target is essentially a fixed quantity and as such there is no backprop through the target. Hence, the learning horizon is extended without dramatically increasing computational cost. It is then up to the meta-learner to match the target.

Minimizing Distance/Divergence to Target in Matching Space

This is what helps control the landscape for meta-learning. The learner’s updated parameters and the target are projected into a matching space, and the meta-learner will be optimized within this space.

A matching function 𝜇 is introduced, measuring the similarity or lack there of, between the meta-learner’s output and the target. The paper mentions and makes use of KL-Divergence as an option for this measure.

The bootstrapped meta gradient update is then as follows (Equation 2):

Where w˜ are the updated meta-parameters, w are meta-parameters, β is a chosen positive constant, ∇w is the grad wrt the second slot of 𝜇, 𝜇 is the matching function, x˜ is the target, and x^(k)(w) is the learner after K applications of 𝜑.

Implementation

I continue with a similar pattern as that of the previous post, although the code has been refactored quite a bit from before; some new libraries, and optimizers introduced to make things cleaner. I came across a cool differentiable optimizer library TorchOpt and opted to use it, since it eliminates the need for excessive functions that the Agent class will need. I initially implemented this with an episodic (CartPole) environment, but for sake of comparison with the paper, I made a Two Color Grid World (TCGW) Environment which the paper used. This helps one (it did for myself) get a more intuitive feel of what is going on.

The model architecture will be as follows: ActorCritic Network, Meta MLP, an Agent, and the TCGW Environment.

ActorCritic Network

This will represent the policy and value functions. For the policy: three linear layers, with input dimensionality of the environment observation space, any sensible choice for hidden dimensionality, and output dimensionality of the environment action space. For the value network: three linear layers, with input dimensionality of the environment observation space, any sensible choice for hidden dimensionality (I use the paper parameters), and output dimensionality 1. There are also saving and loading methods.

Meta MLP

This is just a standard feed-forward network, and will be used to meta-learn the entropy regularization weight. Two linear layers, with input dimensionality of 10 (corresponding to the average reward of the 10 most recent rollouts), any sensible choice for hidden dimensionality (I use the paper parameters), and output dimensionality 1. It uses a ReLU, and a sigmoid for the output activation

Two Color Grid World Environment

This is built exactly in accordance with Section B.1 of the paper. Three random numbers are sampled from [0,24] (representing each of the possible grid locations), these represent the green, blue, and red squares. Every step, we are allowed to move in one of the four directions, and if we reach a blue or red square, it is captured and relocated to a random non-occupied square. Every 100,000 steps the rewards for blue and red flip. For the observation space, as the paper says “Observations are constructed by concatenating one-hot encodings of the each x- and y-coordinate of the two colours and the agent’s position, with a total dimension of 2 × 3 × 5 = 30 (two coordinates for each of three objects, with each one-hot vector being 5-dimensional).”

Agent

This is the meat of the model. The algorithm outlined in the paper is shown below:

Algorithm 1 corresponds to the roll_out method in the class. Algorithm 2 and 3 corresponds to the run method.

There is one helper method plot_results which plots the cumulative reward and entropy rate curves. The corresponding corresponding curves can be found in Figure 2 (Section 5) of the paper for entropy, and Figure 8 (Appendix Section B) for cumulative reward. An important caveat to note is that TCGW is non-episodic, but as mentioned in the paper, the rewards flip every 100,000 steps, and as a result one observes the behaviour seen in the entropy rate curve.

Learning:

Some things of note before hand. The on-policy actor-critic is using ε_PG = ε_TD = 1. We will be meta learning ε_EN (entropy regularization weight) through the Meta MLP. These can be seen in objective function which is written just above Equation 4 (Section 5) of the paper, which I am too lazy to write out in full.

The learning mechanism is as follows:

An n-step trajectory is “rolled-out” for each of the K+L steps. Since we are in a non-episodic environment, we can leave out the need to set done flags, mask final rewards etc. After K steps we save the state dictionary, after K+L-1 steps, we also save the state dictionary.
The usual actor-critic algorithm is played out, computing an actor and critic loss, in accord with Equation 4 (Section 5) of the paper.
The average reward of each of the previous 10 rollouts is continually updated. This is fed into the Meta MLP to predict an ε_EN
If we are at the last step, we compute a bootstrap_loss which is just the actor loss, and step the inner optimizer, otherwise we compute the total loss (actor+critic+entropy) and step the inner optimizer. The inner optimizer is a differentiable optimizer from TorchOpt.
After the K+L steps, we calculate the matching loss, which in this case will be the KL Divergence. This is in accord with the update equation shown above. We use a shell of the actor-critic class and place in it the state dictionary from the Kth step actor-critic, from which we create a distribution. Another distribution is made for the K+L step actor-critic as well.
The Meta MLP optimizer is stepped and we restore the actor-critic optimizer to its (K+L-1)th state.

Below are the parameter/hyper-parameters which this model is using, they correspond to Table 1 from the paper:

Results:

Below are the results for the cumulative rewards and entropy rate curves. These are for K=3, L=5, meta learning rate=0.0001, and total steps=4,800,000. As one can see, it lines up with the results in the paper. The cumulative reward turns positive a little over 1M steps. For the last 300,000 steps, the entropy starts off high when learning, and is then lowered when it starts to get a hang of the game, when the rewards flip (and since the agent has no memory and needs to learn again) we want the entropy back up (to explore) and down again (to master the game), and alas this is what we see.

Cumulative Reward (left) & Entropy Rate (right)

Further:

It would probably be useful to make a regular Actor Critic (A2C) and Meta-Gradients version using the TCGW and show comparative results, as this was the reason behind using a TCGW environment in the paper. If I get around to it, I will add both implementations.

The code repo is located here: https://github.com/l4fl4m3/bmg

If you spot any errors or issues, please point them out.