Skillful Precipitation Nowcasting — An Implementation of DeepMind’s DGMR
TensorFlow/Sonnet Implementation of Deepmind’s DGMR
Background
Deepmind’s paper on skillful precipitation nowcasting is available here, and the accompanying blog post is available here. To get a breakdown of the problem and why a deep generative approach was used to tackle the issue, have a peek at the previous two sources. At its core, making short term precipitation based weather predictions is hard, especially for the next 2 hour time window. Precipitation Nowcasting, a subset on its own within the vast field of weather forecasting, is employed to fill the gap for this short-term window, that larger NWP (Numerical Weather Prediction) models have a hard time forecasting with good accuracy and resolution.
The DGMR (Deep Generative Model of Radar or is it … Rainfall?) is tasked with solving this precipitation nowcasting problem, and produces forecast of superior usefulness and accuracy w.r.t to other models, as outlined in the paper.
The Model Architecture
The model is a conditional generative model, and learning takes place in the framework of a CGAN (conditional generative adversarial network). The generator is trained via two discriminators and a regularization term.
The Generator
Has two main modules: a conditioning stack and a sampler.
The conditioning stack makes a conditioning representation from radar observations inputed into it. It is a feed forward convolutional neural net. As the authors noted, the input context is challenging for conditional generative models and the stack structure allows information from the context data (radars) to be used at multiple resolutions. It is comprised of 4 D-Blocks (residual blocks), and 4 SNConv2D (Spectrally Normalized Convolutional 2D) layers.
The sampler generates n predictions of future radar from the conditioning representation. It is comprised of a latent conditioning stack, then 4 off the following: an ConvGRU (Convolutional Gated Recurrent Unit) layer, an SNConv layer, a G-Block (sample-less residual block), an Upsample G-Block (upsampling using nearest neighbours residual block) . Finally there is a BatchNorm layer, a final SNConv2D layer, and a depth-to-space operation. Along with the initial states, latent representations from the latent conditioning stack are given as input to the lowest-resolution ConvGRU block. These are then consecutively upsampled into the next ConvGRU.
The latent stack is comprised of an SNConv2D layer, 3 L-Blocks (a modified residual block), an Attention Block, and a final L-Block.
The Discriminator
Consists of two parts: a spacial discriminator and a temporal discriminator. They together allow for adversarial learning across space and time. Both are similar, however the temporal discriminator uses 3D convolutions to account for the time dimension.
The spacial discriminator is comprised of 6 D-Blocks utilizing a SNConv2d layer, a BatchNorm layer, and a Linear layer.
The temporal discriminator is comprised of 6 D-Blocks utilizing a SNConv3D layer, a BatchNorm layer, and a Linear layer.
Blocks
D-Block
This block serves to halve the resolution while doubling the number of channels.
Each D-Block is comprised of: an optional pre-activation, two convolutional layers, an optional downsampling average pooling layer, if needed an additional 1x1 convolutional, and downsampling average pooling layer, and finally the residual connection.
G-Block
This block is a residual block similar to the one above.
Each G-Block is comprised of: if needed a 1x1 SNConv layer, a BatchNorm layer, a ReLU, an SNConv layer, a second BatchNorm layer, a second ReLU, a second SNConv layer, and finally the residual connection.
Upsample G-Block
This block serves to double the input’s spatial resolution with nearest neighbor interpolation, and halves its number of channels.
Each Upsample G-Block is comprised of: a Nearest Neighbour Upsampling layer, a 1x1 SNConv layer, a BatchNorm layer, a ReLU, a second Nearest Neighbour Upsampling layer, an SNConv layer, a second BatchNorm layer, a second ReLU, a second SNConv layer, and finally the residual connection.
L-Block
This block is a modified residual block made to increase the number of channels of its respective input.
Each L-Block is comprised of : an activation, a Conv2D layer, a second activation, a second convolutional layer, if needed a residual convolutional layer, and finally the residual connection.
Attention Block
This is a a spatial attention module . As the authors stated, it allows the model to be more robust across different types of regions and events, and provides an implicit regularization to prevent overfitting.
The Attention Block is comprised of query, key, and value 1x1 Conv2D layers, a learnable gamma variable, and the residual connection.
ConvGRU
The ConvGRU uses the conditioning representations as initial states for each of its recurrent modules.
The ConvGRU is comprised of a sigmoid activated SNConv2D layer for the read gate, a sigmoid activated SNConv2D layer for the update gate, and a (ReLU) activated SNConv2D layer for the candidate activation.
Training
Loss Functions
Discriminator
Minimize the spacial discriminator loss (D) and temporal discriminator loss (T) as follows:
The above two loss functions utilize a hinge loss.
Generator
Maximize the generator objective function as follows:
Where D and T are the spacial and temporal discriminator losses, respectively. R is the regularization term, which averages the generated samples and weights the final loss to heavier rainfall scenarios.
Distributed Setup
To train this behemoth of a model, I used the the UK training set provided by Deepmind.
I trained/am training it distributed across 8 TPU (v3) cores, kindly provided by Google’s TPU Research Cloud Program. Due to the model size, I used a (global) batch size of 8 per step, and 3 samples per input during the generation step, prior to grid cell regularization, as opposed to the 6 used in the paper. All the other hyperparameters align with the paper. I experimented around with a bunch of GCP VM and GPU configurations, but after refactoring the training loop to run on a TPU, I realized it would work best, and Google kindly provided the service for free through their program. The training is occasionally check-pointed to cloud, and since the dataset is so large, some refactoring needs to be done to incorporate some sort of epoch structure during training.
Since I am still tweaking this here and there, I wont bother publishing any loss or other statistics for now.
The model is available here, check it out and offer up any suggestions for improvements, errors, or issues: