Velocity Learning | Diane Adjavon

Recently while reading this book I stumbled upon the phrase:

Crucially, learning a dynamical system by modeling its velocity turns out to be much easier than learning its position directly. In our learning setup, this translates into an optimisation landscape with more favorable geometry, leading to the ability to train much deeper architectures than was possible before.

Geometric Deep Learning Grids, Groups, Graphs, Geodesics, and Gauges, p.73 It was in a short discussion about why ResNets work well and it struck me as an interesting point, although I have little intuition for why it is true. Since reading that, I have been thinking about how this might apply in other architectures and training setups. Why do diffusion models work so well? In the Ho et al. paper Denoising Diffusion Probabilistic Models, they reparametrize the reverse process of the diffusion model¹ to train the network to learn the perturbation² rather than the signal. In some sense, this is modelling the velocity³. They find that it performs at least as well, and in some cases better, than the more intuitive formulation before reparametrizing.

I get a similar feeling from transformers, although not as clear. The query, key, and value in an attention layer are all derived from the input data. The dot product between them means that at the output of an attention layer we have an output that is a higher-order representation of the input⁴. This is something that common operations such as convolutions of fully connected layers don’t allow, and none of the popular non-linearities used as activations provide this either. Granted, higher order description is not the same as learning velocity, but I feel like it is worth exploring⁵.

Which is the part where they take a noisy image and turn it into a slightly less noisy image ↩
Here, the noise added at a given step in the forward process ↩
Or at least a difference, let’s say those are the same in this case. ↩
Is it third order? I still struggle to understand the respective roles of query, key and value. They all seem like embeddings to me. ↩
Especially when transformers might be doing gradient descent on the context? See Transformers learn in-context by gradient descent by von Oswald et al. ↩