Ervin Tasnadi’s blog

deep-learning

Gradient of the attention op

Oct 9, 2024

Uncategorized

ai, attention, automatic differentiation, deep-learning, gradients, machine-learning, math, mathematics, numpy, pytorch

In this post, the gradient of the attention op will be derived from a single rule used to implement reverse mode automatic differentiation. Attention mechanism is the foundational building block of the transformer architecture that is the foundation of Today’s most successful language models. It was shown that it can replace recurrent blocks in neural…