GI Chow


Home | Pages | Archives


What is qkv in LLM transformers? What does it do? How does it work?

December 10, 2023 11:30 pm

There are several great explanations of how the encoder and decoder transformer described in the paper Attention Is All You Need works, e.g. The Transformer Model.

A fundamental concept is attention i.e. the adding of context and meaning to individual words by considering each individual word against each of the other words surrounding it.

For example, if a sentence contains the word ‘bank’ then the presence of ‘money’ in the same sentence suggests that means a financial institution rather than a river bank.

Attention is implemented in code using query (q), key (k) and value (v) vectors and there are some analogies like this one, which regards the key/value/query concept as similar to retrieval systems. For example, “when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).”

This analogy raises as many questions as it answers however

To answer these questions, I found this video which provides a great practical explanation of the model that is simultaneously proven as demonstrably correct through being implemented as working code with a small scale test case that can be trained and run in minutes on the free Google Colab platform.

Karparthy (1:04:15, 1:08:00) describes the query (q) as “what am I looking for” (“I” being a single token from the input sentence) and the key (k) as “what do I contain” so the dot product q.k (where k is all the keys of all the tokens in the input sentence) becomes the affinity between the tokens of the input. Where a token’s query vector aligns with the key vector of another token, the token ‘learns’ more about it (aggregates its feature information into its position).

The value (v) is the “thing that gets aggregated for the purpose of the particular head of attention” that the q, k and v matrices form. Ultimately, the purpose of value is to appropriately weight the token affinities (q.k) so that the product q.k.v is able to sufficiently distinguish token sequences and hence allow the most appropriate next word to be predicted (by the very last ‘softmax’ component of the decoder).

To put all this into more pithy and understandable terms:

Hope this helps – if you find a better explanation or ‘intuition’ of qkv please do leave a comment!

Posted by gichow

Categories: Miscellaneous

Tags:

Leave a Reply



Mobile Site | Full Site


Get a free blog at WordPress.com Theme: WordPress Mobile Edition by Alex King.