What is qkv in LLM transformers? What does it do? How does it work?

There are several great explanations of how the encoder and decoder transformer described in the paper Attention Is All You Need works, e.g. The Transformer Model.

A fundamental concept is attention i.e. the adding of context and meaning to individual words by considering each individual word against each of the other words surrounding it.

For example, if a sentence contains the word ‘bank’ then the presence of ‘money’ in the same sentence suggests that means a financial institution rather than a river bank.

Attention is implemented in code using query (q), key (k) and value (v) vectors and there are some analogies like this one, which regards the key/value/query concept as similar to retrieval systems. For example, “when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).”

This analogy raises as many questions as it answers however

  • What is the query? – the whole of the text we provide to the AI interface (ChatGPT,  Bard, etc) or just a single word?
  • How is the query stored in a matrix?
  • What kind of key is the query matched against?
  • How does the key relate to the value?
  • Is the value something we get as an output from the process or something the model already possesses from previous training?
  • What is the point of the value we get from this whole process? How does it relate to generating a response to the text we enter into things like ChatGPT or Bard?

To answer these questions, I found this video which provides a great practical explanation of the model that is simultaneously proven as demonstrably correct through being implemented as working code with a small scale test case that can be trained and run in minutes on the free Google Colab platform.

Karparthy (1:04:15, 1:08:00) describes the query (q) as “what am I looking for” (“I” being a single token from the input sentence) and the key (k) as “what do I contain” so the dot product q.k (where k is all the keys of all the tokens in the input sentence) becomes the affinity between the tokens of the input. Where a token’s query vector aligns with the key vector of another token, the token ‘learns’ more about it (aggregates its feature information into its position).

The value (v) is the “thing that gets aggregated for the purpose of the particular head of attention” that the q, k and v matrices form. Ultimately, the purpose of value is to appropriately weight the token affinities (q.k) so that the product q.k.v is able to sufficiently distinguish token sequences and hence allow the most appropriate next word to be predicted (by the very last ‘softmax’ component of the decoder).

To put all this into more pithy and understandable terms:

  • There is a query, a key and a value matrix for each ‘head’ of attention, i.e.  way of characterising the relationship between words (e.g. relations between tokens, relations between pairs of tokens, relations between groups of 4 tokens, etc.)
  • Q contains the ‘word mix’ (more accurately token mix) from our input text (a) at a particular word position in the text and (b) constrained to a fixed number (e.g. 4, as “hard-coded” for our particular LLM implementation) of sequential words e.g. 4 words from our input text “I like learning about artificial intelligence” at position 1 would be “I like learning about”.
  • K contains the features that this same set of words has – one feature might be e.g. “is a doing word”
  • Q.K gives us a representation of the meaning of the input word mix by aggregating the features each input word has (K) against the features each input word is looking for (Q). So “I” might look for the “doing word” feature and “like” and “learning” would offer that feature. In the matrix dot product for the “doing” feature, “I”, “like” and “learning” would shine. This product is also called the compatibility matrix since it captures how compatible each word is with every other word and hence the extent to which the features of the compatible words should be baked into each query word.
  • We need a consistent way of storing meaning for the computer since two different human language sentences or ‘word mixes’ could yield the same deep meaning (e.g. “a likes b” and “b is liked by a”) and vice versa i.e. two identical sentences could give a different deep meaning depending on what words came before them. Q.K gives us that.
  • V contains the values or weights for each word’s features e.g. we can imagine features like
    • is an action
    • is a thing
  • When V is then multiplied by Q.K we get a numeric matrix that we can use to represent the meaning of the word mix. Subsequent steps in the model can then predict (from the model’s historic training data of English sentences / knowledge) which word likely comes next after encountering this particular meaning.

Hope this helps – if you find a better explanation or ‘intuition’ of qkv please do leave a comment!

Leave a comment