Transformers for all

5 month ago

Interest in Large Language Models(LLMs) and Artificial Intelligence, in general, have sky-rocketed since the release of ChatGPT from OpenAI in 2022. The ripples of chaptGPT have spread far and wide and everybody seem to talk about the impending AGI(Artificial General Intelligence) these days. Although many people(within the tech community and outside) have known about or used ChatGPT, the majority of them may not know about the technology that powers such LLMs as ChatGPT: The Transformers. There are many online resources that have a brilliant demonstration of what transformers are and how they work, most of them are technical and it is a bit difficult for the general people to understand.

In this post, I’ll try to explain the transformer architecture that has powered the LLMs like ChatGPT, LLama, and others in plain English so that it can be understood by everyone, hence the name: Transformers for all.

Transformer is a sequence-to-sequence model meaning it takes a sequence of tokens as inputs and outputs another sequence of tokens. In terms of Natural Language Processing, a token may mean a word, a subword, or even a character.

Fig 1. Each of the underlined subwords can be considered a token

The output of a transformer is a sequence too which depends on the input and the task we want the model to perform(Although transformers have been customized to solve the classification tasks. I’m talking about the original transformer architecture). Hence this sequence-to-sequence paradigm enables the model to perform various Natural language related tasks such as Translation( where the input sequence is a sentence in one language and the output sequence is its translation in the target language), Summarization( where the input is a span of text and the output is its summarized text), etc.

Motivation behind Transformer

Before transformer architecture came along, Recurrent Neural Networks(RNNs), or Long Short Term Memory(LSTM) in particular were ubiquitously used for NLP or other sequence-processing tasks. However, RNNs process one token at a time and are sequential in nature. Hence, to process let’s say, the 100th token of a sequence, all previous 99 tokens should have been processed already. The RNN unit at the 100th step takes the context from all the 99 processed tokens before as well as the 100th token to produce the context for the 100th position. This sequential nature of RNNs makes the training process slow especially when we are dealing with long sequences like in summarization tasks where the input sequence can have 1000s of tokens.

Fig 2. RNN unit processing one token at a time

For all the above-mentioned reasons, parallelizing the operations of RNNs is a difficult job. We have to wait for the 99 tokens to be processed before we begin processing the 100th token. Another problem is long-term dependency. In long sequences, the context from far away tokens has to travel a long way to reach the current token’s state which the RNN unit is processing. Again, using our 100 tokens example, the context from 1st token has to travel 99 units in terms of processing to have any impact on the 100th token. It is very likely this contextual information is diluted by the time it reaches the 100th token. This means our RNN model may not learn how the 1st and 100th tokens are related. Hence, the transformer architecture was designed to address these problems in RNNs. The main way the transformer addresses these shortcomings of RNNs is by using the Attention mechanism which I talk about in detail shortly. The attention mechanism enables the parallelization of operations and also provides a direct path for all the tokens in the input sequence to interact with all the other tokens.

Main components of a transformer

  • The embedding layer
  • The encoder unit
  • The decoder unit
Fig 3: The transformer architecture with its main components.

So the general workflow of a transformer looks like this:

The input sequence goes to an embedding layer where it is converted to a vector form(row of numbers). Each token’s vector also gets information about its position in the sequence. After the text sequence gets its numerical form, it flows into the encoder unit where each token interacts with all the other tokens in the sequence using the self-attention mechanism. The tokens also learn how related they are with other tokens. The output from an encoder unit is all the tokens in their vector form with information regarding how they relate to other tokens in the sequence. This output is now the new representation of the input sequence. The output from the encoder unit then flows to the decoder unit where the decoder analyses it and produces the output sequence based on the task/objective it’s being trained on.

I’ll now discuss each of these components in more detail.

Embedding layer

The purpose of the embedding layer is to convert the textual sequence into a numerical form that the machine understands. There are two kinds of embeddings going on at the embedding layer: Word/token embedding and position embedding. Let’s talk about token embedding first.

In token embedding, each token in the input sentence gets converted to a vector form of fixed dimension. Initially, the vectors are initialized randomly meaning the token vectors have random numbers. Throughout the training process of the transformer, the model learns the appropriate vector for each token. This is called learned embedding.

Fig 4: Arbitrary vector representation of each input sequence

Fig 4 shows an arbitrary vector representation of an example input sequence “you shall not pass”. The original transformer implementation used a 512-dimension vector for each token. The presumption here is: during the training, the model learns to assign a similar vector value to the tokens that are related or similar. Here, I have assumed each word in our input as a single token. In fact, this depends on the type of tokenizer we apply to the model. There are subwords tokenizer like Byte pair encoding which may break the words into subwords based on how frequently they occur in the dataset.

After the token embedding phase, each token is also assigned another set of vectors which is called positional embedding. This set of vectors contains the position information for each token. In RNN training, the model doesn’t need the explicit position information because the tokens are processed sequentially hence the position of the tokens is implied. However, in the case of transformer training, all the input tokens go into the encoder unit at the same time so tokens should have some signal to imply their position in the sequence. In positional embedding, vectors are generated for each token which represent their position in the input text and these sequence of vectors are added to the sequence of tokens embeddings. We have now converted a sequence of text in natural language to a sequence of vector(array of numbers) that the machine can understand. This sequence of vectors go into the encoder block.

Encoder

Encoder is a series of layers where the model tries to understand the syntax and semantics of the input. By syntax and semantics, I mean the model tries to understand grammatical structure, the meaning of the input text and the context for each of the words in our input. There are various layers to the encoder block but the most important operation it goes through is the Attention operation.

This is an operation where each token interacts with all the other tokens and tries to figure out its importance or its meaning with respect to other tokens.

Attention

This image has an empty alt attribute; its file name is Notebook-39-page-1-2-1-768x1024.png
Fig 5: Attention among the tokens

The figure above shows a representative learnt attention score for each tokens with other tokens(Attention for tokens ‘a’,’red’,’car’ is not shown with other tokens to make the figure less clustered). The thickness of the line shows how much a token is related to other token.

Self-attention is a mechanism where a text looks into all of its token as shown in the figure and calculates its new representation based on the attention score.

It can be best understood as follows:

Let’s say for token “There”, which has its vector representation coming in from an embedding layer, which I call There_vec.

For other tokens, their vector representation are: (is_vec,a_vec,red_vec,car_vec).

Consider, “There” has an attention score of 50% with iteself, 10% with is, 10% with a, 15% with red and 15% with car.

Now, the new representation of “There” after applying attention would be:

There_H1=(50% of There_vec+10% of is_vec+10% of a_vec +15% of red_vec + 15% of car_vec)

Hence, after applying attention each token is a combination of all the other tokens in the given text weighted by its attention score.

Now the question remains: how are those attention score determined for each tokens with respect to all other tokens?

It is learned through neural network training in form of learning the value of certains parameters which we call weights. This technique is not unique for transformers, hence it is out of scope for this post.

Another important concept that’s going on inside the encoder block is called Multi-head attention.

Multi-head attention

The main idea behind multi-head attention is to analyze the text from different context/scenarios. When we human read/listen to a natural language, we understand it from different context. One example context could be the grammatical correctness, another context could be the sentiment carried by that text. There could be possibly various contexts through which a natural language is perceived by our mind. The above demonstration of attention scores of the tokens is analyzing the tokens from one context. Hence, in multi-head attention, the process of representing the tokens as attention-score weighted tokens is repeated many times in parallel. In each “head” the learned attention score will be different for the tokens. It means, the tokens are related differently to other tokens in different context. It’s usual to use 8 heads, 16 heads, 32 heads or even more heads in multi-head attention.

In each head, the tokens are represented as I have shown in the above section. At the end of the multi-head attention, the token representations are concatenated together to get a new representation.

If we have 8 heads, the token “There” would be represented as:

There_H1 + There_H2 + . . . . . . + There_H8

(Here, + denotes concatenation).

The output from the encoder block now has a new representation of input text by analyzing it’s importance to other tokens and by analyzing the text in different contexts.

Decoder

The encoder block is responsible for processing the input sequence of texts whereas the decoder is where the input representations coming out of the encoder block interacts with the supposed output sequence or the output sequence generated by the model so far.

Let’s say we are training our model to translate text from english to french and the english text is “There is a red car.”

Fig 6: Interaction of input sequence(encoder representation) with the output sequence

At the beginning, we just have an input sequence so the output seq. is just an added <start> token. The input sequence interacts with this <start> token just as described above in the encoder section. This is called encoder-decoder attention. The output sequence also interacts with itself (self-attention). After these interactions we have a new representation of the supposed output for the first step out of the decoder. The first output token is concatenated with the <start> token and the self-attention and encoder-decoder attention operation happens inside the decoder to generate the next token. The intermediate step after generating 4th token is shown in the second part of figure 6. This way of generating next token by analyzing the previously generated tokens with the input representation is known as auto-regressive generation. This process goes on until the model generates an <end> token.

GPT like models

LLM like ChatGPT is a fine-tuned version of the GPT model. The GPT model is based on the transformer architecture but it just takes the Decoder part of it. Since it doesn’t have an encoder, it is not trained using the sequence-to-sequence generation objective. Instead, the GPT model is trained on the next word prediction. Like I described in the decoder section above, The decoder-only model starts with some sequence of text as a context and tries to generate the correct next word by applying the self-attention. It doesn’t apply encoder-decoder attention as there is no encoder layer and the generation happens in an auto-regressive fashion until the model generates <end> token.

What was stopping ChatGPT like model before transformer?

The direct path provided by the attention mechanism for each token to interact with other tokens helped the model to understand the language and task better. RNN relied on the the intermediate steps for the first token to interact with 100th token. The multi-head attention enabled the transformer model to analyze the text from various contexts which was lacking in previous models like RNNs.

These are the improvements on representational capacity of the model. However, the biggest improvement came from the parallelizing the training of the transformers. The direct path of interactions among tokens meant the training could be massively parallelized. Due to this, we were able to create and train huge models with millions and billions of parameters using a vast network of powerful GPUs. This enabled an era of Large Language Models.