
Before the advent of Transformers, Recurrent Neural Networks (RNNs) were the go-to models for handling sequential data, where the order of elements plays a crucial role â as is the case in natural language processing, time series, and other domains. Inspired by traditional feedforward neural networks, RNNs introduced the ability to process data step by step, maintaining a form of memory across the sequence.
However, this sequential approach comes with significant limitations. First, RNNs process inputs one element at a time, preventing full utilization of modern GPUs, which are designed for parallel computation â this makes training relatively slow. Second, RNNs struggle to capture long-range dependencies within sequences. As information is passed from step to step, it tends to degrade or vanish, especially over long distances, leading to what is commonly known as the vanishing gradient problem.
It is in this context that Transformers revolutionized the field. While inspired by the encoder-decoder frameworks of RNNs, Transformers remove the notion of recurrence entirely, replacing it with a fully Attention-based mechanism. This allows the model to focus directly on the most relevant parts of a sequence, regardless of their position.
With this innovation, Transformers not only surpassed RNNs in performance on key NLP tasks â such as machine translation, text summarization, and speech recognition â but also unlocked new applications across various domains, including computer vision and bioinformatics.
So why did Transformers replace RNNs? Because they directly address the two critical limitations of recurrent models:
In short, the rise of Transformers represents a natural and necessary evolution beyond the structural limitations of RNNs. Next, letâs dive deeper into how this groundbreaking architecture works.

Originally designed for tasks like machine translation, Transformers are highly effective at converting input sequences into output sequences. They are the first model to rely entirely on the self-attention mechanism, without using RNNs or convolutional networks. However, they still maintain the classic encoder-decoder structure.
If we think of the Transformer as a black box for translation, it takes a sentence in one language â say, English â and translates it into another language, like Spanish.
Now, if we take a closer look at this black box, it consists of two main parts:
In reality, both the encoder and decoder are made up of multiple stacked layers, all with the same structure. Each encoder layer processes the input and passes it to the next one. On the decoder side, each layer takes input from both the last encoder layer and the previous decoder layer.
In the original Transformer model, there were 6 layers for the encoder and 6 layers for the decoder, but this number (N) can be adjusted as needed.
Now that we have a general idea of the Transformer architecture, letâs dive deeper into how the encoders and decoders work.


Transformers are currently the basis for many advances in natural language processing (NLP). They have enabled models such as BERT, GPT, and T5 to achieve impressive results. At the heart of these models is the Transformer encoder, which transforms input data (such as sentences) into machine-readable information.
In this section, we will simply explain the key elements that make up this encoder: embeddings (which transform words into numbers), positional coding (which indicates word order), self-attention (which allows the model to know which words are important), normalization and residual connections (which help stabilize learning), and finally the feed-forward layer (which processes the information before moving on to the next step).
Each of these elements is essential for the Transformer to perform so well on tasks such as translation, summarization, and text comprehension.
In a Transformer, the very first step is to convert words into numbers. Computers donât directly understand words or sentences like we do; they need numbers to work. This is where input embedding (or âvectorizing wordsâ) comes in. This step is only done in the very first encoder (often called the âbottom encoderâ).
How it works:


Transformers, the artificial intelligence models revolutionizing language processing, have a unique characteristic: unlike their predecessors like recurrent neural networks (RNNs), they donât process the words of a sentence one by one. Instead, they analyze all the words simultaneously, which allows them to be much faster.
However, this approach poses a problem: how can the model understand the order of the words? After all, the meaning of a sentence crucially depends on the order in which the words are placed. For example, âThe cat sleeps on the matâ has a very different meaning from âThe mat sleeps on the cat!â Without information about the order, the Transformer could confuse these two sentences.
An Ingenious Trick
To solve this problem, researchers invented a technique called âpositional encoding.â The idea is simple: we add information to each word that indicates its position in the sentence. Itâs a bit like giving each word an address.
How it Works: Mathematical Waves to Encode Position
Instead of using simple numbers to encode the position, we use mathematical waves, specifically sine and cosine functions. Each position in the sentence receives a unique âmathematical signatureâ created from these waves.
Why use sines and cosines? These functions have interesting properties:
In Practice: The Magic Addition
Word Representation (Embeddings): Each word in the sentence is first transformed into a vector of numbers called an âembedding.â This vector represents the meaning of the word.
Encoding the Position: For each position in the sentence, we calculate a vector of numbers using sine and cosine functions. This vector represents the position of the word.
Combining the Two: We add the wordâs embedding vector and the position vector
Thanks to this addition, each word now has a representation that combines:

Each encoder layer contains:
a) Multi-Head Self-Attention: When a machine wants to understand or translate a sentence, the encoder is the main tool that reads and analyzes the sentence. Itâs a bit like the modelâs brain, trying to grasp the overall meaning before taking action.
The encoder uses a key method: self-attention. Imagine that each word in the sentence looks around and asks itself:âWhat other words help me better understand the meaning of the sentence?â
For example, in the sentence âYouâre nice,â the word âesâ will quickly understand that it is linked to âTu,â because together they form the meaning of the action. This ability to make connections is crucial for properly grasping the context.
To make these connections, each word is transformed into three elements:
đ When a word poses its question (Query) and another word has the correct key (Key), then the answer is positive: we keep the Value of the found word to enrich the overall understanding.
And thatâs not all! The encoder doesnât just look at a single sentence. They repeat this process several times in parallel, slightly changing the way they ask questions or read the keys.
Itâs as if several people were looking at the same sentence but from different perspectives: some will focus on the grammar, others on the tone or the cause-and-effect relationships.
b) Normalization and Residual Connections: 
Once the encoder has completed the self-attention phase (where the words have looked at the other words to capture the context), we donât move directly to the next step. Before that, two important operations are performed:
â The residual connection (or âshortcutâ) Imagine that the result of the self-attention is like an enriched version of what the model has understood about the relationships between words. This result is added to the original input, meaning the representations of the words before self-attention.
đ Why? Because it helps preserve the initial information intact and adds on top of what self-attention has discovered. Itâs a bit like having a draft with ideas and highlighting the important parts without ever erasing your original text.
This is called a residual connection: we add the âresidueâ (the original input) to the layerâs result.
â Normalization Next, this sum passes through a normalization (called Layer Normalization). This step serves to stabilize the values to prevent them from becoming too large or too small.
đ Why? It helps the model to learn better during training and avoids certain problems like forgetting or losing information (the famous âvanishing gradientâ problem).
â ď¸ This duo (residual + normalization) is a ritual in the encoder of Transformers: after every sub-layer (whether self-attention or feed-forward neural network), it is repeated.
c) Feed-Forward Neural Network: 
After the information passes through self-attention and the residual connection, it enters a new phase called the feed-forward neural network. Hereâs how it works:
Information Transformation: The information first goes through a series of two simple transformations, called linear layers. These layers modify the information, kind of like changing its shape.
ReLU Activation: Next, thereâs a small âfilterâ called ReLU. This filter helps make the information clearer and more useful for the next steps. Itâs like highlighting important details to make them stand out.
Return to the Input: The transformed information is added back to the original information (this is the residual connection). This helps keep the original idea intact while adding new insights, without losing any important elements.
Normalization: Finally, everything goes through a last step called normalization. This step adjusts the information to ensure everything is balanced before moving on to the next phase.
The output of the final layer of the encoder is a set of vectors. Each of these vectors represents the input sequence, but with a rich and deep contextual understanding. These vectors contain all the information processed by the encoder, capturing the relationships and meaning of the words in the sequence.
This encoded output is then passed to the decoder in the Transformer model. The decoder uses these vectors to generate predictions or output sequences, such as translating text or producing a response. The encoderâs job is to prepare this information so that the decoder can focus on the correct parts of the input when decoding.
Think of the encoderâs function as building a tower made of layers. You can stack multiple encoder layers, and each layer adds more understanding by looking at the input from a slightly different perspective. Each layer uses its own attention mechanism, allowing it to learn different aspects of the input. As you stack layers, you increase the depth and richness of understanding. This process helps improve the transformerâs ability to predict and generate outputs more accurately, refining understanding layer by layer.
Decoder Workflow

At the heart of the Transformer model lies the decoder, whose mission is to generate coherent text sequences. Like the encoder, it is made up of several sub-layers, including multi-head attention mechanisms, neural networks, and normalization techniques. These components work together to decode the information prepared by the encoder and produce intelligible text.
The decoder operates in an autoregressive manner, meaning it generates text step by step, based on the words it has already produced. It starts with a seed symbol, then uses the information from the encoder, enriched by attention, to select the next word. This process continues sequentially, with each new word influenced by the previous ones and by the context provided by the encoder. The decoder stops when it generates an end symbol, signaling the completion of the text sequence. In short, the decoder is an orchestrator that transforms encoded information into fluid and relevant text.
Unlike the encoder, which receives the input sentence directly, the decoder requires a starting point to initiate text generation. This starting point is the famous âOutputs,â which feeds the âOutput Embeddingâ at the very beginning of the decoding process. This first âOutputsâ comes from the Special âStart of Sequenceâ (SOS) Token, a special token added to the modelâs vocabulary. This token is not part of the common words of the language, such as nouns, verbs, or adjectives. Its main role is to signal to the decoder that a new text sequence will be generated, thus acting as a start signal. This SOS Token is therefore the very first âOutputsâ sent to the âOutput Embeddingâ layer to start the generation process.
The âOutput Embeddingâ is a matrix, or a lookup table, that associates each token in the vocabulary, including the SOS token, with a numerical vector. This vector makes it possible to represent each token in a multidimensional space, where tokens with similar meanings are positioned close to each other. Concretely, when the SOS token is sent to the Output Embedding, the matrix extracts the vector corresponding to this token. This vector then becomes the initial input to the decoder. It is this vector that will be combined with the positional encoding and processed by the different layers of the decoder, such as masked attention, multi-head attention, etc., to progressively generate the text sequence.
Positional encoding is a crucial step in the sequence processing process in a Transformer model, whether in the encoder or the decoder. This step comes after token embedding, and its role is to add information about the position of words in the sequence.
Transformer models do not process sequences sequentially, like RNNs or LSTMs, which process each word one by one in a specific order. Instead, the words are processed in parallel. However, this poses a problem: without an indication of their position, the model would not know whether the word âcatâ comes before or after âeat.â This is where positional embeddings come in.
Positional embeddings add a unique vector to each word in the sequence to indicate its specific position. For example, the first word in the sequence could have a different positional encoding than the second word, and so on. These vectors are added to the word embeddings, allowing the model to maintain an understanding of word order.
It works like this: each token in the input sequence is associated with a vector representing its position, and then this positional encoding is added to the vector of the corresponding word. This allows the model to âknowâ the order in which the words appear in the sentence, and therefore better understand the context.
This process is identical in both the encoder and the decoder.
đ Important context before we begin:
We want to train a model for translate englsh sentence to french. During training, the complete French sentence is already available (because itâs part of the data the model needs to learn). But to prevent the decoder from âcheatingâ by looking at future words during training, we use a mask in the self-attention process. This mask is like a filter that prevents the model from seeing subsequent words in the sentence while it learns to make predictions.
a) Masked Self-Attention: 
đ Role during training:
When the decoder learns to generate the French sentence, it must only use the words already generated (on the left) and not see future words (on the right).
âĄď¸ Concrete example: Letâs imagine weâre learning to generate: âThe cat was sitting on the carpetâ
When the model has to learn to predict the word âsittingâ, it can only use:
âTheâ âcatâ âwasâ And cannot see subsequent words like âonâ or âthe carpetâ even if they are available in the example.
đ Why? Because in real life, when generating text, we never have future words in advance.
đĄ How does it work technically?
We apply a triangular mask (called a causal mask) to the attention matrix. This mask places â-ââ on the attention scores of future words. Result: the decoder learns to look only at words already known to predict the next one. đ¨ Simple analogy: Itâs like writing a sentence without cheating: when you write the word âassis,â you only look at what you just wrote before, not what comes after.
b) Encoder-Decoder Multi-Head Attention or Cross Attention đ Whatâs changing here:
Once the masked self-attention is complete, the decoder will look for clues in the encoded English sentence.
âĄď¸ Concrete example: Still for the word âassisâ to be generated, the decoder will consult the encoded English sentence âThe cat sat on the matâ.
It will ask itself:
âWhere is the key information in the English?â
Here, it will notice that the word âsatâ is important because it is the verb to be translated as âassisâ. đ§ Where does this information come from?
The encoderâs outputs, which are available for ALL English words from the start (because the English sentence is encoded in its entirety before starting the decoder).
â ď¸ But be careful:
The decoder still looks ONLY at the words already generated on the French side, thanks to the masked self-attention seen just above. But for these words, he can use ALL English words via the encoder to better understand them.
đ¨ Simple analogy: Itâs as if you were translating sentence by sentence:
You write âassisâ on the French side, and each time you look at the complete English sentence to check how to translate it correctly.
b) Feed-Forward Neural Network
âď¸ Last step of a decoder layer:
Here, each word processed by attention is then refined. The feed-forward network will transform the word representation by applying two linear layers and a non-linear activation.
âĄď¸ Example: The internal representation of the word âassisâ is improved to be ready to move on to the next step (e.g., predicting the next word âsurâ).
đ¨ Simple analogy: Itâs as if, after choosing the word âassis,â you further refine your idea by adding details and nuances before writing the next word.
c) Linear Classifier and Softmax for Generating Output Probabilities
After passing through all the decoder layers (masked self-attention, encoder-decoder attention, feed-forward network), we obtain a vector representation for the word the model is about to generate. But at this stage, itâs still just an abstract vector, with no word assigned yet.
đŻ What happens next?
Linear Classifier: This vector is fed into a linear classifier (a linear layer). The classifier transforms the vector into a list of scores, one score for each word in the vocabulary. Example: If the model knows 10,000 words, it will produce 10,000 scores (one for âsat,â another for âjumped,â etc.).
Softmax: The softmax function takes these raw scores and turns them into probabilities. Each word gets a probability representing how likely it is to be the correct word in this context. Example: âsatâ : 65% âjumpedâ : 20% âranâ : 10% The remaining words will have smaller probabilities.
Selecting the final word: The model picks the word with the highest probability (e.g., âsatâ at 65%). This word becomes the newly generated word, and will be used as input for the next step to predict the following word. Simple analogy: Itâs like playing a guessing game with multiple choices: You look at all the possible options (all 10,000 words), You think: âHmm⌠65% chance itâs âsat,â 20% for âjumpedââŚâ Then you choose the most likely answer.

The Transformer architecture has fundamentally reshaped the landscape of natural language processing (NLP) and sequence modeling.
Unlike Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), which process sequences token by token in order, Transformers introduced parallel processing thanks to their attention mechanisms. This shift allowed models to capture long-range dependencies much more efficiently, without being limited by sequential bottlenecks.
Full attention span: Transformers can directly âattendâ to every word in a sentence at once, while RNNs struggle with remembering distant words due to vanishing gradients.
Faster training: By processing all tokens in parallel during training (in the encoder), Transformers drastically reduce computation time compared to the step-by-step nature of RNNs.
Scalability: Transformers easily scale to large datasets and massive model sizes (e.g., GPT, BERT), unlocking unprecedented performance on tasks like translation, summarization, and text generation.
By combining self-attention, masked attention, cross-attention, and feed-forward layers, the Transformer builds highly contextualized word representations, making it a universal tool not just for NLP, but also for vision, audio, and multi-modal tasks.
Today, almost every state-of-the-art model â from ChatGPT to BERT â relies on the Transformer backbone, making it one of the most influential breakthroughs in AI history.