Attention Model Overview

Toshiba Kamruzzaman
4 min readSep 10, 2020

--

As the name suggest, Attention model gives importance or attention on some parts of the data instead of the whole data.

Suppose, you have given a complex task to complete. So, what will you do? You will divide the task into several small parts and label them as most important parts, less important parts and not important part. And solve them according to their importance .

Attention model also works on this basis. It tries to find is more important feature of a input in that particular moment and forget all the other features of the input at that moment.

Let me give you an example of how attention model works on image captioning.

Look at the following figure. (fig-1)

fig 1: toddler in yellow top and hat holding fruit

In this figure, there are 6 important points to notice

  1. Toddler
  2. Yellow (It is the color of the top & hat)
  3. Top
  4. Hat
  5. Fruit
  6. Hand ( It is holding that fruit)

So, if we use Attention model, the input image will be divided into x number of pixels. And then it will first activate only that portion from the whole image where it will find an object. In this case , it will find four objects at first(toddler, fruits, basket, hat) As the toddler has higher probability to be detected as object and it will convert the pixel information to text as toddler and will not care about all information of the other objects at that time.

Then it will gain the feature information of the color and convert the information to context. This time, it will not focus on any other information like toddler , hand, fruit etc.

What the Attention component of the network will do for each word in the output sentence is map the important and relevant pixels from the input sentence and assign higher weights to these pixels enhancing the accuracy of the output prediction.

There are basically two type of attention

  • Bahdanau Attention
  • Loung Attention

Bahdanau Attention

It is known as local attention which was proposed by Dzmitry Bahdanau. In this type of mechanism, attention is given on only a few source of positions. The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:

  1. Producing the Encoder Hidden States: Encoder produces hidden states of each element in the input sequence
  2. Calculating Alignment Score: Between the previous decoder hidden state and each of the encoder’s hidden states are calculated (Note: The last encoder hidden state can be used as the first hidden state in the decoder)
  3. Softmaxing the Alignment Scores: The alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
  4. Calculating the Context Vector: the encoder hidden states and their respective alignment scores are multiplied to form the context vector
  5. Decoding the output: the context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new output
  6. Repetition of process: process (steps 2–5) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length.

Loung Attention

It is known as global attention which was proposed by Thang Luong. In this type of mechanism, attention is given on all the source of positions.

The two main differences between Luong Attention and Bahdanau Attention are:

  1. The way that the alignment score is calculated
  2. The position at which the Attention mechanism is being introduced in the decoder

There are three types of alignment scoring functions proposed in Luong’s paper compared to Bahdanau’s one type. Also, the general structure of the Attention Decoder is different for Luong Attention, as the context vector is only utilised after the RNN produced the output for that time step. We will explore these differences in greater detail as we go through the Luong Attention process, which is:

  1. Producing the Encoder Hidden States: Encoder produces hidden states of each element in the input sequence
  2. Decoder RNN: the previous decoder hidden state and decoder output is passed through the Decoder RNN to generate a new hidden state for that time step
  3. Calculating Alignment Score: Using the new decoder hidden state and the encoder hidden states, alignment scores are calculated
  4. Softmaxing the Alignment Scores: the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
  5. Calculating the Context Vector: the encoder hidden states and their respective alignment scores are multiplied to form the context vector
  6. Producing the final output: the context vector is concatenated with the decoder hidden state generated in step 2 as passed through a fully connected layer to produce a new output
  7. Repetition of process: The process (steps 2–6) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

As we can already see above, the order of steps in Luong Attention is different from Bahdanau Attention.

--

--