Here are the Biotech and VR Companies We Saw at CES 2019

The Las Vegas Convention Center hosted CES 2019, where hundreds of exhibitors and thousands of show attendees experienced firsthand the most modern of technology, including in the biotechnology and…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The Ultimate Transformer And The Attention You Need.

I will try to explain this topic as uncomplicated as possible including the arithmetical part.

A Transformer is on the most popular state of the art deep learning architecture that is mostly used for NLP task. Ever since the advent of the transformer it has replaced the RNN and LSTM for various tasks. Several new NLP models such as BERT,GPT,T5 are based on transformer architecture.

RNN and LSTM were widely used in sequential task such as word prediction , machine translation , text generation . However one of the major challenges they face is capturing the long term dependency.

To overcome this a new architecture was introduced named Transformer in the paper “Attention is all you need”. The transformer is currently the state of the art of model for several NLP task. The transformer is entirely based on attention mechanism and completely gets rid of recurrence. The transformer uses a special type of attention mechanism called self attention. We will discuss about this but lets first understand how language translation actually works with transformer.

The transformer consist of an encoder-decoder architecture. We feed the input sentence to the encoder. The encoder learns the representation of the input sentence and send the representation to the decoder. The decoder receives the representation learned by the encoder and produces the output.

Suppose we need to convert an English sentence to French . So our input will be an English sentence which we will feed to encoder , the encoder will learn the representation of our English sentence and will forward the representation to the decoder, the decoder will produce an output accordingly.

So what actually happened ?

The transformer consist of a stack of N numbers of Encoder. The output of one encoder is send as an input of another encoder above it. Each encoder sends its output to the encoder above it and the final encoder returns a representation of the given source sentence as an output.

In the paper “Attention is all you need” the author used 6 encoder . However we can try n number of encoders . For our simplicity we are going to use only 2 encoders.

So question again arise how does it actually works ?

From the above figure you can easily understand that all encoder blocks are identical. You can also observer that each encoder consist of two sublayers

before learning about the sub-layers lets first learn what is self-attention mechanism.

Lets consider an example for our better understanding

“A dog ate the food because it was hungry”

In the above sentence the pronoun it may refer to dog or food. By reading the sentence we can easily understand it refer to dog not food. Here is where self attention mechanism helps us. In the sentence “The dog ate the food because it was hungry” our model first computes the representation of word A ,next it computes the representation of word dog , then the next word and the next and so on till the sentence ends. While computing the representation of the word, it relates each word to all the other words in the sentence to understand more about the words. For instance lets say while computing the representation of the word it our model relates the word it to all the other words to know more about it and while doing it our model understood the the word it is more related to dog than food , as we can see the line connecting the word dog is more thicker than food.

Now lets understand how actually it does. Suppose our input is “I am good”. First we get the embeddings of each word in our sentence (embedding are just vectors representation of word and the value of embedding will be learned during training.) Let x1 be the embedding of word I, x2 be for am and x3 be for good.

The dimension of our input X will be [sentence length * embedding dimension] . The number of words in our sentence is 3 and the lets assume our embedding dimension be 512, So our input matrix will be of dimension[3*512]. Now from the input matrix X we create three matrices : a query matrix Q, a key matrix K and a value matrix V.

Okay , but how we create our three matrices?

To create these three matrices we introduce three new weight matrices Wq,Wk and Wv. We create the query Q, Key k and Value V by multiplying weight(Wq,Wk,Wv) matrices with our X . The weight matrices are first randomly initialised and there optimal values are learned during training.

The image makes very much clear the q1 ,k1,v1 for word I , q2,k2,v2 for word am and q3,k3,v2 for word good. Note that the dimensions of Wq ,Wk and Wv have dimension 64. Thus the dimension of our Q,K and V matrices be [size of sentence * dimension] which is [3*64].

Now the question is why are we even computing this ?

we have seen how computing of Q,K,V matrices are done and we also know how they are extracted from input X matrices. Now lets see how they are used in self-attention mechanism. We are aware that in order to compute the representation of a word , self-attention mechanism relates the word to all the words in the given sentence and we know that understanding how a word is related to all the other words gives us a better representation. The Self-attention mechanism includes four steps

The first step of the self-attention mechanism is to compute dot product between the Q and K matrices .

Lets observe the first row of QK matrix , it show which how similar the query vector q1 for I is similar for vector K1,k2,k3.

similar observe the second row followed by the third row . The third row states relation of the word good is more with its itself followed by I and am. Thus we can say that computing the dot product gives us the similarity matrices of each word to all the words in the sentence.

To obtain stable stable gradient we divide QK matrix with square root of K matrix dimension. In our case the the dimension of K Matrix is 64 hence 8 will me its square root.

By looking at the preceding similarity scores , we can understand that they in unnormalized form , so we normalise using a softmax function. Applying softmax helps in bringing the score to the range of 0 to1 and the sum scores equals to 1.

Now we compute attention matrix Z. The attention matrix contains the attention value of each word in the matrix . We can compute attention matrix by multiplying the score matrix with V matrix and we have

The attention matrix Z is computed by the sum of value vectors weighted by score . Lets understand it by looking at it row by row. First lets see how the first Z1 of the word I is computed .

thus it stats that Z1 contains 90% from the vector V1 I , 7% from V2 am , and 3% from V3 good.

But how is it useful ?

To answer the above question lets take a detour to our previous example

“A dog ate the food because it was hungry”

as we compute Z for the above sentence we will realise that the word it shows similarity more towards dog than food.

The self-attention mechanism is graphically show like this

The self-attention mechanism is also knows scaled dot product.

Why ?

you probably having your answer.

Instead of having a single attention , we can use multiple attention head that is we can compute multiple attention matrix as it will be more accurate. In muti-head attention we first find attention metrices of each word then concatenate them and multiply by a new weight Metrix.

Learning Position Encoding

If we feed x to our transformer directly ,it cannot understand the word order and hence cannot understand the sentence .So, instead feeding the input matrix directly to the transformer directly we add some information indicating the word order . This technique is called positional encoding.

The feed forward network consist of two dense layer with Relu activations. The parameters of feedforward are same over different positions of the sentence and different over encoder blocks.

Add and norm component

One more important component in encoder is add and norm component. I connects the input and output of sub layers. It is basically a residual connection flowed by layer normalization. The layer normalization promotes faster training by preventing the values in each layer by changing heavily.

a complete reprenstation of encoder

Now we take our encoder representation and feed it to our decoder. The decoder takes input from encoder and generates input. Like encoder we can also stack up N numbers on decoder. We can also observe that the encoder representation of input sentence (encoder output) is send to all the decoders. Thus a decoder receives two inputs : one from the previous decoder and other is encoder’s representation.

Ok but how exactly our decoder target sentence?

Lets explore it in more details. At step t=1, the input of the decoder will be <SOS>, which indicates start of sentence .The decoder takes <SOS> as input and generate first word . At time step t=2 along with current input the decoder takes newly generated word from the previous time step ,t-1 and tries to generate the next word in the sentence. Similarly on every time step, the decoder combines the newly generated word to the input and predict the next word. Once <eos> token which indicates end of sentence ,is generated which means the decoder has completed generating the target sentence.

In the encoder section we learned that we convert input to an embedding matrix and add positional embedding to it then feed it to the encoder. Similarly here instead of feeding the input to decoder directly we convert input to an enbedding matrix and add position embedding.

Okay but the ultimate question is how exactly does the decoder works ?

Lets explore what's going inside a decoder block.

Similar to the encoder block , the decoder also has the same inner layers. Now that we have a basic idea of the decoder lets check each component one by one.

In our English to French dataset, lets say our dataset looks like this

During training , since we have the right target sentence , we can just feed the whole target sentence as input to decoder but with small modification. We learned that the decoder takes <sos> as the first token and combines the next predict word to the input on every time step for predicting the time step for predicting the target sentence until <eos> token in reached. We also learned the instead of directly feeding the decoder the input directly ,we convert it into an embedding matrix and then add positional encoding.

Lets suppose the following matrix X is obtained by adding embedding matrix and position matrix.

Now we feed our X to decoder first layer i.e. Masked multi-head attention. This works exactly the same as multi-head attention layer but with a small difference.

To create self attention we create we create Q,K V vector. Since we are computing multi-head attention we create N numbers of Q,K,V vectors. Our input sentence to our decoder is <sos> je vais bien and we know how that the self -attention mechanism relates each word to each other words to get a better understanding. But in this case there is a catch here. During test time the decoder will only have the words previous generated until previous step as input . For example lets say at the time step t=2 the decoder will only have the input words as [<sos>,je] and it will not have other words. So, we have to train our model the same fashion. Thus our attention mechanism should relate only until the word je and not the other words . To this we mask all the words on the right that are not predicted by the model yet.

Masking words like this will help attention mechanism to attend only those words that will be available to to the model during testing .

Lets represent the encoder representation by R and the attention matrix obtained as a result of the multi-head attention sublayer by M. Since here we have an interaction between the encoder and the decoder , this layer is also called an encoder and decoder attention layer.

We create the query matrix Q using the attention matrix M obtained from the previous sublayer and we create the key , value matrices using the encoder representation, R.

The Feedforward layer and Add and Norm Component layer works exactly the same as encoder .

Once the decoder learns the representation of the target sentence , we feed the output obtained from the topmost decoder to the linear and softmax layer. The linear layer generates the logits whose size is equal to our vocabulary size.

Thank You !

Add a comment

Related posts:

3 Falsehoods I Grew Up Believing

I think COVID-19 has made me even more introspective than usual. With less opportunities for new memories to be made, I find myself seeking refuge in old ones. What I find most striking are the…

STEM education is about to change

The way that we view STEM education in the coming years is going to change tremendously — more so than ever before. Our connection to each other has grown so much in the past few years as smartphones…

Kamu dan Bandung

Malam terasa cukup hangat, mungkin karena di siang harinya matahari terus saja memancarkan terik tak ingin meredup sejenak saja. Gala dengan motor jadulnya bertandang ke rumah sang pujaan hati. Ada…