... | ... | @@ -33,6 +33,7 @@ Implementation of Byte-Pair encoding: https://huggingface.co/transformers/tokeni |
|
|
* A (Tobias Tesch): http://jalammar.github.io/illustrated-transformer/
|
|
|
explained somewhat better
|
|
|
"The multiheaded beast" refering to the multi-head attention and its advantages, see section "The Beast With Many Heads"
|
|
|
* Comment (JJ): multi head attention can be seen in analogy to multiple feature maps / learned kernel filters in standard convolution networks (CNN). Each single map / kernel per layer in CNN is a particular local feature type learned from data, while each single head of attention per layer is a particular global feature type that goes over the whole sequence / previous activation layer (in contrast to local kernels in CNNs that look only on a small portion of the previous activation layer)
|
|
|
|
|
|
* Q (Yueling): Transformer network Fig 1, right , another (shifted) Input needed ?
|
|
|
|
... | ... | @@ -44,7 +45,8 @@ explained somewhat better |
|
|
The encoder is used to encode the input sequence "Je suis étudiant", the decoder is used to predict the next token based on the history of target tokens (it can't see the future, only the past tokens) and the encoded input sequence.
|
|
|
|
|
|
* Q (Yueing): How to ensure that output is physical relient
|
|
|
* A (Mehdi): Yueing was thinking about how to apply transformers to time series and how the order (time) is taken into account. Transformer do not know order, they just process sets in the general case, order can be added as "feature" (positional encoding), it's what is usually done. In the original transformer paper (attention is all you need), they use positinal encoding, which are fixed features representing order that are added (added as the + element-wise operation on vectors of identical size, they are not concatenated) to word embeddings. Other papers such as GPT papers learn positional encoding from data directly in the same way you would learn word embeddings: each "position" in the sequence has an embedding associated to it (a vector), and those vectors are learned.
|
|
|
* A (Mehdi): Yueing was thinking about how to apply transformers to time series and how the order (time) is taken into account. Transformer do not know order, they just process sets in the general case, order can be added as "feature" (positional encoding), it's what is usually done. In the original transformer paper (attention is all you need), they use positional encoding, which are fixed features representing order that are added (added as the + element-wise operation on vectors of identical size, they are not concatenated) to word embeddings. Other papers such as GPT papers learn positional encoding from data directly in the same way you would learn word embeddings: each "position" in the sequence has an embedding associated to it (a vector), and those vectors are learned.
|
|
|
* Comment (JJ): Original paper (Attention Is All You Need, https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) experiments with both fixed positional encoding and learnt positional encoding to impose order into word sequence. As fixed encoding turned out to work as well there as the learned one, people tend to use fixed position encoding as it is computationally less intensive.
|
|
|
|
|
|
|
|
|
* another paper recommendation (Stephan Bialonski):
|
... | ... | @@ -56,7 +58,8 @@ People try to already do this, for instance: "Deep Transformer Models for Time S |
|
|
* A (Tobias) intuitive explanation:
|
|
|
* assume input sequence of 8 words, each embedded as 512 dim vector.
|
|
|
* in attention layer each vector is multiplied by query, key, value matrix (e.g. $\in$ 64dim), resulting in query, key, value vector for each input token, weights of matrices learned
|
|
|
* to get the attention layer output for a word, the scalar product of the query of the word with the keys of all other words is built, yielding for each word a scalar weight. Those weights are passed through a softmax layer to sum up to one. Then, the sum of the values of each word, weighted by those weights is built, yielding an e.g. 64 dimensional output for the considered word. This is concatenated with the analog outputs of the 7 other attention heads foring a 512 dimensional output. The same is (potentially in parallel) done for all other words.
|
|
|
* to get the attention layer output for a word, the scalar product of the query of the word with the keys of all other words is built, yielding for each word a scalar weight. Those weights are passed through a softmax layer to sum up to one. Then, the sum of the values of each word, weighted by those weights is built, yielding an e.g. 64 dimensional output for the considered word. This is concatenated with the analog outputs of the 7 other attention heads foring a 512 dimensional output. The same is (potentially in parallel) done for all other words.
|
|
|
* JJ: comment - in most state of the art language models (eg GPT), self-attention is used, the (encoder or decoder) layer activity itself is used for computing queries, keys and values
|
|
|
|
|
|
* Q: Residuals, added or concatenated?
|
|
|
* A: added and layer normalization
|
... | ... | |