|
|
# JULAIN journal club, 21 Sept 2020
|
|
|
|
|
|
|
|
|
Attention Is All You Need
|
|
|
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NIPS 2017
|
|
|
|
|
|
- Paper: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
|
|
|
|
|
|
Self-Attention Generative Adversarial Networks
|
|
|
Zhang, Goodfellow, Metaxas, Odena, ICML 2019
|
|
|
|
|
|
- Paper: https://arxiv.org/abs/1805.08318
|
|
|
|
|
|
|
|
|
# Discussion and questions
|
|
|
|
|
|
* Q (Olav): is Tokenization learned?
|
|
|
|
|
|
* MC: they use byte-pair encoding. "Sentences were encoded using byte-pair encoding [3}".
|
|
|
So yes, it's kind of learned because it's data dependent, but it's done before training,
|
|
|
then tokenization is fixed.
|
|
|
Transformers only process already tokenized sequences.<br>
|
|
|
Byte pair encoding[1][2] or digram coding[3] is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression" in the C Users Journal.[4]<br>
|
|
|
A variant of the technique has shown to be useful in several natural language processing (NLP) applications, such as OpenAI's GPT, GPT-2, and GPT-3.[5] <br>
|
|
|
Source: https://en.wikipedia.org/wiki/Byte_pair_encoding<br>
|
|
|
Implementation of Byte-Pair encoding: https://huggingface.co/transformers/tokenizer_summary.html
|
|
|
|
|
|
|
|
|
* Q (Olav): Multihead attention, how does it work?
|
|
|
* A (Tobias Tesch): http://jalammar.github.io/illustrated-transformer/
|
|
|
explained somewhat better
|
|
|
"The multiheaded beast" refering to the multi-head attention and its advantages, see section "The Beast With Many Heads"
|
|
|
|
|
|
* Q (Yueling): Transformer network Fig 1, right , another (shifted) Input needed ?
|
|
|
|
|
|
* A(Mehdi): Input - history of tokens, output - is next token. In the training phase, For each next token to predict, we have the history of 'groundtruth' previous tokens as "input", this is quite usual in NLP, it's called "teacher forcing". In the testing phase, next token is predicted based on history (either via sampling from multinomial or using beam search, beam search is common in machine translation) then it is included in the history to predict the next token, then this procedure (predict next token + feed to history) repeated until we get into an EOS (End of Sentence) which is a special token that refers to the end of the sentence. Following the example of the blog post https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ Suppose we have an input sequence: "Je suis étudiant" that we need to translate into "I am a student". We first feed "Je suis étudiant" to the encoder. Then, using the decoder, we predict each token of the target sequence one at a time. Teacher working works (training phase) as if, for each input-target sentence pairs, you have the following supervised examples (notation is X -> Y where X are inputs and Y their corresponding outputs) :
|
|
|
"Je suis étudiant" + I -> am
|
|
|
"Je suis étudiant" + I + am -> a
|
|
|
"Je suis étudiant" + I + am + a -> student
|
|
|
"Je suis étudiant" + I + am + a + student -> EOS(End Of Sentence)<br>
|
|
|
The encoder is used to encode the input sequence "Je suis étudiant", the decoder is used to predict the next token based on the history of target tokens (it can't see the future, only the past tokens) and the encoded input sequence.
|
|
|
|
|
|
* Q (Yueing): How to ensure that output is physical relient
|
|
|
* A (Mehdi): Yueing was thinking about how to apply transformers to time series and how the order (time) is taken into account. Transformer do not know order, they just process sets in the general case, order can be added as "feature" (positional encoding), it's what is usually done. In the original transformer paper (attention is all you need), they use positinal encoding, which are fixed features representing order that are added (added as the + element-wise operation on vectors of identical size, they are not concatenated) to word embeddings. Other papers such as GPT papers learn positional encoding from data directly in the same way you would learn word embeddings: each "position" in the sequence has an embedding associated to it (a vector), and those vectors are learned.
|
|
|
|
|
|
|
|
|
* another paper recommendation (Stephan Bialonski):
|
|
|
People try to already do this, for instance: "Deep Transformer Models for Time Series Forecasting", https://arxiv.org/pdf/2001.08317.pdf
|
|
|
|
|
|
* Q: What is a good explanation of attention mechanism?
|
|
|
* A: recomended blog post <br> https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
|
|
|
|
|
|
* A (Tobias) intuitive explanation:
|
|
|
* assume input sequence of 8 words, each embedded as 512 dim vector.
|
|
|
* in attention layer each vector is multiplied by query, key, value matrix (e.g. $\in$ 64dim), resulting in query, key, value vector for each input token, weights of matrices learned
|
|
|
* to get the attention layer output for a word, the scalar product of the query of the word with the keys of all other words is built, yielding for each word a scalar weight. Those weights are passed through a softmax layer to sum up to one. Then, the sum of the values of each word, weighted by those weights is built, yielding an e.g. 64 dimensional output for the considered word. This is concatenated with the analog outputs of the 7 other attention heads foring a 512 dimensional output. The same is (potentially in parallel) done for all other words.
|
|
|
|
|
|
* Q: Residuals, added or concatenated?
|
|
|
* A: added and layer normalization
|
|
|
"We employ a residual connection [10] around each ofthe two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer isLayerNorm(x+ Sublayer(x)), whereSublayer(x)is the function implemented by the sub-layeritself. To facilitate these residual connections, all sub-layers in the model, as well as the embeddinglayers, produce outputs of dimensiondmodel= 512"
|
|
|
(page 3)
|
|
|
* Q: unclear what this adding realy does
|
|
|
* MC: it's a residual connection, it's the same mechanism which is used in ResNets https://arxiv.org/abs/1512.03385, the motivation is the same, to enable gradients flow easily when you have a large number of layers
|
|
|
|
|
|
|
|
|
* Q: Is self attention GANs done on all pixels?
|
|
|
|
|
|
* Tobias: equation (1) in section 3 also confirms that you have to calculate the scalar product for each pixel * each other pixel as you said.
|
|
|
* MC: in inner layers you have a grid of size kxk, in each position you have a vector (which represent the channels). Then, you apply self-attention on the set of vectors on all positions, just like you would do that for a sequence of tokens in NLP. For 32x32 grid size, if you apply self-attention, you will have 32*32=1024 "tokens", where each token refer to a position in the grid, which itself represent a region of the image. Note that Self-Attention is applied both in the generator and the discriminator.
|
|
|
|
|
|
* Q: code available to see how it works in practice? Or to make math better readable
|
|
|
* MC: depending on the goal, whether it is for educational purposes or for training transformers for specific tasks, I found these following links helpful:
|
|
|
* Educational: http://nlp.seas.harvard.edu/2018/04/03/attention.html
|
|
|
* MinGPT: minimal PyTorch implementation, https://github.com/karpathy/minGPT
|
|
|
* HuggingFace: https://github.com/huggingface/transformers, this is more for fine-tuning pre-trained transformers for NLP or training them from scratch.
|
|
|
They also have a tokenization library https://github.com/huggingface/tokenizers, which implements Byte-Pair encoding (see above) among other methods. |
|
|
\ No newline at end of file |