... | ... | @@ -36,7 +36,7 @@ explained somewhat better |
|
|
|
|
|
* Q (Yueling): Transformer network Fig 1, right , another (shifted) Input needed ?
|
|
|
|
|
|
* A(Mehdi): Input - history of tokens, output - is next token. In the training phase, For each next token to predict, we have the history of 'groundtruth' previous tokens as "input", this is quite usual in NLP, it's called "teacher forcing". In the testing phase, next token is predicted based on history (either via sampling from multinomial or using beam search, beam search is common in machine translation) then it is included in the history to predict the next token, then this procedure (predict next token + feed to history) repeated until we get into an EOS (End of Sentence) which is a special token that refers to the end of the sentence. Following the example of the blog post https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ Suppose we have an input sequence: "Je suis étudiant" that we need to translate into "I am a student". We first feed "Je suis étudiant" to the encoder. Then, using the decoder, we predict each token of the target sequence one at a time. Teacher working works (training phase) as if, for each input-target sentence pairs, you have the following supervised examples (notation is X -> Y where X are inputs and Y their corresponding outputs) :
|
|
|
* A(Mehdi): Input - history of tokens, output - is next token. In the training phase, For each next token to predict, we have the history of 'groundtruth' previous tokens as "input", this is quite usual in NLP, it's called "teacher forcing". In the testing phase, next token is predicted based on history (either via sampling from multinomial or using beam search, beam search is common in machine translation) then it is included in the history to predict the next token, then this procedure (predict next token + feed to history) repeated until we get into an EOS (End of Sentence) which is a special token that refers to the end of the sentence. Following the example of the blog post https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ Suppose we have an input sequence: "Je suis étudiant" that we need to translate into "I am a student". We first feed "Je suis étudiant" to the encoder. Then, using the decoder, we predict each token of the target sequence one at a time. Teacher forcing works (training phase) as if, for each input-target sentence pairs, you have the following supervised examples (notation is X -> Y where X are inputs and Y their corresponding outputs) :
|
|
|
"Je suis étudiant" + I -> am
|
|
|
"Je suis étudiant" + I + am -> a
|
|
|
"Je suis étudiant" + I + am + a -> student
|
... | ... | |