BERT 入門

浅川伸一 (東京女子大学) asakawa@ieee.org

14/Jun/2020

自己紹介

師匠エルマンと USCDにて

浅川伸一:博士(文学) 東京女子大学情報処理センター勤務。早稲田大学在学時はピアジェの発生論的認識論に心酔する。卒業後エルマンネットの考案者ジェフ・エルマンに師事，薫陶を受ける。以来人間の高次認知機能をシミュレートすることを通して知的であるとはどういうことかを考えていると思っていた。著書に「AI白書 2019, 2018」(2019年, アスキー出版, 共著)，「深層学習教科書ディープラーニング G検定（ジェネラリスト）公式テキスト」(2018年，翔泳社，共著), 「Python で体験する深層学習」(コロナ社, 2016)，「ディープラーニング，ビッグデータ，機械学習あるいはその心理学」(新曜社, 2015)，「ニューラルネットワークの数理的基礎」「脳損傷とニューラルネットワークモデル，神経心理学への適用例」いずれも守一雄他編「コネクショニストモデルと心理学」(2001)北大路書房など

アウトライン

どこにでも現れる注意
BERT 概説
流行の句あり

第 1 部

どこにでも現れる注意

多頭=自己注意 Multi-Head Self-Attention: MHSA

自然言語処理 NLP Transformer(Vaswani et al. 2017); BERT(Devlin et al. 2018); RoBERTa(Y. Liu et al. 2019); distilBERT (Sanh et al. 2020); and more…
画像処理 (Ramachandran et al. 2019); A2-Net (Chen et al. 2018); U-GAT-IT (Kim et al. 2019)
強化学習，メタ学習 SNAIL (Mishra et al. 2018)
敵対生成ネットワーク SAGAN (Zhang et al. 2019)

多頭=自己注意 Multi-Head Self-Attention

Left: (Vaswani et al. 2017), Right: (Ramachandran et al. 2019)

\[ \text{自己注意}\left(\mathbf{X}_{t,:}\right)=\text{ソフトマックス}\left(\mathbf{A}_{t,;}\right)\mathbf{XW}_{\text{バリュv}}, \] \[ \mathbf{A}=\mathbf{XW}_{\text{クエリq}}\mathbf{W}_{\text{キーk}}^\top\mathbf{X}^\top \]

\[ \mathbf{A}:=\left(\mathbf{X}+\mathbf{P}\right)\mathbf{W}_{\text{クエリq}}\mathbf{W}_{\text{キーk}}^\top\left(\mathbf{X}+\mathbf{P}\right)^\top, \hspace{3em} \text{$\mathbf{P}$ は位置符号化器PE} \]

Multi-head self-attention: MHSA (2)

Multi-head self-attention: MHSA (3) SAGAN (Self-Attention GAN)

From (Zhang et al. 2019) Fig. 1, and 3. 画像生成において，近傍画素から情報だけでなく，関連する遠距離の特徴を利用して生成することにより一貫性のある対象やシナリオを生成可能。各行の左の元画像上のカラー点は 5 つの代表的なクエリの場所を示す。右側の 5 画像は各クエリ位置における注意地図。最も注目されている領域が，色分けされた矢印で示されている。

Multi-head self-attention: MHSA (4) Non-Local Net

時空の非局所ネットワークの概念図。特徴地図はテンソルとして示されている。例えば 1024 チャンネルの場合は $T\times H\times W\times1024$ である。 $\otimes$ は行列積を，$\oplus$ は要素和を示す。ソフトマックス演算は各行に対して実行される。青いボックスは $1\times1\times1\times1$ の畳み込みを表す。 $512$ チャンネルのボトルネックを持つ埋め込みガウシアン版が示されている。バニラガウス版は $\theta$ と $\phi$ とを除去することでドット積版は $1/N$ のスケーリングでソフトマックスを置き換えることで行うことができる。 From (Wang et al. 2018)

Multi-head self-attention: MHSA (4) SNAIL

From (Mishra et al. 2018) Fig. 2

トランスフォーマーはリカレント構造や畳み込み構造を持たず埋め込みベクトルに位置符号化器を加えることで系列情報を処理する。しかし、逐次的な順序情報が貧弱であるとの批判がある。とりわけ強化学習のような位置依存性に敏感な課題では問題。トランスフォーマーモデルにおける位置問題を解決するため，自己注意機構と時間的な畳み込み temporal convolution を組み合わせたモデルが Simple Neural Attention Meta-Learner (SNAIL)(Mishra et al. 2018)。 SNAIL は，メタ学習，強化学習の両方の課題に優れていることが実証された。

注意用語集 Taxosonomy of attention

文脈ベース 注意 context-base attention: $\text{score}(s_t,h_i)=\cos(s_t,h_i)$ (Graves, Wayne, and Danihelka 2014)
加算的 (連結的) 注意 Additive : $\text{score}(s_t,h_i)=v_a^\top\tanh\left(W_a\left[s_t;h_i\right]\right)$ (Bahdanau, Cho, and Bengio 2015)
- (Luong, Pham, and Manning 2015) では連結 concatenated, (Vaswani et al. 2017) では加算 additive と表記されている
場所ベース 注意 Location-Base: $a_{t,i}=\text{softmax}(\mathbf{w}_a \mathbf{s}_t)$ (Luong, Pham, and Manning 2015)
Note: This simplifies the softmax alignment to only depend on the target position.
一般的 注意 general: $\text{score}(s_t,h_i)=s_t^\top\mathbf{W}_ah_i$ (Luong, Pham, and Manning 2015)
$\mathbf{W}_a$ は学習可能な結合係数行列
ドット積 注意 dot-product: $\text{score}(\mathbf{s}_t,\mathbf{h}_i)=s_i^\top h_i$ (Luong, Pham, and Manning 2015)
スケール化ドット積 注意 scaled dot-product(^): $\displaystyle\text{score}(\mathbf{s}_t,\mathbf{h}_i)=\frac{\mathbf{s}_t \mathbf{h}_i}{\sqrt{n}}$ (Vaswani et al. 2017)
- スケール化規格化因子 $1/\sqrt{n}$ を用いる

第1部 Multi-head self-attention: MHSA のまとめ

自然言語処理，画像処理，強化学習，メタ学習の 4 分野でほほ同様の MHSA が取り入れられている。
クエリ，キー，バリュー各テンソルを学習することが行われている
従来手法である畳み込みや LSTM を MHSA で置き換える動きがある。
ただし, SAGAN と SNAIL （non-local net）では入力情報を concatenate して上位層に伝える点が他と異なる

補足注意が現れるに至った歴史

BOW, TFIDF(Jones 1972), SMT(Manning and Schuütze 1999), N-gram モデル, Dimensionality would increse w.r.t. $V^N$
RNN (Elman 1990),(Mikolov et al. 2010)(Mikolov et al. 2011)
LSTM (Hochreiter and Schmidhuber 1997),(Gers, Schmidhuber, and Cummins 1999),(Greff et al. 2015), Seq2seq(Sutskever, Vinyals, and Le 2014), 注意モデル(Bahdanau, Cho, and Bengio 2015),
Transformer (Vaswani et al. 2017)
BERT (Devlin et al. 2018)

それぞれ有名なので説明はしません

第 2 部

BERT 概説

Mnih and Graves (2014)

From (Mnih et al. 2014)

Show and Tell (2014)

Attention for neural image captioning (Xu et al. 2015)

Seq2seq model

From (Sutskever, Vinyals, and Le 2014) Fig. 1, 翻訳モデル “seq2seq” の概念図

“eos” は文末を表す。中央の “eos” の前がソース言語であり，中央の “eos” の後はターゲット言語の言語モデルである SRN の中間層への入力として用いる。

注意すべきは，ソース言語の文終了時の中間層状態のみをターゲット言語の最初の中間層の入力に用いることであり，それ以外の時刻ではソース言語とターゲット言語は関係がない。逆に言えば最終時刻の中間層状態がソース文の情報全てを含んでいるとみなしうる。この点を改善することを目指すことが 2014 年以降盛んに行われてきた。顕著な例が後述する 双方向 RNN， LSTM 採用したり，注意機構を導入することであった。

Seq2seq (2)

From (Sutskever, Vinyals, and Le 2014) Fig. 2

Seq2seq (3)

From [Sutskever, Vinyals, and Le (2014)} Fig. 2

自然言語系の注意

左:[Bahdanau, Cho, and Bengio (2015)}, 中:[Luong, Pham, and Manning (2015)} Fig. 2, 右:[Luong, Pham, and Manning (2015)} Fig. 3

BERT の特徴

BERT の特徴を 3 つにまとめると以下の通り

トランスフォーマー Transformer に基づく MHSA を用いた多層ニューラルネットワークモデル
2 つの事前訓練: マスク化言語モデル と 次文予測課題
Fine tuning によるマルチタスクで性能向上 GLUE スコアボード, SuperGLUE を参照のこと

BERT の入力表現

埋め込みトークンの総和，位置符号器，分離埋め込みの 3 者 From (Devlin et al. 2018) Fig. 2

BERT の事前訓練: マスク化言語モデル

全入力系列のうち 15% をランダムに [MASK] トークンで置き換える

入力はオリジナル系列を [MASK] トークンで置き換えた系列
ラベル: オリジナル系列の [MASK] 部分にの正しいラベルを予測
80%: オリジナル入力系列を [MASK] で置換
10%: [MASK] の位置の単語をランダムな無関連語で置き換える
10%: オリジナル系列

BERT の事前訓練: 次文予測課題

言語モデルの欠点を補完する目的，次の文を予測

[SEP] トークンで区切られた 2 文入力

入力: the man went to the store [SEP] he bought a gallon of milk.
ラベル: IsNext
入力: the man went to the store [SEP] penguins are flightless birds.
ラベル: NotNext

BERT: ファインチューニング

(a), (b) は文レベル課題， (c),(d)はトークンレベル課題, E: 入力埋め込み表現, $T_i$: トークン $i$ の文脈表象。

From (Devlin et al. 2018) Fig.3

GLUE: General Language Understanding Evaluation

CoLA: 入力文が英語として正しいか否かを判定
SST-2: スタンフォード大による映画レビューの極性判断
MRPC: マイクロソフトの言い換えコーパス。2文が等しいか否かを判定
STS-B: ニュースの見出し文の類似度を5段階で評定
QQP: 2 つの質問文の意味が等価かを判定
MNLI: 2 入力文が意味的に含意，矛盾，中立を判定
QNLI: Q and A
RTE: MNLI に似た2つの入力文の含意を判定
WNI: ウィノグラッド会話チャレンジ

その他

SQuAD: スタンフォード大による Q and A ウィキペディアから抽出した文
RACE: 中学入試，高校入試に相当するテスト多肢選択回答 # BERT モデルの詳細
データ: Wikipedia (2.5B words) + BookCorpus (800M words)
バッチサイズ: 131,072 words (1024 sequences * 128 length or 256 sequences * 512 length)
訓練時間: 1M steps (~40 epochs)
最適化アルゴリズム: AdamW, 1e-4 learning rate, linear decay
BERT-Base: 12 層, 各層 768 ニューロン, 12 多頭注意
BERT-Large: 24 層, 各層 1024 ニューロン, 16 多頭注意
4x4 / 8x8 TPU で 4 日間

BERT: ファインチューニング手続きによる性能比較

マスク化言語モデルのマスク化割合の違いによる性能比較

マスク化言語モデルのマスク化割合はマスクトークン:ランダム置換:オリジナル=80:10:10 だけでなく，他の割合で訓練した場合の 2 種類下流課題， MNLI と NER で変化するかを下図に示した。 80:10:10 の性能が最も高いが大きな違いがあるわけではないようである。

BERT: モデルサイズ比較

モデルのパラメータ数による性能比較

パラメータ数を増加させて大きなモデルにすれば精度向上が期待できる。下図では，横軸にパラメータ数で MNLI は青と MRPC は赤で描かれている。パラメータ数増加に伴い精度向上が認められる。図に描かれた範囲では精度が天井に達している訳ではない。パラメータ数が増加すれば精度は向上していると認められる。

BERT: モデル単方向，双方向モデル比較

言語モデルの相違による性能比較

言語モデルをマスク化言語モデルか次単語予測の従来型の言語モデルによるかの相違による性能比較を下図に示した。横軸には訓練ステップである。訓練が進むことでマスク化言語モデルとの差は 2 パーセントではあるが認められるようである。

BERT: 事前訓練比較

事前訓練の効果比較

図には事前訓練の比較を示しされている。全ての事前訓練を用いた場合が青，次文訓練を除いた場合が赤，従来型言語モデルで次文予測課題をした場合を黄，従来型言語モデルで次文予測課題なしを緑で描かれている。4 種類の下流課題は MNLI, QNLI, MRPC, SQuAD である。下流のファインチューニング課題ごとに精度が分かれるようである。

各モデルの特徴

RoBERTa: BERT の訓練コーパスを巨大 (173GB) にし，ミニバッチサイズを大きした
XLNet: 順列言語モデル。2 ストリーム注意
MT-DNN: BERT ベースの転移学習に重きをおいたモデル
GPT-2: BERT に基づく。人間超えして 2019 年 2 月時点で炎上騒ぎ
BERT: Transformerに基づく言語モデル。マスク化言語モデル と 次文予測 に基づく事前訓練，各下流課題をファインチューニング。事前訓練されたモデルは一般公開済。
DistillBERT: BERT の蒸留版
ELMo: 双方向 RNN による文埋め込み表現
Transformer: 自己注意に基づく言語モデル。多頭注意，位置符号器.

事前訓練とマルチ課題学習

From (X. Liu et al. 2019) Fig. 1

Transformer: Attention is all you need

\[\mathop{attention}\left(Q,K,V\right)=\mathop{dropout}\left(\mathop{softmax}\left(\frac{QK^\top}{\sqrt{d} }\right)\right)V\]

From (Vaswani et al. 2017) Fig. 2

位置符号器 Position encoders

トランスフォーマーの入力には，上述の単語表現に加えて，位置符号器からの信号も重ね合わされる。位置 $i$ の信号は次式で周波数領域へと変換される:

\[ \begin{align} \text{PE}_{(\text{pos},2i)} &= \sin\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right)\\ \text{PE}_{(\text{pos},2i+1)} &= \cos\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \end{align} \]

位置符号器による位置表現は，$i$ 番目の位置情報をワンホット表現するのではなく，周波数領域に変換することで周期情報を表現する試みと見なし得るだろう。

位置符号化に用いられる符号化

このようにしてできた値を入力側と出力側で下図のように連結させたものがトランスフォーマーである。

From (Vaswani et al. 2017) Fig. 1

これまで見てきたように，トランスフォーマーでは入力信号に基づいて情報の変換が行なわれる。この意味ではトランスフォーマーにおける多頭自己注意 MHSA とはボトムアップ注意の変形であるとみなしうる。逆言すれば，RNN のように過去の履歴をすべて保持しているわけではないので，系列情報については，position encoders に頼っている側面が指摘できる。

BERT, GPT, ELMo 事前訓練の違い

BERT: トランスフォーマー，マスク化言語モデル，次文予測課題
GPT: 順方向トランスフォーマー
ELMo: 双方向 RNN による中間層の連結

多言語対応

From (Lample and Conneau 2019) Fig. 1

BERT の発展

From https://towardsdatascience.com/a-review-of-bert-based-models-4ffdc0f15d58

BERT: 埋め込みモデルによる構文解析

BERT の構文解析能力を下図示した。各単語の共通空間に射影し，単語間の距離を計算することにより構文解析木と同等の表現を得ることができることが報告されている(Hewitt and Manning 2019)。

BERT による構文解析木を再現する射影空間

From https://github.com/john-hewitt/structural-probes

word2vec において単語間の距離は内積で定義されていた。このことから，文章を構成する単語で張られる線形内積空間内の距離が構文解析木を与えると見なすことは不自然ではない。そこで構文解析木を再現するような射影変換を見つけることができれば BERT を用いて構文解析が可能となる。例えば上図における chef と store と was の距離を解析木を反映するような空間を見つけ出すことに相当する。 2 つの単語 $w_i$, $w_j$ とし単語間の距離を $d\left(w_i,w_j\right)$ とする。適当な変換を施した後の座標を $h_i$, $h_j$ とすれば，求める変換 $B$ は次式のような変換を行なうことに相当する: \[ \min_{B}\sum_l\frac{1}{\left|s_\ell\right|^2}\sum_{i,j}\left(d\left(w_i,w_j\right)-\left\|B\left(h_i-h_j \right)\right\|^2\right) \] ここで $\ell$ は文 s の訓練文のインデックスであり，各文の長さで規格化することを意味している。

BERT 実装

BERT 実装のパラメータを以下に示した。現在配布されている BERT-base あるいは性能が良い BERT-large は各層のニューロン数と全体の層数である。

データ: Wikipedia (2.5B words) + BookCorpus (800M words)
バッチサイズ: 131,072 words (1024 sequences $\times$ 128 length or 256 sequences $\times$ 512 length)
訓練ステップ: 1M steps (40 epochs)
最適化アルゴリズム: AdamW, 1e-4 learning rate, linear decay
BERT-Base: 12 層, 各層 768 ニューロン, 12 多頭注意
BERT-Large: 24 層, 各層 1024 ニューロン, 16 多頭注意
訓練時間: 4x4 / 8x8 の TPU で 4 日間

LSTM

左: LSTM (浅川, 2015) より，右: トランスフォーマー(Vaswani et al. 2017)
入力ゲートと入力は Q, K と同一視，出力ゲートと V とは同一視可能？

BERT embeddings

class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings.
    """

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
        if input_ids is not None:
            input_shape = input_ids.size()
        else:
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]
        device = input_ids.device if input_ids is not None else inputs_embeds.device
        if position_ids is None:
            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
            position_ids = position_ids.unsqueeze(0).expand(input_shape)
        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

BERT inside

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

第 3 部

流行りの句

流行りの句 (continued.)

Residual attention

(Wang et al. 2017) Fig. 1, 2, 3

A2 net

From (Chen et al. 2018) Fig. 1

DistilBERT

3 つの損失関数(Sanh et al. 2020):

知識蒸留損失
マスク化言語モデル損失
コサイン損失

Relationship between self-attention and convolution

From (Cordonnier, Loukas, and Jaggi 2020)

ここまでのまとめ

MHSA は畳み込みと同等の能力がありそうである。
Reformer に見られるように position encodings を工夫する余地は残されているように思われる。

文献

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” In Proceedings in the International Conference on Learning Representations (ICLR), edited by Yoshua Bengio and Yann LeCun. San Diego, CA, USA. http://arxiv.org/abs/1409.0473.

Chen, Yunpeng, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. “$A^2$-Nets: Double Attention Networks.” In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 352–61. Curran Associates, Inc. http://papers.nips.cc/paper/7318-a2-nets-double-attention-networks.pdf.

Cordonnier, Jean-Baptiste, Andreas Loukas, and Martin Jaggi. 2020. “ON the Relationship Between Self-Attention and Convolutional Layers.” ArXiv Preprint [cs.LG] (1911.035842). https://arxiv.org/1911.03584/.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint.

Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14: 179–211.

Gers, Fleix A., Jürgen Schmidhuber, and Fred Cummins. 1999. “Learning to Forget: Continual Prediction with LSTM.” In Artificial Neural Networks ICANN 99. Ninth International Conference on, 2:850–55. Edinburgh, Scotland.

Graves, Alex, Greg Wayne, and Ivo Danihelka. 2014. “Neural Turing Machines.” ArXiv:1410.5401. http://arxiv.org/abs/1410.5401v1.

Greff, Klaus, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2015. “LSTM: A Search Space Odyssey.” ArXiv:1503.04069.

Hewitt, John, and Christopher D. Manning. 2019. “A Structural Probe for Finding Syntax in Word Representations.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4129–38. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1419.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9: 1735–80.

Jones, Karen Spärck. 1972. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation 28 (1): 11–21.

Kim, Junho, Minjae Kim, Hyeonwoo Kang, and Kwanghee Lee. 2019. “U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation.” ArXiv Preprint [cs.CV] (1907.10830).

Lample, Guillaume, and Alexis Conneau. 2019. “Cross-Lingual Language Model Pretraining.” ArXiv Preprint 1901.07291v1 [cs.CL].

Liu, Xiaodong, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. “Multi-Task Deep Neural Networks for Natural Language Understanding.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4487–96. Florence, Italy: Association for Computational Linguistics.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized Bert Pretraining Approach.” ArXiv Preprint.

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. 2015. “Effective Approaches to Attention-Based Neural Machine Translation.” ArXiv Preprint cs.CL: 1508.04025.

Manning, Christopher D., and Hinrich Schuütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT press.

Mikolov, Tomaś, Stefan Kombrink, Lukaś Burget, Jan “Honza" Černocký, and Sanjeev Khudanpur. 2011. “Extensions of Recurrent Neural Network Language Model.” In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague, Czech Republic.

Mikolov, Tomáš, Martin Karafiát, Lukáš Burget, Jan “Honza” Černocký, and Sanjeev Khudanpur. 2010. “Recurrent Neural Network Based Language Model.” In Proceedings of INTERSPEECH2010, edited by Takao Kobayashi, Keiichi Hirose, and Satoshi Nakamura, 1045–8. Makuhari, JAPAN.

Mishra, Nikhil, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2018. “A Simple Neural Attentive Meta-Learner.” ArXiv Preprint [cs.AI] (1707.03141).

Mnih, Volodymyr, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. “Recurrent Models of Visual Attention.” In Advances in Neural Information Processing Systems 27, edited by Zoubin Ghahramani, Max Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2204–12. Curran Associates, Inc. http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf.

Ramachandran, Prajit, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. “Stand-Alone Self-Attention in Vision Models.” ArXiv Preprint [cs.CV] (1906.05909). https://arxiv.org/1906.059009/.

Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. “DistilBERT, a Distilled Version of Bert: Smaller, Faster, Cheaper and Lighter.” ArXiv Preprint. https://arXiv.org/1910.01108.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems (NIPS), edited by Zoubin Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 27:3104–12. Montreal, BC, Canada. http://arxiv.org/abs/1409.3215v3.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, and Łukasz Kaiser. 2017. “Attention Is All You Need.” arXiv Preprint [cs.CL] (1706.03762).

Wang, Fei, Mengqing Jiang, Chen Qian, Shuo Yang, and Cheng Li. 2017. “Residual Attention Network for Image Classification.” In Proceedings of International Conference of Computer Vision (ICCV), IEEE International Conference. Venice, Italy.

Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. “Non-Local Neural Networks.” ArXiv Preprint [cs.CV]. https://arxiv.org/1711.07971.

Xu, Kelvin, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ArXiv:1502.03044.

Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. “Self-Attention Generative Adversarial Networks.” ArXiv Preprint [stat.ML] (1805.08318).