Transformer: Attention is All You Need
๐Ÿค–

Transformer: Attention is All You Need

Created
Jul 30, 2022
Editor
Tags
NLP
cleanUrl: 'paper/transformer'
๋…ผ๋ฌธ : Attention Is All You Need ์ €์ž : Ashish Vaswani,ย Noam Shazeer,ย Niki Parmar,ย Jakob Uszkoreit,ย Llion Jones,ย Aidan N. Gomez,ย Lukasz Kaiser,ย Illia Polosukhin
ย 
ย 

๋…ผ๋ฌธ ์„ ์ • ์ด์œ 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‹œํ€€์Šค๋ณ€ํ™˜ task์—์„œ ํ”ํžˆ ์‚ฌ์šฉ๋์—ˆ๋˜ convolution์ด๋‚˜ recurrent layer ์—†์ด ์ˆœ์ˆ˜ attention ๋ฉ”์ปค๋‹ˆ์ฆ˜๋งŒ์„ ์ด์šฉํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์ด๋Œ์–ด๋‚ธ Transformer๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๊ด€๋ จ ๋ชจ๋ธ์„ ๊ณต๋ถ€ํ•  ๊ฒฝ์šฐ ๋Œ€ํ‘œ์ ์œผ๋กœ ๋จผ์ € ์–ธ๊ธ‰๋  ์ •๋„๋กœ ์ €๋ช…ํ•œ ๋…ผ๋ฌธ์ผ ๋ฟ ์•„๋‹ˆ๋ผ, ํ…์ŠคํŠธ๋ฅผ ๋›ฐ์–ด ๋„˜์–ด ์ด๋ฏธ์ง€, ๋น„๋””์˜ค ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์— ํ™œ์šฉ๋  ๋งŒํผ ํš๊ธฐ์ ์ธ ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค๊ณ  ์ƒ๊ฐํ•˜์—ฌ ์ด๋ฅผ ๊ณต๋ถ€ํ•ด๋ณด๊ณ ์ž ๋…ผ๋ฌธ์„ ์„ ํƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ย 

Abstract

์œ ๋ช…ํ•œ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ๋“ค์€ ๋ณต์žกํ•œ ์ˆœํ™˜๏น’ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง(์ดํ•˜ RNN๏น’CNN)์„ ๊ธฐ๋ฐ˜์œผ๋กœ encoder์™€ decoder๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ ๋˜ํ•œ attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ด์šฉํ•˜์—ฌ encoder์™€ decoder๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” RNN๏น’CNN ์—†์ด ์˜ค์ง Attention ๋งค์ปค๋‹ˆ์ฆ˜๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ, ์ƒˆ๋กญ๊ณ  ๊ฐ„ํŽธํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์ธ Transformer๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. 2๊ฐ€์ง€์˜ ๊ธฐ๊ณ„๋ฒˆ์—ญ task์—์„œ Transformer ๋ชจ๋ธ๋“ค์€ ์„ฑ๋Šฅ์ด ๋งค์šฐ ์šฐ์ˆ˜ํ–ˆ์œผ๋ฉฐ, ๋ณ‘๋ ฌํ™”๋ฅผ ํ†ตํ•ด ํ•™์Šต ์‹œ๊ฐ„์„ ํ˜„์ €ํžˆ ์ค„์—ฌ๋‚˜๊ฐ”์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€ WMT 2014 English-German ๋ฒˆ์—ญ task์—์„œ ์•™์ƒ๋ธ”์„ ํ†ตํ•ด ๊ธฐ์กด์˜ ์ตœ๊ณ  ์ ์ˆ˜๋ฅผ 2 BLEU ์ด์ƒ ํ–ฅ์ƒ์‹œํ‚ค๋ฉด์„œ, 28.4 BLEU๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. WMT 2014 English-French ๋ฒˆ์—ญ task๋Š” 8๊ฐœ์˜ GPU, 3.5์ผ๊ฐ„์˜ ํ•™์Šต์„ ํ†ตํ•ด 41.8 BLEU๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์ด ์ƒˆ๋กœ์šด ๋‹จ์ผ ๋ชจ๋ธ SOTA(State-Of-The-Art)๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • BLEU : ํ•œ ์ž์—ฐ์–ด์—์„œ ๋‹ค๋ฅธ ์ž์—ฐ์–ด๋กœ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๋œ ํ…์ŠคํŠธ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๊ธฐ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜
๋˜ํ•œ, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์ด์™ธ์—๋„ English constituency(ํฌ๊ฑฐ๋‚˜, ๋งค์šฐ ์ ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•˜์—ฌ ๊ตฌ๋ฌธ ๋ถ„์„ ์ง„ํ–‰)์— ๋Œ€ํ•ด์„œ๋„ ์ผ๋ฐ˜ํ™”๊ฐ€ ์ž˜ ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

Introduction

RNN(Recurrent Neural Networks), ํŠนํžˆ LSTM(Long Short-Term Memory), GRU(Gated Recurrent Unit)์€ ์‹œํ€€์Šค๋ชจ๋ธ๋ง์ด๋‚˜ ์–ธ์–ด๋ชจ๋ธ๋ง, ๊ธฐ๊ณ„๋ฒˆ์—ญ๊ณผ ๊ฐ™์€ ๋ณ€ํ™˜ ๋ฌธ์ œ์—์„œ SOTA ์ ‘๊ทผ๋ฒ•์œผ๋กœ ๊ฒฌ๊ณ ํ•˜๊ฒŒ ์ž๋ฆฌ์žก๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆœํ™˜ ์–ธ์–ด ๋ชจ๋ธ๊ณผ encoder-decoder ๊ตฌ์กฐ์˜ ํ•œ๊ณ„๋ฅผ ๋„“ํžˆ๊ธฐ ์œ„ํ•œ ๋…ธ๋ ฅ์€ ๊ณ„์† ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
Recurrent ๋ชจ๋ธ๋“ค์€ ์ „ํ˜•์ ์œผ๋กœ input๊ณผ output ์‹œํ€€์Šค์˜ ํ† ํฐ ์œ„์น˜์— ๋”ฐ๋ผ ๊ณ„์‚ฐ์„ ๋ถ„ํ•ดํ•ฉ๋‹ˆ๋‹ค. ๊ณ„์‚ฐ์‹œ๊ฐ„์— ๊ฐ step์— ์œ„์น˜๋ฅผ ์ •๋ ฌ์‹œ์ผœ, hidden state ์˜ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜์—ฌ, ์ด์ „์˜ hidden state์ธ ์˜ ํ•จ์ˆ˜, ๊ทธ๋ฆฌ๊ณ  t position์—์„œ์˜ ์ž…๋ ฅ์œผ๋กœ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ณธ์งˆ์ ์œผ๋กœ ์ด๋Ÿฐ ์—ฐ์†์ ์ธ(sequentialํ•œ) ํŠน์„ฑ์€ ํ•™์Šต์— ์žˆ์–ด์„œ ๋ณ‘๋ ฌํ™”๋ฅผ ๋ฐฐ์ œํ•˜๋Š”๋ฐ, ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ํ•™์Šต ์ƒ˜ํ”Œ๊ฐ„์˜ batch๋ฅผ ์ œํ•œํ•˜์—ฌ, ์‹œํ€€์Šค์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์ทจ์•ฝํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ์—๋Š” ์ธ์ˆ˜๋ถ„ํ•ด ๊ผผ์ˆ˜(?)์™€ ์กฐ๊ฑด๋ถ€ ๊ณ„์‚ฐ์„ ํ†ตํ•ด ํ•™์Šต์˜ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ , ๋ชจ๋ธ์˜ ํผํฌ๋จผ์Šค ๋˜ํ•œ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฐ์†์  ๊ณ„์‚ฐ์ด๋ผ๋Š” ๊ทผ๋ณธ์ ์ธ ๋ฌธ์ œ๊ฐ€ ์—ฌ์ „ํžˆ ๋‚จ์•„์žˆ์Šต๋‹ˆ๋‹ค.
์ž…๏น’์ถœ๋ ฅ์˜ ์‹œํ€€์Šค์—์„œ ๊ทธ๋“ค์˜ ๊ฑฐ๋ฆฌ์™€ ์ƒ๊ด€ ์—†๋Š” ์˜์กด์„ฑ์„ ํ—ˆ์šฉํ•˜๋ฉด์„œ, ๋‹ค์–‘ํ•œ task์— ๋Œ€ํ•˜์—ฌ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง ๋ฐ ๋ณ€ํ™˜ ๋ชจ๋ธ์—์„œ Attention ๋งค์ปค๋‹ˆ์ฆ˜์€ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด ๋˜์–ด๊ฐ€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋ช‡๊ฐ€์ง€๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” Attention ๋งค์ปค๋‹ˆ์ฆ˜์€ recurrent ๋„คํŠธ์›Œํฌ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
ํ•ด๋‹น ๋ชจ๋ธ์—์„œ๋Š” recurrence๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์‚ฌ์ด์˜ globalํ•œ ์˜์กด์„ฑ์„ ๋Œ์–ด๋‚ด๊ธฐ ์œ„ํ•ด Attention ๋งค์ปค๋‹ˆ์ฆ˜์— ์ „์ ์œผ๋กœ ์˜์กดํ•˜๋Š” ๊ตฌ์กฐ์ธ Transformer๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. Transformer๋Š” ํ›จ์”ฌ ๋” ๋ณ‘๋ ฌํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , 8๊ฐœ์˜ P100 GPU์— ๋Œ€ํ•˜์—ฌ 12์‹œ๊ฐ„ ์ •๋„๋งŒ ํ•™์Šตํ•ด๋„ ๋ฒˆ์—ญ ํ’ˆ์งˆ์— ๋Œ€ํ•˜์—ฌ ์ƒˆ๋กœ์šด SOTA๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

Background

์—ฐ์†์ ์ธ ๊ณ„์‚ฐ์„ ์ค„์ด๊ธฐ ์œ„ํ•œ ๋ชฉํ‘œ๋Š” ๋˜ํ•œ Extended Neural GPU, ByteNet, ConvS2S์˜ ๊ธฐ๋ฐ˜์„ ํ˜•์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค์€ ๋ชจ๋‘ ๋ชจ๋“  ์ž…๏น’์ถœ๋ ฅ์˜ ์œ„์น˜์—์„œ hidden representation์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” CNN์„ ๊ธฐ์ดˆ์ ์ธ building block์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ๋“ค์—์„œ๋Š”, ์ž„์˜์˜ ๋‘ ์ž…๋ ฅ ๋˜๋Š” ์ถœ๋ ฅ ์œ„์น˜๋กœ ๋ถ€ํ„ฐ ์‹ ํ˜ธ๋ฅผ ๊ด€๊ณ„์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ๊ณ„์‚ฐ์˜ ์ˆ˜๊ฐ€ ์œ„์น˜๋“ค๊ฐ„์˜ ๊ฑฐ๋ฆฌ์— ๋น„๋ก€ํ•˜๋Š”๋ฐ, ConvS2S๋Š” ์„ ํ˜•์ ์œผ๋กœ, ByteNet์€ ๋Œ€์ˆ˜์ ์œผ๋กœ(log์™€ ๋น„๋ก€ํ•˜์—ฌ) ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์œ„์น˜๊ฐ€ ๋ฉ€ ์ˆ˜๋ก ์˜์กด์„ฑ์„ ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ต๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. Transformer์—์„œ๋Š” ๊ณ„์‚ฐ์˜ ์ˆ˜๊ฐ€ ์ƒ์ˆ˜ ๋ฒˆ์œผ๋กœ ์ค„์–ด๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋„ Attention-๊ฐ€์ค‘ ์œ„์น˜๋ฅผ ํ‰๊ท  ๋‚ด๊ธฐ ์œ„ํ•ด ํšจ๊ณผ์ ์ธ ํ•ด์ƒ๋„๊ฐ€ ๊ฐ์†Œํ•˜๋Š” cost๊ฐ€ ์กด์žฌํ•˜์ง€๋งŒ, ์ด๋Š” section 3.2์— ์„ค๋ช…๋  Multi-Head Attention์„ ํ†ตํ•ด ์ƒ์‡„์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋•Œ๋กœ๋Š” intra-attention์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š” Self-attention์€ ํ•œ ์‹œํ€€์Šค์˜ representation์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋‹จ์ผ ์‹œํ€€์Šค์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜๋“ค์„ ๊ด€๋ จ์‹œํ‚ค๋Š” ๋งค์ปค๋‹ˆ์ฆ˜์ž…๋‹ˆ๋‹ค. self-attention์˜ ๊ฒฝ์šฐ, ๋…ํ•ด, ์ƒ์„ฑ์š”์•ฝ, ๋ฌธ๋งฅ ์ถ”๋ก , ๊ทธ๋ฆฌ๊ณ  ํƒœ์Šคํฌ ๋น„์˜์กด์ ์ธ ๋ฌธ์žฅ ํ‘œํ˜„ ๋“ฑ ๋‹ค์–‘ํ•œ task์—์„œ ์„ฑ๊ณต์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
End-to-End memory๋„คํŠธ์›Œํฌ๋Š” ๋ฐฐ์—ด๋œ ์‹œํ€€์Šค์˜ recurrent ๋„คํŠธ์›Œํฌ ๋Œ€์‹ , ์žฌ๊ท€์  attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ณ , ๋‹จ์ผ ์–ธ์–ด QA(Question-Answering)์ด๋‚˜ ์–ธ์–ด๋ชจ๋ธ๋ง task์— ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜, Transformer๋Š” ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ representation์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ€€์Šค ํ• ๋‹น RNN์ด๋‚˜ ํ•ฉ์„ฑ๊ณฑ ์—†์ด self-attention์—๋งŒ ์ „์ ์œผ๋กœ ์˜์กดํ•˜๋Š” ์ฒซ๋ฒˆ์งธ ๋ณ€ํ™˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
ย 

Model Architecture

์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ์—์„œย ์ธ์ฝ”๋”๋Š” ์ž…๋ ฅ sequence (x1, ... ,xn)๋ฅผ ๋ฐ›์•„ย continuous representationย z =ย (z1, ... ,zn)๋ฅผ ๋ฐ˜ํ™˜ํ•˜๊ณ , ๋””์ฝ”๋”๋Š” ์ด z๋ฅผ ์‚ฌ์šฉํ•ด ์ถœ๋ ฅ sequence (y1, ... ,yn)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œย ๊ฐ time step์—์„œ ๋‹ค์Œ ๋‹จ์–ด์ž…๋ ฅ์„ ์ƒ์„ฑํ•  ๋•Œ, ๋ชจ๋ธ์€ ์ด์ „์— ์ƒ์„ฑ๋œ ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€์ ์ธ input์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—ย Auto-Regressiveํ•ฉ๋‹ˆ๋‹ค.
Transformer๋„ ์ด๋Ÿฌํ•œ ์ „๋ฐ˜์ ์ธ ์•„ํ‚คํ…์ณ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์— stacked self-attention, point-wise, fully connected layer๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
notion image

Encoder and Decoder Stacks

(1) Encoder

์œ„ ๊ทธ๋ฆผ์—์„œ ์™ผ์ชฝ์— ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋™์ผํ•œ ๋ ˆ์ด์–ด์˜ ์ธ์ฝ”๋” 6๊ฐœ(N=6) ๋ฅผ ๋ณ‘๋ ฌ๋กœ ์—ฐ๊ฒฐํ•ด ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ ˆ์ด์–ด์—๋Š” 2๊ฐœ์˜ sub-layers ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
notion image
์ฒซ๋ฒˆ์งธ sub-layerย : multi-head self- attention mechanism
๋‘๋ฒˆ์งธ sub-layerย : simple position-wise fully connected feed-forward network
ย 
๊ทธ๋ฆผ์—์„œย Add & Norm ์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„์œผ๋กœย ๋‘๊ฐœ์˜ sub-layer ๊ฐ๊ฐ์—ย ์ž”์ฐจ์—ฐ๊ฒฐ(residual connection)๊ณผ ์ธต ์ •๊ทœํ™”(layer normalization)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ž”์ฐจ์—ฐ๊ฒฐ์€ x + Sublayer(x), ์ฆ‰ sub-layer ์ธต์˜ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์„ ๋”ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ sub-layer์™€ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ๊ณผ ์ž…๋ ฅ์˜ ์ฐจ์›์€ ๋™์ผํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค(d_model = 512). ๊ทธ๋ฆฌ๊ณ  ์ธต ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•ด, ์ตœ์ข… ํ•จ์ˆ˜๋Š” LayerNorm(x + Sublayer(x)) ์ž…๋‹ˆ๋‹ค.
ย 

(2)ย Decoder

์œ„ ๊ทธ๋ฆผ์—์„œ ์˜ค๋ฅธ์ชฝ์— ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋””์ฝ”๋” 6๊ฐœ์˜ (N=6) ๋™์ผํ•œ ๋ ˆ์ด์–ด๋ฅผ stackํ•ด ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ธ์ฝ”๋”์˜ 2๊ฐœ์˜ sub-layer์™€ ๋”๋ถˆ์–ด sublayer์ธย ย ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ์— ๋Œ€ํ•œย Masked multi-head self-attention์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๋’ค์— ์˜ค๋Š” ๋‹จ์–ด๋ฅผ ์ฐธ๊ณ ํ•˜์ง€ ๋ชปํ•˜๋„๋ก masking์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰ย i ์‹œ์ ์˜ ๋‹จ์–ด๊ฐ€ ์˜ค์ง i ์ด์ „์˜ ๋‹จ์–ด์—๋งŒ ์˜์กดํ•˜๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.ย ๋˜ํ•œ ๊ฐ sub-layer์— ๋Œ€ํ•ด ์ž”์ฐจ์—ฐ๊ฒฐ๊ณผ ์ธต ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
notion image
ย 

Attention

Attention์€ query์— ๋Œ€ํ•ด์„œ key-value ์Œ์„ ์ถœ๋ ฅ๊ณผ mappingํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ Query, Keys, Values ๊ทธ๋ฆฌ๊ณ  ์ถœ๋ ฅ์€ ๋ชจ๋‘ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค. Query์—์„œ ๋ชจ๋“  key์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ณ , key์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ value๋ฅผ ์œ ์‚ฌ๋„ ๊ฐ€์ค‘ํ•ฉ์œผ๋กœ ์ถœ๋ ฅ์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.ย Query, Keys, Values๋Š” sequence์˜ ๋ชจ๋“  ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์ž…๋‹ˆ๋‹ค.
1. Query vector
> ํ˜„์žฌ ์ฒ˜๋ฆฌํ•˜๊ณ ์ž ํ•˜๋Š” token์„ ๋‚˜ํƒ€๋‚ด๋Š” vector
2. Key vector
> ์ผ์ข…์˜ label, ์‹œํ€€์Šค ๋‚ด์— ์žˆ๋Š” ๋ชจ๋“  ํ† ํฐ์— ๋Œ€ํ•œ identity
3. Value vector
> Key์™€ ์—ฐ๊ฒฐ๋œ ์‹ค์ œ ํ† ํฐ์„ ๋‚˜ํƒ€๋‚ด๋Š” vector
ย 

(1) Scaled Dot- Product Attention

notion image
์ž…๋ ฅ์€ย queries, ์ฐจ์›์ด d_k์ธ keys,ย ์ฐจ์›์ด d_v์ธ values๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.ย ๊ฐ€์ค‘์น˜๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด softmax function์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. key์˜ ์ฐจ์›, d_k์— ๋ฃจํŠธ๋ฅผ ์”Œ์šด ๊ฐ’์œผ๋กœ ๋‚˜๋ˆ„์–ด์ค˜ ์Šค์ผ€์ผ๋งํ•ด์ค๋‹ˆ๋‹ค. ์ด๋•Œ ๊ฐ query ๋ฒกํ„ฐ๋ฅผ ๋”ฐ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ query ์ง‘ํ•ฉ์œผ๋กœ ๋™์‹œ์— ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ queries, keys, values๋Š” ๊ฐ๊ฐ ํ–‰๋ ฌ Q, K, V๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ์ถœ๋ ฅ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
notion image
์ผ๋ฐ˜์ ์œผ๋กœ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” attention ํ•จ์ˆ˜์—๋Š” Additive attention๊ณผ ๋ณธ ๋ชจ๋ธ์— ์‚ฌ์šฉํ•œ Dot-product attetion์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ฐฉ๋ฒ•์˜ ์ด๋ก ์ƒ ๋ณต์žก์„ฑ์€ ๋น„์Šทํ•˜์ง€๋งŒ ์‹ค์ œ ์ ์šฉ์—์„œ Dot-product attention์ด ํœ ์”ฌ ๋น ๋ฅด๊ณ  ๋” space-efficient ํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ ํ™”๋œ ํ–‰๋ ฌ ๊ณฑ code๋ฅผ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
ย 

(2) Multi-Head Attention

notion image
d_model ์ฐจ์›์˜ย queries, keys, values์„ ์‚ฌ์šฉํ•œย ํ•œ๋ฒˆ์˜ attention ํ•จ์ˆ˜๋ณด๋‹ค ๊ฐ๊ฐย d_v, d_k, d_k ์ฐจ์›์œผ๋กœ ์ถ•์†Œํ•ด ์„œ๋กœ ๋‹ค๋ฅธ h๊ฐœ์˜ linear projection์„ ํ•œ ๊ฒƒ์ด ๋” ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ์•„๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ฐจ์›์„ ์ถ•์†Œํ•œ V, K, Q๋ฅผ ๋ณ‘๋ ฌ attention์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ d_v์ฐจ์›์˜ ์ถœ๋ ฅ ๊ฐ’์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ์ด๊ฒƒ์ดย ํ•ฉ์ณ์ ธ(concatenated) ๋‹ค์‹œ ํˆฌ์˜(projection)ํ•ด ์ตœ์ข… ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
notion image
ย 
์ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ positions์— ์„œ๋กœ ๋‹ค๋ฅธ representation subspaces์—์„œ ๊ฒฐํ•ฉ์ ์œผ๋กœ(jointly) ์ •๋ณด์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ย ์ถ•์†Œ๋œ Attention์„ ๋ณ‘๋ ฌํ™”ํ•ด ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์œผ๋กœ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰ ๋‹จ์–ด๊ฐ„์˜ ์—ฐ๊ด€๋„๋ฅผ ๊ตฌํ•  ๋•Œ, ๋ฌธ๋ฒ•์  ๊ตฌ์กฐ๋‚˜ ์˜๋ฏธ์  ๊ตฌ์กฐ์— ๋Œ€ํ•ด์„œ ๋ณตํ•ฉ์ ์œผ๋กœ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋ณธ ๋…ผ๋ฌธ์—์„œย 8๊ฐœ์˜ parallel attention layers(h=8)์„, ๊ฐ d_k, d_v๋Š”ย 64์ฐจ์›(512/8=64)์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
ย 

(3) Applications of Attention in our Model

Transformer์€ multi-head attention์„ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
ย 
  • ๋””์ฝ”๋”์˜ ๋‘๋ฒˆ์งธ sub-layer, Multi-Head Attention์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. "encoder-decoder attention" ๋ ˆ์ด์–ด์—์„œ queries๋Š” ์ด์ „์˜ ๋””์ฝ”๋” ๋ ˆ์ด์–ด์—์„œ ์˜ค๊ณ  memory keys์™€ values๋Š” ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์˜ต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋””์ฝ”๋”๊ฐ€ ์ž…๋ ฅ sequence์˜ ๋ชจ๋“  position์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” sequence-to-sequence ๋ชจ๋ธ์˜ ์ผ๋ฐ˜์ ์ธ ์ธ์ฝ”๋”-๋””์ฝ”๋” attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ๋ชจ๋ฐฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ธ์ฝ”๋”์˜ ์ฒซ๋ฒˆ์งธ sub-layer, Multi-Head Attntion์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ธ์ฝ”๋”๋Š” self-attention layers๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. self-attention layer์—์„œ ๋ชจ๋“  key, values, queries๋Š” ๊ฐ™์€ ๊ณณ(์ธ์ฝ”๋”์˜ ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ)์—์„œ ์˜ต๋‹ˆ๋‹ค. ์ธ์ฝ”๋”์˜ ๊ฐ position์€ ์ด์ „ ๋ ˆ์ด์–ด์˜ ๋ชจ๋“  position์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋””์ฝ”๋”์˜ ์ฒซ๋ฒˆ์งธ sub-layer, Masked Multi-Head Attention์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋””์ฝ”๋”์˜ self-attention layer๋กœ ๋””์ฝ”๋”๊ฐ€ ํ•ด๋‹น position๊ณผ ์ด์ „๊นŒ์ง€์˜ positon๋งŒ์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ position ๋‹ค์Œ์˜ position์„ masking outํ•ฉ๋‹ˆ๋‹ค. ์•„์ฃผ ์ž‘์€ ๊ฐ’์— ์ˆ˜๋ ดํ•˜๋„๋ก ๊ฐ’์„ ์ฃผ๊ณ  softmax๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฏธ๋ž˜์˜ position์— ๋Œ€ํ•œ ์˜ํ–ฅ์„ maskong outํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

Position-wise Feed-Forward Networks

์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ ๊ฐ ๋ ˆ์ด์–ด๋Š” Fully Connected feed-forward network์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ฐ position์— ๋…๋ฆฝ์ ์œผ๋กœ ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‘๊ฐœ์˜ ์„ ํ˜•๋ณ€ํ™˜๊ณผ ReLU๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
notion image
ํ•˜๋‚˜์˜ ์ธต ๋‚ด์—์„œ ๋‹ค๋ฅธ position์— ๋Œ€ํ•ด ์„ ํ˜•๋ณ€ํ™˜์€ ๊ฐ™์ง€๋งŒ, ๋ ˆ์ด์–ด๋งˆ๋‹ค๋Š” ๋‹ค๋ฅธ ๋ชจ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ์ฐจ์›์€ d_model = 512, ๊ทธ๋ฆฌ๊ณ  inner-layer์˜ ์ฐจ์›์€ d_ff = 2048์ž…๋‹ˆ๋‹ค.
ย 

Embeddings and Softmax

๋‹ค๋ฅธ sequence ๋ณ€ํ™˜๊ณผ ๋น„์Šทํ•˜๊ฒŒ, transformer๋„ ์ž…๋ ฅ ํ† ํฐ๊ณผ ์ถœ๋ ฅ ํ† ํฐ์„ ์ฐจ์›์ด d_model์ธ ๋ฒกํ„ฐ๋กœ ์ „ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด learned embedding์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋””์ฝ”๋”์˜ ์ถœ๋ ฅ์œผ๋กœ next-token ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๊ฐ€๋Šฅํ•œ ์„ ํ˜•๋ณ€ํ™˜๊ณผ ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‘ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์™€ pre-softmax ์„ ํ˜• ๋ณ€ํ™˜์—์„œ ๋™์ผํ•œ weight ํ–‰๋ ฌ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์—์„œ ์ด ๊ฐ€์ค‘์น˜๋“ค์— d_model ๋ฃจํŠธ ์”Œ์šด ๊ฐ’์„ ๊ณฑํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
ย 

Positional Encoding

Transformer๋Š” RNN๊ณผ CNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—ย sequence์˜ ์ˆœ์„œ๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด์„œย ๋‹จ์–ด ํ† ํฐ์˜ ์ƒ๋Œ€์  ๋˜๋Š” ์ ˆ๋Œ€์  ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์‚ฝ์ž…ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ์—ย "Positional Encoding"์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. Positional Encoding์€ ์ž„๋ฒ ๋”ฉ์˜ ์ฐจ์›๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ d_model ์ฐจ์›์„ ๊ฐ–๊ฒŒ ํ•ด ๋‘๊ฐœ๊ฐ„์˜ ๋ง์…ˆ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.ย ์œ„์น˜ ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ๊ฐ’์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋‹ค์Œ ๋‘ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
notion image
posย : ์œ„์น˜ (position)
iย : ์ฐจ์›(dimension)
ย 
๊ฐ ์œ„์น˜ ์ธ์ฝ”๋”ฉ์˜ ์ฐจ์›์€ ์‚ฌ์ธ๊ณก์„ (sinusoid)์— ๋Œ€์‘๋ฉ๋‹ˆ๋‹ค. ์ธ๋ฑ์Šค๊ฐ€ย ์ง์ˆ˜(2i)์ผ ๋•Œ๋Š” ์‚ฌ์ธํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ ย ํ™€์ˆ˜(2i+1)์ผ ๋•Œ๋Š” ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์–ด๋А k ๊ฐ’์ด๋ผ๋„ PE_(pos+k)๋Š” ์„ ํ˜• ํ•จ์ˆ˜ PE_(pos)๋กœ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์ธยท์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜๋ฅผ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.
Learning positional embedding ๋ฐฉ๋ฒ•๋„ ๊ฑฐ์˜ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์‚ฌ์ธ๊ณก์„ ์˜ ๋ฐฉ๋ฒ•์ด ํ•™์Šต๊ณผ์ •์—์„œ ๋งŒ๋‚ฌ๋˜ sequence ๋ณด๋‹ค ๊ธด sequence์— ๋Œ€ํ•ด์„œ๋„ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
ย 

Why Self-Attention

์ „ํ˜•์ ์ธ ์‹œํ€€์Šค ๋ณ€ํ™˜ encoder๋‚˜ decoder์ฒ˜๋Ÿผ, ๊ฐ€๋ณ€๊ธธ์ด๋ฅผ ๊ฐ€์ง„ ์‹œํ€€์Šค์˜ representation ์„ ๋™์ผํ•œ ๊ธธ์ด๋ฅผ ๊ฐ€์ง„ ์‹œํ€€์Šค์˜ representation ์— ๋งคํ•‘ํ•˜๋Š”๋ฐ ์ „ํ˜•์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” recurrent convolutional layer๋ฅผ self-attention layer์™€ ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ ๋น„๊ตํ•ด๋ณด๋ ค ํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ธ๊ฐ€์ง€์˜ ์ด์œ ๋กœ self-attention์„ ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.
  1. ๊ฐ layer์˜ ์ „์ฒด ๊ณ„์‚ฐ๋ณต์žก๋„
  1. ๋ณ‘๋ ฌํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณ„์‚ฐ๋Ÿ‰(์š”๊ตฌ๋˜๋Š” ์—ฐ์†์  ๊ณ„์‚ฐ์˜ ์ตœ์†Œ ๊ฐœ์ˆ˜๋กœ ์ธก์ •)
  1. ๋„คํŠธ์›Œํฌ ๋‚ด๋ถ€์˜ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ์‚ฌ์ด์˜ ๊ฒฝ๋กœ ๊ธธ์ด
์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ๋‹ค์–‘ํ•œ ์‹œํ€€์Šค ๋ณ€ํ™˜ task์—์„œ ์ค‘์š”ํ•œ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์˜์กด์„ฑ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ํ•œ๊ฐ€์ง€ ์š”์ธ์€ ๋„คํŠธ์›Œํฌ์—์„œ ์ง€๊ทธ์žฌ๊ทธ๋กœ ์ฃผ์–ด์ง€๋Š” ์ „์œ„, ํ›„์œ„ ์‹ ํ˜ธ์˜ ๊ธธ์ด์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ์‹œํ€€์Šค์—์„œ ์–ด๋– ํ•œ ์œ„์น˜์—์„œ๋“ ์ง€ ์‚ฌ์ด์˜ ๊ฒฝ๋กœ๋“ค์ด ์งง์„ ์ˆ˜๋ก, ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ์‰ฌ์›Œ์ง‘๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹ค๋ฅธ layer๋“ค๋กœ ๊ตฌ์„ฑ๋œ ๋„คํŠธ์›Œํฌ์—์„œ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์œ„์น˜๊ฐ€ ๊ฐ€์žฅ ๋จผ ๊ฒฝ๋กœ์˜ ๊ธธ์ด๋ฅผ ๋น„๊ตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
ย 
๋ ˆ์ด์–ด ํƒ€์ž…๋งˆ๋‹ค ๊ฐ€์žฅ ํฐ ๊ฒฝ๋กœ์˜ ๊ธธ์ด, ๋ ˆ์ด์–ด ๋‹น ๊ณ„์‚ฐ ๋ณต์žก๋„์™€ ์ตœ์†Œํ•œ์˜ ์—ฐ์†์  ๊ณ„์‚ฐ๋Ÿ‰์„ ๋น„๊ตํ•˜์˜€์Šต๋‹ˆ๋‹ค.
n : ์‹œํ€€์Šค ๊ธธ์ด
d : representation ์ฐจ์›
k : convolution์˜ kernel size
r : ์ œํ•œ๋œ self-attention์—์„œ์˜ ์ด์›ƒ ์ˆ˜
notion image
์—ฐ์†์  ๊ณ„์‚ฐ๋Ÿ‰์ด O(n)์„ ๋”ฐ๋ฅด๋Š” recurrent layer์™€ ๋‹ฌ๋ฆฌ, self-attention layer๋Š” ์ƒ์ˆ˜๊ฐœ์˜ ์—ฐ์†์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ๊ณ„์‚ฐ๋“ค๋กœ ๋ชจ๋“  ์œ„์น˜๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ๊ณ„์‚ฐ๋ณต์žก๋„์— ๋Œ€ํ•ด์„œ๋Š”, ์‹œํ€€์Šค์˜ ๊ธธ์ด n์ด representation์˜ ์ฐจ์›์ธ d๋ณด๋‹ค ์ž‘์„ ๋•Œ recurrent layer๋ณด๋‹ค self-attention layer๊ฐ€ ๋” ๋น ๋ฅธ๋ฐ, word-piece๋‚˜ byte-pair representation ๊ฐ™์€ ๊ธฐ๊ณ„๋ฒˆ์—ญ์—์„œ SOTA ๋ชจ๋ธ๋“ค์ด ์‚ฌ์šฉํ•˜๋Š” ๋ฌธ์žฅ representation์˜ ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„ n์ด d๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค. ๋งค์šฐ ๊ธด ์‹œํ€€์Šค์™€ ๊ด€๋ จ๋œ task์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, self-attention์—์„œ ๊ฐ๊ฐ์˜ ์ถœ๋ ฅ ์œ„์น˜๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ•˜๋Š” ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ์ด์›ƒ ์ˆ˜๋ฅผ r๋กœ ์ œํ•œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ฒฝ๋กœ ๊ธธ์ด์˜ ์ตœ๋Œ“๊ฐ’์€ O(n/r)๋กœ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
kernel ๋„ˆ๋น„ k๊ฐ€ n๋ณด๋‹ค ์ž‘์€ ํ•˜๋‚˜์˜ convolutional layer๋Š” ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์— ๋Œ€ํ•˜์—ฌ ๋ชจ๋“  ์Œ์„ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋„คํŠธ์›Œํฌ ์•ˆ์—์„œ ๋‘ ์œ„์น˜ ์‚ฌ์ด์˜ ๊ฐ€์žฅ ๊ธด ๊ฒฝ๋กœ์˜ ๊ธธ์ด๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด์„œ, ์ธ์ ‘ํ•œ kernel์˜ ๊ฒฝ์šฐ convolutional layer๊ฐ€ ๋งŒํผ ํ•„์š”ํ•˜๊ณ , ํ™•์žฅ๋œ convolution์˜ ๊ฒฝ์šฐ๋Š” ๋งŒํผ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. convolutional layer์˜ ๊ฒฝ์šฐ k ๋•Œ๋ฌธ์— ๋ณดํ†ต recurrent layer๋ณด๋‹ค ๋” ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ถ„๋ฆฌ๋œ convolutional layer์˜ ๊ฒฝ์šฐ, ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ๋กœ ์ค„์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, k=n์ผ ๋•Œ๋Š”, ๋ถ„๋ฆฌ๋œ convolutional layer์˜ ๋ณต์žก๋„๊ฐ€ self-attention๊ณผ point-wise fedd-forward layer์˜ ๊ฒฐํ•ฉ์˜ ๋ณต์žก๋„์™€ ๋™์ผํ•ด์ ธ, ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ ์ถ”๊ฐ€๋กœ, self-attention์€ ๋”์šฑ ํ•ด์„์ ์ธ ๋ชจ๋ธ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ attention ํ—ค๋“œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต๋  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋ฌธ์žฅ์˜ syntactic(๋ฌธ๋ฒ•์ ), semantic(๊ตฌ๋ฌธ์ ) ๊ตฌ์กฐ์™€ ๊ด€๋ จ๋œ ํŠน์„ฑ์„ ํฌ์ฐฉํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ย 

Training

Dataset : WMT 2014 English-German dataset (4.5million sent. pairs, byte-pair encoding), WMT 2014 English-French dataset (36M sent., 32000 word-piece vocab.)
Hardware : 8๊ฐœ์˜ NVIDIA P100 GPU๋ฅผ ์‚ฌ์šฉํ•ด์„œ base model์€ 100,000 step (1step = 0.4์ดˆ), big model ์€ 300,000step(1step = 1.0์ดˆ) ๋™์•ˆ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
Optimizer :ย Adamย optimizer ์‚ฌ์šฉํ•ด์„œ learning rate๋Š” [์‹-5] ์™€ ๊ฐ™์ด warmup step = 4000์œผ๋กœ ๋‘๊ณ  ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
Residual Dropout : ๊ฐ sublayer์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ residual connection์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— dropout ์„ ์ ์šฉํ•œ๋‹ค.ย Pdrop=1ย ์œผ๋กœ ์ง€์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
Label Smoothing : label smoothing value = 0.1 ๋กœ ์ง€์ •ํ•ด์„œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Perplexity ์—๋Š”๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ์ฃผ์ง€๋งŒ, accuracy์™€ BLEU ์ ์ˆ˜์—๋Š” ๊ธ์ •์ ์ธ ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.
ย 

Results

notion image
WMT 2014 English-German translation task(EN-DE) ์™€ WMT 2014 English-Frensh translation task(EN-FR) ์˜ ๊ฒฐ๊ณผ๋กœ, SOTA๋ฅผ ๋›ฐ์–ด๋„˜๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
notion image
์œ„์—์„œ ์ œ์•ˆํ•œ Transformer์˜ ๊ตฌ์กฐ์—์„œ ๋ช‡ ๊ฐ€์ง€ ์š”์†Œ๋“ค์„ ๋ณ€๊ฒฝํ•œ ํ›„ English-German translation task์— ์ ์šฉํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
  • (A) head์˜ ๊ฐœ์ˆ˜,ย ๋ฅผ ๋ณ€๊ฒฝ : head๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฉด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.
  • (B) ๋งŒ ๋ณ€๊ฒฝ
  • (C) ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ํ‚ค์šด ๊ฒฝ์šฐ : ๋ชจ๋ธ์ด ์ปค์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋ฉ๋‹ˆ๋‹ค.
  • (D) dropout์˜ ์˜ํ–ฅ : dropout ๋„ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.
  • (E) positional embedding์˜ ์ค‘์š”์„ฑ : learned positional embedding์„ ์‚ฌ์šฉํ•ด๋„ ์„ฑ๋Šฅ์— ํฐ ๋ณ€ํ™”๋Š” ์—†์Šต๋‹ˆ๋‹ค.
ย 
notion image
๋‹ค๋ฅธ task์—๋„ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•œ์ง€ ์‹คํ—˜ํ•˜๊ธฐ ์œ„ํ•ด, English Constituency Parsing task์— transformer๋ฅผ ์ ์šฉํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜์ง€๋Š” ๋ชปํ•˜์ง€๋งŒ, ์˜์™ธ๋กœ ๊ฝค ์ข‹์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

Conclusion

ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š”, encoder-decoder ๊ตฌ์กฐ์—์„œ ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” recurrent layer๋ฅผ multi-head self-attention์œผ๋กœ ๊ต์ฒดํ•˜๋ฉด์„œ, ์ „์ ์œผ๋กœ attention์—๋งŒ ์˜์กดํ•˜๋Š” ์ตœ์ดˆ์˜ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ, Transformer๋ฅผ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค.
๋ฒˆ์—ญ task์—์„œ, Transformer๋Š” recurrent, convolutional layer๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๊ตฌ์กฐ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋นจ๋ฆฌ ํ•™์Šต๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. WMT 2014 English-to-German, WMT 2014 English-to-French ๋ฒˆ์—ญ task ๋‘˜ ๋‹ค, ์ƒˆ๋กœ์šด SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ์˜ ์ตœ๊ณ ์˜ ๋ชจ๋ธ์€ ์ด์ „์— ์—ฐ๊ตฌ๋˜์—ˆ๋˜ ๋ชจ๋“  ์•™์ƒ๋ธ”์„ ๋›ฐ์–ด๋„˜์—ˆ์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ, ์ถ”ํ›„์˜ attention-based model์„ ๊ธฐ๋Œ€ํ•˜๋ฉด์„œ, Transformer๋ฅผ text๊ฐ€ ์•„๋‹Œ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ฌธ์ œ์— ์ ์šฉํ•˜๊ณ , ์ง€์—ญ์ ์ด๊ณ , ์ œํ•œ๋œ attention ๋งค์ปค๋‹ˆ์ฆ˜์ด ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค์™€ ๊ฐ™์€ ํฐ ์ž…์ถœ๋ ฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ๋‹ค๋ฃจ๊ฒŒ๋” ํ™•์žฅ์‹œํ‚ฌ ๊ณ„ํš์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
ย 
ย 

์ฝ”๋“œ ์‹ค์Šต

ย 

Reference

Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf