Fasttext : Enriching Word Vectors with Subword Information
๐Ÿ…ฐ๏ธ

Fasttext : Enriching Word Vectors with Subword Information

Created
Feb 6, 2022
Editor
Tags
NLP
cleanUrl: "paper/fasttext"
๐Ÿ“„
๋…ผ๋ฌธ : Fasttext : Enriching Word Vectors with Subword Information ์ €์ž : Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

๋…ผ๋ฌธ ์„ ์ • ์ด์œ 

๋ณธ ๋…ผ๋ฌธ์€ Word2Vec์—์„œ ์ œ์‹œํ–ˆ๋˜ skip-gram์„ ํ™•์žฅ์‹œ์ผœ, ํ•˜์œ„๋‹จ์–ด๋ฅผ ๋ฌธ์ž n-gram์œผ๋กœ ํ‘œํ˜„ํ•œ ํ›„ ์ด๋ฅผ sumํ•˜์—ฌ ํ˜•ํƒœ์†Œ๋ฅผ ๋ณด์กดํ•˜๋Š” ๋ฐฉ์‹์ธ Fasttext๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด์ „์— ๊ณต๋ถ€ํ–ˆ๋˜ Word2Vec์˜ ๊ฒฝ์šฐ, ๊ฐ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๋‹จ์–ด์˜ morphology(ํ˜•ํƒœ์†Œ)๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์ด์—, Fasttext์—์„œ๋Š” word2vec์˜ ์–ด๋–ป๊ฒŒ ํ™•์žฅ์‹œ์ผœ ๋‚ด๋ถ€ ๊ตฌ์กฐ ์ •๋ณด๋ฅผ ๋‹ด์•„ ๋‚ด๋ ค ํ–ˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ณ ์ž ๋ณธ ๋…ผ๋ฌธ์„ ์„ ํƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.
๐Ÿ‘‰๐Ÿป
Word2Vec์— ๋Œ€ํ•œ ์ƒ์„ธํ•œ ์„ค๋ช…์€ ์ด์ „ ํฌ์ŠคํŠธ
๐Ÿ“œ
Efficient Estimation Of Word Representations In Vector Space (Word2Vec) (1)
๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.

Abstract

์œ ๋ช…ํ•œ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์˜ ๊ฒฝ์šฐ ๊ฐ ๋‹จ์–ด๋งˆ๋‹ค ๋‹ค๋ฅธ ๋ฒกํ„ฐ๋ฅผ ํ• ๋‹น ํ•˜์—ฌ, ๋‹จ์–ด์˜ ํ˜•ํƒœ๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” Skip-gram์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ ๊ฐ ๋‹จ์–ด๋ฅผ character n-gram ๋ฒกํ„ฐ์˜ ์กฐํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ํ•ด๋‹น ๋ฐฉ์‹์€ ํฐ corpora(๋‹จ์–ด ๋ง๋ญ‰์น˜)์— ๋Œ€ํ•ด์„œ๋„ ํ•™์Šต ์†๋„๊ฐ€ ๋น ๋ฅด๊ณ , ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋“ฑ์žฅํ•˜์ง€ ์•Š์€ ๋‹จ์–ด๋„ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, 9๊ฐœ ์–ธ์–ด์— ๋Œ€ํ•ด ์œ ์‚ฌ๋„ ๋ฐ ์ถ”๋ก  task๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•˜์˜€๋”๋‹ˆ, SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ย 

Introduction

๋งˆ๋””์—†์ด ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” representation๋Š” ์ „ํ˜•์ ์œผ๋กœ ๋™์‹œ๋ฐœ์ƒํ™•๋ฅ ์„ ์ด์šฉํ•˜์—ฌ, ๋ผ๋ฒจ์ด ์—†๋Š” ํฐ corpora์—์„œ ํŒŒ์ƒ๋ฉ๋‹ˆ๋‹ค. ๋ถ„ํฌ์˜๋ฏธํ•™์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์˜ ํŠน์ง•์„ ๊ณต๋ถ€ํ•ด์˜ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. neural network ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ๋Š”, ์ˆœ๋ฐฉํ–ฅ ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ word embedding ๋ฐฉ์‹(์ขŒ์šฐ ๊ฐ๊ฐ 2๊ฐœ์˜ ๋‹จ์–ด๋“ค์— ๊ทผ๊ฑฐํ•จ)์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋” ์ตœ๊ทผ์—๋Š”, ๋งค์šฐ ํฐ corpora์— ๋Œ€ํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ log-bilinear ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค.
์œ„์™€ ๊ฐ™์€ ๊ธฐ์ˆ ๋“ค์€ ๋Œ€๋ถ€๋ถ„ vocabulary ๋‚ด ๊ฐ ๋‹จ์–ด๋ฅผ parameter๋ฅผ ๊ณต์œ ํ•˜์ง€ ์•Š๊ณ  ๋ถ„๋ฆฌ๋œ vector๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ๋‹จ์–ด๋“ค์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ ๋ฌด์‹œํ•˜๋Š”๋ฐ, ์ด๋Š” Turkish๋‚˜ Finnish์™€ ๊ฐ™์ด ํ˜•ํƒœํ•™์ ์œผ๋กœ ํ’๋ถ€ํ•œ ๋‹จ์–ด๋“ค์—๊ฒŒ ๊ต‰์žฅํžˆ ํฐ ํ•œ๊ณ„์ ์ด ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, French๋‚˜ Spanish์—๋Š” ๋™์‚ฌ์— 40๊ฐœ ์ด์ƒ์˜ ๋‹ค๋ฅธ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , Finnish๋Š” 15๊ฐœ์˜ ๋ช…์‚ฌ ํ˜•ํƒœ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์–ธ์–ด๋“ค์€ ํ•™์Šต์— ์“ฐ์ด๋Š” corpus์—๋Š” ๊ฑฐ์˜ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ๋‹จ์–ด ํ˜•ํƒœ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด, ์ข‹์€ representation์„ ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋งŽ์€ ๋‹จ์–ด์˜ ํ˜•ํƒœ๊ฐ€ ๊ทœ์น™์„ ๋”ฐ๋ฅด๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ฌธ์ž ์ˆ˜์ค€์˜ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋ฒกํ„ฐ ํ‘œํ˜„์„ ๊ฐœ์„ ์‹œ์ผœ ํ˜•ํƒœํ•™์ ์œผ๋กœ ํ’๋ถ€ํ•œ ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค.
ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š”, character n-gram์„ ํ†ตํ•ด representation์„ ํ•™์Šตํ•˜๊ณ , n-gram vector์˜ ํ•ฉ์œผ๋กœ ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” skip-gram ๋ชจ๋ธ์˜ ํ™•์žฅ์„ ์†Œ๊ฐœํ•˜๊ณ , ํ•˜์œ„ ๋‹จ์–ด์˜ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ํ˜•ํƒœ๋ฅผ ๋„๋Š” 9๊ฐœ์˜ ์–ธ์–ด์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ณ , ์žฅ์ ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.
ย 

Model

โ€˜ํ˜•ํƒœโ€™๋ฅผ ๊ณ ๋ คํ•˜๋ฉด์„œ word representation์„ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํ•˜์œ„๋‹จ์–ด unit์„ ๊ณ ๋ คํ•˜๋ฉด์„œ ํ˜•ํƒœ๋ฅผ ๋งŒ๋“ค๊ณ , ์ด๋Ÿฌํ•œ character(์•ŒํŒŒ๋ฒณ) n-gram์˜ ํ•ฉ์œผ๋กœ ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์–ด vector๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์™€ ํ•˜์œ„๋‹จ์–ด ๋ชจ๋ธ์„ ์‚ด๋ช…ํ•˜๊ณ , character n-gramdml dictionary๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•˜๋Š”์ง€ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

1. General model (= word2vec)

๋จผ์ €, skip-gram์„ ์งš๊ณ  ๊ฐ€์•ผํ•ฉ๋‹ˆ๋‹ค.
์‚ฌ์ด์ฆˆ๊ฐ€ W์ธ vocabulary๊ฐ€ ์ฃผ์–ด์กŒ์„๋•Œ, ๊ฐ ๋‹จ์–ด w์˜ index๋Š” {} ์ด๊ณ , ํ•ด๋‹น ๋ชจ๋ธ์˜ ๋ชฉํ‘œ๋Š” ๊ฐ ๋‹จ์–ด w์˜ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. word representation์€ ํ•ด๋‹น ๋ฌธ๋งฅ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋“ค์„ ์ž˜ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.
๊ณต์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด, skip-gram์˜ ๋ชฉํ‘œ๋Š” ๋‹จ์–ด ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋ฌธ๋งฅ ๋‹จ์–ด ์— ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ์™€์•ผ์ง€ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์•„์ง€๋Š” ์ง€ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
: ๋ฅผ ๋‘˜๋Ÿฌ์‹ธ๊ณ  ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ index set(์ค‘์‹ฌ ๋‹จ์–ด๊ฐ€ ์•„๋‹Œ ๋ฌธ๋งฅ ๋‹จ์–ด)
: corpus ๋‚ด ๋‹จ์–ด ์ˆ˜
ย 
context ๋‹จ์–ด๋ฅผ ๊ฐ€์ง€๊ณ  ํ™•๋ฅ ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” softmax ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
s : (word,context)์Œ์„ ๋งคํ•‘ํ•˜๋Š” scoring function
๋ถ„์ž : ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์‹ค์ œ ์ •๋‹ต์ธ ๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ 
๋ถ„๋ชจ : ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‹ค๋ฅธ ๋ชจ๋“  ๋‹จ์–ด๋“ค์ด ๋“ฑ์žฅํ•  ํ™•๋ฅ ์˜ ํ•ฉ
ย 
๊ทธ๋Ÿฌ๋‚˜, softmaxํ•จ์ˆ˜๋Š” ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ํ•˜๋‚˜์— ๋Œ€ํ•ด์„œ๋งŒ ์˜ˆ์ธกํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•˜๋‚˜๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๊ณ„์‚ฐ ํšจ์œจ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์–ด ๋ณธ ๋…ผ๋ฌธ์˜ ์ผ€์ด์Šค์™€ ์ž˜ ๋งž์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, softmax ๋Œ€์‹  negative sampling์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
context words๋ฅผ ์˜ˆ์ธกํ•˜๋Š” multi label classification ๋ฌธ์ œ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ context words ์ธ์ง€, ์•„๋‹Œ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” binary classification task๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์น˜์— ์žˆ๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด ๋ชจ๋“  context words๋ฅผ positive example๋กœ ๊ณ ๋ คํ•˜๊ณ , dictionary๋กœ๋ถ€ํ„ฐ ๋žœ๋คํ•˜๊ฒŒ negative example์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์œ„์น˜์— ์žˆ๋Š” context์— ๋Œ€ํ•˜์—ฌ, binary logistic loss๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์€ negative log-likelihood๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
: vocabulary์—์„œ ์ถ”์ถœ๋œ negative samples
์ขŒ์ธก: ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๊ฐ€ ์ •๋‹ต์ด๋ผ๋ฉด ๋‘ ๋ฒกํ„ฐ์˜ ์œ ์‚ฌ๋„๋ฅผ ๋†’์ด๊ณ , ํ•ฉ์ด ์ตœ๋Œ€ํ™” ๋œ๋‹ค.
์šฐ์ธก: ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์‹ค์ œ ์ •๋‹ต์ด ์•„๋‹Œ n์— ๋Œ€ํ•ด์„œ๋Š” ๋‘ ๋ฒกํ„ฐ์˜ ์œ ์‚ฌ๋„๋ฅผ ๋‚ฎ์ถ”๊ณ , -1์ด ๊ณฑํ•ด์ ธ ์žˆ์–ด ํ•ฉ์ด ์ตœ์†Œํ™” ๋œ๋‹ค.
์œ„์™€ ๊ฐ™์€ ํ˜•์‹์„ ์ด์šฉํ•˜์—ฌ ๋‹ค์‹œ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
ย 

2. Subword model

๊ฐ ๋‹จ์–ด๊ฐ€ ๋ถ„๋ฆฌ๋œ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ, skip-gram ๋ชจ๋ธ์€ ๋‹จ์–ด์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‚ด๋ถ€ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค๋ฅธ scoring function s๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๊ฐ ๋‹จ์–ด w๋Š” ๋ฌธ์ž n-gram์˜ ์ง‘ํ•ฉ์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ๋‹จ์–ด ์ฒ˜์Œ๊ณผ ๋์— <,>๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ ‘๋‘์‚ฌ์™€ ์ ‘๋ฏธ์‚ฌ๋ฅผ ๋‹ค๋ฅธ ๋ฌธ์ž sequence์™€ ๊ตฌ๋ถ„ํ•˜๊ธฐ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” n-grams ์ง‘ํ•ฉ์— ๋‹จ์–ด w ์ž์‹ ๋„ ์ถ”๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค.
ย 
e.g. where๋ผ๋Š” ๋‹จ์–ด์—์„œ n = 3 ์ผ ๋•Œ, ๋‹จ์–ด n-gram :
<wh, whe, her, ere, re>
special sequence :
<where>
  • ๋‹จ์–ด her์—์„œ ๋‚˜์˜จ <her>๊ณผ where์—์„œ ๋‚˜์˜จ <her>์€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค
ย 
๋‹จ์–ด๋Š” ํ•ด๋‹น ๋‹จ์–ด์— ๋Œ€ํ•œ n-gram์˜ ํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ , scoring function์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
: ๊ฐ n-gram g์— ๋Œ€ํ•œ ๋ฒกํ„ฐ ํ‘œํ˜„
: ๋‹จ์–ด w์—์„œ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  n-gram ์ง‘ํ•ฉ
: ๋ฌธ๋งฅ ๋‹จ์–ด ์˜ ๋ฒกํ„ฐ
ย 
๋‹จ์–ด๋“ค๋ผ๋ฆฌ representation์˜ ๊ณต์œ ๋„ ๊ฐ€๋Šฅํ•ด์ง€๊ณ , ์ด๋กœ์จ ์ƒ์†Œํ•œ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ ๋ฏฟ์„๋งŒํ•œ representation์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, eats, eating๊ณผ ๊ฐ™์ด eat์ด๋ผ๋Š” ์›๋ž˜ ๋‹จ์–ด์—์„œ ํŒŒ์ƒ๋œ ๋‹จ์–ด๋“ค์˜ ํ‘œํ˜„์„ ๊ณต์œ ํ•˜๊ณ  ํ•™์Šต์‹œ์ผฐ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ํฌ๊ฒŒ ๋งŒ๋“œ๋Š” ์ผ์ด๊ธด ํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ณ  ๊ณ„์‚ฐ์„ ํšจ์œจ์ ์œผ๋กœ ํ•˜๊ธฐ ์œ„ํ•ด, n-gram๋“ค์„ 1๋ถ€ํ„ฐ K๊นŒ์ง€์˜ ์ •์ˆ˜๋กœ ๋งคํ•‘ํ•˜๋Š” ํ•ด์‹ฑํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. Fowler-Noll-Vo ํ•ด์‹ฑํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์˜€๊ณ , K๋ฅผ ์ดํ•˜๋กœ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ถ๊ทน์ ์œผ๋กœ, ๋‹จ์–ด๋Š” word dictionary์—์„œ ์ž์‹ ์˜ Index์™€ ๊ทธ ๋‹จ์–ด๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” hashed n-gram์˜ ์ง‘ํ•ฉ์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

Experiments

Baseline

๋ณธ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์„ word2vec ํŒจํ‚ค์ง€์˜ skip-gram, CBOW(Continuous Bag-Of-Words)๊ณผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

Optimization

์•ž์—์„œ๋Š” negative log likelihood์— SGD(stochastic gradient descent)๋ฅผ ์ ์šฉ์‹œ์ผœ ์ตœ์ ํ™”๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ๋ฒ ์ด์Šค๋ผ์ธ์ธ skip-gram์—์„œ๋Š”, ์„ ํ˜• ๊ฐ์†Œํ•˜๋Š” step size๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.T๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ training set๊ณผ, data ์ „๋ฐ˜์— ๋Œ€ํ•ด ํ†ต๊ณผํ•˜๋Š” ์ˆ˜๊ฐ€ P์™€ ๋™์ผํ•˜๋‹ค๊ณ  ์ฃผ์–ด์กŒ์„ ๋•Œ, ์‹œ๊ฐ„ t์—์„œ์˜ step size๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
: fixed parameter
ย 
์ตœ์ ํ™”๋ฅผ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด, Hogwild๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ์“ฐ๋ ˆ๋“œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณต์œ ํ•˜๊ณ , ๋น„๋™๊ธฐ์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋ฒกํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ฉ๋‹ˆ๋‹ค.
ย 

Implementation details

word vector์˜ ์ฐจ์›์€ 300์ž…๋‹ˆ๋‹ค. positive example์— ๋Œ€ํ•ด, uni-gram(n=1)์˜ ๋นˆ๋„์— ๋Œ€ํ•ด ์ œ๊ณฑ๊ทผํ•œ ๊ฐ’๊ณผ ๋น„๋ก€ํ•˜๋Š” ํ™•๋ฅ ๋กœ ๋žœ๋คํ•˜๊ฒŒ 5๊ฐœ์˜ negatives๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์˜€์Šต๋‹ˆ๋‹ค. context window size์˜ ๊ฒฝ์šฐ c๋กœ ์„ค์ •ํ•˜์˜€๋Š”๋ฐ, c์˜ ๊ฐ’์€ 1๊ณผ 5 ์‚ฌ์ด์—์„œ ๊ท ์ผํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ๋‹จ์–ด๋“ค์„ ์ผ๋ถ€๋งŒ ์ทจํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” rejection threshold๋ฅผ ๋กœ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค. word dictionary๋ฅผ ๋งŒ๋“ค ๋•Œ, ๋‹จ์–ด๊ฐ€ training set์— ์ ์–ด๋„ 5๋ฒˆ ์ด์ƒ์€ ๋‚˜ํƒ€๋‚˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค. step size์˜ ์€ skip-gram์€ 0.025, ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ชจ๋ธ๊ณผ CBOW๋Š” 0.05๋กœ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” word2vec ํŒจํ‚ค์ง€์˜ ๋””ํดํŠธ ๊ฐ’์ด๊ณ  ํ•ด๋‹น ๋ชจ๋ธ์—๋„ ์ž˜ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.
English ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์„ธํŒ…ํ•  ๋•Œ, ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์€ ์•ฝ 1.5๋ฐฐ ๋А๋ฆฌ๊ฒŒ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€ C++์—์„œ ์‹คํ–‰๋˜๊ณ , ๊ณต๊ณต์œผ๋กœ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
ย 

Datasets

Wikipedia ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด 9๊ฐœ์˜ ์–ธ์–ด(Arabic, Czech, German, English, Spanish, French, Italian, Romanian, Russian)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
Matt Mahoney์˜ ์ „์ฒ˜๋ฆฌ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ„ํ‚คํ”ผ๋””์•„ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๊ทœํ™”ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋Š” ๋ฌด์ž‘์œ„๋กœ ์„ž์—ฌ์žˆ๊ณ , 5๊ฐœ์”ฉ ํŒจ์Šคํ•˜๋ฉด์„œ ํ•™์Šต์„ ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
ย 

Results

5๊ฐœ์˜ experiments๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์€ sisg(Subword Information Skip Gram)๋กœ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.
ย 

1. ์‚ฌ๋žŒ์˜ ์œ ์‚ฌ๋„ ํ‰๊ฐ€์™€ ๋‹จ์–ด ๋ฒกํ„ฐ ์œ ์‚ฌ๋„์˜ correlation ๋น„๊ต

cbow์™€ skipgram(cbow and sg)์€ ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์€ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” null vector(sisg-)๋ฅผ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ sisg(Subword Information Skip Gram)์€ subword ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—ย ๋ชจ๋ฅด๋Š” ๋‹จ์–ด(OOV)์— ๋Œ€ํ•ด์„œ๋„ ํƒ€๋‹นํ•œ ๋‹จ์–ด๋ฒกํ„ฐ๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
notion image
English WS353์„ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋“  ๋ฐ์ดํ„ฐ์—์„œ baseline๋ณด๋‹ค sisg๊ฐ€ ์„ฑ๋Šฅ์ด ์ข‹์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ๋˜ ๋ชจ๋ฅด๋Š” ๋‹จ์–ด๋ฅผ ๋‹จ์–ด ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ธ sisg๊ฐ€ null๋กœ ๋‚˜ํƒ€๋‚ธ sisg-๋ณด๋‹ค ๊ฐ™๊ฑฐ๋‚˜ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค˜ย subword ์ •๋ณด์˜ ์žฅ์ ์„ ์ฆ๋ช…ํ•ด์ค๋‹ˆ๋‹ค.
Arabic, German ๊ทธ๋ฆฌ๊ณ  Russian์ด ๋‹ค๋ฅธ ์–ธ์–ด๋ณด๋‹ค ๋” ํšจ๊ณผ์ ์ธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. German์€ 4๊ฐ€์ง€ Russian์€ 6๊ฐ€์ง€ ๋ฌธ๋ฒ•์  ์–ดํ˜•๋ณ€ํ™”๋ฅผ ๋ณด์ด๊ณ  Russian์€ ํ•ฉ์„ฑ์–ด๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์—ย ํ˜•ํƒœ๋ก ์  ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ธ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.
English์—์„œ Rare Words dataset (RW)๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ WS353์—์„œ๋Š” ๋‚ฎ๊ฒŒ ๋‚˜ํƒ€๋‚˜๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ ์…‹์€ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ์–ด subword ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
ย 

2. ์œ ์ถ” ๋ฌธ์ œ Word Analogy

A : B = C : D์˜ ๊ด€๊ณ„ ์—์„œ ๋ชจ๋ธ์„ ํ†ตํ•ด D๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค. ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์€ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ questions์€ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.
notion image
syntacticย information์—์„œ ๋šœ๋ ทํ•œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋Œ€์กฐ์ ์œผ๋กœย semantic์—์„œ๋Š” ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ดํ›„์— ๋‚˜์˜ฌ ์‹คํ—˜(5)์—์„œ ๋ณด์—ฌ์ฃผ๋“ฏ์ด character n-gram์˜ ๊ธธ์ด ์กฐ์ •์„ ํ†ตํ•ด์„œ semantic์—์„œ๋„ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ˜•ํƒœ๋ก ์  ์ •๋ณด๊ฐ€ ํ’๋ถ€ํ•œ Czech(CS) ๊ทธ๋ฆฌ๊ณ  German(DE)์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
ย 

3. Comparison with Morphological Representations

RNN, cbow, morphological transformation of Soricut and Och, log-bilinear language ๋ชจ๋ธ์„ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ชจ๋ธ sisg์™€ ์œ ์‚ฌ์„ฑ task์— ๋Œ€ํ•ด ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. ํ˜•ํƒœ๋ก ์  ์ •๋ณด์— ๊ธฐ๋ฐ˜ํ•œ ๋ชจ๋ธ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
notion image
ํ˜•ํƒœ๋ก ์  ๋ณ€ํ™˜์„ ์‚ฌ์šฉํ•œ Soricut and Och(2015)๋ณด๋‹ค๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Soricut and Och(2015)์—์„œ๋Š” noun compounding์„ ํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ํŠนํžˆ German์—์„œ ํฐ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ย 

4. Effect of the Size of the Training Data

์šฐ๋ฆฌ๋Š” ๋‹จ์–ด๊ฐ„์˜ character-level ์œ ์‚ฌ์„ฑ์„ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ•™์Šต๋ฐ์ดํ„ฐ์˜ ์‚ฌ์ด์ฆˆ์— robust ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.ย OOV์˜ ๋น„์œจ์€ ๋ฐ์ดํ„ฐ์…‹์ด ์ค„์–ด๋“ค์ˆ˜๋ก ์ฆ๊ฐ€ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ sisg-์™€ cbow๋Š” ์„ฑ๋Šฅ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.ย ๋‹จ์–ด ์‚ฌ์ด์ฆˆ์— ์˜์กดํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด cbow ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.
notion image
๊ฒฐ๊ณผ,ย ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์—์„œ, ๋ชจ๋“  ์‚ฌ์ด์ฆˆ์—์„œ sisg๊ฐ€ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. cbow ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ฐ˜๋ฉด์— sisg๋Š” ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ์˜ ์ฆ๊ฐ€๊ฐ€ ํ•ญ์ƒ ์„ฑ๋Šฅ ์ฆ๊ฐ€๋ฅผ ๋ถˆ๋Ÿฌ์˜ค์ง€๋Š” ์•Š์•˜์Šต๋‹ˆ๋‹ค.ย 
์•„์ฃผ ์ž‘์€ ํ•™์Šต๋ฐ์ดํ„ฐ์…‹์—๋„ sisg๋Š” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. German GUR350์—์„œ sisg๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์˜ 5%๋งŒ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์€ 66์œผ๋กœ cbow๋กœ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šตํ•œ ์„ฑ๋Šฅ 62๋ณด๋‹ค ๋†’์•˜์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ English RW์—์„œ sisg๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์˜ 1%๋งŒ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์€ 45์œผ๋กœ cbow๋กœ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šตํ•œ ์„ฑ๋Šฅ 43๋ณด๋‹ค ๋†’์•˜์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œย ์ œํ•œ๋œ ์‚ฌ์ด์ฆˆ์˜ ๋ฐ์ดํ„ฐ ์…‹์—์„œ๋„ ๋‹จ์–ด๋ฒกํ„ฐ๊ฐ€ ํ•™์Šต๋  ์ˆ˜ ์žˆ๊ณ  ์ด์ „์— ๋“ฑ์žฅํ•˜์ง€ ์•Š์€ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ ์—ฌ์ „ํžˆ ์ž˜ ํ•™์Šต๋œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ํ™œ์šฉ์— ํ•„์š”ํ•œ relevent task-specific data๋Š” ์–‘์ด ๋งŽ์ง€ ์•Š์€๋ฐ, ์ด ๋ชจ๋ธ์„ ํ†ตํ•ด ์ ์€ ํ•™์Šต๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ํฐ ์žฅ์ ์ž…๋‹ˆ๋‹ค.
ย 

5. Effect of the Size of N-grams

์•ž์„œ ๋ชจ๋ธ์—์„œ ์„ค๋ช…ํ–ˆ๋“ฏ์ด n-gram์˜ ๊ธฐ๋ณธ size์„ 3-6์œผ๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. n size๊ฐ€ ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
notion image
English ๊ทธ๋ฆฌ๊ณ  German์—์„œ 3-6์€ ํ•ฉ๋ฆฌ์ ์ธ ์„ ํƒ์ด์—ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ฒ”์œ„์˜ ๊ธธ์ด๋Š” task์™€ language์— ๋”ฐ๋ผ ์ž„์˜์ ์œผ๋กœ ์กฐ์ •ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์—ด5, 6์—์„œ ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. long n-gram์„ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์œ ์ถ” task์—์„œ longer n-grams๊ฐ€ semantic ์œ ์ถ”๋ฅผ ๋„์™€์ค๋‹ˆ๋‹ค.
n-gram์„ ์‚ฌ์šฉํ•  ๋•Œ, ๊ธฐํ˜ธ <, >๋ฅผ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— n = 2๋กœ ํ•˜๋ฉด ํ•˜๋‚˜๋Š” proper character์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” positional one์ธ ๊ฒƒ์ด ์ƒ๊ธฐ๊ธฐ ๋•Œ๋ฌธ์— 2๋ณด๋‹ค๋Š” ์ปค์•ผํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

6. Language Modeling

without using pre-trained word vectors (LSTM), with pre-trained word vectors without subword information(sg), ๊ทธ๋ฆฌ๊ณ  ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ชจ๋ธ(sisg)์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.
notion image
pre-trained word vectors์ผ ๋•Œ test perplexit๊ฐ€ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. subword๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ, plain skipgram model๋ณด๋‹ค ๋” ๋‚ฎ์€ test perplexit์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ย 
ย 

Qualitative analysis

1. ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด์— ๋Œ€ํ•œ Nearest Neighbors

์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ํ†ตํ•ด ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š์€ ๋‹จ์–ด์— ๋Œ€ํ•œ Nearest neighbors๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ฒ ์ด์Šค๋ผ์ธ์ธ skipgram๋ณด๋‹ค ๋ณธ ๋ชจ๋ธ(sisg)์ด ๋” ํ•ฉ๋ฆฌ์ ์œผ๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.
notion image
ย 

2. Character N-grams and Morophemes

๊ฐ ๋‹จ์–ด์˜ ์ค‘์š”ํ•œ n-gram์„ ์ฐพ๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. w์€ ๋‹จ์–ด์˜ n-grams์˜ ํ•ฉ์ด๊ณ  ๊ฐ n-gram g์— ๋Œ€ํ•ด์„œ restricted representation์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
notion image
u_w์™€ u_w/g๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ๊ฐ’์˜ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ n-gram์„ ์ˆœ์œ„๋ฅผ ์ •ํ•ฉ๋‹ˆ๋‹ค. ranked n-grams๋Š” ๋‹ค์Œ ํ‘œ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
notion image
์˜ˆ๋ฅผ ๋“ค์–ด Autofahrer (car driver)์˜ ์ค‘์š”ํ•œ n-grams๋Š” Auto (car) ๊ทธ๋ฆฌ๊ณ  Fahrer (driver)๋กœ ํ•ฉ๋ฆฌ์ ์ธ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ starfish์€ star๊ณผ fish, lifetime์€ life์™€ time์ด ๋„์ถœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
ย 

3. Word Similarity for OOV Words

๋ณธ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์€ OOV์— ๋Œ€ํ•œ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. OOV ๋‹จ์–ด์˜ n-grams ํ‰๊ท ์œผ๋กœ vector representation์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ€ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ์ž˜ ๋‚˜ํƒ€๋‚ด๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ OOV ๋‹จ์–ด์™€ ํ•™์Šต๋ฐ์ดํ„ฐ ๋‚ด์˜ ๋‹จ์–ด๋ฅผ pair๋กœ ๋‘ ๋‹จ์–ด๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.
notion image
ย 
๋‹ค์Œ ๊ทธ๋ฆผ์—์„œ x์ถ•์ด OOV ๋‹จ์–ด์ž…๋‹ˆ๋‹ค. ๋นจ๊ฐ„์ƒ‰์€ ์–‘์˜ ์ฝ”์‚ฌ์ธ, ํŒŒ๋ž€์ƒ‰์€ ์Œ์˜ ์ฝ”์‚ฌ์ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
๋‹จ์–ด rarity์™€ scarceness์—์„œ -ness์™€ -ity๊ฐ€ ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‹จ์–ด preadolescent๋Š” -adolesc-๋ผ๋Š” subword ๋•๋ถ„์— ๋‹จ์–ด young๊ณผ ์ž˜ ๋งค์น˜๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ OOV ๋‹จ์–ด๋„ ์˜๋ฏธ๋ฅผ ์ž˜ ๋‚˜ํƒ€๋‚ด๋Š” ๋‹จ์–ด๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

Conclusion

Fasttext๋Š” character n-grams๊ณผ skipgram์„ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. subword ์ •๋ณด๋ฅผ ํ†ตํ•ด ๋‹จ์–ด๋ฒกํ„ฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ํ•™์Šต์„ ๋น ๋ฅด๊ฒŒ ํ•˜๊ณ  ์‚ฌ์ „์ฒ˜๋ฆฌ๋‚˜ ๊ฐ๋…์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „์— ํ•™์Šต๋˜์ง€ ์•Š์€ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ ๋‹ค์–‘ํ•œ task์—์„œ ๋ฒ ์ด์Šค๋ผ์ธ์„ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ํ˜•ํƒœ๋ก ์  ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
ย 

References

Enriching Word Vectors with Subword Information (Piotr Bojanowski,ย Edouard Grave,ย Armand Joulin,ย Tomas Mikolov) https://arxiv.org/abs/1607.04606
Efficient Estimation of Word Representations in Vector Space (Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean) https://arxiv.org/abs/1301.3781
Distributed Representations of Words and Phrases and their Compositionality (Tomas Mikolov,ย Ilya Sutskever,ย Kai Chen,ย Greg Corrado,ย Jeffrey Dean) https://arxiv.org/abs/1310.4546
[์ž์—ฐ์–ด์ฒ˜๋ฆฌ][paper review] FastText: Enriching Word Vectors with Subword Information https://supkoon.tistory.com/15

์ด์ „ ๊ธ€ ์ฝ๊ธฐ