Efficient Estimation Of Word Representations In Vector Space (Word2Vec) (1)
๐Ÿ“œ

Efficient Estimation Of Word Representations In Vector Space (Word2Vec) (1)

Created
Feb 2, 2022
Editor
Tags
NLP
cleanUrl: "/paper/word2vec"
๐Ÿ“„
๋…ผ๋ฌธ : Efficient Estimation Of Word Representations In Vector Space (Word2Vec) ์ €์ž : Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

๋…ผ๋ฌธ ์„ ์ • ๋ฐฐ๊ฒฝ

๋ณธ ๋…ผ๋ฌธ์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ Word2Vec ์ด๋ผ๋Š” ๊ธฐ๋ฒ•์œผ๋กœ ์•Œ๋ ค์ง„ ๋ชจ๋ธ์„ ์ œ์‹œํ•œ ์ €๋ช…ํ•œ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ฒ•์˜ ์ด๋ฆ„์—์„œ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋“ฏ์ด, ๊ฐ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ์— ๋Œ€์‘ํ•˜์—ฌ ์ €์žฅํ•  ์ˆ˜ ์žˆ๊ณ  ๊ฐ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๋ง์…ˆ๊ณผ ๋บ„์…ˆ์œผ๋กœ ์˜๋ฏธ์— ๋Œ€ํ•œ ์—ฐ์‚ฐ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์›ํ•˜๋Š” ์˜๋ฏธ์˜ ๋‹จ์–ด๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด ํฅ๋ฏธ๋กœ์› ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ neural network ๋ชจ๋ธ๋“ค๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ค์› ๋˜ โ€œ๋‹จ์–ด์˜ ์˜๋ฏธโ€๋ฅผ ๊ธฐ๊ณ„์—๊ฒŒ ํ•™์Šต์‹œํ‚ค๊ณ , ๊ธฐ๊ณ„๊ฐ€ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ์ ์—์„œ ํฐ ์˜์˜๊ฐ€ ์žˆ๋‹ค๊ณ  ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด์—, Word2Vec ๊ธฐ๋ฒ•๊ณผ ๊ธฐ๊ณ„์˜ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์ œ์‹œ๋œ ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ์ธ CBOW, Skip-gram์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ ์ž ๋ณธ ๋…ผ๋ฌธ์„ ์„ ํƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Introduction

2013๋…„ ์ด์ „์˜ NLP system๊ณผ ๊ธฐ์ˆ ๋“ค์—์„œ๋Š” ํ•™์Šต๋œ ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ ์—†์ด ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ๋กœ ์กด์žฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋‹จ์ˆœํ•œ ๋ชจ๋ธ๋“ค์€ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜์—ฌ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ์œผ๋‚˜, ์‹ค์ œ ํ•™์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ์ œํ•œ๋˜์–ด์žˆ์–ด ๊ธฐ์ˆ ์ ์ธ ๋ฐœ์ „์„ ํ•˜๊ธฐ๋Š” ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์„ ํ†ตํ•ด ๋” ๋ณต์žกํ•œ ๋ชจ๋ธ๋“ค์˜ ์ •ํ™•๋„๋ฅผ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
ย 
๋ณธ ๋…ผ๋ฌธ์˜ ๋ชฉํ‘œ๋Š” ์–‘์งˆ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํฐ data set ๋˜๋Š” vocabulary๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์„ ํ†ตํ•ด ์œ ์‚ฌํ•œ ์˜๋ฏธ์˜ ๋‹จ์–ด๊ฐ€ ๊ทผ์ฒ˜์— ์œ„์น˜ํ•  ๋ฟ ์•„๋‹ˆ๋ผ, multiple degrees of similarity(syntactic, semantic, phonetic ๋“ฑ์˜ ๋ถ„์•ผ์˜ feature๋ฅผ ๊ณต์œ )๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค. vector(โ€œ์„œ์šธโ€) - vector(โ€œ์ˆ˜๋„โ€) + vector(โ€œ์ผ๋ณธโ€) ์˜ ๊ฒฐ๊ณผ๋กœ ์–ป์€ ๋ฒกํ„ฐ์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ€ โ€œ๋„์ฟ„โ€๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ๋ฐฉ์‹์œผ๋กœ, ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ์—ฐ์‚ฐ์ด ๋‹จ์–ด ์˜๋ฏธ์˜ ์—ฐ์‚ฐ์œผ๋กœ ๊ฐ€๋Šฅํ•˜๋„๋ก ์ƒˆ๋กœ์šด ๋ชจ๋ธ ์•„ํ‚คํ…์ณ๋ฅผ ๊ตฌ์„ฑํ•˜์—ฌ syntactic, semantic ์˜์—ญ์—์„œ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ย 

๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค

1. N-gram Language Model

์ผ๋ จ์˜ ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ํ•ด๋‹น ๋‹จ์–ด๋“ค ๋’ค์— ๋‚˜์˜ฌ ๋‹จ์–ด๋ฅผ ํ†ต๊ณ„์ ์œผ๋กœ ์ถ”์ธกํ•˜์—ฌ ์ถœ๋ ฅํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์•ž์˜ ๋‹จ์–ด ์ค‘ ์ตœ๊ทผ N๊ฐœ์˜ ๋‹จ์–ด๋งŒ์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์‚ฌ์šฉํ•˜๋Š” ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜์— ๋”ฐ๋ผ unigram, bigram, trigram, 4-gram ๋“ฑ์œผ๋กœ ์ด๋ฆ„์ด ๋ถ™์Šต๋‹ˆ๋‹ค. ํ•™์Šต ์ฝ”ํผ์Šค๋ฅผ ํ†ตํ•ด ๋‹จ์–ด๋“ค ๋’ค์— ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜์—ฌ ํ•™์Šตํ•˜๊ณ , ์ฃผ์–ด์ง„ N๊ฐœ์˜ ๋‹จ์–ด์— ๋Œ€ํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ๋กœ ๋’ค์— ๋“ฑ์žฅํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๊ฐ€์žฅ ๋†’์€ ๋‹จ์–ด๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฒฐ๊ณผ๋กœ ์ถœ๋ ฅํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
์žฅ์ 
  • ๋‹จ์–ด์˜ ํ›ˆ๋ จ๊ณผ ์ถ”๋ก  ๊ณผ์ •์ด ๊ฐ„๋‹จํ•˜๊ณ  ํฐ ์‹œ์Šคํ…œ์— ์‰ฝ๊ฒŒ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹จ์ 
  • ๊ณผ๊ฑฐ์˜ ์‚ฌ๋ก€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ์— ์ƒˆ๋กœ์šด ๋‹จ์–ด ์กฐํ•ฉ์— ๋Œ€ํ•ด ์ดํ•ดํ•˜์ง€ ๋ชปํ•œ๋‹ค. (ํฌ์†Œ๋ฌธ์ œ - sparsity problem)
  • ๋‹จ์–ด๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์•Œ์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ทผ์ฒ˜์˜ ๋‹จ์–ด๋งŒ ์ฐธ๊ณ ํ•˜์—ฌ ๋ฌธ์žฅ์„ ์ž‘์„ฑํ•˜๋ฏ€๋กœ ์ „์ฒด์ ์ธ ๋ฌธ์žฅ์˜ ๊ตฌ์กฐ์™€ ๋งฅ๋ฝ์„ ํŒŒ์•…ํ•˜๊ธฐ ํž˜๋“ค๋‹ค.

2. NNLM

Neural Network Language Model ๋˜๋Š” Feedforward Neural Language model์€ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ธฐ๊ณ„์—๊ฒŒ ํ•™์Šต์‹œ์ผœ, ํ›ˆ๋ จ์— ์—†๋˜ ๋‹จ์–ด์˜ ์ˆœ์„œ์— ๋Œ€ํ•ด ๋ณด๋‹ค ์ •ํ™•ํ•œ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•œ ๊ฐœ์„ ๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. N-gram ์ฒ˜๋Ÿผ N๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์ด์šฉํ•ด ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์–ด์ง„ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋งคํ•‘๋œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ์—ฐ๊ฒฐ ์—ฐ์‚ฐํ•˜์—ฌ hidden layer๋กœ ์ „๋‹ฌํ•˜๋ฉด hidden layer์—์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•ด ์ถœ๋ ฅ์ธต์œผ๋กœ ๋ณด๋‚ด๊ณ , ์ถœ๋ ฅ์ธต์—์„œ ๋˜๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜์™€ ๊ณฑํ•ด์ง„ ๋’ค ๊ฐ€์žฅ ๊ฒฐ๊ณผ์น˜๊ฐ€ ํฐ ๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•™์Šต ์ฝ”ํผ์Šค์˜ ์ •๋‹ต๊ณผ ์–ป์–ด๋‚ธ ๊ฒฐ๊ณผ ๊ฐ„์˜ ์ฐจ์ด์— ๋Œ€ํ•ด ์—ญ์ „ํŒŒ๊ฐ€ ์ด๋ฃจ์–ด์ง€๋ฉฐ ์ง€๋‚˜์˜จ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๋“ค๊ณผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ๊ฐ’์ด ์กฐ์ •๋˜๋ฉฐ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
notion image
์žฅ์ 
  • NNLM์€ ๊ณผ๊ฑฐ์— ๋“ฑ์žฅํ•˜์ง€ ์•Š์€ ๋‹จ์–ด์˜ ์กฐํ•ฉ์—๋„ ์ถฉ๋ถ„ํžˆ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋‹ค. (ํฌ์†Œ๋ฌธ์ œ-sparsity problem-์˜ ํ•ด๊ฒฐ)
๋‹จ์ 
  • ๋ฌธ์žฅ ์•ž์€ ์ƒ๋žตํ•œ ์ฑ„, ๊ทผ์ฒ˜์˜ N๊ฐœ์˜ ๋‹จ์–ด๋งŒ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋‹ค.
  • N-gram์— ๋น„ํ•ด ํ›จ์”ฌ ๋” ๋งŽ์€ ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋ฉฐ ์ด ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ๋А๋ฆฌ๋‹ค.

3. RNNLM

Recurrent Neural Network Lauguage Machine์€ projection layer๋ฅผ ์ œ๊ฑฐํ•œ NNLM์—์„œ hidden layer์˜ ์ถœ๋ ฅ์ด ๋‹ค์‹œ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ํ˜•์‹์œผ๋กœ ๋ชจ๋ธ์„ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. Reccurent ๋ถ€๋ถ„์ด ์ผ์ข…์˜ ๋‹จ๊ธฐ์ €์žฅ๊ณต๊ฐ„์œผ๋กœ ์ž‘๋™ํ•˜๋ฉด์„œ ์ด์ „ ๋‹จ์–ด๋“ค์„ ์ฒดํฌํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, NNLM์ฒ˜๋Ÿผ window size(N๊ฐœ์˜ ๋‹จ์–ด)๋ฅผ ์ •ํ•ด์ฃผ์ง€ ์•Š์•„๋„ ์ด์ „์˜ ๋‹จ์–ด๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
notion image
์žฅ์ 
  • RNNLM์€ NNLM๋ณด๋‹ค ์—ฐ์‚ฐ๋Ÿ‰์ด ์ ๊ธฐ ๋•Œ๋ฌธ์—, ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋น„๊ต์  ๋น ๋ฅด๋‹ค.
๋‹จ์ 
  • ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ธฐ์—๋Š” ์—ฌ์ „ํžˆ ๋А๋ฆฐ ์†๋„๋ฅผ ๋ณด์ธ๋‹ค.
ย 
๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋Š” ๊ณผ๊ฑฐ์—๋„ ์—ฌ๋Ÿฌ๋ฒˆ ๋“ฑ์žฅํ–ˆ์ง€๋งŒ, ์ฒ˜๋ฆฌ ์„ฑ๋Šฅ ๊ฐœ์„ ๊ณผ ๋А๋ฆฐ ํ•™์Šต ์†๋„์˜ ๊ฐœ์„ ์„ ์œ„ํ•ด ๋” ๋‚˜์€ ๊ธฐ์ˆ ๊ณผ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ•„์š”๋กœ ํ•˜์˜€๊ณ , ์ด์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ ์ œ์‹œํ•œ ๊ฒƒ์ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•˜๋Š” word2vec ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์•„ํ‚คํ…์ฒ˜ ๋ชจ๋ธ๋กœ, CBOW ๋ชจ๋ธ๊ณผ Skip-gram ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ย 

Model Architectures

Distributed representation

ํฌ์†Œ ํ‘œํ˜„(Sparse Representation)
๐Ÿค”
e.g. 10000๊ฐœ์˜ ๋‹จ์–ด ์ค‘ โ€˜๊ฐ•์•„์ง€โ€™๊ฐ€ 4๋ฒˆ์งธ ๋‹จ์–ด๋ผ๋ฉด?(0๋ถ€ํ„ฐ ์‹œ์ž‘)
๊ฐ•์•„์ง€ = [0 0 0 0 1 0 0 ... ์ค‘๋žต ... 0}
  • ํ‘œํ˜„ํ•˜๊ณ ์ž ํ•˜๋Š” ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค ๊ฐ’๋งŒ 1์ด๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์œผ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๋ฒกํ„ฐ ๋˜๋Š” ํ–‰๋ ฌ์˜ ๊ฐ’์ด ๋Œ€๋ถ€๋ถ„ 0์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ๋ฐฉ์‹
  • one-hot vector / 1-of-V vector (* V : Vocabulary์˜ ์ „์ฒด ๋‹จ์–ด ์ˆ˜)
  • ํ•ด๋‹น ํ‘œํ˜„์€ ๊ฐ ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
๋ถ„์‚ฐ ํ‘œํ˜„(Distributed representation)
๐Ÿค”
e.g. ๊ฐ•์•„์ง€ = [0.2 0.3 0.5 0.7 0.2 ... ์ค‘๋žต ... 0.2]
  • ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋‹ค์ฐจ์› ๊ณต๊ฐ„์— ๋ถ„์‚ฐํ•˜์—ฌ ๋ฒกํ„ฐํ™” ์‹œํ‚ค๋Š” ๋ฐฉ์‹
  • ๋ถ„์‚ฐํ‘œํ˜„์„ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๋ฒกํ„ฐํ™” ํ•˜๋Š” ๊ฒƒ์„ word embedding์ด๋ผ ํ•˜๊ณ , ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” neural net์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋ถ„์‚ฐ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Computational Complexity

์—ฌ๋Ÿฌ architecture๋ฅผ ๋น„๊ตํ•˜๊ธฐ์— ์•ž์„œ, ๊ณ„์‚ฐ๋ณต์žก๋„(๋ชจ๋ธ์„ ์™„์ „ํžˆ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ ์“ฐ์ด๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜)๋ฅผ ์ •์˜ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณ„์‚ฐ๋ณต์žก๋„๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
E : ํ•™์Šต ์‹œ epoch ์ˆ˜ (3 ~ 50)
T : Training set์— ์žˆ๋Š” ๋‹จ์–ด ์ˆ˜ (~1,000,000,000)
Q : ์ถ”ํ›„์— ๋ชจ๋ธ architecture์— ์˜ํ•ด ๊ฒฐ์ •๋˜๋Š” ๋ถ€๋ถ„
ย 

Continuous Bag Of Words

์ฃผ๋ณ€(๋งฅ๋ฝ)์˜ ๋‹จ์–ด๋“ค๋กœ ์ค‘๊ฐ„(์ค‘์‹ฌ)์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ค‘์‹ฌ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์–ด๋А์ •๋„๊นŒ์ง€ ์ด์šฉํ•  ๊ฒƒ์ธ์ง€(window size) ๊ฒฐ์ •ํ•˜์—ฌ ๋ชจ๋ธ์˜ input์œผ๋กœ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค. window size๋ฅผ n์ด๋ผ ํ•˜๋ฉด, ์‹ค์ œ๋กœ ์˜ˆ์ธก์— ์“ฐ์ด๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋Š” 2n๊ฐœ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
notion image
ํ•ด๋‹น architecture๋Š” feedforward NNLM๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, hidden layer๊ฐ€ ์ œ๊ฑฐ๋˜๋ฉด์„œ, projection layer๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด๋“ค๊ณผ ๊ณต์œ ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ „์ฒด word vector์˜ ํ‰๊ท ๊ฐ’(bag-of-words)์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด์˜ ์ˆœ์„œ๊ฐ€ projection์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ํ•ด๋‹น ๋ชจ๋ธ์€ ์—ฐ์†์ ์ธ context๋ฅผ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— CBOW (Continuous Bag-of-words) ๋ผ๋Š” ํ‘œํ˜„์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • input : ์˜ˆ์ธก์— ์ด์šฉํ•  2n๊ฐœ์˜ ์ฃผ๋ณ€๋‹จ์–ด 1-of-V ๋ฒกํ„ฐ
  • output label : ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ์ค‘๊ฐ„๋‹จ์–ด์˜ 1-of-V ๋ฒกํ„ฐ
  • training complexity Q = N x D + D x log(V)
    • N : ์ด์ „ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜
      D : vector์˜ ์ฐจ์›
      V : Vocabulary ๋‚ด ์ „์ฒด ๋‹จ์–ด ์ˆ˜
ย 
e.g. The fat cat sat on the mat. โ†’ [โ€˜Theโ€™, โ€˜fatโ€™, โ€˜catโ€™, โ€˜onโ€™, โ€˜theโ€™, โ€˜matโ€™]์œผ๋กœ โ€˜satโ€™์„ ์˜ˆ์ธกํ•ด๋ณด์ž
โ€˜satโ€™ : ์ค‘์‹ฌ(center) ๋‹จ์–ด
[โ€˜Theโ€™, โ€˜fatโ€™, โ€˜catโ€™, โ€˜onโ€™, โ€˜theโ€™, โ€˜matโ€™] : ์ฃผ๋ณ€(context) ๋‹จ์–ด
ย 
๋จผ์ € ์œˆ๋„์šฐ๋ฅผ ์˜†์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ€๋ฉฐ(sliding window) ์ค‘์‹ฌ๋‹จ์–ด์™€ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ๋ณ€๊ฒฝํ•ด ๋‚˜๊ฐ€๋ฉด์„œ ํ•™์Šต์— ์ด์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
notion image
Model์— ์ฃผ๋ณ€ vector๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๋ฉด์„œ ๊ณผ ๊ณฑํ•ด์ง„ 4๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ๋Œ€ํ•˜์—ฌ ํ‰๊ท  ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ projection layer๋กœ ์ „๋‹ฌ๋˜๊ณ ,ํ•ด๋‹น ๋ฒกํ„ฐ๊ฐ€ ๋‹ค์‹œ ์™€ ๊ณฑํ•ด์ง€๋ฉด์„œ output layer๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฒกํ„ฐ์— softmax๋ฅผ ์ ์šฉํ•œ ๊ฒฐ๊ณผ ๋ฒกํ„ฐ์™€ target label ์‚ฌ์ด์˜ cross-entropy ๊ฐ’์„ loss function์œผ๋กœ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.
์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ํƒ€๊นƒ ๋‹จ์–ด์˜ ์›-ํ•ซ ๋ฒกํ„ฐ๋Š” (0, 0, 0, 1, 0, 0, 0)์ž…๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์— softmax๋ฅผ ์ทจํ•ด์ค€ ๋ฒกํ„ฐ๊ฐ€ (0, 0, 0, 1, 0, 0, 0)์— ๊ฐ€๊นŒ์›Œ์ ธ cross-entropy๊ฐ’์ด 0์— ๊ฐ€๊นŒ์›Œ์งˆ ์ˆ˜ ์žˆ๋„๋ก ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W, Wโ€™๊ฐ€ ๊ฐฑ์‹ ๋ฉ๋‹ˆ๋‹ค.
notion image
ย 

Continuous Skip-gram Model

Continuous Skip-gram Model ์€ CBOW๊ณผ input๊ณผ output์ด ๋ฐ˜๋Œ€์ž…๋‹ˆ๋‹ค. ์ฆ‰ CBOW์—์„œ๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด(before and after current word)๋ฅผ ํ†ตํ•ด ์ค‘์‹ฌ ๋‹จ์–ด(current word)๋ฅผ ์˜ˆ์ธกํ–ˆ๋‹ค๋ฉด, Skip-gram์€ ์ค‘์‹ฌ ๋‹จ์–ด์—์„œ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
notion image
  • input : ์˜ˆ์ธก์— ์ด์šฉํ•  ์ค‘๊ฐ„๋‹จ์–ด์˜ 1-of-V ๋ฒกํ„ฐ
  • output label : ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” 2n๊ฐœ์˜ ์ฃผ๋ณ€๋‹จ์–ด 1-of-V ๋ฒกํ„ฐ
  • training complexity Q = C x (D + D x log(V))
    • C : ๋‹จ์–ด์˜ ์ตœ๋Œ€ ๊ฑฐ๋ฆฌ
      D : vector์˜ ์ฐจ์›
      V : Vocabulary ๋‚ด ์ „์ฒด ๋‹จ์–ด ์ˆ˜
ย 
์˜ˆ๋ฅผ ๋“ค์–ด "The fat cat sat on the mat" ์ด๋ผ๋Š” ๋ฌธ์žฅ์—์„œ โ€œsatโ€์ด๋ผ๋Š” ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ log-linear classifier์— input์œผ๋กœ ์ž…๋ ฅํ•˜๋ฉด โ€œfatโ€, โ€œcatโ€, โ€œonโ€, โ€œtheโ€ ์˜ output์„ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค. ์ค‘์‹ฌ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋ฏ€๋กœ projection layer์—์„œ ๋ฒกํ„ฐ๋“ค์˜ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์€ ์—†์Šต๋‹ˆ๋‹ค.
ํ•™์Šต๊ณผ์ •์—์„œ ๋ฒ”์œ„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด word vector์˜ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€์ง€๋งŒ, ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ค‘์‹ฌ ๋‹จ์–ด์™€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด๋Š” ํ•ด๋‹น ๋‹จ์–ด์™€ ๊ด€๋ จ์ด ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต ๋‹จ์–ด์—์„œ ์ ์€ ์ƒ˜ํ”Œ๋ง ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋‚ฎ์€ weight๋ฅผ ์ค๋‹ˆ๋‹ค.
notion image
<1, C> ์˜ ๋ฒ”์œ„์—์„œ ๋žœ๋คํ•˜๊ฒŒ R(Window size)์„ ๋ฝ‘๊ณ , ์ค‘์‹ฌ ๋‹จ์–ด ์ด์ „์˜ R๊ฐœ, ์ค‘์‹ฌ ๋‹จ์–ด ์ดํ›„์˜ R๊ฐœ์— ๋Œ€ํ•ด์„œ ์˜ˆ์ธกํ•ด ์ด R+R =2R๊ฐœ์˜ word์— ๋Œ€ํ•ด ์˜ˆ์ธกํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.
Skip-gram๊ณผ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์ธ NNLM, RNNLM๋ฅผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ๋กœ๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์˜ word vector๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
ย 
์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” Introduction, architecture์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ดค์Šต๋‹ˆ๋‹ค. Word2Vec์€ ๊ธฐ์กด์˜ ๋‹จ์–ด๋“ค ๊ฐ„ ์—ฐ๊ด€์„ฑ ์—†์ด ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํฐ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ๋‘๊ฐ€์ง€ ํ•™์Šต๋ฐฉ์‹์ธ CBOW(Continuous Bag-of-words)์™€ Skip-gram์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ๋Š” Word2Vec์˜ CBOW์™€ Skip-gram๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ ๋ฐ ์˜์˜ ๊ทธ๋ฆฌ๊ณ  ์ฝ”๋“œ๊ตฌํ˜„์„ ์†Œ๊ฐœํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
ย 

๋‹ค์Œ ๊ธ€ ์ฝ๊ธฐ