5-3 CBOW ์ž„๋ฒ ๋”ฉ ํ•™์Šตํ•˜๊ธฐ

์ด๋ฒˆ์—๋Š” ๋ฒ”์šฉ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•˜๊ณ  ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ์ธ Word2Vec CBOW(Continuous Bag-of-Words) ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. CBOW๋Š” ๋‹จ์ˆœํ•˜๊ฒŒ ๋นˆ์นธ ์ฑ„์šฐ๊ธฐํ•˜๋Š” ๊ฒƒ์— ๋น„์œ ํ•ด ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ „์ฒด์ ์ธ ๊ณผ์ •์„ ์„ค๋ช…๋“œ๋ฆฌ์ž๋ฉด ๋ฌธ์žฅ ์† โ€˜๋ฌธ๋งฅ ์œˆ๋„โ€™๋ฅผ ๋งŒ๋“ค์–ด ๋ฌธ๋งฅ ์œˆ๋„์˜ ์ค‘์•™ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•œ ํ›„ ๋‹ค์‹œ ๊ทธ ์ œ๊ฑฐ๋œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ ํ•™์Šต์„ ํ•˜๋Š” โ€˜๋‹ค์ค‘ ๋ถ„๋ฅ˜ ์ž‘์—…โ€™์ž…๋‹ˆ๋‹ค. ๋ˆ„๋ฝ๋œ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์–ผ๋งˆ๋‚˜ ์ž˜ ํŒŒ์•…ํ•ด๋‚ด๋Š”์ง€๊ฐ€ CBOW ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ์ •์ง“๊ฒ ์ฃ .
ย 

ํ”„๋ž‘์ผ„์Šˆํƒ€์ธ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•œ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •

์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ๋Š” ์‚ฌ์šฉํ•  ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•˜๊ณ  ๊ทธ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ด์„ ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹ ํด๋ž˜์Šค๋ฅผ ๋งŒ๋“ค๊ณ  ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹์„ ํ›ˆ๋ จ, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋ถ„ํ• ํ•ด์ฃผ๋Š” ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.
ย 

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•ํ•˜๊ธฐ

์˜ˆ์ œ์— ์‚ฌ์šฉ๋  ๋ฐ์ดํ„ฐ์…‹์€ ํ”„๋กœ์ ํŠธ ๊ตฌํ…๋ฒ ๋ฅดํฌ(Project Gutenberg)์—์„œ ๋ฐฐํฌํ•˜๋Š” ๋ฉ”๋ฆฌ ์…ธ๋ฆฌ์˜ ์†Œ์„ค ใ€Žํ”„๋ž‘์ผ„์Šˆํƒ€์ธใ€์˜ ๋””์ง€ํ„ธ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. NLTK์˜ Punkt ํ† ํฐ ๋ถ„ํ• ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ๊ฐœ๋ณ„ ๋ฌธ์žฅ์œผ๋กœ ๋ถ„ํ• ํ•˜๊ณ  ๋Œ€๋ฌธ์ž๋Š” ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ๊ตฌ๋‘์ ์„ ์™„์ „ํžˆ ์ œ๊ฑฐํ•ด์ค๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ด์œ ๋Š” NLTK๊ฐ€ case sensitive, ์ฆ‰ ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ™์€ ์ŠคํŽ ๋ง์˜ ๋‹จ์–ด๋ผ๋„ ๋Œ€๋ฌธ์ž๋กœ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ๊ณผ ์†Œ๋ฌธ์ž๋กœ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์„ ๋‹ค๋ฅด๊ฒŒ ์ธ์‹ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์„ ๊ฑฐ์นœ ํ›„, ์ด์ œ ๊ณต๋ฐฑ์œผ๋กœ ๋ฌธ์ž์—ด์„ ๋ถ„ํ• ํ•˜์—ฌ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ถ”์ถœํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
ย 
ย 

notion image

ย 
CBOW ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ์…‹์„ ์—ฐ์†๋œ ์œˆ๋„(Window)๋“ค๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์œ„์— ๋ชจ์ด๋Š” ๊ฒ€์€ ์‚ฌ๊ฐํ˜•๋“ค์ด ๊ฐ๊ฐ์˜ ์—ฐ์†๋œ ์›๋„์ด๊ณ  ๋นจ๊ฐ„ ์‚ฌ๊ฐํ˜•๋“ค์€ ๊ฐ ์œˆ๋„์˜ ์ค‘์•™ ๋‹จ์–ด์ž…๋‹ˆ๋‹ค! ๋ชจ๋ธ์€ ๋ฌธ์žฅ์˜ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์‚ฌ์•… ์ง€๋‚˜๊ฐ€๋ฉด์„œ ์ง€์ •๋œ ํฌ๊ธฐ์˜ ์œˆ๋„๋กœ ๋‹จ์–ด๋“ค์„ ๋ฌถ์Šต๋‹ˆ๋‹ค. ์œ„ ์˜ˆ์‹œ์—์„œ ๋ณด์ด๋Š” ๋ฌธ๋งฅ ์œˆ๋„๋Š” ๊ธธ์ด๊ฐ€ ์–‘์ชฝ์œผ๋กœ 2(์‰ฝ๊ฒŒ ๋งํ•ด์„œ ๋‹จ์–ด 2๊ฐœ)์ธ ๋ฌธ๋งฅ ์œˆ๋„์ฃ . ์ด ์œˆ๋„๊ฐ€ ํ…์ŠคํŠธ ์œ„๋ฅผ ์Šฌ๋ผ์ด๋”ฉํ•˜๋ฉฐ ํ•™์Šต ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•˜์—ฌ CBOW ๋ชจ๋ธ์ด ์™ผ์ชฝ ๋ฌธ๋งฅ๊ณผ ์˜ค๋ฅธ์ชฝ ๋ฌธ๋งฅ์„ ํ†ตํ•ด ํƒ€๊นƒ ๋‹จ์–ด, ์ฆ‰ ์ค‘์•™์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๊ฒ๋‹ˆ๋‹ค.
ย 

ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹ ํด๋ž˜์Šค ๋งŒ๋“ค๊ธฐ

์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ์œˆ๋„ ๋ฐ์ดํ„ฐ์…‹๊ณผ ํƒ€๊นƒ ๋‹จ์–ด๋ฅผ ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๊ฐ€์ ธ์™€์ฃผ๊ณ  CBOWDataset ํด๋ž˜์Šค๋ฅผ ํ†ตํ•ด ์ธ๋ฑ์‹ฑํ•ด์ค๋‹ˆ๋‹ค.
class CBOWDataset(Dataset): @classmethod def load_dataset_and_make_vetorizer(cls, cbow_csv) '''๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ ธ์˜ค๊ณ  Vectorizer ๋งŒ๋“ค๊ธฐ ๋งค๊ฐœ๋ณ€์ˆ˜: cbow_csv(str) ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜ ๋ฐ˜ํ™˜๊ฐ’: CBOWDataset์˜ instance''' cbow_df = pd.read_csv(cbow_csv) train_cbow_df = cbow_df[cbow_df.split == 'train'] return cls(cbow_df, CBOWVectorizer.from_dataframe(train_cbow_df)) def __getitem__(self, index): '''ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹์˜ ์ฃผ์š” ์ง„์ž… ๋ฉ”์„œ๋“œ ๋งค๊ฐœ๋ณ€์ˆ˜: index(int) ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์ธ๋ฑ์Šค ๋ฐ˜ํ™˜๊ฐ’: ํŠน์„ฑ(x_data)๊ณผ ๋ ˆ์ด๋ธ”(y_target)๋กœ ์ด๋ฃจ์–ด์ง„ ๋”•์…”๋„ˆ๋ฆฌ''' row = self._target_df.iloc[index] context_vector = \ self._vectorizer.vectorize(row.context, self._max_seq_length) target_index = self._vectorizer.cbow_vocab.lookup_token(row.target) return {'x_data': context_vector, 'y_target': target_index}
์—ฌ๊ธฐ์„œ ์‚ฌ์šฉ๋œ __ getitem__() ๋ฉ”์„œ๋“œ๋Š” Vectorizer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œˆ๋„์˜ ์™ผ์ชฝ ๋ฌธ๋งฅ๊ณผ ์˜ค๋ฅธ์ชฝ ๋ฌธ๋งฅ์„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ด์ค๋‹ˆ๋‹ค. ํƒ€๊นƒ ๋‹จ์–ด, ์ฆ‰ ์ค‘์•™ ๋‹จ์–ด๋Š” Vocabulary๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •์ˆ˜๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹์„ train set, validation set, test set์œผ๋กœ ๋‚˜๋ˆ„์–ด์ฃผ๊ธฐ

๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ชจ๋‘ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๊ณ , ๋ถ€๋ถ„ ๋ถ€๋ถ„ ๋‚˜๋ˆ„์–ด์„œ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹, ๊ฒ€์ฆํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹, ํ…Œ์ŠคํŠธํ•  ๋•Œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์„ ๋งˆ๋ จํ•ด์ค๋‹ˆ๋‹ค.
Train set(ํ›ˆ๋ จ ์„ธํŠธ)๋Š” ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ์— ์‚ฌ์šฉํ•˜๊ณ  validation set(๊ฒ€์ฆ ์„ธํŠธ)๋Š” ํ•™์Šตํ•œ ๊ฒƒ์„ ํ† ๋Œ€๋กœ ์ž˜ ํ›ˆ๋ จ๋˜์—ˆ๋‚˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ณ  ๊ฐ€๋Š ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‘ ์„ธํŠธ๋Š” ๋ชจ๋ธ ํ›ˆ๋ จ ์ค‘์— ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด๋ฉฐ, test set(ํ…Œ์ŠคํŠธ ์„ธํŠธ)๊ฐ€ ์‹ค์ „์ธ ๊ฒƒ์ด๋ผ ๊ฒ€์ฆ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ํ—ท๊ฐˆ๋ ค ํ•˜์‹œ๋ฉด ์•ˆ ๋ฉ๋‹ˆ๋‹ค~
์ด ๊ธ€์˜ ์˜ˆ์ œ์—์„œ๋Š” ๊ตฌ์„ฑํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ 70%๋ฅผ ํ›ˆ๋ จ ์„ธํŠธ, 15%๋ฅผ ๊ฒ€์ฆ ์„ธํŠธ, ๋‚˜๋จธ์ง€ 15%๋ฅผ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
ย 

Vocabulary, Vectorizer, DataLoader

์ด์ œ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ณ€ํ™˜ํ•ด์•ผํ•˜๋Š”๋ฐ, ์ด๋•Œ Vocabulary ํ•จ์ˆ˜, Vectorizer ํ•จ์ˆ˜, DataLoaderํ•จ์ˆ˜๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. CBOW ๋ชจ๋ธ์€ ์กฐ๊ธˆ ํŠน๋ณ„ํžˆ Vocabulary ํ•จ์ˆ˜๊ฐ€ ์›-ํ•ซ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค์ง€ ์•Š๊ณ  ๋Œ€์‹  ๋ฌธ๋งฅ์˜ ์ธ๋ฑ์Šค๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ •์ˆ˜ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด ๋ฐ˜ํ™˜ํ•˜๋Š” ์ž‘์—…์„ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ‘์— ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด Vectorizerํ•จ์ˆ˜๋ฅผ ๊ตฌํ˜„ํ•ด๋ด…๋‹ˆ๋‹ค.
class CBOWVectorizer(object): '''์–ดํœ˜ ์‚ฌ์ „์„ ์ƒ์„ฑํ•˜๊ณ  ๊ด€๋ฆฌํ•จ''' def vectorize(self, context, vector_length = -1): '''๋งค๊ฐœ๋ณ€์ˆ˜: context(str) ๊ณต๋ฐฑ์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง„ ๋‹จ์–ด ๋ฌธ์ž์—ด vector_length(int) ์ธ๋ฑ์Šค ๋ฒกํ„ฐ์˜ ๊ธธ์ด ๋งค๊ฐœ๋ณ€์ˆ˜''' indices = \ [self.cbow_vocab.lookup_token(token) for token in context.split(' ')] if vector_length < 0: vector_length = len(indices) out_vector = np.zeros(vector_length, dtype = np.int64) out_vector[:len(indices)] = indices out_vector[len(indices):] = self.cbow_vocab.mask_index return out_vector
ย 
์ด๋•Œ ๋ฌธ๋งฅ์˜ ํ† ํฐ ์ˆ˜๊ฐ€ ์ตœ๋Œ€ ๊ธธ์ด๋ณด๋‹ค ์ ์œผ๋ฉด, ๋‚จ์€ ํ•ญ๋ชฉ๋“ค์€ 0์œผ๋กœ ์ฑ„์›Œ์ง€๋Š”๋ฐ, ์ด ํ˜„์ƒ์„ 0์œผ๋กœ โ€˜ํŒจ๋”ฉโ€™๋˜์—ˆ๋‹ค๊ณ  ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.
ย 

CBOWClassifier ๋ชจ๋ธ

class CBOWClassifier(nn.Module): def __init__(self, vocabulary_size, embedding_size, padding_idx = 0): '''๋งค๊ฐœ๋ณ€์ˆ˜: vocabulary_size(int) ์–ดํœ˜ ์‚ฌ์ „์˜ ํฌ๊ธฐ, ์ž„๋ฒ ๋”ฉ ๊ฐœ์ˆ˜์™€ ์˜ˆ์ธก ๋ฒกํ„ฐ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•จ embedding_size(int) ์ž„๋ฒ ๋”ฉ ํฌ๊ธฐ padding_idx(int) ๊ธฐ๋ณธ๊ฐ’ 0 (์ž„๋ฒ ๋”ฉ์ด ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ์ธ๋ฑ์Šค)''' super(CBOWClassifier, self).__init__() self.embedding = nn.embedding(num_embeddings = vocabulary_size, embedding_dim = embedding_size, padding_idx = padding_idx) self.fc1 = nn.Linear(in_features = embedding_size, out_features = vocabulary_size) def forward(self, x_in, apply_softmax = False): '''๋ถ„๋ฅ˜๊ธฐ์˜ ์ •๋ฐฉํ–ฅ ๊ณ„์‚ฐ ๋งค๊ฐœ๋ณ€์ˆ˜: x_in (torch.Tensor) ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํ…์„œ x_in.shape๋Š” (batch, input_dim)์ž„ apply_softmax (bool) ์†Œํ”„ํŠธ๋งฅ์Šค ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์œ„ํ•œ ํ”Œ๋ž˜๊ทธ ํฌ๋กœ์Šค ์—”ํŠธ๋กœํ”ผ ์†์‹ค์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด False๋กœ ์ €์žฅ ๋ฐ˜ํ™˜๊ฐ’: ๊ฒฐ๊ณผ ํ…์„œ tensor.shape๋Š” (batch, output_dim)์ž„''' x_embedded_sum = self.embedding(x_in).sum(dim = 1) y_out = self.fc1(x_embedded_sum) if apply_softmax: y_out = F.softmax(y_out, dim = 1) return y_out
ย 
ย 
ย