5-2 ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ ๊ฐ€์ ธ์™€ ์‚ฌ์šฉํ•˜๊ธฐ

์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ์ด์šฉํ•˜๊ธฐ

5์žฅ์—์„œ ๋’ค์— ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šตํ•˜๋Š” CBOW ๋ชจ๋ธ์— ๋Œ€ํ•ด ๊ฐ„๋žตํ•˜๊ฒŒ ๊ณต๋ถ€ํ•  ์˜ˆ์ •์ด์ง€๋งŒ, ์‹ค์งˆ์ ์œผ๋กœ ์šฐ๋ฆฌ๊ฐ€ ์ƒˆ๋กœ์šด ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งŒ๋“ค ์ผ์€ ๊ฑฐ์˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—, ๊ธฐ์กด์— ์ž˜ ๋งŒ๋“ค์–ด์ง„ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ๊ฐ€์ ธ์™€ ๋ชจ๋ธ์— ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
ย 
์ธํ„ฐ๋„ท์— ๋ฌด๋ฃŒ๋กœ ๊ณต๊ฐœ๋˜์–ด์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ์ค‘, ์Šคํƒ ํฌ๋“œ ๋Œ€ํ•™๊ต์˜ GloVe ์ž„๋ฒ ๋”ฉ์„ ์ด์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
GloVe ์ž„๋ฒ ๋”ฉ ๋‹ค์šด๋ฐ›๊ธฐ
์Šคํƒ ํฌ๋“œ ๋Œ€ํ•™๊ต์˜ ์›นํŽ˜์ด์ง€์—์„œ Download pre-trained word vectors ์ค‘ ๋งˆ์Œ์— ๋“œ๋Š” ๊ฒƒ์„ ์„ ํƒํ•ด ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ์ €๋Š” ๊ฐ€์žฅ ์šฉ๋Ÿ‰์ด ์ ์€ glove.6B.zip์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.
notion image
ย 
Google Colab์—์„œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ํŒŒ์ผ ๋กœ๋“œํ•˜๊ธฐ
๊ตฌ๊ธ€ Colab์—์„œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ๋ถˆ๋Ÿฌ์™€ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ๋กœ์ปฌํ™˜๊ฒฝ์—์„œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” ํ•„์š”์—†๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
ย 
๋จผ์ € ๋‹ค์šด๋ฐ›์€ ์ž„๋ฒ ๋”ฉ ํŒŒ์ผ์€ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ์— ์—…๋กœ๋“œํ•ด์ค๋‹ˆ๋‹ค. ์ €๋Š” GloVe ํด๋” ์งธ๋กœ Colab Notebooks์— ๋„ฃ์–ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
ย 
๋‹ค์Œ, ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ์™€ colab์„ ์—ฐ๊ฒฐํ•ด์ค๋‹ˆ๋‹ค.
from google.colab import drive drive.mount('/content/gdrive')
ย 
๊ทธ ๋‹ค์Œ, file_path์— ๋‚ด๊ฐ€ ๋ฐฉ๊ธˆ ๋„ฃ์–ด์ค€ ํŒŒ์ผ์˜ ์ฃผ์†Œ๋ฅผ ์ ์–ด์ค๋‹ˆ๋‹ค.
file_path = '/content/drive/MyDrive/Colab Notebooks/glove/glove.6B.100d.txt'
notion image
ํŒŒ์ผ ์ฃผ์†Œ๋Š” ์™ผ์ชฝ ํด๋” ํƒญ์œผ๋กœ ๋“ค์–ด๊ฐ€์„œ ํŒŒ์ผ์„ ์„ ํƒํ•˜๋ฉด ์‰ฝ๊ฒŒ ๊ฒฝ๋กœ ๋ณต์‚ฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐Ÿ™‚
ย 
์ด์ œ ํŒŒ์ผ๋ช… ๋Œ€์‹ , file_path๋ฅผ ์ „๋‹ฌํ•ด์ฃผ๋ฉด ํ•ด๋‹น ํŒŒ์ผ์„ colab์—์„œ๋„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

ย 
Colab์„ ๊ธฐ์ค€์œผ๋กœ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ์„ , annoy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์ค๋‹ˆ๋‹ค. annoy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” Approximate Nearest Neighbors Oh Yeah ์˜ ์ค„์ž„๋ง๋กœ, ๊ณต๊ฐ„ ๋‚ด์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๋ฅผ ์ฐพ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ spotify ์—์„œ ์Œ์•…์„ ์ถ”์ฒœํ•  ๋•Œ ํ™œ์šฉ๋˜์–ด์ง€๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž…๋‹ˆ๋‹ค.
!pip install annoy
ย 
์•„๋ž˜ ์ฝ”๋“œ์™€ ๊ฐ™์ด, PreTrainedEmbeddings๋ฅผ ์ •์˜ํ•ด์ค๋‹ˆ๋‹ค.
import numpy as np from annoy import AnnoyIndex class PreTrainedEmbeddings(object): def __init__(self, word_to_index, word_vectors): self.word_to_index = word_to_index self.word_vectors = word_vectors self.index_to_word = {v: k for k, v in self.word_to_index.items()} self.index = AnnoyIndex(len(word_vectors[0]), metric = 'euclidean') for _, i in self.word_to_index.items(): self.index.add_item(i, self.word_vectors[i]) self.index.build(50) # ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๊ฐ’(ํ–‰๋ ฌ๊ฐ’)์„ ์ถœ๋ ฅ def get_embedding(self, word): return self.word_vectors[self.word_to_index[word]] # ๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๋ฅผ ํ•˜๋‚˜ ์ฐพ์Œ def get_closest_to_vector(self, vector, n=1): nn_indices = self.index.get_nns_by_vector(vector, n) return [self.index_to_word[neighbor] for neighbor in nn_indices] # ๋‹จ์–ด 3๊ฐœ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด, word1, word2์™€ ๋™์ผํ•œ ๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š” word3์˜ ๋‹จ์–ด์Œ word4๋ฅผ ์ฐพ์•„์„œ ์ถœ๋ ฅ def compute_and_print_analogy(self, word1, word2, word3): vec1 = self.get_embedding(word1) vec2 = self.get_embedding(word2) vec3 = self.get_embedding(word3) spatial_relationship = vec2 - vec1 vec4 = vec3 + spatial_relationship closest_words = self.get_closest_to_vector(vec4, n=4) existing_words = set([word1, word2, word3]) closest_words = [word for word in closest_words if word not in existing_words] if len(closest_words) == 0: print("๊ณ„์‚ฐ๋œ ๋ฒกํ„ฐ์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค!") return for word4 in closest_words: print("{} : {} :: {} : {}".format(word1, word2, word3, word4)) @classmethod def from_embeddings_file(cls, embedding_file): word_to_index = {} word_vectors = [] with open(embedding_file) as fp: for line in fp.readlines(): line = line.split(" ") word = line[0] vec = np.array([float(x) for x in line[1:]]) word_to_index[word] = len(word_to_index) word_vectors.append(vec) return cls(word_to_index, word_vectors)
ย 
์ด์ œ, ๋‹ค์šด๋กœ๋“œํ•œ GloVe ํŒŒ์ผ์„ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ฉด ์ž˜ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์šฐ๋ฆฌ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
embeddings = PreTrainedEmbeddings.from_embeddings_file(file_name)
ย 
ย 
๋ช‡ ๊ฐ€์ง€ ์žฌ๋ฏธ์žˆ๋Š” ์˜ˆ์‹œ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
compute_and_print_analogy ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด 3๊ฐ€์ง€ ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ–ˆ์„ ๋•Œ word1, word2์˜ ๊ด€๊ณ„์™€ ๋™์ผํ•œ ๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š” word3์˜ ์Œ, word4๋ฅผ ์ถœ๋ ฅํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
embeddings.compute_and_print_analogy('man', 'he', 'woman') >>> output man : he :: woman : she embeddings.compute_and_print_analogy('fly', 'plane', 'sail') >>> output fly : plane :: sail : ship embeddings.compute_and_print_analogy('cat', 'kitten', 'dog') >>> output cat : kitten :: dog : puppy embeddings.compute_and_print_analogy('blue','color', 'dog') >>> output blue : color :: dog : pets embeddings.compute_and_print_analogy('food', 'delicious', 'cat') >>> output food : delicious :: cat : adorable
ย 
์œ„ ์˜ˆ์‹œ์—์„œ man์„ he๋ผ๊ณ  ํ•œ๋‹ค๋ฉด woman์€ she๋ผ๊ณ  ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ž˜ ํŒŒ์•…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. blue๊ฐ€ color์— ํฌํ•จ๋œ๋‹ค๋Š” ๊ด€๊ณ„๋ฅผ ์ž˜ ์ดํ•ดํ•˜๊ณ  dog๊ฐ€ pets์— ํฌํ•จ๋œ๋‹ค๊ณ  ์ถœ๋ ฅํ•˜๋Š” ๊ฒƒ๋„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. food๊ฐ€ deliciousํ•˜๋‹ค๊ณ  ํ•  ๋•Œ, cat์€ adorableํ•˜๋‹ค๊ณ  ํ•˜๋ฉฐ, ์ˆ˜์‹ ๊ด€๊ณ„๋„ ์ž˜ ํŒŒ์•…ํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
๋ฌผ๋ก  ํ”„๋กœ๊ทธ๋žจ์ด ์ด ๋‹จ์–ด๋“ค์˜ ์˜๋ฏธ์™€ ๊ด€๊ณ„๋ฅผ ๋ชจ๋‘ ์•Œ๊ณ  ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ํ•™์Šต ๊ณผ์ •์—์„œ ๋‹จ์–ด๋“ค์ด ์ž์ฃผ ๊ฐ™์ด ๋“ฑ์žฅํ•  ์ˆ˜๋ก ๋‹จ์–ด๋ผ๋ฆฌ์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค๊ณ  ํŒ๋‹จํ•˜๋ฉฐ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๊ณต๊ฐ„ ๋‚ด์—์„œ ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ–๋„๋ก(๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ–๋„๋ก) ๋‹จ์–ด๋“ค์„ ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•˜๋ฉด์„œ ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์† ์กฐ์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋“ค์„ ๊ฐ€์žฅ ์ž˜ ๋‚˜ํƒ€๋‚ด๋Š” ์œ„์น˜์— ๋‘๊ฒŒ ๋˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ถœ๋ ฅํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
ย 
๊ทธ๋Ÿฌ๋‚˜, ์ด๋Ÿฌํ•œ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•  ๋•Œ๋Š” ์ฃผ์˜ํ•ด์•ผ ํ•˜๋Š” ๋ถ€๋ถ„์ด ํ•˜๋‚˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ธํ„ฐ๋„ท์ด๋‚˜ ๋ฌธ์„œํ™”๋œ ๊ธ€๋“ค์„ ํ•™์Šต๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ธํ„ฐ๋„ท ๊ธ€ ๋“ฑ์— ํฌํ•จ๋˜์–ด ์žˆ๋Š” ์ „๋ฐ˜์ ์ธ ์–ธ์–ด์‚ฌ์šฉ์ž๋“ค์˜ ๊ณ ์ •๊ด€๋…์ด๋‚˜ ํŽธ๊ฒฌ์ด ์ž„๋ฒ ๋”ฉ ๊ณผ์ •์— ๋ฐ˜์˜๋˜์–ด ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ํ•™์Šต ๋ฐ์ดํ„ฐ์† ํŽธ๊ฒฌ์— ์˜ํ•ด ๋‚˜ํƒ€๋‚˜๋Š” man-doctor / woman-nurse ์™€ ๊ฐ™์€ ํŽธํ–ฅ์„ฑ์— ์œ ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
ย 
embeddings.compute_and_print_analogy('man', 'doctor', 'woman') >>> output man : doctor :: woman : nurse man : doctor :: woman : pregnant
ย 
์ด๋Ÿฌํ•œ ํŽธํ–ฅ์„ฑ์„ ์–ด๋–ป๊ฒŒ ์ œ๊ฑฐํ•ด์•ผ ํ•  ๊ฒƒ์ธ๊ฐ€ ์—ญ์‹œ NLP ๋ถ„์•ผ๊ฐ€ ๋– ์˜ค๋ฅด๋ฉด์„œ ํ•จ๊ป˜ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋Š” ํฅ๋ฏธ๋กœ์šด ์—ฐ๊ตฌ ๋ถ„์•ผ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.
ย 
ย