2-2 N-๊ทธ๋žจ

ย 
n-๊ทธ๋žจ(n-gram)์€ ๊ณ ์ • ๊ธธ์ด(n)์˜ ์—ฐ์†๋œ ํ† ํฐ ์‹œํ€€์Šค์ž…๋‹ˆ๋‹ค. ์œ ๋‹ˆ๊ทธ๋žจ(unigram)์€ ํ† ํฐ ํ•œ ๊ฐœ, ๋ฐ”์ด๊ทธ๋žจ(bigram)์€ ํ† ํฐ ๋‘ ๊ฐœ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ํ† ํฐ n๊ฐœ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค๋ฉด n-๊ทธ๋žจ์ด ๋ฉ๋‹ˆ๋‹ค. ์•ž์„œ ํ™œ์šฉํ–ˆ๋˜ spaCy์™€ NLTK ๊ฐ™์€ ํŒจํ‚ค์ง€๋ฅผ ํ™œ์šฉํ•˜๋ฉด n-๊ทธ๋žจ์„ ํŽธ๋ฆฌํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
n-๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ์ •ํ™•ํ•œ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ํž˜๋“  ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์•ž๋’ค ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๊ฐ™์€ ๋‹จ์–ด๋„ ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ, ๋ง๋ญ‰์น˜์—์„œ n๊ฐœ์˜ ๋‹จ์–ด ๋ญ‰์น˜๋กœ ๋Š์–ด ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ๊ฐ„์ฃผํ•œ๋‹ค๋ฉด ์ข€ ๋” ํšจ์œจ์ ์œผ๋กœ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, n์„ ๋„ˆ๋ฌด ํฌ๊ฒŒ ์žก์„ ๊ฒฝ์šฐ ํฌ์†Œ์„ฑ, ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์ง€๋Š” ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ตœ๋Œ€ 5๋ฅผ ๋„˜๊ฒŒ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ๊ถŒ์žฅ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
๋‹ค์Œ์€ ๊ฐ„๋‹จํžˆ unigram, bigram, n-gram์„ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์ดํ•ด๋ฅผ ๋•๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ ํ…์ŠคํŠธ๋กœ ์˜ˆ์‹œ๋ฅผ ๋“  ๊ฒƒ์ด๋ฏ€๋กœ, ๋ณต์žกํ•œ ๋ง๋ญ‰์น˜๋ฅผ ํ† ํฐํ™”ํ•  ๊ฒฝ์šฐ ์—ฌ๋Ÿฌ ํŒจํ‚ค์ง€์—์„œ ์ œ๊ณตํ•˜๋Š” ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
import re def n_gram(text, n): #uni_gram text_low = text.lower() text_low = re.sub('(\W+)', ' ', text_low) words = text_low.split(' ') if '' in words: words.remove('') #bi_gram if n == 2: bi_words = [] for i in range(len(words) - 1): bi_words.append(' '.join(words[i:i+2])) return bi_words #n_gram n_words = [] for i in range(len(words) - n + 1): n_words.append(' '.join(words[i:i+n])) return n_words #์˜ˆ์‹œ text = "Yes, I can do it!" >>> n_gram(text, 1) #uni_gram ['yes', 'i', 'can', 'do', 'it'] >>> n_gram(text, 2) #bi_gram bigram = ['yes i', 'i can', 'can do', 'do it'] >>> n_gram(text, 3) #tri_gram trigram = ['yes i can', 'i can do', 'can do it']
ย 
์ด์ „ ๊ธ€ ์ฝ๊ธฐ
๋‹ค์Œ ๊ธ€ ์ฝ๊ธฐ