2-1 ๋ง๋ญ‰์น˜, ํ† ํฐ, ํƒ€์ž…

1. ๋ง๋ญ‰์น˜

๋ชจ๋“  NLP ์ž‘์—…์€ ๋ง๋ญ‰์น˜(corpus, ๋ณต์ˆ˜ํ˜•์€ corpora)๋ผ ๋ถ€๋ฅด๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์›์‹œ ํ…์ŠคํŠธ(ASCII๋‚˜ UTF-8 ํ˜•ํƒœ)์™€ ์ด ํ…์ŠคํŠธ์— ์—ฐ๊ด€๋œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ(metadata)๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋Š” ๋ฐ์ดํ„ฐ์— ๊ด€ํ•œ ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ๋กœ, ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ์„ค๋ช…ํ•ด์ฃผ๋Š” ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ์†์„ฑ์ •๋ณด๋ผ๊ณ ๋„ ํ•˜๋ฉฐ, ๋ณดํ†ต ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ๋ชฉ์ ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๋นจ๋ฆฌ ์ฐพ๊ธฐ ์œ„ํ•œ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์‹๋ณ„์ž, ๋ ˆ์ด๋ธ”, ํƒ€์ž„์Šคํƒฌํ”„ ๋“ฑ ํ…์ŠคํŠธ์™€ ๊ด€๋ จ๋œ ์–ด๋–ค ๋ถ€๊ฐ€ ์ •๋ณด๋„ ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋จธ์‹ ๋Ÿฌ๋‹ ๋ถ„์•ผ์—์„œ๋Š” ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ™์€ ํ…์ŠคํŠธ๋ฅผ ์ƒ˜ํ”Œ(sample) ๋˜๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ(data point)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ๋˜ํ•œ ์ƒ˜ํ”Œ์˜ ๋ชจ์Œ์ธ ๋ง๋ญ‰์น˜๋Š” ๋ฐ์ดํ„ฐ์…‹(dataset)์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

2. ํ† ํฐ

์•ž์„œ ์„ค๋ช…ํ•œ ๋ง๋ญ‰์น˜์—์„œ ํ† ํฐ(token)์ด๋ผ๋Š” ๋ถˆ๋ฆฌ๋Š” ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ์ž‘์—…์„ ํ† ํฐํ™”(tokenization)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ† ํฐ์€ ๋ณดํ†ต ์˜๋ฏธ์žˆ๋Š” ๋‹จ์œ„๋กœ ์ •์˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ† ํฐ์˜ ๊ธฐ์ค€์„ ๋‹จ์–ด๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ, ๋‹จ์–ด ํ† ํฐํ™”๋ผ๊ณ  ํ•˜๋Š”๋ฐ ๊ฐ„๋‹จํ•œ ํ…์ŠคํŠธ๋ฅผ ํ†ตํ•ด ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
ย 
๋‹ค์Œ์€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ‚ค์ง€์ธ spaCy์˜ ์˜ˆ์ž…๋‹ˆ๋‹ค.
import spacy nlp = spacy.load('en') text = "Don't give up, study with daiv." print([str(token) for token in nlp(text.lower())]) >>> output : ['do', "n't", 'give', 'up', ',', 'study', 'with', 'daiv', '.']
ย 
๋˜๋‹ค๋ฅธ ํŒจํ‚ค์ง€์ธ NLTK์˜ ์˜ˆ์ž…๋‹ˆ๋‹ค.
from nltk.tokenize import TweetTokenizer tweet=u"Snow White and the Seven Degrees#makeAMovieCold@midnight:-)" tokenizer = TweetTokenizer() print(tokenizer.tokenize(tweet.lower())) >>> ['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']
ย 
๋ณต์žกํ•œ ํŠน์ˆ˜๋ฌธ์ž ๋“ฑ์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์€ ๊ฐ„๋‹จํ•œ ํ…์ŠคํŠธ์˜ ๊ฒฝ์šฐ ์ด๋Ÿฐ ์‹์œผ๋กœ ์ง์ ‘ ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
# ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด๋ฅผ ๋ถ„๋ฆฌํ•ด์ค๋‹ˆ๋‹ค. text = "The journey is the reward." text_low = text.lower() text_low = text.replace('.', '') words = text_low.split(' ') print(words) >>> ['the', 'journey', 'is', 'the', 'reward']
ย 

3. ํƒ€์ž…

ํƒ€์ž…์€ ๋ง๋ญ‰์น˜์— ๋“ฑ์žฅํ•˜๋Š” ๊ณ ์œ ํ•œ ํ† ํฐ์ž…๋‹ˆ๋‹ค. ๋ง๋ญ‰์น˜์— ์žˆ๋Š” ๋ชจ๋“  ํƒ€์ž…์˜ ์ง‘ํ•ฉ์„ ์–ดํœ˜ ์‚ฌ์ „ ๋˜๋Š” ์–ดํœ˜๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. dictionary์˜ key์ฒ˜๋Ÿผ ๊ณ ์œ ํ•œ ํ† ํฐ์„ ํƒ€์ž…์ด๋ผ๊ณ  ์ •์˜ํ•œ๋‹ค ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.
๋‹จ์–ด๋Š” ๋‚ด์šฉ์–ด(content words)์™€ ๋ถˆ์šฉ์–ด(stopword)๋กœ ๊ตฌ๋ถ„๋ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต ์กฐ์‚ฌ, ๊ด€์‚ฌ, ์ „์น˜์‚ฌ ๋“ฑ์˜ ๋ถˆ์šฉ์–ด๋Š” ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€๋งŒ ๋‹จ์–ด ์ž์ฒด์— ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ, ๋ถ„์„์— ๋ฐฉํ•ด๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ ์ œ๊ฑฐํ•ด์ค๋‹ˆ๋‹ค.
ย 
ย 
๋‹ค์Œ ๊ธ€ ์ฝ๊ธฐ
ย