2-3 ํ‘œ์ œ์–ด์™€ ์–ด๊ฐ„

ย 
์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ ๋ง๋ญ‰์น˜ ์† ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ‘œ์ œ์–ด ์ถ”์ถœ(lemmatization)๊ณผ ์–ด๊ฐ„ ์ถ”์ถœ(stemming)๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
ย 

ํ‘œ์ œ์–ด ์ถ”์ถœ

ํ‘œ์ œ์–ด(lemma)๋Š” ๋‹จ์–ด์˜ ๊ธฐ๋ณธํ˜•์„ ๋œปํ•ฉ๋‹ˆ๋‹ค. โ€˜diveโ€™๋ผ๋Š” ๋™์‚ฌ๋ฅผ ํ•œ ๋ฒˆ ๋ถ„์„ํ•ด๋ด…์‹œ๋‹ค. ์ด ๋™์‚ฌ๋Š” diving, dove, dived, dives ๋“ฑ ๋ฌธ์žฅ ์†์—์„œ์˜ ์—ญํ• ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋กœ ๋ณ€ํ˜•๋ฉ๋‹ˆ๋‹ค. โ€˜beโ€™๋ผ๋Š” ๋‹จ์–ด๋ฅผ ๋˜ ์˜ˆ๋กœ ๋“ค์ž๋ฉด am, was, are, is ๋“ฑ ์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ ๋ชจ์Šต๋“ค๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, dive์™€ be๊ฐ€ ๊ฐ๊ฐ์˜ ๋ชจ๋“  ๋‹จ์–ด๋“ค์˜ ๊ธฐ๋ณธํ˜•, ์ฆ‰ ํ‘œ์ œ์–ด์ธ ๊ฒ๋‹ˆ๋‹ค. ์ด๋•Œ ํ† ํฐ๋“ค์„ ์ด๋Ÿฌํ•œ ํ‘œ์ œ์–ด๋กœ ๋ฐ”๊พธ๋ฉฐ ๋ฒกํ„ฐ ํ‘œํ˜„์˜ ์ฐจ์›์„ ์ถ•์†Œํ•˜๋Š” ์ด ๊ณผ์ •์„ ํ‘œ์ œ์–ด ์ถ”์ถœ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
ย 
import spacy nlp = spacy.load('en') doc = nlp(u"he was running late") for token in doc: print('{} --> {}'.format(token, token.lemma_)) #format() ํ•จ์ˆ˜๋Š” ๋”ฐ์˜ดํ‘œ ์† {} ์•ˆ์— ์ง€์ •ํ•ด์ค€ ๋ณ€์ˆ˜๋“ค์„ ์ฐจ๋ก€๋Œ€๋กœ ๋„ฃ์–ด์ค๋‹ˆ๋‹ค.
ย 
โ€œhe was running lateโ€๋ผ๋Š” ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•˜์—ฌ ์–ป์€ ํ† ํฐ๋“ค์„ token, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ํ† ํฐ๋“ค์ด ํ‘œ์ œ์–ด ์ถ”์ถœ ๊ณผ์ • ๊ฑฐ์นœ ๊ฒฐ๊ณผ๋ฅผ token.lemma_๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋Œ๋ฆฌ๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.
ย 
he --> he was --> be running --> run late --> late
ย 
์ด๋•Œ ํ‘œ์ œ์–ด ์ถ”์ถœ์—์„œ ์ฃผ์˜ํ•ด์•ผํ•  ์ ์€, ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ ์ •๋ณด๋‚˜ ํ˜•ํƒœ์†Œ ์ •๋ณด๋ฅผ ์•Œ์•„์•ผ ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์œ„ ์ฝ”๋“œ์—์„œ ์‚ฌ์šฉํ•œ spaCy๋Š” ์ด๋ฏธ ๋งŽ์€ ๋‹จ์–ด๋“ค์˜ ๊ธฐ๋ณธํ˜•์ด ์ •์˜๋œ WordNet ์‚ฌ์ „์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์ œ์–ด๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. NLTK(Natural Language ToolKit)์—์„œ ์ œ๊ณตํ•˜๋Š” ํ‘œ์ œ์–ด ์ถ”์ถœ ๋„๊ตฌ์ธ WordNetLemmatizer์™€ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•  ๋•Œ ๊ทธ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ํ•จ๊ป˜ ์ž…๋ ฅํ•ด์ฃผ์–ด, ํ’ˆ์‚ฌ๊ฐ€ ๋ณด์กด๋œ ํ‘œ์ œ์–ด๋ฅผ ์ถœ๋ ฅํ•ด์ค€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
ย 

์–ด๊ฐ„ ์ถ”์ถœ

์–ด๊ฐ„(stem)์€ ๋‹จ์–ด์—์„œ ๋ณ€ํ•˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„์„ ๋œปํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฌธ์žฅ ์† ์ด๋Ÿฌํ•œ ์–ด๊ฐ„๋“ค์„ ์ถ”์ถœํ•ด๋‚ด๋Š” ๊ธฐ๋ฒ•์„ ์–ด๊ฐ„ ์ถ”์ถœ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, ์–ด๊ฐ„ ์ถ”์ถœ์€ ๋‹จ์ˆœํžˆ ์ˆ˜๋™์œผ๋กœ ์ •ํ•œ ๊ทœ์น™์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๋์„ ์ž˜๋ผ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ด๋Š”๋ฐ, ์ด ๊ทœ์น™์ด ์„ธ์ƒ ๋ชจ๋“  ๋‹จ์–ด๋“ค์— ๋งž๊ฒŒ ์ ์šฉ๋  ์ˆ˜๋Š” ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•๋„๊ฐ€ ๋‹ค์†Œ ๋ถ€์กฑํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
์–ด๊ฐ„ ์ถ”์ถœ๊ธฐ๋กœ๋Š” Porter์™€ Snowball์ด ์œ ๋ช…ํ•œ๋ฐ, ์•„๋ž˜ ์ฝ”๋“œ๋Š” Porter๋ฅผ ์ด์šฉํ•˜์—ฌ This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes.๋ผ๋Š” ๋ฌธ์žฅ์— ์–ด๊ฐ„ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.
ย 
import nltk from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize s = PorterStemmer() text = "This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes." words = word_tokenize(text) print(words) #๋ฌธ์žฅ์˜ ๋‹จ์–ด๋“ค์„ ๋‚˜๋ˆ” ['This', 'was', 'not', 'the', 'map', 'we', 'found', 'in', 'Billy', 'Bones', "'s", 'chest', ',', 'but', 'an', 'accurate', 'copy', ',', 'complete', 'in', 'all', 'things', '--', 'names', 'and', 'heights', 'and', 'soundings', '--', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.'] #๊ฐ ๋‹จ์–ด๋“ค์ด ์–ด๊ฐ„ ์ถ”์ถœ ๊ณผ์ •์„ ๊ฑฐ์นœ ๊ฒฐ๊ณผ [s.stem(w) for w in words] ['thi', 'wa', 'not', 'the', 'map', 'we', 'found', 'in', 'billi', 'bone', "'s", 'chest', ',', 'but', 'an', 'accur', 'copi', ',', 'complet', 'in', 'all', 'thing', '--', 'name', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'except', 'of', 'the', 'red', 'cross', 'and', 'the', 'written', 'note', '.']
(์ถœ์ฒ˜: https://omicro03.medium.com/%EC%9E%90%EC%97%B0%EC%96%B4%EC%B2%98%EB%A6%AC-nlp-5%EC%9D%BC%EC%B0%A8-%EC%96%B4%EA%B0%84-%EC%B6%94%EC%B6%9C-%ED%91%9C%EC%A0%9C%EC%96%B4-%EC%B6%94%EC%B6%9C-4a967d830cc2)
ย 
์ฝ”๋“œ ์† ์–ด๊ฐ„ ์ถ”์ถœ ๊ณผ์ •์„ ๊ฑฐ์นœ ๊ฒฐ๊ณผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ถœ๋ ฅ๋œ ๋‹จ์–ด๋“ค๋„ ์žˆ์ง€๋งŒ ํ‹€๋ฆฐ ๋‹จ์–ด๋“ค๋„ ๊ต‰์žฅํžˆ ๋งŽ์Šต๋‹ˆ๋‹ค. ์–ด๊ฐ„ ์ถ”์ถœ์€ ํ‘œ์ œ์–ด ์ถ”์ถœ๊ณผ ๋‹ฌ๋ฆฌ ํ’ˆ์‚ฌ์˜ ๋ณด์กด๋„ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ๋„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ •ํ™•๋„๋Š” ํ‘œ์ œ์–ด ์ถ”์ถœ๋ณด๋‹ค ๋ถ€์กฑํ•œ ๋Œ€์‹  ์ถ•์†Œ ๊ณผ์ •์„ ๋น„๊ต์  ๋นจ๋ฆฌ ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ์žฅ์ ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
ย 
์ด์ „ ๊ธ€ ์ฝ๊ธฐ