2-4 ๋ฌธ์žฅ๊ณผ ๋ฌธ์„œ ๋ถ„๋ฅ˜ : TF-IDF

ย 
๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ์„œ์™€ ๊ฐ™์€ ๊ธด ํ…์ŠคํŠธ ๋ญ‰์น˜๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์ž‘์—… ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๋ฌธ์„œ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ์—๋Š” ๊ต‰์žฅํžˆ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ๋Š”๋ฐ TF-IDF ํ‘œํ˜„์ด ๊ทธ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.
ย 

TF-IDF (Term Frequency - Inverse Document Frequency)

TF(Term Frequency)๋Š” ์–ด๋–ค ํ•œ ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ ๋‚ด์— ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์ž…๋‹ˆ๋‹ค. DF(Document Frequency)๋Š” ๊ทธ ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ๋ฌธ์„œ์— ๋‚˜ํƒ€๋‚˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์ด๋ฉฐ, ์ด ๊ฐ’์˜ ์—ญ์ˆ˜๋ฅผ IDF(Inverse Document Frequency)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, TF-IDF(Term Frequency - Inverse Document Frequency)๋ผ๋Š” ๊ฒƒ์€ TF์™€ IDF๋ฅผ ๊ณฑํ•œ ๊ฐ’์œผ๋กœ, ํŠน์ •ํ•œ ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ ๋ฌธ์„œ์—๋Š” ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ๋™์‹œ์— ํ•ด๋‹น ๋ฌธ์„œ์— ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š”์ง€๋ฅผ ์•Œ๋ ค์ค๋‹ˆ๋‹ค.
ย 
ํŠน์ •ํ•œ ๋‹จ์–ด์˜ TF ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ์ค‘์š”ํ•œ ๋‹จ์–ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, my, and, this ์™€ ๊ฐ™์ด ๋‹ค๋ฅธ ๋ฌธ์„œ์—๋„ ๊ต‰์žฅํžˆ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์€ ํ•ด๋‹น ๋ฌธ์„œ์—์„œ TF๊ฐ’์ด ๋†’๋‹ค๊ณ  ํ•˜๋”๋ผ๋„ ์ค‘์š”ํ•œ ๋‹จ์–ด๋ผ๊ณ  ๋ณด๊ธฐ๊ฐ€ ํž˜๋“ญ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ IDF ๊ฐ’์„ TF์— ๊ณฑํ•ด์ฃผ์–ด ํ”ํ•œ ๋‹จ์–ด๋“ค์—๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์ถ”๊ณ  ํ”ํ•˜์ง€ ์•Š์•„ ๋ฌธ์„œ์˜ ํŠน์ƒ‰์„ ์‚ด๋ ค์ฃผ๋Š” ๋‹จ์–ด๋“ค์—๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์ž…๋‹ˆ๋‹ค.
ย 
TF-IDF์˜ ๊ณต์‹์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค.
  • : ๋ฌธ์„œ y ์† ๋‹จ์–ด x์˜ TF-IDF
  • : ๋ฌธ์„œ y ์† ๋‹จ์–ด x์˜ TF
  • : ๋‹จ์–ด x๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜
  • : ์ด ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜
ย 
์‚ฌ์ดํ‚ท๋Ÿฐ์„ ์‚ฌ์šฉํ•ด TF-IDF ํ‘œํ˜„์„ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค.
#์šฐ์„  ๋‹จ์–ด ๋ถ„๋ฅ˜ from sklearn.feature_extraction.text import TfidfVectorizer text = ['I go to my home my home is very large', # Doc[0] 'I went out my home I go to the market', # Doc[1] 'I bought a yellow lemon I go back to home'] # Doc[2] tfidf_vectorizer = TfidfVectorizer() # TF-IDF ๊ฐ์ฒด์„ ์–ธ tfidf_vectorizer.fit(text) # ๋‹จ์–ด๋ฅผ ํ•™์Šต์‹œํ‚ด tfidf_vectorizer.vocabulary_ # ๋‹จ์–ด์‚ฌ์ „์„ ์ถœ๋ ฅ sorted(tfidf_vectorizer.vocabulary_.items()) # ๋‹จ์–ด์‚ฌ์ „ ์ •๋ ฌ
( ์ถœ์ฒ˜: https://chan-lab.tistory.com/24 )
#์ถœ๋ ฅ ๊ฒฐ๊ณผ [('back', 0), ('bought', 1), ('go', 2), ('home', 3), ('is', 4), ('large', 5), ('lemon', 6), ('market', 7), ('my', 8), ('out', 9), ('the', 10), ('to', 11), ('very', 12), ('went', 13), ('yellow', 14)]
( ์ถœ์ฒ˜: https://chan-lab.tistory.com/24 )
ย 
์ฝ”๋“œ ์† ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜๋Š” ์ด 3๊ฐœ, ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋Š” ์ด 15๊ฐœ(I์™€ a๋Š” ์ œ์™ธํ•จ)์ž…๋‹ˆ๋‹ค.
์ฒซ ๋ฒˆ์งธ ๋ฌธ์„œ์˜ TF ๋ฒกํ„ฐ๋Š” [0,ย 0,ย 1,ย 2,ย 1,ย 1,ย 0,ย 0,ย 2,ย 0,ย 0,ย 1,ย 1,ย 0,ย 0],
๋‘ ๋ฒˆ์งธ ๋ฌธ์„œ์˜ TF ๋ฒกํ„ฐ๋Š” [0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
์„ธ ๋ฒˆ์งธ ๋ฌธ์„œ์˜ TF ๋ฒกํ„ฐ๋Š” [1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1] ์ž…๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์„ธ ๋ฌธ์„œ์— ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋˜๋Š” DF ๋ฒกํ„ฐ๋Š” [1,ย 1,ย 3,ย 3,ย 1,ย 1,ย 1,ย 1,ย 2,ย 1,ย 1,ย 3,ย 1,ย 1,ย 1] ์ž…๋‹ˆ๋‹ค.
ย 
์ด TF, DF ๊ฐ’๋“ค์„ ์œ„ TF-IDF ๊ณต์‹์— ๋Œ€์ž…ํ•ด์ฃผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค.
#TF-IDF ์ฝ”๋“œ tfidf_vectorizer.transform(text).toarray()
#์ถœ๋ ฅ ๊ฒฐ๊ณผ array([[0. , 0. , 0.2170186 , 0.4340372 , 0.36744443, 0.36744443, 0. , 0. , 0.55890191, 0. , 0. , 0.2170186 , 0.36744443, 0. , 0. ], [0. , 0. , 0.24902824, 0.24902824, 0. , 0. , 0. , 0.42164146, 0.3206692 , 0.42164146, 0.42164146, 0.24902824, 0. , 0.42164146, 0. ], [0.44514923, 0.44514923, 0.26291231, 0.26291231, 0. , 0. , 0.44514923, 0. , 0. , 0. , 0. , 0.26291231, 0. , 0. , 0.44514923]])
( ์ถœ์ฒ˜: https://chan-lab.tistory.com/24 )
ย 
TF-IDF ํ‘œํ˜„์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ทธ ๋ฌธ์„œ๋งŒ์˜ ํŠน์„ฑ์„ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ๊ฐ€๋ ค๋‚ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ๋ฅผ ๋“ค์–ด ์˜ํ•™ ์šฉ์–ด๊ฐ€ ๋งŽ์ด ๋‚˜์˜จ๋‹ค๋ฉด ์˜ํ•™ ๊ด€๋ จ ๊ธ€๋กœ, ์˜ˆ์ˆ  ์šฉ์–ด๊ฐ€ ๋งŽ์ด ๋‚˜์˜จ๋‹ค๋ฉด ์˜ˆ์ˆ  ๊ด€๋ จ ๊ธ€๋กœ ๊ตฌ๋ถ„ ์ง€์œผ๋ฉฐ ๋ฌธ์„œ๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค๋‹ˆ๋‹ค.
ย 
TF-IDF ํ‘œํ˜„ ์ด์™ธ์—๋„ ์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ , ํผ์ง€ k-ํ‰๊ท , ๊ณ„์ธต์ ์ธ ๋ฒ ์ด์ง€์•ˆ ํด๋Ÿฌ์Šคํ„ฐ๋ง๊ณผ ๊ฐ™์€ ์ง€๋„, ๋น„์ง€๋„ ํ•™์Šต๊ณผ ๋ ˆ์ด๋ธ”๋œ ๋ฐ์ดํ„ฐ์…‹์ด ์ ์„ ๋•Œ ์œ ์šฉํ•œ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ, Nearest ๋ถ„๋ฅ˜๊ธฐ์™€ ๊ฐ™์€ ์ค€์ง€๋„ ํ•™์Šต์„ ํ†ตํ•ด ๋ฌธ์„œ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ์„œ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋ฉด ์ƒํ’ˆ ๋ฆฌ๋ทฐ์˜ ๊ฐ์„ฑ์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ ์ŠคํŒธ ์ด๋ฉ”์ผ์„ ํ•„ํ„ฐ๋งํ•˜๊ฑฐ๋‚˜ ์–ธ์–ด๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๋“ฑ ๋‹ค๋ฅธ ์ž‘์—…์„ ํ•  ๋•Œ๋‚˜ ์šฐ๋ฆฌ์˜ ์ผ์ƒ์ƒํ™œ์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.
ย 
ย 
ย