5-4 ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ์„ ํ™œ์šฉํ•œ ์ „์ด ํ•™์Šต

AG ๋‰ด์Šค ๋ฐ์ดํ„ฐ์…‹

AG ๋‰ด์Šค ๋ฐ์ดํ„ฐ์…‹์€ ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹๊ณผ ์ •๋ณด ์ถ”์ถœ ๋ฐฉ๋ฒ• ์—ฐ๊ตฌ๋ฅผ ๋ชฉ์ ์œผ๋กœ 2005๋…„์— ์ˆ˜์ง‘ํ•œ ๋‰ด์Šค ๊ธฐ์‚ฌ ๋ชจ์Œ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์žฅ์˜ ๋ชฉํ‘œ๋Š” ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์˜ ํšจ๊ณผ๋ฅผ ์•Œ์•„๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ธฐ์‚ฌ ์ œ๋ชฉ์— ์ดˆ์ ์„ ๋งž์ถฐ ์ฃผ์–ด์ง„ ์ œ๋ชฉ์œผ๋กœ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋‹ค์ค‘ ๋ถ„๋ฅ˜ ์ž‘์—…์„ ๋งŒ๋“ค ๊ฒƒ์ž…๋‹ˆ๋‹ค.
ย 
ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ๋Š” ํ…์ŠคํŠธ๋ฅผ ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ ํ›„ ์‰ผํ‘œ, ๋งˆ์นจํ‘œ, ๋А๋‚Œํ‘œ ๋“ฑ์˜ ์ฃผ์œ„์— ๊ณต๋ฐฑ์„ ์ถ”๊ฐ€ํ•˜๊ณ  ๊ทธ ์™ธ ๊ตฌ๋‘์  ๊ธฐํ˜ธ๋Š” ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๋Š” ์‹์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์€ ํ•™์Šต, ๊ฒ€์ฆ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์ฝ”๋“œ๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ ํ–‰์—์„œ ๋ชจ๋ธ ์ž…๋ ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฌธ์ž์—ด์„ ์ถ”์ถœํ•˜๊ณ  Vectorizer๋ฅผ ์‚ฌ์šฉํ•ด ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋‹ค์Œ์œผ๋กœ ๋‰ด์Šค ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ •์ˆ˜์™€ ์Œ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
ย 
๋‹ค์Œ์€ NewsDataset.__getitem__() ๋ฉ”์„œ๋“œ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค.
class NewsDataset(Dataset): @classmethod def load_dataset_and_make_vectorizer(cls, news_csv): # ๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด Vectorizer ๋งŒ๋“ค๊ธฐ # news_csv(str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜ news_df = pd.read_csv(news_csv) train_news_df = news_df[news_df.split=='train'] # NewsDataset์˜ ์ธ์Šคํ„ด์Šค return cls(news_df, NewsVectorizer.form_dataframe(train_news_df)) def __getitem__(self, index): # ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹์˜ ์ฃผ์š” ์ง„์ž… ๋ฉ”์„œ๋“œ # index(int): ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์ธ๋ฑ์Šค row = self._target_df.iloc[index] title_vector = self._vectorizer.vectorize(row.title, self._max_seq_length) category_index = self._vectorizer.category_vocab.lookup_token(row.category) return {'x_data': title_vector, # ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ํŠน์„ฑ 'y_target': category_index} # ๋ ˆ์ด๋ธ”
ย 
ย 

Vocabulary, Vectorizer, DataLoader

์ด๋ฒˆ ์žฅ์—์„œ๋Š” Vocabulary ํด๋ž˜์Šค๋ฅผ ์ƒ์†ํ•œ SequenceVocabulary๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด ํด๋ž˜์Šค์—์„œ๋Š” ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ์— ์‚ฌ์šฉํ•˜๋Š” ํŠน์ˆ˜ ํ† ํฐ 4๊ฐœ(UNK ํ† ํฐ, MASK ํ† ํฐ, BEGIN-OF-SEQUENCE ํ† ํฐ, END-OF-SEQUENCE ํ† ํฐ)๊ฐ€ ์žˆ๋Š”๋ฐ, ์ถ”ํ›„์— ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๊ฒ ์ง€๋งŒ ํฌ๊ฒŒ 3๊ฐ€์ง€ ์šฉ๋„๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. UNK ํ† ํฐ์€ ๋ชจ๋ธ์ด ๋“œ๋ฌผ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์— ๋Œ€ํ•œ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ํ…Œ์ŠคํŠธ ์‹œ์— ๋ณธ ์  ์—†๋Š” ๋‹จ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. MASK ํ† ํฐ์€ Embedding ์ธต์˜ ๋งˆ์Šคํ‚น ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฐ€๋ณ€ ๊ธธ์ด์˜ ์‹œํ€€์Šค๊ฐ€ ์žˆ์„ ์‹œ์— ์†์‹ค ๊ณ„์‚ฐ์„ ๋•์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ 2๊ฐ€์ง€ ํ† ํฐ์€ ์‹œํ€€์Šค ๊ฒฝ๊ณ„์— ๋Œ€ํ•œ ํžŒํŠธ๋ฅผ ์‹ ๊ฒฝ๋ง์— ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
ย 
ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋‘ ๋ฒˆ์งธ ๋ถ€๋ถ„์€ Vectorizer์ž…๋‹ˆ๋‹ค. ์ด ํด๋ž˜์Šค๋Š” SequenceVocabulary ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์บก์Аํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ์˜ˆ์ œ์˜ Vectorizer๋Š” ๋‹จ์–ด ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ํŠน์ • ์ž„๊ณ—๊ฐ’์„ ์ง€์ •ํ•˜์—ฌ Vocabulary์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ „์ฒด ๋‹จ์–ด ์ง‘ํ•ฉ์„ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๋ชฉ์ ์€ ๋นˆ๋„๊ฐ€ ๋‚ฎ์€ ์žก์Œ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์‹ ํ˜ธ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•˜๊ณ  ๋ชจ๋ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ œ์•ฝ์„ ์™„ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
ย 
์ธ์Šคํ„ด์Šค ์ƒ์„ฑ ํ›„ Vectorizer์˜ vectorizer() ๋ฉ”์„œ๋“œ๋Š” ๋‰ด์Šค ์ œ๋ชฉ ํ•˜๋‚˜๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ€์žฅ ๊ธด ์ œ๋ชฉ๊ณผ ๊ธธ์ด๊ฐ€ ๊ฐ™์€ ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” 2๊ฐ€์ง€ ์ฃผ์š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ์งธ, ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต ๋ฐ์ดํ„ฐ์…‹์ด ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ๊ด€๋ฆฌํ•˜๊ณ  ์ถ”๋ก  ์‹œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ๋ฒกํ„ฐ ๊ธธ์ด๋กœ ์‚ฌ์šฉํ•˜์ง€๋งŒ, CNN ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์ถ”๋ก  ์‹œ์—๋„ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ๊ฐ™์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‘˜์งธ, ๋‹จ์–ด ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” 0์œผ๋กœ ํŒจ๋”ฉ๋œ ์ •์ˆ˜ ๋ฒกํ„ฐ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด ์ •์ˆ˜ ๋ฒกํ„ฐ๋Š” ์‹œ์ž‘ ๋ถ€๋ถ„์— BEGIN-OF-SEQUENCE ํ† ํฐ์„ ์ถ”๊ฐ€ํ•˜๊ณ  ๋์—๋Š” END-OF-SEQUENCE ํ† ํฐ์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๋ถ„๋ฅ˜๊ธฐ๋Š” ์‹œํ€€์Šค ๊ฒฝ๊ณ„๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ณ  ๊ฒฝ๊ณ„ ๊ทผ์ฒ˜์˜ ๋‹จ์–ด์— ์ค‘์•™์— ๊ฐ€๊นŒ์šด ๋‹จ์–ด์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ๋ฐ˜์‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
๋‹ค์Œ์€ AG ๋‰ด์Šค ๋ฐ์ดํ„ฐ ์…‹์„ ์œ„ํ•œ Vectorizer ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค.
class NewsVectorizer(object): def vectorize(self, title, vector_length = -1): # title(str): ๊ณต๋ฐฑ์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง„ ๋‹จ์–ด ๋ฌธ์ž์—ด # vector_length(int): ์ธ๋ฑ์Šค ๋ฒกํ„ฐ์˜ ๊ธธ์ด ๋งค๊ฐœ๋ณ€์ˆ˜ indices = [self.title_vocab.begin_seq_index] indices.extend(self.title_vocab.lookup_token(token) for token in title.split(" ")) indices.append(self.title_vocab.end_seq_index) if vector_length < 0: vector_length = len(indices) out_vector = np.zeros(vector_length, dtype=np.int64) out_vector[:len(indices)] = indices out_vector[len(indices):] = self.title_vocab.mask_index # ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋œ ์ œ๋ชฉ (๋„˜ํŒŒ์ด ์–ด๋ ˆ์ด) return out_vector @classmethod def from_dataframe(cls, news_df, cutoff=25): # ๋ฐ์ดํ„ฐ์…‹ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ Vectorizer ๊ฐ์ฒด ๋งŒ๋“ค๊ธฐ # news_df(pandas.DataFrame): ํƒ€๊นƒ ๋ฐ์ดํ„ฐ์…‹ # cutoff(int): Vocabulary์— ํฌํ•จํ•  ๋นˆ๋„ ์ž„๊ณ—๊ฐ’ category_vocab = Vocabulary() for category in sorted(set(news_df.category)): category_vocab.add_token(category) word_counts = Counter() for title in news_df.title: for token in title.split(" "): if token not in string.punctuation: word_counts[token] += 1 title_vocab = SequenceVocabulary() for word, word_count in word_counts.items(): if word_count >= cutoff: title_vocab.add_token(word) # NewsVectorizer ๊ฐ์ฒด return cls(title_vocab, category_vocab)
ย 

NewsClassifier ๋ชจ๋ธ

๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์ดˆ๊ธฐ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ๋กœ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋จผ์ € ๋””์Šคํฌ์—์„œ ์ž„๋ฒ ๋”ฉ์„ ๋กœ๋“œํ•œ ๋‹ค์Œ ์‹ค์ œ ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” ์ž„๋ฒ ๋”ฉ์˜ ์ผ๋ถ€๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ Embedding ์ธต์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ์„ ํƒํ•œ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ์™€ ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๋ฅผ ๋‹ค์Œ ์ฝ”๋“œ์—์„œ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์–ดํœ˜ ์‚ฌ์ „์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์˜ ๋ถ€๋ถ„ ์ง‘ํ•ฉ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
def load_glove_from_file(glove_filepath): # glove_filepath (str): ์ž„๋ฒ ๋”ฉ ํŒŒ์ผ ๊ฒฝ๋กœ word_to_index = {} embeddings = [] with open(glove_filepath, "r") as fp: for index, line in enumerate(fp): line = line.split(" ") # each line: word num1 num2 ... word_to_index[line[0]] = index # word = line[0] embedding_i = np.array([float(val) for val in line[1:]]) embeddings.append(embedding_i) # word_to_index (dict), embeddings (numpy.ndarary) return word_to_index, np.stack(embeddings) def make_embedding_matrix(glove_filepath, words): # ํŠน์ • ๋‹จ์–ด ์ง‘ํ•ฉ์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ ๋งŒ๋“ค๊ธฐ # glove_filepath (str): ์ž„๋ฒ ๋”ฉ ํŒŒ์ผ ๊ฒฝ๋กœ # words (list): ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ word_to_idx, glove_embeddings = load_glove_from_file(glove_filepath) embedding_size = glove_embeddings.shape[1] final_embeddings = np.zeros((len(words), embedding_size)) for i, word in enumerate(words): if word in word_to_idx: final_embeddings[i, :] = glove_embeddings[word_to_idx[word]] else: embedding_i = torch.ones(1, embedding_size) torch.nn.init.xavier_uniform_(embedding_i) final_embeddings[i, :] = embedding_i # final_embeddings (numpu.ndarray): ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ return final_embeddings
ย 
์ž…๋ ฅ ํ† ํฐ ์ธ๋ฑ์Šค๋ฅผ ๋ฒกํ„ฐ ํ‘œํ˜„์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” Embedding์ธต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์ฝ”๋“œ์—์„œ๋Š” Embedding ์ธต์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ฐ”๊พธ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. forward() ๋ฉ”์„œ๋“œ์—์„œ ์ด ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•ด ์ธ๋ฑ์Šค๋ฅผ ๋ฒกํ„ฐ๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค.
class NewsClassifier(nn.Module): def __init__(self, embedding_size, num_embeddings, num_channels, hidden_dim, num_classes, dropout_p, pretrained_embeddings=None, padding_idx=0): """ ๋งค๊ฐœ๋ณ€์ˆ˜: embedding_size (int): ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ num_embeddings (int): ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ๊ฐœ์ˆ˜ num_channels (int): ํ•ฉ์„ฑ๊ณฑ ์ปค๋„ ๊ฐœ์ˆ˜ hidden_dim (int): ์€๋‹‰ ์ฐจ์› ํฌ๊ธฐ num_classes (int): ํด๋ž˜์Šค ๊ฐœ์ˆ˜ dropout_p (float): ๋“œ๋กญ์•„์›ƒ ํ™•๋ฅ  pretrained_embeddings (numpy.array): ์‚ฌ์ „์— ํ›ˆ๋ จ๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ณธ๊ฐ’์€ None padding_idx (int): ํŒจ๋”ฉ ์ธ๋ฑ์Šค """ super(NewsClassifier, self).__init__() if pretrained_embeddings is None: self.emb = nn.Embedding(embedding_dim=embedding_size, num_embeddings=num_embeddings, padding_idx=padding_idx) else: pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float() self.emb = nn.Embedding(embedding_dim=embedding_size, num_embeddings=num_embeddings, padding_idx=padding_idx, _weight=pretrained_embeddings) self.convnet = nn.Sequential( nn.Conv1d(in_channels=embedding_size, out_channels=num_channels, kernel_size=3), nn.ELU(), nn.Conv1d(in_channels=num_channels, out_channels=num_channels, kernel_size=3, stride=2), nn.ELU(), nn.Conv1d(in_channels=num_channels, out_channels=num_channels, kernel_size=3, stride=2), nn.ELU(), nn.Conv1d(in_channels=num_channels, out_channels=num_channels, kernel_size=3), nn.ELU() ) self._dropout_p = dropout_p self.fc1 = nn.Linear(num_channels, hidden_dim) self.fc2 = nn.Linear(hidden_dim, num_classes) def forward(self, x_in, apply_softmax=False): """๋ถ„๋ฅ˜๊ธฐ์˜ ์ •๋ฐฉํ–ฅ ๊ณ„์‚ฐ ๋งค๊ฐœ๋ณ€์ˆ˜: x_in (torch.Tensor): ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํ…์„œ x_in.shape๋Š” (batch, dataset._max_seq_length)์ž…๋‹ˆ๋‹ค. apply_softmax (bool): ์†Œํ”„ํŠธ๋งฅ์Šค ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์œ„ํ•œ ํ”Œ๋ž˜๊ทธ ํฌ๋กœ์Šค-์—”ํŠธ๋กœํ”ผ ์†์‹ค์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด False๋กœ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค ๋ฐ˜ํ™˜๊ฐ’: ๊ฒฐ๊ณผ ํ…์„œ. tensor.shape์€ (batch, num_classes)์ž…๋‹ˆ๋‹ค. """ # ์ž„๋ฒ ๋”ฉ์„ ์ ์šฉํ•˜๊ณ  ํŠน์„ฑ๊ณผ ์ฑ„๋„ ์ฐจ์›์„ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค x_embedded = self.emb(x_in).permute(0, 2, 1) features = self.convnet(x_embedded) # ํ‰๊ท  ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋ถ€๊ฐ€์ ์ธ ์ฐจ์›์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค remaining_size = features.size(dim=2) features = F.avg_pool1d(features, remaining_size).squeeze(dim=2) features = F.dropout(features, p=self._dropout_p) # MLP ๋ถ„๋ฅ˜๊ธฐ intermediate_vector = F.relu(F.dropout(self.fc1(features), p=self._dropout_p)) prediction_vector = self.fc2(intermediate_vector) if apply_softmax: prediction_vector = F.softmax(prediction_vector, dim=1) return prediction_vector
ย 

๋ชจ๋ธ ํ›ˆ๋ จ

ํ›ˆ๋ จ ๊ณผ์ •์€ ๋ฐ์ดํ„ฐ์…‹ ์ดˆ๊ธฐํ™”, ๋ชจ๋ธ ์ดˆ๊ธฐํ™”, ์†์‹ค ํ•จ์ˆ˜ ์ดˆ๊ธฐํ™”, ์˜ตํ‹ฐ๋งˆ์ด์ € ์ดˆ๊ธฐํ™”, ํ›ˆ๋ จ ์„ธํŠธ์— ๋Œ€ํ•œ ๋ฐ˜๋ณต, ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ, ๊ฒ€์ฆ ์„ธํŠธ์— ๋Œ€ํ•œ ๋ฐ˜๋ณต๊ณผ ์„ฑ๋Šฅ ์ธก์ •์„ ํ•œ ๋’ค์— ํŠน์ • ํšŸ์ˆ˜ ๋™์•ˆ ์ด ๋ฐ์ดํ„ฐ์…‹์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์ฝ”๋“œ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํฌํ•จํ•œ ์˜ˆ์ œ์˜ ํ›ˆ๋ จ ๋งค๊ฐœ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค.
args = Namespace( # ๋‚ ์งœ์™€ ๊ฒฝ๋กœ ์ •๋ณด news_csv="data/ag_news/news_with_splits.csv", vectorizer_file="vectorizer.json", model_state_file="model.pth", save_dir="model_storage/ch5/document_classification", # ๋ชจ๋ธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ glove_filepath='data/glove/glove.6B.100d.txt', use_glove=False, embedding_size=100, hidden_dim=100, num_channels=100, # ํ›ˆ๋ จ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ seed=1337, learning_rate=0.001, dropout_p=0.1, batch_size=128, num_epochs=100, early_stopping_criteria=5, # ์‹คํ–‰ ์˜ต์…˜ cuda=True, catch_keyboard_interrupt=True, reload_from_files=False, expand_filepaths_to_save_dir=True )
ย 

๋ชจ๋ธ ํ‰๊ฐ€์™€ ์˜ˆ์ธก

๋ชจ๋ธ์ด ์ž‘์—…์„ ์ž˜ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‘ ๊ฐ€์ง€๋กœ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •๋Ÿ‰์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ฑฐ๋‚˜ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ์กฐ์‚ฌํ•˜์—ฌ ์งˆ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ

classifier.eval() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ ๋ชจ๋“œ๋กœ ์„ค์ •ํ•˜์—ฌ ๋“œ๋กญ์•„์›ƒ๊ณผ ์—ญ์ „ํŒŒ๋ฅผ ๋ˆ ํ›„ ํ›ˆ๋ จ ์„ธํŠธ ๋ฐ ๊ฒ€์ฆ ์„ธํŠธ์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋Š” ๋”ฑ ํ•œ ๋ฒˆ๋งŒ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ƒˆ๋กœ์šด ๋‰ด์Šค ์ œ๋ชฉ์˜ ์นดํ…Œ๊ณ ๋ฆฌ ์˜ˆ์ธกํ•˜๊ธฐ

ํ›ˆ๋ จ์˜ ๋ชฉ์ ์€ ์‹ค์ „์— ๋ฐฐ์น˜ํ•˜์—ฌ ์ฒ˜์Œ ์ ‘ํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ถ”๋ก  ํ˜น์€ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ๋‰ด์Šค ์ œ๋ชฉ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋จผ์ € ํ›ˆ๋ จํ•  ๋•Œ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•œ ๋ฐฉ์‹์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ์ „์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒ˜๋ฆฌ๋œ ๋ฌธ์ž์—ด์€ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•œ Vectorizer๋ฅผ ์‚ฌ์šฉํ•ด ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๊ณ  ํŒŒ์ดํ† ์น˜ ํ…์„œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋‹ค์Œ์œผ๋กœ ์ด ํ…์„œ์— ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ธก ๋ฒกํ„ฐ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ์ฐพ์•„ ์นดํ…Œ๊ณ ๋ฆฌ ์ด๋ฆ„์„ ์กฐํšŒํ•˜๋Š”๋ฐ, ์ด ๊ณผ์ •์„ ์ฝ”๋“œ๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
def predict_category(title, classifier, vectorizer, max_length): """๋‰ด์Šค ์ œ๋ชฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค ๋งค๊ฐœ๋ณ€์ˆ˜: title (str): ์›์‹œ ์ œ๋ชฉ ๋ฌธ์ž์—ด classifier (NewsClassifier): ํ›ˆ๋ จ๋œ ๋ถ„๋ฅ˜๊ธฐ ๊ฐ์ฒด vectorizer (NewsVectorizer): ํ•ด๋‹น Vectorizer max_length (int): ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด """ title = preprocess_text(title) vectorized_title = \ torch.tensor(vectorizer.vectorize(title, vector_length=max_length)) result = classifier(vectorized_title.unsqueeze(0), apply_softmax=True) probability_values, indices = result.max(dim=1) predicted_category = vectorizer.category_vocab.lookup_index(indices.item()) return {'category': predicted_category, 'probability': probability_values.item()}
ย 
ย 
์ด์ „ ๊ธ€ ์ฝ๊ธฐ
ย 
ย