6-2 RNN ์‹ค์Šต : ์„ฑ์”จ ๊ตญ์  ๋ถ„๋ฅ˜ (1)

์„ฑ์”จ ๊ตญ์  ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ

ย 
ย 
์ด๋ฒˆ ์žฅ์—์„œ๋Š” RNN์„ ์ด์šฉํ•œ ์„ฑ์”จ ๊ตญ์  ๋ถ„๋ฅ˜ ์˜ˆ์ œ๋ฅผ ์ง„ํ–‰ํ•˜๋ฉฐ RNN์˜ ๊ธฐ๋ณธ ์„ฑ์งˆ๊ณผ ์•ž์„œ ๊ณต๋ถ€ํ•œ ์—˜๋งŒ RNN ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›Œ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
ย 

ํŒŒ์ดํ† ์น˜์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํŒŒ์ดํ”„๋ผ์ธ

ํŒŒ์ดํ† ์น˜์—์„œ๋Š” NLP ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๋ฅผ ์ผ๋ จ์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋ฏธ๋‹ˆ๋ฐฐ์น˜(mini-batch)ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ชจ๋ธ์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—๋Š” ํ…์ŠคํŠธ์˜ ํ† ํฐํ™”, ๋ฒกํ„ฐํ™”, ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ชจ์œผ๋Š” ๊ณผ์ •์„ ๋ชจ๋‘ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํŒŒ์ดํ”„๋ผ์ธ์ด๋ž€ ๋ชจ๋ธ์—๊ฒŒ ์‚ฌ์šฉ์ž ์ •์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž ์ •์˜ ๋ฐ์ดํ„ฐ์…‹์— ํŒŒ์ดํ”„๋ผ์ธ์„ ์ ์šฉํ•˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  1. Dataset ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹ ์ •์˜ ๋ฐ ์ƒ์„ฑ
  1. Vocabulary ๋ฐ์ดํ„ฐ์…‹์˜ ํ† ํฐ์„ ์ •์ˆ˜๋กœ ๋งคํ•‘ํ•˜๊ธฐ
  1. Vectorizer Vocabulary๋ฅผ ์ฐธ๊ณ ํ•ด ๋ฐ์ดํ„ฐ์…‹์˜ ํ† ํฐ์„ ์ •์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฒกํ„ฐํ™”ํ•˜๊ธฐ
  1. DataLoader Vectorizer๊ฐ€ ๋ณ€ํ™˜ํ•œ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋งŒ๋“ค๊ธฐ
ย 
๊ฐ๊ฐ์˜ ๋ชจ๋“ˆ์— ๋Œ€ํ•œ ์„ค๋ช…๊ณผ ์ฝ”๋“œ๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ์•„๋ž˜์— ์ž‘์„ฑํ•ด๋‘์—ˆ์Šต๋‹ˆ๋‹ค. ์†Œ์Šค์ฝ”๋“œ์˜ ๊ฒฝ์šฐ, ํ† ๊ธ€๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

1. Dataset

ํŒŒ์ดํ† ์น˜์—์„œ๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด Dataset ํด๋ž˜์Šค๋ฅผ ์ƒ์†ํ•˜๊ณ  __init()__, __getitem__(), __len__() 3๊ฐœ์˜ ๋ฉ”์„œ๋“œ๋ฅผ ๊ตฌํ˜„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์ธ SurnameDataset ํด๋ž˜์Šค์˜ ํ•„์š”ํ•œ ๊ฐ’๋“ค์„ __init__() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์„ ์–ธํ•˜๊ณ , ํ•„์š”ํ•œ ๋งค์„œ๋“œ๋ฅผ ๊ตฌํ˜„ํ•ด๋ด…์‹œ๋‹ค.
ย 

๋ฐ์ดํ„ฐ์…‹ ์†Œ์Šค์ฝ”๋“œ

from argparse import Namespace import os import json import numpy as np import pandas as pd import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import Dataset, DataLoader import tqdm class SurnameDataset(Dataset): def __init__(self, surname_df, vectorizer): """ ๋งค๊ฐœ๋ณ€์ˆ˜: surname_df (pandas.DataFrame): ๋ฐ์ดํ„ฐ์…‹ vectorizer (SurnameVectorizer): ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋งŒ๋“  Vectorizer ๊ฐ์ฒด """ self.surname_df = surname_df self._vectorizer = vectorizer self._max_seq_length = max(map(len, self.surname_df.surname)) + 2 self.train_df = self.surname_df[self.surname_df.split=='train'] self.train_size = len(self.train_df) self.val_df = self.surname_df[self.surname_df.split=='val'] self.validation_size = len(self.val_df) self.test_df = self.surname_df[self.surname_df.split=='test'] self.test_size = len(self.test_df) self._lookup_dict = {'train': (self.train_df, self.train_size), 'val': (self.val_df, self.validation_size), 'test': (self.test_df, self.test_size)} self.set_split('train') # ํด๋ž˜์Šค ๊ฐ€์ค‘์น˜ class_counts = self.train_df.nationality.value_counts().to_dict() def sort_key(item): return self._vectorizer.nationality_vocab.lookup_token(item[0]) sorted_counts = sorted(class_counts.items(), key=sort_key) frequencies = [count for _, count in sorted_counts] self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32) @classmethod def load_dataset_and_make_vectorizer(cls, surname_csv): """๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ  ์ƒˆ๋กœ์šด Vectorizer ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค ๋งค๊ฐœ๋ณ€์ˆ˜: surname_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜ ๋ฐ˜ํ™˜๊ฐ’: SurnameDataset์˜ ๊ฐ์ฒด """ surname_df = pd.read_csv(surname_csv) train_surname_df = surname_df[surname_df.split=='train'] return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df)) @classmethod def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath): """ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ƒˆ๋กœ์šด Vectorizer ๊ฐ์ฒด๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ์บ์‹œ๋œ Vectorizer ๊ฐ์ฒด๋ฅผ ์žฌ์‚ฌ์šฉํ•  ๋•Œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜: surname_csv (str): ๋ฐ์ดํ„ฐ์…‹์˜ ์œ„์น˜ vectorizer_filepath (str): Vectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜ ๋ฐ˜ํ™˜๊ฐ’: SurnameDataset์˜ ์ธ์Šคํ„ด์Šค """ surname_df = pd.read_csv(surname_csv) vectorizer = cls.load_vectorizer_only(vectorizer_filepath) return cls(surname_df, vectorizer) @staticmethod def load_vectorizer_only(vectorizer_filepath): """ํŒŒ์ผ์—์„œ Vectorizer ๊ฐ์ฒด๋ฅผ ๋กœ๋“œํ•˜๋Š” ์ •์  ๋ฉ”์„œ๋“œ ๋งค๊ฐœ๋ณ€์ˆ˜: vectorizer_filepath (str): ์ง๋ ฌํ™”๋œ Vectorizer ๊ฐ์ฒด์˜ ์œ„์น˜ ๋ฐ˜ํ™˜๊ฐ’: SurnameVectorizer์˜ ์ธ์Šคํ„ด์Šค """ with open(vectorizer_filepath) as fp: return SurnameVectorizer.from_serializable(json.load(fp)) def save_vectorizer(self, vectorizer_filepath): """Vectorizer ๊ฐ์ฒด๋ฅผ json ํ˜•ํƒœ๋กœ ๋””์Šคํฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค ๋งค๊ฐœ๋ณ€์ˆ˜: vectorizer_filepath (str): Vectorizer ๊ฐ์ฒด์˜ ์ €์žฅ ์œ„์น˜ """ with open(vectorizer_filepath, "w") as fp: json.dump(self._vectorizer.to_serializable(), fp) def get_vectorizer(self): """ ๋ฒกํ„ฐ ๋ณ€ํ™˜ ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค """ return self._vectorizer def set_split(self, split="train"): self._target_split = split self._target_df, self._target_size = self._lookup_dict[split] def __len__(self): return self._target_size def __getitem__(self, index): """ํŒŒ์ดํ† ์น˜ ๋ฐ์ดํ„ฐ์…‹์˜ ์ฃผ์š” ์ง„์ž… ๋ฉ”์„œ๋“œ ๋งค๊ฐœ๋ณ€์ˆ˜: index (int): ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ์ธ๋ฑ์Šค ๋ฐ˜ํ™˜๊ฐ’: ๋‹ค์Œ ๊ฐ’์„ ๋‹ด๊ณ  ์žˆ๋Š” ๋”•์…”๋„ˆ๋ฆฌ: ํŠน์„ฑ (x_data) ๋ ˆ์ด๋ธ” (y_target) ํŠน์„ฑ ๊ธธ์ด (x_length) """ row = self._target_df.iloc[index] surname_vector, vec_length = \ self._vectorizer.vectorize(row.surname, self._max_seq_length) nationality_index = \ self._vectorizer.nationality_vocab.lookup_token(row.nationality) return {'x_data': surname_vector, 'y_target': nationality_index, 'x_length': vec_length} def get_num_batches(self, batch_size): """๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๋ฐฐ์น˜ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค ๋งค๊ฐœ๋ณ€์ˆ˜: batch_size (int) ๋ฐ˜ํ™˜๊ฐ’: ๋ฐฐ์น˜ ๊ฐœ์ˆ˜ """ return len(self) // batch_size def generate_batches(dataset, batch_size, shuffle=True, drop_last=True, device="cpu"): """ ํŒŒ์ดํ† ์น˜ DataLoader๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜. ๊ฑฑ ํ…์„œ๋ฅผ ์ง€์ •๋œ ์žฅ์น˜๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค. """ dataloader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last) for data_dict in dataloader: out_data_dict = {} for name, tensor in data_dict.items(): out_data_dict[name] = data_dict[name].to(device) yield out_data_dict
ย 

2. Vocabulary

Vocabulary๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ ๋ฌธ์ž ํ† ํฐ๋“ค์ด ๊ณ ์œ ํ•œ ์ •์ˆ˜๊ฐ’์— ๋งคํ•‘๋˜๋„๋ก ํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๋”•์…”๋„ˆ๋ฆฌ ์ž๋ฃŒํ˜•์„ ์ด์šฉํ•ด Vocabulary๋ฅผ ๊ด€๋ฆฌํ•˜๋Š”๋ฐ, ํ•˜๋‚˜์˜ ๋”•์…”๋„ˆ๋ฆฌ๋Š” ๋ฌธ์ž๋ฅผ ์ •์ˆ˜ ์ธ๋ฑ์Šค์— ๋งคํ•‘ํ•˜๊ณ  ๋‚˜๋จธ์ง€ ๋”•์…”๋„ˆ๋ฆฌ๋Š” ์ •์ˆ˜ ์ธ๋ฑ์Šค๋ฅผ ๋ฌธ์ž์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค.
โ—
ย 
๋ณธ ์˜ˆ์ œ์—์„œ๋Š” Vocabulary๋ฅผ ์ƒ์†ํ•˜๋Š” SequenceVocabulary๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ํ•™์Šต ๊ณผ์ •์—์„œ ํ•œ ๋ฒˆ๋„ ๋ณธ ์  ์—†๋Š” ๋‹จ์–ด๊ฐ€ ํ…Œ์ŠคํŠธ ๊ณผ์ •์—์„œ ์ž…๋ ฅ๋œ ๊ฒฝ์šฐ์—๋Š” Vocabulary์—์„œ ๋Œ€์‘๋˜๋Š” ์ •์ˆ˜๊ฐ’์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋„ ์ปค๋ฒ„ํ•˜๊ธฐ ์œ„ํ•ด SequenceVocabulary์—์„œ๋Š” UNK(unknown)๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ํŠน์ˆ˜ ํ† ํฐ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • unk_token : ํ•™์Šต ๊ณผ์ •์—์„œ ๋ณด์ง€ ๋ชปํ•ด ์–ดํœ˜ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜จ ๊ฒฝ์šฐ UNK ํ† ํฐ์œผ๋กœ ์ฒ˜๋ฆฌ
  • mask_token : ๊ฐ€๋ณ€ ๊ธธ์ด์˜ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ MASK ํ† ํฐ์„ ์‚ฌ์šฉ
  • begin_seq_token : ๋ฌธ์žฅ ์‹œ์ž‘์— BEGIN ํ† ํฐ์„ ๋ถ™์—ฌ ๋ชจ๋ธ์ด ๋ฌธ์žฅ ๊ฒฝ๊ณ„๋ฅผ ์ธ์‹ํ•˜๋„๋ก ํ•จ
  • end_seq_token : ๋ฌธ์žฅ ์‹œ์ž‘์— END ํ† ํฐ์„ ๋ถ™์—ฌ ๋ชจ๋ธ์ด ๋ฌธ์žฅ ๊ฒฝ๊ณ„๋ฅผ ์ธ์‹ํ•˜๋„๋ก ํ•จ
ย 

Vocabulary ์†Œ์Šค์ฝ”๋“œ

class Vocabulary(object): """๋งคํ•‘์„ ์œ„ํ•ด ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์–ดํœ˜ ์‚ฌ์ „์„ ๋งŒ๋“œ๋Š” ํด๋ž˜์Šค """ def __init__(self, token_to_idx=None): """ ๋งค๊ฐœ๋ณ€์ˆ˜: token_to_idx (dict): ๊ธฐ์กด ํ† ํฐ-์ธ๋ฑ์Šค ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ """ if token_to_idx is None: token_to_idx = {} self._token_to_idx = token_to_idx self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()} def to_serializable(self): """ ์ง๋ ฌํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค """ return {'token_to_idx': self._token_to_idx} @classmethod def from_serializable(cls, contents): """ ์ง๋ ฌํ™”๋œ ๋”•์…”๋„ˆ๋ฆฌ์—์„œ Vocabulary ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค """ return cls(**contents) def add_token(self, token): """ ํ† ํฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งคํ•‘ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค ๋งค๊ฐœ๋ณ€์ˆ˜: token (str): Vocabulary์— ์ถ”๊ฐ€ํ•  ํ† ํฐ ๋ฐ˜ํ™˜๊ฐ’: index (int): ํ† ํฐ์— ์ƒ์‘ํ•˜๋Š” ์ •์ˆ˜ """ if token in self._token_to_idx: index = self._token_to_idx[token] else: index = len(self._token_to_idx) self._token_to_idx[token] = index self._idx_to_token[index] = token return index def add_many(self, tokens): """ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋ฅผ Vocabulary์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜: tokens (list): ๋ฌธ์ž์—ด ํ† ํฐ ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜๊ฐ’: indices (list): ํ† ํฐ ๋ฆฌ์ŠคํŠธ์— ์ƒ์‘๋˜๋Š” ์ธ๋ฑ์Šค ๋ฆฌ์ŠคํŠธ """ return [self.add_token(token) for token in tokens] def lookup_token(self, token): """ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜: token (str): ์ฐพ์„ ํ† ํฐ ๋ฐ˜ํ™˜๊ฐ’: index (int): ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค """ return self._token_to_idx[token] def lookup_index(self, index): """ ์ธ๋ฑ์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜: index (int): ์ฐพ์„ ์ธ๋ฑ์Šค ๋ฐ˜ํ™˜๊ฐ’: token (str): ์ธํ…์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ ์—๋Ÿฌ: KeyError: ์ธ๋ฑ์Šค๊ฐ€ Vocabulary์— ์—†์„ ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. """ if index not in self._idx_to_token: raise KeyError("the index (%d) is not in the Vocabulary" % index) return self._idx_to_token[index] def __str__(self): return "<Vocabulary(size=%d)>" % len(self) def __len__(self): return len(self._token_to_idx) class SequenceVocabulary(Vocabulary): def __init__(self, token_to_idx=None, unk_token="<UNK>", mask_token="<MASK>", begin_seq_token="<BEGIN>", end_seq_token="<END>"): super(SequenceVocabulary, self).__init__(token_to_idx) self._mask_token = mask_token self._unk_token = unk_token self._begin_seq_token = begin_seq_token self._end_seq_token = end_seq_token self.mask_index = self.add_token(self._mask_token) self.unk_index = self.add_token(self._unk_token) self.begin_seq_index = self.add_token(self._begin_seq_token) self.end_seq_index = self.add_token(self._end_seq_token) def to_serializable(self): contents = super(SequenceVocabulary, self).to_serializable() contents.update({'unk_token': self._unk_token, 'mask_token': self._mask_token, 'begin_seq_token': self._begin_seq_token, 'end_seq_token': self._end_seq_token}) return contents def lookup_token(self, token): """ ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ํ† ํฐ์ด ์—†์œผ๋ฉด UNK ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜: token (str): ์ฐพ์„ ํ† ํฐ ๋ฐ˜ํ™˜๊ฐ’: index (int): ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค ๋…ธํŠธ: UNK ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด (Vocabulary์— ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด) `unk_index`๊ฐ€ 0๋ณด๋‹ค ์ปค์•ผ ํ•ฉ๋‹ˆ๋‹ค. """ if self.unk_index >= 0: return self._token_to_idx.get(token, self.unk_index) else: return self._token_to_idx[token]
ย 

3. Vectorizer

Vectorizer๋Š” ์•ž์„œ ์ •์˜ํ•œ SequenceVocabulary ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. Vectorizer๋Š” Vocabulary์—์„œ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ •์ˆ˜ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ›์•„์™€ ๋ฒกํ„ฐํ™”๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”๋Š” ๋ฌธ์žฅ ๋‹จ์œ„(๋˜๋Š” ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์‹œํ€€์Šค ๊ธธ์ด ๋‹จ์œ„)๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”์˜ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๋ฒกํ„ฐ์˜ ๊ธธ์ด๋Š” ํ•ญ์ƒ ๊ฐ™์•„์•ผํ•˜๋ฏ€๋กœ, ๋นˆ ์ž๋ฆฌ์— 0์„ ์ฑ„์›Œ๋„ฃ๋Š” ํŒจ๋”ฉ ๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ ๋ฒกํ„ฐํ™” ๊ณผ์ •์˜ ์˜ˆ์‹œ๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
[์˜ˆ์‹œ] ์›๋ฌธ : I Love Deep Daiv -> ์ •์ˆ˜ ๋งคํ•‘ : 1 5 7 6 -> ํ† ํฐ ๋ถ€์—ฌ : 8 1 5 7 6 9 (BEGIN, END ํ† ํฐ์„ 8, 9 ๋ผ๊ณ  ํ•˜์ž) -> ๋ฒกํ„ฐ ํŒจ๋”ฉ : 8 1 5 7 6 9 0 0 (ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธธ์ด๊ฐ€ 8์ผ ๋•Œ, ๋‚จ์€ ์ž๋ฆฌ 0์œผ๋กœ ์ฑ„์šฐ๊ธฐ)
ย 
Vectorizer์—์„œ Vocabulary ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Vocabulary๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ์ œํ•œํ•˜๊ฑฐ๋‚˜ ํŠน์ •ํ•œ ์ž„๊ณ„๊ฐ’์„ ์ง€์ •ํ•ด ํ•œ ๋‘๋ฒˆ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ Vocabulary์— ๋“ฑ๋กํ•˜์ง€ ์•Š๋Š” ๋“ฑ์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹จ์–ด ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๋Š” ์—ญํ• ๋„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
ย 

Vectorizer ์†Œ์Šค์ฝ”๋“œ

class SurnameVectorizer(object): """ ์–ดํœ˜ ์‚ฌ์ „์„ ์ƒ์„ฑํ•˜๊ณ  ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค """ def __init__(self, char_vocab, nationality_vocab): """ ๋งค๊ฐœ๋ณ€์ˆ˜: char_vocab (Vocabulary): ๋ฌธ์ž๋ฅผ ์ •์ˆ˜๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค nationality_vocab (Vocabulary): ๊ตญ์ ์„ ์ •์ˆ˜๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค """ self.char_vocab = char_vocab self.nationality_vocab = nationality_vocab def vectorize(self, surname, vector_length=-1): """ ๋งค๊ฐœ๋ณ€์ˆ˜: title (str): ๋ฌธ์ž์—ด vector_length (int): ์ธ๋ฑ์Šค ๋ฒกํ„ฐ์˜ ๊ธธ์ด๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ """ indices = [self.char_vocab.begin_seq_index] indices.extend(self.char_vocab.lookup_token(token) for token in surname) indices.append(self.char_vocab.end_seq_index) if vector_length < 0: vector_length = len(indices) out_vector = np.zeros(vector_length, dtype=np.int64) out_vector[:len(indices)] = indices out_vector[len(indices):] = self.char_vocab.mask_index return out_vector, len(indices) @classmethod def from_dataframe(cls, surname_df): """๋ฐ์ดํ„ฐ์…‹ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ SurnameVectorizer ๊ฐ์ฒด๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜: surname_df (pandas.DataFrame): ์„ฑ์”จ ๋ฐ์ดํ„ฐ์…‹ ๋ฐ˜ํ™˜๊ฐ’: SurnameVectorizer ๊ฐ์ฒด """ char_vocab = SequenceVocabulary() nationality_vocab = Vocabulary() for index, row in surname_df.iterrows(): for char in row.surname: char_vocab.add_token(char) nationality_vocab.add_token(row.nationality) return cls(char_vocab, nationality_vocab) @classmethod def from_serializable(cls, contents): char_vocab = SequenceVocabulary.from_serializable(contents['char_vocab']) nat_vocab = Vocabulary.from_serializable(contents['nationality_vocab']) return cls(char_vocab=char_vocab, nationality_vocab=nat_vocab) def to_serializable(self): return {'char_vocab': self.char_vocab.to_serializable(), 'nationality_vocab': self.nationality_vocab.to_serializable()}
ย 

4. DataLoader

DataLoader๋Š” Vectorizer์—์„œ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์„ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋ชจ์•„ ์ž‘์—…์„ ํŽธ๋ฆฌํ•˜๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. DataLoader๊ฐ€ ์ƒ์„ฑํ•œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ํ†ตํ•ด ๋ถ„๋ฅ˜, ๋ถ„์„ ๋ชจ๋ธ๋“ค์ด ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
ย 
๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ž€ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ์— ๋ฐ์ดํ„ฐ ์ „์ฒด๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š๊ณ  ์ผ๋ถ€๋งŒ์„ ํ™œ์šฉํ•ด ๋” ๋น ๋ฅด๊ฒŒ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ตœ์ ํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž‘์€ ๋‹จ์œ„๋กœ ์ชผ๊ฐ  ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ๋งํฌ์—์„œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์— ๋Œ€ํ•ด ์ฝ์–ด๋ณด์„ธ์š”.
[ํ˜ธ๊ธฐ์‹ฌ] mini-batch๋Š” ์™œ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?
๋”ฅ๋Ÿฌ๋‹์—์„œ ํ•œ๋ฒˆ์˜ iteration์„ ์œ„ํ•ด ๋“ค์–ด๊ฐ€๋Š” ์ธํ’‹๋ฐ์ดํ„ฐ๋Š” ๋ณดํ†ต batch๋ผ๊ณ  ํ•˜์—ฌ ์ˆ˜์‹ญ์ˆ˜๋ฐฑ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊ทธ๋ฃน์œผ๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด mini-batch๋Š” ํ•œ๋ฒˆ์˜ iteration์— ์ธํ’‹ ๋ฐ์ดํ„ฐ๋กœ ํ•œ๊ฐœ๋ฅผ ์“ฐ๋Š” ๊ฒฝ์šฐ์™€ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ์“ฐ๋Š” ๋‘ ๊ฒฝ์šฐ(์–‘๊ทน๋‹จ)์— ๋น„ํ•ด ์–ด๋–ค ์žฅ์ ์ด ์žˆ๊ธธ๋ž˜ ์ด๋ ‡๊ฒŒ ๋‹น์—ฐํ•œ ๋“ฏ์ด ์“ฐ์ด๋Š” ๊ฑธ๊นŒ์š”. ๋‹น์—ฐํ•œ ๋ง์ด์ง€๋งŒ mini-batch๋Š” ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ ์žฅ์ ์„ ๋ชจ๋‘ ์–ป๊ธฐ ์œ„ํ•œ(์„œ๋กœ์˜ ๋‹จ์ ์„ ๋ณด์™„) ํƒ€ํ˜‘์ ์ž…๋‹ˆ๋‹ค, ์•„๋ž˜์—์„œ๋Š” ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ ์žฅ๋‹จ์ ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ  ์™œ mini-batch๋ฅผ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์ •๋ฆฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
[ํ˜ธ๊ธฐ์‹ฌ] mini-batch๋Š” ์™œ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?
ย 

DataLoader ์†Œ์Šค์ฝ”๋“œ

def generate_batches(dataset, batch_size, shuffle=True, drop_last=True, device="cpu"): """ ํŒŒ์ดํ† ์น˜ DataLoader๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜. ๊ฑฑ ํ…์„œ๋ฅผ ์ง€์ •๋œ ์žฅ์น˜๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค. """ dataloader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last) for data_dict in dataloader: out_data_dict = {} for name, tensor in data_dict.items(): out_data_dict[name] = data_dict[name].to(device) yield out_data_dict
ย 
์ด์ „ ๊ธ€ ์ฝ๊ธฐ
โ—
Dictionary ์ž๋ฃŒํ˜• key์™€ value๊ฐ€ 1๋Œ€1๋กœ ์ˆœ์„œ ์—†์ด ๋งคํ•‘๋˜๋Š” ํ˜•ํƒœ์˜ ์ž๋ฃŒํ˜• โ€bananaโ€๋ผ๋Š” ๋ฌธ์ž์—ด์„ value๋กœ ํ•˜๊ณ , ์ˆซ์ž โ€˜5โ€™๋ฅผ key๋กœ ํ•˜์—ฌ ๋”•์…”๋„ˆ๋ฆฌ์— ์ €์žฅํ•  ๊ฒฝ์šฐ, key ๊ฐ’์ธ 5๋ฅผ ํ†ตํ•ด โ€œbananaโ€ value๋ฅผ ๋ฐ›์•„์˜ฌ ์ˆ˜ ์žˆ๋‹ค. [์˜ˆ์‹œ] dict = {5 : โ€œbananaโ€} dict[5] >>> banana