Optimization

Optimization

Optimization


Optimization๋Š” Loss Function์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด๋‹ค.
ย 

Random Search

โš ๏ธย ์ฃผ์˜! ์ •ํ™•๋„์˜ ํŽธ์ฐจ๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ๋กœ๋Š” ์“ฐ์ด์ง€ ์•Š๋Š” ๊ฐœ๋…์ด๋‹ค.
์ž„์˜๋กœ ์ƒ˜ํ”Œ๋งํ•œ W๋“ค์„ ๋งŽ์ด ๋ชจ์•„๋†“๊ณ  Loss๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ์–ด๋–ค W๊ฐ€ ์ข‹์€์ง€๋ฅผ ์‚ดํŽด๋ณด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
ย 

Gradient Descent

1์ฐจ ๋ฏธ๋ถ„๊ณ„์ˆ˜๋ฅผ ์ด์šฉํ•ด ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ์•„๊ฐ€๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ํ•จ์ˆ˜ ๊ฐ’์ด ๋‚ฎ์•„์ง€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ณ€ํ˜•์‹œ์ผœ๊ฐ€๋ฉด์„œ ์ตœ์ข…์ ์œผ๋กœ ์ตœ์†Œ ํ•จ์ˆ˜ ๊ฐ’์„ ๊ฐ–๋„๋ก ํ•˜๋Š” ๊ฐ’์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
์ž์„ธํ•œ ๋‚ด์šฉ์€ Gradient Descent ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์‹œ์˜ค.
ย 

Stochastic Gradient Descent(SGD)

2-2์—์„œ ์‚ดํŽด๋ณธ ๋ฐฉ๋ฒ•์„ Full Gradient Descent๋ผ๊ณ  ํ•˜๋Š”๋ฐ ๋ชจ๋“  Data์— ๋Œ€ํ•˜์—ฌ ์ผ์ผ์ด ์ž‘์—…์„ ํ•˜๊ธฐ์—๋Š” ์—ฐ์‚ฐ๋Ÿ‰๋„ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. ๊ทธ๋ž˜์„œ ์‹ค์ œ๋กœ๋Š” ์†๋„์™€ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด Train Data์˜ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉํ•ด์„œ Gradient์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์ด ์ค‘ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์ด Stochastic Gradient Descent(SGD)๋‹ค.
Minibatch๋ผ๋Š” ์ž‘์€ ํŠธ๋ ˆ์ด๋‹ ์ƒ˜ํ”Œ ์ง‘ํ•ฉ์œผ๋กœ ๋‚˜๋ˆ ์„œ ๋ฐ˜๋ณต ํ•™์Šตํ•˜์—ฌ Loss์˜ ์ „์ฒด ํ•ฉ์˜ ์ถ”์ •์น˜์™€ ์‹ค์ œ Gradient์˜ ์ถ”์ •์น˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
์ด๋•Œ Minibatch์˜ ๊ฐ’์œผ๋กœ๋Š” ๊ฐ’์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
# ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• while True: data_batch = sample_training_data(data, 256) # ์˜ˆ์‹œ)๋ฐฐ์น˜๊ฐœ์ˆ˜ 256๊ฐœ weights_grad = evaluate_gradient(loss_fun, data_batch, weights) weights += - step_size * weights_grad # ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ