1๊ฐ•_Introduction

Lecture 1 _ Introduction

ย 
ย 
ย 

Reinforcement Learning

: how can an intelligent agent learn to make a good sequence of decisions under uncertainty? \
๋ถˆํ™•์‹ค์„ฑ ํ•˜์—์„œ, ์ธ๊ณต์ง€๋Šฅ์ด ์–ด๋–ป๊ฒŒ ์ผ๋ จ์˜ ๊ฒฐ์ •๋“ค์— ๋Œ€ํ•˜์—ฌ ์ข‹์€ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋„๋ก ํ•™์Šต์‹œํ‚ฌ ๊ฒƒ์ธ๊ฐ€?

key issues of reinforcement learning

  • Sequence of Decisions : ํ•˜๋‚˜๋งŒ ๋‚˜์˜ค๊ณ  ๋๋‚˜๋Š” ๊ฒฐ์ •์ด ์•„๋‹Œ ๊ณ„์† ์ด์–ด์ง€๋Š” ์ผ๋ จ์˜ ๊ฒฐ์ •๋“ค
  • Good Decisions : ์ •ํ™•๋„, ์ตœ๋Œ€ํ•œ์˜ ๋ณด์ƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ์ตœ์ ์˜ ๊ฒฐ์ •
  • the learning : ์ข‹์€ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๊ธฐ ์œ„ํ•ด agent๋ฅผ ํ•™์Šต์‹œํ‚ด
ย 

key aspects of reinforcement learning

๋‹ค๋ฅธ ์ธ๊ณต์ง€๋Šฅ๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๊ฐ•ํ™” ํ•™์Šต๋งŒ์˜ ์ฐจ์ด์ 
  • ์ตœ์ ํ™”
    • ์ตœ์ ํ™”๋Š” ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ํ•„์ˆ˜
  • ์ง€์—ฐ๋œ ๊ฒฐ๊ณผ
    • ํ˜„์žฌ์˜ ๊ฒฐ์ •์ด ๋ฏธ๋ž˜์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Œ (ex> ๊ฒŒ์ž„์—์„œ ์ง€๊ธˆ ํ•œ ์„ ํƒ์ด ๋‚˜์ค‘์˜ ์ŠนํŒจ๋ฅผ ๊ฒฐ์ •ํ•จ)
    • challenges : ์ง€๊ธˆ ํ•œ ๊ฒฐ์ •์— ๋Œ€ํ•ด ์ฆ‰๊ฐ์ ์ธ ํ”ผ๋“œ๋ฐฑ์ด ๋ณด์žฅ๋˜์ง€ ์•Š์Œ โ†’ ๊ณผ๊ฑฐ์— ๋‚ด๋ฆฐ ๊ฒฐ์ •๊ณผ ๋ฏธ๋ž˜์— ๋ฐ›์„ ๋ณด์ƒ์— ๋Œ€ํ•œ ๊ด€๊ณ„๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜๊ธฐ ํž˜๋“ค๋‹ค (โ†” ํŠนํžˆ ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋‹ค๋ฅธ ์ )
  • ํƒ์ƒ‰
    • ์ด๋ฏธ ์ž…๋ ฅ-๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, agent๊ฐ€ ํƒ์ƒ‰ํ•œ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ํ•™์Šต
    • agent๊ฐ€ ํ•˜๋Š” ๊ฒฐ์ •๋“ค์— ๋”ฐ๋ผ ํ•™์Šตํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ฌ๋ผ์ง
  • ์ผ๋ฐ˜ํ™”
    • ์ด์ „์— ํ•™์Šตํ•˜์ง€ ์•Š์€ ์ƒํ™ฉ์— ๋Œ€ํ•ด์„œ๋„ ์ง€๊ธˆ๊นŒ์ง€ ํ•™์Šตํ•œ ๊ฒƒ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ’€์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ
    • ๋ชจ๋“  action์„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜๊ธฐ์—” ์–‘์ด ๋„ˆ๋ฌด ๋ฐฉ๋Œ€ํ•จ โ†’ ์ผ๋ฐ˜ํ™”ํ•˜๋ฉด ์ฒ˜์Œ ๋ณด๋Š” ์ƒํ™ฉ์—์„œ๋„ agent๋Š” ๋ฌธ์ œ๋ฅผ ํ’€์–ด๊ฐˆ ์ˆ˜ ์žˆ๋‹ค
ย 

differences from RL

RL : ๋ชจ๋ธ์ด ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ฉฐ ํ•™์Šต
  • AI planning : ๊ทœ์น™์ด ์ด๋ฏธ ์ ์šฉ๋œ ๋ชจ๋ธ
    • ์ตœ์ ํ™”, ์ง€์—ฐ๋œ ๊ฒฐ๊ณผ, ์ผ๋ฐ˜ํ™”๋Š” ํ•ด๋‹นํ•˜์ง€๋งŒ ํƒ์ƒ‰์€ ํ•ด๋‹นํ•˜์ง€ ์•Š์Œ
    • ์ผ๋ จ์˜ ๊ฒฐ์ •๋“ค์„ ๊ฒฐ์ •ํ•˜๋Š” ๋ชจ๋ธ์ด์ง€๋งŒ, ๊ทœ์น™์ด ์ด๋ฏธ ์ ์šฉ๋˜์–ด ์žˆ์–ด ํ˜„์žฌ์˜ ๊ฒฐ์ •์ด ๋ฏธ๋ž˜์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์•Œ๊ณ  ์žˆ๋‹ค
  • Supervised Machine Learning : ์ฃผ์–ด์ง„ ๊ฒฝํ—˜์„ ํ†ตํ•ด ํ•™์Šตํ•˜๋Š”๋ฐ result O์ธ data ์ด์šฉ
    • ์ตœ์ ํ™”, ์ผ๋ฐ˜ํ™”๋Š” ํ•ด๋‹นํ•˜์ง€๋งŒ ์ง€์—ฐ๋œ ๊ฒฐ๊ณผ, ํƒ์ƒ‰์€ ํ•ด๋‹นํ•˜์ง€ ์•Š์Œ
    • ์ž…๋ ฅ-๊ฒฐ๊ณผ๊ฐ€ ์ด๋ฏธ ๋‚˜์˜จ ๊ฒฝํ—˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šต
    • agent๊ฐ€ ์Šค์Šค๋กœ ๊ฒฝํ—˜ํ•˜๋ฉฐ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ด๋ฏธ ๊ฒฝํ—˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šต
  • Unsupervised Machine Learning : ์ฃผ์–ด์ง„ ๊ฒฝํ—˜์„ ํ†ตํ•ด ํ•™์Šตํ•˜๋Š”๋ฐ result X์ธ data ์ด์šฉ
    • ์ตœ์ ํ™”, ์ผ๋ฐ˜ํ™”๋Š” ํ•ด๋‹นํ•˜์ง€๋งŒ ์ง€์—ฐ๋œ ๊ฒฐ๊ณผ, ํƒ์ƒ‰์€ ํ•ด๋‹นํ•˜์ง€ ์•Š์Œ
    • agent๊ฐ€ ์Šค์Šค๋กœ ๊ฒฝํ—˜ํ•˜๋ฉฐ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ด๋ฏธ ๊ฒฝํ—˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•˜๋‚˜ ํ–‰๋™์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉ
  • Imitation Learning
    • ์ตœ์ ํ™”, ์ง€์—ฐ๋œ ๊ฒฐ๊ณผ, ์ผ๋ฐ˜ํ™”๋Š” ํ•ด๋‹นํ•˜์ง€๋งŒ ํƒ์ƒ‰์€ ํ•ด๋‹นํ•˜์ง€ ์•Š์Œ
    • agent๊ฐ€ ์Šค์Šค๋กœ ๊ฒฝํ—˜ํ•˜๋ฉฐ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ด๋ฏธ ๊ฒฝํ—˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šต
    • ๋‹ค๋ฅธ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํ–‰๋™์„ ๋”ฐ๋ผ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋ฐฉํ•ด๋ณด์ง€ ์•Š์€ ์ƒˆ๋กœ์šด ์ƒํ™ฉ์„ ๋งˆ์ฃผ์น˜๋ฉด ํ•ด๊ฒฐ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค
ย 
ย 

Sequential Decision Making (under uncertainty)

notion image
  • world์™€ agent๊ฐ€ ์„œ๋กœ ์˜ํ–ฅ์„ ์ฃผ๋ฉฐ ์—ฐ์†์ ์ธ ๊ฒฐ์ •์„ ๋งŒ๋“ ๋‹ค
  • ์ด ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ํ์‡„ ๋ฃจํ”„์˜ ๋ชฉํ‘œ๋Š” ๋ฏธ๋ž˜์˜ ๋ณด์ƒ์ด ์ตœ๋Œ€๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š” ๊ฒฐ์ •์„ ํ•˜๋Š” ๊ฒƒ
  • key challenges
    • ์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ๊ณผ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ์ ์ ˆํ•˜๊ฒŒ ๋งž์ถฐ์•ผ ํ•œ๋‹ค
      • ๋ฏธ๋ž˜์˜ ๋ณด์ƒ์„ ์œ„ํ•ด ์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ์„ ํฌ๊ธฐํ•ด์•ผ ํ•  ๋•Œ๋„ ์žˆ๋‹ค
      • (e.g. ๊ณต๋ถ€ํ•  ๋•Œ ์‰ฌ์šด ๋ฌธ์ œ๋งŒ ํ‘ผ๋‹ค๋ฉด ์ง€๊ธˆ ๋งž๋Š” ๋ฌธ์ œ(์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ)๋Š” ๋งŽ๊ฒ ์ง€๋งŒ, ์‹œํ—˜์—์„œ๋Š” ๋งž์€ ๋ฌธ์ œ์˜ ๊ฐœ์ˆ˜(๋ฏธ๋ž˜์˜ ๋ณด์ƒ)๊ฐ€ ์ ์„ ๊ฒƒ์ด๋‹ค. ์‹œํ—˜์—์„œ ๋งž๋Š” ๋ฌธ์ œ(๋ฏธ๋ž˜์˜ ๋ณด์ƒ)์„ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ณต๋ถ€ํ•˜๋ฉฐ ๋งž๋Š” ๋ฌธ์ œ์˜ ๊ฐœ์ˆ˜(์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ)์„ ํฌ๊ธฐํ•ด์•ผ ํ•˜๋Š” ์‚ฌ๋ก€)
  • ์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ๊ณผ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ์˜ ๋น„์œจ์„ ์„ค์ •ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ reward function์ด๋ผ ํ•˜๋Š”๋ฐ, ์ด reward function์„ ์–ด๋–ป๊ฒŒ ์ง€์ •ํ•˜๋А๋ƒ์— ๋”ฐ๋ผ agent๊ฐ€ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋‹ฌ๋ผ์ง„๋‹ค
ย 

์šฉ์–ด ์ •๋ฆฌ

1) agent & world
๋งค ์„ค์ •๋œ ํƒ€์ž„ ์Šคํƒญ(time step, t)๋งˆ๋‹ค agent๋Š” ํ–‰๋™(action, a)์„, world๋Š” ํ–‰๋™์— ๋Œ€ํ•œ ๋ฐ˜์‘(observation, o)์™€ ๋ณด์ƒ(reward,r)์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค
(์—ฌ๊ธฐ์„œ action์„ ์ทจํ–ˆ์„ ๋•Œ ๋ฐ”๋กœ ๋ฐ›๋Š” reward๋Š” ์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ์ด๊ณ , ํ˜„์žฌ์™€ ๋–จ์–ด์ง„ time step์—์„œ์˜ reward๊ฐ€ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ์ด๋‹ค.)
2) history
๊ณผ๊ฑฐ agent์˜ action๊ณผ world๊ฐ€ ๊ทธ์— ๋Œ€ํ•ด ๋ฐ˜ํ™˜ํ•œ observation๊ณผ reward ์˜ ์ง‘ํ•ฉ
3) state
agent๊ฐ€ ๊ฒฐ์ •์„ ๋‚ด๋ฆด ๋–„ ์‚ฌ์šฉํ•˜๋Š” ์ƒํ™ฉ ์ „์ฒด
ย 

The Markov assumption

  • ๋ฏธ๋ž˜๊ณผ ๊ณผ๊ฑฐ๋กœ๋ถ€ํ„ฐ ๋…๋ฆฝ์ ์ด๊ณ  ํ˜„์žฌ์˜ state๊ฐ€ ๊ณผ๊ฑฐ์˜ ๋ชจ๋“  history๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•œ๋‹ค๋ฉด, ๊ทธ state๋งŒ์œผ๋กœ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ๊ฒฐ์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๋‹ค.
  • state ์„ค์ •์— ๋”ฐ๋ผ Markov assumption์€ ํ•ญ์ƒ ์„ฑ๋ฆฝ์ด ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ state๋ฅผ ์ž˜ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.
ย 

Full Observability

  • MDP(Markov Decision Process)
    • agent์˜ state(agent๊ฐ€ ํ™•์ธ ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” state)์™€ real world์˜ state(์‹ค์ œ ๋ชจ๋“  world์˜ state)๊ฐ€ ์ผ์น˜ํ•œ๋‹ค๋ฉด, state๋Š” agent๊ฐ€ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๋ถ€๋ถ„์ด๋‹ค.
  • POMDP(Partially Observable Markov Decision Process)
    • agent์˜ state์™€ real world์˜ state๊ฐ€ ์ผ์น˜ํ•˜์ง€ ์•Š์„ ๋•Œ ์‚ฌ์šฉ
    • agent๊ฐ€ ๊ด€์ฐฐํ•  ์ˆ˜ ์—†๋Š” ๋ถ€๋ถ„์ด ์กด์žฌํ•˜๋ฏ€๋กœ, state๋ฅผ agent๊ฐ€ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์œผ๋กœ๋งŒ ์„ค์ •ํ•˜๋ฉด state์˜ ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค.
    • MDP๋ณด๋‹ค ๋งŽ์€ ์ •๋ณด๋ฅผ ํฌํ•จ
    • ย 

Type of Sequential Decision Processes

  • Bandits
    • ํ˜„์žฌ agent๊ฐ€ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ถ€๋ถ„ ๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•œ state๊ฐ€ ๊ตฌ์„ฑ๋  ์ˆ˜ ์žˆ๋‹ค
    • ๊ณผ๊ฑฐ์˜ ๊ฒฐ์ •์€ ํ˜„์žฌ์˜ state์™€ ๊ด€๋ จ์ด ์—†๋‹ค
  • MDPs and POMDPs
    • actions์ด ๋ฏธ๋ž˜์˜ state์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์„ ๊ฐ์•ˆํ•œ๋‹ค
    • ๋ณด์žฅ๋œ ๋ถ€๋ถ„๊ณผ ์•„๋‹Œ ๋ถ€๋ถ„์„ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”
  • How the World Changes
    • Deterministic
      • ์–ด๋–ค action์„ ์ทจํ–ˆ์„ ๋•Œ ๋ฐ˜ํ™˜๋  ๊ฒฐ๊ณผ๊ฐ€ ์ •ํ•ด์ ธ์žˆ์Œ
    • Stochastic
      • ์–ด๋–ค action์„ ์ทจํ–ˆ์„ ๋•Œ ๋ฐ˜ํ™˜๋  ๊ฒฐ๊ณผ๊ฐ€ ํ™•๋ฅ ์ 
ย 

RL Algorithm Components

RL Algorithm Components often include one or more of
Model : representation of how the world changes in response to agentโ€™s action
Policy : function mapping agentโ€™s states to action
Value Function : future rewards from being in a state and/or action when following a particular policy
ย 

Model

  • agent๊ฐ€ ์–ด๋–ค action์„ ์ทจํ•˜๋А๋ƒ์— ๋”ฐ๋ผ world๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ฐ”๋€”์ง€์— ๋Œ€ํ•œ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ํ•จ์ˆ˜
  • Transaction / Dynamics model : agent์˜ ๋‹ค์Œ state๋ฅผ ์˜ˆ์ธก
    • Reward model : ์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ์„ ์˜ˆ์ธก
ย 

Policy

  • state๋ฅผ ์ž…๋ ฅ๋ฐ›์•˜์„ ๋•Œ agent๊ฐ€ action์„ ์–ด๋–ป๊ฒŒ ์„ ํƒํ•  ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ํ•จ์ˆ˜
  • Deterministic Policy : ์ž…๋ ฅ๋œ state์— ๋”ฐ๋ฅธ action ์ถœ๋ ฅ๊ฐ’์ด ํ•˜๋‚˜
    • Stochastic Policy : ์ž…๋ ฅ๋œ state์— ๋”ฐ๋ฅธ ๊ฐ€๋Šฅํ•œ action ๋ณ„ ํ™•๋ฅ ์„ ์ถœ๋ ฅ
ย 

Value Function

  • ํŠน์ • policy์— ๋”ฐ๋ฅธ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ์˜ ์ดํ•ฉ์„ ์˜ˆ์ƒํ•˜๋Š” ํ•จ์ˆ˜
  • ์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ๊ณผ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ ๊ฐ๊ฐ์— ์–ผ๋งˆ๋‚˜ ๋น„์ค‘์„ ๋‘˜ ๊ฒƒ์ธ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ํ•จ์ˆ˜
  • value๊ฐ€ ๋†’์„์ˆ˜๋ก ๋” ํฐ ๋ณด์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ์ข‹์€ policy๋ผ ํ•œ๋‹ค
ย 

Types of RL Agents

  • Model Based Agent
    • model์ด ์กด์žฌ
    • ์ด model์ด policy function์ด๋‚˜ value function์„ ๊ฐ–๊ณ  ์žˆ์„ ์ˆ˜๋„, ์•„๋‹ ์ˆ˜๋„ ์žˆ์Œ
    • ์ฆ‰ ๋ช…์‹œ๋œ policy function์ด๋‚˜ value function์ด ์—†์Œ
  • Model - free Agent
    • model์ด ์—†์Œ
    • ๋ช…์‹œ๋œ policy function์ด๋‚˜ value function์ด ์กด์žฌ
ย 

Key Challenged in learning to Make Sequences of Good Decisions

Planning (Agentโ€™s internal computation)

  • world์˜ ๋™์ž‘์— ๋Œ€ํ•œ model ์กด์žฌ
    • Dynamic/Reward model
  • ํ•™์Šต ๊ณผ์ •์—์„œ world์— ๋Œ€ํ•œ ํƒ์ƒ‰์€ ๋ถˆํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, world์™€์˜ ์ƒํ˜ธ์ž‘์šฉ ๋ถˆํ•„์š”
  • ์—ฌ๋Ÿฌ ์„ ํƒ์ง€ ์ค‘ high reward๋ฅผ ๋ณด์žฅํ•˜๋Š” action์œผ๋กœ ๊ฒฐ์ •
ย 

Reinforcement Learning

  • world์˜ ๋™์ž‘์— ๋Œ€ํ•œ model์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ
  • ํ•™์Šต ๊ณผ์ •์—์„œ world์— ๋Œ€ํ•œ ํƒ์ƒ‰์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ดˆ๋ฐ˜์— ๋งŽ์€ ์‹œํ–‰์ฐฉ์˜ค ํ•„์š”
  • ํ•™์Šต ์‹œ high reward๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๊ณผ world์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ• ๋ชจ๋‘๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•จ
ย 
ย 

Exploration vs Exploitation

  • agent๊ฐ€ ์‹œ๋„ํ•˜๋Š” action๋งŒ ์ˆ˜ํ–‰๋จ
  • RL agent ๊ฐ€ action์„ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•
    • Exploration : trying new things that might enable the agent to make better decisions in the future ์ƒˆ๋กœ์šด ์‹œ๋„!
    • Exploitation : choosing actions that are expected to yield good reward given past experience ์—ฌํƒœ๊นŒ์ง€ ํ•œ ๊ฒƒ ์ค‘์— ์ข‹์€๊ฒƒ ๋‹ค์‹œ
  • Exploration-Exploitation ์‚ฌ์ด tradeoff๊ฐ€ ์ด๋ค„์ง€๊ธฐ๋„ ํ•จ
    • sacrifice reward in order to explore & learn about potentially better policy
ย 
ย 

Evaluation & Control

Evaluation

  • ํ‰๊ฐ€์™€ ์˜ˆ์ธก์„ ํ†ตํ•œ ๋ณด์ƒ ์˜ˆ์ธก

Control

  • Optimization : ๊ฐ€์žฅ ์ข‹์€ policy๋ฅผ ์ฐพ์•„ ์ตœ์ ํ™”