4๊ฐ•_Model Free Control

ย 

0. Review

Model free Control Examples

  • MDP ๋กœ ๋ชจ๋ธ๋ง ๋œ ๋งŽ์€ ์‘์šฉ๋ถ„์•ผ๊ฐ€ ์žˆ์Œ.
  • game, robots, helicopter flight, Go, ๋“ฑ๋“ฑ
  • But, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ณ„์‚ฐ ๋น„์šฉ ๋งค์šฐ ๋น„์Œˆ (์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ณ„์‚ฐ๋˜์–ด์•ผ ํ•˜๋Š” ๋ถ€๋ถ„์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ)

On-policy learning

  • ์ง์ ‘ ๊ฒฝํ—˜
  • ์ถ”์ •์น˜๋ฅผ ํ•™์Šตํ•˜๊ณ  ๊ทธ policy๋กœ ๋ถ€ํ„ฐ ์–ป์–ด์ง„ ๊ฒฝํ—˜์œผ๋กœ policy ํ‰๊ฐ€

Off-policy learning

  • ์ถ”์ •์น˜๋ฅผ ํ•™์Šตํ•˜๊ณ  ๋‹ค๋ฅธ policy๋กœ ๋ถ€ํ„ฐ ์–ป์–ด์ง„ ๊ฒฝํ—˜์„ ์‚ฌ์šฉํ•ด ๊ทธ policy ํ‰๊ฐ€

1. Generalized Policy Iteration

Policy Update

Notation
: State Value Function
: State-Action Value Function
ย 
  • ์ฒ˜์Œ์˜ ์ •์ฑ…(Policy)๋ฅผ ๋ผ ํ•˜๊ณ , ์ •์ฑ… ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๊ฐ’์„ ๋ผ๊ณ  ํ•˜๋ฉด, ์—…๋ฐ์ดํŠธ ๋˜๋Š” ์ •์ฑ…์„ ๋ผ ํ• ๋•Œ, ์—…๋ฐ์ดํŠธ ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ฑ๋ฆฝํ•œ๋‹ค.
  • ๊ฒฐ๊ตญ, ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๊ฒƒ์€ State-Action value function(Q)์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” action์„ ์ฐพ๋Š”๊ฒƒ.

Iteration

  • ์ดˆ๊ธฐ๊ฐ’ ์„ค์ • : N(s,a) =0, G(s,a)=0,
  • under ,
notion image

2. Importance of Exploration

Epsilon greedy

  • ํŠน์ • ํด๋ž˜์Šค์˜ ์ •์ฑ…์€ ๋ชจ๋“  (s,a) ์Œ์ด true value๋กœ ๊ทผ์‚ฌ์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ–‰.โ†’ ์ฆ‰, ํ˜„์žฌ์˜ ์ •์ฑ…์ด ์ถฉ๋ถ„ํžˆ ์ข‹์€์ง€ ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•จ.
  • |A|๊ฐ€ ํ–‰๋™์˜ ๊ฐœ์ˆ˜๋ผ๊ณ  ํ•˜๋ฉด, -greedy ์ •์ฑ…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜
  • ์ฆ‰, 1-epsilon์˜ ํ™•๋ฅ ๋กœ ์ฃผ์–ด์ง„ ์‹์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ์ •์ฑ…์„ ์ฐพ๊ณ , epsilon/|A| ์˜ ํ™•๋ฅ ๋กœ๋Š” ๋žœ๋คํ•œ a๋ฅผ ์„ ํƒํ•œ๋‹ค.
notion image
ย 

GLIE

  • Greedy in the Limit of Infinite Exploration
notion image
  • ๋ชจ๋“  State-Action ์Œ์ด ๋ฌดํ•œํžˆ visit ๋œ๋‹ค๋ฉด, ํ–‰๋™ ์ •์ฑ… ํ•จ์ˆ˜๋Š” ํ™•๋ฅ  1์„ ๊ฐ–๋Š” greedy policy๋กœ ์ˆ˜๋ ดํ•œ๋‹ค. ์ฆ‰, epsilon=0์œผ๋กœ ์ค„์ด๊ฒŒ ๋œ๋‹ค.

3. Monte Carlo Control

Pseudo code

notion image
notion image

4. Temporal Difference Methods for Control

SARSA Algorithm

ย 
notion image
  • Convergence
notion image
  • ์œ ํ•œ๊ฐœ์˜ ์ƒํƒœ์™€ ํ–‰๋™์„ ๊ฐ–๋Š” MDP์˜ SARSA๋Š” Q(s,a)๊ฐ€ ์ตœ์ ์˜ action-value function์ธ Q*(s,a)๋กœ ์ˆ˜๋ ดํ•œ๋‹ค. ๋‹จ, ๋‹ค์Œ์˜ ์กฐ๊ฑด๋“ค์„ ๊ฐ€์ •ํ–ˆ์„ ๋•Œ ์„ฑ๋ฆฝ
  1. ๊ฐ€ GLIE ์กฐ๊ฑด์„ ๋งŒ์กฑ
  1. step-size ์— ๋Œ€ํ•ด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์กฐ๊ฑด ๋งŒ์กฑ
ย 

Q-Learning

  • State-Action Function Q์˜ ์ถ”์ •์น˜๋ฅผ ์œ ์ง€ํ•˜๊ณ , bootstrap์— ์‚ฌ์šฉํ•œ๋‹ค. โ†’ ๊ฐ€์žฅ ์ข‹์€ ๋ฏธ๋ž˜ ํ–‰๋™์˜ value๋ฅผ ์‚ฌ์šฉ
  • SARSA์™€ ๋น„๊ต
notion image
  • ์ฆ‰, ๋‹ค์Œ ํ–‰๋™๋งŒ์„ ์‚ฌ์šฉํ•ด state-Action function์„ ๊ณ„์‚ฐํ•˜๋Š” SARSA์™€ ๋‹ฌ๋ฆฌ, Q-learning์€ ๋ถ€ํŠธ์ŠคํŠธ๋žฉ(์žฌํ‘œ๋ณธ์ถ”์ถœ๊ธฐ๋ฒ•)์„ ์‚ฌ์šฉํ•ด ๋‹ค์–‘ํ•œ action์„ ํ•˜๊ณ , ๊ทธ ์ค‘ Q๋ฅผ ์ตœ๋Œ€ํ™” ํ•˜๋Š” Action์„ ์‚ฌ์šฉํ•œ๋‹ค.
  • Q-learning with epsilon greedy Exploration
notion image

5. Maximization Bias

  • ์ถ”์ •์น˜ ์˜ value ๋Š” bias ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.
notion image

Double Q- Learning

  • unbiased estimator์ธ Q ํ•จ์ˆ˜๋ฅผ 2๊ฐœ๋กœ ๋ถ„ํ• ํ•ด ํ•˜๋‚˜๋Š” action์„ ์„ ํƒํ•˜๊ณ (Q1), ํ•˜๋‚˜๋Š” ์ด๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํ•จ์ˆ˜(Q2)๊ฐ€ ์กด์žฌํ•œ๋‹ค. โ†’ bias ๋‚ฎ์ถค
notion image
ย 
notion image
  • ํŽธ์ฐจ๋ฅผ ์ตœ๋Œ€ํ™” ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— Q-learing ์€ double Q-learning๋ณด๋‹ค ํ›จ์”ฌ ๋” ๋งŽ์€ ์‹œ๊ฐ„์„ suboptimal ์„ ์ฐพ๋Š”๋ฐ ์‹œ๊ฐ„์„ ์‚ฌ์šฉํ•œ๋‹ค.