3๊ฐ•_ Model-Free Policy Evaluation

lec 3

  • Recall
    • Definition of Return G_t
      • notion image
    • Definition of State Value Function, expected return from starting in state s under policy Pi.
      • notion image
    • Definition of State-Action Value Function, expected return from starting in state s, taking action a and then following policy Pi.
      • notion image

ย 
estimating the expected return of a particular policy if donโ€™t have access to true MDP models.
Dynamic Programming
Monte Carlo policy evaluation
policy evaluation when donโ€™t have a model of how the world work
given on-policy samples
temporal difference (TD)
metrics to evaluate and compare algorithms
ย 
  1. dynamic programming
Planning์œผ๋กœ ๊ฐ€์žฅ ํ”ํžˆ ์‚ฌ์šฉ๋˜๋˜ ๊ฒƒ์ด dynamic programming, ๋‹ค์ด๋‚ด๋ฏน ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ธ๋ฐ, ์ด๋ฏธ ์ฃผ์–ด์ง„ ํ˜น์€์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” MDP์— ๋”ฐ๋ผ reward๋ฅผ ์ •์˜ํ•˜๊ณ  ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ ์—์ด์ „ํŠธ๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •.
๋‹ค์ด๋‚ด๋ฏน - ์—ฐ์†์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ.
๋‹ค์ด๋‚ด๋ฏน ํ”„๋กœ๊ทธ๋ž˜๋ฐ - ์—ฐ์†์ ์œผ๋กœ ๋ฐœ์ƒ๋˜๋Š” ๋ฌธ์ œ๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ optimizeํ•˜์—ฌ ํ’€์–ด๋‚ด๋Š” ๊ฒƒ. ํฐ ๋ฌธ์ œ๋ฅผ ์ž‘์€ ๋ฌธ์ œ๋กœ ์ชผ๊ฐœ์„œ ํ’€์–ด๋‚ด๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค (divide & conquer)
Model free ๋ฐฉ์‹์€, MDP๊ฐ€ ์ฃผ์–ด์ง€์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ(๋ชจ๋ฅด๋Š” ์ƒํ™ฉ์—์„œ) agent๊ฐ€ environment์™€ ์ง์ ‘์ ์œผ๋กœ ์ƒํ˜ธ์ž‘์šฉํ•˜์—ฌ ๊ฒฝํ—˜์„ ์ถ•์ , ์ด๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ด์–ด๋‚˜๊ฐ€๋Š” ๊ฒƒ. ์ด ๊ณผ์ •์—์„œ value function์„ ์ตœ์ ํ™”ํ•˜์—ฌ optimal policy๋ฅผ ์ฐพ์•„๋‚˜๊ฐ„๋‹ค.
์—ฌ๊ธฐ์„œ Model Free method์˜ ๋ฐฉ์‹์œผ๋กœ Monte Carlo Learning, ๊ทธ๋ฆฌ๊ณ  Temporal Differnece Learning ๋‘๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.
  • Monte Carlo methods learn directly from episodes of experience.
  • Monte Carlo is a model-free : no knowledge of MDP transitions / rewards
  • Monte Carlo learns from complete episodes : which means there is no bootstrapping
  • Uses the simplest possible idea : which value is equal to the mean return
์—ํ”ผ์†Œ๋“œ๋งˆ๋‹ค ์ง์ ‘ ๊ฒฝํ—˜์„ ํ†ตํ•ด environment๋ฅผ ํ•™์Šตํ•ด ๋‚˜๊ฐ€๋Š”๋ฐ, transition / reward์— ๋Œ€ํ•œ ์‚ฌ์ „ ์ง€์‹์ด ์—†๋Š” ์ƒํƒœ๋กœ observe & reward๊ฐ€ ์ฃผ์–ด์ง„๋‹ค.
episode๊ฐ€ ์ข…๋ฃŒ๋œ ํ›„, ๋ฐ›๊ฒŒ ๋˜๋Š” reward์˜ mean ๋งŒํผ value๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
notion image
obtaining final gain โ†’ define value function โ†’ inference of E, but different way of using mean value of reward??
Bootstrapping?? โ†’ ๊ฐ™์€ ์ข…๋ฅ˜์˜ ์ถ”์ •๊ฐ’์— ๋Œ€ํ•ด์„œ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•  ๋•Œ, ํ•œ๊ฐœ ํ˜น์€ ๊ทธ ์ด์ƒ์˜ ์ถ”์ •๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ.
  • Temporal difference method learn directly from episodes of experience
  • It is a model-free : requires no knowledge of MDP transitions / rewards
  • Unlike Monte Carlo, use bootstrapping
Important Properties to Evaluate Policy Evaluation Algorithms
  • Robustness to Markov assumption
  • Bias/variance charecteristics
  • Data efficiency
  • Computational efficiency
ย