lec 3
- Recall
- Definition of Return G_t
- Definition of State Value Function, expected return from starting in state s under policy Pi.
- Definition of State-Action Value Function, expected return from starting in state s, taking action a and then following policy Pi.
ย
estimating the expected return of a particular policy if donโt have access to true MDP models.
Dynamic Programming
Monte Carlo policy evaluation
policy evaluation when donโt have a model of how the world work
given on-policy samples
temporal difference (TD)
metrics to evaluate and compare algorithms
ย
- dynamic programming
Planning์ผ๋ก ๊ฐ์ฅ ํํ ์ฌ์ฉ๋๋ ๊ฒ์ด dynamic programming, ๋ค์ด๋ด๋ฏน ํ๋ก๊ทธ๋๋ฐ ์ธ๋ฐ, ์ด๋ฏธ ์ฃผ์ด์ง ํน์์ด๋ฏธ ์๊ณ ์๋ MDP์ ๋ฐ๋ผ reward๋ฅผ ์ ์ํ๊ณ ์ ํ๋ ํ๊ฒฝ์์ ์์ด์ ํธ๋ฅผ ํ์ต์ํค๋ ๊ณผ์ .
๋ค์ด๋ด๋ฏน - ์ฐ์์ ์ผ๋ก ๋ฐ์ํ๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ๊ฒ.
๋ค์ด๋ด๋ฏน ํ๋ก๊ทธ๋๋ฐ - ์ฐ์์ ์ผ๋ก ๋ฐ์๋๋ ๋ฌธ์ ๋ฅผ ์ํ์ ์ผ๋ก optimizeํ์ฌ ํ์ด๋ด๋ ๊ฒ. ํฐ ๋ฌธ์ ๋ฅผ ์์ ๋ฌธ์ ๋ก ์ชผ๊ฐ์ ํ์ด๋ด๋ ๊ฒ๊ณผ ๊ฐ๋ค (divide & conquer)
Model free ๋ฐฉ์์, MDP๊ฐ ์ฃผ์ด์ง์ง ์์ ์ํฉ์์(๋ชจ๋ฅด๋ ์ํฉ์์) agent๊ฐ environment์ ์ง์ ์ ์ผ๋ก ์ํธ์์ฉํ์ฌ ๊ฒฝํ์ ์ถ์ , ์ด๋ฅผ ํตํด ํ์ต์ ์ด์ด๋๊ฐ๋ ๊ฒ. ์ด ๊ณผ์ ์์ value function์ ์ต์ ํํ์ฌ optimal policy๋ฅผ ์ฐพ์๋๊ฐ๋ค.
์ฌ๊ธฐ์ Model Free method์ ๋ฐฉ์์ผ๋ก Monte Carlo Learning, ๊ทธ๋ฆฌ๊ณ Temporal Differnece Learning ๋๊ฐ์ง๊ฐ ์๋ค.
- Monte Carlo methods learn directly from episodes of experience.
- Monte Carlo is a model-free : no knowledge of MDP transitions / rewards
- Monte Carlo learns from complete episodes : which means there is no bootstrapping
- Uses the simplest possible idea : which value is equal to the mean return
์ํผ์๋๋ง๋ค ์ง์ ๊ฒฝํ์ ํตํด environment๋ฅผ ํ์ตํด ๋๊ฐ๋๋ฐ, transition / reward์ ๋ํ ์ฌ์ ์ง์์ด ์๋ ์ํ๋ก observe & reward๊ฐ ์ฃผ์ด์ง๋ค.
episode๊ฐ ์ข
๋ฃ๋ ํ, ๋ฐ๊ฒ ๋๋ reward์ mean ๋งํผ value๋ก ์ฌ์ฉ๋๋ค.
obtaining final gain โ define value function โ inference of E, but different way of using mean value of reward??
Bootstrapping?? โ ๊ฐ์ ์ข
๋ฅ์ ์ถ์ ๊ฐ์ ๋ํด์ ์
๋ฐ์ดํธ๋ฅผ ํ ๋, ํ๊ฐ ํน์ ๊ทธ ์ด์์ ์ถ์ ๊ฐ์ ์ฌ์ฉํ๋ ๊ฒ.
- Temporal difference method learn directly from episodes of experience
- It is a model-free : requires no knowledge of MDP transitions / rewards
- Unlike Monte Carlo, use bootstrapping
Important Properties to Evaluate Policy Evaluation Algorithms
- Robustness to Markov assumption
- Bias/variance charecteristics
- Data efficiency
- Computational efficiency
ย