7๊ฐ•_Imitation Learning in Large State Spaces

ย 
ย 
ย 

Why IL(Imitation Learning)?

  • ์•ž์„œ ๋ฐฐ์šด ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘, DQN์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•œ ํ”ฝ์…€์ด ์—ฌ๋Ÿฌ๋ฒˆ ๋ฐ”๋€Œ๋Š” (๊ฒŒ์ž„ ์ƒ์—์„œ ๊ณต๊ฐ„์ด ๊ณ„์† ๋ณ€ํ™”ํ•˜๋Š”) Mobtezuma ๊ฒŒ์ž„๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๊ธฐ์กด DQN์—์„œ์˜ ํƒ์ƒ‰์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
  • Result
    • notion image
      ์™ผ์ชฝ์˜ DQN์œผ๋กœ๋Š” 2๊ฐœ์˜ ๊ณต๊ฐ„๋งŒ์„ ํƒ์ƒ‰ํ–ˆ๊ณ , ์˜ค๋ฅธ์ชฝ์˜ ํ–ฅ์ƒ๋œ DQN์œผ๋กœ๋„ ๋ชจ๋“  ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์–ป์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Mobtezumaโ€™s Revenge ๊ฒŒ์ž„
    • https://www.youtube.com/watch?v=JR6wmLaYuu4
    • 8์ดˆ ์ •๋„๋ฅผ ๋ณด๋ฉด, ๋ชจ๋“  ํ”ฝ์…€์ด ๋ฐ”๋€Œ๋ฉฐ ์บ๋ฆญํ„ฐ๊ฐ€ ์žˆ๋Š” ๊ณต๊ฐ„์ด ๋ณ€ํ™”ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    • ์ฆ‰, ๊ฐ•ํ™”ํ•™์Šต์˜ ์‹œ์„ ์œผ๋กœ ๋ณผ ๋•Œ, ์–ด๋– ํ•œ Action์„ ํ•˜๋ฉด ๊ทธ ์ดํ›„ ๋ชจ๋“  State๊ฐ€ ๋ณ€ํ™”ํ•˜๋Š” ์ƒํ™ฉ
    • Video preview
ย 
  • ํ•ด๊ฒฐ๋ฐฉ์•ˆ
    • ์ „๋ฌธ๊ฐ€๊ฐ€ ๊ฒช์€ ๊ฒฝํ—˜์„ ํ†ตํ•ด ํ•™์Šตํ•˜์ž! (Imitation Learning)
    • ๋ณด์ƒ์„ ๋ฐ›๋Š” ์‹œ๊ฐ„์ด ๊ธธ๊ฑฐ๋‚˜, ๋ณด์ƒ์ด ๋ชจํ˜ธํ•  ๊ฒฝ์šฐ, ์›ํ•˜๋Š” ์ •์ฑ…์„ ์ง์ ‘ ์ฝ”๋”ฉํ•˜๊ธฐ ์–ด๋ ค์šธ ๊ฒฝ์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
ย 
  • ๊ทธ๋ž˜์„œ Imitation Learning์€ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๋ฐ?
    • reward๋ฅผ demonstrationํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์‹ค์ œ๋กœ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋ฉด์„œ reward๋ฅผ implicitํ•˜๊ฒŒ ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ์ž์œจ์ฃผํ–‰ ์ž๋™์ฐจ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ˆ™๋ จ๋œ ์šด์ „์ž๊ฐ€ ์ง์ ‘ ์šด์ „์„ ํ•˜๋ฉด์„œ State์™€ Action์˜ ์‹œํ€€์Šค๋“ค์„ ์ „๋‹ฌํ•˜๊ณ , Agent๋Š” ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
    • ํ•˜์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ reward๋ฅผ ํ•˜๋‚˜ ํ•˜๋‚˜ ๋ถ€์—ฌํ•˜๊ฑฐ๋‚˜, ํŠน์ •ํ•œ policy๋ฅผ ๋”ฐ๋ฅด๋„๋ก ํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒฝ์šฐ์— ๋น„ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
notion image
ย 
  • DQN๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ State์™€ Action, Transition model์ด ์ฃผ์–ด์ง€์ง€๋งŒ, reward function R์€ ์ฃผ์–ด์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  (s0,a0,s1,a1,โ€ฆ)๊ณผ ๊ฐ™์€ demonstration์ด ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค.

Behavioral Cloning

  • Supervised learning์„ ํ†ตํ•ด ์ „๋ฌธ๊ฐ€์˜ policy๋ฅผ ์ง์ ‘ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์ž! (Like Machine Learning)
  1. Policy์˜ ํด๋ž˜์Šค๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. (ex. neural network, decision tree, โ€ฆ)
  1. expert์˜ state๋ฅผ supervised learning model์˜ input, expert์˜ action์„ supervised learning model์˜ output์œผ๋กœ ๋‘๊ณ  Agent๋ฅผ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.
  • ๋ฌธ์ œ์ 
    • Compounding Error
      • ๋Œ€๋ถ€๋ถ„์˜ Machine Learning์€ ๋ฐ์ดํ„ฐ์˜ iid(๋™์ผํ•˜๊ณ  ๋…๋ฆฝ์ ์ธ ๋ถ„ํฌ์—์„œ ์ƒ์„ฑ๋จ)์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.
      • ํ•˜์ง€๋งŒ ๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ๋ฐ์ดํ„ฐ๋Š” ๋…๋ฆฝ์„ฑ์„ ๋ณด์žฅํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. (์‹œ๊ฐ„ ํ๋ฆ„์— ๋”ฐ๋ฅธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ!)
      • ๋”ฐ๋ผ์„œ Machine Learning ๊ธฐ๋ฐ˜์˜ ๊ฐ•ํ™”ํ•™์Šต์€ ํ˜„์žฌ ์–ด๋–ค state์ธ์ง€๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์•Š๊ณ , ํŠน์ • state์—์„œ๋Š” ํŠน์ • action์„ ์ทจํ•˜๊ธธ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.
      • ์˜ˆ์‹œ) ์›ํ˜• ํŠธ๋ž™ ๋‚ด ์ž์œจ์ฃผํ–‰ ์ž๋™์ฐจ
      • notion image
      • ํŒŒ๋ž€์ƒ‰์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ์šด์ „ํ•œ ๊ฒฝ๋กœ๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ดˆ๋ฐ˜์— ์กฐ๊ธˆ ๋” ๋ฐ–์œผ๋กœ ์šดํ–‰ํ•˜๋Š” ์•ฝ๊ฐ„์˜ error๊ฐ€ ๋ฐœ์ƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.
      • ํ•˜์ง€๋งŒ, Agent๊ฐ€ ํ˜„์žฌ์˜ ์ž๋™์ฐจ์˜ ์œ„์น˜๊ฐ€ ๋ฐ–์œผ๋กœ ๋‚˜์™€์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์€ ์ฑ„๋กœ expert ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ํŠน์ • ๊ตฌ๊ฐ„์—์„œ ์ฝ”๋„ˆ๋ง์„ ์ง„ํ–‰ํ•˜๋ฉด ์‚ฌ๊ณ ๊ฐ€ ๋‚˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
      • ์ฆ‰, time step t์—์„œ ์•„์ฃผ ์ž‘์€ ์‹ค์ˆ˜๋กœ ์ธํ•ด ๊ทธ ์ดํ›„์˜ time step t+1, t+2, โ€ฆ ์—์„œ๋„ ๊ณ„์† ์˜ค์ฐจ๊ฐ€ ์ƒ๊ฒจ ๊ฒฐ๊ตญ์€ ํ•™์Šต์— ์‹คํŒจํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
      • ย 
  • ํ•ด๊ฒฐ์ฑ… (DAGGER : Dataset Aggregation)
    • notion image
    • ์ž˜๋ชป๋œ ๊ธธ์„ ๊ฐ€๋ฉด expert์—๊ฒŒ ์–ด๋–ค action์„ ์ทจํ•ด์•ผํ•˜๋Š”์ง€ ์•Œ๋ ค์ค˜!!๋ผ๊ณ  ๋ฌผ์–ด๋ณด๋Š” ๋ฐฉ์‹
    • ํ•˜์ง€๋งŒ, ์ด ๋ฐฉ๋ฒ•์€ ๋งค์šฐ ์ œํ•œ์ ์ธ ์ƒํ™ฉ์—์„œ๋งŒ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
ย 

Inverse Reinforcement Learning

  • Expert์˜ policy๋ฅผ ๋ณด๊ณ  reward function์„ ์ฐพ์•„๋‚˜๊ฐ€๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • Imitation Learning์€ reward function R์„ input์œผ๋กœ ๋ฐ›์ง€ ์•Š๊ณ , demonstration (s0,a0,s1,a1,โ€ฆ)์‹œํ€€์Šค๋ฅผ ๋ฐ›๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ํ†ตํ•ด reward๋ฅผ ์•Œ์•„๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ๋‹จ, expert์˜ policy๊ฐ€ optimalํ•˜๋‹ค๋Š” ์ „์ œ๋ฅผ ํ•˜๊ณ  ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌธ์ œ์ 
    • ์ถ”์ •๋˜๋Š” reward function์€ ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ.
  • ํ•ด๊ฒฐ์ฑ… : Linear value function approximation
    • notion image
    • R๊ฐ’์„ W^t X(s)๋ผ๊ณ  ์ •์˜ํ•˜๋Š”๋ฐ, w๋Š” weight vector์ด๊ณ , x(s)๋Š” state์˜ feature๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ weight vector w๋ฅผ ์ฃผ์–ด์ง„ demonstration์„ ํ†ตํ•ด ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.
    • ์ฆ‰, ์šฐ๋ฆฌ๊ฐ€ ํ•™์Šต์‹œํ‚จ weight vector w๊ฐ’์—๋‹ค๊ฐ€ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” state feature์˜ ๊ฐ’์„ ๊ณฑํ•ด์ค€ ๊ฒƒ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. โ†’ ๋‹จ, ์šฐ๋ฆฌ๋Š” expert์˜ policy๋ฅผ optimal๋กœ ์ „์ œํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ž์ฃผ ๋ณด์ด๋Š” state feature๋ฅผ ๊ฐ–๋Š” state์˜ reward๋Š” ๋†’๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
ย 

Apprenticeship Learning

  • ์œ„์˜ Inverse RL๊ณผ ๋น„์Šทํ•œ ๋ฐฉํ–ฅ
  • ์ถ”๊ฐ€์ ์ธ ๊ฒƒ์€, ๋งˆ์ง€๋ง‰ 6๋ฒˆ ์ˆ˜์‹์ž…๋‹ˆ๋‹ค.
    • notion image
    • : expert๊ฐ€ ์ฃผ๋Š” optimal ํ•œ policy โ†’ ์šฐ๋ฆฌ๊ฐ€ ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๊ฐ’
    • : expert๊ฐ€ ์ฃผ๋Š” policy๋ฅผ ์ œ์™ธํ•œ ๋‹ค๋ฅธ policy
    • ์™€ ์˜ ์ฐจ์ด๊ฐ€ ์ž‘์€ ๋ฅผ ์ฐพ๊ณ , ์™€ ๊ฐ’์˜ ์ฐจ์ด๊ฐ€ ์ž‘์€ w๋ฅผ ๊ตฌํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ reward function์— ๊ด€๊ณ„์—†์ด, ์ถฉ๋ถ„ํžˆ optimal policy์— ๊ฐ€๊นŒ์šด policy๋ฅผ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย