Foundation of Reinforcement learning(V)
Introduction
In the previous post, we have introduced the estimate of value function: MC and TD. In this post, we will introduce two important algorithms for estimating the action value function: SARSA and Q-learning.
Looking back at our previous post, now we have known ‘What is the best state’: estimating the state value function , but we still don’t know ‘What is the best action’: . Here we don’t know the transition probability , so we can’t directly compute the optimal policy. So, we need to estimate the action value function .
SARSA
For any (state, action, reward, next state, next action) executated by the policy , we can update the action value function as follows:
Foundation of Reinforcement learning(V)
https://hjcheng0602.github.io/blog/Foundation-of-Reinforcement-learning-V/