Foundation of Reinforcement learning(V)

05-18-2026 blog

Introduction

In the previous post, we have introduced the estimate of value function: MC and TD. In this post, we will introduce two important algorithms for estimating the action value function: SARSA and Q-learning.

Looking back at our previous post, now we have known ‘What is the best state’: estimating the state value function $V^{π} (S_{t})$ , but we still don’t know ‘What is the best action’: $π (s) = ar g max_{a \in A} P (s ’∣ s, a) V^{π} (s ’)$ . Here we don’t know the transition probability $P (s ’∣ s, a)$ , so we can’t directly compute the optimal policy. So, we need to estimate the action value function $Q^{π} (s, a)$ .

SARSA

For any (state, action, reward, next state, next action) executated by the policy $π$ , we can update the action value function as follows: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})]$

Foundation of Reinforcement learning(V)

https://hjcheng0602.github.io/blog/Foundation-of-Reinforcement-learning-V/

AuthorHan Jincheng

Posted on05-18-2026

Updated on05-18-2026