Foundation of Reinforcement learning(III)

05-17-202605-17-2026blog12 minutes read (About 1791 words)

Introduction

In the previous post, we have introduced the Bellman equation and the linear programming formulation for MDP. In this post, we will discuss the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment.

The Settings of RL

Typically, RL is framed as MDP, exploring the enviroment and learning the optimal policy.
Generally, we can only observe the episodes and usually, we do not have the model of the environment.

So we need to introduce the model into our RL project. Model-based RL actually is a method to solve the MDP.

Dynamic Programming Based RL

Dynamic Programming for finite MDP

Our objective function is simple, just the expected return, which is defined as: $π max E_{π} [t = 0 \sum \infty γ^{t} R_{t + 1} ∣ s_{0} = s]$

In the first episode of the series, we introduced Backward induction, which we can start from the last step and then recursively solve the problem. However, this method is not efficient for large state space, and it requires the state transition.

But meanwhile, we can also use the Bellman equation of value function to tackle the problem: $V^{π} (s) = a \in A \sum π (a ∣ s) ⎣ ⎡ immediate reward R (s, a) + discount γ s ’ \in S \sum transition P (s ’∣ s, a) future value V^{π} (s ’) ⎦ ⎤$

Optimal Value Function

For a state $s$ , we can define the optimal value function as the maximum value function over all policies: $V^{*} (s) = π max V^{π} (s)$
So the optimal value function is as follows: $V^{*} (s) = a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{*} (s ’)$

So the best policy can be derived from the optimal value function as: $π^{*} (a ∣ s) = ar g a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{*} (s ’)$

for any state $s$ and policy $π$ , there is: $V^{*} (s) = V^{π^{*}} (s) \geq V^{π} (s)$

Obviously, the value function relates to the policy, so we can iterate the optimal value function and the optimal policy until convergence. They are called Value Iteration and Policy Iteration respectively.

Value Iteration

For an MDP which is finite in both state and action space, we can use the value iteration to solve the problem. The value iteration is as follows:

Initialize $V (s) = 0$ for all $s \in S$
For each state $s \in S$ , update the value function as:
$V (s) \leftarrow max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)$
Repeat step 2 until convergence.

NOTE: There isn’t any specific order to update the value function, we can update the value function in any order. But the convergence rate may be different.

Sync & Async Value Iteration

Sync value iteration need to store two copies of value funtion:

For any state $s$ , we update the value function as:
$V_{n e w} (s) \leftarrow max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V_{o l d} (s ’)$
After updating all states, we copy the new value function to the old value function:
$V_{o l d} \leftarrow V_{n e w}$

Async value iteration only need to store one copy of value function:
$V (s) \leftarrow max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)$

Policy Iteration

The assumption of MDP is the same as the value iteration, which is finite in both state and action space. The policy iteration is as follows:

Randomly initialize a policy $π$ and a value function $V (s) = 0$ for all $s \in S$
Repeat the following steps until convergence:
Policy Evaluation: For each state $s \in S$ , update the value function as:
$V (s) \leftarrow \sum_{a \in A} π (a ∣ s) [R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)]$
Policy Improvement: For each state $s \in S$ , update the policy as:
$π (a ∣ s) \leftarrow ar g max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)$

Obviously, the Policy Iteration will be more expensive than the Value Iteration, since it needs to evaluate the policy in each iteration. However, the Policy Iteration can converge faster than the Value Iteration, since it can update the policy in each iteration.

Let’s contrast the two methods:

Method	Value Iteration	Policy Iteration
Update	Value function	Policy and value function
uses	Bellman optimality equation	Bellman expectation equation

NOTE:

Value iteration is a greedy method, we always use the best.

Update the value function by Bellman equation in Policy Iteration is expensive.

For smaller space MDP, Policy Iteration is faster than Value Iteration, but for larger space MDP, Value Iteration is faster than Policy Iteration.

If there isn’t any state transition circle, the value iteration is better.

Bellman operators

In fact, we have introduced the Bellman operators in the previous post, but we haven’t discussed it in detail.

Why Policy Iteration and Value Iteration can converge to the optimal value function? The key is that Bellman operators are contraction mappings.

Bellman operator is the collection of below functions:

Bellman expectation operator, usually denoted as $T^{π}$ , which is defined as:

$T^{π} V (s) = a \in A \sum π (a ∣ s) [R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V (s ’)]$
Bellman optimality operator, usually denoted as $T^{*}$ or $T$ , which is defined as:

$T V (s) = a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V (s ’)$

They can be used on the state value function and action value function:

expectation operator is used in the policy iteration, used for computing
the value function of a given policy, while is the inner loop of the policy iteration.
optimality operator is used in the value iteration, used for computing the optimal value function, while is the main loop of the value iteration.

Both the Bellman expectation operator and the Bellman optimality operator can be defined on the action value function and the state value function:

Bellman expectation operator on V-function:

$V^{π} (s) = E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s] = E_{π} [R (s_{0}, a_{0}) + γ s ’ \in S \sum P (s ’∣ s, a) π (a ∣ s) V^{π} (s ’)] = a \in A \sum π (a ∣ s) [R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{π} (s ’)] = (T^{π} V^{π}) (s)$

Bellman optimality operator on V-function:

$V^{*} (s) = π max E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s] = a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{*} (s ’) = (T V^{*}) (s)$

Bellman expectation operator on Q-function:

$Q^{π} (s, a) = E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s, a_{0} = a] = E_{π} [R (s_{0}, a_{0}) + γ s ’ \in S \sum P (s ’∣ s, a) π (a ∣ s) Q^{π} (s ’, a ’)] = R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) a ’ \in A \sum π (a ’∣ s ’) Q^{π} (s ’, a ’) = (T^{π} Q^{π}) (s, a)$

Bellman optimality operator on Q-function:

$Q^{*} (s, a) = π max E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s, a_{0} = a] = R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) a ’ \in A max Q^{*} (s ’, a ’) = (T Q^{*}) (s, a)$

Due to the contraction property of the Bellman operators, we can guarantee the convergence of the value iteration and policy iteration to the optimal value function.

Model RL

In the above sections, our objective environment is a known MDP, all our methods are based on the assumption that we have the model of the environment, which is the transition probability and the reward function. However, in many real-world scenarios, we do not have the model of the environment, so we need to learn the model from the data.

There are two basic thoughts to learn the model of the environment:

learn the state transition probability $P (s ’∣ s, a)$ :

$P (s ’∣ s, a) = \frac{N ( s , a , s ’ )}{N ( s , a )}$

where $N (s, a, s ’)$ is the number of times we have observed the transition from state $s$ to state $s ’$ when taking action $a$ , and $N (s, a)$ is the number of times we have observed taking action $a$ in state $s$ .
learn the reward function $R (s, a)$ :

$R (s, a) = average (R_{t} ∣ s_{t} = s, a_{t} = a)$

where $N (s, a)$ is the number of times we have observed taking action $a$ in state $s$ , and $R (s, a)$ is the average reward we have observed when taking action $a$ in state $s$ .

The simple simulate algorithm is as follows:

randomly initialize a policy $π$
repeat the following steps until convergence:
collect data by executing the policy $π$ in the environment, and store the transition data in a replay buffer.
learn the model of the environment from the replay buffer, which includes learning the state transition probability and the reward function.
solve the MDP with the learned model to get the optimal policy $π^{*}$ .

Other method to solve this is not learning the MDP, instead we learn the value function directly from the data, which is called model-free RL, we will discuss it in the next post.

Conclusion

In this post, we have introduced the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment. We have discussed the value iteration and policy iteration, which are two basic methods to solve the MDP. We have also introduced the Bellman operators, which are the key to guarantee the convergence of the value iteration and policy iteration. Finally, we have introduced the simple simulate algorithm, which is a method to learn the model of the environment and solve the MDP with the learned model. In the next post, we will discuss the model-free Reinforcement learning, which is a method to learn the value function directly from the data without learning the model of the environment.

Foundation of Reinforcement learning(III)

https://hjcheng0602.github.io/blog/Foundation-of-Reinforcement-learning-III/

AuthorJincheng Han

Posted on05-17-2026

Updated on05-17-2026

#Reinforcement learning review notes

Foundation of Reinforcement learning(III)

Introduction

The Settings of RL

Dynamic Programming Based RL

Dynamic Programming for finite MDP

Optimal Value Function

Value Iteration

Sync & Async Value Iteration

Policy Iteration

Bellman operators

Model RL

Conclusion

Comments

Categories

Catalogue

Recents

Archives

Tags