Two and One

Posted May 18, 2026Updated May 18, 2026blog16 minutes read (About 2439 words)

Foundation of Reinforcement learning(IV)

Introduction

In the previous posts, we have introduced the MDP and its solution. But in practice, we often do not have the simulation of the environment, which means we cannot directly apply our knowledge of MDP to solve the problem. There is indeed a method to simulate the environment, which is called model-based Reinforcement learning. However, in this post, we will focus on the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs.

Estimating Value Functions

In mode-based RL, value functions can be computed by DP methods as follows: $V^{π} (s) = E_{π} [R (s_{0}, a_{0}) + γ R (s_{1}, a_{1}) + γ^{2} R (s_{2}, a_{2}) + \dots ∣ s_{0} = s] = R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{π} (s ’)$

However, in model-free RL, we cannot directly access the $P (s ’∣ s, a)$ and $R (s, a)$ , but we have some ways to estimate the value function from episodes of experience.

Why we estimate the value function?
Because we can use the value function to derive the optimal policy, which is our ultimate goal. Besides value function can help us to reuse historical experience to make better decisions in the future, which is the essence of Reinforcement learning.

Here is a graph to introduce some methods to estimate the value function:

value estimation(Slide credit: David Silver)

Monte Carlo methods

Target: Learn $V^{π}$ from episodes of experience.

Review: accumulate reward function:

$G_{t} = R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots = k = 0 \sum \infty γ^{k} R_{t + k}$

Review: value function is the expected return: $V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s] ≃ \frac{1}{N ( s )} i = 1 \sum N (s) G_{t}^{i}$
The Monte Carlo method use empirical mean cumulative reward instead of expected return to estimate the value function.

First-visit Monte Carlo method

The first-visit Monte Carlo method estimates the value function by averaging the returns following the first time a state is visited in an episode. The algorithm is as follows:

Initialization:
- For any $s \in S$ , $V (s) \in R$ , $N (s) = 0$ .
- For any $s \in S$ , $returns (s) = \emptyset$ .
Loop for each episode:
- Generate an episode following policy $π$ : $S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, \dots, S_{T}$ .
- For t = T-1, T-2, …, 0:
  - $G \leftarrow γ G + R_{t + 1}$
  - If $s$ is the first time in the episode:
    - Append $G$ to $returns (s)$ .
    - $V (s) \leftarrow average (returns (s))$ .

The reason why we call it ‘first-visit’ is that we only update the value function for the first time we visit a state in an episode. Thus we can avoid the bias caused by multiple visits to the same state in an episode. However, this method may have high variance because it only uses one return for each state in an episode.

Incremental Monte Carlo method

The first-visit Monte Carlo method takes a lot of memory to store the returns for each state, which is not efficient. The incremental Monte Carlo method uses an incremental update rule to estimate the value function without storing all the returns. The algorithm is as follows:

Initialization:
- For any $s \in S$ , $V (s) \in R$ , $N (s) = 0$ , $G = 0$ .
Loop for each episode:
- Generate an episode following policy $π$ : $S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, \dots, S_{T}$ .
- For t = T-1, T-2, …, 0:
  - $G \leftarrow γ G + R_{t + 1}$
  - If $s$ is the first time in the episode:
    - $N (s) \leftarrow N (s) + 1$
    - $V (s) \leftarrow V (s) + \frac{1}{N ( s )} (G - V (s))$ .

Interesting, online softmax also takes the same update rule as the incremental Monte Carlo method. Great job.

Besides, incremental MC provides more design space for us to tackle some problems in practice. For example, we can use a constant step size $α$ instead of $\frac{1}{N ( s )}$ to update the value function, which is called constant step size MC method. It is useful when the environment is non-stationary, which means the reward function and transition probability may change over time. In this case, we want to give more weight to recent returns than old returns, which can be achieved by using a constant step size $α$ : $V (s) \leftarrow V (s) + α (G - V (s))$

Some properties of Monte Carlo methods

Monte Carlo methods are model-free, which means they do not require the knowledge of the environment’s dynamics (transition probabilities and reward function).
Monte Carlo methods take the simpliest approach to estimate the value function, which is to average the returns following the policy. However, this method may have high variance because it only uses one return for each state in an episode.
One key to note is that Monte Carlo methods can only be applied to finite MDPs, which means the state space and action space must be finite.

Importance sampling

Let’s try to estimate a custom distribution $p (x)$ ‘s expectation. $E_{x \sim p} [f (x)] = \int f (x) p (x) d x = \int f (x) \frac{p ( x )}{q ( x )} q (x) d x = E_{x \sim q} [f (x) \frac{p ( x )}{q ( x )}]$
Then we reassign the importance sampling weight $w (x) = \frac{p ( x )}{q ( x )}$ , we can rewrite the above equation as: $E_{x \sim p} [f (x)] = E_{x \sim q} [f (x) w (x)]$

Off-policy Monte Carlo methods via Importance Sampling

We can use the cumulative reward function of policy $μ$ to justify policy $π$ , and then weight the cumulative reward function by the importance ratio between $π$ and $μ$ to estimate the value function of policy $π$ . The algorithm is as follows:

Every episode would be mutified by the importance sampling ratio: $G_{t}^{π / μ} = \frac{π ( A _{t} ∣ S _{t} )}{μ ( A _{t} ∣ S _{t} )} \frac{π ( A _{t + 1} ∣ S _{t + 1} )}{μ ( A _{t + 1} ∣ S _{t + 1} )} \dots \frac{π ( A _{T - 1} ∣ S _{T - 1} )}{μ ( A _{T - 1} ∣ S _{T - 1} )} G_{t}$

So we then update the value function by: $V (s) \leftarrow V (s) + \frac{1}{N ( s )} (G_{t}^{π / μ} - V (s))$

Sample by importance sampling will significantly increase the variance of the return, which is because the importance sampling ratio can be very large when $π$ and $μ$ are very different.

Temporal-Difference Learning

Temporal-Difference (TD) is a method combining the MC method and DP method, which name comes from the fact that it uses the diff of estimated value function at two consecutive time steps to update the value function. There are two key ideas in TD learning: TD error and TD target.

For state value funtion $V$ , after a transition from state $s$ to state $s ’$ with reward $r$ , the TD error is defined as: $δ = r + γV (s ’) - V (s)$
The TD target is defined as: $\hat{V} = r + γV (s ’)$

As for the TD in Bellman expectation equation, the TD error is used in estimating the expect part.

Some details of TD learning

The simpliest TD learning algorithm is called TD(0), which updates the value function by the TD error at each time step. The key equation of TD(0) is as follows: $V (s) \leftarrow V (s) + α δ = V (s) + α (r + γV (s ’) - V (s))$

Why we update like this?
The Bellman expectation equation is rewritten as:

$E_{π} [R_{t + 1} + γ V^{π} (S_{t + 1}) - V^{π} (S_{t}) ∣ S_{t} = s] = 0$
That’s all. We want to make the TD error as small as possible, which means we want to make the estimated value function as close as possible to the true value function. Thus, we can use the TD error to update the value function.

The TD method introduce the bootstrapping idea, which means we use the estimated value function to update the value function. This is different from the MC method, which uses the actual return to update the value function. The bootstrapping idea can significantly reduce the variance of the return, but it may introduce bias because we are using an estimated value function to update the value function.

Contrast between TD and MC methods

They have the same goal: Learn the value function from episodes of experience. However, they have different approaches to achieve this goal. The MC method uses the actual return to update the value function, which can have high variance but no bias. The TD method uses the estimated value function to update the value function, which can have low variance but may introduce bias.

TD method	MC method
update value function $V (s)$ like $V (s) \leftarrow V (s) + α (r + γV (s ’) - V (s))$	update value function $V (s)$ like $V (s) \leftarrow V (s) + \frac{1}{N ( s )} (G_{t} - V (s))$

The object of TD is $R_{t} + γV (s_{t + 1})$ , which is called TD target, while the object of MC is $G_{t}$ , which is the actual return. The TD method’s error is called TD error, which is defined as $δ = r + γV (s ’) - V (s)$ , while the MC method’s error is defined as $G_{t} - V (s)$ .

The strengths and limitations of TD learning and MC learning

TD method can learn until the end of an episode:

After each step in an episode, TD method can update the value function use the former value function, which means it can learn until the end of an episode. However, MC method can only update the value function after the end of an episode, which means it cannot learn until the end of an episode.
TD method can learn from incomplete episodes, which means it can learn from episodes that are not terminated. However, MC method can only learn from complete episodes, which means it cannot learn from episodes that are not terminated.

Tradeoff between bias and variance

	Estimator	Bias	Variance
MC	$G_{t}$	Unbiased: $E [G_{t}] = V^{π} (s)$	Higher
TD (real)	$R_{t + 1} + γ V^{π} (S_{t + 1})$	Unbiased: $E [R_{t + 1} + γ V^{π} (S_{t + 1})] = V^{π} (s)$	Lower
TD (actual)	$R_{t + 1} + γV (S_{t + 1})$	Biased: $E [R_{t + 1} + γV (S_{t + 1})]$ ≠ $V^{π} (s)$	Lower

Note: The real TD target uses the true $V^{π}$ , which is unknown in practice. The actual TD target uses the current estimate $V$ , introducing bias. Despite the bias, TD typically has lower variance than MC because it bootstraps from a single step rather than a full trajectory.

Multi-step TD learning

The TD(0) method only uses the immediate reward and the estimated value of the next state to update the value function, which may not be sufficient to capture the long-term dependencies in the environment. The multi-step TD learning method uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. We will introduce it by leading into n-step cumulate reward function and n-step TD target.

N-step cumulate reward function

Consider the following n-step cumulate reward function: $G_{t}^{(n)} = R_{t} + γ R_{t + 1} + \dots + γ^{n - 1} R_{t + n - 1} + γ^{n} V (S_{t + n})$
It seems make sense to use the n-step cumulate reward function to update the value function, which is called n-step TD learning. The key equation of n-step TD learning is as follows: $V (s) \leftarrow V (s) + α (G_{t}^{(n)} - V (s))$

N-step mean cumulate reward function

Can we take up the information of different n-step cumulate reward function to update the value function? The answer is yes.
We can use a weighted average of different n-step cumulate reward functions to update the value function, which is called n-step mean TD learning. The key weight figure of weighted average is as follows:

So the n-step mean cumulate reward function is defined as: $G_{t}^{λ} = (1 - λ) n = 1 \sum \infty λ^{n - 1} G_{t}^{(n)}$
Then we can update the value function by: $V (s) \leftarrow V (s) + α (G_{t}^{λ} - V (s))$

This is called TD( $λ$ ) method, which is a generalization of TD(0) and MC methods. When $λ = 0$ , TD( $λ$ ) reduces to TD(0) method, and when $λ = 1$ , TD( $λ$ ) reduces to MC method. Thus, by adjusting the value of $λ$ , we can control the bias-variance tradeoff in the estimation of the value function.

Conclusion about TD(λ) method

Unless the $l amb d a$ is 0, TD( $λ$ ) mothod is unbiased, because it’s a weighted average of unbiased n-step TD targets.
The variance of TD( $λ$ ) method is lower than that of MC method, because it uses bootstrapping to update the value function, which can reduce the variance of the return. However, the variance of TD( $λ$ ) method is higher than that of TD(0) method, because it uses more rewards and estimated values to update the value function, which can increase the variance of the return.

$Var (a X + bY) = a^{2} Var (X) + b^{2} Var (Y)$
Empirically $λ$ is not quite commom because fast credit assignment for a given action is preferred. So MC or TD(0) is more commonly used in practice. However, TD( $λ$ ) method can be useful when we want to balance the bias-variance tradeoff in the estimation of the value function, which can be achieved by adjusting the value of $λ$ .

So TD( $λ$ ) use $λ$ as variable while n-step TD use $n$ as variable. TD( $λ$ ) is a generalization of n-step TD, which can be seen as a weighted average of infin-step TD targets. By adjusting the value of $λ$ , we can control the bias-variance tradeoff in the estimation of the value function, which can be useful in practice when we want to balance the bias and variance in the estimation of the value function.

Conclusion

In this post, we have introduced the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs. We have introduced two methods to estimate the value function from episodes of experience: Monte Carlo methods and Temporal-Difference learning. We have also introduced the n-step TD learning method, which uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. Finally, we have discussed the bias-variance tradeoff in the estimation of the value function, which can be controlled by adjusting the value of $λ$ in TD( $λ$ ) method.

Posted May 17, 2026Updated May 17, 2026blog12 minutes read (About 1791 words)

Foundation of Reinforcement learning(III)

Introduction

In the previous post, we have introduced the Bellman equation and the linear programming formulation for MDP. In this post, we will discuss the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment.

The Settings of RL

Typically, RL is framed as MDP, exploring the enviroment and learning the optimal policy.
Generally, we can only observe the episodes and usually, we do not have the model of the environment.

So we need to introduce the model into our RL project. Model-based RL actually is a method to solve the MDP.

Dynamic Programming Based RL

Dynamic Programming for finite MDP

Our objective function is simple, just the expected return, which is defined as: $π max E_{π} [t = 0 \sum \infty γ^{t} R_{t + 1} ∣ s_{0} = s]$

In the first episode of the series, we introduced Backward induction, which we can start from the last step and then recursively solve the problem. However, this method is not efficient for large state space, and it requires the state transition.

But meanwhile, we can also use the Bellman equation of value function to tackle the problem: $V^{π} (s) = a \in A \sum π (a ∣ s) ⎣ ⎡ immediate reward R (s, a) + discount γ s ’ \in S \sum transition P (s ’∣ s, a) future value V^{π} (s ’) ⎦ ⎤$

Optimal Value Function

For a state $s$ , we can define the optimal value function as the maximum value function over all policies: $V^{*} (s) = π max V^{π} (s)$
So the optimal value function is as follows: $V^{*} (s) = a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{*} (s ’)$

So the best policy can be derived from the optimal value function as: $π^{*} (a ∣ s) = ar g a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{*} (s ’)$

for any state $s$ and policy $π$ , there is: $V^{*} (s) = V^{π^{*}} (s) \geq V^{π} (s)$

Obviously, the value function relates to the policy, so we can iterate the optimal value function and the optimal policy until convergence. They are called Value Iteration and Policy Iteration respectively.

Value Iteration

For an MDP which is finite in both state and action space, we can use the value iteration to solve the problem. The value iteration is as follows:

Initialize $V (s) = 0$ for all $s \in S$
For each state $s \in S$ , update the value function as:
$V (s) \leftarrow max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)$
Repeat step 2 until convergence.

NOTE: There isn’t any specific order to update the value function, we can update the value function in any order. But the convergence rate may be different.

Sync & Async Value Iteration

Sync value iteration need to store two copies of value funtion:

For any state $s$ , we update the value function as:
$V_{n e w} (s) \leftarrow max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V_{o l d} (s ’)$
After updating all states, we copy the new value function to the old value function:
$V_{o l d} \leftarrow V_{n e w}$

Async value iteration only need to store one copy of value function:
$V (s) \leftarrow max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)$

Policy Iteration

The assumption of MDP is the same as the value iteration, which is finite in both state and action space. The policy iteration is as follows:

Randomly initialize a policy $π$ and a value function $V (s) = 0$ for all $s \in S$
Repeat the following steps until convergence:
Policy Evaluation: For each state $s \in S$ , update the value function as:
$V (s) \leftarrow \sum_{a \in A} π (a ∣ s) [R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)]$
Policy Improvement: For each state $s \in S$ , update the policy as:
$π (a ∣ s) \leftarrow ar g max_{a \in A} R (s, a) + γ \sum_{s ’ \in S} P (s ’∣ s, a) V (s ’)$

Obviously, the Policy Iteration will be more expensive than the Value Iteration, since it needs to evaluate the policy in each iteration. However, the Policy Iteration can converge faster than the Value Iteration, since it can update the policy in each iteration.

Let’s contrast the two methods:

Method	Value Iteration	Policy Iteration
Update	Value function	Policy and value function
uses	Bellman optimality equation	Bellman expectation equation

NOTE:

Value iteration is a greedy method, we always use the best.

Update the value function by Bellman equation in Policy Iteration is expensive.

For smaller space MDP, Policy Iteration is faster than Value Iteration, but for larger space MDP, Value Iteration is faster than Policy Iteration.

If there isn’t any state transition circle, the value iteration is better.

Bellman operators

In fact, we have introduced the Bellman operators in the previous post, but we haven’t discussed it in detail.

Why Policy Iteration and Value Iteration can converge to the optimal value function? The key is that Bellman operators are contraction mappings.

Bellman operator is the collection of below functions:

Bellman expectation operator, usually denoted as $T^{π}$ , which is defined as:

$T^{π} V (s) = a \in A \sum π (a ∣ s) [R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V (s ’)]$
Bellman optimality operator, usually denoted as $T^{*}$ or $T$ , which is defined as:

$T V (s) = a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V (s ’)$

They can be used on the state value function and action value function:

expectation operator is used in the policy iteration, used for computing
the value function of a given policy, while is the inner loop of the policy iteration.
optimality operator is used in the value iteration, used for computing the optimal value function, while is the main loop of the value iteration.

Both the Bellman expectation operator and the Bellman optimality operator can be defined on the action value function and the state value function:

Bellman expectation operator on V-function:

$V^{π} (s) = E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s] = E_{π} [R (s_{0}, a_{0}) + γ s ’ \in S \sum P (s ’∣ s, a) π (a ∣ s) V^{π} (s ’)] = a \in A \sum π (a ∣ s) [R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{π} (s ’)] = (T^{π} V^{π}) (s)$

Bellman optimality operator on V-function:

$V^{*} (s) = π max E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s] = a \in A max R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) V^{*} (s ’) = (T V^{*}) (s)$

Bellman expectation operator on Q-function:

$Q^{π} (s, a) = E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s, a_{0} = a] = E_{π} [R (s_{0}, a_{0}) + γ s ’ \in S \sum P (s ’∣ s, a) π (a ∣ s) Q^{π} (s ’, a ’)] = R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) a ’ \in A \sum π (a ’∣ s ’) Q^{π} (s ’, a ’) = (T^{π} Q^{π}) (s, a)$

Bellman optimality operator on Q-function:

$Q^{*} (s, a) = π max E_{π} [t = 0 \sum \infty γ^{t} R_{t} ∣ s_{0} = s, a_{0} = a] = R (s, a) + γ s ’ \in S \sum P (s ’∣ s, a) a ’ \in A max Q^{*} (s ’, a ’) = (T Q^{*}) (s, a)$

Due to the contraction property of the Bellman operators, we can guarantee the convergence of the value iteration and policy iteration to the optimal value function.

Model RL

In the above sections, our objective environment is a known MDP, all our methods are based on the assumption that we have the model of the environment, which is the transition probability and the reward function. However, in many real-world scenarios, we do not have the model of the environment, so we need to learn the model from the data.

There are two basic thoughts to learn the model of the environment:

learn the state transition probability $P (s ’∣ s, a)$ :

$P (s ’∣ s, a) = \frac{N ( s , a , s ’ )}{N ( s , a )}$

where $N (s, a, s ’)$ is the number of times we have observed the transition from state $s$ to state $s ’$ when taking action $a$ , and $N (s, a)$ is the number of times we have observed taking action $a$ in state $s$ .
learn the reward function $R (s, a)$ :

$R (s, a) = average (R_{t} ∣ s_{t} = s, a_{t} = a)$

where $N (s, a)$ is the number of times we have observed taking action $a$ in state $s$ , and $R (s, a)$ is the average reward we have observed when taking action $a$ in state $s$ .

The simple simulate algorithm is as follows:

randomly initialize a policy $π$
repeat the following steps until convergence:
collect data by executing the policy $π$ in the environment, and store the transition data in a replay buffer.
learn the model of the environment from the replay buffer, which includes learning the state transition probability and the reward function.
solve the MDP with the learned model to get the optimal policy $π^{*}$ .

Other method to solve this is not learning the MDP, instead we learn the value function directly from the data, which is called model-free RL, we will discuss it in the next post.

Conclusion

In this post, we have introduced the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment. We have discussed the value iteration and policy iteration, which are two basic methods to solve the MDP. We have also introduced the Bellman operators, which are the key to guarantee the convergence of the value iteration and policy iteration. Finally, we have introduced the simple simulate algorithm, which is a method to learn the model of the environment and solve the MDP with the learned model. In the next post, we will discuss the model-free Reinforcement learning, which is a method to learn the value function directly from the data without learning the model of the environment.

Posted May 17, 2026Updated May 17, 2026blog7 minutes read (About 1104 words)

Foundation of Reinforcement learning(II)

Introduction

Given the former post where we have introduced the MDP and some basic properties, we are now ready to discuss the MDP-based Reinforcement learning. But first, we need to introduce the solution of MDP.

Bellman Equation

If we have learned the previous post, we will know that there are two types of value function, state value function and action value function. Their mathematical definitions are as follows: $V^{π} (s) = E_{π} [t = 0 \sum \infty γ^{t} R_{t + 1} ∣ s_{0} = s]$

Instinctively, the state value function is the expected return when we start from state $s$ and follow policy $π$ . Similarly, the action value function is defined as:

$Q^{π} (s, a) = E_{π} [t = 0 \sum \infty γ^{t} R_{t + 1} ∣ s_{0} = s, a_{0} = a]$

Here, it represents the expected return when we start from state $s$ , take action $a$ , and then follow policy $π$ thereafter.

On the other hand, we have a accumulate reward function, which is defined as: $G_{t} = R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots = k = 0 \sum \infty γ^{k} R_{t + k}$
It can be recursively defined as: $G_{t} = R_{t} + γ G_{t + 1}$

So, we can rewrite the state value function as: $V^{π} (s) = a \sum π (a ∣ s) [R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{π} (s ’)]$
Above is the Bellman expectation equation for the state value function. Now, our question is to choose the best policy $π$ to maximize the value function. We can define the optimal state value function as: $V^{*} (s) = π max V^{π} (s)$

Due to the Principle of Optimality: Each stage of an optimal policy must be optimal for the remaining stages, we can derive the Bellman optimality equation for the state value function as: $V^{*} (s) = a max [R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{*} (s ’)]$

Above is the Bellman optimality equation for the state value function.

Linear Programming for MDP

We have a key observation that the Bellman optimality equation can be rewritten as a linear programming problem. The linear programming formulation for MDP is as follows:

$minimize subject to s \sum V (s) V (s) \geq R (s, a) + γ s ’ \sum P (s ’∣ s, a) V (s ’), \forall s \in S, a \in A$

Proof:
The first part is to show that the optimal value function $V^{*}$ is a feasible solution to the above linear programming problem. We can see that for any state $s$ and action $a$ , we have: $V^{*} (s) = a ’ max [R (s, a ’) + γ s ’ \sum P (s ’∣ s, a ’) V^{*} (s ’)] \geq R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{*} (s ’)$
Thus, $V^{*}$ satisfies the constraints of the linear programming problem.
The second part is to show that any feasible solution satisfying the constraints must be greater than or equal to $V^{*}$ .
Given that the optimal policy $π^{*}$ choose one action $π^{*} (s)$ for each state $s$ , LP constraints imply that for any state $s$ : $V (s) \geq R (s, π^{*} (s)) + γ s ’ \sum P (s ’∣ s, π^{*} (s)) V (s ’)$
Let’s write them in matrix form: $V \geq R^{π^{*}} + γ P^{π^{*}} V$
while the $R^{π^{*}}$ is the reward vector under policy $π^{*}$ , and $P^{π^{*}}$ is the transition matrix under policy $π^{*}$ . We can rearrange the above inequality as: $(I - γ P^{π^{*}}) V \geq R^{π^{*}}$
Since $γ < 1$ , we can conclude that $I - γ P^{π^{*}}$ is invertible, and we can get its inverse as: $(I - γ P^{π^{*}})^{- 1} = k = 0 \sum \infty (γ P^{π^{*}})^{k}$
Obviously, the above inverse is a non-negative matrix. Thus, we can multiply both sides of the inequality by $(I - γ P^{π^{*}})^{- 1}$ to get: $V \geq (I - γ P^{π^{*}})^{- 1} R^{π^{*}} = V^{π^{*}}$
Since $V^{*} \geq V^{π^{*}}$ , we can conclude that $V \geq V^{*}$ .
So, we have shown that any feasible solution is greater than or equal to $V^{*}$ , and the optimal value function $V^{*}$ is a feasible solution. Thus, the optimal solution to the linear programming problem is $V^{*}$ .

Given we have the $V^{*}$ , we can easily derive the optimal policy $π^{*}$ as: $π^{*} (s) = ar g a max [R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{*} (s ’)]$

Here I asked claude to give me a simple explanation of the equivalence between this greedy like policy decision and the optimal policy. The formal proof may be need to use the fixed point theorem, but it is beyond the scope of this post. We just need to remember that the optimal policy can be derived from the optimal value function by choosing the action that maximizes the expected return.

The Dual Linear Programming for MDP

The dual linear programming formulation for MDP is as follows: $maximize subject to s, a \sum ρ (s, a) R (s, a) a \sum ρ (s ’, a) = s, a \sum ρ (s, a) P (s ’∣ s, a), \forall s ’ \in S ρ (s, a) \geq 0, \forall s \in S, a \in A$

If the initial state distribution $μ (s) > 0$ for all $s \in S$ , then let
$w (s) = μ (s)$ , the target function means the expected accumulated reward represented by the occupancy measure, and the constraints means the flow conservation constraints.

Assume that the optimal solution is $ρ^{*} (s, a)$ , we can derive the optimal policy from Theorem 2 as: $π^{*} (s) = \frac{ρ ^{*} ( s , a )}{\sum _{a} ρ ^{*} ( s , a )}$

Comparison between the Primal and Dual Linear Programming for MDP

Dimension	Primal LP	Dual LP
variable	state value function $V (s)$	occupancy measure $ρ (s, a)$
objective	minimize $\sum_{s} V (s)$	maximize $\sum_{s, a} ρ (s, a) R (s, a)$
constraints	Bellman optimality constraints	flow conservation constraints
explanation	state value	action frequency

Summary

At the beginning of this post, I tried to introduce the MDP-based Reinforcement learning, but I found that the solution of MDP takes a lot of space, so I just introduce the Bellman equation and the linear programming formulation for MDP. In the next post, I will introduce the value iteration and policy iteration algorithms for solving MDP, which are based on the Bellman equation.

Posted May 16, 2026Updated May 16, 2026blog15 minutes read (About 2240 words)

Foundation of Reinforcement learning(I)

The category of decision making problem

dimension	single step	multi step
one person	optimization problem	RL, to the best situation
multi person	static game	dynamic game, MARL.etc

Dynamic programming

Dynamic program is used to solve the Sequential decision making problem, feature of this problem is that it’s decision making process is sequential, and the decision at one step will affect the next step, and the reward is received at the end of decision making process, not at each step.

For an example, given a maze like problem below, the agent need to find a way from Position A to Position B, and the time of each way is different. Agent need to find the way with the least time. A simple way to solve this is to list all the possible paths, but if there is a circle, if the map is large, this will be unfeasible.

A better way to solve this is Backward induction, we start from the end point, and for evey point we calculate the time to reach the end point, then we regart the selected point as the new end point, and repeat this process until we reach the start point. This is a dynamic programming method, and it can solve the problem in polynomial time. But due to we need find a backward path, this method is only suitable for DAG, if there is a circle, this method will fail.

maze

The example is just a introduction, we can summarize the features of dynamic programming as follows:

it start from the end, and caculate the best action for each state.
it traverse all the states, and for each state, it calculate the best action, and the value of this state.
it need to define the state, path(state transition), time(online reward)

So it lead to the Principle of Optimality:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Markov Decision Process

Stochastic Process

A stochastic process is a collection of random variables, which can be used to describe the evolution of a system over time. It’s mathematical definition is as follows: $P (X_{t + 1} ∣ X_{t}, X_{t - 1}, \dots, X_{0})$
This means that the probability of the next state $X_{t + 1}$ depends on the current state $X_{t}$ and all the previous states $X_{t - 1}, \dots, X_{0}$ .

Markov Process

Compared to stochastic process, Markov process has a stronger assumption, which is “the future is independent of the past given the present”. Mathematically, it’s definition is as follows: $P (X_{t + 1} ∣ X_{t}, X_{t - 1}, \dots, X_{0}) = P (X_{t + 1} ∣ X_{t})$
This means that the probability of the next state $X_{t + 1}$ only depends on the current state $X_{t}$ , and is independent of all the previous states $X_{t - 1}, \dots, X_{0}$ .

Trying to understand it’s property is that the current state contains all the information about the past, so we can make decision based on the current state without worrying about the past.

Markov Decision Process

Markov Decision Process (MDP) provides a mathematical framework for modeling decision making in situations where the outcome is partly random, partly under the control of a decision maker. An MDP is defined by the following components:

State space (S): A set of all possible states in the environment.
Action space (A): A set of all possible actions that the agent can take
Transition function (P): A function that defines the probability of transitioning from one state to another given a specific action. It is denoted as $P (s ’∣ s, a)$ , which represents the probability of transitioning to state $s ’$ from state $s$ after taking action $a$ .
Reward function (R): A function that defines the reward received after transitioning from one state to another given a specific action. It is denoted as $R (s, a)$ , which represents the reward received after taking action $a$ in state $s$ . Sometimes only relates to the State.
Discount factor (γ): A factor that determines the importance of future rewards. It is a value between 0 and 1, where a value closer to 0 makes the agent prioritize immediate rewards, while a value closer to 1 makes the agent consider future rewards more heavily.

The dynamic feature of MDP

The whole process of MDP is dynamic as follows:

The agent observes the current state $s_{t}$ .
The agent selects an action $a_{t}$ based on its policy $π (a ∣ s)$ , which is a mapping from states to actions.
The agent gets a reward $R (s_{t}, a_{t})$ .
The MDP transitions to a new state $s_{t + 1}$ according to the transition function $P (s_{t + 1} ∣ s_{t}, a_{t})$ .

The total reward that the agent receives over time is often defined as the discounted sum of rewards: $G_{t} = R (s_{t}, a_{t}) + γ R (s_{t + 1}, a_{t + 1}) + γ^{2} R (s_{t + 2}, a_{t + 2}) + \dots = k = 0 \sum \infty γ^{k} R (s_{t + k}, a_{t + k})$

Markov Policy

In the context of MDP, a policy is a function that depends on the history: $h_{t} = (s_{0}, a_{0}, s_{1}, a_{1}, \dots, s_{t - 1}, a_{t - 1}, s_{t}) π (a_{t} ∣ h_{t}) = P (a_{t} ∣ h_{t})$

But a Markov policy is a special type of policy that only depends on the current state: $π (a_{t} ∣ s_{t}) = P (a_{t} ∣ s_{t})$

In the RL setting, we usually assume that the policy is a Markov policy. Why?

The MDP has Markov property, which means the future is independent of the past given the present, so there is no special information in the history that can help us make better decision, so we can just use the current state to make decision.More informally, for any policy relying on the history, we can find a Markov policy that at least performs as well as it does, so we can just focus on Markov policy without loss of generality.proof(the 26th and 27th slides of the lecture)

The category of MDP Policy

At the time demension, we can categorize the policy into two types:

Stationary policy: A policy that does not change over time. It is defined as $π (a ∣ s)$ , which means the action taken in state $s$ is the same at any time step.
Non-stationary policy: A policy that can change over time. It is defined as $π_{t} (a ∣ s)$ , which means the action taken in state $s$ can be different at different time steps.

At the probability distribution demension, we can categorize the policy into two types:

Deterministic policy: A policy that always selects the same action for a given state. It is defined as $π (s) = a$ , which means the action taken in state $s$ is always $a$ .
Stochastic policy: A policy that selects actions according to a probability distribution. It is defined as $π (a ∣ s) = P (a ∣ s)$ , which means the action taken in state $s$ is selected according to the probability

In the RL setting, we usually assume that the policy is a stationary policy. Why?
Typically, we consider the infinite horizon setting. There is also a proof that for any non-stationary policy, we can find a stationary policy that at least performs as well as it does, so we can just focus on stationary policy without loss of generality. proof(the 29th and 32th slides of the lecture)

The best policy for MDP

There is a theorem:

In a situation that the discount factor $γ < 1$ , while the state and action space are finite and the horizon is infinite, there exists a deterministic
and stationary policy $π^{*}$ that is optimal, which means for any policy $π$ , we have $V^{π^{*}} (s) \geq V^{π} (s)$ .

Proof: Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

The goal of MDP

Our goal is to choose the action to maximize the expected reward, which is defined as follows: $E [R (s_{0}, a_{0}) + γ R (s_{1}, a_{1}) + γ^{2} R (s_{2}, a_{2}) + \dots] = E [t = 0 \sum \infty γ^{t} R (s_{t}, a_{t})]$

So we can define the value function for a policy $π$ as follows: $V^{π} (s) = E [t = 0 \sum \infty γ^{t} R (s_{t}, a_{t}) ∣ s_{0} = s]$
This means the expected reward that the agent can get starting from state $s$ and following policy $π$ .

Occupancy Measure

In MDP context, the occupancy measure is a way to represent the discounted state-action expectation under a policy $π$ , also known as state-action visitation distribution. It is defined as follows: $ρ^{π} (s, a) = a \sim π (s),, s ’ \sim p (s, a) E [t = 0 \sum \infty γ^{t} I (s_{t} = s, a_{t} = a)]$

while the $s \sim p (s, a)$ means the state transition, which is defined as follows: $s_{t + 1} \sim p (s_{t}, a_{t})$

On the other hand, the state occupancy measure is defined as follows: $ρ^{π} (s) = a \sim π (s),, s ’ \sim p (s, a) E [t = 0 \sum \infty γ^{t} I (s_{t} = s)]$

How to compute the occupancy measure?

State occupancy measure

We assume that the initial state distribution is $μ (s)$ , then we can compute the state occupancy measure as follows: $ρ^{π} (s ’) = μ (s ’) + γ s \sum p^{π} (s ’∣ s) ρ^{π} (s)$

then we can solve the fomula: $ρ^{π} = (I - γ (P_{SS ’}^{π})^{T})^{- 1} μ$

State-action occupancy measure

We can compute the state-action occupancy measure as follows: $ρ^{π} (s, a) = μ (s ’) π (a ’∣ s ’) + γ s \sum p^{π} (s ’∣ s) ρ^{π} (s, a)$

Pay attention that the whole process is flow conservation. Because the state-action occupancy measure is the expected discounted number of times that the agent takes action $a$ in state $s$ , so the total flow into state $s$ must equal the total flow out of state $s$ . This is why we have the flow conservation constraint in the computation of occupancy measure.

Some Properties of Occupancy Measure

Obviously, from the definition of the measures:

$ρ^{π} (s) = \sum_{a} ρ^{π} (s, a)$
$ρ^{π} (s, a) = π (a ∣ s) ρ^{π} (s)$

We have two important theorems about the occupancy measure:

Theorem 1: For two policies $π$ and $π ’$ interacting with the same dynamic environment, if $ρ^{π} = ρ^{π ’}$ , then $π_{1} = π_{2}$ .
Theorem 2: Given a Occupancy measure $ρ$ , the only policy that can generate this occupancy measure is $π_{ρ} (s, a) = \frac{ρ ( s , a )}{\sum _{a} ρ ( s , a )}$ .

Accumulated reward for a policy

As we have defined the occupancy measure, we can compute the accumulated reward for a policy $π$ as follows: $V (π) = a \sim π (\cdot ∣ s), s ’ \sim p (\cdot ∣ s, a) E [R (s_{0}, a_{0}) + γ R (s_{1}, a_{1}) + γ^{2} R (s_{2}, a_{2}) + \dots] = s, a \sum a \sim π (\cdot ∣ s),, s ’ \sim p (\cdot ∣ s, a) E [R (s, a)] ρ^{π} (s, a) = s, a \sum R (s, a) ρ^{π} (s, a) = ρ^{π} E [R (s, a)]$

Value function and Q function

The value function is used to evaluate a state or a state-action pair, given a policy $π$ .

The state value function is usually know as value function, which is defined as follows: $V^{π} (s) = a \sim π (\cdot ∣ s), s ’ \sim p (\cdot ∣ s, a) E [R (s, a) + γ V^{π} (s ’)]$

The state-action value function is usually know as Q function, which is defined as follows: $Q^{π} (s, a) = a ’ \sim π (\cdot ∣ s ’),, s ’’ \sim p (\cdot ∣ s ’, a ’) E [R (s, a) + γ Q^{π} (s ’, a ’)]$

Obviously, we have the relationship between the value function and the Q function: $V^{π} (s) = a \sum π (a ∣ s) Q^{π} (s, a)$
And we can also compute the value function and Q function using the occupancy measure as follows: $V^{π} (s) Q^{π} (s, a) = a \sum R (s, a) ρ^{π} (s, a) = R (s, a) + γ s ’ \sum p (s ’∣ s, a) V^{π} (s ’)$

Summary

MDP provides us a simple but powerful mathematical framework to model the sequential decision making problem.
The five-tuple of MDP is defined as $(S, A, P, R, γ)$ , which represents the state space, action space, transition function, reward function and discount factor respectively.
Markov poverty is the key assumption of MDP, which means the future is independent of the past given the present.
Policy is the function to choose the action, usually is a conditional probability distribution over actions given states, and we usually assume that the policy is a stationary policy.
Occupancy measure is a way to represent the discounted state-action expectation under a policy, which can be used to compute the accumulated reward for a policy.
State value function and state-action value function are used to evaluate a state or a state-action pair, given a policy, and they can be computed using the occupancy measure.

Introduction

Estimating Value Functions

Monte Carlo methods

First-visit Monte Carlo method

Incremental Monte Carlo method

Some properties of Monte Carlo methods

Importance sampling

Off-policy Monte Carlo methods via Importance Sampling

Temporal-Difference Learning

Some details of TD learning

Contrast between TD and MC methods

The strengths and limitations of TD learning and MC learning

Tradeoff between bias and variance

Multi-step TD learning

N-step cumulate reward function

N-step mean cumulate reward function

Conclusion about TD(λ) method

Conclusion

Introduction

The Settings of RL

Dynamic Programming Based RL

Dynamic Programming for finite MDP

Optimal Value Function

Value Iteration

Sync & Async Value Iteration

Policy Iteration

Bellman operators

Model RL

Conclusion

Introduction

Bellman Equation

Linear Programming for MDP

The Dual Linear Programming for MDP

Comparison between the Primal and Dual Linear Programming for MDP

Summary

The category of decision making problem

Dynamic programming

Markov Decision Process

Stochastic Process

Markov Process

Markov Decision Process

The dynamic feature of MDP

Markov Policy

The category of MDP Policy

The best policy for MDP

The goal of MDP

Occupancy Measure

How to compute the occupancy measure?

State occupancy measure

State-action occupancy measure

Some Properties of Occupancy Measure

Accumulated reward for a policy

Value function and Q function

Summary

Categories

Recents

Archives

Tags