Foundation of Reinforcement learning(II)

05-17-202605-17-2026blog7 minutes read (About 1104 words)

Introduction

Given the former post where we have introduced the MDP and some basic properties, we are now ready to discuss the MDP-based Reinforcement learning. But first, we need to introduce the solution of MDP.

Bellman Equation

If we have learned the previous post, we will know that there are two types of value function, state value function and action value function. Their mathematical definitions are as follows: $V^{π} (s) = E_{π} [t = 0 \sum \infty γ^{t} R_{t + 1} ∣ s_{0} = s]$

Instinctively, the state value function is the expected return when we start from state $s$ and follow policy $π$ . Similarly, the action value function is defined as:

$Q^{π} (s, a) = E_{π} [t = 0 \sum \infty γ^{t} R_{t + 1} ∣ s_{0} = s, a_{0} = a]$

Here, it represents the expected return when we start from state $s$ , take action $a$ , and then follow policy $π$ thereafter.

On the other hand, we have a accumulate reward function, which is defined as: $G_{t} = R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots = k = 0 \sum \infty γ^{k} R_{t + k}$
It can be recursively defined as: $G_{t} = R_{t} + γ G_{t + 1}$

So, we can rewrite the state value function as: $V^{π} (s) = a \sum π (a ∣ s) [R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{π} (s ’)]$
Above is the Bellman expectation equation for the state value function. Now, our question is to choose the best policy $π$ to maximize the value function. We can define the optimal state value function as: $V^{*} (s) = π max V^{π} (s)$

Due to the Principle of Optimality: Each stage of an optimal policy must be optimal for the remaining stages, we can derive the Bellman optimality equation for the state value function as: $V^{*} (s) = a max [R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{*} (s ’)]$

Above is the Bellman optimality equation for the state value function.

Linear Programming for MDP

We have a key observation that the Bellman optimality equation can be rewritten as a linear programming problem. The linear programming formulation for MDP is as follows:

$minimize subject to s \sum V (s) V (s) \geq R (s, a) + γ s ’ \sum P (s ’∣ s, a) V (s ’), \forall s \in S, a \in A$

Proof:
The first part is to show that the optimal value function $V^{*}$ is a feasible solution to the above linear programming problem. We can see that for any state $s$ and action $a$ , we have: $V^{*} (s) = a ’ max [R (s, a ’) + γ s ’ \sum P (s ’∣ s, a ’) V^{*} (s ’)] \geq R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{*} (s ’)$
Thus, $V^{*}$ satisfies the constraints of the linear programming problem.
The second part is to show that any feasible solution satisfying the constraints must be greater than or equal to $V^{*}$ .
Given that the optimal policy $π^{*}$ choose one action $π^{*} (s)$ for each state $s$ , LP constraints imply that for any state $s$ : $V (s) \geq R (s, π^{*} (s)) + γ s ’ \sum P (s ’∣ s, π^{*} (s)) V (s ’)$
Let’s write them in matrix form: $V \geq R^{π^{*}} + γ P^{π^{*}} V$
while the $R^{π^{*}}$ is the reward vector under policy $π^{*}$ , and $P^{π^{*}}$ is the transition matrix under policy $π^{*}$ . We can rearrange the above inequality as: $(I - γ P^{π^{*}}) V \geq R^{π^{*}}$
Since $γ < 1$ , we can conclude that $I - γ P^{π^{*}}$ is invertible, and we can get its inverse as: $(I - γ P^{π^{*}})^{- 1} = k = 0 \sum \infty (γ P^{π^{*}})^{k}$
Obviously, the above inverse is a non-negative matrix. Thus, we can multiply both sides of the inequality by $(I - γ P^{π^{*}})^{- 1}$ to get: $V \geq (I - γ P^{π^{*}})^{- 1} R^{π^{*}} = V^{π^{*}}$
Since $V^{*} \geq V^{π^{*}}$ , we can conclude that $V \geq V^{*}$ .
So, we have shown that any feasible solution is greater than or equal to $V^{*}$ , and the optimal value function $V^{*}$ is a feasible solution. Thus, the optimal solution to the linear programming problem is $V^{*}$ .

Given we have the $V^{*}$ , we can easily derive the optimal policy $π^{*}$ as: $π^{*} (s) = ar g a max [R (s, a) + γ s ’ \sum P (s ’∣ s, a) V^{*} (s ’)]$

Here I asked claude to give me a simple explanation of the equivalence between this greedy like policy decision and the optimal policy. The formal proof may be need to use the fixed point theorem, but it is beyond the scope of this post. We just need to remember that the optimal policy can be derived from the optimal value function by choosing the action that maximizes the expected return.

The Dual Linear Programming for MDP

The dual linear programming formulation for MDP is as follows: $maximize subject to s, a \sum ρ (s, a) R (s, a) a \sum ρ (s ’, a) = s, a \sum ρ (s, a) P (s ’∣ s, a), \forall s ’ \in S ρ (s, a) \geq 0, \forall s \in S, a \in A$

If the initial state distribution $μ (s) > 0$ for all $s \in S$ , then let
$w (s) = μ (s)$ , the target function means the expected accumulated reward represented by the occupancy measure, and the constraints means the flow conservation constraints.

Assume that the optimal solution is $ρ^{*} (s, a)$ , we can derive the optimal policy from Theorem 2 as: $π^{*} (s) = \frac{ρ ^{*} ( s , a )}{\sum _{a} ρ ^{*} ( s , a )}$

Comparison between the Primal and Dual Linear Programming for MDP

Dimension	Primal LP	Dual LP
variable	state value function $V (s)$	occupancy measure $ρ (s, a)$
objective	minimize $\sum_{s} V (s)$	maximize $\sum_{s, a} ρ (s, a) R (s, a)$
constraints	Bellman optimality constraints	flow conservation constraints
explanation	state value	action frequency

Summary

At the beginning of this post, I tried to introduce the MDP-based Reinforcement learning, but I found that the solution of MDP takes a lot of space, so I just introduce the Bellman equation and the linear programming formulation for MDP. In the next post, I will introduce the value iteration and policy iteration algorithms for solving MDP, which are based on the Bellman equation.

Foundation of Reinforcement learning(II)

https://hjcheng0602.github.io/blog/Foundation-of-Reinforcement-learning-II/

AuthorJincheng Han

Posted on05-17-2026

Updated on05-17-2026

#Reinforcement learning review notes

Foundation of Reinforcement learning(II)

Introduction

Bellman Equation

Linear Programming for MDP

The Dual Linear Programming for MDP

Comparison between the Primal and Dual Linear Programming for MDP

Summary

Comments

Categories

Catalogue

Recents

Archives

Tags