Foundation of Reinforcement learning(II)

Introduction

Given the former post where we have introduced the MDP and some basic properties, we are now ready to discuss the MDP-based Reinforcement learning. But first, we need to introduce the solution of MDP.

Bellman Equation

If we have learned the previous post, we will know that there are two types of value function, state value function and action value function. Their mathematical definitions are as follows:

Instinctively, the state value function is the expected return when we start from state and follow policy . Similarly, the action value function is defined as:

Here, it represents the expected return when we start from state , take action , and then follow policy thereafter.

On the other hand, we have a accumulate reward function, which is defined as:
It can be recursively defined as:

So, we can rewrite the state value function as:
Above is the Bellman expectation equation for the state value function. Now, our question is to choose the best policy to maximize the value function. We can define the optimal state value function as:

Due to the Principle of Optimality: Each stage of an optimal policy must be optimal for the remaining stages, we can derive the Bellman optimality equation for the state value function as:

Above is the Bellman optimality equation for the state value function.

Linear Programming for MDP

We have a key observation that the Bellman optimality equation can be rewritten as a linear programming problem. The linear programming formulation for MDP is as follows:

Proof:
The first part is to show that the optimal value function is a feasible solution to the above linear programming problem. We can see that for any state and action , we have:
Thus, satisfies the constraints of the linear programming problem.
The second part is to show that any feasible solution satisfying the constraints must be greater than or equal to .
Given that the optimal policy choose one action for each state , LP constraints imply that for any state :
Let’s write them in matrix form:
while the is the reward vector under policy , and is the transition matrix under policy . We can rearrange the above inequality as:
Since , we can conclude that is invertible, and we can get its inverse as:
Obviously, the above inverse is a non-negative matrix. Thus, we can multiply both sides of the inequality by to get:
Since , we can conclude that .
So, we have shown that any feasible solution is greater than or equal to , and the optimal value function is a feasible solution. Thus, the optimal solution to the linear programming problem is .

Given we have the , we can easily derive the optimal policy as:

Here I asked claude to give me a simple explanation of the equivalence between this greedy like policy decision and the optimal policy. The formal proof may be need to use the fixed point theorem, but it is beyond the scope of this post. We just need to remember that the optimal policy can be derived from the optimal value function by choosing the action that maximizes the expected return.

The Dual Linear Programming for MDP

The dual linear programming formulation for MDP is as follows:

If the initial state distribution for all , then let
, the target function means the expected accumulated reward represented by the occupancy measure, and the constraints means the flow conservation constraints.

Assume that the optimal solution is , we can derive the optimal policy from Theorem 2 as:

Comparison between the Primal and Dual Linear Programming for MDP

Dimension Primal LP Dual LP
variable state value function occupancy measure
objective minimize maximize
constraints Bellman optimality constraints flow conservation constraints
explanation state value action frequency

Summary

At the beginning of this post, I tried to introduce the MDP-based Reinforcement learning, but I found that the solution of MDP takes a lot of space, so I just introduce the Bellman equation and the linear programming formulation for MDP. In the next post, I will introduce the value iteration and policy iteration algorithms for solving MDP, which are based on the Bellman equation.

Foundation of Reinforcement learning(II)

https://hjcheng0602.github.io/blog/Foundation-of-Reinforcement-learning-II/

AuthorJincheng Han
Posted on05-17-2026
Updated on05-17-2026

Comments