Foundation of Reinforcement learning(IV)
Introduction
In the previous posts, we have introduced the MDP and its solution. But in practice, we often do not have the simulation of the environment, which means we cannot directly apply our knowledge of MDP to solve the problem. There is indeed a method to simulate the environment, which is called model-based Reinforcement learning. However, in this post, we will focus on the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs.
Estimating Value Functions
In mode-based RL, value functions can be computed by DP methods as follows:
However, in model-free RL, we cannot directly access the and , but we have some ways to estimate the value function from episodes of experience.
Why we estimate the value function?
Because we can use the value function to derive the optimal policy, which is our ultimate goal. Besides value function can help us to reuse historical experience to make better decisions in the future, which is the essence of Reinforcement learning.
Here is a graph to introduce some methods to estimate the value function:
Monte Carlo methods
Target: Learn from episodes of experience.
Review: accumulate reward function:
Review: value function is the expected return:
The Monte Carlo method use empirical mean cumulative reward instead of expected return to estimate the value function.
First-visit Monte Carlo method
The first-visit Monte Carlo method estimates the value function by averaging the returns following the first time a state is visited in an episode. The algorithm is as follows:
- Initialization:
- For any ,, .
- For any ,.
- Loop for each episode:
- Generate an episode following policy : .
- For t = T-1, T-2, …, 0:
- If is the first time in the episode:
- Append to .
- .
The reason why we call it ‘first-visit’ is that we only update the value function for the first time we visit a state in an episode. Thus we can avoid the bias caused by multiple visits to the same state in an episode. However, this method may have high variance because it only uses one return for each state in an episode.
Incremental Monte Carlo method
The first-visit Monte Carlo method takes a lot of memory to store the returns for each state, which is not efficient. The incremental Monte Carlo method uses an incremental update rule to estimate the value function without storing all the returns. The algorithm is as follows:
- Initialization:
- For any ,, , .
- Loop for each episode:
- Generate an episode following policy : .
- For t = T-1, T-2, …, 0:
- If is the first time in the episode:
- .
Interesting, online softmax also takes the same update rule as the incremental Monte Carlo method. Great job.
Besides, incremental MC provides more design space for us to tackle some problems in practice. For example, we can use a constant step size instead of to update the value function, which is called constant step size MC method. It is useful when the environment is non-stationary, which means the reward function and transition probability may change over time. In this case, we want to give more weight to recent returns than old returns, which can be achieved by using a constant step size :
Some properties of Monte Carlo methods
- Monte Carlo methods are model-free, which means they do not require the knowledge of the environment’s dynamics (transition probabilities and reward function).
- Monte Carlo methods take the simpliest approach to estimate the value function, which is to average the returns following the policy. However, this method may have high variance because it only uses one return for each state in an episode.
- One key to note is that Monte Carlo methods can only be applied to finite MDPs, which means the state space and action space must be finite.
Importance sampling
Let’s try to estimate a custom distribution ‘s expectation.
Then we reassign the importance sampling weight , we can rewrite the above equation as:
Off-policy Monte Carlo methods via Importance Sampling
We can use the cumulative reward function of policy to justify policy , and then weight the cumulative reward function by the importance ratio between and to estimate the value function of policy . The algorithm is as follows:
Every episode would be mutified by the importance sampling ratio:
So we then update the value function by:
Sample by importance sampling will significantly increase the variance of the return, which is because the importance sampling ratio can be very large when and are very different.
Temporal-Difference Learning
Temporal-Difference (TD) is a method combining the MC method and DP method, which name comes from the fact that it uses the diff of estimated value function at two consecutive time steps to update the value function. There are two key ideas in TD learning: TD error and TD target.
For state value funtion , after a transition from state to state with reward , the TD error is defined as:
The TD target is defined as:
As for the TD in Bellman expectation equation, the TD error is used in estimating the expect part.
Some details of TD learning
The simpliest TD learning algorithm is called TD(0), which updates the value function by the TD error at each time step. The key equation of TD(0) is as follows:
Why we update like this?
The Bellman expectation equation is rewritten as:
That’s all. We want to make the TD error as small as possible, which means we want to make the estimated value function as close as possible to the true value function. Thus, we can use the TD error to update the value function.
The TD method introduce the bootstrapping idea, which means we use the estimated value function to update the value function. This is different from the MC method, which uses the actual return to update the value function. The bootstrapping idea can significantly reduce the variance of the return, but it may introduce bias because we are using an estimated value function to update the value function.
Contrast between TD and MC methods
They have the same goal: Learn the value function from episodes of experience. However, they have different approaches to achieve this goal. The MC method uses the actual return to update the value function, which can have high variance but no bias. The TD method uses the estimated value function to update the value function, which can have low variance but may introduce bias.
| TD method | MC method |
|---|---|
| update value function like | update value function like |
The object of TD is , which is called TD target, while the object of MC is , which is the actual return. The TD method’s error is called TD error, which is defined as , while the MC method’s error is defined as .
The strengths and limitations of TD learning and MC learning
TD method can learn until the end of an episode:
- After each step in an episode, TD method can update the value function use the former value function, which means it can learn until the end of an episode. However, MC method can only update the value function after the end of an episode, which means it cannot learn until the end of an episode.
- TD method can learn from incomplete episodes, which means it can learn from episodes that are not terminated. However, MC method can only learn from complete episodes, which means it cannot learn from episodes that are not terminated.
Tradeoff between bias and variance
| Estimator | Bias | Variance | |
|---|---|---|---|
| MC | Unbiased: | Higher | |
| TD (real) | Unbiased: | Lower | |
| TD (actual) | Biased: ≠ | Lower |
Note: The real TD target uses the true , which is unknown in practice. The actual TD target uses the current estimate , introducing bias. Despite the bias, TD typically has lower variance than MC because it bootstraps from a single step rather than a full trajectory.
Multi-step TD learning
The TD(0) method only uses the immediate reward and the estimated value of the next state to update the value function, which may not be sufficient to capture the long-term dependencies in the environment. The multi-step TD learning method uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. We will introduce it by leading into n-step cumulate reward function and n-step TD target.
N-step cumulate reward function
Consider the following n-step cumulate reward function:
It seems make sense to use the n-step cumulate reward function to update the value function, which is called n-step TD learning. The key equation of n-step TD learning is as follows:
N-step mean cumulate reward function
Can we take up the information of different n-step cumulate reward function to update the value function? The answer is yes.
We can use a weighted average of different n-step cumulate reward functions to update the value function, which is called n-step mean TD learning. The key weight figure of weighted average is as follows:
So the n-step mean cumulate reward function is defined as:
Then we can update the value function by:
This is called TD() method, which is a generalization of TD(0) and MC methods. When , TD() reduces to TD(0) method, and when , TD() reduces to MC method. Thus, by adjusting the value of , we can control the bias-variance tradeoff in the estimation of the value function.
Conclusion about TD(λ) method
Unless the is 0, TD() mothod is unbiased, because it’s a weighted average of unbiased n-step TD targets.
The variance of TD() method is lower than that of MC method, because it uses bootstrapping to update the value function, which can reduce the variance of the return. However, the variance of TD() method is higher than that of TD(0) method, because it uses more rewards and estimated values to update the value function, which can increase the variance of the return.
Empirically is not quite commom because fast credit assignment for a given action is preferred. So MC or TD(0) is more commonly used in practice. However, TD() method can be useful when we want to balance the bias-variance tradeoff in the estimation of the value function, which can be achieved by adjusting the value of .
So TD() use as variable while n-step TD use as variable. TD() is a generalization of n-step TD, which can be seen as a weighted average of infin-step TD targets. By adjusting the value of , we can control the bias-variance tradeoff in the estimation of the value function, which can be useful in practice when we want to balance the bias and variance in the estimation of the value function.
Conclusion
In this post, we have introduced the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs. We have introduced two methods to estimate the value function from episodes of experience: Monte Carlo methods and Temporal-Difference learning. We have also introduced the n-step TD learning method, which uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. Finally, we have discussed the bias-variance tradeoff in the estimation of the value function, which can be controlled by adjusting the value of in TD() method.
