Markov Decision Processes
Welcome to Lecture 5 – Markov Decision Processes, where we will learn how to deal with non-deterministic search problems. We will start with Expectimax, introduced last lecture, to model the uncertain outcomes of our actions, instead of a randomly acting opponent.
Additionally, we will exploit the “markovian” nature of our problem. “Markov” means, that the outcome only depends on the state the agent is currently in and the chosen action, not on the history. It doesn’t matter what the agent did or where the agent was 10 or 3 steps ago. And if the agent is in the same state again 20 steps later, the available actions and the probabilities for the different outcomes haven’t changed.
To efficiently deal with Markov Decision Processes we will look at two methods: value iteration and policy iteration, which consists of policy evaluation and policy improvement. If you look closely, all these methods look similar. They are all based on the Bellman Equations. So if you understand one of them, the others become much simpler.
What do we need that for? In the next lecture you will use what you learn here for Reinforcement Learning.
Topics in this course:
- Introduction and Outlook (MDPs 1)
- MDPs (MDPs 2)
- Finite Horizons and Discounting (MDPs 3)
- Value Iteration (MDPs 4-5)
- Policy Iteration (MDPs 6-7)
- Summary and Outlook (MDPs 8)
The lecture videos can be found here.
Step-by-step examples and additional explanations:
- Formalization
- Value Iteration & Discounting
- Policy Iteration