Exercise 9: Monte-carlo methods

Exercise 9: Monte-carlo methods#

Note

This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
- 02465ex9_Python.pdf
Slides: [1x] ([6x]). Reading: Chapter 5-5.4+5.10, [SB18].
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date

Tabular methods (Q-learning, Sarsa, etc.)#

As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent which will hopefully simplify the process.

Note

The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of dealing with this situation is to use the info-dictionary, e.g. so that s, info = env.reset() will specify a info['mask'] variable which is a numpy ndarray so that a given action is available if info['mask'][a] == 1. You can read more about this choice at The gymnasium discrete space documentation.

The \(Q\)-values behave like a 2d numpy ndarray:

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
>>> state, _ = env.reset()
>>> state
(0, 0)
>>> agent.Q[state, 1] = 2 # Update a Q-value
>>> agent.Q[state, 1]     # Get a Q-value
2
>>> agent.Q[state, 0]     # Q-values are by default zero
0

To implement masking, the agent.Q-table has two special functions which requires the info-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.

To get the optimal action use agent.Q.get_optimal_action(s, info_s)

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env)
>>> state, info = env.reset()               # Get the info-dictionary corresponding to s
>>> agent.Q[state, 1] = 2.5                 # Update a Q-value; action a=1 is now optimal.
>>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state
1

To get all Q-values corresponding to a state use agent.Q.get_Qs(s, info_s)

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env)
>>> state, info = env.reset()                  # Get the info-dictionary corresponding to s
>>> agent.Q[state, 1] = 2.5                    # Update a Q-value; action a=1 is now optimal.
>>> actions, Qs = agent.Q.get_Qs(state, info)  # Note we pass along the info-dictionary corresopnding to this state
>>> actions                                    # All actions that are available in this state (after masking)
(0, 1, 2, 3)
>>> Qs                                         # All Q-values available in this state (after masking)
(0, 2.5, 0, 0)

You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)].

Note

The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.

We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps. Since this function uses the Q-values, it also requires an info-dictionary:

>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.1)  # epsilon-greedy exploration
>>> state, info = env.reset()               # to get a state and info-dictionary
>>> a = agent.pi_eps(state, info)           # Epsilon-greedy action selection
>>> a
1

Warning

In the train(s, a, r, sp, done, info_s, info_sp)-method, remember to use the info-dictionary corresponding to the state.

use self.Q.get_Qs(s, info_s) and self.Q.get_Qs(sp, info_sp)
never use self.Q.get_Qs(s, info_sp)

Classes and functions#

None so far.

Solutions to selected exercises#

Solution to Problem 1 - The single-state example

Part a: Comparing to the \(T=3\) example clearly \(N^{\text{first} }(s_1) = 1\) and the return is:

\[G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 = 1 + \gamma + \gamma^2 = \frac{1 - \gamma^3}{1-\gamma}\]

Therefore the accumulated reward is:

\[S^{\text{first}}(s_1) = G_0 = \sum_{t=0}^{T-1}\gamma^t R_{t+1} = \sum_{t=0}^{T-1}\gamma^t = \frac{1-\gamma^{T} }{1-\gamma} \equiv h(T).\]

Part b: From the previous example, it is clear that the return only depends on the number of future times \(T\) we visit a state. If we again consider the \(T=3\) example we see that \(N^{\text{every} }(s_1) = T\) since we visit \(s_1\) a total of \(T\) times. Each time we compute the (future) return \(G_t\); in the first visit this is equal to what we got before, in the next visit it is based on \(T-1\) future rewards, and so on. For \(T=3\):

\[\begin{split}S^{\text{every}}(s_1) & = (R_1 + \gamma R_2 + \gamma^2 R_3) + ( R_2 + \gamma R_3) + (R_3) \\ & = h(3) + h(2) + h(1)\end{split}\]

So in general:

\[\begin{split}S^{\text{every}}(s_1) & = \sum_{t=1}^T h(t) = \sum_{t=1}^T \frac{1-\gamma^{t} }{1-\gamma} = \frac{T - \gamma \sum_{k=0}^{T-1} \gamma^k}{1-\gamma} \\ & = \frac{T - \gamma \frac{1-\gamma^T }{1-\gamma} }{1-\gamma} \\ & = \frac{T - \gamma h(T) }{1-\gamma}\end{split}\]

Part c: Given the above we can compute the two estimators for \(m=1\) and see that they clearly give different results. Although they agree when \(T \rightarrow \infty\), this limit is not interesting since the value of \(T\) is random not under our control.

Part d: Given \(m\) estimates clearly \(N^{\text{first} }(s_1) = m\) and

\[S^{\text{first}}(s_1) = \sum_{k=1}^m h(T_k) = \frac{m - \sum_{k=1}^m \gamma^{T_k} }{1-\gamma}\]

Part e: Again clearly \(N^{\text{every} }(s_1) = \sum_{k=1}^m T_k\) and

\[S^{\text{every}}(s_1) = \sum_{k=1}^m \frac{T_k - \gamma h(T_k) }{1-\gamma} = \frac{\sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k) }{1-\gamma}\]

Part f: By calculation we get that:

\[\frac{m}{ N^{\text{every} } (s_1) } = \frac{1}{ \frac{1}{m } \sum_{k=1}^m T_k } \approx \frac{1}{ \mathbb{E}[T] } = \frac{1}{ \frac{1}{1-p} } = 1-p\]

According to the law of large numbers the error in this approximation scale as \(\frac{1}{\sqrt{m}}\).

Part g: Starting with first-visit we have that:

\[\begin{split}\frac{ S^{\text{first} } (s_1) }{ N^{\text{first} }(s_1) } & = \frac{1}{m} \sum_{k=1}^m h(T_k) = \frac{1 - \frac{1}{m} \sum_{k=1}^m \gamma^{T_k} }{1-\gamma} \\ & \approx \frac{1}{1-\gamma}\left[ 1 - \mathbb{E}[\gamma^T] \right] \\ & = \frac{1}{1-\gamma}\left[1 - \gamma \frac{ (1-p) }{ 1 - \gamma p } \right] \\ & = \frac{1}{1-\gamma p}\end{split}\]

For every visit we get that

\[\begin{split}\frac{S^{\text{every} } (s_1) }{ N^{\text{every} }(s_1) } & = \frac{1}{1-\gamma} \frac{1}{m} \left[ \sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k) \right] \frac{m}{ N^{\text{every} }(s_1) } \\ & \approx \left[ \frac{1}{m} \sum_{k=1}^m T_k - \gamma \frac{1}{m} \sum_{k=1}^m h(T_k) \right] \frac{ 1-p }{1-\gamma} & = \left[ \frac{1}{1-p} - \gamma \frac{1}{1-\gamma p} \right] \frac{ 1-p }{1-\gamma} \\ & = \frac{1}{1-\gamma p}\end{split}\]

So we see that first and every visit agree when \(m\) is large and the true return is \(v_\pi(s_1) = \frac{1}{1-\gamma p}\).

Bonus proofs

The chance of staying in \(s_1\) is \(p\), the chance of staying \(T-1\) times is \(p^{T-1}\), and the chance of jumping to \(s_2\) is \(1-p\). Therefore \(p(T) = p^{T-1}(1-p)\).
\[\begin{split}\mathbb{E}[T] & = (1-p)\sum_{T=0}^\infty T p^{T-1} = (1-p) \sum_{T=0}^\infty \frac{d}{dp} p^T \\ & = (1-p) \frac{d}{dp} \sum_{T=0}^\infty p^T = (1-p)\frac{d}{dp} \frac{1}{1-p} = (1-p)\frac{1}{(1-p)^2} = \frac{1}{1-p}.\end{split}\]
\[\mathbb{E}[\gamma^T] = (1-p)\sum_{T=1}^\infty T\gamma^T p^{T-1} = (1-p)\gamma \sum_{T=1}^\infty (\gamma p)^{T-1} = \frac{(1-p)\gamma}{1-\gamma p}.\]

Problem 10.1-10.2: MC Value estimation

Problem 10.3-10.4: MC control

Exercise 9: Monte-carlo methods

Contents

Exercise 9: Monte-carlo methods#

Tabular methods (Q-learning, Sarsa, etc.)#

Classes and functions#

Solutions to selected exercises#