Exercise 9: Monte-carlo methods#
Note
This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
Tabular methods (Q-learning, Sarsa, etc.)#
As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent which will
hopefully simplify the process.
Note
The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of
dealing with this situation is to use the info-dictionary, e.g. so that s, info = env.reset() will specify a info['mask'] variable which is a numpy ndarray so that a given action is available if info['mask'][a] == 1.
You can read more about this choice at The gymnasium discrete space documentation.
The \(Q\)-values behave like a 2d numpy ndarray:
>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
>>> state, _ = env.reset()
>>> state
(0, 0)
>>> agent.Q[state, 1] = 2 # Update a Q-value
>>> agent.Q[state, 1] # Get a Q-value
2
>>> agent.Q[state, 0] # Q-values are by default zero
0
To implement masking, the agent.Q-table has two special functions which requires the info-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.
To get the optimal action use
agent.Q.get_optimal_action(s, info_s)>>> from irlc.ex08.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state 1
To get all Q-values corresponding to a state use
agent.Q.get_Qs(s, info_s)>>> from irlc.ex08.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> actions, Qs = agent.Q.get_Qs(state, info) # Note we pass along the info-dictionary corresopnding to this state >>> actions # All actions that are available in this state (after masking) (0, 1, 2, 3) >>> Qs # All Q-values available in this state (after masking) (0, 2.5, 0, 0)
You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)].
Note
The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.
We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps. Since this function
uses the Q-values, it also requires an info-dictionary:
>>> from irlc.ex08.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.1) # epsilon-greedy exploration
>>> state, info = env.reset() # to get a state and info-dictionary
>>> a = agent.pi_eps(state, info) # Epsilon-greedy action selection
>>> a
1
Warning
In the train(s, a, r, sp, done, info_s, info_sp)-method, remember to use the info-dictionary corresponding to the state.
use
self.Q.get_Qs(s, info_s)andself.Q.get_Qs(sp, info_sp)never use
self.Q.get_Qs(s, info_sp)
Classes and functions#
None so far.
Solutions to selected exercises#
Solution to Problem 1 - The single-state example
Part a: Comparing to the \(T=3\) example clearly \(N^{\text{first} }(s_1) = 1\) and the return is:
Therefore the accumulated reward is:
Part b: From the previous example, it is clear that the return only depends on the number of future times \(T\) we visit a state. If we again consider the \(T=3\) example we see that \(N^{\text{every} }(s_1) = T\) since we visit \(s_1\) a total of \(T\) times. Each time we compute the (future) return \(G_t\); in the first visit this is equal to what we got before, in the next visit it is based on \(T-1\) future rewards, and so on. For \(T=3\):
So in general:
Part c: Given the above we can compute the two estimators for \(m=1\) and see that they clearly give different results. Although they agree when \(T \rightarrow \infty\), this limit is not interesting since the value of \(T\) is random not under our control.
Part d: Given \(m\) estimates clearly \(N^{\text{first} }(s_1) = m\) and
Part e: Again clearly \(N^{\text{every} }(s_1) = \sum_{k=1}^m T_k\) and
Part f: By calculation we get that:
According to the law of large numbers the error in this approximation scale as \(\frac{1}{\sqrt{m}}\).
Part g: Starting with first-visit we have that:
For every visit we get that
So we see that first and every visit agree when \(m\) is large and the true return is \(v_\pi(s_1) = \frac{1}{1-\gamma p}\).
Bonus proofs
The chance of staying in \(s_1\) is \(p\), the chance of staying \(T-1\) times is \(p^{T-1}\), and the chance of jumping to \(s_2\) is \(1-p\). Therefore \(p(T) = p^{T-1}(1-p)\).
- \[\begin{split}\mathbb{E}[T] & = (1-p)\sum_{T=0}^\infty T p^{T-1} = (1-p) \sum_{T=0}^\infty \frac{d}{dp} p^T \\ & = (1-p) \frac{d}{dp} \sum_{T=0}^\infty p^T = (1-p)\frac{d}{dp} \frac{1}{1-p} = (1-p)\frac{1}{(1-p)^2} = \frac{1}{1-p}.\end{split}\]
- \[\mathbb{E}[\gamma^T] = (1-p)\sum_{T=1}^\infty T\gamma^T p^{T-1} = (1-p)\gamma \sum_{T=1}^\infty (\gamma p)^{T-1} = \frac{(1-p)\gamma}{1-\gamma p}.\]
Problem 10.1-10.2: MC Value estimation
Problem 10.3-10.4: MC control