CS 234 Winter 2021
Assignment 1
Due: January 22 at 6:00 pm (PST)
For submission instructions please refer to website For all problems, if you use an existing result
from either the literature or a textbook to solve the exercise, you need to cite the source.
1 Flappy Karel MDP [25 pts]
There i...
CS 234 Winter 2021
Assignment 1
Due: January 22 at 6:00 pm (PST)
For submission instructions please refer to website For all problems, if you use an existing result
from either the literature or a textbook to solve the exercise, you need to cite the source.
1 Flappy Karel MDP [25 pts]
There is a hot new mobile game on the market called Flappy Karel, where Karel the robot must
dodge the red pillars of doom and flap its way to the green pasture. Consider the following 2 grid
environments (Flappy World 1 and Flappy World 2). Starting from any unshaded square, Karel can
either move right & up, or right & down (e.g from state 4 you can move to state 10 or 12, think
checkers). Actions are deterministic and always succeed unless they will cause Karel to run into a
wall. The thicker edges indicate walls, and attempting to move in the direction of a wall results in
falling down one square (e.g. going in any direction from state 30 leads to falling into state 31). A
successful run by Karel in Flappy World 1 is shown in Figure 1b. Taking any action from the green
target squares (no. 32) earns a reward of rg and ends the episode. Taking any action from the red
squares of doom (no. 1, 7, 8, 12, 13...) earns a reward of rr and ends the episode. Otherwise, from
every other square, taking any action is associated with a reward rs . Assume the discount factor
γ = 0.9, rg = +5, and rr = −5 unless otherwise specified. Notice the Horizon is technically infinite
in both worlds.
1 8 15 22 29 1 8 15 22 29
2 9 16 23 30 2 9 16 23 30
3 10 17 24 31 3 10 17 24 31
4 11 18 25 32 4 11 18 25 32
5 12 19 26 33 5 12 19 26 33
6 13 20 27 34 6 13 20 27 34
7 14 21 28 35 7 14 21 28 35
(a) Flappy World 1 (b) A successful run by Karel in Flappy World 1
Figure 1
1
, (a) Let rs ∈ {−4, −1, 0, 1}. Starting in square 2, for each of the possible values of rs briefly
explain what the optimal policy would be in Flappy World 1. In each case is the optimal policy
unique and does the optimal policy depend on the value of the discount factor γ? Explain
your answer. [5 pts]
(b) What value of rs would cause the optimal policy to return the shortest path to the green
target square? Using this value of rs find the optimal value function for each square in Flappy
world 1. What is the optimal action from square 27? [5 pts]
Now consider Flappy world 2. It is the same as Flappy world 1, except there are no walls on the
right and left sides. Going past the right end of flappy world 2 simply loops you to left hand side.
Take a look at Figure 1b for a successful run by Karel in Flappy World 2.
7 14 21 28 35
(b) A successful run by Karel in Flappy World 2
(a) Flappy World 2
Figure 2
(c) Let rs ∈ {−4, −1, 0, 1}. Starting in square 3, for each of the possible values of rs briefly
explain what the optimal policy would be in Flappy World 2. Using the value of rs , that
would cause the optimal policy to return the shortest path to the green target square, find the
optimal value function for each square in Flappy world 2. What is the optimal action from
square 27? [5 pts]
(d) Consider a general MDP with rewards, and transitions. Consider a discount factor of γ. For
this case assume that the horizon is infinite (so there is no termination). A policy π in this
MDP induces a value function V π (lets refer to this as Vold
π ). Now suppose we have the same
MDP where all rewards have a constant c added to them and then have been scaled by a
constant a (i.e. rnew = a(c + rold )). Can you come up with an expression for the new value
function V π induced by π in this second MDP in terms of Vold π , c, a, and γ? [5 pts]
(e) Can scaling all the rewards by a fixed amount change the optimal policy of a MDP? If so,
describe how different ranges of the constant a (where rnew = a ∗ (rold )) would change the
optimal policy of the MDP from part (c). [5 pts]
(a) rs = 1 Take longest possible path (2,10,18,24,30,31,32) to target square.
rs = 0 Take shortest possible path to target square.
rs = −1 Take shortest possible path to target square.
2
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller Themanehoppe. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $9.99. You're not tied to anything after your purchase.