Summary

Samenvatting Reinforcement Learning, ISBN: 9780262193986 Reinforcement Learning (6013B0359Y)

Name: Samenvatting Reinforcement Learning, ISBN: 9780262193986 Reinforcement Learning (6013B0359Y)
SKU: doc_2509493
Rating: 5.00 (1 reviews)
Author: feanne1

1 review

46 views 4 purchases

Course
Reinforcement Learning (6013B0359Y)

Institution
Universiteit Van Amsterdam (UvA)

Book
Reinforcement Learning

This is a comprehensive summary of Reinforcement's lectures, including a number of tips & notes. The summary, like the course, is in English and many formulas have been added.

[Show more]

Preview 3 out of 28 pages

View example

Summarized whole book? Yes
Uploaded on March 24, 2023
Number of pages 28
Written in 2022/2023
Type Summary

error

Book Title:Reinforcement Learning

Author(s):Richard S. Sutton, Andrew G. Barto

Edition:Unknown
ISBN:9780262193986
Edition:Unknown

Institution
Universiteit van Amsterdam (UvA)
Education
Econometrics
Course
Reinforcement Learning (6013B0359Y)

1 review

By: zulkarneync • 1 year ago

feanne1

Member since 1 year 7 documents sold

$12.37

Also available in package deal from $22.25

Added

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Also available in package deal (1)

Econometrie 2e en 3e jaars vakken UvA voordeelbundel

$ 31.36 $ 22.25 3 items

1. Summary - Samenvatting reinforcement learning, isbn: 9780262193986 reinforcement learning (601...
2. Summary - Samenvatting corporate finance, global edition, isbn: 9781292304151 finance (6011p02...
3. Summary - Life insurance mathematics - samenvatting soln man actu maths life contingen risks, i...
Show more

Reinforcement Learning Summary
Lecture 1

RL = what to do to maximize a numerical reward
1 action for each situation -> highest reward

Problem of RL:
• Sense of the state of the environment
• Take actions -> affect the state
• Goal relating to the state

Learning from interactions, directly from its environment

Exploration-Exploitation dilemma:
Exploitation: profit from your experience
Exploration: look for better options in the future

Elements of RL:
1. Policy
Behavior of learning agent in time
Action to be taken
Policies may be stochastic
2. Reward Rt
Goal of the problem
Reward-signal
3. Value function V (s)
Total amount of reward expected to accumulate over the future, starting from state s
Long-run desirability of the state, considering future states & rewards
Actions based on value judgements
Value estimation
4. Model environment
Model-free: trial & error learner
Model-based given state & action, model predicts next state & R

→ What action to take as a function of the state signal
→ Learn while interacting

Multi-armed bandits

k-armed bandit → k options & 1 situation: non-associative feedback problem
k: number of actions
t: time step
At: action at t
q* (a): true value of action a (expected reward) = E[Rt | At = a]

q* (a) unknown → We use Qt (a) as an estimation at time t

𝞹

, Action-value methods
Types of actions: Greedy approach: exploiting, so choose a with highest Qt(a)
Exploring actions

= number of times a has been selected until time t

If Nt (a) = 0 , then Qt (a) = c, some default time
If Nt (a) → ♾ , then Qt (a) → q* (a)

Selection:
1. Random selection: P[ At = a ] = 1/k
2. Greedy action selection method At = arg maxa {Qt (a)}
3. -greedy action selection: with prob. select randomly from all actions with equal probability,
otherwise greedy

Offline computing: all data already available : computationally inefficient
Qn = (R1 + … + Rn-1) / (n - 1)
Qn+1 = (R1 + … + Rn-1) / n

Online computing:

new estimate = old estimate + stepwise ( target - old estimate )

Non-stationary: rewards probabilities change over time
give more weight to recent reward than to long-past reward

Varying step-size n (a): convergence for n (a) = 1/2
& no convergence for n (a) = and varying Qn+1
Sample average: bias disappears when actions are selected at least once

Optimistic initial value for Q1(a) → forces to select all options at least once
→ Qt (a) to proper level

𝜺 𝜶 𝜺 𝜶 𝜶 𝜶

, Lecture 2

Observations multi-armed bandits

Optimistic initial value Q1(a):
• Qt+1 = Qn(a) + [Rn(a) - Qn(a)], Q1(a) = c, influence high for small
• Qn+1 = Qn(a) + 1/n [Rn(a) - Qn(a)], whatever Q1(a) = c → Q2(a) = R1(a)

Greedy (with average reward value):
• t > t0 → At = a0, action a0 as long as Qt(a0) > 0 and if q*(a0) > 0
• t→♾ → Qt(a0) → q*(a0), for the ‘absorbing’ action

-Greedy (with average reward value):
• t→♾ → Qt(a) → q*(a), for all actions
• t→♾ → P[At = a*] = ( 1 - ) + / k, with a* = arg maxa {q* (a)}: optimal

Greedy selection (optimal initial value, exp. regency weighted average):
• Higher average reward (high Rt: slow decrease Qt), ‘absorbing’ action often optimal action

Do better by: giving exploiting priority and when we explore:
• Avoid low reward actions (no random selections)
• Good choices for pi: select greedy with high prob. for p1
• Low reward actions will still be selected

Upper-confidence-bound action selection
Explore non-greedy actions with high potential & keep exploring, also in the long-run

Behaviour:
• Sqrt-term measure of uncertainty, like confidence interval
• Value of actions increases in time, even when not selected
• When selected the uncertainty term decreases
• Subtle favoring of less-frequent selected actions

Multi-bandit problem non-associative task: find or track the best action for a single situation (or
state), either stationary or non-stationary

Contextual Bandits
Associative task: find best action for multiple situations (or states), i.e. learn a policy
Associative search: trial-and-error learning and association of actions to situations

Full RL associative task: Actions affect next situation (or state)

𝜺

𝜶 𝜺 𝜺 𝜶

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller feanne1. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $12.37. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

77254 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Summary

Samenvatting Reinforcement Learning, ISBN: 9780262193986 Reinforcement Learning (6013B0359Y)

Document information

Subjects

Connected book

Written for

1 review

Seller

Reviews received

Content preview

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Quick and easy check-out

Focus on what matters

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?