AGI Fundamentals: Reinforcement Learning

In reinforcement learning, an agent and environment interaction’s total components can all be described as functions— some to be mapped onto others. Imagine an agent making a decision in a virtual environment. Such a decision is made by an algorithm that falls into this form:

${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>v</mi><mrow><mi>s</mi><mi>t</mi><mi>a</mi><mi>r</mi></mrow></msub><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><munder><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow><mi>a</mi></munder><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mi>γ</mi><msub><mi>v</mi><mrow><mi>s</mi><mi>t</mi><mi>a</mi><mi>r</mi></mrow></msub><mfenced><msub><mi>S</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub></mfenced><mo> </mo><msub><menclose notation=\"left\"><mo> </mo><mi>S</mi></menclose><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>a</mi></mrow></mfenced></mstyle></math>","truncated":false}$

and below are the steps to get to this form

At each step t the agent:

Receives observation ${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>O</mi><mi>t</mi></msub></mstyle></math>","truncated":false}$
Executes action ${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>A</mi><mi>t</mi></msub></mstyle></math>","truncated":false}$

The environment:

Receives action ${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>A</mi><mi>t</mi></msub></mstyle></math>","truncated":false}$
Emits observation ${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>O</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn><mo> </mo></mrow></msub></mstyle></math>","truncated":false}$

Where Gt stands for total return, summing over all reward Rt. Every reward at point t is a scalar feedback signal. The agent’s goal is to maximize cumulative reward.

${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>v</mi><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>G</mi><mrow><mi>t</mi><mo> </mo></mrow></msub><msub><menclose notation=\"left\"><mo> </mo><mi>S</mi></menclose><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi></mrow></mfenced><mspace linebreak=\"newline\"/><mi>v</mi><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>2</mn></mrow></msub><mo>+</mo><mo> </mo><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>3</mn></mrow></msub><mo> </mo><mo>+</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><mo>+</mo><mo> </mo><msub><mi>R</mi><mi>n</mi></msub><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi></menclose></mrow></mfenced><mspace linebreak=\"newline\"/><mspace linebreak=\"newline\"/><msub><mi>G</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><msub><mi>G</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mspace linebreak=\"newline\"/><mi>v</mi><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><mi>v</mi><mfenced><msub><mi>S</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub></mfenced><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi></menclose></mrow></mfenced></mstyle></math>","truncated":false}$

This part was a little difficult to understand for me. The expected cumulative reward conditioned on state s, is given by the function v(s). That’s the first two equations. We then define Gt recursively such that the base case is Rt+1 and the recursive case is v(St+1).

A mapping from states to actions is called a policy. This is when certain action At is selected to, for example, maximize the value that has long-term consequences. Now that we know about the policy, we know that it can also be a condition.

${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>q</mi><mfenced><mrow><mi>s</mi><mo>,</mo><mi>a</mi></mrow></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>G</mi><mi>t</mi></msub><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>a</mi></menclose></mrow></mfenced><mspace linebreak=\"newline\"/><mi>q</mi><mfenced><mrow><mi>s</mi><mo>,</mo><mi>a</mi></mrow></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>2</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>3</mn></mrow></msub><mo> </mo><mo>+</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><mo>+</mo><msub><mi>R</mi><mi>n</mi></msub><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>a</mi></menclose></mrow></mfenced><mspace linebreak=\"newline\"/></mstyle></math>","truncated":false}$

The Agent components are:

Agent state (St)
Policy
Value functions
Model

1. Agent state

The History is the full sequence of observations (Ot), Actions (At), and Rewards (Rt)

${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>H</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><msub><mi>O</mi><mn>0</mn></msub><mo>,</mo><mo> </mo><msub><mi>A</mi><mn>0</mn></msub><mo>,</mo><mo> </mo><msub><mi>R</mi><mn>1</mn></msub><mo>,</mo><mo> </mo><msub><mi>O</mi><mn>1</mn></msub><mo>,</mo><mo> </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mo> </mo><msub><mi>O</mi><mrow><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>,</mo><mo> </mo><msub><mi>A</mi><mrow><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>,</mo><mo> </mo><msub><mi>R</mi><mi>t</mi></msub><mo>,</mo><mo> </mo><msub><mi>O</mi><mi>t</mi></msub></mstyle></math>","truncated":false}$

In a way, the history tells us the fully observable environment state.

The Markov decision processes (MDPs) states that a process is markovian if the probability of a reward and a subsequent state doesn’t change if we add more History Ht. ${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>p</mi><mfenced><mrow><mi>r</mi><mo>,</mo><mi>s</mi><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub></menclose></mrow></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi>p</mi><mfenced><mrow><mi>r</mi><mo>,</mo><mo> </mo><mi>s</mi><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>H</mi><mi>t</mi></msub></menclose><mo> </mo><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub></mrow></mfenced></mstyle></math>","truncated":false}$

This means that the current state contains everything, and additional history being added to the agent’s peripheral view won’t help its decision-making process.

I gathered that it is often costly to look at the complete history and construct a fully Markovian agent state from that. So that’s why we need a partially observable environment to kind of “bound” the observable environment. This partial-ness should be a good representation of the environment.

2. Policy

A policy defines the agent’s behavior, and it is a map from agent state to action. Policies are usually denoted with pi.

${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>A</mi><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"normal\">π</mi><mfenced><mi mathvariant=\"normal\">S</mi></mfenced><mo> </mo><mspace linebreak=\"newline\"/><mi mathvariant=\"normal\">π</mi><mfenced><mrow><mi mathvariant=\"normal\">A</mi><mo> </mo><menclose notation=\"left\"><mo> </mo><mi mathvariant=\"normal\">S</mi></menclose></mrow></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"normal\">p</mi><mfenced><mrow><mi mathvariant=\"normal\">A</mi><mo> </mo><menclose notation=\"left\"><mo> </mo><mi mathvariant=\"normal\">S</mi></menclose></mrow></mfenced></mstyle></math>","truncated":false}$

Where a deterministic policy is the first and stochastic (random) policy is the second line.

3. Value function

The definition of an actual value function is the expected return. The value function explicitly depend on the policy pi. We introduce discount factor gamma.

${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>v</mi><mi mathvariant=\"normal\">π</mi></msub><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>G</mi><mi>t</mi></msub><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><mi mathvariant=\"normal\">π</mi></menclose></mrow></mfenced><mspace linebreak=\"newline\"/><msub><mi>v</mi><mi mathvariant=\"normal\">π</mi></msub><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><mi>γ</mi><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>2</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><msup><mi>γ</mi><mn>2</mn></msup><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>3</mn></mrow></msub><mo> </mo><mo>+</mo><mo> </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><mi>π</mi></menclose></mrow></mfenced></mstyle></math>","truncated":false}$

Where ${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>γ</mi><mo> </mo><mo>∈</mo><mfenced open=\"[\" close=\"]\"><mrow><mn>0</mn><mo>,</mo><mn>1</mn></mrow></mfenced></mstyle></math>","truncated":false}$

From what I understood, it means we have newly conditioned policy and state at time t. When imposed gamma, the agent will offset immediate reward to long-term reward accordingly. A small gamma value will render our value function to be myopic, or seeking immediate rewards.

The return has a recursive form where

${"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>G</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mi>γ</mi><msub><mi>G</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mspace linebreak=\"newline\"/><msub><mi>v</mi><mi>π</mi></msub><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>G</mi><mi>t</mi></msub><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub><mo> </mo><mo>~</mo><mi mathvariant=\"normal\">π</mi><mfenced><mi mathvariant=\"normal\">s</mi></mfenced></menclose></mrow></mfenced><mspace linebreak=\"newline\"/><msub><mi>v</mi><mi>π</mi></msub><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mi>γ</mi><msub><mi>G</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub><mo> </mo><mo>~</mo><mi mathvariant=\"normal\">π</mi><mfenced><mi mathvariant=\"normal\">s</mi></mfenced></menclose></mrow></mfenced><mspace linebreak=\"newline\"/><mi>r</mi><mi>e</mi><mi>c</mi><mi>u</mi><mi>r</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo> </mo><mi>s</mi><mi>t</mi><mi>e</mi><mi>p</mi><mo> </mo><mo>-</mo><mo>></mo><mo> </mo><mspace linebreak=\"newline\"/><msub><mi>v</mi><mi>π</mi></msub><mfenced><mi>s</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mi mathvariant=\"double-struck\">E</mi><mfenced open=\"[\" close=\"]\"><mrow><msub><mi>R</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo> </mo><mo>+</mo><mi>γ</mi><msub><mi>v</mi><mi mathvariant=\"normal\">π</mi></msub><mfenced><msub><mi>S</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub></mfenced><mo> </mo><menclose notation=\"left\"><mo> </mo><msub><mi>S</mi><mi>t</mi></msub><mo> </mo><mo>=</mo><mo> </mo><mi>s</mi><mo>,</mo><mo> </mo><msub><mi>A</mi><mi>t</mi></msub><mo> </mo><mo>~</mo><mi mathvariant=\"normal\">π</mi><mfenced><mi mathvariant=\"normal\">s</mi></mfenced></menclose></mrow></mfenced><mspace linebreak=\"newline\"/><mi>N</mi><mi>o</mi><mi>t</mi><mi>e</mi><mo> </mo><mi>t</mi><mi>o</mi><mo> </mo><mi>L</mi><mi>i</mi><mi>s</mi><mi>a</mi><mo>:</mo><mo> </mo><mi>a</mi><mo> </mo><mo>~</mo><mi mathvariant=\"normal\">π</mi><mfenced><mi mathvariant=\"normal\">s</mi></mfenced><mo> </mo><mi>means</mi><mo> </mo><mi mathvariant=\"normal\">a</mi><mo> </mo><mi>is</mi><mo> </mo><mi>chosen</mi><mo> </mo><mi>by</mi><mo> </mo><mi>policy</mi><mo> </mo><mi mathvariant=\"normal\">π</mi><mo> </mo><mi>in</mi><mo> </mo><mi>state</mi><mo> </mo><mi mathvariant=\"normal\">s</mi><mo> </mo></mstyle></math>","truncated":false}$

Bellman equation can be written as the optimal value equation, optimal value of state s is equal to the maximization over actions of the expected reward plus discounted next value, conditioned on that state and action at time t. Better formulated here:

I can’t seem to find the sub-star button so I actually wrote star in there. That’s not how it should look but I can’t complain. Lastly, notes on model.

4. Model

The model will predict what the environment will do next. Two terms that will be recurring in Reinforcement learning papers are:

Prediction: to evaluate the future for a given policy P
Control: to optimise the future, to find the best policy P

Search This Blog

Vendôme

AGI Fundamentals: Reinforcement Learning

Comments

Post a Comment

Popular posts from this blog

Et al.

Circumstances

Man's Search for Meaning