AGI Fundamentals: Reinforcement Learning
In reinforcement learning, an agent and environment interaction’s total components can all be described as functions— some to be mapped onto others. Imagine an agent making a decision in a virtual environment. Such a decision is made by an algorithm that falls into this form:
and below are the steps to get to this form |
At each step t the agent:
Receives observation
Executes action
The environment:
Receives action
Emits observation
Where Gt stands for total return, summing over all reward Rt. Every reward at point t is a scalar feedback signal. The agent’s goal is to maximize cumulative reward.
This part was a little difficult to understand for me. The expected cumulative reward conditioned on state s, is given by the function v(s). That’s the first two equations. We then define Gt recursively such that the base case is Rt+1 and the recursive case is v(St+1).
A mapping from states to actions is called a policy. This is when certain action At is selected to, for example, maximize the value that has long-term consequences. Now that we know about the policy, we know that it can also be a condition.
The Agent components are:
Agent state (St)
Policy
Value functions
Model
1. Agent state
The History is the full sequence of observations (Ot), Actions (At), and Rewards (Rt)
In a way, the history tells us the fully observable environment state.
The Markov decision processes (MDPs) states that a process is markovian if the probability of a reward and a subsequent state doesn’t change if we add more History Ht.
This means that the current state contains everything, and additional history being added to the agent’s peripheral view won’t help its decision-making process.
I gathered that it is often costly to look at the complete history and construct a fully Markovian agent state from that. So that’s why we need a partially observable environment to kind of “bound” the observable environment. This partial-ness should be a good representation of the environment.
2. Policy
A policy defines the agent’s behavior, and it is a map from agent state to action. Policies are usually denoted with pi.
Where a deterministic policy is the first and stochastic (random) policy is the second line.
3. Value function
The definition of an actual value function is the expected return. The value function explicitly depend on the policy pi. We introduce discount factor gamma.
Where
From what I understood, it means we have newly conditioned policy and state at time t. When imposed gamma, the agent will offset immediate reward to long-term reward accordingly. A small gamma value will render our value function to be myopic, or seeking immediate rewards.
The return has a recursive form where
Bellman equation can be written as the optimal value equation, optimal value of state s is equal to the maximization over actions of the expected reward plus discounted next value, conditioned on that state and action at time t. Better formulated here:
I can’t seem to find the sub-star button so I actually wrote star in there. That’s not how it should look but I can’t complain. Lastly, notes on model.
4. Model
The model will predict what the environment will do next. Two terms that will be recurring in Reinforcement learning papers are:
Prediction: to evaluate the future for a given policy P
Control: to optimise the future, to find the best policy P
Comments
Post a Comment