Searching for Policies in Python: An intro to Optimization

Anni Sapountzi
10 min readFeb 10, 2018

--

Dynamic Programming Principles: 1. Break down the problem into smaller parts, 2. store(remember/memoize) the sub-problems already solved. In this way, you reduce the time complexity of the algorithm. DP is efficient when there are recurring sub problems to compute but inefficient when the state space is large. Photo by quotefancy.

The goal of the optimization problem is to find the best solution from all feasible solutions by finding the optima of the objective function.

Optimization Techniques:

DP, LP, NLP are called exact methods. The RL framework falls under the non-conventional techniques. Meta Heuristic Search, sometimes also called as Computational Intelligence, include algorithms such as: 1. Genetic Algorithms (GA), 2. Particle Swarm Optimization, 3. Artificial Bee Colony, 4. Ant Colony Optimization 5. Harmony Search., Simulation Annealing (SA) and Tabu Search (TS). They are problem independent (used as black boxes), not so greedy and thus do not stuck at local optimum but also do not provide a guarantee to find an optimal solution. As such they can be seen as the generalized version of heuristic search. The heuristics are problem-dependent, offer an approximate solution, are often too greedy and they usually get trapped in a local optimum. Warning: do not confuse approximate solution with approximation solution, as the latter knows how close the solution is to the optimal solution. This beautiful categorisation picture is taken by Mukherjee & Ray, 2006

The most common objective functions for the different types of learning are:

i. the log loss for classification,

ii. mean squared error for regression, and

iii. the reward/value function for Reinforcement Learning (RL)*.

*RL is goal-directed method of learning and the goal is formalised by the reward. The difference with other ML methods is that in RL, we learn from interacting with the environment and we don’t have a supervisor (the oracle that reveals what is true). What we have instead is a reward which is always a single number.

The core of the problem now becomes finding the model’s parameter values that results in the lowest loss (i.e., regret, cost) or the highest reward.

The loss in classification and error in regression are relatively good measures to estimate the parameter values of the function.

On the other hand, the reward in RL* is provided by the environment and typically is not a very good proxy. As you cannot draw the surface of the function you are trying to approximate given a sparse reward or a random initial policy.

The problem with RL methods is that the reward signals tend to be “wimpy”; in some environments, agents become stuck looking for patterns in random data.”Geoffrey Hinton.

If you need a brief intro of learning, please have a look at Learning = Representation + Evaluation + Optimization.

(Exact) Dynamic Programming

DP: collection of algorithms to compute optimal policies given a perfect knowledge of the environment. The environment is typically modelled as a finite Markov Decision Process (MDP).

📆 Dynamic = occurs in successively stages (i.e., sequential), changes over time (temporal)

⏳Programming = mathematical programming, optimization problem

🤔Uncertainty = outcomes partially random & partially under the control (i.e., choice) of a decision maker (i.e., agent).

Principle of optimality by Bellman:

The Bellman equation expresses the relationship between the value of a state and the value of its successor states.

Whatever initial state and decision on that state, all the subsequent states must be optimal; optimally rewarded, due to that we choose a max rewarded action recursively at each step.

That is to say, to solve a problem from some starting period t to some ending period T, one implicitly has to solve the subproblems starting from later dates s, where t<s<T. This is an example of optimal substructure and shows how the value of the problem starting from t is related to the value of the problem starting from s. Bellman optimality principle can be considered as a component of MDP, solved by DP.

The problem structure must satisfy the above 2 properties for DP and RL to be good solutions to the problem at hand: 1. the bellman equation hols for each state in the value function 2. recursive relationship between the values (Vt = Rt +V(t+1)).

The environment MDP

MDP: the standard framework for modelling sequential decision making or planning under uncertainty.

🧩 MDP components:

  • state (X or denoted as S),-the basis for an action-
  • transition probabilities between states (P),-change state after an action-
  • action (A),-choice made by an agent-
  • reward(R), -evaluate the state or the action you took-
  • and the discount factor (γ).-influence of time passing to rewards-

Finite MDP = A,S space is finite an its dynamics are given by a set of probabilities P(s’,r|s,a). (Approximate DP for continuous spaces).

DP & RL key idea is the use of value functions to organize the search for good policies (behaviors).

Solutions to MDP (Bellman equation): Dynamic Programming, Reinforcement Learning.

Value function= ‘how good*’ is to be in a state v(s)or to take an action from that state q(s,a). The prediction of future reward taken from this environment [*defined by the future rewards].

Policy π(a|s) = agent select actions as functions of states; map states to actions. It optimizes the Value Function by maximizing the reward.

Reward r: long term expected return of a policy for every possible state.

The fundamental difference between DP and RL is that the first assumes to have full knowledge of the environment (all MDP components are available), which rarely responds to real-life situations. RL on the other hand learns a model of the environment (i.e. MDP, where typically P or R aren’t known). Based on this distinction, we sometimes refer to them as planning and learning problems, respectively. Planning in MDP means finding the optimal policy for the given known environment.

Issues related to RL & DP are:

1. time complexity O(N): the number of iterations required to find optimal solutions,

2. space complexity (memory): the dimensionality of the features space (e.g., state),

3. exploitation (of the known best solution) vs exploration (of unknown solutions) trade off in a certain time horizon.

Time complexity for DP & exhaustive search (Brute Force) solution in solving the Traveling Salesman Problem, by xkcd.

The difference between DP and “divide and conquer” strategy is that the latter can be solved by combining optimal solutions to non-overlapping sub-problems.

DP is used for problems, where a linear, without a rearranging order, exists between the variables (strings in a sentence, matrices in a chain, the left to right order of leaves in a search tree), like:

  • Planning in MDP (solution: DP, exhaustive search)
  • String algorithms (e.g. sequence alignment)
  • Graph algorithms (e.g. shortest path algorithms)
  • Graphical models (e.g. Viterbi algorithm)
  • Scheduling algorithms
Exhaustive search: if the 1-step dynamics are known P(s’,r|s,a) iteratively examine each possible policy until the last time step T, and then opt for the optimal path of actions. Though, this isn’t time-efficient, it needs O(N^T). DP hence is exponentially faster than any direct search algorithm (as shown below). Image from https://www.slideshare.net/donghyunkwak90/reinforcement-learning-73854927

Planning methods

Steps: The planning problem is composed by two steps: policy evaluation &policy improvement.

Policy evaluation finds the optimal value (argmax) for all states in the state space.

  1. ⛅️ Planning is firstly a prediction problem :

→ Input: Markov Reward Process (MDP and the policy)

← Output: Value function v.

⚙️ Process: Policy evaluation

We computed the value function for a deterministic (fixed) policy because we need to find whether better policies (actions for each state). That process is known as policy improvement: if we take another action in a given state would result in a new and better policy?

2. ✨ Then it is an improvement problem:

→ Input: MDP, Value function

← Output: Better Policy

⚙️ Process: Policy improvement

🎮 1.+2. = Control.

Finally, the evaluation and improvement problems are combined in the control problem. Now, we have to decide which planning method to use in order to calculate the optimal policy. The two classical DP methods for exact planning in an MDP are policy iteration and value iteration.

Policy Evaluation/Improvement loop:The convergent policy is guaranteed to be optimal, if the Q or V value functions are exact. Both proven to converge. Basic idea stays the same behind PI & VI: alternate between improving the policy (requires knowing its VF), and computing the VF of the new, improved, policy.

1. Policy Iteration (PI)

Evaluate policies by constructing their Value Functions (ΝΟΤ optimal VF) and use these VF to find new, improved policies.

It is composed by three steps:

  1. Initialization of a policy and the VF (arbitrarily)
  2. Policy Evaluation (prediction): Compute the state VF for some or all states given a fixed policy.
  3. Policy Improvement: Act greedily (i.e. pick the best action at each state) with respect to the policy given the optimal state-value function. Then we are guaranteed to improve the policy.

Repeat steps 2,3 until one of the following conditions are met:

(i) the policy remain unchanged if it’s already optimal,

(ii)a time limit has been reached, or

(iii)the change to the value function is below a certain threshold Δ(denoted as η in the next fig.)

2. Value Iteration (VI)

Search for the optimal value function which is used to compute(only once) an optimal policy.

It is composed by two steps:

  1. Initialization of a VF (arbitrarily)
  2. Find optimal VF with a single step of policy evaluation+ immediately do one policy improvement.
Difference between PI (b) & VI(d): On contrary to VI, the full VF is computed in PI. VI Integrates evaluation and improvement in one update rule. If you have a large state space (10⁶) its better to use VI such that the enumeration of all states cost less computation time (compared to PI).
Policy Iteration & Value Iteration Pseudocode highlighted with the differences between them. In Value Iteration, we don’t make multiple steps because once the VF is optimal, then the policy converges (meaning it is also optimal). Whereas at policy iteration we do multiple steps in each iteration where every step gives us a different V(s) and the corresponding policy.

Once we acquire the optimal policy, we have solved our control problem since we know which actions the agent should pick.

Now that we have a grasp of planning methods, we can code them in python.

The Coding Part (Python)

First of all, we need to have access to a perfect MDP environment. We will use the standardised environment used in Reinforcement Learning: An Introduction Chapter 4. You can find the code here.

The nonterminal states are S = {1, 2, . . . , 14}. There are four actions possible in each state, A = {up, down, right, left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. The reward is −1 on all transitions until the terminal state (grey box) is reached.
The input (args) and output (return statement) are depicted as comments. Set the 4 ingredients of the MDP environment and the 7 local variables.

Policy Evaluation

Compute the Value Function for the current policy to measure how good is that policy
Inputs: MDP and threshold theta
Output: the value function for each state in the grid environment

Policy Iteration

Improve the policy. When is optimal return it with its value function
Inputs: MDP and threshold theta
Output: the value function for each state in the grid environment

Value Iteration

Finding optimal value function + one policy extraction
Inputs: MDP and theta threshold
Outputs: Optimal Policy & Value Function

Backups: Width and Depth of Policy Updates

Backup = policy update

Policy Improvement in Value Iteration and Policy Iteration is done using a “full backup” which is nothing more than the Bellman equations turned into policy updates. At each state, we can look ahead one step at each possible action and next state. We can do that only because we have a perfect model of the environment. The alternative is to use sample backups.

Space and time complexity grow in parallel with the depth (bootstrapping) and width (all or sample states) of a policy update.

Shallow backups/1-step ahead: the updated estimates are based on the previous estimates.

Deep backups: running all the steps to terminal states.

DP and exhaustive search are two planning methods. They use full backups to update the policy — width dimension. These methods take the expectation values of updates over all states (based on a distribution of possible trajectories). Expected updates require a distribution model.

1-step TD learning and pure Monte Carlo approaches use sample backups (sample trajectories of s-a-r) from actual or simulated interaction with the environment. These are learning methods as they don’t have access to the full MDP in order to compute the value functions.

How do we sample trajectories? We execute Bellman backups only on states that are visited under good policies.

In this way, we can solve the scaling problem of looping over all possible states. Sample updates require only a sample model (not all possible transitions required by DP), or can be done from actual experience with no model at all (this could be considered as another dimension of variation).

Two dimensions of RL algorithms, based on the backups used to learn or construct a policy. The horizontal axis denotes the depth and the vertical the width (length) of the update. Figure taken by the survey on deep reinforcement learning and it’s based on Reinforcement Learning An Introduction. There’s always a trade off between full and sample back ups wrt the computational efficiency. In full back-ups, in one realisation, i.e., episode, all the possible next outcome values are calculated. The sample back ups don’t calculate all the outcomes possibilities, as they learn only the next value according to the experience with the environment. However, they do need many realisations. Backward recursion is done in the former case and forward recursion in the latter. Sample back ups are preferred when there is a large number of states.

What is Bootstrapping? Short answer: learn a prediction from a later learnt prediction (using estimates to update estimates). The degree of bootstrapping is another dimension of variation in backups.

Approximate (Neuro) Dynamic Programming & Deep Reinforcement Learning

In practice, the applicability of DP and RL is limited by the enormous size of the underlying state spaces. This is known as Bellman’s “curse of dimensionality” or “expanding grid”.

Curse of dimensionality: exponential growth of states with the expanding of the time horizon (number of steps the agent makes).

Manipulating the value of every state in a tabular representation is not feasible in large or continuous domains. Consider adding one state in the transition table of state space: add one row and one column, namely adding one cell to every existing column and row.

A solution for large MDPs is to use function approximations like neural networks, decision trees etc. This idea is termed as Neuro dynamic programming, approximate dynamic programming, or deep reinforcement learning. However, in the case of approximate dynamic programming, any function can be used (e.g., polynomial regression) rather than only machine learning models (for example if you assume that the value function is convex then you can fit a polynomial regression model at the states that then output the actions).

Key Idea: learn (approximate) a state, action, or the transition dynamics as a linear (or not) combination of the input features.

In value function approximation, you use simulation to tune the parameters. That is to compute an approximate solution to Bellman’s equation and then construct near-optimal policies.

The alternative is to use policy optimization, also called optimization in policy space. That is tuning policy parameters in a direction of improvement. Policy optimization approach involves i. Meta-Heuristic ,e.g., Evolutionary algorithms; and ii. Policy Gradients.

Policy Optimization, learning policies to make more likely the good actions (left) and Dynamic Programming, learning value functions and indirectly learning policies (right).
The differences between Dynamic Programming & Policy Optimization

Policy gradient methods relies upon optimising parametrised policies w.r.t. the reward by gradient descent. The parameters are the policies indexed by a parameter vector of numbers, analogous to classification or regression with input the states and output the actions. You can use the exact same approximation functions as in : i. classification network (discrete action space) outputs vector of probabilities, and ii. regression network (continuous action space) outputs mean and diagonal covariance of Gaussian

Let J(θ) be a policy optimization function and α the learning rate
Search for the local maximum of J(θ) by ascending the gradient of θ.
Gradient: Find the values of theta that maximize J(θ)
Learning Rate: How big the red line (step) will be depends on the value of a
Compute the gradients of all objective functions w.r.t. the parameterized policies
Gradient in the feature space trying to go to the direction that is steepest.

All errors are my own. ☞ Or a bot. Help fix it or use better practices🙄

Hap-py-coding ⇧⏎

And as always thanks for reading😎

--

--

Anni Sapountzi
Anni Sapountzi

No responses yet

Write a response