### The Paths Perspective with regards to Value Learning

This article attempts to gain some deeper insight into how Temporal Difference learning brings together paths of experience for improved statistical efficiency.

In the previous few years, reinforcement learning (RL) has experienced considerable evolution, which includes defeating world-champion Go players, managing robotic hands, and even painting pictures.

One of the critical sub-problems of Reinforcement Learning is value estimation – learning the longer-term impacts of being in a state. This can be a bit complex as future returns are typically noisy, impacted by several things other than the current state. The more far ahead we peer into the future, the more this rings true. But while difficult, estimation of value is also fundamental to several RL approaches. For various approaches (policy-value iteration), estimation of value is fundamentally the entire problem, whereas in other strategies (actor-critic models), value estimation is fundamental for noise reduction.

The organic method to estimate the value of a state is as the average returns you witness from that state. We refer to this as Monte Carlo value estimation. If a state is visited by just a single episode, Monte Carlo states its value is the return of that episode. If several episodes visit a state, Monte Carlo pegs its value as the average over them.

Let’s author Monte Carlo in a bit more formal fashion. Within reinforcement learning, we often detail algorithms with update rules, which inform us how estimates alter with an additional episode. We’ll leverage an “updates toward” (<–) operator to retain the simplicity of equations. In tabular settings like the cliff word instance, this “update towards” operator computes a running average. More particularly, the nth Monte Carlo update is V (*S _{t}*) = V (

*S*) + 1/n [R

_{t-1}_{n }– V(

*s*)] and we could just as simply leverage the “+=” notation. However, when leveraging parametric function approximators like neural networks, our “update towards” operator might signify a gradient step, which cannot be written in “+=” notation. In order to maintain the cleanliness of our notation and to keep it as general as feasible, we opted to leverage the ↩ operator throughout.

_{t}*V(S _{t}) *↩

*R*

_{t}The term on the right is referred to as the return and we leverage it to measure the amount of long-term reward an agent earns. The return is merely a weighted total of future rewards *r _{t }*+

*γrt*+1+

*γ*2

*rt*+2+… where

*γ*is a discount factor which manages how much short term rewards are worth comparative to longer-term rewards. Estimation of value by updations towards return makes a ton of sense. After all, value is defined as expected/predicted return. It might be shocking the we can do better.

**Beating Monte Carlo**

However, we can do better. The strategy is to leverage a technique referred to as “Temporal Difference (TD) learning, which bootstraps off of closeby states to render value updates.

*V(S _{t}*) ↩ r

_{t }+

*γ*V(s

_{t}+1)

Intersections amongst dual trajectories are managed differently under this update. Differing from Monte Carlo, TD updates merge intersections in order that the return flows backwards to all preceding states.

What is meant by merging trajectories in a more formal sense? Why might it be an appropriate idea? One thing to observe is that *V(s _{t}+1*) can be written as the expectation over all of its TD updates:

*V(s _{t+1}) *≃ E[

*r*

*t*+1′ +

*γV*(

*s*

*t*+2′)]

≃ E[*r**t*+1′] +* **γ*E[*V*(*s**t*+2′)]

Currently we can leverage this equation to expand the TD update rule recursively:

V (*s _{t}+1) *↩

*r*

*t*+

*γV*(

*s*

*t*+1) _ +

↩ *r**t *+ *γ*E[*r**t*+1] + *γ*2E[*V*(*s**t*+2′′)]

↩ *r**t *+ *γ*E [*r**t*+1′] + *γ*2EE [*r**t*+2′′] + …

This provides us an odd-looking sum of nested expectation values. Glancing at it, it’s not obvious how to contrast them with the more simplistic-looking Monte Carlo update. More critically, it’s not obvious that we should contrast the duo; the updates are so differing that it seems a bit redundant, like comparing apples to oranges. Indeed, it’s simple to think of Monte Carlo and TD learning as two completely differing strategies.

However, they are not so far apart, as a matter of fact. Let’s rewrite the Monte Carlo update in terms of reward and locate it beside the expanded TD update.

**MC update**

*V*(*s**t*) ↩ *rt *(Reward from current path) + *γ* *rt*+1 (Reward from current path) + *γ*2 *rt*+2 (reward from current path) + …

(

**TD update**

* V(s_{t}) ↩ r_{t} *(Reward from current path) +

*γ*E [

*rt*+1′] (Expectation over paths increasing current path.)

A pleasing correspondence has come up. The variation between Monte Carlo and TD learning boils down to the nested expectation operators. It happens that there is a nice visual depiction for what they are performing. We refer to it as the paths perspective on value learning.

**The Paths Perspective**

We often believe of an agent’s experience as a series of trajectories. The grouping makes sense logically, and is simple to visualize.

But this manner of organization of experience de-stresses relationships amongst trajectories. Wherever two trajectories meet at an intersection, both results are valid futures for the agent. So even if the agent has adhered to Trajectory 1 to the intersection, it could theoretically follow Trajectory 2 from the juncture onward. We can drastically expand the agent’s experience leveraging these simulated trajectories or “paths”.

**Value Estimations – **It happens that Monte Carlo is averaging over real trajectories while learning is averaging over all potential paths. The nested expectation values we observed earlier correspond to the agent averaging across all potential future paths.

**Contrasting the two – **From a general standpoint, the ideal value estimate is the one with the least variance. As tabular TD and Monte Carlo are empirical averages, the strategy that provides the better estimation is the one that averages over more items. This raises an obvious question: which estimator averages over more items?

V ar[V(s)] (Variance of estimate) ∝ 1/N (Inverse of the number of items in the average)

To start with, TD learning never averages over lesser trajectories than Monte Carlo as there are never lesser simulated trajectories than actual trajectories. On the other side, when there are more simulated trajectories, TD learning has the opportunity to average over more of the agent’s experience. This line of reasoning indicates that TD learning is the superior estimator and assists in explaining why TD has a tendency to outpace Monte Carlo within tabular environments.

**Introducing Q-functions**

An alternative to the value function is the Q-function. Rather than going about estimating the value of a state, it provides estimations about the value of a state and an action. The most overt reasons to leverage Q-functions is that they enable us to contrast differing actions.

There are some other nice attributes of Q-functions. In order to observe them, let’s write out the Monte Carlo and TD update rules.

**Updating Q-functions: **The Monte Carlo update rule appears almost identical to the one we noted down for V (s):

We still go about updating towards the return. Rather than updating them towards the return of being in some state though, we update towards the return of being in some kind of state and choosing some action.

Now let’s attempt doing the same thing with the TD update:

*Q(s _{t }, a_{t}) *↩

*r*

_{t }+*γ*

*Q*(

*st*+1,

*at*+1)

This variant of the TD update needs a tuple of the form (s_{t, }a_{t, }r_{t}, s_{t}+1, a_{t}+1) so we refer to it as the Sarsa algorithm. Sarsa might be the easier method to write this TD update, however, it’s not the most efficient. The issue with Sarsa is that it leverages *Q(s _{t}+*1, a

_{t}+1) for the subsequent state value when it actually should be leveraging V (s

_{t}+1). What we require is an improved estimate of V(S

_{t+}1)

There are several ways to recover *V(s _{t+}1_{) }* from Q-functions. In the upcoming section, we’ll take a deeper look at four of them.

Learning Q-functions with reweighted paths

Expected Sarsa. A better method of estimation of the next state’s value is with a weighted sum. Also written as an expectation value, therefore, “Expected Sarsa” over its Q-values. We refer to this approach as extended Sarsa:

Here’s a shocking fact with regards to expected Sarsa: the value estimate it provides is usually superior than a value estimate computer directly from the experience. This is due to the fact that the expectation value weights the Q-values by the true policy distribution instead of the empirical policy distribution. In performing this, Expected Sarsa rectifies for the difference amongst the empirical policy distribution and the true policy distribution.

Off-policy value learning. We can take this idea even further. Rather than weighting Q-values by the actual policy distribution, we could weight them through an arbitrary policy, π off

This minute alteration allows us to estimate value under any policy we desire. It’s interesting to think about Expected Sarsa as a special scenario of off-policy learning that’s leveraged for on-policy estimation.

**Re-weighting path intersections. **What does the paths perspective state with regards to off-policy learning? To identify a solution to this question, let’s consider some state where several paths of experience reach an intersection. Wherever intersecting paths are re-weighted, the paths that are most indicative of the off-policy distribution wind up making bigger contributions to the value estimate. Meanwhile, paths that have reduced probability make reduced contributions.

**Q-learning. **There are several scenarios where an agent requires to gather experience under a sub-optimal policy (e.g. to enhance exploration) while estimating value under an optimum one. In these scenarios, we leverage a variant of off-policy learning referred to as Q-learning.

Q-learning prunes away all but the higest-valued paths. That paths that stay are the paths that agents will adhere to at evaluation time, they are the only ones it requires to pay attention to. This kind of value learning usually leads to quicker convergence, than on-policy strategies.

**Double Q-Learning**. The issue with Q-learning is that it provides biased value estimates. More particularly, it is over-optimistic in the presence of noisy rewards. Here’s an instance where Q-learning fails:

You get to a casino and play a hundred slot machines. It’s your lucky day: you hit the jackpot on machine 43. Now, if you leverage Q-learning to go about estimating the value of being in the casino, you will select the ideal outcome over the actions of playing slot machines. You will wind up believing that the value of the casino is the value of the jackpot, and determine that the casino is a great place to be!

At times the biggest Q-value of a state is big just by chance, selecting it over others makes the value estimation as a biased one. One method to minimize this bias is to have a friend go to the casino and play the same set of slot machines. Then, question them what their winnings were at machine 43 and leverage their reply as your value estimation. It’s not probable that you both got the jackpot on the same machine, so this time you won’t wind up with an over-optimistic estimate. We refer to this strategy as Double Q-learning.

**Bringing it together. **It’s simple to think of Sarsa, Expected Sarsa, Q-learning, and Double Q-learning as differing algorithms – however, as we’ve observed, they are merely differing ways of estimating V(s_{t} – 1) in a TD update.

The strategy underlying all of these strategies is that they re-weight path intersections.

**Re-weighting paths with Monte Carlo. **At this juncture, a natural question is: Could we achieve the same re-weighting effect with Monte Carlo? We could, but it would be more complicated and involve re-weighting all of the agent’s experience. By working at intersections, TD learning re-weights individual transitions rather than episodes in their entirety. This makes TD strategies a lot more convenient for off-policy learning.

**Merging of Paths with Function Approximatiors**

Up till now, we’ve learned that a single parameter, the value estimate for each state or each state-action pair. This functions well for the Cliff World instance as it has a minimal number of states. But most fascinating RL problems have a big or infinite array of such states. This makes it difficult to record value estimates for every state.

Rather, we must force our value estimator to have lesser parameters than there are states. We can perform this with machine learning strategies like linear regression, decision trees, or neural networks. All of these strategies fall under the umbrella of function approximation.

**Merging closeby paths. **From the paths perspective, we can interpret function approximation as a method of merging nearby paths. However, what do we imply by “nearby”? In the figure above, we made an implicit decision to measure “nearby” with Euclidean distance. This was a solid idea as the Euclidean distance amongst two states has a high degree of correlation with the likelihood that the agent will transition between them.

However, it’s simple to imagine scenarios where this implicit assumption breaks down. By including a singular long barrier, we can build a case where the Euclidean distance metric leads to bad generalization. The issue is that we have merged the incorrect paths.

**Merging the incorrect paths. **The diagram below demonstrates the impacts of merging of the incorrect paths a tad more explicitly. As the Euclidean averager is to blame for weak generalization, both Monte Carlo and TD make bad value updates. But, TD learning amplifies these errors drastically while Monte Carlo does not.

We’ve observed that TD learning makes more effective value updates. The price we pay is that these updates wind up being a lot more sensitive to bad generalization.

**Implications for deep reinforcement learning**

**Neural Networks: **Deep neural networks are probably the most widespread function approximators within reinforcement learning. These models are thrilling for several reasons, however one specifically nice attribute is that they don’t make implicit assumptions about which states are “nearby”

Early on in training, neural networks, a lot like averagers, have a tendency to merge the incorrect paths of experience. In the Cliff Walking instance, an untrained neural network might make the same bad value updates as the Euclidean averager.

But as training moves forward, neural networks can actually learn to overcome these errors. They learn which states are nearby from experience and know-how. In the Cliff World instance, we might expect a completely-trained neural network to have learned the value updates to states above the barrier should never impact the values of states below the barrier. This isn’t something that most other function approximators can do. It’s one of the reasons RL is so fascinating!

**TD or not TD? **Thus far, we’ve observed how TD learning can outpace Monte Carlo by merging pathways of experience where they’ve intersected. We’ve also observed that merging of paths is a double-edged sword, when function approximation causes bad value updates, TD can wind up doing even worse.

Over the previous few decades, a dominant share of research in RL has preferred TD learning to Monte Carlo. Indeed, several approaches to RL leverage TD-style value updates. What that being stated, there are several other ways to leverage Monte Carlo for reinforcement learning. Our discussion is centred around Monte Carlo for value estimation in this article, however it can also be leveraged for policy selection in Silver et al.

As Monte Carlo and TD both have desirable attributes, why not attempt constructing a value estimator that is a combo of the two? That’s the reasoning behind TD learning. It’s a strategy that merely interpolates (leveraging the coefficient *λ*) between Monte Carlo and TD updates in the limit *λ = *0, we recover the TD update rule. Meanwhile, when *λ = *1, we retrieve Monte Carlo..usually, TD(*λ*) functions better than either Monte Carlo or TD learning alone. Scientists often keep the *λ *coefficient constant as they go about training a deep RL model. However, if we think Monte Carlo learning is ideal early on in training (prior to the agent having learned a good state of representation) and TD learning is ideal later on (when it’s simpler to advantage from convergent paths), perhaps the ideal approach is to anneal *λ *over the course of training.

**Conclusion**

In this blog post by AICorespot we put forth to you a novel way of thinking about TD learning. It assists us to see why TD learning can be advantageous, why it can be efficient for off-policy learning, and why there can be challenges in bringing together TD learning with function approximators