强化学习的数学原理（英文版）

作者：未知作者笔记数：10 条

7.2 TD learning of action values: Sarsa

At time t, only the q-value of (st, at) is updated, whereas the q-values of the others remain the same.

Expected SarsaGiven a policy π, its action values can be evaluated by Expected Sarsa, which is a variant of Sarsa.

They are different only in terms of their TD targets. In particular, the TD target in Expected Sarsa is [插图], while that of Sarsa is rt+1 + γqt(st+1, at+1).

Although calculating the expected value may increase the computational complexity slightly, it is beneficial in the sense that it reduces the estimation variances because it reduces the random variables in Sarsa from{st, at, rt+1, st+1, at+1} to {st, at, rt+1, st+1}.

[插图]

7.4 TD learning of optimal action values: Q-learning

on-policy learning and off-policy learning.

What makes Q-learning slightly special compared to the other TD algorithms is that Q-learning is off-policy while the others are on-policy.

When the behavior policy is the same as the target policy,

The advantage of off-policy learning is that it can learn optimal policies based on the experience samples generated by other policies,

on-policy/off-policy is online/of fline learning.