作者:未知作者 笔记数:10 条
At time t, only the q-value of (st, at) is updated, whereas the q-values of the others remain the same.
Expected SarsaGiven a policy π, its action values can be evaluated by Expected Sarsa, which is a variant of Sarsa.
They are different only in terms of their TD targets. In particular, the TD target in Expected Sarsa is [插图], while that of Sarsa is rt+1 + γqt(st+1, at+1).
Although calculating the expected value may increase the computational complexity slightly, it is beneficial in the sense that it reduces the estimation variances because it reduces the random variables in Sarsa from{st, at, rt+1, st+1, at+1} to {st, at, rt+1, st+1}.
[插图]
on-policy learning and off-policy learning.
What makes Q-learning slightly special compared to the other TD algorithms is that Q-learning is off-policy while the others are on-policy.
When the behavior policy is the same as the target policy,
The advantage of off-policy learning is that it can learn optimal policies based on the experience samples generated by other policies,
on-policy/off-policy is online/of fline learning.