广义优势估计：High-dimensional Continuous Control Using Generalized Advantage Estimation

针对策略梯度优势估计的问题「高方差variance、高偏差bias」提出统一框架GAE来平滑地处理variance和bias之间的平衡。

Dec 26, 2025 AI

多智能体强化学习

参考文档 # Transformer-based Multi-Agent Reinforcement Learning for Generalization of Heterogeneous Multi-Robot Cooperation

Dec 25, 2025 AI

Decision Transformer: Reinforcement Learning via Sequence Modeling

A framework that abstracts RL as a sequence modeling problem.

Dec 13, 2025 AI, Robotics

Isaac Lab安装使用

Isaac Lab is unified and modular framework for robot learning that aims to simplify common workflows in robotics research.

Nov 23, 2025 AI, Robotics

DeepMind强化学习综述

1. Introduction 折扣因子「强化学习交互过程存在终止风险」: 折扣因子$\gamma$的理解：它本质是对 “交互能否持续” 的信念量化——交互持续到下一步的概率为$\gamma$，而非单纯的 “未来收益打折”（这是其表象） In general, the discount factor reflects the assumption that there i...

Nov 22, 2025 AI

Robotics Transformer-2

1. 总结

Oct 12, 2025 AI, Robotics

Robotics Transformer-1

1. 总结

Oct 12, 2025 AI, Robotics

最大熵强化学习：SAC和Soft Q-learning

介绍最大熵强化学习基础理论和两个主要算法soft actor-critic和soft Q-learning

Sep 27, 2025 AI

强化学习之策略梯度算法

Policy Gradient Method

Sep 7, 2025 AI

PPO算法：Proximal Policy Optimization

一、引言 PPO的特点: PPO 采用了代理目标函数（surrogate object function），一个样本可以经历多个 epoch 的小批量更新，而常规的 Policy Gradient 算法每个数据样本只进行一次梯度更新。它具备 TRPO 的优点，同时更简单，样本复杂度也更优。其他算法的不足: Q-learning：在许多简单问题上表现不佳，且其原...

Sep 1, 2025 AI