Skill-based reinforcement learning has become the mainstream approach to solve sparse-rewards decision making tasks. The skills extracted from the demonstration datasets provide the temporal abstraction. However, in previous skill-based RL methods, the skills are kept fixed during online learning, which brings in sub-optimal asymptotic performance when the dataset contains only sub-optimal behavior modes.
To solve the problems, a research team led by Ying Wen published their new research on 15 June 2026 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
This team proposed a skill-based RL method which fine-tune the entire hierarchical policy under a unified optimization objective via dynamical skill refinement mechanism. The method is verified and tested in multiple sparse-reward robotic manipulation tasks. Compared with SOTA methods, the proposed method achieves higher asymptotic performance and more stable performance improvement.
In this work, they propose to optimize the hierarchical policy’s performance in TA-MDP. They prove that the unified optimization objective guarantees the performance improvement in TA-MDP and essentially optimizes the performance lower bound in original MDP, which illustrates the effectiveness. They learn the skill refinement into a residual policy predicting dynamically weighted action increments, which avoids the skill space collapse. At the end of each epoch, the high-level policy and the low-level policy are simultaneously updated in an on-policy manner, which circumvents the temporal abstraction shift.
Specifically, the weight of action increment is dynamically determined according to the level of skill refinement in current state.This paper measures the refinement level through random network distillation (RND).
Future work can focus on finding more measures of skill refinement level. Moreover, finding a more compact performance lower bound is an important issue.
DOI
10.1007/s11704-025-50561-3