Model-based methods have recently been shown promising for offline reinforcement learning (RL), which aims at learning good policies from historical data without interacting with the environment.
Previous model-based offline RL methods employ a straightforward prediction method that maps the states and actions directly to the next-step states.
However, such a prediction method tends to capture spurious relations caused by the sampling policy preference behind the offline data.
It is sensible that the environment model should focus on causal influences, which can facilitate learning an effective policy that can generalize well to unseen states.
To solve the problems, a research team led by Yang Yu from LAMDA, Nanjing University published their new research on 15 Apr 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team first provides theoretical results that causal environment models can outperform plain environment models in offline RL by incorporating the causal structure into the generalization error bound. They also propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structured World Models (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL.
Learning the causal structure from offline data, also known as causal discovery from observations, is a crucial phase of FOCUS. However, causal discovery from observations requires a huge number of hypothesis testing, which is computation-consuming. To tackle this problem, we utilize the time-series property in RL data to reduce the number of hypothesis testing. Specifically, we incorporate the constraint that the future cannot cause the past in the PC algorithm, which seeks to uncover causal relationships based on inferred conditional independence relations. Consequently, we can reduce the number of conditional independence tests and determine the causal direction. In addition, we employ kernel-based conditional independence tests, which can be applied to continuous variables without assuming a specific functional form between the variables or a particular data distribution.
Our experimental results validate the theoretical claims, showing that FOCUS outperforms baseline models and other existing causal MBRL algorithms in the offline setting.
DOI: 10.1007/s11704-024-3946-y