Offline Model-Based Reinforcement Learning with Causal Structured World Models

Model-based methods have recently been shown promising for offline reinforcement learning (RL), which aims at learning good policies from historical data without interacting with the environment.

Previous model-based offline RL methods employ a straightforward prediction method that maps the states and actions directly to the next-step states.

However, such a prediction method tends to capture spurious relations caused by the sampling policy preference behind the offline data.

It is sensible that the environment model should focus on causal influences, which can facilitate learning an effective policy that can generalize well to unseen states.

To solve the problems, a research team led by Yang Yu from LAMDA, Nanjing University published their new research on 15 Apr 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.

The team first provides theoretical results that causal environment models can outperform plain environment models in offline RL by incorporating the causal structure into the generalization error bound. They also propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structured World Models (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL.

Learning the causal structure from offline data, also known as causal discovery from observations, is a crucial phase of FOCUS. However, causal discovery from observations requires a huge number of hypothesis testing, which is computation-consuming. To tackle this problem, we utilize the time-series property in RL data to reduce the number of hypothesis testing. Specifically, we incorporate the constraint that the future cannot cause the past in the PC algorithm, which seeks to uncover causal relationships based on inferred conditional independence relations. Consequently, we can reduce the number of conditional independence tests and determine the causal direction. In addition, we employ kernel-based conditional independence tests, which can be applied to continuous variables without assuming a specific functional form between the variables or a particular data distribution.

Our experimental results validate the theoretical claims, showing that FOCUS outperforms baseline models and other existing causal MBRL algorithms in the offline setting.

DOI: 10.1007/s11704-024-3946-y

https://journal.hep.com.cn/fcs/EN/10.1007/s11704-024-3946-y

Zhengmao ZHU, Honglong TIAN, Xionghui CHEN, Kun ZHANG, Yang YU. Offline model-based reinforcement learning with causal structured world models. Front. Comput. Sci., 2025, 19(4): 194347， https://doi.org/10.1007/s11704-024-3946-y

Archivos adjuntos

Figure 1. The architecture of FOCUS. Given offline data, FOCUS learns a $p$ value matrix by KCI test and then gets the causal structure by choosing a $p$ threshold. After combining the learned causal structure with the neural network, FOCUS learns the policy through an offline MBRL algorithm.
Figure 2. The basic, impossible and compound situations of the causation between target variables and condition variables. In the basic situations, Top Line: (a)-(d) list the situations that the condition variable is in the t time step. Bottom Line: Similarly, (e)-(h) list the situations that the condition variable is in the t+1 time step.

21/05/2025 Frontiers Journals

Regions: Asia, China

Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Publicaciones más recientes

Testimonios