Soft measurement based on data-driven models is widely used to predict key variables in process industry due to low cost and real-time capability. However, these models struggle with noisy datasets containing limited samples. A study published in Frontiers of Chemical Science and Engineering presents data-mechanism hybrid driven methods to address this challenge.
The research team proposed four hybrid approaches integrating mechanism models with three common data-driven models: random forest, extreme gradient boosting, and artificial neural network. In these methods, mechanism calculations provide constraints that guide data-driven model training and prediction.
The first method uses mechanism model outputs as inputs for data-driven models. The second concatenates original data with mechanism calculations for enhanced input features. The third incorporates mechanism constraints into the loss function during neural network training. The fourth combines all three approaches.
Validation was conducted through two industrial cases. The benzene-toluene-xylene distillation case used 14 input features to predict three product compositions. The steam methane reforming case involved seven input variables predicting seven outputs with three-step reaction kinetics. Datasets were generated through simulation with controlled Gaussian noise addition.
Results showed that hybrid methods consistently improved prediction accuracy, with improvement magnitude depending on noise intensity, sample size, and model choice. Under noise levels of 10 to 20 percent and sample sizes of 100 to 400, coefficient of determination improvements reached 5.2 percent for random forest, 17.7 percent for extreme gradient boosting, and 36.2 percent for artificial neural network.
In the distillation case, hybrid methods with extreme gradient boosting and artificial neural network achieved maximum improvements of 6.7 percent under lower noise and 7.7 percent under 20 percent noise. In the reforming case, hybrid methods with extreme gradient boosting improved by 0.3 to 2.5 percent, while method (d) enhanced artificial neural network performance by 0.003 to 0.159.
A double hybrid method incorporating mass conservation law further improved artificial neural network predictions by 0.005 to 0.177, showing better stability in high-value regions.
This research demonstrates that data-mechanism hybrid driven methods offer superior predictive performance for key variables in process industry, particularly when working with small noisy datasets common in real-world applications.
DOI
10.1007/s11705-026-2632-z