Background
Chemical toxicity evaluation is vital in the medical, industrial, and agricultural sectors to ensure rigorous safety testing and to prevent harmful effects on the environment, living organisms, and humans. During drug discovery and development, identifying compounds with the highest potential for safety and efficacy can reduce failure rates during early design stages. Multi-Species Acute Toxicity Prediction (MSATP) is a critical yet challenging aspect of toxicity evaluation. Traditional MSATP methods rely on in vivo animal studies combined with in vitro techniques, which are labor-intensive, costly, and time-consuming. Moreover, the extensive use of laboratory animals has raised significant ethical concerns worldwide. With the growing availability of MSATP data, machine learning has emerged as a cost-effective, rapid, and precise alternative that offers an efficient solution to reduce reliance on animal testing.
Research Progress
Current research on MSATP commonly employs multi-task deep neural networks for modeling. However, the small size, high dimensionality, and sparsity of MSATP tabular data render them unsuitable for neural network approaches. To address this, we proposed a multi-task cascade forest framework for MSATP. This framework (Fig.1) integrated feature enhancement through knowledge transfer, and sample enhancement using a greedy search strategy with the covariance distance measure. The framework accommodated tasks of varying sizes in multi-task learning and was specifically designed for tabular data, achieving a 12% improvement in performance compared to current state-of-the-art methods. Figure 2 shows the performance comparison between the proposed framework and seven comparison methods on 59 toxicity endpoints. The proposed framework was represented by curves, whereas the comparison methods used bars.
Additionally, in a single-view context, we conducted ablation experiments to validate the effectiveness of the data enhancement strategy and introduced external dataset experiments to assess the generalization capability of the proposed method for cross-species prediction. In a multi-view context, the feature fusion method and consensus ensemble were demonstrated to further enhance the model performance. Finally, we analyzed feature importance vectors to provide interpretable insights into species toxicity correlations.
Future Prospects
Overall, this framework effectively addressed MSATP tasks and exhibited significant potential for application in various toxicity prediction domains. Although there is potential for further improvement in model performance, feature enhancement based on layer transfer currently imitates the concatenation method of the enhancement vector in the original cascade forest structure. This approach aims to improve the representation ability of the target domain samples by concatenating the prediction results of the source domain model with the target domain data. However, some studies have suggested that the representation ability of enhancement vectors can be further improved by employing more complex feature representation methods, such as Shapley-based feature augmentation vectors or tree-based embedding vectors, rather than relying on a simple model output. In addition, the sample enhancement strategy, which uses covariance distance measurement and a greedy neighbor search for sample transfer at specific toxicity endpoints, overlooks relevant samples with large covariance distances. To address this, multi-task sample clustering may be considered to implement a more accurate sample transfer for specific toxicity endpoints. These ideas will guide subsequent improvements to the model.
The complete study is accessible via DOI:10.34133/research.1046