A new Open Access paper in Digital Discovery presents a scalable hybrid approach that combines physics-informed solubility estimates with interpretable machine learning to improve prediction of drug solubility in organic solvents.
Our partners have published a new paper in the Royal Society of Chemistry journal Digital Discovery exploring how hybrid machine learning (ML) and quantum-informed modelling can improve solubility prediction for drug compounds in organic solvents. Solubility is a critical property for pharmaceutical formulation and processing, but physics-based approaches such as COSMO-RS can be computationally expensive at scale.
Abstract
Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.
Fig. 1 Workflow for solubility modelling and interpretation. (1) A dataset of 714 binary solute–solvent systems is encoded using SMILES. (2) These SMILES serve as inputs for: openCOSMO solubility prediction utilising (a) surface charge distributions obtained from BP86/def2TZVPD and COSMO calculations (b) in cases where the solute is a solid, solute enthalpies of fusion and melting points, and (c) a representative COSMO surface charge density visualisation shown for illustration as part of the COSMO solubility output; and for the generation of MOE, RDKit, and Mordred descriptors as well as Morgan Fingerprints. (3) The resulting COSMO-RS solubility estimates and preprocessed descriptor sets are combined as input under a hybrid mode. Machine learning models (RF, XGBoost, SVM) are trained to predict solubility. (4) SHAP-based heatmaps then decompose model outputs into descriptor and fingerprint contributions, translating predictions into QSAR insights.
Read the Paper
A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents (Wang et al., Digital Discovery, 2026). DOI: 10.1039/D5DD00456J
Link: https://pubs.rsc.org/en/Content/ArticleLanding/2026/DD/D5DD00456J



