ACD/Solubility DB
Prediction Accuracy
Accuracy in Binning Solubility Data (version 11.0)
Version-to-version Improvements in Prediction Accuracy (versions 5–11)
Prediction Accuracy by Chemical Class (version 5.0)
Binning Solubility Data to Determine Prediction Accuracy (v11)
Solubility is a function of many factors including logP and/or logD (hydrophobicity), pKa (ionization), and the crystalline form of the solid. Due to its complex nature, aqueous solubility is particularly difficult to predict. Because of this, values are often used to categorize compounds into three bins—soluble, partially soluble, and insoluble. Third-party published experimental data1 gave us the opportunity to carry out this type of binning evaluation of ACD/Solubility DB (version 11.0). The literature provided solubility data for 1125 unique compounds. The following solubility (logS) ranges were used to bin the calculated data.
Table 1: LogS values used to bin compound solubility
Although different solubility ranges are used by every institution (and these can differ in the pharmaceutical industry for different therapeutic areas), these are a reasonable starting point for the evaluation.2
Results
Table 2: Confusion matrix showing the results of binning predicted solubility using ACD/Solubility DB (version 10) compared with experimentally measured solubility
Un-weighted Kappa Coefficient = 0.78
Agreement of prediction with experimental measurement is seen on the diagonal in the confusion matrix (as highlighted). 993 compounds (88%) of the 1125 compound dataset were binned correctly according to values predicted using ACD/Solubility Batch (version 11).
For the 131 compounds that fell into the wrong categories, the vast majority of disagreements occurred in the boundaries between S/P and P/I. A total of four compounds showed a larger discrepancy: for 3 compounds, measured solubility indicated the compound to be insoluble but prediction binned it as soluble, and in one case an experimentally determined soluble compound was predicted as insoluble. These discrepancies also occurred at the boundaries (around -3 and -5 log units) and account for just 0.4% of the dataset.
Conclusions
Although predictions may not be 100% accurate for all compounds, they offer good insights into relative solubility. If we look at one potential application of this data in a laboratory, the results herein suggest that predicted solubility could be used in planning experiments to reduce the workload on scientists charged with physical measurements. Having observed that compounds in the Soluble and Insoluble bins are predicted well, physical measurement could be limited to the 368 compounds binned as 'Partially Soluble' based on predicted solubility. Time and resources set aside for measurement of the remaining 757 compounds could be allocated elsewhere. Spot checks of a few compounds binned as Soluble and Insoluble would help to confirm prediction accuracy.
References
- J. S. Delaney, J. Chem. Inf. Comput. Sci., 2004, 44, 1000-1005.
- A. Cheng and K. M. Merz, Jr., J. Med. Chem., 2003, 46, 3572-3580.
This comparison of data is by no means a comprehensive study of solubility and accuracy of prediction; rather it is an internal benchmark for ACD/Labs to evaluate the evolution of the product, and a study of the enhancement of the algorithm. Our clients have ranging accuracy needs based on their applications, and are always encouraged to test their data when contemplating deployment options.
|