Solubility Temperature Dependence Predicted from 2 D Structure

The objective of the study was to find a computational procedure to normalize solubility data determined at various temperatures (e.g., 10 – 50 o C) to values at a “reference” temperature (e.g., 25 °C). A simple procedure was devised to predict enthalpies of solution, ΔHsol, from which the temperature dependence of intrinsic (uncharged form) solubility, log S0, could be calculated. As dependent variables, values of ΔHsol at 25 °C were subjected to multiple linear regression (MLR) analysis, using melting points (mp) and Abraham solvation descriptors. Also, the enthalpy data were subjected to random forest regression (RFR) and recursive partition tree (RPT) analyses. A total of 626 molecules were examined, drawing on 2040 published solubility values measured at various temperatures, along with 77 direct calorimetric measurements. The three different prediction methods (RFR, RPT, MLR) all indicated that the estimated standard deviations in the enthalpy data are 11-15 kJ mol -1 , which is concordant with the 10 kJ mol -1 propagation error estimated from solubility measurements (assuming 0.05 log S errors), and consistent with the 7 kJ mol -1 average reproducibility in enthalpy values from interlaboratory replicates. According to the MLR model, higher values of mp, H-bond acidity, polarizability/dipolarity, and dispersion forces relate to more positive (endothermic) enthalpy values. However, molecules that are large and have high H-bond basicity are likely to possess negative (exothermic) enthalpies of solution. With log S0 values normalized to 25 o C, it was shown that the interlaboratory average standard deviations in solubility measurement are reduced to 0.06 0.17 log unit, with higher errors for the least-soluble druglike molecules. Such improvements in data mining are expected to contribute to more reliable in silico prediction models of solubility for use in drug discovery.


Introduction
This study concerns with the prediction of the temperature dependence of intrinsic solubility (S 0 ) of molecules [1,2].A recent survey of 4557 equilibrium solubility measurements ("4557" set) of druglike (mostly ionizable) molecules in aqueous media suggested a two-step procedure to reduce the interlaboratory variance in solubility data mining [3].First, data from ionizable molecules can be adjusted for pH effects by calculating the intrinsic solubility, S 0 , based on the ionization constant, pK a , and the reported water solubility, S w .Second, data can be normalized for temperature effects, by transforming solubility measurements performed at various temperatures (e.g., 10 -50 °C) to those at a benchmark value of 25 °C.With these two steps, it was thought to be possible to reduce the often cited interlaboratory reproducibility of 0.5 -0.7 log unit (or even higher) [4][5][6][7] down to near 0.15 [3].Such improvements in data mining are expected to yield more reliable in silico prediction models of solubility for use in drug discovery.However, the procedure of "normalizing" temperature effects was not articulated [3].Qualitatively, it is generally expected that equilibrium solubility of most druglike molecules increases with increasing temperature (with exceptions).As far as we know, there has not been a report of a quantitative procedure where the temperature dependence of solubility is predicted from the two-dimensional (2D) structure of molecules.
Reported aqueous solubility values are mainly clustered around room and physiological temperatures: 23 ± 3 (78 %) and 37 ± 5 o C (22 %), respectively [3].For mostly simple molecules, there are some secondary compilations of temperature dependent solubility (e.g., Handbook of Aqueous Solubility Data by Yalkowsky et al. [8]).However, for druglike molecules, large published databases of temperature-dependent solubility are scarce.Although our focus is to improve data mining quality of solubility measurements, knowledge of the temperature dependence of solubility has a number of other practical applications.
Being able to estimate the solubility temperature dependence could allow for improved planning in early formulation studies.For example, Ismailos et al. [9] described the unusual solubility behavior of low-soluble cyclosporin A in aqueous media, where the enthalpy of solution, ΔH sol , was determined to be about -53 kJ mol -1 , indicating an exothermic process: as temperature is raised, the solubility decreases.In contrast, most drug molecules are characterized by endothermic ΔH sol , with typical values ranging from +20 to +50 kJ mol -1 , often with uncertainty of nearly 10 kJ mol -1 .The oral dosage form of cyclosporin A consisted of making a dilution with milk immediately before oral administration.However, the observed bioavailability was incomplete and erratic.To increase absorption, it was suggested [9] that a refrigerated sample of milk might be used, to keep more of the drug in solution (~90 μg mL -1 at 4 °C), compared to the amount dissolved in a room-temperature mixture (~20 μg mL -1 at 25 °C).
In the present study we have devised a simple procedure (similar to that used to predict the temperature dependence of pK a values [10]) to predict the temperature dependence of log S 0 from the van't Hoff relationship.As dependent variables, enthalpies of solution at 25 °C, ΔH sol , were subjected to multiple linear regression (MLR) analysis, using melting point (mp) and the Abraham [11,12] six ("5+1": five traditional + product of H-bond acidity and basicity) solvation descriptors.Also, the data were analyzed by the random forest regression (RFR) and recursive partition tree (RPT) methods.A total of 626 molecules were examined, drawing on 2040 published solubility values measured at various temperatures.

Theory
The integration of the van't Hoff differential equation over a small temperature range, between T 1 and T 2 , assuming the change in standard enthalpy of solution, ΔH sol 0 (kJ mol -1 ), to be approximately independent of temperature, produces: (1) where S is the solubility, T is the absolute temperature (K), and the gas constant is R = 8.314x10 -3 kJ mol -1 K -1 .Equation (1) allows one to convert log S determined at T 1 in a narrow temperature range (e.g., ≈ 25-37 °C: T 1 ≈ 298.15 to 310.15 K) to a value at a reference temperature (e.g., 25 °C: T 2 = 298.15K), provided the standard enthalpy of solution at 25 °C, ΔH sol 0 , is known.
The enthalpy value can be determined calorimetrically or from the temperature dependence of log S. That is, if one plots log S vs. 1/T for a series of measurements at different T, the slope of the plotted curve evaluated at 25 °C is equal to -ΔH sol 0 /(2.303R).Often such plots show some curvature, since ΔH sol 0 is slightly temperature dependent, but as a first approximation one may assume linearity if the temperature interval is small (e.g.≈ 25-37 °C).
Enthalpy values determined by the slope method are subject to high uncertainty, depending on the precision of the solubility measurements and the temperature spanned.As an example, assume two solubility measurements were performed at 25 and 37 °C, and that the reproducibility in each solubility measurement is 0.05 log unit.On rearranging Equation (1) and casting it in the propagation-of-error form (SD = standard deviation), One of the objectives of this study is to transform intrinsic solubility, log S 0 T , measured at T, to log S 0 ref , the solubility at "reference" temperature 298.15K (25 °C).Rearrangement of Equation (1) produces: Thermodynamic standard state definition traditionally is based on concentrations being extrapolated to zero (unit activity) value.However, most of the measured solubility values used in Equation ( 1) was determined in saturated water solutions, with many molecules undergoing some degree of ionization.It is assumed that by working with intrinsic solubility values (referring to solubility of molecules in the uncharged form), some of the consequences of non-unit activity considerations in the definition of the standard state ΔH sol 0 may be mitigated, especially for sparingly soluble molecules.To highlight that intrinsic solubility is not determined at the standard (zero-concentration, unit activity) state, ΔH sol (without the superscript) will be associated with solubility measurement in saturated solutions.

Methods
A total of 626 enthalpies of solution were gathered, either directly from calorimetric data (12 %) or indirectly from temperature-dependent solubility data (88 %).

Van't Hoff Analysis of Temperature-Dependent Solubility Data
When available, the enthalpies of solution were taken directly from primary publications.Otherwise, the temperature-dependent intrinsic solubility values for each molecule were fitted in this study to the linear equation: The value of ΔH sol was set equal the slope, b, multiplied by -2.303 R (cf., Equation 1).The solubility values in the units of the source publications (molality or molarity) were used in Equation ( 4).(Mole fraction units were converted to molality prior to fitting.)

Calorimetric Data Sources
Calorimetrically measured enthalpies of solution are considered to be more accurate than those derived from solubility as a function of temperature.A search located 77 molecules with reported calorimetric enthalpies of solution, all from primary literature.Table A1 in the Appendix indicates which enthalpies were the calorimetrically determined.

Solubility Data Sources
Yalkowsky et al. [8] Handbook of Aqueous Solubility Data is a reliable and convenient secondary source of temperature-dependent S w values of relatively simple molecules.The data were measured by the saturation shake-flask method [3], and are expected to have good precision.A search led to 653 suitable S w values for 195 molecules, with 2-6 solubility-temperature points per molecule, in the interval 20 ± 5 to 41 ± 7 °C.The handbook values mainly ranged from about -0.1 to -3.5 log molarity units (mean -1.8).
An additional 1387 measurements as a function of temperature were gathered from primary publications for 354 molecules, with measurements based on the shake-flask (traditional and various miniaturized variants) and two potentiometric (DTT and CheqSol) methods, as described elsewhere [3].This set included more druglike molecules than that from the handbook, and ranged in log S 0 from +2.7 to -8.4 (mean -3.0).Table A1 indicates which enthalpies were calculated from solubility data.

Conversion of Water Solubility (S w ) to Intrinsic Solubility (S 0 )
The pDISOL-X program (www.in-adme.com/pdisol_x.html,in-ADME Research) was used to convert nominally unbuffered water solubility (S w ) values with unspecified pH to intrinsic values (S 0 ) and to calculate the saturation pH (pH sat ), as described by Völgyi et al. [13] and others [14][15][16].Since no instances of ambient CO 2 concentration were reported in solubility measurement studies, for practically-insoluble bases with pK a > 8 (e.g., terfenadine, clofazimine), [CO 2 ] = 10 μM was assumed in the calculation of S 0 .Sparingly-soluble acids were less affected by ambient levels of CO 2 .Also, in the calculations, it was necessary to assume that the Henderson-Hasselbalch relationship was valid [3], except when multiple-pH buffer S pH values (log S vs. pH) data were available to derive the corresponding S 0 values.

Data Types and Exclusions
It was suspected that enthalpy of solution might depend on the acid-base properties of the molecule, as was found in the study of the temperature dependence of pK a values [10].The enthalpy of solution might be influenced by differences between the solvation effects in the proton releasing and gaining processes, or by the electron density of the different functional groups.Consequently, the selected compounds were assigned indicator indices: I A , I B , I AB , and I N , with a unit value indicating that a molecule is an acid, base, ampholyte, or neutral, respectively, and zero otherwise.
Given the multitude of methods and conditions used to measure solubility [3], from which most of the enthalpy of solution values were derived here, and, as mentioned above, such enthalpy values are expected to be sensitive to errors in the log S vs. 1/T data, intralaboratory enthalpy variances are expected to be smaller than interlaboratory variances.About 6 % of the log S measurements were done at two temperatures, each result reported from a different laboratory ("n=2, different labs" set), potentially representing the leastreliable slope-calculated enthalpy values.By contrast, in 73 % of the studies, all of the temperature-dependent log S measurements for a given molecule come from the same laboratory (e.g., the curated "653-set" from the Yalkowsky et al. handbook [8]).The latter "one source" temperature-solubility data are expected to lead to the most-reliable calculated enthalpy values.
For each molecule, the log S values were separated into two groups: those determined at t < 30 °C ("room temperature set", RT) and those at t ≥ 30 °C ("physiological temperature set", PT).For each grouping, the standard deviation (SD RT , SD PT ) values were calculated for each molecule based on replicate measurements.The average of all SD values was 0.18 log unit for the dataset.A given molecule with SD greatly exceeding the average value would be excluded from the training and test sets.
Since we are only using 2D molecular descriptors, different solvates of a given compound were lumped together with nonsolvates for the compound (when this information was available).
Of the 626 enthalpies of solution located/calculated, some values were suspected to be unreliable.We applied three filters to systematically exclude those points from training and test sets from the start, but their predicted ΔH sol values were calculated nevertheless (Table A1).
Three systematic exclusion criteria were:  The "n=2, different labs" data were excluded from model training and testing. If for a given molecule, SD RT or SD PT > 0.3, the data were excluded from the model training and testing. Solubility-derived enthalpies were excluded if slope-calculated values of ΔH sol < -80 or > +80 kJ mol -1 , based on the observation that in the best-quality (calorimetric) data, enthalpy data ranged from -37 to +50 kJ/mol.We put an arbitrary cut-off of ±80 kJ mol -1 in the expectation that erratic values calculated from solubility-temperature data would less likely enter the training set.
On applying the above criteria, 55 compounds were excluded.Thus 571 molecules were selected for the training and test sets.In the 571-set, 43 % were acids, 13 % were bases, 27 % were ampholytes, and 17 % were nonionizable molecules.The above ionization type distribution is not ideally suggestive of druglike molecules, which include more bases than acids.For the 77 calorimetrically measured ΔH sol , the average value is +13 kJ mol -1 , with values ranging from -37 to +50 kJ mol -1 .By comparison, the log S vs. T derived ΔH sol have the average value +21 kJ/mol, and the much wider range from -97 to +120 kJ/mol (before exclusions).

Variances of Replicate ΔH sol
In the 626 enthalpy set, there were 99 instances of replicate ΔH sol values, measured in different laboratories.The average standard deviation in the replicates was 7 kJ mol -1 , in line with expectations that the quality of the data were limited by the expected uncertainties in log S measurement (error ≥ 0.05 log unit), as considered in the Theory section.

Prediction of Enthalpy of Solution
The random forest regression (RFR) method is a powerful and easy-to-use new statistical (albeit somewhat of a "black box") tool which can cope with very large numbers of correlated descriptors, something that is not possible with traditional multiple linear regression (MLR) methods.It was thus a good starting point for our enthalpy data analysis.Using RFR could reveal sensitive descriptors that might not have been otherwise obvious.The recursive partition tree analysis provides a simple alternative view of the characteristics of the data in relation to the descriptors.However, the important advantage of the MLR analysis is that the results are thought to be easier to understand, in terms of the role of specific descriptors in the prediction of physicochemical properties, such as the temperature dependence of solubility.
The RFR modeling (Walters [17] -very useful tutorial) was first explored, starting with the 193 descriptors ("RDK" set, including lipophilicity as indicated by log P and log D, connectivity and molecular shape indices, topological and electrotopological state indices, surface area contributions, partial atomic charges, H-bond donor/acceptor counts, molecular refractivity, and other more specialized descriptors) calculated by the opensource chemoinformatics and machine-learning RDKit library of programs (Landrum et al. [18]; http://rdkit.readthedocs.org/en/latest/),combined with the six Abraham solvation descriptors [11,12], along with the acid-base indicators (I A , I B , I AB , and I N ), and the published melting points (mp).Where the latter values were not found, the Lang and Bradley [19] predicted melting points in the QsarDB open repository of data and prediction tools (http://qsardb.org/repository/handle/10967/104)[20] were used.: H-bond acidity and basicity, and the product of the two; π 2 : dipolarity/polarizability; R 2 : dispersion force; and V x : molar volume -see Glossary of Terms for further elaboration) were estimated from the 2D structure of molecules using ADME Boxes v4.9 program from ACD/Labs (Advance Chemistry Development, Inc., www.ACDLabs.com).Perhaps better fits might be expected from descriptors derived directly from measurements rather than from in silico predictions.However, experimentally-based ABSOLV values may not be available for all the compounds considered here.
After the RFR modeling, recursive partition tree analysis and multiple linear regression (MLR) calculations were explored, using the Algorithm Builder v.1.8program from ACD/Labs.SMILES representations of the 2D structures of molecules were available at the Royal Society of Chemistry ChemSpider website: http://www.chemspider.com/.ACD/ChemSketch from ACD/Labs was used to construct "SDF/mol" format 2D representations of the molecules.

Model Validations
The RFR method randomly selected 30% of the data to be test sets.The RTP method was not validated with a test set, since it was used in a qualitative way.In the MLR method, the "leave-many-out" (LMO) crossvalidation procedure (20 % of the measurements randomly excluded in 100 different repeated combinations), where a cross-validated q 2 was used to assess model predictivity.
The small calorimetric set (Case I) was best predicted.The root-mean-square (RMS) error was 11.7 kJ mol -1 ; 61 % of the variances were predicted.The 10-most sensitive descriptors are shown in Table 1 for this case.At the top of the list is mp.Four of the six ABSOLV descriptors were among those in the top list.
The combined calorimetric + solubility-derived set (II) included all the "571-set" data and the full set of descriptors.The RMS error was slightly higher at 13.9 kJ mol -1 ; 35 % of the variances were predicted, which suggests that the solubility-based data dragged down the overall precision of the combined data set.The mp descriptor was in the second position from the top, but there were no ABSOLV descriptors showing up in the top-10 list.The most sensitive RDK descriptors included fractions of sp 3 carbon, molecular refractivity -surface area [22], log P -surface area [22], various Hall-Kier electrotopological, connectivity and molecular shape indices, and log P [23].
Cases III and IV excluded either RDK or ABSOL descriptors.The statistics were similar to those of the fullcomplement of descriptors (II), with RMS errors rising slightly to 14.4 and 15.1 kJ mol -1 for the RDK (III) and the ABSOLV (IV) cases, respectively, and with accounted variances decreasing slightly.In Case IV (ABSOLV), the order of sensitivities comprised:

Recursive Partition Tree (RPT) Analysis
The "571-set" exclusion-filtered data (calorimetry + solubility-derived), with mp + ABSOLV + acid-base indicator indices (I A , I B , I AB , I N ) as descriptors, was next subjected to recursive partition tree analysis (Algorithm Builder v.1.8). Figure 1 shows the decision tree.
The first node automatically separated the acids from the non-acids.Within the acid set, mp ≥ 303 °C further split the group into two.There are 9 molecules in the higher mp set, with average ΔH sol = 47 ± 13 kJ mol -1 .The compounds with the lower mp included 235 molecules, with average ΔH sol = 27 ± 12 kJ mol -1 .
For the non-acids, Σβ 2 H ≥ 3.8 split the set into two groups.The group with strong H-bond acceptors included 8 molecules, with average ΔH sol = -24 ± 11 kJ mol -1 , associated with an exothermic process.The 319 molecules in the weaker H-bond basicity group were split into two subsets according to molar refractivity, R 2 ≥ 0.47.The 279 molecules with stronger dispersion force interaction (arising from pi-and n-electrons of the solute) had the average ΔH sol = 19 ± 17 kJ mol -1 , whereas the 40 "hard" molecules had the lower average 2 ± 15 kJ mol -1 .

Multiple Linear Regression (MLR) Analysis
Table 2 summarizes the multiple linear regression analysis of the enthalpy data, using the mp + ABSOLV + acid-base indicator indices (I A , I B , I AB , I N ) descriptors.
The calorimetric case (A) again was best-fit, with r 2 = 0.67 and s = 10.8 kJ mol -1 .The statistics were comparable to those found in the RFR analysis for the same set.Unlike the RFR method, it may be easier to interpret the contributions of each of the ABSOLV descriptors.The four acid-base indicator indices in effect acted as four different intercepts: acids and ampholytes had positive additive contributions, whereas bases and neutrals had negative contributions, in addition to those trends predicted by the mp + ABSOLV descriptors.The H-bond basicity (Σβ 2 H ) led to negative (exothermic) enthalpy contributions, whereas π 2 (solute polarity/polarizability due to solute-solvent interactions between bond dipoles and induced dipoles) contributed to positive (endothermic) enthalpy values.The other ABSOLV descriptors made smaller contributions in the Case A model.Enthalpy of solution was predicted to increase with increasing melting points.Figure 2a shows the correlation plot for the calorimetric set.As can be seen, acids tend to have higher positive enthalpies compared to the other molecules.The non-calorimetric case (B) was predicted less well, compared to the above calorimetric case, just as in the case in RFR analysis.There was less acid-base differentiation, as indicated by the similarity of four I-indices.The other main difference between Cases A and B is that the dispersion-force R 2 played a more prominent role in Case B than the dipolarity/polarizability π 2 , with high values predicting more positive (endothermic) enthalpies.Also, compounds with larger McGowan molar volumes (V x ) were associated with more negative (exothermic) enthalpies.
The recursive partitioning tree analysis suggested that acids could be grouped separately from the non-acids.Cases C and D tested that idea.The acids had a muted contribution from ABSOLV descriptors (except for the H-bond acidity), compared to the non-acids, as is indicated in Table 2.For example, a model using just the I A , mp, Σα The calculated enthalpy of solution values in Table A1 are based on Case D applied to acids and Case C applied to non-acids.Very similar results were obtained from the application of combined Case E model (not shown).This Cases C+D method of calculation has been incorporated into the current release of pDISOL-X, where intrinsic solubility, log S 0 , are now predicted both at 25 and 37 °C.

Interlaboratory Errors Temperature Effect
Figure 3 displays the ratios of standard deviations (SD) calculated from replicates from temperature uncompensated log S 0 values, SD(T), to those calculated from temperature normalized data, SD(25 °C), plotted against the normalized SD.If the interlaboratory errors in log S 0 measurement were attributable entirely to temperature effects, then the SD(T)/SD(25 °C) ratio would greatly exceed unity.On the other hand, if there were large systematic errors in log S 0 measurements not related to temperature, then the ratio would be near one.
When the Case C and D models were applied to the "4557-set" data [3], 94 % of the calculated  2), where acids (unfilled circles) tend to have higher positive enthalpies compared to the other molecules.(b) The correlation plot for both calorimetric and solubilitybased data (Case C+D in Table 2).The statistics were calculated the SigmaPlot plotting program.The acids (unfilled circles) dominate the high positive enthalpy side of the plot.
Molecules with very negative enthalpy values are large complex species with many H-bond acceptor groups (e.g., cyclosporine A, clarithromycin, erythromycin and ivermectin.log S 0 ref -log S 0 T differences were between +0.2 and -0.2.If the non-temperature related errors are substantially greater than 0.2, then temperature normalization barely lowers interlaboratory errors.Figure 3 illustrates the impact of normalizing intrinsic solubility data to the 25 °C benchmark temperature. Figure 3a represents high-quality solubility measurements from the Yalkowsky et al. [8] handbook, albeit of relatively simple molecules.The interlaboratory errors are reduced 7.8-fold on the average, to an average value SD(25 °C) = 0.06 log unit.
Figure 3b depicts a larger set of molecules, with a higher proportion of druglike molecules, which are less soluble than those in the case of Figure 3a.The interlaboratory errors are reduced 2.2-fold on the average, to an average value of SD(25 °C) = 0.17 log unit.It is evident that interlaboratory errors not related to ionization and temperature effects are higher in the druglike set of molecules.

Remaining Challenges in the Prediction of Enthalpy of Solution
Table A1 also lists the calculated enthalpy values for the 55 molecules excluded from direct model construction.It is not a "validation" set, since many of the measurements were suspect, following the exclusion criteria.As it turns out, the calculated enthalpies of many of the excluded molecules are at comparable levels of agreement to the experimental enthalpy values.However, some compounds did show high variance.Mostly, these had reported negative enthalpies of solution, but the model calculated positive values (e.g., celecoxib, clotrimazole, droperidol, indinavir, ezetimibe, famotidine, fenofibrate, glibenclamide, loratidine, meloxicam, quetiapine, and saquinavir).Some excluded molecules had very high positive reported enthalpies, which the model only partly matched (triflupromazine, chlorpromazine, triflupromazine, etoxadrol, and rosiglitazone).It is suspected that data precision may be an underlying cause of the poor predictions.It is possible that some molecules may have been measured over a temperature range encompassing the Krafft or clouding points, which would have contributed to a highly nonlinear van't Hoff plot.Complications in solution chemistry, such as the formation of low-order aggregates, micelles, and complexes (with solution constituents) may contribute to increased errors [13][14][15][16].For example, Pobudkowska et al. [25] found that aggregateforming chlorpromazine, triflupromazine, and trifluoperazine (S 0 0.5, 1.1, and 2.7 μg mL -1 , resp.) showed exceptionally high solubility temperature dependence, compared to more soluble (> 10 μg mL -1 ) phenothiazine derivatives.It would be helpful to measure the enthalpies of such molecules by direct calorimetry, to better understand the nature of the interlaboratory errors.

Calculated Negative Enthalpies of Solution
As listed in Table A1, there were 11 calculated negative values based on the Case C+D models, compared to 51 (non-excluded) negative measured values, with about half of these being about a standard deviation from zero.The average Abraham descriptors for the calculated negative enthalpy set were 0.9 (Σα 2 H ), 3.8 (Σβ 2 H ), 3.6 (π 2 ), 2.5 (R 2 ), and 5.3 (V x ).Given the MLR coefficients, the negative contributions were equally from H-bond basicity and the McGowan volume (-65 kJ mol -1 contribution), but these were somewhat offset by positive contributions from dipolarity and molar refractivity (+40 kJ mol -1 contribution), resulting in a net negative calculated average enthalpy (-17 kJ mol -1 ).The RPT decision tree indicates that eight non-acids with H-bond basicity exceeding 3.8 had an average enthalpy of -24 ± 11 kJ mol -1 (node N01, Figure 1).There were 40 additional compounds with near zero enthalpy (node N000 in Figure 1).

MLR Model Validation
The developed MLR model was validated by the leave-many-out (LMO) method, using the Algorithm Builder V1.8 program.In the approach, 20 % of the dependent variables were randomly removed, with the MLR repeated 100 times, produced the q 2 = 0.27, with the q 2 standard deviation of 0.07.These values are only slightly less than the value of r 2 (0.31) determined by normal MLR analysis.

Conclusion
The three different prediction methods tested (RFR, RPT, MLR) all indicated that the error in the enthalpy of solution data are 11-15 kJ mol -1 , which is similar to estimates calculated from propagation-of-errors in solubility measurements (~10 kJ mol -1 ) and with estimates based on interlaboratory replicates (~7 kJ mol -1 ).According to the combined MLR model (Cases C+D or E, Table 2), higher values of mp, H-bond acidity (Σα 2 H ), polarizability/dipolarity (π 2 ), and dispersion forces (R 2 ) lead to more positive enthalpy values.Furthermore, big molecules (large V x ) that have high H-bond basicity (Σβ 2 H ) are associated with more negative enthalpies of solution.
The main objective of this study was to demonstrate that the interlaboratory errors in solubility data mining can be lowered if solubility data are normalized for temperature effects.Before adjusting for ionization and temperature effects, the expected errors are commonly thought to be 0.5 -0.7 (or greater) log unit [4][5][6][7].After transforming S w to S 0 values (adjusting for ionization), it was estimated that temperature normalizations could lower the interlaboratory errors to near 0.15 log unit [3].In the current study, we have shown this to be the case.Thus, for simple and moderately soluble molecules, the average interlaboratory errors are estimated to be about 0.06 log unit, lower than the 0.17 value estimated for sparingly-soluble druglike molecules.Consequently, the commonly cited interlaboratory solubility error of 0.5 -0.7 log unit in data mining warrants revision.It is thus expected that improved data quality will spur improved in silico prediction of solubility of druglike molecules.S solubility, ideally expressed in units of mol/L (M), μg/mL, or mg/mL S 0 "intrinsic" solubility (i.e., the solubility of the uncharged form of the compound) S w "water" solubility, defined by dissolving enough pure free acid/base (not drug salt) in distilled water (or water containing an inert salt -as ionic strength adjustor) to form a saturated solution.The final pH of the suspension, pH sat , and S 0 can be calculated by the Henderson-Hasselbalch equation (when valid), provided the true pK a is known.S pH "pH buffer" solubility (i.e., the total solubility of the compound at a well-defined measured pH) ΔH sol 0 change in enthalpy of solution at 25 °C for uncharged molecules (ΔH sol implies the saturated-solution reference state, rather than the one based on unit activity)

Appendix
Table A1 lists the molecules studied, along with the mp and ABSOLV descriptors.

Figure 1 .
Figure 1.Recursive partition tree analysis (Algorithm Builder, v.1.8),separating acids from non-acids.For acids, mp is a key discriminator.For non-acids, H-bond basicity and molar refractivity are key discriminators.
and F = 321; a model using just I A and mp yielded r 2 = 0.03, s = 12.1, and F = 641.Figure 2(b) shows the correlation plot, combining the Case C and D sets.The acids are on the high positive enthalpy side of the plot.Most of the molecules that have very negative enthalpy values are large complex species (e.g., cyclosporine A, clarithromycin, erythromycin, ivermectin, digoxin, and digitoxin).

Figure 2 .
Figure 2. (a) The correlation plot for the calorimetric set (Case A in Table2), where acids (unfilled circles) tend to have higher positive enthalpies compared to the other molecules.(b) The correlation plot for both calorimetric and solubilitybased data (Case C+D in Table2).The statistics were calculated the SigmaPlot plotting program.The acids (unfilled circles) dominate the high positive enthalpy side of the plot.Molecules with very negative enthalpy values are large complex species with many H-bond acceptor groups (e.g., cyclosporine A, clarithromycin, erythromycin and ivermectin.

Figure 3 .
Figure 3. Standard deviations in log S 0 determined in replicates at multiple temperatures, SD(T), divided by those from log S 0 transformed to the reference temperature, SD(25 o C), as a function of SD(25 °C).(a) High-quality solubility measurements from the Yalkowsky et al. [8] handbook, for relatively simple molecules.(b) The SD ratios for the entire "4557" set containing a large number of druglike molecules.

2 H 2 H 2 H •Σβ 2 H
RDK RDKit generated descriptors RFR random forest regression method RPT recursive partition tree analysis I A , I B , I AB , I N acid-base indicator indices: unit value, indicating that a molecule is an acid, base, ampholyte, or neutral, respectively, and zero otherwise Σα Abraham descriptor -solute H-bond total acidity (also called A) Σβ Abraham descriptor -solute H-bond total basicity (also called B) π 2Abraham descriptor -solute polarity/polarizability due to solute-solvent interactions between bond dipoles and induced dipoles (also called S) R2 Abraham descriptor -excess molar refraction (dm 3 mol -1 / 10); which models dispersion force interaction arising from pi-and n-electrons of the solute (also called E) V x Abraham descriptor -McGowan molar volume (dm 3 mol -1 / 100) of the solute Σα Abraham acid-base H-bonding product descriptor[12] b A = acid, B = base, AB = ampholyte/zwitterion, N = nonionizable molecule.c Underlined mp are predicted using Lang-Bradley program [19, 20].d Underlined references were cited in Yalkowsky et al. [8].©2015 by the authors; licensee IAPC, Zagreb, Croatia.This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/)