Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.


Introduction
In pharmaceutical research, the aqueous solubility of exploratory compounds is a very important physical property to assess [1,2]. Peroral drugs with very low solubility may not release sufficient compound from the solid form during the intestinal transit to generate therapeutic benefit. Conversely, highly water-soluble drugs may not be able to permeate lipoidal barriers in the intestinal wall and in the barriers beyond, to reach the therapeutic site of action in sufficient concentration. Thus, not too little and not too much solubility is an important balancing act in compound advancement during drug development.
Given the large number of compounds tested in drug discovery, measurement of solubility is done by high-throughput methods, which generate "kinetic" values in buffers containing 0.5-5 %v/v DMSO [2,3]. Usually, amorphous solids precipitate from supersaturated solutions in the microtitre wells. Although kinetic solubility can be 10-100 times higher than equilibrium solubility, it is nevertheless suitable for anticipating whether a particular test compound will precipitate in an in-vitro bioassay, triggering a false positive test [3][4][5][6]. Compounds advanced into later stages of research are fewer in number. Justifiably, more rigorous methods are used to measure their equilibrium solubility, often in media more reflective of the biological fluids to which drugs are exposed [7].
It is beneficial to predict equilibrium solubility of research compounds at the start of discovery projects, as part of virtual screening of compound libraries, before any actual measurements are done, for assisting in the prioritizing molecules for the project. Numerous methods for predicting solubility of organic molecules have been described in the literature, based on quantitative structure-property relationships (QSPR), where the molecular structure is used to predict physicochemical properties [8].
This study concerns prediction of the equilibrium solubility of drugs. Perhaps, more importantly, the focus is on the impact of molecules selected to train the prediction method. The details of the evolving Wiki pS0 database (in ADME Research) [9] of druglike molecules will be described. Since 2011, the focused searching of the primary literature for equilibrium measurements of aqueous solubility (especially as a function of pH) of druglike molecules has contributed to 6355 intrinsic solubility, log S 0 , entries. The preprocessing of the available solubility data to extract the underlying S 0 values (normalized to 25 °C [10]) utilized the purpose-designed computer program, pDISOL-X (in ADME Research) [11] (whose prototype FORTRAN version, STBLTY, was first coded in the late 1970s [12]). As part of the curation, data quality was assessed by interlaboratory comparisons of those molecules which were studied multiple times by different researchers. The log S 0 values, along with their estimated standard deviations (SD), were then used to train two solubility prediction models: (i) weighted multiple linear regression (MLR) using Abraham solvation descriptors [13], and (ii) Random Forest regression (RFR) [14] using the diverse descriptor collection from the RDKit open-source chemoinformatics and machine-learning library [15]. The results were compared to those calculated by the general solubility equation (GSE), which requires no training [16,17]. Four external test sets [18][19][20] were employed in the validation of the models, taking care to remove any of the test set molecules from the large training set. Three of the test sets (containing only druglike molecules) have appeared in landmark 'Solubility challenges' [19,20].

General solubility equation (GSE)
In 1965 Irmann [21] described solubility prediction based on a group contribution approach. For solids, he included a term related to the entropy of fusion, coupled with the melting point (T m ). In 1968, Hansch et al. [22] recognized that the octanol-water partition coefficients, log P, are strongly correlated linearly with aqueous solubility values, log S w , for nonionizable liquid samples. Expanding on the work of Irmann and Hansch, Yalkowsky and coworkers developed and popularized the general solubility equation (GSE), to enable the prediction of solubility of liquids and solids in water [16][17][18][23][24][25][26][27]. Just two variables, T m (°C) and log P, both experimental determined, are used in the equation to predict solubility of organic compounds in water (in log molar units): log S = 0.5 -log P -0.01(T m -25) (1) The equation requires no "training." Although the GSE is rooted in sound thermodynamic principles, some assumptions had to be made in developing the equation: test compounds are taken to be nonionized and fully-miscible in octanol (leading to the 0.5 intercept term), and that the water and octanol phases are assumed not appreciably mutually soluble (but, according to [28]: water-saturated octanol contains ~25 mol% water; solubility of octanol in water is ~2 mM). The implicit assumption behind the 0.01 factor arises from the near constancy of the entropy of fusion. This is in reasonable agreement with the relatively nonflexible aromatic solutes initially considered. A semi-empirical version of the GSE was proposed: the calculated log P could be used in place of the experimental value. More recently, a version was proposed entirely based on calculated descriptors [27]. Empirically-adjusted coefficients in Eq. (1), based on various training sets [16,24,29], did not result in substantially improved predictions of the solubility of druglike substances. The GSE is popular for its ease of use [17].
Yalkowsky and Banerjee [18] proposed an external test set of 21 molecules: 6 solid and 3 liquid poorlysoluble pesticides (log S w -3.4 to -7.9), 11 simple drugs (log S w 0.5 to -4.1), and a laxative/dye molecule (with somewhat uncertain solubility). As will be shown below (cf., Fig. 11a), the solubility of the above test set molecules is well predicted by Eq. (1). This test set has been widely used by other investigators.
Empirical prediction models Dearden [30], Taskinen and Norinder [31] thoroughly reviewed solubility prediction studies reported from 1992 to 2005 [25,29,[32][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47] which used the popular Yalkowsky-Banerjee external test set to assess the efficacy of the empirical methods. The average of the reported prediction root-mean square errors (RMSE) is about 0.9 log unit, with individual values found to range from 0.6 to 1.4. The predictions of Raevsky et al. [29] (nearest-neighbor method, using HYBOT hydrogen bond descriptors) and Tetko et al. [40] (artificial neural network method, with electrotopological E-state indices) fared slightly better than those of others. Many of the training sets used in the prediction studies consisted of several hundred simple organic molecules, including aromatic hydrocarbons, polyhalogenated organic compounds, practically-insoluble agrochemicals and environmental pollutants, many in liquid form at room temperature, but only relatively few druglike molecules (resulting in spotty coverage of the chemical space resembling today's pharmaceutical discovery compounds). As summarized in the reviews [30,31], prediction methods included multiple-linear regression (MLR), principal components regression (PCR), partial least-squares (PLS), knearest neighbors (kNN), artificial neural networks (ANN), support vector regression (SVR), and Random Forest regression (RFR). Some of the QSPR methods were based on hundreds of calculated atomic and molecular 2D and 3D descriptors. In many of the studies, the most influential descriptors are two calculated physical properties: log P and molar refractivity, MR, (which accounts for molecular size and polarizability). Other calculated 2D descriptors included partial-charge surface properties, atom and functional group counts, connectivity and topological and electrotopological indices, H-bond donor and acceptor counts; 3D descriptors included energy terms (total potential energy, electrostatic, molecular mechanics force-field energy), molecular shape, volumes, and water-accessible surface areas [48][49][50][51][52][53][54][55].
Wang and Hou [56] summarized solubility prediction efforts up to 2010, comparing the results of 16 studies. They discussed the improvements resulting from consensus modeling. Also, there was a discussion of using "local data" models to improve predictability, with the domain of applicability (DOA) identified by molecular descriptor similarity, rather than structural (e.g., Tanimoto indices) similarity.

Abraham solvation equation (ABSOLV)
Abraham and Le [13] amended the Abraham solvation equation [57] to predict solubility: In the MLR equation, the log S 0 is the dependent variable (measured log intrinsic molar solubility) and the independent variables are the five solute descriptors accounting for the transfer of solute from one phase to another: A is the sum of H-bond acidity, B is the sum of H-bond basicity, S π is the dipolarity/polarizability (subscripted here, so as not to be confused with solubility), E is an excess molar refraction in units of (cm 3 •mol -1 )/10, and V is the McGowan characteristic volume in units of (cm 3 •mol -1 )/100. The c 0 -c 6 coefficients in Eq. (2) are determined by MLR, trained on a set of intrinsic solubility values of a diverse collection of molecules. The five Abraham solvation descriptors may be calculated from 2D structure (introduced as a SMILES text or as coordinates in a 'mol' or 'sdf' type file) using the program ABSOLV [58] (cf., www.acdlabs.com). The A•B cross-term in Eq. (2) is intended to deal with intermolecular H-bond interactions between acid and base functional groups in the solid or liquid environment. Its inclusion, as an alternative to using the T m term in Eq. (1), was intended to improve the prediction accuracy. Eq. (2) applied to the Yalkowsky-Banerjee external test set, using the MLR coefficients reported by Abraham and Le (their Eq. 11), with ABSOLV-calculated descriptors, resulted in RMSE = 1.71 log unit (prostaglandin-E2 was an extreme outlier; data not shown). In the present study, we re-determined the seven MLR coefficients using our own training data, with the data weighted according to estimated measurement errors, to find a much better fit, as will be shown below (cf., Fig. 12a).

Random Forest regression
Of the new machine-learning statistical approaches, the Random Forest regression (RFR) method is thought to be among the top performers, in terms of prediction accuracy. The method was introduced in 2001 by Brieman [14], and is implemented in the open-source "randomForest" library for the R statistical software [59][60][61]. RFR may be appealing to new users because it can be employed "off the shelf," requiring only minimal learning. In many applications, the default "tuning" parameters are nearly optimal. RFR works by constructing an ensemble of hundreds of decision trees [62].
To illustrate, in part, how RFR works, Figure 1 shows an example of a single recursive partition decision tree constructed (Algorithm Builder v.1.8, ACD/Labs, Toronto, Canada; www.acdlabs.com), using the 600 zwitterionic molecules in the Wiki-pS 0 database, drawing on the five Abraham descriptors [57]. The process begins with the unsupervised selection of one of the descriptors (E in the example) and finding the optimal 'splitting' value (1.27 in the example) which divides the solubility data into two branches: the left branch grouping 369 molecules which have descriptors less than the splitting value and the right branch grouping 231 molecules with descriptors equal to or exceeding the splitting value. A criterion for the splitting can be based on minimizing the residual sum of squares at each node, RSS = Σ i (y i -<y left >) 2 + Σ j (y j -<y right >) 2 (3) where i indexes the solubility values in the left branch and j indexes those in the right branch; y represents log S 0 values; <y> is the average value in the left/right branch. Each of the two branches generates a new node. The process then repeats until the "terminal" nodes are reached, associated with a specified minimum of molecules (e.g., 5). In the above decision tree training, r 2 = 0.70 and RMSE = 0.81 (average of the seven terminal "leafs"). Generally, the node splitting procedure yields ever more homogeneous groupings of molecules, and produces trees which bring together similar solubility values at the same node.
The above example involved just one tree, where at each node, all of the descriptors were considered in the selection of the one best suited to split the node. RFR is different in a number of ways. Typically, 500 decision trees -a "forest" -are constructed.

Figure 1.
Example of a calculated recursive partition decision tree (Algorithm Builder v.1.8), based on 600 zwitterionic molecules (Wiki-pS 0 database), using Abraham descriptors. At each node, all five descriptors are queried to select the one best suited for further splitting of the data. In part, node splitting stops at 5 molecules. By comparison, the Random Forest method uses hundreds of trees (each containing a different subset of randomly-selected solubility values of molecules) and re-selects a subset of descriptors randomly for each node splitting Liaw [61] graphically illustrated the structure of a typical random forest. The entire data matrix comprises n rows of solubility values and p columns of chemical descriptors. Each tree in the forest is allocated a different bootstrap (with replacement) sample of the n rows -i.e., it contains a randomlyselected subset (e.g., two thirds) of the entire solubility data. For each tree, the "left out" molecules (e.g., one third) are called the 'out-of-bag' (OOB) sample. Each tree is grown to its maximum size by node splitting, as partly illustrated in Figure 1. In RFR, only a randomly-selected subset of the available descriptors (typically, p/3) is used at each node in each tree. Each tree is grown until the terminal nodes are reached, with each final "leaf" containing a specified minimum number of solubility values, the average of which being the predicted value for the particular tree. The final prediction for the regression model is made by averaging predictions from all trees. All the compounds that did not take part of the tree growing process (OOB compounds) can be used as an internal validation set to estimate the error of the model.
To assess the predictability of the models in the current study, we randomly split the solubility data into a training set (70 %) and an internal test set (30 %), as described by Walters [63]. Also, external test sets proposed by others were predicted based on the RFR model trained with all of the molecules (excluding any from the external test sets).
RFR is not sensitive to the presence of irrelevant descriptors, even those which are highly correlated. Hence, "over-fitting" the data is not expected. (However, it is noteworthy that if test set molecules are also included in the training set, then their RFR "prediction" will be very close to the user-provided measured values.) RFR includes built-in estimation of (i) prediction accuracy (as standard deviation of the predicted mean), (ii) descriptor importance (as a result of sensitivity testing of each descriptor), and (iii) similarity between molecules (as a result of the node filtering process). The application of the method to QSAR predictions has been described in detail by Svetnik et al. [64]. An inconvenience of the currently-developed RFR method is that it cannot extrapolate (in the sense that MLR methods can): it cannot predict any solubility value outside of the range encompassed by the training set. For example, the extremely-low (log S 0 < -8) solubility of drugs like amiodarone, clofazimine, itraconazole, halofantrine, and probucol is not expected to be well estimated by RFR. The latter molecules are near the edge of the chemical space (defined by the descriptors used) that's sparsely populated by molecules with similar solubility. The closest molecules are likely to be more soluble than the above test compounds.
The first applications of RFR to predict solubility appeared in 2007 [65,66]. Schroeter et al. [65] used S w and S pH data (mixed values not corrected for ionization) to train a RFR method, using ~4000 measurements mostly taken from secondary sources [35,67,68] and some from in-house (Bayer Schering Pharma) sources. For the Huuskonen data [35] as test set, RMSE = 0.66 (n=1290) was reported. For the solubility data in the domain of applicability (DOA) matching that of research compounds (10 -3 to 10 -7 M solubility), the RFR method indicated RMSE ~ 0.85 log unit. In the Palmer et al. [66] RFR analysis, aqueous solubility values of 998 structurally diverse druglike solid organic compounds were gathered from similar secondary sources: Handbook of Aqueous Solubility [69], Huuskonen [35], and Delaney [47]. (It was not reported how molecules were corrected for ionization.) The authors used the molecular operating rnvironment (MOE) [70] to generate 126 two-dimensional (log P, MR, charged-surface properties, atom, group, and H-bond counts, connectivity and topological indices) and 36 three-dimensional (total potential energy, electrostatic contributions, molecular shape, and solvent-accessible surface area) descriptors. Various values of the RFR tuning parameters, ntree, mtry, and nodesize, were explored in the model trained with all of the 2D descriptors, with the best parameter values found to be ntree = 500, mtry = 42, and nodesize = 5, which are the usual default values. The training set of compounds produced the statistics: r 2 = 0.98, RMSE = 0.28, n = 988, bias = 0.007. As often pointed out, this is not an accurate measure of the predictability of solubility of molecules not used in the training process. Randomly splitting the entire data into a training set (70 %) and an internal test set (30 %) produces a good measure of the ability of the model to predict solubility of compounds not included in the training set, indicated r 2 = 0.89, RMSE = 0.69, n = 330, bias = 0.017. An external test set produced similar statistics. Including the 3D descriptors did not make substantial improvements to the model.
The most influential descriptors in the Palmer et al. study were calculated to be those related to the fractional van der Waals surface area, VSA. The ten most important descriptors ranked by RFR were log P > negative VSA (PEOE_VSA_FNEG) > number of hydrophobic atoms (a_hyd) > MR > hydrophobic atoms VSA (vsa_hyd) > chi1v (topological) > polar VSA (PEOE_VSA_FPOL) > hydrophobic VSA (PEOE_VSA_FHYD) > MW > negative polar VSA (PEOE_VSA_FPNEG).
More recently, Walters [63] thoroughly compared the Huuskonen thermodynamic S w values (n = 1274) [34,35], the Llinas et al. thermodynamic S 0 values (n = 94) [19] and PubChem (n=1000) kinetic highthroughput solubility [71] databases using the RFR framework. The publication serves as a very useful tutorial to the machine-learning method, and is highly recommended for those interested to try RFR.

Gap between prediction and experiment
For 411 compounds characterized by multi-source solubility measurements, Katritzky et al. [72] found standard deviation, SD, to be 0.58 log in replicate values. According to Taskinen and Norinder [31], an AstraZeneca in-house database of solubility measurements of different batches of the same compound typically showed reproducibility of 0.49 log. Higher uncertainties had been discussed (Jorgensen and Duffy [73]; Palmer and Mitchell [74]). It has been a widely-shared view that interlaboratory measurement reproducibility is typically 0.6 log.
As mentioned previously, the solubility prediction errors are often in the 0.6 to 1.3 log unit range [30,31,56,73,74]. So, one might surmise that prediction methods are approaching measurement error limit. But, this may not be so.
First, many of the early prediction studies considered molecules from a chemical space occupied by relatively simple organic molecules and some complex agrochemicals, which were adequately represented by the then available training set data. In some of these studies, low RMSE were achieved. Earlier training sets were under-represented in practically insoluble and highly lipophilic druglike molecules, whose physicochemical properties are not easy to measure accurately. In some cases, important descriptors, such as calculated log P can be off by 1-2 log units (e.g., amiodarone). Since values of log P > 5 or < -2 are difficult to measure accurately by the shake-flask method [28], log P prediction methods can be uncertain for outof-bounds molecules. At such extreme values, experimental log P values may not strongly correlate with the experimental log S values [75]. Since many of today's research compounds have very low solubility, the earlier prediction methods that have shown low RMSE are not expected to do as well when subjected to predicting solubility of practically insoluble drug molecules, such as amiodarone and itraconazole, or novel research compounds synthesized in drug discovery programs, for which there may be a shortage of prediction training set data publically available.
Second, the perceived 0.6 log error in measured solubility may be upwardly biased, given how disparate legacy data have been handled in assembling large training sets. The relatively poor reproducibility may be the result of systematic errors arising from mixing different types of solubility values, measured at different temperatures, or simply gathered from poor-quality measurements. A 'white paper' drawing on expert consensus thoughts of researchers from six countries addressed the critical needs related to experimental assay design, and how legacy data can be better processed to reveal improved precision [76]. A related study [9] discussed at length the correction of data for ionization when solution complexity distorts the expected shape of the log S-pH profile predicted by the Henderson-Hasselbalch equation. When solubility values measured in the temperature range 10-50 °C are transformed to values at 25 °C, the estimates of the interlaboratory precision improve [10].
The above two points suggest that the gap between prediction and experimental errors may still be substantial. Similarly, Palmer and Mitchell [74] made the case that it's not the data that are limiting, but rather it's the prediction methods (and/or descriptors) that need further improvements. In an earlier review, Faller and Ertl [77] suggested that "no really satisfactory approach to [drug] solubility prediction is available yet," in spite of the large number of prediction studies.

Quality and chemical space of experimental data
It has been consistently shown that the best prediction models are devised from training set molecules that occupy very similar chemical space (defined by the descriptors used) as those in the test set [63]. For drug solubility prediction, the ideal training sets would consist of molecules of interest to discovery projects. Only a tiny fraction of such measurements are publically available, and in-house pharma prediction studies are unlikely to be openly publicized.
Measuring equilibrium solubility of ionizable molecules is expensive and analytical-resource consuming. Even given high analytical investment, quality is not assured when results are based on poorly-designed assays.
Factors affecting reproducibility in published solubility data -'white paper' summary Many of the factors affecting the quality of equilibrium solubility measurement have been discussed in the consensus report ('white paper' [76]) are summarized in the list:  dissolution of added solid has not reached equilibrium during the selected equilibration time,  solid state characterization not performed after equilibration -polymorphs, hydrates, solvates, nanoparticles, amorphous forms not identified,  formations of drug aggregates/oligomers (dimers, trimers, …), micelles, and drug-buffer complexes in solution at equilibrium [78],  poor wettability,  adsorption to filter/vial surface,  inappropriate phase separation methods used, e.g., (i) first centrifuging a saturated solution, then filtering the supernatant (without first saturating the filter); (ii) multiple re-centrifuging a centrifuged solution (without pre-saturating the vial surfaces); (iii) nano-sized particles passing through filter,  using unnecessarily high buffer concentrations, possibly effecting drug-buffer complexation [78],  not using buffers with low-soluble ionizable drugs (especially weak bases),  effect of impurities unaccounted, especially those which are ionizable when unbuffered solutions are used,  not measuring the final pH of the equilibrated saturated solution of ionizable drugs (buffered pH may be altered by the drug),  not taking into account the effect of ambient CO 2 on the water solubility of low-soluble bases in unbuffered solutions,  inadequate pH electrode calibration at low/high pH (junction/asymmetry effects), and in drug-salt studies (high ionic strength),  compound instability at the extremes of pH or over long saturation times (e.g., indomethacin, acetylsalicylic acid, ascorbic acid),  stereoisomers (DL-, D-, L-), (R-/S-), or cis-/trans-isomers not stated,  limit of detection (LOD) -not sufficiently sensitive analytical methods used to determine drug concentration below LOD,  for ionizable compounds, inaccurate value of pK a used to calculate log S 0 from log S-pH profile introduces systematic error.
The impact of the above factors can be minimized by employing good experimental practices and appropriate data analysis methods. However, in today's solubility prediction methods, factors such as the formation of differing polymorphs, hydrates, solvates, amorphous solids, and the impact of stereoisomers, are not adequately addressed.

Wiki-pS 0 database
The intrinsic solubility database, Wiki-pS 0 (in-ADME Research), contains 6355 log S 0 (log molar) entries, based on measured aqueous solubility values of 3014 different compounds collected from 1325 cited references (as of April 2019). In the majority of the cases, the literature data were further processed, using pDISOL-X (in-ADME Research), to extract intrinsic solubility (S 0 ) values from reported aqueous freeacid/base or salt solubilities (S w ), solubilities at specified pH (S pH ), or log S-pH profiles [9,11,76,[78][79][80][81]. All of the molecules are solids at room temperature (except for propofol, whose T m is 14 °C). There are 1078 log S 0 entries derived from 9907 individual log S measurements at a particular pH (cf., Fig. 1a in [9]). About half of the data sources originate from secondary listings and the rest are from primary sources. In the case of secondary sources, the citations to the original work are generally available, and in many cases were consulted for clarifications. Differently named molecules were identified and reconciled by searching the database for matching Tanimoto structural fingerprint indices [15].
For 3671 entries, comments were added to the database records (based on available information in the original sources), briefly noting experimental method used (mostly saturation shake-flask), temperature (23 °C assumed when 'room temperature' was stated or no value was provided), equilibration time, apparent quality of data, standard deviation in measured values (if reported), buffers/pH, polymorphic or hydrate form (if identified), method of solid separation, agitation method, etc.
The most reliable data had been determined by the saturation shake-flask (SSF) method (still the "gold standard" methodology in the minds of most experimentalists), especially when taken as a function of pH. Also, two potentiometric instruments have demonstrated their importance: pSOL [82] and CheqSol [83] (both now available from Pion Inc., Billerica, MA, USA). The characterization of solid forms (crystalline, amorphous, nanoparticle, etc.) and their impact on the measured solubility are important considerations (i.e., solvate, polymorph, racemate effects), but these are not always reported/detailed in the solubility studies.
Two websites: ChemSpider (Royal Society of Chemistry, UK) www.chemspider.com, and PubChem (https://pubchem.ncbi.nlm.nih.gov/ were valuable for checking names of molecules, obtaining CAS numbers, getting structure representations (SMILES), melting points (T m ), and the like. ACD/Labs ChemSketch was useful for drawing molecules and constructing SMILES representation for molecules. When measured T m were not found (as in 19 % of the entries in Wiki-pS 0 ), then Lang and Bradley [84] predicted T m were used: QsarDB open repository of data and prediction tools (http://qsardb.org/repository/predictor/10967/104?model=rf).
Data added to Wiki-pS 0 from multi-source compilations ('low hanging fruit')  PHYSPROP database [67] (Sept 1999 version: over 6000 measured water solubility, S w ): 1327 values were selected for molecules not appreciably ionized in water. Excluded compounds were: (a) T m < 40 °C, (b) log S w < -8 or > 0, (c) surfactants/long aliphatic chain molecules, (d) polycyclic aromatic hydrocarbons, (e) peroxides, (f) carboxylic acids, (g) salts/complexes, (h) dyes or names containing color, and (i) herbicides, pesticides, insecticides, rodenticides, and acaricides (as indicated by "tags" at the ChemSpider website). Of the selected 1327 compounds, the S w values of 1210 nonionizable/nonionized molecules were taken to be S 0 . The other 117 compounds were processed by pDISOL-X to calculate S 0 and pH sat (pH of saturated solution) from the given S w and pK a , assuming pure water was the solvent, and the Henderson-Hasselbalch equation was valid [11]. Literature references (many from Merck Index and Beilstein -cf., below) were recorded in Wiki-pS 0 .
 Handbook of aqueous solubility data [85]: 1130 S w of druglike molecules were selected, with 776 values subjected to pDISOL-X analysis to determine S 0 values. Some values were listed as intrinsic in the handbook, only requiring adjustments when the temperature was not 25 °C. Original references were recorded in Wiki-pS 0 . Many references were checked; however, references for 65 compounds could not be accessed online. Occasionally, reported S w values for neutral compounds were actually those of drug-salt measurements, as clarified on checking the original literature.
 Benet-Broccatelli-Oprea 927 BDDCS solubility list [86]. This compilation contains interesting drugs, but no references to original sources were cited and no experimental details were given. Of the drug solubilities listed, 333 were selected. In many cases, the original sources were recognized on cross checking with existing entries in Wiki-pS 0 . The S w values were mostly of free bases/acids, but some were clearly of salts, which required careful effort to discern. All selected values were converted to the S 0 scale using pDISOL-X.
 Analytical profiles of drug substances (APDS) [87]. The first 39 volumes of the series of monographs were searched for quantitative solubility data. Monographs on 155 molecules were selected for preprocessing. Most of the reported solubility values of ionizable molecules were measured in pure water with unspecified saturation pH. For those ionizable molecules which were not drug salts, the intrinsic values were calculated by pDISOL-X. Unfortunately, the solubility reported in APDS is often devoid of experimental detail (e.g., temperature not always reported), some citing 'personal communication' as references. Nevertheless, there are several high-quality log S -pH original data sets in the monographs.
 Merck index [88]. S w values of 173 molecules were used, after conversion to the S 0 scale. The Merck Index is often cited in older databases (e.g., [67]), but it may not be a sufficiently reliable general source for critical studies (literature references not usually given, details often lacking, etc. Single-source measurement of many compounds ('quick catches') The small single-source databases below consist largely of intrinsic solubility values. Useful collections of original measurements included those of McFarland et al. [93], Bergström and coworkers [94][95][96][97][98], and Faller and Ertl [77].

Sources of pK a data
The pK a values of the ionizable molecules were taken from Avdeef [80]; (cf., www.in-ADME.com/wiki_pka.php/), and various other established sources. When no experimental values were found, then the values calculated by MarvinSketch 5.3.7 (ChemAxon Ltd., www.chemaxon.com) were used. The pK a values were automatically adjusted for changes in the ionic strength [11,80] and temperature [118] by pDISOL-X.

Units conversion
Solubility data have been reported in many concentration units: mol/L (molarity, M), mM, μM, mol/kg (molality, m), mole fraction (x), mg/mL, μg/mL, mg/100mL, mg/dL, %w/v, g%mL, mg/mL%, mg%, "1 in 15 of water," "soluble in 3 parts of water," "2 % soluble in water," units of IU/mL, etc. Mole fraction and molality units are almost always used when solubility is determined over a wide range of temperatures, since the units do not depend on the density of the solutions. In the clearly presented accounts, the equivalent molecular weight to use to convert the practical units (e.g., μg/mL) to molarity is stated (e.g., "concentration is expressed as free base equivalent"). In practice, it is all too easy to make a mistake in converting the reported units to the preferred molarity scale, so extra care is recommended.
It could be argued that solubility should be tabulated in logarithmic units (preferably based on molarity). (i) Direct values span over 12 orders of magnitude and cannot be accurately depicted in S-pH plots at the low end of the scale (sic -log of "zero" solubility is undefined). Unfortunately, raw S-pH data are often presented only in a plot, with points plotted at ~zero. (ii) Errors in log S values do not depend on the magnitude of the log S (whereas they do when direct units are considered). This is problematic when refinement of constants is based on S measurements and unit weights are assumed.
In the Wiki-pS 0 database, values reported in molality units are noted, but are seldom converted to those in molarity (by applying solution density), since the differences are small around the temperature range of interest, and since solution density is usually not reported.

Interlaboratory reproducibility
There are 870 different molecules in Wiki-pS 0 for which solubility was reported from at least two different sources. This formed the basis for estimating interlaboratory reproducibility. Some molecules had been studied in many different laboratories. For example, there were 33 different reports of the solubility of diclofenac found to date, with 17 of these measured at several different pH values, whose complicated profiles were reconciled and discussed by Bergström and Avdeef [79]. The next most-frequently studied molecules are phenytoin, barbital, and ketoprofen, with 30, 26, and 24 interlaboratory determinations, respectively. The average interlaboratory reproducibility, SD avg , based on the curated 870 replicated studies, has been determined to be 0.17 log unit, significantly lower than the experimental reproducibility suggested in past studies (~0.6 log unit) [72][73][74]. As noted above, many factors can lead to the perception of poor reproducibility of measurements. It takes some effort to factor in the possible sources of systematic error, to attain the low SD avg . Still, for some difficult-to-measure drug molecules, the intrinsic solubility is quite uncertain, with SD values exceeding 0.5 log unit [20,79].

Physicochemical properties of database molecules
The 6355 intrinsic solubility set ranges in log S 0 from -11.0 to +1.8 (log molarity), essentially with a Gaussian distribution: mean = -3.04, median = -3.00, SD = 1.88. Figure 2 shows the solubility distribution for the molecules. About 47 % of the entries have log S 0 between -7 and -3, the typical range (DOA -domain of applicability) of values for drugs and research compounds [65]. About 2 % of the molecules have log S 0 < -7. Some of the least-soluble molecules (log S 0 < -8) in the database are amiodarone < clofazimine < itraconazole < halofantrine < ubiquinone < epristeride < vinorelbine < silafluofen < cosalane < etretinate < probucol < arotinoic acid < clomifene < motretinide < lasalocid < carbenoxolone. The most soluble (log S 0 > 0) substances are amino acids, simple carboxylic acids, and carbohydrates.  Figure 3 shows the trend between measured log S 0 and calculated log P (RDKit [15]), the most important descriptor in the prediction of solubility. The scatter is substantial, and perhaps trends nonlinearly at the extremes of the scales. The measured extreme values of log S 0 are possibly more accurate (since these are mostly determined from multi-point log S-H profiles) than the corresponding calculated log P (cf., ubiquinone and amikacin log P values). The traditional shake-flask method for direct-measure log P is thought to be limited to the range (-2 to +5), so methods for prediction of log P would be hard pressed to extrapolate accurately beyond that range, in the absence of reliably measured log P training-set values. Plot of log S 0 (largely SSF type) versus octanolwater partition coefficient, log P, calculated using the RDKit software [15]. Figure 4 shows the distribution of errors determined by averaging the log S 0 of those replicate molecules measured in different laboratories. The average value of interlaboratory standard deviation is SD avg = 0.17 log unit. The individual SD values trend to higher values as solubility decreases (Fig. 4b). The molecule showing the poorest reproducibility, with SD = 0.93 log unit (avg. from five sources), is clofazimine. It is also among the least soluble molecules in the database, with average log S 0 = -9.05. The weakly dibasic (pK a 3.83, 7.54 at 37 °C, I = 0.15 M [105]) phenazine antibiotic (M W 473.4 g/mol) is used to treat leprosy. The orally-bioavailable molecule has the very unusual characteristic of precipitating and accumulating as easily-visible red microcrystals in macrophages [119]. Figure 5 shows the distribution of properties used by Lipinski et al. [120] to define the Rule of 5 as an indicator of "drug-likeness." Frame (a) shows the log P distribution, with the average value of 1.89. About 80 % of the 6355 entries fall within the range of 0 to 5 (expected range for druglike molecules). Frame (b) shows the distribution of molecular weights, with the mean value 280 g/mol. About 95 % of the molecules have M W < 500 g/mol ('good' range). Frame (c) considers H-bonding characteristics. The red bars (tallest) refer to H-bond donor counts (NHD), where 98 % NHD ≤ 5 ('good'). The black bars (extending to higher counts) refer to H-bond acceptors (NHA), where 97 % NHA ≤ 10 ('good'). For the most part, the database molecules are in the expected boundaries of drug-likeness, with log P showing some violations at the high end, and more so at the low end for about 20 % of the entries. Table 1 summarizes the results of the weighted multiple linear regression (MLR) analysis of the Abraham solvation equation (ABSOLV), and the 'trained' version of Yalkowsky's general solubility equation (GSE). Also listed are the Random Forest regression (RFR) metrics. The 22 quaternary ammonium compounds were treated as a separate subset, using just some of the Abraham descriptors. The remaining 6333 solubility values were subjected to the full MLR analyses. Furthermore, the molecules were considered separately in each of four acid-base classes -with reference to predominant charge state at pH 7.4: acids(-), bases(+), neutrals(0), and zwitterions(±), as well as in combined classes.

Yalkowsky's general solubility equation (GSE)
It was of interest to see how well the GSE (untrained) predicted solubility values in the database. Figure  6 shows the results of applying Eq. 1 to the acid-base subset data. The first three classes (Figs. 6a-c) have similar statistical metrics: r 2 = 0.54 to 0.61, RMSE = 1.15 to 1.24, bias = -0.14 to -0.30, and MPP = 37-40 % (measure of prediction performance: percentage of the absolute residuals ≤ ±0.5 log unit). The GSE did not perform as well for the zwitterions (Fig. 6d): r 2 = 0.07, RMSE = 1.54, bias = +0.34, and MPP =25 %. The average calculated log P [15] for the zwitterion set is 0.07 (Table 1), suggesting that the GSE prediction of zwitterions is based largely on T m contributions. When all the classes were combined (n = 6333, excluding 22 quaternary ammonium drugs), the untrained GSE prediction yielded r 2 = 0.57 and RMSE = 1.23 (Table 1). When the fixed coefficients in Eq. 1 (0.5, -1.0, -0.01) were subjected to regression using weighted MLR, the fit improved only slightly for the combined acid-base classes: r 2 = 0.60, RMSE = 1.17, n = 6333, but the refined coefficients (-0.33, -0.83, -0.006) were quite different from the traditional values, especially for the intercept coefficient (Table 1). This may be due to the negative correlation between the intercept and the T m terms (-82 to -97%). When the molecules were examined by the acid-base classes, the acids most resembled the results of the untrained GSE, with coefficients (0.62, -0.94, -0.0115) and metrics: r 2 = 0.70, RMSE = 1.02, n = 1424. The bases and neutrals indicated a negative intercept, -0.28, with only slightly improved metrics ( Table 1). The zwitterion class had reversal of signs for both the intercept and the temperature dependence coefficients, with the slightly improved metrics: r 2 = 0.22, RMSE = 1.28, n = 600. The performance was slightly better than that of the GSE (trained or untrained), and a lot better in the case of zwitterions. The refined ABSOLV coefficients (Table 1) indicate acid-base class differences. These coefficients are not similar to the ones reported by Abraham and Le [13]. In MLR, such differences in coefficients can arise when different training sets are used, as a result of correlations between descriptors. It was found that const:A correlations ranged -50 to -83 % and const:AB correlations ranged +57 to +79 %.  Abbreviations and definitions section below identifies and defines the most important descriptors used in the RFR algorithm.

Weighted multiple linear regression using Abraham descriptors (ABSOLV)
Training set and internal validation Figure 8a shows the entire training set RFR analysis, with the metrics: r 2 = 0.95, RMSE = 0.40, bias = -0.007. This is not a good measure of the predictive power of the method. Rather, it indicates how well the model can incorporate the information represented by the descriptors and relate it to solubility in the training set [66]. The randomly selected internal test set of 1906 solubility values (30 %) are better indicators of the ability of the model to predict external tests compounds which are unknown to the training process. Figure 8b shows the internal test set prediction results: r 2 = 0.89, RMSE = 0.60, bias = 0.0002. This performance is to be expected for external test molecules which are well-represented by the chemical space of the database, as illustrated below. The bottom section of Table 1 summarizes the analysis metrics, both for the entire data set and for the acid-base subsets. The best internal test set performance was found for the zwitterions: r 2 = 0.91, RMSE = 0.45. The right-most column identifies the ten most-important descriptors in the analysis. For the overall data, and for the acid, base, and neutral subsets, the most important descriptor is log P. It's particularly noteworthy that log P is not in the top-10 list for the zwitterions. In several of the cases, the second-most doi: http://dx.doi.org/10.5599/admet. 766 45 important descriptor is molecular refractivity (cf., Abbreviations and definitions for the RDKit terminology). Topological indices play particularly important roles in the zwitterion subset.

Principal component analysis of thirty of the most important RDKit descriptors
The principal component analysis (PCA) function, prcomp(), in the factoextra R library was used to process the 30-most important descriptors identified in RFR. Figure 9 shows the loading plot based on the first two principal components, which account for 63 % of the total variance in the descriptors. Only the HallKierAlpha descriptor has a negative PC1 value, with all of the rest of the descriptors being in the positive PC1 domain. The close proximity of many of the descriptors to each other suggests high correlation between them. Such correlations would be problematic in MLR analysis, but not in RFR. Figure 10 shows the scores plots for the solubility data. Frame (a), which considers only the molecules with M W < 500 g/mol, shows a very dense but apparently symmetrical distribution about the origin. As M W s increase, the points shift in the direction of increasing PC1. Frame (b) shows the molecules with MW > 500 g/mol. The distribution is sparse and further shifted to increasing PC1 values, as M W values increase. Frame (c) shows all the data with the acid-base subset notation. Very large molecules are thinly represented in the bottom-right quadrant. Zwitterions tent to be in the negative PC2 half, evenly distributed in PC1.

Validation against four external test sets
Four external test sets were selected to explore how well the GSE, ABSOLV, and RFR models perform. For each of the test sets, all the test molecules found in the training set were removed, so that the prediction was of truly "unknown" molecules. This was not necessary for the traditional GSE model, since it requires no training. The observed and calculated values are listed in Appendix Tables A1-A4.    However, there were two main problem areas in the RFR modeling, as indicated by poor fit: (i) Fig. 13a shows the outlier pesticides 4,4'-DDT, 2,2',4,5,5'-PCB and chlordane; (ii) Fig. 13d shows the outlier drugs amiodarone, clofazimine, and itraconazole.
Case (i) can be remedied. The Wiki-pS 0 database has very few agrochemicals and no DDT or PCB derivatives. We decided to temporarily augment our database with agrochemicals, to see if RFR prediction could be improved for Test Set 1 (Fig. 13a). The Huuskonen [35] database of 1297 organic molecules was screened with three filters: (a) only compounds with log S w < -5 would be used; (b) only solids would be considered; and (c) Test Set 1 compounds would be excluded. This process resulted in 115 new entries to the augmented database. Figure 14 shows the improved results. By adding a few agrochemicals to the RFR training set, r 2 increased from 0.83 to 0.90, RMSE decreased from 0.83 to 0.66, bias lowered from -0.23 to +0.02, and 'correct' predictions increased from 57 to 71 %. The well-known adage that "like predicts like" is amply illustrated in this example. Case (ii) remains problematic -a case of training-set "missing neighbors" problem. As is evident in Fig. 13d, amiodarone, clofazimine, and itraconazole are poorly predicted, in part because there are few other molecules possessing the properties of these three compounds (cf., upper right edge in scores plot Fig.10c) in the database, and also, because RFR cannot extrapolate solubility beyond the range of its training data. From the PCA analysis, the five nearest neighbors to amiodarone (log S 0 = -10.4), based on three principal components, are halofantrine, irbesartan, butaperazine, mifepristone, and probucol. The log S 0 values for these neighbors show high variance: -8.0, -3.7, -4.3, -5.2, and -8.4, respectively. The RFRpredicted value for amiodarone is log S 0 = -6.8, barely greater than the average value of the five nearest neighbors. To do better, the database needs new neighbors in the chemical space close to amiodarone, doi: http://dx.doi.org/10.5599/admet. 766 49 clofazimine, and itraconazole. Or, better descriptors are needed to define the chemical space, so that truly 'similar' molecules will have nearly the same solubility values. With the three outliers removed, the metric improve: r 2 = 0.82, RMSE = 0.76, bias = -0.31, and MPP = 41%.

Prediction of solubility of quaternary ammonium drugs
Quaternary ammonium compounds are salts, and so do not fall into the category of neutral species associated with the log S 0 constants studied here. GSE and RFR methods did not produce satisfactory results (r 2 ~ 0 in both cases) for this subclass of compounds. However, it was possible to come up with a modified ABSOLV model for this small group of molecules (n=22), based on the equation: with r 2 = 0.97 and RMSE = 0.27, where S π in Eq. (4) is the dipolarity/polarizability Abraham descriptor. Figure 15 compares the tested calculations. Strong H-bond donors (acids) decrease solubility, whereas strong H-bond acceptors (bases) have the opposite effect. High dipolarity/polarizability compounds are associate with low solubility.

Summary
The properties of the chemical space of druglike molecules in the Wiki-pS 0 database of intrinsic aqueous solubility were described in considerable detail. The database was used to train two solubility prediction models: multiple linear regression (MLR) and Random Forest regression (RFR). The predictivity of the models was tested with four external sets of compounds. The MLR model incorporated calculated Abraham solvation descriptors (ABSOLV). The RFR model used the aggregate set of T m (mostly measured values), ABSOLV, and RDKit 2D (204 descriptors in all). As a comparative benchmark, the General Solubility Equation (GSE), which requires no training, was used to predict the intrinsic solubility of the Wiki-pS 0 druglike molecules.
For the intrinsic solubility set, excluding the permanently-charged quaternary amines, RMSE calculated as 1.23 (GSE), 1.00 (ABSOLV), and 0.28 (RFR) for the training sets. The intrinsic set was further divided into four subsets, based on dominant charge at pH 7.4: acids(-), bases(+), neutrals(0), and zwitterions(±). The performances of GSE and ABSOLV were comparable for acids, bases, and neutrals, but for the zwitterionic subset, ABSOLV was better.
For the permanently-charged quaternary amines (n=22), both GSE and RFR did not do well (r 2 = 0). It was possible to develop a simplified ABSOLV training-set model using just three of the solvation descriptors.
The above comparisons are not entirely satisfactory tests of the predictivity of the three methods. For the RFR method, the data are randomly separated into a training set (70%) and an internal test set (30%). RMSE = 0.60 and MPP = 76 % 'correct' predictions for the internal test set calculation. For the zwitterionic subset, RMSE = 0.45 and MPP = 91 %.
The four external test sets allowed the comparisons of the three models in a uniform way. Test Set 1 (te1) was compiled by Yalkowsky and Banerjee [18] for testing the GSE. The other three test sets consisted of druglike molecules, all solids at room temperature, containing no agrochemicals. Test Set 2 (te2) molecules were originally used in the first Solubility Challenge [19,103], and Test Sets 3 (te3) and 4 (te4) molecules were used in the second Solubility Challenge [20].
RFR outperformed the other two methods on the whole. When Wiki-pS 0 was augmented with 115 agrochemicals, te1 prediction improved (RMSE = 0.66, MPP = 71 %), and was better than that of GSE. For te2 and te3 drug solubility RFR predictions, RMSE = 0.85 and 0.75, resp., whereas MPP = 50 and 57 %, resp. There were three molecules in te4 that RFR did not predict well: amiodarone, clofazimine, and itraconazole. Apparently, the current database has limited chemical space coverage in the vicinity of these outliers. With the three outliers removed, RMSE = 0.76 and MPP = 41 % for te4.

Conclusion
The GSE is popular for its simplicity and easy of calculation. It is a convenient benchmark against which to assess new prediction methods. Druglike molecules are expected to be predicted by GSE to within 1.1-1.2 log unit, or to within 0.5 log unit 22-42 % of the time. However, its performance with zwitterionic molecules is limited. The ABSOLV method holds the middle position in the comparisons. The RFR method in this study is attractive, both for its predictive performance and ease of use. It is expected to predict druglike molecules similar to those in Wiki-pS 0 to within 0.6 log unit of the measured values, or within 0.5 log unit 76% of the time. The RFR software is freely downloadable from open sources.
Evidently, the evaluated prediction methods cannot match the precision of measured equilibrium solubility data. The methods need to be further enhanced. More discriminating descriptors would be welcome additions to the openly-available collections. As the amiodarone, clofazimine, and itraconazole examples illustrate, there are still under-populated neighborhoods in the chemical space of the currently tested database. How effective Wiki-pS 0 will be in predicting the solubility of newly-synthesized molecules in pharmaceutical research remains to be explored.

Abbreviations and definitions
DOA domain of applicability associated with druglike substances, determined by descriptor or structural (e.g., Tanimoto indices) similarity. DTT Dissolution Titration Template potentiometric method used to determine intrinsic solubility, S 0 HH Henderson-Hasselbalch equation [80]; e.g., for monoprotic base, log S = log S 0 + log ( 10 +pKa -pH + 1 ) OOB "Out-of-Bag" built-in validation set of compounds randomly selected by the RFR method, which have not been used to train the model. pH sat the equilibrium pH of a saturated water solution of compound whose solubility is S w S solubility, ideally expressed in units of mol/L (M), μg/mL, or mg/mL S 0 "intrinsic" solubility (i.e., the solubility of the uncharged form of the compound) S w "water" solubility, defined by dissolving enough pure free acid/base in distilled water (or water containing an inert salt -as ionic strength adjustor) to form a saturated solution. The final pH of the suspension, pH sat , and S 0 can be calculated by the HH equation (when valid), provided the true pK a is known. Compound added in salt form may disproportionate into free acid/base, depending on how much solid had been added. Calculation of the pH and S 0 of such salt suspensions can be uncertain. S pH "pH buffer" solubility (i.e., the total solubility of the compound at a measured equilibrated pH) SSF saturation shake-flask method, the "gold standard" solubility measurement method RMSE root-mean-square error: RMSE = [ 1/n Σ i (y i obs -y i calc ) 2 ] 1/2 , where y obs / y calc = observed/calculated value of log S 0 according to model, n = number of measurements of log S 0 r 2 squared linear correlation coefficient, r 2 = 1 -Σ i (y i obs -y i calc ) 2 / Σ i (y i obs -<y>) 2 , where y = log S 0 , and <y> is the mean value of log S 0 SD standard deviation: SD = [ 1/n Σ i (y i obs -<y>) 2 ] 1/2 , where n = number of measurements, <y> = mean value of log S 0 F F-statistic: F = (n-p-1)/p • Σ i (y i obs -<y>) 2 / Σ i (y i obs -y i calc ) 2 , where p = number of regression parameters MPP Measure of prediction performance [103]. It refers to the percent of 'correct' predictions, as defined by the count of absolute residuals |log S 0 obs -log S 0 calc | ≤ 0.5 divided by the number of measurements. MPP is represented as a pie chart in the correlation plots. ntree number of trees specified in the Random Forest regression (RFR) -typically 500 mtry number of descriptors to use in the node splitting process in RFR -typically a third of the descriptors nodesize minimum number of data points in the terminal node, beyond which no splitting takes place -typically 5 measurements

Abraham solvation descriptors A H-bond total acidity B
H-bond total basicity S π dipolarity/polarizability due to solute-solvent interactions between bond dipoles and induced dipoles E excess molar refraction (dm 3 mol -1 / 10); which models dispersion force interaction arising from π-and n-electrons of the solute V McGowan molar volume (dm 3 mol -1 / 100) A•B acid-base H-bonding product descriptor used in ABSOLV solubility prediction

Complexity descriptors
BertzCT complexity index, based on size, symmetry, branching, rings, multiple bonds, and heteroatoms characteristic of solute [50]. Ipc content information of topological graph [48] -entropy of atomic distribution in solute  Table A5, with citations to the original literature .
Appendix -Calculated results for the three models and a sampling of the database                  Table A5. Continued a a RN -Registry number (CAS). ΔH sol -calculated [9] heat of solubility, used to adjust data to a standard temperature (25 °C). pKacalculated for strongest acid and weakest base groups. Num.pH -number of S pH measurements in the log S -pH profile. *,**,*** indicate small, moderate, extensive concentration of aggregate/complex.   [9] heat of solubility, used to adjust data to a standard temperature (25 o C). pK a -calculated for strongest acid and weakest base groups.

Terfenadine
Num. pH -number of S pH measurements in the log S -pH profile. *,**,*** indicate small, moderate, extensive concentration of aggregate/complex.