Can small drugs predict the intrinsic aqueous solubility of ‘beyond Rule of 5’ big drugs?

The aim of the study was to explore to what extent small molecules (mostly from the Rule of 5 chemical space) can be used to predict the intrinsic aqueous solubility, S0, of big molecules from beyond the Rule of 5 (bRo5) space. It was demonstrated that the General Solubility Equation (GSE) and the Abraham Solvation Equation (ABSOLV) underpredict solubility in systematic but slightly ways. The Random Forest regression (RFR) method predicts solubility more accurately, albeit in the manner of a ‘black box.’ It was discovered that the GSE improves considerably in the case of big molecules when the coefficient of the log P term (octanol-water partition coefficient) in the equation is set to -0.4 instead of the traditional -1 value. The traditional GSE underpredicts solubility for molecules with experimental S0 < 50 μM. In contrast, the ABSOLV equation (trained with small molecules) underpredicts the solubility of big molecules in all cases tested. It was found that the errors in the ABSOLV-predicted solubilities of big molecules correlate linearly with the number of rotatable bonds, which suggests that flexibility may be an important factor in differentiating solubility of small from big molecules. Notably, most of the 31 big molecules considered have negative enthalpy of solution: these big molecules become less soluble with increasing temperature, which is compatible with ‘molecular chameleon’ behavior associated with intramolecular hydrogen bonding. The X-ray structures of many of these molecules reveal void spaces in their crystal lattices large enough to accommodate many water molecules when such solids are in contact with aqueous media. The water sorbed into crystals suspended in aqueous solution may enhance solubility by way of intra-lattice solute-water interactions involving the numerous H-bond acceptors in the big molecules studied. A ‘Solubility Enhancement–Big Molecules’ index was defined, which embodies many of the above findings.

In the early 1990s, attrition due to poor oral bioavailability and pharmacokinetics (PK) was responsible for nearly 40 % of compounds being rejected in clinical studies [21]. Lipinski's Rule of 5 (Ro5) emerged as part of the effort to address critical issues underlying the high attrition [1]. The Ro5 guidelines suggest that compounds are more likely to be orally bioavailable if three or more of these rules are adhered to: molecular weight, M W ≤ 500 Da, calculated octanol-water partition coefficient, clogP ≤ 5, number of H-bond donors, NHD ≤ 5, and number of H-bond acceptors, NHA ≤ 10. High-throughput screening strategies of physicochemical properties of research compounds led to improvements. By the new millennium, attrition due to PK was reduced to below 10 % [22].
However, many recently approved drugs are larger, more lipophilic, and possess more H-bond acceptors, compared to drugs in the Ro5 chemical space [22]. Many of the newly-approved therapeutics are used in immunosuppression, treatment of infectious/viral diseases, and in oncology. Since 2014, an increasing number of 'beyond the Rule of 5' (bRo5) commentaries have stressed that the strict adherence to the Ro5 may result in lost opportunities [21][22][23][24][25][26][27][28][29][30]. Cell-permeable and orally-bioavailable drugs can be discovered far into bRo5 chemical space. Some of these drugs are derived from natural products, which appear to be better suited for the newer targets which possess large and flat binding sites. Nevertheless, concerns have been raised over the expected higher pharmacokinetic risks from bRo5 compounds: low solubility, poor cell permeability, increased cellular efflux, and extensive metabolism. Medicinal chemists have applied tactics to lessen some of the risks: (a) reducing or shielding polarity by N-methylation, or by bulky side chains, (b) selecting compounds with flexible rings structures allowing for conformational lability, and (c) selecting compounds which can reversibly form multiple intramolecular H-bonds (IMHB) to shield polar groups during passage across lipoidal cell barriers, in the manner of 'molecular chameleons' [26][27][28][29][30][31].
Although most of the bRo5 commentaries have emphasized permeability, absorption, and potency topics, Bergström et al. [25] focused on solubility aspects and the promising computational biopharmaceutical modeling strategies to help identify 'formulate-ability' during lead optimization and early development stages of bRo5 compounds. Caron et al. [29] considered case studies of kinetic solubility (measured in pH 7.4 phosphate buffer containing 1-5 % DMSO) of bRo5 molecules, in terms of the tendency to form IMHBs and their effect on solubility.
In our preceding study [20], three methods of solubility prediction were compared: (a) Yalkowsky's General Solubility Equation (GSE) [8], (b) Abraham Solvation Equation (ABSOLV) [11], and (c) Random Forest regression (RFR) [19] statistical machine learning. The linear ABSOLV and the RFR multiple decision-tree methods were trained with molecules in the Wiki-pS 0 database. Thirty of the most important descriptors identified in the RFR analysis [20] were subjected to Principle Components Analysis (PCA). The scores plot had the appearance of a 'comet' -with a dense symmetrical core of Ro5 compounds about the origin of the first two principle components and a long sparsely-populated tail of big molecules queuing far into the lower-right quadrant. The molecules in the tail have high H-bond acceptor strength (NHA), topological polar surface area (TPSA), fraction of sp 3 carbons (FractionCSP3), and possess M W > 800 Da -many of the recognized hallmarks of bRo5 chemical space.
The aim of the present study was to explore to what extent small molecules (mostly Ro5) can be used to predict the intrinsic aqueous equilibrium solubility of big molecules (all bRo5 drugs), i.e., can the 'head predict the tail'?

Computational models
Three computational approaches described below span from the theoretically sound and easy-to-use GSE, the sound and flexible ABSOLV, and the more accurate (but somewhat of a 'black box') RFR.

General Solubility Equation (GSE)
Expanding on the work of Irmann [32] and Hansch et al. [33], Yalkowsky and coworkers in 1980 developed and thereafter popularized the General Solubility Equation (GSE), to enable the prediction of the solubility of organic molecules in water [8,9,[34][35][36][37][38]. Just two variables, melting point (mp in °C) and octanolwater partition coefficient, log P, are used in the equation to predict solubility (in log molar units): Below, the derivation of Eq. (1) is briefly reviewed in terms of its underpinning assumptions to determine if any are incompatible with bRo5 molecules. Also, the thermodynamics of solubility are well cast by Eq. (1), which can apply to all models tested here. It is useful to start by dissecting aqueous solubility in terms of the Gibbs free energy based on the thermodynamic fusion cycle. The dissolution of a non-ionized crystalline substance suspended in water can be viewed in terms of two major contributions: (a) crystal lattice -energy has to be provided to break down the lattice to form a hypothetical 'supercooled' liquid (sliq), and (b) solvation -energy is released when the liquefied solute dissolves in water. The total solubility of the solute in water is the product of the above two contributions (lattice and solvation), which in logarithmic terms can be stated as the sum [36,37]:

Crystal lattice effect
The lattice contribution, first term on the right side of Eq. (2), arises from the application of the van't Hoff equation, where ∆S m (kJ/mol·K) is the entropy of melting (fusion) and T m is the melting point (in K units). By the 'Walden's rule' approximation [36,37], ∆S m = 0.0565 kJ/mol·K for many organic compounds (particularly for rigid planar molecules, but less so for spherical molecules). At

Solvation effect
The solvation contribution, right-most term in Eq. (3), was investigated by Hansch and coworkers [33]. For 156 simple organic liquid solutes, they demonstrated that solubility correlated with octanol-water partition data as described by a Collander-like linear equation: log S = c 0 + c 1 log P. Octanol, possessing nearly identical calculated H-bond donor and acceptor strength, was selected as a model organic solvent. For a series of aromatics, alkyl halides, and alkanes liquid solutes, c 0 intercepts were calculated to be +0.34, +0.83, and -0.25, respectively. The derived c 1 slope factors were -1.0 for aromatics, -1.22 for alkyl halides, and -1.24 for alkanes.
For a liquid solute, log P relates to the Gibbs free energy for the sum of solute-solute and water-water cohesive interactions minus twice the solute-water adhesive interactions [8]. For a liquid solute with the aqueous solubility of S w liq and the solubility in octanol as S oct liq , it can be approximated that P = S oct liq /S w liq (assuming activity equals concentration and that solute aggregates/micelles don't form [39] if the slope c 1 = -1, then the intercept c 0 = log S oct liq . Yalkowsky and coworkers surmised that c 0 = 0.5, by the following reasoning. Entropy of mixing favors complete miscibility of the two liquids; i.e., the mole fraction = 0.5 [8]. (This is likely to be valid for apolar [37], but may not be accurate for large polar molecules like those found in the bRo5 chemical space.) Since the molar concentration of pure octanol is 6.32 M, the log S oct sliq = log (6.32x0.5) = 0.5 (assuming the solute liquid density is near that of octanol). On rearranging the log form of P defined as the solubility ratio, log S w sliq = 0.50 -log P. On substitution of the latter term into Eq. (3), the GSE, Eq. (1), is so derived.
The above considerations suggest that Eq. (1) may have possible limitations in bRo5 chemical space: (i) Lattice energy of rigid-planar molecules may be different from those of spherical molecules; (ii) Octanol as the model for the supercooled liquid solute may not be accurate for large polar or conformation-flexible molecules; (iii) Non-ideal activity may arise due to solute self-aggregation (e.g., dimer formation of vancomycin), possible micellization (e.g., ubiquinone, iodoxamic acid), and 'molecular chameleon' IMHB effects [29,31].

Using Calculated log P (clogP)
The original two variables (mp , log P) were taken to be experimental values. In a pharmaceutical research setting, such experimental values may not be available in early studies. It has become a common practice to use calculated values, clogP, in place of measured log P in Eq. (1). Apparently, the accuracy of GSE lessens, but only slightly. The use of calculated mp is less frequent, since the accuracy of such predicted values is thought to be relatively low. In the present investigation, experimental values were applied when available, and were calculated in a small number of instances [40].

Abraham Solvation Equation (ABSOLV)
Abraham and Le [11] amended the Abraham Solvation Equation [41] to predict intrinsic solubility (log molar): The independent variables are the five solute descriptors accounting for the transfer of solute from one phase to another: A is the sum of H-bond acidity (similar to NHD), B is the sum of H-bond basicity (similar to NHA), S π is the dipolarity/polarizability (subscripted here, so as not to be confused with solubility), E is an excess molar refraction in units of (cm 3 ·mol -1 )/10, and V is the McGowan characteristic volume in units of (cm 3 ·mol -1 )/100.
In principle, the five solute variables could account for any shortcomings of just using clogP, as in Eq. (1). The A•B cross-term in Eq. (4) was intended to address intermolecular H-bond interactions between acid and base functional groups in the solid or liquid environment. Its inclusion, as an alternative to using the mp term in Eq. (1), was intended to improve the prediction accuracy of Eq. (4).
The c 0 -c 6 coefficients in Eq. (4) are usually determined by multiple linear regression (MLR), trained on a set of intrinsic solubility values of a diverse collection of molecules. The five Abraham solvation descriptors may be calculated from 2D structure (introduced as a SMILES text or as coordinates in a 'mol' format) using the program ABSOLV [42] (cf., www.acdlabs.com). In the present study, the seven MLR coefficients were redetermined using our own training data (Wiki-pS 0 database), with log S 0 data weighted in the regression analysis according to estimated measurement errors [20].
Furthermore, we attempted to improve the accuracy of Eq. (4) when applied to big compounds, by introducing a nonlinear term, Due to potentially high linear correlations among the descriptors, partial least squares (PLS) regression (open source package in R: https://cran.r-project.org/web/packages/pls/) was used (instead of MLR) to determine c 0 -c 7 for given values of z. Several z values in the range 0.9 to 2.0 were tested; the best-fit exponent was selected as the minimum point in the fit of PLS root-mean-square error (RMSE) vs. z.

Random Forest regression
Of the new machine-learning statistical approaches, the Random Forest regression (RFR) method is thought to be one of the most accurate in predicting solubility [17][18][19][20]. RFR can be employed 'off the shelf,' requiring only minimal learning [19]. The provided default 'tuning' parameters are nearly optimal. However, its 'black box' nature makes the outcome of the analysis challenging to interpret in terms of the descriptors used, even when the most important descriptors are quantitatively ranked in RFR.
The method was introduced in 2001 by Brieman [43], and is implemented in the open-source 'randomForest' library for the R statistical software [44][45][46]. RFR works by constructing an ensemble of hundreds of decision trees [47]. The tutorial chapter by Walters [19] is highly recommended as a means to get started with RFR.
The first applications of RFR to predict solubility appeared in 2007 [17,48]. Schroeter et al. [48] used S w and S pH data to train a RFR method, using ~4000 measurements mostly taken from secondary sources [12,49,50] and some from in-house (Bayer Schering Pharma) sources. For the Huuskonen data [12] as test set, RMSE = 0.66 log unit (n=1290) was reported. For the solubility data in the domain of applicability (DOA) matching that of research compounds (10 -3 to 10 -7 M solubility), the RFR method indicated RMSE ~ 0.85 log.
In the Palmer et al. [17] RFR study, aqueous solubility values of 998 structurally diverse druglike solid organic compounds were gathered from similar secondary sources [12,51,52]. The authors used the Molecular Operating Environment (MOE) [53] to generate 126 2D (clogP, MR, charged-surface properties, atom, group, and H-bond counts, connectivity and topological indices) and 36 3D (total potential energy, electrostatic contributions, molecular shape, and solvent-accessible surface area) descriptors. Randomly splitting the entire data into a training set (70 %) and an internal validation set (30 %) produces a good measure of the model predictivity of compounds not included in the training set: r 2 = 0.89, RMSE = 0.69 log, n = 330.

Wiki-pS 0 database
The intrinsic aqueous solubility database Wiki-pS 0 (in-ADME Research) [20,56] was used. It now contains 6473 log S 0 (log molar) entries, based on measured aqueous solubility values of 3065 different compounds (excluding agrochemicals) collected from 1415 cited references. The most reliable published data had been determined by the saturation shake-flask (SSF) method, particularly when measured as a function of pH. In the majority of the cases, the literature data were further processed, using pDISOL-X (in-ADME Research) [56][57][58][59][60][61], to extract intrinsic solubility (S 0 ) values from reported aqueous free-acid/base or salt solubilities (S w ), or solubilities at specified pH (S pH ), or log S-pH profiles. All of the molecules are solids at room temperature (except propofol). There are 1127 log S 0 entries derived from 10167 individual-pH log S measurements. About half of the data sources originate from secondary listings and the rest are from primary sources. In the case of secondary sources, the citations to the original work were generally available, and in many cases were consulted for clarifications. Melting points are included in the database. When measured mp were not found (19 % of entries), mp were calculated by the Lang and Bradley method [40] in the QsarDB open repository of data and prediction programs (http://qsardb.org/repository/predictor/10967/104?model=rf).

Physicochemical properties of the big molecules
In this study, the compounds in Wiki-pS 0 were divided into two groups: 'big' (M W ≥ 800 Da, structures in Appendix) and 'small' molecules. The demarcation was motivated by the shape of the principle components scores plot ( Fig. 10 in [20]). There are 31 molecules (58 log S 0 entries) in the 'big' set. Table 1 lists their characteristic properties. Figure 1 shows the distribution of big-molecule log S 0 values. On the average, the big molecules are less soluble (-4.52) than the small molecules (-3.12).  Figure 2 shows the trend between measured log S 0 and clogP (calculated in RDKit [62]: Wildman-Crippen sum of atomic contributions -cf., Abbreviations and definitions) for the two groups of molecules. The scatter is substantial. Nevertheless, the small molecules (green circles) show the expected -1 slope, whereas the big molecules (red squares) show an apparent slope of -0.239. This is an important characteristic differentiating the two groups.  [20]. Frame (b) shows the distribution of molecular weights about the mean value 1034 Da (compared to 280 Da in the entire set [20]). Frame (c) considers H-bonding characteristics. The red bars (tallest near 5) refer to H-bond donor counts (NHD). The black bars (tallest near 15) refer to H-bond acceptors (NHA). In the small-molecule set, the NHA and NHD groups overlap considerably, as illustrated elsewhere [20]. But, in the big-molecule set (Fig. 3c, Table 1) the NHA and NHD distributions are wider apart: the acceptor count increases, but not so much the donor count. This is an important characteristic differentiating the big-small molecule groups. The separation between the groups is greater than that found in small molecules [20].

Hydrophilicity (clogP) effect in big lipophilic molecules
The linear dependence of log S 0 on clogP (Fig. 2)  The extent of 'correct' predictions is defined here by MPP (measure of prediction performance: percentage of the absolute residuals ≤ 0.5 log unit).
Apparently, crystal lattice contributions are not appreciably different in the two groups of molecules; the refined temperature coefficients in the two sets are close to the GSE value (-0.01) in Eq. (1). Hence, solution-phase interactions appear to dominate solubility [98].
The intercept constants suggest that big-molecule 'supercooled' liquid solutes are less miscible in octanol by 1-2 orders of magnitude than suggested by the original Yalkowsky analysis [8,37]. The intercepts in Eqs. (6a) and (6b) are nearer to those of alkane solutes found in the Hansch et al. [33] study, compared to the constant in Eq. (1). For the big molecules, the highly negative intercept (i.e., decreased solubility of the supercooled liquid in the octanol phase) depresses the solute water solubility by a constant amount.
Countering that, the -0.4 slope factor lessens the contribution of lipophilicity to the calculated solubility of big molecules. The net result is that the traditional GSE overpredicts S 0 for big molecules with experimental solubility above ~50 µM (e.g., oxytocin, nafarelin), and underpredicts S 0 below the crossover point (e.g., everolimus, telithromycin). Figure 4a shows the relationship between the measured solubility of small molecules and that calculated by the classic ('untrained') GSE. (Permanently-charged quaternary amines and big molecules are excluded in the training.) The r 2 , RMSE, MPP statistics are nearly identical to those associated with Eq. (6a), suggesting that 'training' does not improve the GSE predictivity for small molecules.

General Solubility Equation (GSE)
However, the performance of the 'untrained' GSE degrades when the equation is applied to the big molecules, as shown in Figure 4b, with r 2 = 0.0, RMSE = 3.0 log (2.3 without ubiquinone), and MPP = 16 %. Figure 4c plots the big-molecule 'trained' GSE result (cf., Eq. 6b). Note that this is not the equivalent of Ro5 molecules predicting the solubility of bRo5 molecules. Rather, it highlights the hydrophilicity solvation effect of big molecules discussed in the preceding section. The traditional GSE requires adjustments when it comes to predicting the solubility of big molecules (cf., Fig. 4b). (c) When using just the big-molecule data, the three constants in the GSE (Eq. 1) subjected to MLR analysis (cf., Eq. 6b) produce the modified GSE, which is valid only for molecules with MW > 800 Da. There are not enough big molecules in the Wiki-pS 0 database to test the predictiveness of Eq. (6b). The plot of measured log S 0 as a function of the calculated values according to Eq. (7) is shown in Figure 5a. This trained ABSOLV model only slightly outperforms the small-molecule untrained/trained GSE model (Fig. 5a/Eq. 7 compared to Fig. 4a/Eq. 6a). The A•B cross term contribution appears to be negligible.

Abraham solvation model (ABSOLV) -weighted MLR to predict solubility of big compounds
The application of Eq. (7) to the big-molecule set produced unidirectionally-skewed plot, as shown in Figure 5b. According to the ABSOLV model trained on small molecules, the solubility of all big molecules is underpredicted. For example, the gramicidin A measured log S 0 = -4.16 ±0.41 is underestimated by 10 orders of magnitude. Vancomycin is underestimated by nearly 5 orders of magnitude.
An effort was made to improve the fit. A distinguishing characteristic of big compounds is that they contain a high level of H-bond basicity (B) character (Table 1). We tested several nonlinear contributions of the B descriptor, with the aim of amplifying its uniquely high positive impact on solubility (Eq. 7). In order to avoid difficulties due to descriptor correlations, PLS regression was used in place of MLR. The modified model, depicted in Figure 5c, is the best improvement that was found. The modified solvation model consisted of an additional nonlinear term, B +z , with z > 1. The best-fit value of z was determined to be 1.11. This new descriptor was expected to amplify the positive H-bond acceptor contribution in Eq. (7) in the case of big molecules. Other modifications were explored, but only the latter descriptor appeared to improve ABSOLV to a level slightly better than that of the classic GSE (Fig. 4b).
On inspection, the systematic errors in Figure 5b  nROT to Eq. (7) reduced the RMSE from 3.4 to 1.6 and the bias from 2.6 to 0.13 (r 2 remained unchanged). However, this did not result in a significantly improved training-set model when nROT was added to the list of ABSOLV descriptors in a repeated PLS analysis. Flexibility appears to be important, but nROT is not significantly predictive in the training process. Caron et al. [30] demonstrated that nROT may have limitations because it neglects the contribution to flexibility from cyclic fragments in big molecules.

Random forest regression using RDKit combined with Abraham descriptors and melting points
Descriptors For the RFR model building, the 190 RDKit [62] descriptors (excluding those which were zero for all compounds) were combined with the mp and the ABSOLV descriptors. The Abbreviations and definitions section below identifies and defines the most important descriptors used in the RFR algorithm.  Training set and internal validation Figure 6a shows the small-molecule training-set RFR analysis, resulting in the metrics: r 2 = 0.98, RMSE = 0.27 log, bias = -0.001. This quality of fit indicates how well the model can incorporate the information presented by the descriptors and relate it to solubility in the training set [18]. The internal validation set of 1925 small-molecule solubility values (30 %), randomly selected by the method, better indicates the ability of RFR to predict external test compounds which are unknown to the training process. Figure 6b shows the internal validation test set prediction results: r 2 = 0.89, RMSE = 0.64 log, bias = 0.017. This performance is to be expected for external test molecules, provided they are adequately represented in the chemical space of the training set.
Big-molecule external test set prediction Figure 6c illustrates the degree to which the RFR method, trained by small molecules, can predict the solubility of big molecules. The relative accuracy of the prediction (r 2 = 0.42, RMSE = 1.06 log, MPP = 42 %) evidently exceeds that of the GSE and ABSOLV methods (predictive r 2 = 0 and RMSE > 2 log). To wit, small molecules in the training set provide enough subtle 'clues' for the method to extract a sensibly accurate prediction of big-molecule solubility.
In RFR, relationships between descriptors and the model are difficult to extract, and the influence of each compound property on calculated solubility cannot be readily deduced [98]. A major disadvantage to a medicinal chemist is that the RFR result does not directly suggest how compounds could be altered to increase/decrease their solubility. Unlike the intuitive and appealing descriptors in GSE and ABSOLV, many of the RDKit descriptors used are more abstract and not easy to interpret regardless of the modeling method [99]. Table 2 lists the calculated log S 0 values of the big molecules. The last column lists the 'Solubility Enhancement-Big Molecules' -the ratio of the observed S 0 to that calculated by the ABSOLV approach (cf., Eq. 7). The scale quantifies the big-molecule solubility enhancement not predicted by small molecules. A similar ratio using the classic GSE indicates two zones: (a) 'enhancement' for compounds to the left of the identity diagonal line in Figure 4b, and (b) 'attenuation' for compounds to the right of the line. The GSE zoning is directly linked to the partition coefficient (cf., Fig. 4c). The ABSOLV-based SEBM assigns a unified enhancement to all molecules, and separately addresses the role of H-bonding and molecular size (as well as the other Abraham solvation descriptors), whereas the GSE confines the relationship mainly to one descriptor -clogP, whose value may not be accurately calculated or measured for large molecules (e.g., ubiquinone). Figure 7 is a plot of log SEBM as a function of nROT. Although noisy, a trend is evident. The unfilled circles in the figure refer to two external test compounds, big molecules recently approved as drugs: givosiran [100] and tenapanor [101], with M W 1711 and 1145 Da, respectively.

Factors that may shed light on the unusual intrinsic aqueous solubility of big molecules
Lipophilicity behavior of big vs. small molecules differs From the GSE analysis, the notable characteristic distinguishing small from big molecules is the dependence on lipophilicity (Fig. 2, Eq. 6b). Big lipophilic molecules (ubiquinone, iodipamide, everolimus, telithromycin) are more soluble and big hydrophilic molecules (oxytocin, stevioside, nafarelin) are less soluble than predicted by the traditional GSE (cf., Fig. 4b). The empirical Eq. (6b) compensates for this tilted relationship with the less negative (-0.4) clogP factor and the more negative intercept factor (-1.77) than those in the GSE (-1 and 0.5, resp.), as illustrated in Figure 4c. The solubility-partition correlation using octanol works well for small molecules, but octanol does not appear to match the big-molecule solubilitypartitioning behavior in the same way, either because the big molecules are uncharacteristically more soluble in water (extra strong solute-water adhesive interactions) and/or less soluble in the octanol phase (extra strong solute-solute cohesive interactions). Ermondi et al. [27,28] estimated lipophilicity of nine bRo5 drugs using the well-tested small-molecule ElogP and the new 'block relevance' BRlogP chromatographic methods, to investigate the role played by molecular flexibility. They also subjected the molecules to conformational analysis, in order to calculate lipophilicity of various conformers. ElogP chromatographic method appeared to provide an environment in which flexible compounds are driven to assume a more 'folded' apolar conformation (as expected in octanol), whereas the BRlogP method favored an 'extended' polar conformation for such molecules (as expected in water). Lipophilicity of bRo5 compounds strongly depends on their chameleonic properties: closed form preferred in apolar environments and open form in aqueous media. It is suggested that a non-traditional lipophilicity scale is needed for many bRo5 compounds, which takes into account the solute conformational flexibility and the polarity of the dissolution media [27,28]. The Appendix shows the 2D structures of the big molecules selected for the study. Many of these are derived from natural products, possessing flexible cyclic and polycyclic components in there structures. The crystal structures of only about half of these molecules have been deposited in the Cambridge Crystallographic Data Centre (CCDC). In many of the compounds the crystal lattices contain internal void space that may be filled non-stoichiometrically with water, either fixed at certain positions by H-bonds, or mobile in channels. There are numerous sites with which water could interact by donating H-bonds, possibly competing with acceptor groups in IMHB networks, to form stoichiometric hydrates.
Since most of the crystals chosen for structure determination were grown in semi-or non-aqueous media, the reported X-ray structure of the molecules may not precisely reflect the conformational state found in crystals under conditions where they are equilibrated in aqueous solution, or of dissolved molecules in their unhindered states of hydration. In an exceptional study, the aqueous environment was well mimicked in the synchrotron X-ray determination of the structure of the glycopeptide antibiotic vancomycin [102]. Vancomycin crystals grown by the 'hanging drop' method were transferred into a pH 4.6 acetate buffer solution containing 2.2 M NaCl and a cryoprotectant solvent. The suspension was then flash frozen for the low-temperature data collection. The crystal lattice was found to contain an H-bonded dimer of vancomycin, 2 chlorides, 1 acetate, and 105 solvent water molecules in the asymmetric unit. The organization of the lattice water was not described in the publication.
Zografi and coworkers have conducted pioneering research [103][104][105][106] on the influence of adsorbed and absorbed water on the solid state properties of crystalline/amorphous solids, including multicomponent forms such as drug salts and cocrystals. The presence of stoichiometric/nonstoichiometric water in crystalline solids is expected to impact the thermodynamic activity of the solid and thus could affect equilibrium solubility.

Enthalpy of solution of big molecules is negative
A computational procedure to normalize solubility data determined at various temperatures to values at a 'reference' temperature (e.g., 25 °C) was recently described [107]. The enthalpies of solution, ΔH sol , were predicted from 2D structure, from which the temperature dependence of log S 0 is calculated as: For big basic molecules this means that as temperature rises, the solubility decreases. If water is sorbed into the void/channel spaces of crystals containing big molecules, then the negative enthalpy could be rationalized in terms of H-bonding effects. Since water H-bonds weaken with rising temperature, the proportion of the 'extended' (water soluble) conformer may shift in favor of the 'folded' conformer, which is expected to be less soluble in water. With weakened water binding, the intramolecular H-bond interactions may stabilize the structure in a folded form. In this way, negative enthalpy is consistent with the conformational flexibility of 'molecular chameleons' [26][27][28][29][30][31], and highlights the possible role of sorbed water influences on solubility.

Conclusion
We have shown that traditional approaches to the prediction of solubility of big molecules (bRO5) do not work very well, unless modified. On the other hand, the RFR method works reasonably well, but it is not easy to understand what specific contributions the various molecular descriptors provide to the overall prediction.
We attempted to link the Solubility Enhancement-Big Molecules (SEBM) to other physicochemical properties. A trend was evident in the log SEBM vs. nROT plot, suggesting that flexibility appears to enhance the solubility of big molecules. In the SGE model, a different lipophilicity scale might improve the performance of the approach, as empirically suggested in Figure 4c and as suggested by the chromatographic studies of Ermondi et al. [27,28]. The introduction of a nonlinear H-bond basicity term in the case of the ABSOLV approach is empirical, and it is not clear how to relate it to first-principle thermodynamic treatment.
Most of the big molecules have negative enthalpy of solution. That is, their solubility decreases with increasing temperature. This hints of an important H-bonding role for water sorbed into the solid state of the large molecules. Such molecules appear to have void spaces in their crystal lattices, sufficient to accommodate many water molecules under equilibrium conditions with crystals wet by aqueous media.
The observation that the RFR method appears to work encourages us to further search for 3D-based descriptors arising from 'conformational lipophilicity' analysis akin to that developed by Caron and coworkers [27][28][29][30]. The accurate prediction of the solubility of newly approved molecules originating from the bRo5 chemical space would help in selecting/prioritizing candidates in early drug discovery, particularly if the bRo5 molecular basis of solubility were better understood.

Abbreviations and definitions
S 0 "intrinsic" solubility (i.e., the solubility of the uncharged form of the compound) RMSE root-mean-square error: RMSE = [ 1/n Σ i (y i obs -y i calc ) 2 ] 1/2 , where y obs / y calc = observed/calculated value of log S 0 according to model, n = number of measurements of log S 0 r 2 squared linear correlation coefficient, r 2 = 1 -Σ i (y i obs -y i calc ) 2 / Σ i (y i obs -<y>) 2 , where y = log S 0 , and <y> is the mean value of log S 0 SD standard deviation: SD = [ 1/n Σ i (y i obs -<y>) 2 ] 1/2 , where n = number of measurements, <y> = mean value of log S 0 F F-statistic: F = (n-p-1)/p · Σ i (y i obs -<y>) 2 / Σ i (y i obs -y i calc ) 2 , where p = number of regression parameters MPP Measure of prediction performance [108]. It refers to the percent of 'correct' predictions, as defined by the count of absolute residuals |log S 0 obs -log S 0 calc | ≤ 0.5 divided by the number of measurements. MPP is represented as a pie chart in the correlation plots.

Abraham solvation descriptors
A H-bond total acidity B H-bond total basicity S π dipolarity/polarizability due to solute-solvent interactions between bond dipoles and induced dipoles E excess molar refraction (dm 3 mol -1 / 10); which models dispersion force interaction arising from π-and n-electrons of the solute V McGowan molar volume (dm 3 mol -1 / 100) A•B acid-base H-bonding product descriptor used in ABSOLV solubility prediction

Most important RDKit descriptors in RFR analysis
Subdivided Surface Area Molecular Descriptors [109] LabuteVSA sum of atomic contributions [110] to the accessible van der Waals surface area MolLogP sum of atomic contributions to octanol/water partition coefficient, log P MolMR sum of atomic contributions to molar refractivity, MR SMR_VSAk sum of accessible van der Waals surface area for those atoms with atomic contribution to molar refractivity; k refers to a small domain of atomic-contribution to MR; intended to capture molecular size & polarizability PEOE_VSAk intended to capture direct electrostatic interactions in a particular range; based on iterative equalization of atomic orbital electronegativities [111].

Availability of the Wiki-pS 0 Database
The entire Wiki-pS 0 database is planned to be released in book form: A. Avdeef. Intrinsic Aqueous Solubility Data for Pharmaceutical Research. Wiley-Interscience, Hoboken, NJ (under discussion with publisher). A sampling is presented in Table A5 in [20].