Advanced data mining technologies
Konstantin Balakin and Nikolay Savchuk look at advanced data mining for the assessment of the ADMET properties of potential drugs
Konstantin Balakin and Nikolay Savchuk look at advanced data mining for the assessment of the ADMET properties of potential drugs
Advanced cheminformatics technologies, aimed at selecting suitable screening candidates with optimised absorption, distribution, metabolism, excretion and toxicity (ADMET) profiles, are of great industrial demand.Poor pharmacokinetics and toxicity are the most important causes of costly late-stage failures in drug development, and it is widely recognised that these issues should be addressed as early as possible in the drug discovery process.
Despite the development of a variety of medium and high-throughput in vitro ADMET screens in recent years, the modern high-throughput synthesis and screening technologies enormously increased the number of compounds for which early ADMET data are needed. In this paper, several in silico approaches are described, which increase the ability to predict some important pharmacokinetic, metabolic and toxicity endpoints.
The approaches are based on an advanced algorithm of dimensionality reduction and data visualisation: Sammon non-linear maps. The developed models allow virtual ADMET profiling of combinatorial libraries and selecting compounds for in vitro/in vivo testing. This approach will improve the decision-making process of compound selection in drug discovery.
non-linear maps
CDL's method is based on a statistical data mining approach, which extracts information from knowledge databases of compounds with experimentally determined properties of interest. Molecular features encoding the relevant physico-chemical and topological properties of compounds were calculated from 2D or 3D molecular representations.
After the relevant molecular descriptors were calculated and selected, a non-linear Sammon map was generated and analysed. Non-linear Sammon mapping is a multivariate statistical technique that approximates local geometric relationships of a multidimensional property space on a two-dimensional plot.
Sammon maps represent all relative distances between all pairs of compounds, and the distance between two points on the map directly reflects the similarity of the compounds - i.e. this methodology creates a 2D image of a multidimensional space.
unbiased manner
Among the other dimensionality reduction techniques that have appeared in statistical literature, Sammon non-linear mapping is unique for its conceptual simplicity and ability to reproduce the topology and structure of the data space in a faithful and un-biased manner. The method has an undoubted practical value and can be recommended for analysis of medium-sized combinatorial libraries (up to several thousands of compounds) aiming at selection of subsets with enhanced knowledge-based informational content.
The following examples illustrate the development of several in silico models for prediction of some important ADMET properties of small molecule compounds, such as human intestinal absorption, blood-brain barrier permeability, binding affinity to the active sites of the metabolising enzymes, plasma protein binding and specific toxicity. All the calculations were performed using the ChemoSoft software suite.
Human intestinal absorption (HIA) is a major issue in the design, optimisation, and selection of candidates for development of orally active pharmaceuticals. The combinations of several molecular descriptors believed to be important for HIA have been used in various multivariate analysis studies. Physico-chemical properties of drugs, such as lipophilicity, molecular weight, ionisation profile and H-bonding capacity, determine the extent to which drugs can cross the cellular barriers using passive diffusion mechanisms. In general, the molecular properties affecting HIA via passive diffusion mechanisms are well understood, and the reported to-date models adequately describe this phenomenon.
Nevertheless, while much effort continues to be expended in this field with some success on existing datasets, perhaps the most pressing need at this time is for larger, high-quality sets of experimental data and for an effective data mining algorithm to provide a sound basis for model building.
CDL has developed an in silico model for the early recognition of compounds with poor intestinal absorption. The model is based on a relatively large training set consisting of ca. 300 drugs with known values of HIA and several calculated molecular descriptors encoding the above-mentioned properties crucial for effective penetration through biological membranes.
After the non-linear map was generated, CDL investigators observed statistically significant differences in the molecular properties of well-absorbed (HIA>80%) and poorly absorbed (HIA<20%) drugs (Figure 1).
These categories of compounds occupied distinctly different areas on the map, and these differences can be used for assessment of HIA profile for novel compounds. The model is applicable as an in silico screening tool at early discovery stages, for example for candidate selection in hit-to-lead libraries or for the design and selection stage of diverse and focused library construction.
Optimisation of the distribution of therapeutic compounds between brain and blood is a very important issue in the design of CNS-active drugs. For drugs targeted for CNS, blood-brain barrier (BBB) penetration is a necessary attribute unless invasive or carrier-based strategies are being considered.
On the other hand, for drugs aimed at other sites of action, BBB permeation would be undesirable as it can produce unwanted side-effects. Considering great importance of the problem, the development of a reliable method of effective pre-synthetic assessment of BBB permeability is a requirement in the discovery of CNS-active agents.
CDL created a robust qualitative model for the early assessment of the BBB permeability of small, therapeutically relevant molecules. The methodology is based on the extraction of knowledge from a large number of literature sources on BBB permeation of organic compounds.
Blood-brain barrier
A comprehensive set of experimental BBB-permeability data on more than 500 compounds was collected. It was assumed that only passive diffusion mechanisms are involved in the BBB transport of these compounds. Statistical analysis enabled the selection of an optimal set of molecular descriptors for the effective prediction of BBB penetration. The projection of the combined data set of well and poorly permeable compounds onto Sammon map was generated, figure 2.
The data set of BBB-permeable compounds occupies a distinct area on the map substantially different from the regions of poorly permeable agents. Therefore, the map can be used for the assessment of BBB-permeability, and hence candidates. The constructed learning model is useful in constraining the size of libraries of potential CNS active agents.
The company has also used non-linear maps for the assessment of ability of drugs and drug-like compounds to bind to the cytochrome P450s (CYPs). This approach was applied for CYP-specific classification of nearly 500 drug compounds. Chemists observed statistically significant differences in the molecular properties of strong (Km<10) and weak (Km>100) binders for various CYPs. The weak and strong binders occupied distinctly different areas of the map for all the groups. For illustration, the Sammon map, figure 3, shows areas of strong and poor binders for CYP3A4 isozyme, which represents the largest isozyme-specific group in the studied dataset.
drug interactions
There exists strong correlation between P450 binding affinity and the reversible competitive P450 inhibition of drugs. CYP inhibition is thought to be the most common cause of drug-drug interactions, and several prominent drugs have been withdrawn from market due to such undesirable effects. Although inhibition of CYP enzymes in vitro is not necessarily associated with drug-drug interactions in clinical studies, lead compounds with weak CYP450 inhibition are favoured in drug development based on these considerations.
Reliable in silico methods for assessing CYP450 inhibition can provide a valuable complement to early stage selection of lead compounds. It should be noted that modelling criteria for CYP inhibition should be flexible, and among other ADMET factors, such as absorption, metabolism and toxicity, CYP inhibition is relatively more important only in disease states treated with co-administered drugs.
Plasma proteins are the major vehicles for transport and buffering of drug compounds. Understanding of drug-target and drug-plasma protein binding characteristics throughout the course of the drug development process is essential in ADMET evaluation of novel drug candidates. Clinical potential of drug compounds is greatly affected by the nature of their interactions with circulating plasma proteins, such as human serum albumin (HSA) and a 1-acid glycoprotein (AGP). Plasma protein binding varies greatly and affects free concentration of drug compound in circulation, transport and distribution in body's tissues body, and the duration of drug action.
CDL has developed an approach to in silico classification of drugs and drug-like compounds according to their binding affinity to plasma proteins (in relation to multi-protein binding). Using non-linear Sammon maps, it completed a knowledge-based classification analysis of more than 400 drugs with experimental %-bound values. Figure 4 demonstrates that strong and poor binders occupy clearly different sites on the map.
This method can also be recommended as an efficient in silico tool for early evaluation of some important pharmacokinetic parameters of drug compounds, such as volume of distribution and plasma clearance.
available data
With increasing experimental toxicity data available, more reliable predictions based on quantitative structure-toxicity relationships (QSTR) of toxic endpoints can be made. A number of predictive methods have been compared from a regulatory perspective.
Various QSTR approaches include the use of multiple linear regression or neural networks. Expert systems such as DEREK are yet another approach. New technologies are based on the advances in understanding and analysing the effects of chemical compounds on gene expression and processing. However, all these approaches can serve to give a first alert, rather than being truly predictive.
Using the available data about the toxicological profile of organic compounds, CDL has generated a computational model for assessment of specific toxic effects. As an illustration, a non-linear Sammon map, figure 5, demonstrates the possibility of differentiation between compounds possessing cardiac and gastrointestinal toxicity.
Interestingly, the map was generated with the use of only physico-chemical descriptors, although the toxic effects are often associated with the presence of some specific structural fragments.
Probably, the discovered differences in properties of these groups of toxicants reflect their different tissue distribution profile, which is a function of physico-chemical rather than structural parameters.
Pharmaceutical lead discovery and opimisation have now become an integrated process where in vitro, in vivo and in silico methods should be considered simultaneously.
The described models are based on data collected from literature sources and can be used as efficient 'pre-screen' tools to limit the need for future high-capacity metabolism, pharmacokinetics and toxicity studies. For example, these models can be used as ADMET filters for compounds prior to further pharmacological testing, where the capacity for such testing was unable to cope with the number of hits in primary screening.
The obvious next step is the use of these in silico models for assisting library design by virtual screening of proposed structures to ensure that ADMET properties are favourable before carrying out the synthesis.
The further evolution of such technologies will result in the development of integrated cheminformatics platforms, where all the issues related to phenomenon of ADMET will be solved with maximal quality, time- and cost-effectiveness.