Solving the data conundrum in drug discovery

Published: 3-Mar-2016

Dealing with more data does not have to mean reduced productivity. Dr Thibault Geoui, Director, Chemistry and Biomedical Products, Elsevier R&D Solutions, looks at data developments

You need to be a subscriber to read this article.
Click here to find out more.

Life sciences companies are spending more on R&D and earning less for their efforts, according to a recent Deloitte survey. While the cost of developing an asset grew by one-third from 2010 to 2015, from around US$1.2bn to $1.6bn, average peak sales per asset declined by half during the same period, from $816m to $416m. These discrepancies reflect, at least in part, pharma’s over-investment in data-producing technologies such as next-generation sequencing, and a concomitant paucity of investment in data management and data analytics technologies that can make sense of all that input.

The authors of a study in Nature rightly state that the pharmaceutical industry is facing ‘unprecedented challenges to its business model’ and that those challenges can be met by shifting investments to the earlier stages of drug discovery. Companies can accomplish this by allocating some of the money they spend in Phase I–III studies – which currently account for 63% of the total R&D budget – to the preclinical stage; increasing spending in the clinical stages does not correlate with improved return on investment – quite the opposite. Moreover, it has been demonstrated that if companies invest in in silico tools that enable the production of more optimised leads, they have a 50% higher probability of technical success in Phase II at a cost reduction of 30% per new medical entity.

More data is good only if companies can access the right data – that is, all available data relevant to the projects at hand

More data is good only if companies can access the right data – that is, all available data relevant to the projects at hand. To do that effectively, they need a system that ensures the data are properly prepared and ‘harmonised’ – i.e. adjusted for differences and inconsistencies among measurements, methods and terminology, and produced in a clean, accurate and structured way. Researchers then are in a position to conduct focused searches. Text-mining systems such as Elsevier’s extract input not only from the burgeoning literature (PubMed now contains more than 25 million biomedical-related citations), but also from clinical trials and patient-centric sources such as pharmacy and health insurance profiles, electronic health records and mobile diagnostic and monitoring devices.

For the most robust results relevant to the early stages of discovery, it is useful also to have the capability of integrating regulatory data found in the medical, chemistry, statistical and clinical pharmacology/biopharmaceutics reviews sections of the US Food and Drug Administration’s (FDA’s) new drug approval packages, and similar input from the European Medicines Agency. Important human data with key covariants can be missed entirely if, as is often the case, the studies referenced in these documents are not published in the literature.

Increasingly, raw datasets – generally defined as any data that are the direct result of observations or experimentation without processing, analysis or other intellectual input – are also becoming available. For example, Elsevier recently launched an Open Data pilot that makes the raw research data submitted with an article – e.g. output from a measurement device, data from social surveys and digital scans – accessible online to all users, alongside the published article. In addition, the International Committee of Medical Journal Editors has just proposed new rules that will require authors to share clinical trial data as a prerequisite for the transcripts to be considered for publication. While this input increases the volume of data that need to be converted into a searchable format, incorporating these datasets can help researchers identify, validate and build upon findings that are applicable to their own work.

Facilitating reproducibility

Specifically during lead identification and optimisation, focused searches across a multitude of data sources can give rise to significant savings. For example, the results can help validate a company’s internal findings about a candidate compound by ensuring that internal results are similar to what has already been published; this, in turn, can reduce the amount – and cost – of additional experimentation in house.

To take advantage of these capabilities, companies can invest in a data-mining system that searches the full text of articles, not just abstracts, and gives researchers the ability to choose which metrics to apply to get the most relevant, high-quality results. To validate internal findings, for example, researchers would refine their searches to ensure that a particular fact is mentioned in the ‘results’ section of a paper – i.e. that it is definitely a scientific finding, not simply a citation of someone else’s work in the introduction, discussion or conclusion sections. In addition, to eliminate citation bias, they could ascertain that for each instance, a particular fact is reported by a completely different research group.

Many risk-management systems use different platforms and different datasets, and don’t necessarily talk to each other

Most companies have a system in place that enables them to stay compliant with current regulatory requirements. But those systems can fall short if a company is trying to gain the best possible understanding of a drug’s safety impact early in the discovery process, instead of dealing with an unanticipated adverse event or drug-drug interaction in later stages or postmarket. Many risk-management systems use different platforms and different datasets, and don’t necessarily talk to each other. Results can’t easily be compared, often leading to duplication of efforts. This lack of standardisation is particularly onerous for companies that implement different approaches across business units, geographies and previous acquisitions. In such an environment, even with ongoing literature monitoring, it is easy to miss articles that report adverse drug reactions for a particular candidate.

One solution is to invest in a common platform that hosts all relevant data, with appropriate permissions for sharing. A single, comprehensive platform facilitates the development of a proactive (i.e. premarket) risk-management action plan that incorporates multiple datasets into the monitoring process, helping to ensure that vital information, published or not, will be flagged for subsequent investigation. The idea of proactively monitoring safety events to better manage risk across the life cycle of a drug, and inform products still in the pipeline, is relatively new. But regulatory bodies such as the FDA and EMA are sharing information and amending their requirements to promote continuous gathering and analysis of data from as many sources as possible, as early in the process as possible.

 Data mining tools are becoming increasingly important in R&D

Data mining tools are becoming increasingly important in R&D

Data sharing increasingly is being encouraged, if not mandated, across, and between, companies and organisations – yet another strong reason for investing in a powerful text- and data-mining system. For example, the NIH and the European Commission (EC) are launching initiatives that encourage the sharing of research data specifically to enable reuse. The NIH recently announced its intention to ‘make public access to digital scientific data the standard for all NIH-funded research’, whereas the EC has prioritised Open Science, with a long-term goal of increasing ‘the impact and quality of science, making science more efficient, reliable and responsive to the grand challenges of our times as well as foster co-creation and open innovation’.

Expanding resources

In addition, as part of its NIH Big Data to Knowledge initiative, the NIH recently published a funding opportunity to develop the NIH Data Discovery Index to enable the discovery, access and citation of biomedical data. Other funding opportunities encourage the development of dedicated data search engines. In a project co-funded by a National Science Foundation EAGER grant, for example, Elsevier is working on a data search pilot with the Carnegie Mellon School of Computer Science to facilitate the querying of tabular content extracted from articles and imported from research databases. These initiatives provide more incentives for companies to invest in technologies that can manage and cull relevant information from diverse and rapidly expanding databases. When organisations acquire similar datasets through open sources and collaborations, combining and successfully mining those datasets can make it easier to draw conclusions about specific entities of interest, and also to conduct meaningful meta-analyses.

An often overlooked benefit of mining shared data is the facilitation of drug repurposing

An often overlooked benefit of mining shared data is the facilitation of drug repurposing, as it is now commonly accepted that most approved drugs are not selective for a single target or signalling pathway. Companies have applied text- and data-mining to successfully identify new indications for, among others, the TNF-inhibitor adalimumab (Humira) and the anticancer drug imatinib (Gleevec). Streamlining drug repurposing with the best possible tools for discovering and delineating disease and treatment mechanisms provides significant economic benefits to companies, as well as significant health benefits to patients.

Mining social media input

Social media input is the next big swath of largely unstructured data that companies will be required to incorporate into their decision-making processes. In its recently revised guidelines on adverse event reporting, the EMA states that marketing authorisation holders should regularly screen internet or digital media for potential reports of suspected adverse events and assess whether such information qualifies for reporting to the agency. Such reports also can be used to inform decision-making on candidates with similar mechanisms of action to a drug known to trigger adverse effects.

Technologies to aggregate and mine social media input effectively are still in development

In June 2014, the FDA posted a draft guidance for industry on the use of social media platforms for advertising/promoting risk and benefit information to the public. The guidance is the result of an ongoing collaboration between the FDA, industry and patient groups to identify best practices for engaging patients, and to conduct pilot programmes that solicit patient input via social media. The collaboration is expected to influence drug development as well as adverse event reporting. Technologies to aggregate and mine social media input effectively are still in development; however, one possible solution is to treat ‘tweets’ and other social media comments as text and enable the use of advanced text-mining software to help companies at least become aware of what people are saying about specific products.

In addition, Elsevier is beginning to enable the application of ‘sentiment analysis’ – an emerging data-mining strategy that identifies words and phrases that indicate opinions, attitudes and lack of certainty (e.g. ‘suggests,’ ‘seems to indicate’) – to track potentially relevant input that appears in social media, and biomedical literature searches, as well. Although challenges such as the need for linguistic resources and the ambiguity of words and of intent (e.g. irony, sarcasm), sentiment analysis is likely to play an increasingly important role in informing drug discovery and development R&D going forward.

You may also like