Computational drug repurposing based on electronic health records: a scoping review

We abstracted 33 articles based on four themes with the following flow, (1) journal and articles, (2) data used, (3) methods, and (4) results of repurposing. During the process for each theme, important data elements were identified by the first author and validated by all seven reviewers. The synthesis of the articles for each data element was conducted by each reviewer. The results were finally validated and organized by the first author. The disagreements in synthesis were resolved among all the reviewers in the consensus meeting. The general summarization of the articles is shown in Supplementary Table 1. The flow details for each article can be found in Supplementary Method 1.

Publication venue

The 33 papers reviewed consisted of 29 journals and four conference articles (see Fig. 1). We manually categorized those articles into three types: (1) Computer Science, (2) Informatics/Biomedical Informatics, and (3) Medicine/Biology/Pharmacology. The majority of the articles were Informatics/Biomedical (n = 22) and Medicine/Biology/Pharmacology (n = 10). We also noticed that the conference articles were Informatics/Biomedical, suggesting this topic or methodology is more popular among the Biomedical Informatics community. Most studies were conducted in the United States (n = 22), with the remainder being scattered amongst Asian and European countries. In addition, the topic of EHR-based drug repurposing gains popularity from the year 2012 (n = 1) to 2021 (n = 10).

Fig. 1
figure 1

Distribution of publication type, stratified on the year of publication and country of origin.


The majority of studies relied on the EHR from an institution affiliated with either the authors themselves (e.g., Vanderbilt University Medical Center17,18,19) or one of their collaborators (e.g., Mayo Clinic18,19). Only three studies utilized publicly available datasets, including IBM Watson Health Explorys database20, MIMIC-II21, and adverse event reporting systems (AERSs)22. Most of the studies used only EHR, while others utilized multiple kinds of sources to facilitate the drug repurposing, such as knowledge bases (N = 11), Omics databases (N = 7) (Please note, we distinguished Omics data from EHR data) (see Supplementary Fig. 2a). Among all the association knowledge bases, drug-gene information was the most popular (see Supplementary Fig. 2b). Drugbank23 was the main source of drug-gene (protein) information in the studies (N = 11). We noticed widespread usage of biomedical repositories, such as Sider24 for side effect information (N = 3) and OMIM25 for gene-disease relations (N = 2).

Amongst all the 22 EHR data types covered in our survey, medication (N = 22), diagnosis (N = 19), lab test (N = 17), and demographic (N = 16) were the most frequently used (see Fig. 2a). Most studies were conducted based on patient cohort sizes of less than 10,000 (N = 8), 10,000 to 100,000 (N = 6), and 100,000 to 1,000,000 (N = 6) (see Fig. 2b). We note that a few studies did not specify the size of the patient cohort used (N = 3). Supplementary Table 2 shows the detailed information of the data in the reviewed studies.

Fig. 2: Distribution of EHR types and number of patients.
figure 2

a shows the distribution of the EHR types, and b shows the distribution of the number of patients.

Drug repurposing methods

Figure 3 shows the number of papers with different data processing methods for EHR data regarding natural language processing (NLP), standardization, or temporal data processing. Of the surveyed studies, five studies utilized NLP to process their data, seven used standardization, and four dealt with temporal data. We note that three studies implemented more than one data processing method (e.g., MedEx and RxNorm CUI were used to extract and standardize medication information19, and the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) was used for both drug prescription and laboratory tests26).

Fig. 3
figure 3

Distribution of different data processing methods and the predictive model.

NLP pipelines were used to extract a diverse set of information that differed depending on individual study needs. For example, drugs and diseases were extracted from triads of sentences by using MetaMap21. Regarding the adoption of standardization methods, standardization efforts are mainly focused on using standard terminologies for medical concepts. For example, Proteomics Standard Initiative—Molecular Interactions (PSI-MI) codes were used for proteomics, Gene Ontology (GO) for genomics, Anatomical Therapeutic Chemical (ATC) codes for drugs, and ICD-10 and Online Mendelian Inheritance in Man (OMIM) for disease data27. Similarly, Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) was also used for diseases, RxNorm for drugs, logical observation identifier names and codes (LOINC) for Laboratory tests20, standard billing codes for clinical phenotypes28, and Unified Medical Language System (UMLS) for multiple kinds of biomedical related concepts29. Temporal information was primarily used to track disease progression. For example, temporal data was used in one study to analyze the association between the virological status of patients and all-cause mortality as well as other individual-level factors30. Supplementary Table 3 shows the summary information of data processing in the studies.

As shown in Fig. 3, statistical analysis and machine learning are two predominant computational approaches used for drug repurposing through mining a large set of health data. In statistical analysis methods, statistical models and tests are used to determine the effect of drugs on disease targets or other related clinical variables such as genes and laboratory tests. For example, Wang et al.31 searched drug and gene information from public pharmacological and genomic databases as well as private EHRs for glaucoma diseases. It used p values based on the chi-square tests and false discovery rates (FDR) of drugs targeted to glaucoma genes/diseases to detect potential treatment candidates. For example, the prevalence of glaucoma was 0.11% in theophylline-treated patients, and 0.058% in celecoxib-treated patients, suggesting these drugs may have antiglaucoma effects as the incidence of glaucoma was significantly lower in these drug-use cohorts than in healthy individuals. Wu et al.19 classified a patient cohort into two comparison groups—an exposure group with drug prescription and a non-exposure group without drug prescription and applied cox regression to measure the association of drugs with cancer survival for suggesting repurposing candidates. Goldstein et al.17 used logistic regression (or multivariate regression) and derived p values to examine the association between drug candidates and genetic mutation (or glucose tolerance test) data for identifying drug repurposing candidates for gestational diabetes.

Machine learning is another type of common computational approach for predicting new disease targets of existing drugs. Three popular machine learning methods are based on similarity/interaction network, the least-square optimization method, and deep learning. For instance, Zhou et al.20 developed a network-based prediction system of disease-target interactions by modeling phenotypic and genetic relationships among drugs, side effects, diseases, and genes for identifying repositioned drug candidates. Ghalwash et al.32 formulated the problem of finding drugs that have an effect on the levels of laboratory test results as a regularized least-square unconstrained convex optimization problem. Liu et al.33 created a high-throughput screening framework with existing large-scale real-world data. The framework extracted potential repurposing drug ingredients, identifies the corresponding user and non-user sub-cohorts, computes features and disease progression outcomes for all patients in both sub-cohorts, and estimates the treatment effects using deep learning methods. Supplementary Table 4 summarizes the computational methods in detail.

Evaluation of EHR-based computational drug repositioning research is critical to ensure valid and reliable computation methods and new signals. Unlike predictive modeling or adverse drug reaction detection, where the gold standard outcomes can be well defined, there may be a lack of well-established evidence or ground truth to validate the newly discovered target signals in drug repositioning research. Therefore, the evaluation may rely on multiple internal and external sources of evidence. Figure 4 summarizes the sources for training and validation. Of the 33 papers reviewed, six did not present any methods for assessing drug performance. Of the 27 that did, the most common performance metrics used were machine learning related (e.g., precision-recall, AUC-ROC). Risk ratios (e.g., hazard, odds, and relative risk) were commonly reported to evaluate the effectiveness of candidate drugs. With respect to validation, 12 papers performed validation of any hypothesis candidate drugs against other data sources based on EHR data, ten based on biomedical literature, and nine based on public knowledge bases.

Fig. 4
figure 4

Distribution of resources for training and validation.

The EHR is the most frequently reported source for training and validation since it contains rich, dense, and longitudinal information. The drug effects on laboratory tests were mainly used in building predictive models (N = 12). For validation, Drug-Disease information (N = 6) observed in the EHR is mainly used. Please note, that a dataset can be both used for training and validation. The validation can be conducted by retrospectively analyzing EHR data to estimate the usage and effects of the candidate drug. For example, Wang et al. searched EHR data to obtain information on the usage of the candidate drugs and glaucoma31. Due to potential issues of data quality or information representation (e.g., unstructured text), manual chart reviews are often required when leveraging EHR for evaluation. Cai et al. conducted a chart review of EHRs based on randomly selected 20 participants to determine the accuracy of newly identified phenotypes34. In addition to EHRs, external databases such as Drugbank and clinical trials databases can be great resources for evaluation purposes. One common way of leveraging these databases is through study replication, a method by which target associations are reproduced using the same computational methods on a different dataset, and the difference in the study outcomes is statistically compared. Cai et al.34 used two additional external data sources BioVU and UK Biobank to cross-examine the association between a genetic variant and coronary heart disease phenotypes. Xu et al.35 performed a comprehensive performance comparison to the existing state-of-the-art drug repositioning methods to reveal the advantages of the proposed methods. Out of 33 articles, two studies27,35 conducted an additional laboratory study to validate the potential therapeutic effect on animal models and demonstrated additional validity to the proposed methods.

Due to the lack of ground truth and potential EHR-related data quality issues, we recommend having multiple evaluations on different data sources. We found that 13 out of 33 studies reported more than one evaluation method. For example, Wu et al.19 incorporated two different validation methods: (1) supporting evidence from biomedical literature, and (2) supporting evidence from human interventional cancer trials. Paik et al. 27 used computational evaluation (tenfold cross-validation) on known associations in a vivo zebrafish model of ALS. Hsieh et al. 36 validated the candidate drugs through both in vitro drug screening and real-world population-based studies leveraging EHRs. Supplementary Table 5 summarizes the validation methods in detail.

Drug repurposing results

Figure 5 shows the disease targeted. The most common repurposing target was diabetes-related, consisting of 10 out of the 33 publications17,26,29,32,37,38,39,40,41,42, including type 2 diabetes37,41,42, gestational diabetes17, diabetes (unspecified)26,29, diabetes-related tests including glycated hemoglobin32 and Fasting Blood Glucose38,39,40. Six publications did not focus on any specific diseases21,22,27,28,43,44. For example, Dang et al.21 aimed to establish a generic process and method to integrate phenomic data in EHR with omic and drug data.

Fig. 5
figure 5

Distribution of diseases targeted.

Cardiovascular-related diseases are also focused on in seven publications. Specifically, Jang et al.41 targeted congestive heart failure, myocardial ischemia, and stroke, Ghalwash et al.32 targeted low-density lipoprotein, which is a risk factor for cardiovascular and vascular diseases, Kim et al.26 targeted dyslipidemia, Cai et al.34 targeted cardiovascular disease, Liu et al.33 targeted coronary artery disease, Nordon et al.29 targeted hypertension, and 366 targeted coronary heart disease, congestive heart failure, heart attack, and stroke. In addition, there are four publications targeting at cancer18,19,35,45, three targeting at COVID36,46,47, two targeting at asthma41,42.

It is worthwhile to note that some of the reviewed articles did not report on specific drugs, but rather presented a selection of top n repurposed drug candidates as determined by their respective methodologies. Of those that did subset reported drugs, they were typically sub-selected by certain drug types, such as statins, triptans, PPIs, and nasal steroids in one study, α1‐adrenoceptor antagonists in another, and antihypertensive calcium channel blockers in a third. Of the 33 studies reviewed, only five reported results focused on a single drug, metformin in the case of Xu et al.18, febuxostat in the case of Muraki et al.48, terbutaline sulfate in the case of Paik et al.27, Fluoxetine in the case of Bi et al.45, and Dextromethorphan in the case of Cummings et al.49. Intuitively, this finding makes sense as most methods are focused on presenting candidates for further screening rather than having a pre-existing drug that should be further studied, and as such methods would result in a selection of candidates that should then be cross-validated against known clinical indications for method validity, rather than a clinical validation of an individual drug itself selected from said list of candidates. Supplementary Tables 1, 6 show the drugs explored and the corresponding diseases targeted in detail.

Data and tools published

Despite the widespread and vital use of EHR data for drug repurposing research, datasets and tools were not readily electronically available to the public in many of the surveyed studies. Table 1 shows the publically shared data and tools among the reviewed studies. Only 1 study33 of the 33 reviewed studies can be fully reproducible with publically available EHR and the tool so as to verify the original studies to follow-up studies. In terms of the dataset, seven of these studies used publicly open EHR datasets: IBM Watson Health Explorys database20,50,51, IBM Health MarketScan database33,45,51, and MIMIC-II21, which are the only research that shared their original dataset, Vanderbilt Synthetic Derivatives database17, through data use agreement. From the perspective of sharing developed tools, five studies shared their own tools28,29,33,36,42. In contrast, others indicated open software, which they used, without their practical implementations20,27,45. Lastly, some of the studies shared the analysis and results in the form of supplements or separated links19,26,27,31,33,35,36,41,42,44,45,46,47,49,50,51,52.

Table 1 Summary of publically shared data and tools.


Leave a Reply

Your email address will not be published. Required fields are marked *