Background

With the adoption of computerized medication ordering and administration systems, the veil on the incidence of adverse drug events (ADEs) is slowly being removed. Unfortunately, ADEs are still considered to be heavily under-reported [1]. Among the ADEs that are reported, around half are preventable [2], causing unnecessary suffering for patients and increased healthcare costs. According to one meta-analysis, ADEs are, in fact, responsible for around 4.9% of hospital admissions worldwide, and, in some cases, this number can be as high as 41.3% [3]. There is thus no doubt that drug safety is an important public health problem. Unfortunately, the high rate of ADEs may continue unabated unless systems that provide decision support for drug selection and dosing are developed and more widely implemented at the point of care [4].

Pharmacovigilance using electronic health records

Efforts have been made in pharmacovigilance to improve drug safety. The World Health Organization (WHO) defines pharmacovigilance as "the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem" [5]. The primary resources involved in pharmacovigilance are clinical trials, spontaneous reports and longitudinal healthcare databases [6]. The use of these can be divided into pre-marketing and post-marketing pharmacovigilance activities. In the pre-marketing stage, prior to the launch of a drug, clinical trials are used to gather information on both the efficacy and safety of a drug. However, such a source of information comes with two inherent limitations, namely small samples of participants and short study duration. These limitations make it challenging to identify ADEs that are rare or occur with a long latency. In the post-marketing stage, after the drug has been launched, spontaneous reporting systems are used continuously to collect information on the safety of the drug. Examples of such systems are the US Food and Drug Administration's Adverse Event Reporting System [7] and WHO's Global Individual Case Safety Reports Database, Vigibase [8]. Spontaneous reports are voluntarily made by patients and physicians of suspected ADEs, which allows for monitoring of all drugs on the market at a fairly low cost. Unfortunately, such systems suffer heavily from under-reporting: it has been estimated that more than 94% of ADEs are not reported through spontaneous reports [9]. Other limitations of spontaneous reports include selective reporting, incomplete patient information and indeterminate population information; for more details see [10]. Indeterminate population information is particularly problematic since it prevents the calculation of the incidence of reported ADEs. As a result of these limitations, the need for alternative, complementary data sources is duly being acknowledged.

Among the possible alternative data sources, which also includes social media and medical literature, are electronic health records (EHRs) [11] since they capture and integrate patient data from all aspects of clinical observations over time. Although the main function of EHRs is to archive and manage patient data efficiently - in comparison to paper-based health record systems - secondary use of EHR data is currently being widely explored for various medical research, such as disease discovery and patient stratification [12, 13], among which also pharmacovigilance has received a lot of attention. There are various ways of utilizing EHRs for pharmacovigilance in a data-driven fashion, such as calculating correlations between drugs and diseases, clustering patients into different disease groups, and employing machine learning based prediction [14], among which the latter is particularly nascent.

Predictive modeling of data from electronic health records

Machine learning based methods are data-driven approaches that can support discovery and exploitation of statistical patterns from large quantities of data. Given a large amount of observations that are described by multiple variables, such methods have proven to be robust to random errors [15]. In areas where there is a need to analyze large amounts of data, such as bioinformatics, machine learning is a key technique, particularly when analyzing "big data" [16]. This is also the case in post-marketing drug safety surveillance, where the discovery process typically relies on large samples; computational signal detection algorithms have in this context been developed to analyze data with the purpose of detecting signals of potential ADEs [17]. Some of these algorithms detect signals according to a score function based on contingency tables, such as disproportionality analysis of spontaneous reports. However, a limitation of using contingency tables is that, by reducing the analysis to only two dimensions, the potential concomitant loss of clinically crucial information may result in arbitrary associations [17, 18]. This can be eschewed by instead employing multivariate algorithms for signal detection, where machine learning methods can provide efficient and effective means of modeling high-dimensional data.

Applying machine learning to EHR data is, however, challenging for various reasons. A natural way of fitting EHR data into machine learning models is to utilize the various clinical events that are recorded in EHRs as variables to describe, for instance, patients. For each patient, these clinical events can be represented either as a sequence according to reporting chronology, or as a bag, in effect discarding order information. Treating clinical events as sequences is, however, problematic for two reasons: (1) many events have identical timestamps, which raises the question of how to deal with simultaneously occurring events; (2) there is a lack of understanding to what extent the order of reported events reflects reality, i.e., we cannot know whether the sequence of reported events is the same as the actual sequence of events. When representing clinical events as a bag, there are other problems that need to be handled, as illustrated in Figure 1. On the one hand, the data is often high-dimensional and sparse, i.e., a large number of features describe each patient, but many features have non-zero values only for a small fraction of the patients. On the other hand, the types of data available in EHRs are heterogeneous and complex. Typically, EHR data includes both structured data according to predefined templates, such as demographic patient information, drug prescriptions, diagnoses, clinical measurements and lab tests, as well unstructured data in the form of clinical notes written in natural language. Moreover, for some types of data, such as prescribed drugs, assigned diagnoses and obtained clinical measurements, a patient may have experienced the same type of clinical event multiple times, for instance a patient being prescribed a certain drug multiple times. In summary, the challenges of analyzing EHR data with machine learning methods stem not only from high dimensionality and sparsity, but also from the existence of different data types that are tangled together with missing and duplicated values.

Figure 1
figure 1

Extracting data for machine learning methods from electronic health records.

Related work

Due to limited access to EHR data, research on exploiting it for pharmacovigilance is still relatively scarce compared to using other data sources, despite its acknowledged potential. Among the published research on using EHRs for ADE detection, some have focused on using clinical notes [1921], while how best to exploit the structured data remains under-explored. In some studies, however, clinical measurements or lab tests from EHRs have been utilized for (adverse) event detection by representing them as time series [22], aggregating them into categorical variables [23], or representing them from multiple perspectives [24]. Other studies have used diagnoses and drugs instead [25, 26], while these data types have also been used in conjunction for signaling ADEs, albeit only in a case study and on a very limited scale [27].

Diagnoses and drugs are normally encoded by standard coding systems such as International Statistical Classification of Diseases and Related Health Problems (ICD) and Anatomical Therapeutic Chemical Classification System (ATC), respectively. These coding systems have their own concept hierarchies representing terms from general levels to more specific ones according to organ system or etiology. In a previous study, we have studied the possibility of exploiting these concept hierarchies to obtain improved predictive performance on the task of distinguishing between patients who have experienced a specific ADE and randomly selected patients who have not experienced that same ADE [26]. It was shown that for such tasks, using only the more general levels of the codes is sufficient to maintain the predictive performance on a high level. We have also evaluated various ways of representing clinical measurements from EHRs and discovered that using such measurements alone still leads to the effective detection of ADEs; moreover, using only the number of times each clinical measurement has been taken, without considering their actual values, is a representation that results in the highest predictive performance for the most common learning algorithms [24].

However, previous studies have either used a single data type from EHRs or a small number of pre-selected variables from different data types to signal a specific ADE. In this study, we explore if it is beneficial to combine various data types, on a large scale, by using all of the available variables for ADE detection, and also how best to represent them. In addition to detecting specific ADEs, this study aims to explore ways of using structured EHR data that can be exploited to detect a wide range of ADEs, which could be adopted in a general decision support system that alerts for potential ADEs.

Methods

In this study, we investigated the use of various data types in EHRs for drug safety surveillance. Here, we focused on using the structured data to build predictive models using machine learning based methods. Clinical measurements, diagnoses and drugs were extracted from a real EHR database. Besides the known problems of EHR data such as high dimensionality and sparsity, these data types have their own characteristics and hence lead to different challenges when fitting them into predictive models. For example, some clinical events here might be observed multiple times for one patient, while some might not be observed at all. Therefore, a series of experiments were conducted to explore the use of these heterogeneous data types separately and together when predicting ADEs with machine learning based methods: first, different representations of each data type were compared and the best representation of the corresponding data type was selected for merging with the other data types, i.e., to form a fused feature set, which was compared to using each data type separately; second, to reduce the high dimensionality and sparsity, feature selection was conducted on both the separate data types and the fused feature set, which also allowed for an in-depth feature analysis; and finally, various commonly used learning algorithms were applied and compared for the classification task.

Data source

Data was extracted from a Swedish EHR database, the Stockholm EPR Corpus (this research has been approved by the Regional Ethical Review Board in Stockholm with permission number 2012/834-31/5). This database contains health records of around 700,000 patients from 2009 to 2010, which were obtained from Karolinska University Hospital in Stockholm, Sweden [28]. Here, large amounts of diagnosis information, drug administrations, clinical measurements, lab tests and clinical notes in free-text from anonymized health records are available for research. In this study, we only extracted the structured data, i.e., diagnoses, drugs and clinical measurements.

In the Stockholm EPR Corpus, diagnoses are encoded by the International Statistical Classification of Diseases and Related Health Problems, 10th Edition (ICD-10), some of which indicate ADEs, e.g., G44.4 (drug-induced headache). To create training data for building machine learning models, we used these ADE-related diagnosis codes as class labels. The population is hence divided into patients that have been assigned an ADE-related diagnosis code and those who have not. In a study on the use of ICD-10 codes for ADE reporting [29], the ADE-related diagnosis codes were divided into categories according to the strength of their indication for ADEs, where category A.1 (a drug-related causation was noted in the diagnosis code) and category A.2 (a drug-or other substance-related causation was noted in the diagnosis code) were used in this study, as they indicate the most certain causal drug-diagnosis relationship of ADEs compared to the other categories.

To avoid spurious findings, we have selected 27 ADE-related codes that are most frequently used in the Stockholm EPR Corpus, resulting in 27 datasets, where the existence of each ADE-related diagnosis code indicating a particular ADE served as the class label in each dataset; see Table 1 for the selected ADE-related diagnosis codes and their description. The classification task is hence binary: positive or negative with respect to a specific ADE. In each dataset, examples correspond to patients: patients whom have been assigned an ADE-specific diagnosis code constitute positive examples and patients whom have been assigned a similar diagnosis code to the ADE-specific diagnosis code form negative examples, where two codes are considered similar if they share the first three levels in the concept hierarchy. For instance, if the positive examples are patients with diagnosis code G44.4 (drug-induced headache), the negative examples are patients with any diagnosis code starting with G44 (other headache syndromes), but not G44.4. Features are clinical events, i.e., diagnoses, drugs and clinical measurements, that are reported in the health records of these patients prior to the event of interest, i.e., the class label. The number of instances, the proportion of the positive class and the number of features from each data type for each dataset are described in Table 2.

Table 1 The 27 selected ADE related diagnosis codes.
Table 2 Statistical description of 27 datasets.

Experimental setup

The main underlying learning algorithm in this study is random forest [30], which is an ensemble learning method that generates a set of decision trees. Each tree in the forest is built with a bootstrapped sample from the original training examples and each node in the tree only considers a randomly selected subset of the original feature set. The trees carry out the learning task independently from each other and the forest eventually outputs the final result through voting, i.e., averaging the output of all constituent trees. The random forest learning algorithm has become one of the most popular machine learning methods, especially in bioinformatics where data is often high dimensional, as a result of its relatively low computational cost and robust predictive performance [31].

Evaluation was done through 10-fold cross validation with 10 iterations. The performance metrics used in this study are accuracy and area under ROC curve (AUC). Accuracy, the most common and perhaps also the most intuitive metric to evaluate the performance of a predictive model, measures the percentage of examples that are predicted correctly. Area under ROC curve can be used whenever the learning algorithm is able to rank the examples based on the decreasing probability of predicting them as positive. It measures the probability of ranking a true positive example ahead of a false positive example [32], i.e., the rate of detecting true signals versus the false alarm rate. Compared to accuracy, AUC is sometimes favored because it is not sensitive to changes in the class distribution between training and test data.

When more than two models were compared, a Friedman test [33] was employed to test the statistical significance, where the rank of each model is used. To look further at the pairwise significance between the inspected models, a post-hoc test using the Bergman-Hommel procedure was applied [34].

Using various data types

In the first experiment, different representations of clinical measurements, on the one hand, and diagnoses and drugs on the other (here we consider diagnoses and drugs as one data type, namely clinical codes, as they share the same characteristics), as well as their combination, were compared.

Clinical measurements

In a previous study [24], we proposed five representations (listed below) of clinical measurements to handle the problem that each measurement can be observed multiple times for a patient. Here, we re-evaluated the use of these representations, as well as their combination, on a slightly different task.

  • Mean - the average of the observed values

  • SD - the standard deviation of the observed values

  • Slope - the difference between the first and last observation over the time span

  • Existence - whether or not a measurement has been taken

  • Count - the number of times a measurement was taken

Clinical codes

Diagnoses are encoded by the ICD-10 system and drugs by the ATC system in the Stockholm EPR Corpus, both of which have inherent concept hierarchies that can be used to aggregate the clinical codes into different hierarchical levels, as shown in Figure 2. Here, we compared using the different levels of clinical codes to a combination of all levels.

Figure 2
figure 2

Concept hierarchies of ATC and ICD-10 codes. C10AA01 is the ATC code for Simvastatin and F25.1 is the ICD-10 code for Schizoaffective disorder.

After investigating representations of clinical measurements and clinical codes separately, we combined them using their respective best observed representation. As it has previously been shown that, when an ensemble model is employed, building the model from a fused set of data types is favored compared to fusing ensemble models built from the individual data type [35], we combined the two data types by fusing them into one feature set before applying the random forest algorithm. The predictive performance of random forests using clinical measurements, clinical codes and a combination of the two were compared.

Feature selection

In a follow-up experiment, feature selection was added to the pipeline prior to building the predictive models in order to remove those features that are not informative, while simultaneously reducing the dimensionality and, in some cases, sparsity. There are two common types of feature selection approaches: wrapper-based and filter-based. The former utilizes the targeted learning algorithm as a black box to evaluate the usefulness of features according to their predictive performance [36], while the latter selects features according to a score function independent of the chosen learning algorithm [37]. Wrapper-based approaches are generally considered to produce better feature subsets but with much higher computational costs compared to the filter-based approaches. In this study, we used a filter-based approach to univariate feature selection, information gain, to select relatively important features, where the features are first ranked according to the information gain between them and the class label before selecting the top-ranked ones. The information gain for a certain feature is calculated as the difference between the entropy before splitting the training examples with this feature and the entropy after splitting. The entropy of the random variable x is

H x =- x p x l o g 2 p x ,

where p(x) is the probability distribution of x. In this case, the entropy before splitting is the entropy by splitting the examples only according to the class label Y , H(Y); and the entropy after splitting the examples on feature f is

H Y | f = f , Y p Y H Y f ,

where, Y f is the probability distribution of the class label given feature f. Therefore, the information gain of feature f is

I ( f , Y ) = H ( Y ) - H ( Y | f ) .

In this study, we explored the impact of feature selection on the predictive performance of the random forest algorithm with a set of thresholds starting from the top 10% of available features ranked according to their information gain scores and subsequently adding an extra 10% until the full feature set is included.

Using various learning algorithms

The random forest algorithm is known for being robust with high dimensional data; therefore, in the last experiment, eight additional commonly used learning algorithms were applied in order to find out if the observation from using random forest holds for the others and also to study the impact of feature selection on this task. The selected learning algorithms and their parameters are listed in Table 3. Each learning algorithm used clinical measurements and clinical codes, in isolation and combined, to build predictive models with features selected on all thresholds.

Table 3 Learning algorithms and their default settings.

Results

In this section, we report on the predictive performance, in terms of accuracy and AUC, of models generated with the random forest algorithm that was provided with various representations of 27 clinical datasets, each one containing a different data type (clinical codes and measurements) and representation, as well as combinations of these - with and without feature selection. We present both results from individual datasets, as well as summary results, averaged over datasets. An in-depth feature analysis is moreover conducted and, finally, results from using various learning algorithms are summarized.

Using various data types

The clinical measurements were represented in five distinct ways - Mean, SD, Slope, Existence and Count - as well as a combination of these. Accuracy and AUC, averaged over 27 datasets, as obtained by random forest models with access to the each of the representations are presented in Table 4. For both accuracy and AUC, using the combined representation yielded the best performance. The clinical codes, on the other hand, were aggregated - save for the most specific level - into more general levels according to their concept hierarchies. The averaged accuracy and AUC of random forests with access either to a single level or a combination of all four levels are shown in Table 5 from which we can see that the predictive performance was improved when including all levels of the concept hierarchies.

Table 4 Comparing multiple representations of clinical measurements.
Table 5 Comparing different levels of clinical codes.

A random forest provided with a fused feature set, comprising the best representations of clinical measurements and clinical codes, was then built and compared with random forests with access only to one of the data types. The number of features of the fused feature set is presented in Table 2 under Combination. The accuracy and AUC for the 27 datasets are listed in Table 6. According to a Friedman test, the observed differences among the three random forests is significant, in terms of both accuracy and AUC, and the post-hoc analysis indicates that only using clinical measurements leads to significantly worse predictive performance compared to using clinical codes and their combination; however, there is no significant difference between the latter two.

Table 6 Comparing random forests using clinical measurements (M), clinical codes (C) and their combination (M+C).

Using the most informative features

The performance of random forests using clinical measurements, clinical codes and their combination after selecting different proportions of the most informative features according to their information gain scores are shown in Figure 3 for accuracy and Figure 4 for AUC. From these results we can see that applying feature selection improved the predictive performance, albeit on a small scale. However, even when employing feature selection, the addition of clinical measurements fails to improve the predictive performance compared to using only clinical codes. An explanation for this can be sought by investigating the outcome from different perspectives.

Figure 3
figure 3

Averaged accuracy from random forests using clinical measurements (M), clinical codes (C) and their combination (M+C) at each feature selection threshold.

Figure 4
figure 4

Averaged AUC from random forests using clinical measurements (M), clinical codes (C) and their combination (M+C) at each feature selection threshold.

From a quantitative point of view, the bar plot in Figure 5, depicting the proportion of clinical measurements and clinical codes among the selected features indicates that, irrespective of threshold, the majority are invariably clinical codes. From a qualitative point of view, as shown in Figure 6, the relative informativeness of specific representations of each data type according to their information gain scores tells us that clinical codes are generally more informative than clinical measurements.

Figure 5
figure 5

Proportion of clinical measurements (M) and clinical codes (C) among selected features.

Figure 6
figure 6

Relative informativeness of the 5 representations of clinical measurements (MN: mean; SD: standard deviation; SL: slope; YN: existence; CN: count) and 4 levels (L1 - L4) of clinical codes based on their information gain scores. Larger area indicates lower informativeness.

Moreover, due to the distinct nature of different ADEs, we also present results for each individual dataset in Figure 7 for accuracy and Figure 8 for AUC, respectively. For datasets such as D642, which has the largest number of features, feature selection clearly improves the accuracy; while for some datasets, such as T783 and T887, using a combination of the two data types yields the best predictive performance.

Figure 7
figure 7

Accuracy of random forest using clinical measurements (M), clinical codes (C) and their combination (M+C) at each feature selection threshold in each dataset.

Figure 8
figure 8

AUC of random forest using clinical measurements (M), clinical codes (C) and their combination (M+C) at each feature selection threshold in each dataset.

Using various learning algorithms

Figure 9 and Figure 10 demonstrate the averaged accuracy and AUC, respectively, of eight additional commonly used learning algorithms using clinical measurements, clinical codes and their combination over the 27 datasets. It is clear that the random forest algorithm outperforms the others; for most learning algorithms, using clinical codes yields the best predictive performance; however, for learning algorithms that are very sensitive to high dimensionality, such as k nearest neighbors (kNN), using measurements alone and/or applying feature selection improve(s) the predictive performance, as the number of clinical measurements is far smaller than the number of clinical codes.

Figure 9
figure 9

Accuracy of multiple classifiers using clinical measurements (M), clinical codes (C) and their combination (M+C) at each feature selection threshold.

Figure 10
figure 10

AUC of multiple classifiers using clinical measurements (M), clinical codes (C) and their combination (M+C) at each feature selection threshold.

Discussion

This study investigated the use of various types of structured EHR data - clinical measurements and clinical codes - both in isolation and in combination, to build machine learning models for ADE detection. The results show that using clinical codes alone, or together with clinical measurements, leads to significantly improved predictive performance compared to using only clinical measurements. In addition, feature selection based on information gain was conducted to remove relatively less informative variables, which also enables a deeper inspection of the informativeness of each data type and representation.

Results analysis

We evaluated different representations of clinical measurements and clinical codes using methods proposed in [24] and [26], and slightly different results are observed here. In the previous study that explored the possibility of exploiting the concept hierarchies of clinical codes [26], it was demonstrated that using only the more general levels of the codes was sufficient to maintain the predictive performance on a high level; in this study, however, we observed that using all levels of the codes, including both the general and the more specific levels, yields the best predictive performance. A possible explanation for this is that the tasks in the two studies are different: in [26], the task was to distinguish patients with a specific ADE from randomly selected patients without the ADE; in this study, the task was to distinguish patients with a specific ADE from patients with a similar disease to the ADE. The latter is a much more difficult task than the former, as the positive and negative examples are more similar in the latter. It is thus not surprising that, in this task, more specific levels of codes are needed to improve the predictive performance. In the study that investigates various representations of clinical measurements [24], the model with a combination of multiple representations outperformed the ones with any single representation, which is consistent with the observation in this study; however, the predictive performance of models using the single representations are inconsistent with the previous study: Mean is the best in the former, while Count is the best in the latter. This discrepancy might be due to slightly different settings of the tasks in the two studies. In [24], the task was also to distinguish patients with a specific ADE and patients with similar diseases to the ADE, but it is achieved by retrospectively analyzing the entire available patient history in the EHRs, i.e., clinical events that occurred after the target ADE were included in the predictive models; in this study, the task was instead designed for detecting ADEs at the point of care, which means that only the clinical events that occurred prior to the target ADE were allowed to be exploited in the predictive models.

By combining clinical measurements and clinical codes, the predictive performance does not outperform using only clinical codes. In order to understand the reasons for this observation, we looked at the number of features selected from each data type and their corresponding relative informativeness by ranking features based on their information gain. In general, most of the selected features are clinical codes, which is partly biased as there are in fact more codes than measurements in the feature set, but even when only the top 10% of features are selected, the majority of the top-ranked features are clinical codes. Since only looking at the quantity is not fair in this case, we instead inspected the relative informativeness, adjusted by the number of features, between codes and measurements. It turned out that clinical codes were consistently more informative than clinical measurements. Although by using only clinical measurements, the predictive performance is not worse than random guessing (average accuracy of 81.41 and AUC of 0.655), adding them to clinical codes does not seem to be helpful in improving the predictive performance compared to using codes alone. This can partly be explained by how each tree is built in the random forest: the algorithm selects the most informative feature from a random subset of features as the node to split on when building each tree. In this case, clinical measurements are less likely to be selected as they are inferior to clinical codes in terms of both quantity and quality. As a result, they can almost be considered useless when used in conjunction with clinical codes.

Besides the random forest algorithm, we also employed several other common learning algorithms. Similar results are observed with AdaBoost, Bagging and decision tree as were observed for the random forest algorithm, while for the other learning algorithms that are neither tree-based nor ensemble models, the results deviate from the previous pattern. For example, logistic regression favors the combination of clinical codes and measurements when no feature selection is conducted; a support vector machine with the RBF kernel using clinical measurements yields better predictive performance when only part of the features are selected; and the k nearest neighbor algorithm always achieves better performance by using clinical measurements alone. Moreover, feature selection has a different impact on these learning algorithms, which is basically consistent with what we know about their sensitiveness towards high dimensionality, e.g., adding feature selection clearly improves the predictive performance of the k nearest neighbor algorithm. Here, it is worth noting that among all of the investigated learning algorithms, the random forest classifier consistently outperforms the others for this task, which, again, proves its robustness on handling high dimensional data.

In addition to the averaged results over the 27 datasets, we also presented results for each individual dataset. For most datasets, using only clinical measurements results in the worst performance; however, if we look at the results for accuracy, for some datasets, such as G251, F132 and L270, opposite results are observed; for the AUC results, we can see that for datasets D642, E273, F199, L270, T783, T784, T886 and T887, using a combination of clinical measurements and codes outperforms the others. These diverse results can perhaps be explained by the different nature of each ADE. For example, to detect D642 (drug induced anemia), using clinical codes only is probably not sufficient since such a diagnosis is often made after observing results from blood tests; to detect ADEs starting with F (mental and behavioural disorders), it is less likely that using clinical measurements is helpful, whereas clinical notes, in this case, might contain much more valuable information than the structured data.

Challenges of using electronic health records for adverse drug event detection

Although EHRs are increasingly considered as a valuable resource for pharmocavgilance and machine learning based methods are often favored over other methods when analyzing large amounts of data from EHRs, it is, by using such purely data-driven methods, difficult to distinguish clinically relevant signals from systematic biases in the data. Therefore, the machine learning methods should serve primarily as tools for exploring the massive amounts of data and testing hypotheses; eventually, human knowledge and experience is still necessary to evaluate the validity of the findings.

In addition to the challenges that have already been discussed in the background section, EHR data is also very noisy. On the one hand, the quality of the diagnosis encoding varies according to the experience and expertise of coders [38], making it difficult for data analysts to adjust the validity and reliability of the reported events. According to a review by the Swedish National Board of Health and Welfare, around 20% of the assigned primary diagnosis codes were found to be erroneous [39]. On the other hand, clinical codes can be influenced by various factors, such as the knowledge and experience of the clinicians, the amount of information available at admission and strategic billing, rendering the choice of codes to report biased. In such situations, when the codes are used to label the training data, we should proceed with caution as they cannot entirely be considered as a gold standard. One expensive alternative here is to involve experts for reviewing training data and correcting incorrect labels.

Limitations and future work

One limitation of this study is that the labels in the training data are directly extracted from the EHR database without being scrutinized by clinical experts. This could lead to findings that do not entirely reflect reality. Moreover, both clinical codes and measurements are represented in certain ways in this study, and hence the results and findings are limited only to these representations. It is, for instance, conceivable that, with better representations, clinical measurements would be as informative as clinical codes for detecting ADEs. Therefore, in future work, representations that can further improve the informativeness of clinical measurements should be explored. This study only included two types of data, codes and measurements, from EHRs. A natural extension would thus be to include more data types, such as lab tests and notes.

Conclusions

We have here demonstrated how machine learning can be employed to analyze structured data in electronic health records for the purpose of supporting pharmacovigilance activities such as detecting adverse drug events. Predictive models learned from electronic health records could be incorporated into adverse drug event alerting systems at the point of care, primarily facilitating the correct encoding of adverse drug events, which, in turn, would address the problem of under-reporting of adverse drug events and lead to more reliable statistics. To create high-performing predictive models, it is essential to pay careful attention to which data to use and how to best represent it, especially so when faced with high-dimensional and extremely sparse data. We have here presented a detailed study and proposed solutions to the said challenges, focusing on two groups of data: measurements and clinical codes that encode drugs and diagnoses.

Within each data type, it is advantageous to combine multiple representations, effectively providing a more holistic view of the data. Across data types, providing all representations of each data type leads to improved predictive performance for some learning algorithms, while for the best-performing learning algorithm - random forest - this is beneficial in certain cases only, i.e., for specific adverse drug events. Generally speaking, clinical codes are more informative than measurements for the purpose of detecting adverse drug events, and it is not necessary in general to add measurements to clinical codes. Selecting a subset of the most informative features can, to some extent, lead to improved predictive performance, even with learning algorithms that are considered to effectively handle high-dimensional data.