Big Data Analytics in Healthcare - Promise and Potential
Big Data Analytics in Healthcare - Promise and Potential
Big Data Analytics in Healthcare - Promise and Potential
http://www.hissjournal.com/content/2/1/3
REVIEW
Open Access
Abstract
Objective: To describe the promise and potential of big data analytics in healthcare.
Methods: The paper describes the nascent field of big data analytics in healthcare, discusses the benefits, outlines
an architectural framework and methodology, describes examples reported in the literature, briefly discusses the
challenges, and offers conclusions.
Results: The paper provides a broad overview of big data analytics for healthcare researchers and practitioners.
Conclusions: Big data analytics in healthcare is evolving into a promising field for providing insight from very large
data sets and improving outcomes while reducing costs. Its potential is great; however there remain challenges to
overcome.
Keywords: Big data, Analytics, Hadoop, Healthcare, Framework, Methodology
Introduction
The healthcare industry historically has generated large
amounts of data, driven by record keeping, compliance
& regulatory requirements, and patient care [1]. While
most data is stored in hard copy form, the current trend is
toward rapid digitization of these large amounts of data.
Driven by mandatory requirements and the potential to
improve the quality of healthcare delivery meanwhile reducing the costs, these massive quantities of data (known
as big data) hold the promise of supporting a wide range
of medical and healthcare functions, including among
others clinical decision support, disease surveillance,
and population health management [2-5]. Reports say
data from the U.S. healthcare system alone reached, in
2011, 150 exabytes. At this rate of growth, big data for U.S.
healthcare will soon reach the zettabyte (1021 gigabytes)
scale and, not long after, the yottabyte (1024 gigabytes) [6].
Kaiser Permanente, the California-based health network,
which has more than 9 million members, is believed to
have between 26.5 and 44 petabytes of potentially rich
data from EHRs, including images and annotations [6].
By definition, big data in healthcare refers to electronic
health data sets so large and complex that they are difficult
* Correspondence: [email protected]
1
Graduate School of Business, Fordham University, 113 W. 60th Street, 10023
New York, NY, USA
Full list of author information is available at the end of the article
2014 Raghupathi and Raghupathi; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly credited.
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Page 2 of 10
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Page 3 of 10
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Page 4 of 10
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Architectural framework
The conceptual framework for a big data analytics project in healthcare is similar to that of a traditional health
informatics or analytics project. The key difference lies
in how processing is executed. In a regular health analytics project, the analysis can be performed with a business intelligence tool installed on a stand-alone system,
such as a desktop or laptop. Because big data is by definition large, processing is broken down and executed
across multiple nodes. The concept of distributed processing has existed for decades. What is relatively new is
its use in analyzing very large data sets as healthcare
providers start to tap into their large data repositories to
gain insight for making better-informed health-related
decisions. Furthermore, open source platforms such as
Hadoop/MapReduce, available on the cloud, have encouraged the application of big data analytics in healthcare.
While the algorithms and models are similar, the user
interfaces of traditional analytics tools and those used
for big data are entirely different; traditional health analytics tools have become very user friendly and transparent. Big data analytics tools, on the other hand, are
extremely complex, programming intensive, and require
the application of a variety of skills. They have emerged
in an ad hoc fashion mostly as open-source development
tools and platforms, and therefore they lack the support
and user-friendliness that vendor-driven proprietary
tools possess. As Figure 1 indicates, the complexity begins with the data itself.
Big data in healthcare can come from internal (e.g., electronic health records, clinical decision support systems,
CPOE, etc.) and external sources (government sources, laboratories, pharmacies, insurance companies & HMOs,
etc.), often in multiple formats (flat files, .csv, relational
Page 5 of 10
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
data from diverse sources is cleansed and readied. Depending on whether the data is structured or unstructured, several data formats can be input to the big data
analytics platform.
In this next component in the conceptual framework,
several decisions are made regarding the data input approach, distributed design, tool selection and analytics
models. Finally, on the far right, the four typical applications of big data analytics in healthcare are shown.
These include queries, reports, OLAP, and data mining.
Visualization is an overarching theme across the four applications. Drawing from such fields as statistics, computer science, applied mathematics and economics, a
wide variety of techniques and technologies has been developed and adapted to aggregate, manipulate, analyze,
and visualize big data in healthcare.
The most significant platform for big data analytics is
the open-source distributed data processing platform
Hadoop (Apache platform), initially developed for such
routine functions as aggregating web search indexes. It
belongs to the class NoSQL technologiesothers include CouchDB and MongoDBthat evolved to aggregate data in unique ways. Hadoop has the potential to
process extremely large amounts of data mainly by allocating partitioned data sets to numerous servers (nodes),
each of which solves different parts of the larger problem and then integrates them for the final result [28-31].
Hadoop can serve the twin roles of data organizer and
analytics tool. It offers a great deal of potential in enabling enterprises to harness the data that has been, until
now, difficult to manage and analyze. Specifically, Hadoop
makes it possible to process extremely large volumes of
data with various structures or no structure at all. But
Hadoop can be challenging to install, configure and administer, and individuals with Hadoop skills are not easily
found. Furthermore, for these reasons, it appears organizations are not quite ready to embrace Hadoop completely.
The surrounding ecosystem of additional platforms and
tools supports the Hadoop distributed platform [30,31].
These are summarized in Table 1.
Numerous vendorsincluding AWS, Cloudera,
Hortonworks, and MapR Technologiesdistribute opensource Hadoop platforms [29]. Many proprietary options
are also available, such as IBMs BigInsights. Further,
many of these platforms are cloud versions, making them
widely available. Cassandra, HBase, and MongoDB, described above, are used widely for the database component. While the available frameworks and tools are mostly
open source and wrapped around Hadoop and related
platforms, there are numerous trade-offs that developers and users of big data analytics in healthcare must
consider. While the development costs may be lower
since these tools are open source and free of charge, the
downsides are the lack of technical support and minimal
Page 6 of 10
Description
MapReduce
Hive
Jaql
Zookeeper
HBase
Cassandra
Oozie
Lucene
Avro
Mahout
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Concept statement
Establish need for big data analytics project in healthcare
based on the 4Vs.
Step 2
Proposal
What is the problem being addressed?
Why is it important and interesting?
Why big data analytics approach?
Background material
Step 3
Methodology
Page 7 of 10
Propositions
Variable selection
Data collection
ETL and data transformation
Platform/tool selection
Conceptual model
Analytic techniques
-Association, clustering, classification, etc.
Results & insight
Step 4
Deployment
Evaluation & validation
Testing
Examples
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Page 8 of 10
(NICE) of the U.K.s National Health Service. NICE is reportedly a leader in the analytics of large clinical datasets
for exploring the effectiveness of clinical and cost factors
in the use of new drugs and/or clinical treatments. The
Italian Medicines Agency is also reported to collect and
analyze clinical data on the use of expensive new drugs as
one goal in a country-level cost-effectiveness program [6].
Another leading example of big data analytics in healthcare is the Department of Veterans Affairs (VA) use of applications on its very large data set in an effort to comply
with performance-based accountability framework and
disease management practice [6]. In one very famous example, California-based Kaiser Permanente associated
clinical data with cost data to generate a key data set, the
analytics of which led to the discovery of adverse drug effects and subsequent withdrawal of Vioxx from the market [6]. Researchers at the Johns Hopkins School of
Medicine discovered they could use data from Google Flu
Trends to predict sudden increases in flu-related emergency room visits at least a week before warnings from
the CDC. Likewise, the analysis of Twitter updates was as
accurate as (and two weeks ahead of) official reports at
tracking the spread of cholera in Haiti after the January
2010 earthquake [6]. Also reported is an application developed by IBM that predicts the likely outcomes of diabetes
patients using patients panel data linked to physicians,
management protocols, and the overall relationship to
population health management averages [6]. In another diabetes application, physicians at Harvard Medical School
and Harvard Pilgrim Health Care recently demonstrated
the potential of analytics applications to EHR data to identify and group patients with diabetes for public health surveillance. Four years worth of data based on numerous
indicators from multiple sources was utilized. The analytics application also differentiated between Type 1 and
Type II diabetes [6,26]. Finally, at Blue Cross Blue Shield
of Massachusetts (BCBSMA) there was a need to embed
analytics into business processes to help decision-makers
across the business gain insight into financial and medical
data and become more proactive. Several benefits were
reported. First, the analytics enabled medical directors to
identify high-risk disease groups and act to minimize risk
and improve patient outcomes. For example, new preventive treatment protocols could be introduced among
patient groups with high cholesterol, thereby fending off
heart problems. Also, complex health informatics reports were generated 300% faster than previously, helping BCBSMA service clients more effectively [6].
The next section briefly identifies some of the key
challenges in big data analytics in healthcare.
Challenges
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Conclusions
Big data analytics has the potential to transform the way
healthcare providers use sophisticated technologies to
gain insight from their clinical and other data repositories and make informed decisions. In the future well see
the rapid, widespread implementation and use of big
data analytics across the healthcare organization and the
healthcare industry. To that end, the several challenges
highlighted above, must be addressed. As big data analytics becomes more mainstream, issues such as guaranteeing privacy, safeguarding security, establishing standards
and governance, and continually improving the tools and
technologies will garner attention. Big data analytics and
applications in healthcare are at a nascent stage of development, but rapid advances in platforms and tools can accelerate their maturing process.
Competing interests
We, the authors declare we have no competing interests.
Authors contributions
Both WR and VR contributed equally. Both authors read and approved the
final manuscript.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
Author details
1
Graduate School of Business, Fordham University, 113 W. 60th Street, 10023
New York, NY, USA. 2Brooklyn College, City University of New York, Brooklyn,
NY, USA.
25.
27.
References
1. Raghupathi W: Data Mining in Health Care. In Healthcare Informatics:
Improving Efficiency and Productivity. Edited by Kudyba S. Taylor & Francis;
2010:211223.
28.
26.
29.
Page 9 of 10
Burghard C: Big Data and Analytics Key to Accountable Care Success. IDC
Health Insights; 2012.
Dembosky A: Data Prescription for Better Healthcare. Financial Times,
December 12, 2012, p. 19; 2012. Available from: http://www.ft.com/intl/cms/
s/2/55cbca5a-4333-11e2-aa8f-00144feabdc0.html#axzz2W9cuwajK.
Feldman B, Martin EM, Skotnes T: Big Data in Healthcare Hype and Hope.
October 2012. Dr. Bonnie 360; 2012. http://www.west-info.eu/files/big-data-inhealthcare.pdf.
Fernandes L, OConnor M, Weaver V: Big data, bigger outcomes. J AHIMA
2012:3842.
IHTT: Transforming Health Care through Big Data Strategies for leveraging
big data in the health care industry; 2013. http://ihealthtran.com/
wordpress/2013/03/iht%C2%B2-releases-big-data-research-reportdownload-today/.
Frost & Sullivan: Drowning in Big Data? Reducing Information Technology
Complexities and Costs for Healthcare Organizations. http://www.emc.com/
collateral/analyst-reports/frost-sullivan-reducing-information-technologycomplexities-ar.pdf.
Bian J, Topaloglu U, Yu F, Yu F: Towards Large-scale Twitter Mining for Drugrelated Adverse Events. Maui, Hawaii: SHB; 2012.
Raghupathi W, Raghupathi V: An Overview of Health Analytics. Working
paper; 2013.
Ikanow: Data Analytics for Healthcare: Creating Understanding from Big Data.
http://info.ikanow.com/Portals/163225/docs/data-analytics-for-healthcare.pdf.
jStart: How Big Data Analytics Reduced Medicaid Re-admissions. A jStart Case
Study; 2012. http://www-01.ibm.com/software/ebusiness/jstart/portfolio/
uncMedicaidCaseStudy.pdf.
Knowledgent: Big Data and Healthcare Payers; 2013. http://knowledgent.
com/mediapage/insights/whitepaper/482.
Explorys: Unlocking the Power of Big Data to Improve Healthcare for Everyone.
https://www.explorys.com/docs/data-sheets/explorys-overview.pdf.
IBM: IBM big data platform for healthcare. Solutions Brief; 2012. http://public.
dhe.ibm.com/common/ssi/ecm/en/ims14398usen/IMS14398USEN.PDF.
Intel: Leveraging Big Data and Analytics in Healthcare and Life Sciences:
Enabling Personalized Medicine for High-Quality Care, Better Outcomes; 2012.
http://www.intel.com/content/dam/www/public/us/en/documents/whitepapers/healthcare-leveraging-big-data-paper.pdf.
IBM: Data Driven Healthcare Organizations Use Big Data Analytics for Big
Gains; 2013. http://www03.ibm.com/industries/ca/en/healthcare/
documents/Data_driven_healthcare_organizations_use_big_data_analytics_
for_big_gains.pdf.
Savage N: Digging for drug facts. Commun ACM 2012, 55(10):1113.
Zenger B: Can Big Data Solve Healthcares Big Problems? HealthByte,
February 2012; 2012. http://www.equityhealthcare.com/docstor/EH%20Blog%
20on%20Analytics.pdf.
LaValle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N: Big data,
analytics and the path from insights to value. MIT Sloan Manag Rev 2011,
52:2032.
Capgemini: The Deciding Factor: Big Data & Decision Making; 2013. http://
www.capgemini.com/thought-leadership/the-deciding-factor-big-datadecision-making.
Connolly S, Wooledge S: Harnessing the Value of Big Data Analytics. Teradata;
2013.
Courtney M: Puzzling out big data. Engineering & Technology 2013:5660.
Intel: Big Data Analytics; 2012. http://www.intel.com/content/dam/www/
public/us/en/documents/reports/data-insights-peer-research-report.pdf.
Manyika J, Chui M, Brown B, Buhin J, Dobbs R, Roxburgh C, Byers AH: Big
Data: The Next Frontier for Innovation, Competition, and Productivity. USA:
McKinsey Global Institute; 2011.
IBM: Large Gene interaction Analytics at University at Buffalo, SUNY; 2012.
http://public.dhe.ibm.com/common/ssi/ecm/en/imc14675usen/
IMC14675USEN.PDF.
IBM: Harvard Medical School; 2011. http://public.dhe.ibm.com/common/ssi/
ecm/en/imc14685usen/IMC14685USEN.PDF.
Raghupathi W, Kesh S: Interoperable electronic health records
design: towards a service-oriented architecture. e-Service Journal
2007, 5:3957.
Borkar VR, Carey MJ, Chen L: Big data platforms: what's next? ACM
Crossroads 2012, 19(1):4449.
Ohlhorst F: Big Data Analytics: Turning Big Data into Big Money. USA: John
Wiley & Sons; 2012.
Raghupathi and Raghupathi Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
Page 10 of 10