VERONICA: Visual Analytics for Identifying Feature Groups in Disease Classification
Abstract
:1. Introduction
2. Background
2.1. Visual Analytics
2.1.1. Analytics Module
2.1.2. Interactive Visualization Module
2.2. Machine Learning Techniques
2.2.1. Decision Tree
2.2.2. Support Vector Machines
2.2.3. Naive Bayes
2.3. Class Imbalance Problem
Sampling Techniques
3. Related Work
4. Materials and Methods
4.1. Design Process and Participants
4.2. Data Sources
4.3. Cohort Entry Criteria
4.4. Response Variable
4.5. Input Features
5. Implementation Details
6. Workflow
7. The Design of VERONICA
7.1. Analytics Module
7.2. Interactive Visualization Module
8. Limitations
9. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Data Source | Description | Study Purpose |
---|---|---|
Canadian Institute for Health Information Discharge Abstract Database and National Ambulatory Care Reporting System | The Canadian Institute for Health Information Discharge Abstract Database and the National Ambulatory Care Reporting System collect diagnostic and procedural variables for inpatient stays and ED visits, respectively. Diagnostic and inpatient procedural coding uses the 10th version of the Canadian Modified International Classification of Disease system 10th Revision (after 2002). | Cohort creation, description, exposure, and outcome estimation |
Ontario Drug Benefits | The Ontario Drug Benefits database includes a wide range of outpatient prescription medications available to all Ontario citizens over the age of 65. The error rate in the Ontario Drug Benefits database is less than 1%. | Medication prescriptions, description, and exposure |
Registered Persons Database | The Registered Persons Database captures demographic (sex, date of birth, postal code) and vital status information on all Ontario residents. Relative to the Canadian Institute for Health Information Discharge Abstract Database in-hospital death flag, the Registered Persons Database has a sensitivity of 94% and a positive predictive value of 100%. | Cohort creation, description, and exposure |
Ontario Health Insurance Plan | The Ontario Health Insurance Plan database contains information on Ontario physician billing claims for medical services using fee and diagnosis codes outlined in the Ontario Health Insurance Plan Schedule of Benefits. These codes capture information on outpatient, inpatient, and laboratory services rendered to a patient. | Cohort creation, stratification, description, exposure, and outcome |
Variable | Database | Code | Set Code |
---|---|---|---|
Major cancer | Canadian Institute for Health Information Discharge Abstract Database | International Classification of Diseases 9th Revision | 150, 154, 155, 157, 162, 174, 175, 185, 203, 204, 205, 206, 207, 208, 2303, 2304, 2307, 2330, 2312, 2334 |
International Classification of Diseases 10th Revision | 971, 980, 982, 984, 985, 986, 987, 988, 989, 990, 991, 993, C15, C18, C19, C20, C22, C25, C34, C50, C56, C61, C82, C83, C85, C91, C92, C93, C94, C95, D00, D010, D011, D012, D022, D075, D05 | ||
Ontario Health Insurance Plan | Diagnosis | 203, 204, 205, 206, 207, 208, 150, 154, 155, 157, 162, 174, 175, 183, 185 | |
Chronic liver disease | Canadian Institute for Health Information Discharge Abstract Database | International Classification of Diseases 9th Revision | 4561, 4562, 070, 5722, 5723, 5724, 5728, 573, 7824, V026, 571, 2750, 2751, 7891, 7895 |
International Classification of Diseases 10th Revision | B16, B17, B18, B19, I85, R17, R18, R160, R162, B942, Z225, E831, E830, K70, K713, K714, K715, K717, K721, K729, K73, K74, K753, K754, K758, K759, K76, K77 | ||
Ontario Health Insurance Plan | Diagnosis | 571, 573, 070 | |
Fee code | Z551, Z554 | ||
Coronary artery disease (excluding angina) | Canadian Institute for Health Information Discharge Abstract Database | Canadian Classification of Diagnostic, Therapeutic and Surgical Procedures | 4801, 4802, 4803, 4804, 4805, 481, 482, 483 |
Canadian Classification of Health Interventions | 1IJ50, 1IJ76 | ||
International Classification of Diseases 9th Revision | 412, 410, 411 | ||
International Classification of Diseases 10th Revision | I21, I22, Z955, T822 | ||
Ontario Health Insurance Plan | Diagnosis | 410, 412 | |
Fee code | R741, R742, R743, G298, E646, E651, E652, E654, E655, Z434, Z448 | ||
Diabetes | Canadian Institute for Health Information Discharge Abstract Database | International Classification of Diseases 9th Revision | 250 |
International Classification of Diseases 10th Revision | E10, E11, E13, E14 | ||
Ontario Health Insurance Plan | Diagnosis | 250 | |
Fee code | Q040, K029, K030, K045, K046 | ||
Heart failure | Canadian Institute for Health Information Discharge Abstract Database | Canadian Classification of Diagnostic, Therapeutic and Surgical Procedures | 4961, 4962, 4963, 4964 |
Canadian Classification of Health Interventions | 1HP53, 1HP55, 1HZ53GRFR, 1HZ53LAFR, 1HZ53SYFR | ||
International Classification of Diseases 9th Revision | I500, I501, I509, I255, J81 | ||
International Classification of Diseases 10th Revision | I21, I22, Z955, T822 | ||
Ontario Health Insurance Plan | Diagnosis | 428 | |
Fee code | R701, R702, Z429 | ||
Hypertension | Canadian Institute for Health Information Discharge Abstract Database | International Classification of Diseases 9th Revision | 401, 402, 403, 404, 405 |
International Classification of Diseases 10th Revision | I10, I11, I12, I13, I15 | ||
Ontario Health Insurance Plan | Diagnosis | 401, 402, 403 | |
Kidney stones | Canadian Institute for Health Information Discharge Abstract Database | International Classification of Diseases 9th Revision | 5920, 5921, 5929, 5940, 5941, 5942, 5948, 5949, 27411 |
International Classification of Diseases 10th Revision | N200, N201, N202, N209, N210, N211, N218, N219, N220, N228 | ||
Peripheral vascular disease | Canadian Institute for Health Information Discharge Abstract Database | Canadian Classification of Diagnostic, Therapeutic and Surgical Procedures | 5125, 5129, 5014, 5016, 5018, 5028, 5038, 5126, 5159 |
Canadian Classification of Health Interventions | 1KA76, 1KA50, 1KE76, 1KG50, 1KG57, 1KG76MI, 1KG87, 1IA87LA, 1IB87LA, 1IC87LA, 1ID87LA, 1KA87LA, 1KE57 | ||
International Classification of Diseases 9th Revision | 4402, 4408, 4409, 5571, 4439, 444 | ||
International Classification of Diseases 10th Revision | I700, I702, I708, I709, I731, I738, I739, K551 | ||
Ontario Health Insurance Plan | Fee code | R787, R780, R797, R804, R809, R875, R815, R936, R783, R784, R785, E626, R814, R786, R937, R860, R861, R855, R856, R933, R934, R791, E672, R794, R813, R867, E649 | |
Cerebrovascular disease (stroke or transient ischemic attack) | Canadian Institute for Health Information Discharge Abstract Database | International Classification of Diseases 9th Revision | 430, 431, 432, 4340, 4341, 4349, 435, 436, 3623 |
International Classification of Diseases 10th Revision | I62, I630, I631, I632, I633, I634, I635, I638, I639, I64, H341, I600, I601, I602, I603, I604, I605, I606, I607, I609, I61, G450, G451, G452, G453, G458, G459, H340 | ||
Chronic kidney disease | Canadian Institute for Health Information Discharge Abstract Database | International Classification of Diseases 9th Revision | 4030, 4031, 4039, 4040, 4041, 4049, 585, 586, 5888, 5889, 2504 |
International Classification of Diseases 10th Revision | E102, E112, E132, E142, I12, I13, N08, N18, N19 | ||
Ontario Health Insurance Plan | Diagnosis | 403, 585 |
Variable | Database | Code | Set Code |
---|---|---|---|
Family physician visit | Ontario Health Insurance Plan | Fee code | A001, A003, A004, A005, A006, A007, A008, A900, A901, A905, A911, A912, A967, K131, K132, K140, K141, K142, K143, K144, W003, W008, W121 |
Variable | Database | Code Set | Code |
---|---|---|---|
Dialysis | Canadian Institute for Health Information Discharge Abstract Database | Canadian Classification of Diagnostic, Therapeutic and Surgical Procedures | 5127, 5142, 5143, 5195, 6698 |
Canadian Classification of Health Interventions | 1PZ21, 1OT53DATS, 1OT53HATS, 1OT53LATS, 1SY55LAFT, 7SC59QD, 1KY76, 1KG76MZXXA, 1KG76MZXXN, 1JM76NC, 1JM76NCXXN | ||
International Classification of Diseases 9th Revision | V451, V560, V568, 99673 | ||
International Classification of Diseases 10th Revision | T824, Y602, Y612, Y622, Y841, Z49, Z992 | ||
Ontario Health Insurance Plan | Fee code | R850, G324, G336, G327, G862, G865, G099, R825, R826, R827, R833, R840, R841, R843, R848, R851, R946, R943, R944, R945, R941, R942, Z450, Z451, Z452, G864, R852, R853, R854, R885, G333, H540, H740, R849, G323, G325, G326, G860, G863, G866, G330, G331, G332, G861, G082, G083, G085, G090, G091, G092, G093, G094, G095, G096, G294, G295 | |
Kidney transplant | Canadian Institute for Health Information Discharge Abstract Database | Canadian Classification of Health Interventions | 1PC85 |
Ontario Health Insurance Plan | Fee code | S435, S434 |
References
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
- Hersh, W.R. Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance. Am. J. Manag. Care 2007, 13, 277–278. [Google Scholar]
- Jensen, P.B.; Jensen, L.J.; Brunak, S. Mining electronic health records: Towards better research applications and clinical care. Nat. Rev. Genet. 2012, 13, 395–405. [Google Scholar] [CrossRef]
- Weiskopf, N.G.; Weng, C. Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 2013, 20, 144–151. [Google Scholar] [CrossRef] [Green Version]
- Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
- Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. J. R. Stat. Soc. Ser. C 1979, 28, 100–108. [Google Scholar] [CrossRef]
- Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
- Nielsen, F. Hierarchical Clustering. Introduction to HPC with MPI for Data Science. In Undergraduate Topics in Computer Science; Nielsen, F., Ed.; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 195–211. ISBN 978-3-319-21903-5. [Google Scholar]
- Alexander, N.; Alexander, D.C.; Barkhof, F.; Denaxas, S. Using Unsupervised Learning to Identify Clinical Subtypes of Alzheimer’s Disease in Electronic Health Records. Stud. Health Technol. Inform. 2020, 270, 499–503. [Google Scholar] [CrossRef] [PubMed]
- Lütz, E. Unsupervised Machine Learning to Detect Patient Subgroups in Electronic Health Records. Available online: /paper/Unsupervised-machine-learning-to-detect-patient-in-L%C3%9CTZ/e11f5b060947f22ae7d80d053564546487dbc0bf (accessed on 11 November 2020).
- Khalid, S.; Judge, A.; Pinedo-Villanueva, R. An Unsupervised Learning Model for Pattern Recognition in Routinely Collected Healthcare Data. In Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies, Funchal, Madeira, Portugal, 19–21 January 2018; SCITEPRESS—Science and Technology Publications: Funchal, Portugal, 2018; pp. 266–273. [Google Scholar]
- Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094. [Google Scholar] [CrossRef] [PubMed]
- Lasko, T.A.; Denny, J.C.; Levy, M.A. Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data. PLoS ONE 2013, 8, e66341. [Google Scholar] [CrossRef]
- Marlin, B.M.; Kale, D.C.; Khemani, R.G.; Wetzel, R.C. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT Symposium on International Health Informatics—IHI ’12; ACM Press: Miami, FL, USA, 2012; p. 389. [Google Scholar]
- Wang, L.; Tong, L.; Davis, D.; Arnold, T.; Esposito, T. The application of unsupervised deep learning in predictive models using electronic health records. BMC Med. Res. Methodol. 2020, 20, 37. [Google Scholar] [CrossRef] [Green Version]
- Panahiazar, M.; Taslimitehrani, V.; Pereira, N.L.; Pathak, J. Using EHRs for Heart Failure Therapy Recommendation Using Multidimensional Patient Similarity Analytics. Stud. Health Technol. Inform. 2015, 210, 369–373. [Google Scholar]
- Langavant, L.C.D.; Bayen, E.; Yaffe, K. Unsupervised Machine Learning to Identify High Likelihood of Dementia in Population-Based Surveys: Development and Validation Study. J. Med. Internet Res. 2018, 20, e10493. [Google Scholar] [CrossRef] [PubMed]
- Abdullah, S.S.; Rostamzadeh, N.; Sedig, K.; Garg, A.X.; McArthur, E. Visual Analytics for Dimension Reduction and Cluster Analysis of High Dimensional Electronic Health Records. Informatics 2020, 7, 17. [Google Scholar] [CrossRef]
- Abdullah, S.S. Visual Analytics of Electronic Health Records with a Focus on Acute Kidney Injury. Ph.D. Thesis, The University of Western Ontario, London, ON, Canada, 2020. [Google Scholar]
- Keim, D.A.; Mansmann, F.; Thomas, J. Visual analytics: How much visualization and how much analytics? ACM SIGKDD Explor. Newsl. 2010, 11, 5. [Google Scholar] [CrossRef]
- Caruana, R.; Karampatziakis, N.; Yessenalina, A. An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; Association for Computing Machinery: Helsinki, Finland, 2008; pp. 96–103. [Google Scholar]
- Johnstone, I.M.; Titterington, D.M. Statistical challenges of high-dimensional data. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. 2009, 367, 4237–4253. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Krause, J.; Perer, A.; Bertini, E. Using Visual Analytics to Interpret Predictive Machine Learning Models. arXiv 2016, arXiv:160605685. [Google Scholar]
- Liu, S.; Wang, X.; Liu, M.; Zhu, J. Towards better analysis of machine learning models: A visual analytics perspective. Vis. Inform. 2017, 1, 48–56. [Google Scholar] [CrossRef]
- Krause, J.; Perer, A.; Ng, K. Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; Association for Computing Machinery: San Jose, CA, USA, 2016; pp. 5686–5697. [Google Scholar]
- Zhao, X.; Wu, Y.; Lee, D.L.; Cui, W. iForest: Interpreting Random Forests via Visual Analytics. IEEE Trans. Vis. Comput. Graph. 2019, 25, 407–416. [Google Scholar] [CrossRef]
- Spinner, T.; Schlegel, U.; Schäfer, H.; El-Assady, M. explAIner: A Visual Analytics Framework for Interactive and Explainable Machine Learning. IEEE Trans. Vis. Comput. Graph. 2020, 26, 1064–1074. [Google Scholar] [CrossRef] [Green Version]
- Ola, O.; Sedig, K. The challenge of big data in public health: An opportunity for visual analytics. Online J. Public Health Inform. 2014, 5, 223. [Google Scholar] [CrossRef] [Green Version]
- Parsons, P.; Sedig, K.; Mercer, R.; Khordad, M.; Knoll, J.; Rogan, P. Visual Analytics for Supporting Evidence-Based Interpretation of Molecular Cytogenomic Findings. In Proceedings of the 2015 Workshop on Visual Analytics in Healthcare, Chicago, IL, USA, 25 October 2015. [Google Scholar]
- Simpao, A.F.; Ahumada, L.M.; Gálvez, J.A.; Rehman, M.A. A review of analytics and clinical informatics in health care. J. Med. Syst. 2014, 38, 45. [Google Scholar] [CrossRef] [PubMed]
- Sedig, K.; Parsons, P.; Babanski, A. Towards a characterization of interactivity in visual analytics. J. Multimed. Process. Technol. 2012, 3, 12–28. [Google Scholar]
- Abdullah, S.S.; Rostamzadeh, N.; Sedig, K.; Garg, A.X.; McArthur, E. Multiple Regression Analysis and Frequent Itemset Mining of Electronic Medical Records: A Visual Analytics Approach Using VISA_M3R3. Data 2020, 5, 33. [Google Scholar] [CrossRef] [Green Version]
- Abdullah, S.S.; Rostamzadeh, N.; Sedig, K.; Lizotte, D.J.; Garg, A.X.; McArthur, E. Machine Learning for Identifying Medication-Associated Acute Kidney Injury. Informatics 2020, 7, 18. [Google Scholar] [CrossRef]
- Leighton, J.P. Defining and Describing Reason. In The Nature of Reasoning; Leighton, J.P., Sternberg, R.J., Eds.; Cambridge University Press: Cambridge, UK, 2004; pp. 3–11. ISBN 0-521-81090-6. [Google Scholar]
- Wilkinson, L. Classification and regression trees. Systat 2004, 11, 35–56. [Google Scholar]
- Quinlan, J.R. C4. 5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Lewis, D.D. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conference on Machine Learning, Chemnitz, Germany, 21 April 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 4–15. [Google Scholar]
- Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods; Cambridge University Press: Cambridge, UK, 2000; ISBN 978-0-521-78019-3. [Google Scholar]
- Thomas, J.J.; Cook, K.A. Illuminating the Path: The Research and Development Agenda for Visual Analytics; IEEE Computer Society: Washington, DC, USA, 2005. [Google Scholar]
- Sedig, K.; Parsons, P. Interaction design for complex cognitive activities with visual representations: A pattern-based approach. AIS Trans. Hum.-Comput. Interact. 2013, 5, 84–133. [Google Scholar] [CrossRef] [Green Version]
- Cui, W. Visual Analytics: A Comprehensive Overview. IEEE Access 2019, 7, 81555–81573. [Google Scholar] [CrossRef]
- Jeong, D.H.; Ji, S.Y.; Suma, E.A.; Yu, B.; Chang, R. Designing a collaborative visual analytics system to support users’ continuous analytical processes. Hum.-Cent. Comput. Inf. Sci. 2015, 5, 5. [Google Scholar] [CrossRef] [Green Version]
- Parsons, P.; Sedig, K. Distribution of Information Processing While Performing Complex Cognitive Activities with Visualization Tools. In Handbook of Human Centric Visualization; Huang, W., Ed.; Springer: New York, NY, USA, 2014; pp. 693–715. ISBN 978-1-4614-7485-2. [Google Scholar]
- Han, J.; Kamber, M.; Pei, J. Data mining concepts and techniques third edition. In The Morgan Kaufmann Series in Data Management Systems; Elsevier: Amsterdam, The Netherlands, 2011; pp. 83–124. [Google Scholar]
- Agrawal, R.; Swami, A.; Imielinski, T. Database Mining: A Performance Perspective. IEEE Trans. Knowl. Data Eng. 1993, 5, 914–925. [Google Scholar] [CrossRef] [Green Version]
- Sahu, H.; Shrma, S.; Gondhalakar, S. A Brief Overview on Data Mining Survey. IJCTEE 2008, 1, 114–121. [Google Scholar]
- Keim, D.; Mansmann, F.; Schneidewind, J.; Thomas, J.; Ziegler, H. Visual analytics: Scope and challenges. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2008; Volume 4404, pp. 76–90. [Google Scholar]
- Kehrer, J.; Hauser, H. Visualization and visual analysis of multifaceted scientific data: A survey. IEEE Trans. Vis. Comput. Graph. 2013, 19, 495–513. [Google Scholar] [CrossRef] [PubMed]
- Rostamzadeh, N.; Abdullah, S.S.; Sedig, K. Data-Driven Activities Involving Electronic Health Records: An Activity and Task Analysis Framework for Interactive Visualization Tools. Multimodal Technol. Interact. 2020, 4, 7. [Google Scholar] [CrossRef] [Green Version]
- Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: London, UK, 1984. [Google Scholar]
- Ismail, B.; Anil, M. Regression methods for analyzing the risk factors for a life style disease among the young population of India. Indian Heart J. 2014, 66, 587–592. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Deng, H.; Runger, G.; Tuv, E. Bias of Importance Measures for Multi-valued Attributes and Solutions. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2011, Espoo, Finland, 14–17 June 2011; Honkela, T., Duch, W., Girolami, M., Kaski, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 293–300. [Google Scholar]
- Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 6. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Ghaddar, B.; Naoum-Sawaya, J. High dimensional data classification and feature selection using support vector machines. Eur. J. Oper. Res. 2018, 265, 993–1004. [Google Scholar] [CrossRef]
- Holte, R.C.; Acker, L.E. Concept Learning and the Problem of Small Disjuncts. IJCAI 1989, 89, 813–818. [Google Scholar]
- Weiss, G.M. Mining with rarity: A unifying framework. ACM SIGKDD Explor. Newsl. 2004, 6, 7–19. [Google Scholar] [CrossRef]
- Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rahman, M.M.; Davis, D.N. Cluster Based Under-Sampling for Unbalanced Cardiovascular Data. Proc. World Congr. Eng. 2013, 3, 3–5. [Google Scholar]
- Drummond, C.; Holte, R.C. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. In Proceedings of the Workshop on Learning from Imbalanced Datasets II, Washington, DC, USA, 21 August 2003; Volume 11, pp. 1–8. [Google Scholar]
- Nguyen, H.M.; Cooper, E.W.; Kamei, K. A comparative study on sampling techniques for handling class imbalance in streaming data. In Proceedings of the The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems, Kobe, Japan, 20–24 November 2012; pp. 1762–1767. [Google Scholar]
- Van Hulse, J.; Khoshgoftaar, T.M.; Napolitano, A. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning, New York, NY, USA, 20–24 June 2007; Association for Computing Machinery: New York, NY, USA, 2007; pp. 935–942. [Google Scholar]
- Chawla, N.V.; Japkowicz, N.; Kotcz, A. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 2004, 6, 1–6. [Google Scholar] [CrossRef]
- Fernández, A.; del Río, S.; Chawla, N.V.; Herrera, F. An insight into imbalanced Big Data classification: Outcomes and challenges. Complex Intell. Syst. 2017, 3, 105–120. [Google Scholar] [CrossRef] [Green Version]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Rostamzadeh, N.; Abdullah, S.S.; Sedig, K. Visual Analytics for Electronic Health Records: A Review. Informatics 2021, 8, 12. [Google Scholar] [CrossRef]
- Mane, K.K.; Bizon, C.; Schmitt, C.; Owen, P.; Burchett, B.; Pietrobon, R.; Gersing, K. VisualDecisionLinc: A visual analytics approach for comparative effectiveness-based clinical decision support in psychiatry. J. Biomed. Inform. 2012, 45, 101–106. [Google Scholar] [CrossRef] [Green Version]
- Baytas, I.M.; Lin, K.; Wang, F.; Jain, A.K.; Zhou, J. PhenoTree: Interactive Visual Analytics for Hierarchical Phenotyping From Large-Scale Electronic Health Records. IEEE Trans. Multimed. 2016, 18, 2257–2270. [Google Scholar] [CrossRef]
- Ha, H.; Lee, J.; Han, H.; Bae, S.; Son, S.; Hong, C.; Shin, H.; Lee, K. Dementia Patient Segmentation Using EMR Data Visualization: A Design Study. Int. J. Environ. Res. Public. Health 2019, 16, 3438. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Guo, R.; Fujiwara, T.; Li, Y.; Lima, K.M.; Sen, S.; Tran, N.K.; Ma, K.-L. Comparative Visual Analytics for Assessing Medical Records with Sequence Embedding. Vis. Inform. 2020, 4, 72–85. [Google Scholar] [CrossRef]
- Hund, M.; Böhm, D.; Sturm, W.; Sedlmair, M.; Schreck, T.; Ullrich, T.; Keim, D.A.; Majnaric, L.; Holzinger, A. Visual analytics for concept exploration in subspaces of patient groups. Brain Inform. 2016, 3, 233–247. [Google Scholar] [CrossRef] [Green Version]
- Huang, C.-W.; Lu, R.; Iqbal, U.; Lin, S.-H.; Nguyen, P.A.; Yang, H.-C.; Wang, C.-F.; Li, J.; Ma, K.-L.; Li, Y.-C.; et al. A richly interactive exploratory data analysis and visualization tool using electronic medical records. BMC Med. Inform. Decis. Mak. 2015, 15, 92. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Levy, A.R.; O’Brien, B.J.; Sellors, C.; Grootendorst, P.; Willison, D. Coding accuracy of administrative drug claims in the Ontario Drug Benefit database. Can. J. Clin. Pharmacol. J. Can. Pharmacol. Clin. 2003, 10, 67–71. [Google Scholar]
- Collister, D.; Pannu, N.; Ye, F.; James, M.; Hemmelgarn, B.; Chui, B.; Manns, B.; Klarenbach, S. Health Care Costs Associated with AKI. Clin. J. Am. Soc. Nephrol. CJASN 2017, 12, 1733–1743. [Google Scholar] [CrossRef] [PubMed]
- Liangos, O.; Wald, R.; O’Bell, J.W.; Price, L.; Pereira, B.J.; Jaber, B.L. Epidemiology and outcomes of acute renal failure in hospitalized patients: A national survey. Clin. J. Am. Soc. Nephrol. CJASN 2006, 1, 43–51. [Google Scholar] [CrossRef] [Green Version]
- Thongprayoon, C.; Qureshi, F.; Petnak, T.; Cheungpasitporn, W.; Chewcharat, A.; Cato, L.D.; Boonpheng, B.; Bathini, T.; Hansrivijit, P.; Vallabhajosyula, S.; et al. Impact of Acute Kidney Injury on Outcomes of Hospitalizations for Heat Stroke in the United States. Dis. Basel Switz. 2020, 8, 28. [Google Scholar] [CrossRef]
- Abdullah, S.S.; Rostamzadeh, N.; Sedig, K.; Garg, A.X.; McArthur, E. Predicting Acute Kidney Injury: A Machine Learning Approach Using Electronic Health Records. Information 2020, 11, 386. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing; Huang, D.-S., Zhang, X.-P., Huang, G.-B., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
- Ferri, C.; Hernández-Orallo, J.; Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognit. Lett. 2009, 30, 27–38. [Google Scholar] [CrossRef]
- Garcıa, V.; Sánchez, J.S.; Mollineda, R.A. On the suitability of numerical performance measures for class imbalance problems. In Proceedings of the International Conference in Pattern Recognition Applications and Methods, Algarve, Portugal, 6–8 February 2012; pp. 310–313. [Google Scholar]
- Parikh, R.; Mathai, A.; Parikh, S.; Chandra Sekhar, G.; Thomas, R. Understanding and using sensitivity, specificity and predictive values. Indian J. Ophthalmol. 2008, 56, 45–50. [Google Scholar] [CrossRef] [PubMed]
- Rostamzadeh, N. Visual Analytics for Performing Complex Tasks with Electronic Health Records. Ph.D. Thesis, University of Western Ontario, London, ON, Canada, 2021. [Google Scholar]
Features |
---|
Minor assessment |
General assessment |
General re-assessment |
Consultation |
Repeat consultation |
Intermediate assessment or well-baby care |
Mini assessment |
Complex house call assessment |
House call assessment |
Limited consultation |
Special family and general practice consultation |
Comprehensive family and general practice consultation |
Care of the elderly FPA |
Periodic health visit—adult 65 years of age and older |
Chronic disease shared appointment-2 patients (per unit) |
Chronic disease shared appointment—3 patients (per unit) |
Chronic disease shared appointment—4 patients (per unit) |
Chronic disease shared appointment—5 patients (per unit) |
Chronic disease shared appointment—6 to 12 patients (per unit) |
Nursing home or home for the aged—first 2 subsequent visits per patient per month (per visit) |
Nursing home or home for the aged—additional subsequent visits (maximum 2 per patient per month) per visit |
Additional visits due to intercurrent illness per visit |
Hospital Encounter Codes | Medications |
---|---|
Acute myeloid leukemia | Sunitinib Malate |
Diffuse non-Hodgkin’s lymphoma | Lenalidomide |
Chronic kidney disease | Abiraterone Acetate |
Congestive heart failure | Metolazone |
Cholecystitis | Cyclosporine |
Lymphoid leukemia | Megestrol Acetate |
Malignant neoplasm of bladder | Lithium Carbonate |
Decubitus ulcer | Atropine Sulfate and Diphenoxylate Hcl |
Abnormal serum enzyme levels | Furosemide |
Secondary and unspecified malignant neoplasm of lymph nodes | Prochlorperazine Maleate |
Groups | Codes |
---|---|
Comorbidities | “C” |
Demographics | “D” |
GP visits | “G” |
Hospital encounter codes | “H” |
Medications | “M” |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rostamzadeh, N.; Abdullah, S.S.; Sedig, K.; Garg, A.X.; McArthur, E. VERONICA: Visual Analytics for Identifying Feature Groups in Disease Classification. Information 2021, 12, 344. https://doi.org/10.3390/info12090344
Rostamzadeh N, Abdullah SS, Sedig K, Garg AX, McArthur E. VERONICA: Visual Analytics for Identifying Feature Groups in Disease Classification. Information. 2021; 12(9):344. https://doi.org/10.3390/info12090344
Chicago/Turabian StyleRostamzadeh, Neda, Sheikh S. Abdullah, Kamran Sedig, Amit X. Garg, and Eric McArthur. 2021. "VERONICA: Visual Analytics for Identifying Feature Groups in Disease Classification" Information 12, no. 9: 344. https://doi.org/10.3390/info12090344
APA StyleRostamzadeh, N., Abdullah, S. S., Sedig, K., Garg, A. X., & McArthur, E. (2021). VERONICA: Visual Analytics for Identifying Feature Groups in Disease Classification. Information, 12(9), 344. https://doi.org/10.3390/info12090344