Keywords
machine learning, science mapping, bibliometrics, topic analysis, SciMAT
This article is included in the Artificial Intelligence and Machine Learning gateway.
machine learning, science mapping, bibliometrics, topic analysis, SciMAT
The machine learning field researches different human learning processes, the theoretical analysis of possible learning algorithms and methods for several application domains1. Studies based on machine learning have allowed scientists and companies to predict mass mortality events2, the quality of water3, segment clients in private banking4, automatically classify text5 or for the production of crops, such as cocoa6. Considering the growing interest of the scientific community in machine learning research and its challenges, it is interesting and necessary to analyze the field. A good approach for that purpose is science mapping analysis because it is a different way of visualizing information that allows a new researcher to become familiar with a field. An example of science mapping analysis providing an overview of the conceptual evolution of a field in medicine is proposed in 7. In this study, we perform a science mapping analysis to explore machine learning research. The objective of the study is to allow new data analysts to know the current knowledge base about machine learning and to have an initial point to explore applications in this field.
This article has the following structure: In the Methods section, we describe the research methodology, the dataset used, the tool configuration, and how the analysis was performed. The Results section presents the results of the science mapping analysis. The conclusions are at the end of the article.
We used Scopus as the bibliographic source. We looked, in the third quarter of 2017, for references of articles and conferences about machine learning, using this concept as the search keyword (‘machine AND learning’), with results ranging from 2007 to 2017(Q2). The concept was searched for in the article title, abstract and keywords of articles found. These articles were sorted by date (newest first). All the articles, between 2007 and 2017, were taken in to account for performing the analysis with the aim of obtaining a general vision of the field.
We obtained 67,475 records that were saved using RIS format in different files sorted by year. Figure 1 shows a summary of the records, and shows that research in the field of machine learning has been growing steadily. It is important to observe that the results for 2017 are not in the figure since the records are only to the second half Q2 of year and database is permanently updated. However, these results were used for the analysis because they show trends, which is the primary objective of the present study.
We used the records from Scopus for executing the analysis with SciMAT version 1.1.04. In this tool, the unit of analysis was Words. Primarily, we did a deduplication process, grouping similar words (by plurals) and looking for synonyms or duplicates in words with the highest number of documents and repetitions, always trying to avoid bias to include the largest number of terms. After that, we divided the time interval (2007 – 2017) into six smaller periods: 2007–2009, 2010–2012, 2013–2014, 2015, 2016, 2017(Q2 – published papers up the second quarter of the year). We distributed the gaps this way in order to have a comparable number of articles in each one of them. Finally, we carried out the analysis with the following configuration: all the periods, author’s words as the unit of analysis, a minimum frequency for data reduction of 100 (excepting 2017 (Q2) with 50) and co-occurrence as a kind of network. Other configurations were: network reduction equal to one, association strength as a normalization measure, Simple centers algorithm with a maximum net size of seven and a minimum of five, core mapper for the document mapper, h-index and Sum citations as the quality measures and, lastly, association strength for the evolution and overlapping map.
SciMAT shows a spatial representation of the way disciplines, fields, specialties and documents or authors related to one another8–10. For this purpose, the tool implements a longitudinal framework, which takes as its base a co-word analysis and the h-index. Co-word analysis firstly provides information on the themes of a research field and, secondly, enables to analyze and to track the evolution of a study field throughout consecutive periods of time11. The h-index is used to measure the impact of the various identified thematic areas8.
In this tool, we follow four steps as mentioned by 10: researching cluster detection, drawing strategic diagrams, plotting of thematic areas and carrying out a performance analysis. For the first one, the tool creates a network of keywords co-occurrence based on 12 and 13 and makes a clustering of keywords to topics, using the Simple center's algorithm. For the second step, according to 13, the cluster centrality and density rank values are relevant. The centrality measures the intensity of the interaction of a group with the others; if the cluster is boldly related to the field of research, then the link will be stronger. The density measures the intensity of internal links inside the group. According to 8 and 10, the themes of the clusters can be classified into four groups using these two measures: (1) Motor themes, which are both well developed and central to the research field; (2) basic and transversal themes, which are not sufficiently developed topics but significant for the area of investigation; (3) emerging or declining themes, which are weakly studied and marginal; (4) highly developed and isolated themes, which have well-developed internal links, but reduced external links and have only minimal importance to the field. The third step is the plotting of thematic areas, and the last one is to conduct the performance analysis. In this study, we analyzed quantitative measures, such as the number of documents and authors.
To analyze the most relevant topics of investigation in different years, SciMAT uses strategic diagrams. We generated a diagram for each period of the study. The charts are divided into four quadrants: the upper-right quadrant alludes to the motor themes, the upper-left quadrant to the highly developed and isolated themes, the lower-right one to the basic and transversal themes, and the lower-left one to the emerging or declining topics9.
Figure 2 shows the strategic diagram of the period 2007–2009, which has 27 themes. During this time span, OPTIMIZATION, CONTROL-THEORY, and IMAGE-CLASSIFICATION are some of the emerging topics. SEQUENCE-ANALYSIS-PROTEIN appears as the motor theme with the highest density (0.83) and centrality (3.21) values, followed by DATABASES-PROTEIN, with centrality equal to 2.97 and density equal to 0.43. This suggests that machine learning was widely used to study the protein molecule in that period, covering, for example, studies to predict the structure or function of that molecule.
Due to the great number of themes that appear in the strategic diagram, we decided to choose three themes, giving priority to those that may have a high impact application. Figure 3A shows the net for the theme DATABASES-PROTEIN, which has a density of 0.43, a centrality of 2.97 and a document count of 322. In this net, we can observe that an important topic is PROTEIN-STRUCTURE. Figure 3B presents the network for the theme DATASETS, which has a density of 0.06, a centrality of 1.37, a document count of 273 and strongly related topics such as CLASSIFIERS, SEMI-SUPERVISED-LEARNING and RANDOM-FOREST. Figure 3C shows the network for the theme IMAGE-CLASSIFICATION, which has a density of 0.11, a centrality of 1.3, a document count of 273 and relevant concepts, such as NEURAL-NETWORKS, SUPPORT-VECTOR-MACHINE (SVM) and IMAGE-PROCESSING.
(A) DATABASES-PROTEIN; (B) DATASETS; (C) IMAGE-CLASSIFICATION.
Figure 4 shows the strategic diagram of the period 2010–2012. The diagram has 35 themes. During this time span, VIRTUAL-REALITY, ROBOTS, and METADATA are some of the emerging themes, while CLASSIFICATION-ALGORITHM is one of the basic and transversal topics. AGED and PROTEIN-ANALYSIS appear as the motor subjects with the highest density (0.72 and 0.58, respectively) and centrality (1.91 and 2.18, respectively) values. Other important themes are CHEMISTRY AND GENETICS. This is a sign that during this period, topics on biology and health began to become relevant in applied machine learning research.
From the second period, we selected three thematic networks. Figure 5A shows the net for the theme PROTEIN-ANALYSIS, which has a density of 0.58, a centrality of 2.18 and a document count of 229. This shows us that topics about proteins continue to be important in this period. Figure 5B presents the network for the theme CHEMISTRY, which is an emergent theme and has a density of 0.32, a centrality of 2.24 and a document count of 346. Figure 5C shows the network for the subject VIRTUAL-REALITY, which has a density of 0.07, a centrality of 0.98, a document count of 65 and is another emerging theme for the period 2010–2012.
(A) PROTEIN-ANALYSIS; (B) CHEMISTRY; (C) VIRTUAL-REALITY.
The strategic diagram of the period 2013–2014 is shown on Figure 6, which has 32 themes. During this time frame, SENSORS, FACE-RECOGNITION and COMMERCE are some of the emerging topics. The IMAGE-INTERPRETATION-COMPUTER-ASSISTED topic appears as the motor theme, with the highest density (0.58) and centrality (2.61) values. This suggests that in this period there were studies on machine learning applied to different topics, such as sensor data, costs, and gesture recognition.
We selected three thematic networks from the third period (2013–2014). Figure 7A shows the network for the theme AMINO-ACID-SEQUENCE, which has a density of 0.42, a centrality of 1.91 and a document count of 276. Figure 7B presents the network for the theme MOBILE-DEVICES, which has a density of 0.21, a centrality of 0.76, a document count of 159 and strongly related topics such as HUMAN-COMPUTER-INTERACTION, E-LEARNING, and UBIQUITOUS-COMPUTING. Figure 7C shows the network for the theme COMMERCE, which has a density of 0.07, a centrality of 0.88, a document count of 205 and relevant concepts, such as SOCIAL-NETWORKING and COSTS.
(A) AMINO-ACID-SEQUENCE; (B) MOBILE-DEVICES; (C) COMMERCE.
Figure 8 shows the strategic diagram of the period 2015. The diagram has 25 themes. During this time span, SMARTPHONES, FACE-RECOGNITION and FORESTRY (label generated for algorithms such as Random-Forest or Decision-Trees) are some of the emerging themes. NUCLEAR-MAGNETIC-RESONANCE-IMAGING appears as the motor subject with the highest density (0.57) and centrality (2.81) values. Other important themes are COMPUTATIONAL-BIOLOGY and MEDICAL-IMAGING. We found that during this period biology and health were once again relevant topics in machine learning research.
From the fourth time frame, we selected three thematic networks. Figure 9A shows the network for the theme NUCLEAR-MAGNETIC-RESONANCE-IMAGING, which has a density of 0.57, a centrality of 2.81 and a document count of 251. Figure 9B presents the network for the theme COMPUTATIONAL-BIOLOGY, which is an emergent theme and has a density of 0.44, a centrality of 1.8 and a document count of 320. Figure 9C shows the network for the topic SMARTPHONES, which has a density of 0.07, a centrality of 0.13, a document count of 144 and is another emerging theme for the 2015 period.
(A) NUCLEAR-MAGNETIC-RESONANCE-IMAGING; (B) COMPUTATIONAL-BIOLOGY; (C) SMARTPHONES.
The strategic diagram of the period 2016 is shown in Figure 10, which has 24 themes. During this time span, UBIQUITOUS-COMPUTING and COMMERCE are some of the emerging themes. The MIDDLE-AGED theme –which refers to applications developed for middle-aged people– appears as the motor topic with the highest density (0.76) and centrality (2.04) values.
We selected three thematic networks from the fifth period (2016). Figure 11A shows the network for the theme MEDICAL-IMAGING, which has a density of 0.26, a centrality of 1.45 and a document count of 312. Figure 11B presents the network for the theme INTRUSION-DETECTION, which has a density of 0.3, a centrality of 0.89 and a document count of 289. Figure 11C shows the network for the theme UBIQUITOUS-COMPUTING, which has a density of 0.07, a centrality of 1.07, a document count of 132 and relevant concepts, such as AUTOMATION, SMARTPHONES and the INTERNET.
(A) MEDICAL-IMAGING; (B) INTRUSION-DETECTION; (C) UBIQUITOUS-COMPUTING.
Figure 12 shows the strategic diagram of the period 2017(Q2). The diagram has 16 themes. During this time span, HUMAN is the only emerging theme, while FORESTRY is one of the basic and transversal topics. This shows us the importance of algorithms such as Random-Forest or Decision-Trees during the last decade in the research on machine learning. MEDICAL-IMAGING appears as the motor theme with the highest density (0.51) and centrality (2.05) values. Once again, topics on health are relevant in machine learning research.
From 2017(Q2), we selected three thematic networks. Figure 13A shows the net for the theme MEDICAL-IMAGING, which has a document count of 137. Figure 13B presents the network for the topic FORESTRY (label generated for algorithms such as Random-Forest or Decision-Trees), which has a density of 0.17, a centrality of 1.84 and a document count of 220.
(A) MEDICAL-IMAGING; (B) FORESTRY.
Exposing emerging trends in the field of machine learning allows researchers to increase their understanding of the changes and the evolution over time of this research field. One of the primary objectives of a science mapping analysis is to highlight trends and possible relationships between the relevant topics of a research field. SciMAT is a useful tool to carry out a study based on this approach, which offers fundamental themes, based on a cluster generation. The results of the present study show that machine learning is an important and widely studied scientific area. The tendencies indicate that machine learning applications will still be of interest to the scientific community. The use of machine learning to predict diseases such as cancer14 or Alzheimer’s disease15, and in fields such as biology16, rehabilitation system17, commerce18, smartphones19 and ubiquitous computing20, will be a trend in the near future.
Dataset 1: Data obtained from Scopus and SciMat project file, to be opened in SciMat. DOI, 10.5256/f1000research.15620.d21242521
The authors are grateful to the Telematics Engineering Group (GIT) of the University of Cauca for scientific support and Innovacción Cauca project for master's scholarship granted to J. Rincon-Patino.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Kumar R, Sharma A, Tiwari RK: Can we predict blood brain barrier permeability of ligands using computational approaches?. Interdiscip Sci. 2013; 5 (2): 95-101 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Machine learning classification, prediction models using Artificial Intelligence, Statistical analysis, Computational Biology
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 07 Aug 18 |
read | read |
Click here to access the data.
Spreadsheet data files may not format correctly if your computer is using different default delimiters (symbols used to separate values into separate cells) - a spreadsheet created in one region is sometimes misinterpreted by computers in other regions. You can change the regional settings on your computer so that the spreadsheet can be interpreted correctly.
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)