0% found this document useful (0 votes)
58 views7 pages

A Practical Guide To Artificial Intelligence-Based Image Analysis in Radiology

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 7

REVIEW ARTICLE

A Practical Guide to Artificial Intelligence–Based Image


Analysis in Radiology
Thomas Weikert, MD, Joshy Cyriac, MSc, Shan Yang, PhD, Ivan Nesic, MSc,
Victor Parmar, MSc, and Bram Stieltjes, MD, PhD

in radiology with 2 goals: first, to give an overview on what is needed to


Abstract: The use of artificial intelligence (AI) is a powerful tool for image anal-
implement such projects and provide a reference for project planning,
ysis that is increasingly being evaluated by radiology professionals. However, due
and second, to enable radiology professionals to ask relevant questions
to the fact that these methods have been developed for the analysis of nonmedical
while reading articles on AI-based software in radiology.
image data and data structure in radiology departments is not “AI ready”, imple-
menting AI in radiology is not straightforward. The purpose of this review is to
guide the reader through the pipeline of an AI project for automated image anal- DATA SEARCH AND RETRIEVAL
ysis in radiology and thereby encourage its implementation in radiology depart- The first 2 steps of any AI project in radiology are data search
ments. At the same time, this review aims to enable readers to critically appraise and retrieval. They are hampered by the fact that data in today's radiol-
articles on AI-based software in radiology. ogy departments are organized in a patient-centered way. This is a log-
Key Words: artificial intelligence, machine learning, radiology,
ical consequence of a core task of radiology departments, that is, the
computer-assisted image processing, medical informatics,
creation of reports on examinations. Thus, finding a specific patient re-
natural language processing
port and imaging history has the highest priority, and search options in
an RIS are usually optimized to do so. Artificial intelligence projects in
(Invest Radiol 2019;00: 00–00) radiology, however, require a different view on data: a study-centered view.
Two typical search queries are: (1) “identify all radiographs with corre-

B oth radiology and artificial intelligence (AI) have a long history: the
X-rays have been discovered by W.C. Röntgen in 1895,1 and the
term “AI” was first introduced at a conference in Dartmouth in 1956.2
sponding reports of the wrist in our RIS/PACS that either confirm or ex-
clude a fracture,” and (2) “identify all computed tomography [CT]
pulmonary angiograms with corresponding reports performed at our in-
The vision of “artificial brains” is even older and was known to the 2 stitution between January 01, 2018, and January 6, 2019, on scanner
pioneers of computing, Alan Turing and Konrad Zuse.3 type X with the question whether there is pulmonary embolism or
There is quite some imprecision in the use of the terms AI, ma- not, and separate examinations containing pulmonary embolism from
chine learning (ML), and deep learning (DL). Artificial intelligence is those that do not.” While almost all RIS/PACS applications might han-
an umbrella term encompassing any technique that enables computers dle the first query, processing complex queries such as the second one is
to mimic human intelligence. Machine learning is a subclass of AI tech- not feasible. Finally, many clinical RIS/PACS systems do not allow for
niques and describes algorithms that self-improve upon exposure to a batch-wise export of image and text data. Consequently, reorganiza-
new data.4 Deep learning is a subset of ML algorithms that make use tion of the data according to a study-centered view is needed.
of multilayered neural networks.5 Here, we will use AI as the umbrella The information required for search queries is contained in DICOM
term for consistency. Both radiology and computing have evolved up to tags and the radiology reports.26 Four steps are needed to realize data
the point where an application of AI to radiology problems has become reorganization, these are as follows: (1) retrieval of DICOM tags (from
feasible: radiology has become digital with all data stored in radiology PACS) and radiology reports (from RIS); (2) merging and storing these
information system (RIS)/picture archiving and communication system data in a database; (3) design of a tool for searching this database with
(PACS) archives; AI has matured for automated image analysis. This de- an intuitively understandable user interface and swift full-text search ca-
velopment has been driven by increased computational power, cheaper pability on the report texts by indexing; Figure 1 displays an example
data storage, and higher data transfer rates. This has also led to a pro- RIS/PACS search engine (SE) developed and used at our institution;
nounced increase in articles on AI in radiology in recent years with a and (4) connectivity of the RIS/PACS-SE to clinical RIS/PACS data-
wide range of possible applications such as finding detection,6–11 bases that allow for an easy export of data to tables (technical image in-
segmentation,12–17 classification,18–21 and outcome prediction.22–24 As formation and reports) and secondary image processing applications
the data stored in radiology departments are patient-centered and clinical (image data). Note that a basic version of a study-centered RIS/PACS-SE
data are at best present in an unstructured fashion, as a whole, this data requires only the readout of a limited number of standard DICOM tags
conglomerate is not “AI ready” and a multistep pipeline from data (eg, study date [0008,0020], study description [0008,1030], modality
search and download to quantification of diagnostic performance is [0008,0060]). The integration of further tags allows for more specific
needed to implement AI projects. queries (eg, contrast/bolus agent [0018, 0010]). However, missing data,
Although it is not the aim of this article to describe ML techniques inconsistent labeling over time, and the diverse usage of private data
(for details, see Chartrand et al5 and Kohli et al25), we aim to guide readers element tags are limiting factors. To overcome this problem, establish-
through the pipeline required for an AI project in automated image analysis ing a unified diction for all DICOM tags of interest and logging any
changes over time are important measures to assure high data quality.
Received for publication May 21, 2019; and accepted for publication, after revision,
June 22, 2019.
From the Department of Radiology, University Hospital Basel, Basel, Switzerland.
REPORT TEXT CURATION
Correspondence to: Thomas Weikert, MD, Department of Radiology, University Hospital Although unsupervised ML approaches are rarely used in
Basel, Petersgraben 4, 4031 Basel, Switzerland. E-mail: [email protected]. radiology,27–29 most AI projects require classification of examinations
Conflicts of interest and sources of funding: none declared.
Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved.
for compiling the training and testing dataset. However, manually label-
ISSN: 0020-9996/19/0000–0000 ing data on an examination level (eg, “fracture” vs “no fracture”) causes
DOI: 10.1097/RLI.0000000000000600 enormous costs due to 2 reasons: first, AI projects benefit from large

Investigative Radiology • Volume 00, Number 00, Month 2019 www.investigativeradiology.com 1


Copyright © 2019 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
This paper can be cited using the date of access and the unique DOI number which can be found in the footnotes.
Weikert et al Investigative Radiology • Volume 00, Number 00, Month 2019

FIGURE 1. Exemplary, in-house developed RIS/PACS search platform allowing for flexible data queries including full-text search on radiology reports.

amounts of data, and second, labeling requires a certain level of medical and use of NLP libraries such as peFinder38 or NILE39 is an alternative.
expertise, necessitating the involvement of physicians. Automatized Challenging factors for NLP in the context of radiology encompass am-
retrospective extraction of labels from radiology reports is an answer biguity of abbreviations, dealing with uncertainty and inconclusiveness
to this challenge. However, despite efforts toward standardization in re- of reports and the presence of spelling errors.40
cent years,30–32 radiology reports still mostly consist of unstructured or
semistructured texts. A simple text query is not sufficient due to the
plethora of words describing same findings and further complicating fac- IMAGE DATA CURATION
tors such as negations. Natural language processing (NLP) is a potential There are 3 levels of detail for image labeling: whole image clas-
solution to this problem. It allows for a transfer of continuous texts to sification, object detection, and object segmentation (see Fig. 2).
labels. The performance of NLP methods that were historically based The goal of whole image classification is to assign 1 class label
on manually drafted lexical rule systems has improved dramatically in per examination (eg, class label “fracture” assigned to a radiograph). If
last years by incorporating ML approaches such as support vector this label can be extracted from the corresponding radiology reports, no
machines, random forest, and deep convolutional neural networks further labeling of image data is required. This allows for instant labeling of
(DCNNs).33–35 This significantly reduces human input as only a small a huge number of examinations, as demonstrated by Annarumma et al.37
subset of the text data has to be labeled manually. Although these Object detection requires the assignment of 1 label per object, for
methods achieve excellent sensitivity and specificity, a residual level example, lung tumors, and includes information on where an object is
of inaccuracy is inherent to NLP: Pons and colleagues reviewed NLP located. A method frequently used for object detection tasks is that of
in radiology research projects and found sensitivities ranging from bounding boxes, that is, rectangular boxes with an assigned object class
71% to 98%.36 The acceptable amount of inaccuracy depends on the that contain an object of interest (Fig. 2A). They are also called “weak
amount of data and the specific research question. That AI algorithms annotations” due to the fact that there is no delineation of findings bor-
can handle a certain amount of wrong labels has been demonstrated ders.41 Bounding boxes are provided in many open datasets for object
among others by Annarumma and colleagues, who used NLP-derived recognition, for example, Open Images V4, with over 15 million boxes
labels of over 470,000 adult chest radiographs to successfully train 2 belonging to 600 everyday object classes such as “coffee cup,”42 and
DCNNs for triaging based on imaging information.37 For many pro- were also used in the field of radiology to detect pulmonary nodules on
jects, the labeling of reports on the level of examinations is sufficient. radiographs43 and colitis on abdominal CT scans.44 Usually, there is more
However, labeling of additional information can be useful to identify sub- than 1 object class. For example, if the aim is to predict TNM classes of
sets of examinations with specific features (eg, acute or chronic pulmo- lesions in patients with lung cancer, at least 4 (T) + 3 (N) + 1 (M) labels
nary embolism). As both manual labeling and NLP approaches require are needed. Before starting the annotation, the definition of meaningful,
manual input, an easy-to-use labeling tool is important. Natural language preferably mutually exclusive, labels is important. It should be kept in
processing can be implemented with the software packages mentioned mind that AI algorithms need a sufficient number of examples for each
below in section Software and Hardware Requirements. Whenever a category to be able to correctly map inputs to output classes.45 How
100% correctly labeled dataset is warranted, manual labeling pre hoc or many examples are needed depends on the distinctness of the categories
post hoc remains an option. For specific questions, the design, adaption, and quality of training data, for example, creating an algorithm separating

2 www.investigativeradiology.com © 2019 Wolters Kluwer Health, Inc. All rights reserved.


Copyright © 2019 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
This paper can be cited using the date of access and the unique DOI number which can be found in the footnotes.
Investigative Radiology • Volume 00, Number 00, Month 2019 Guide to AI-Based Image Analysis in Radiology

FIGURE 2. Three levels of image labels, exemplified on the CT image of an adenocarcinoma of the lung: (A) whole image labeling with the label “tumor”
assigned to the whole image, (B) object detection with a light blue bounding box containing the tumor, and (C) object segmentation with tumor
borders delineated in light blue.

completely black from completely white images, corresponding to high learning rate of a DCNN, that is, the pace of adaption during training.57
distinctness, and perfect data quality would need only few examples. It is considered good practice to repeat this training-validation cycle
One can refer to the sample size used in previous studies with similar multiple times using different subsets for training and validation to
questions. The information on how many examples were eventually make use of all data and increase robustness of the model, also known
available for each category should always be reported. as cross-validation. Finally, the model's predictive performance should
For many medical imaging projects, be it for reasons of quantifi- be tested on the testing dataset. It is important that data contained in
cation (eg, volumes) or secondary analyses such as radiomics, the exact the testing dataset are not used for training or validation, as otherwise
demarcation of objects is of interest. Therefore, full segmentation is needed. the derived performance measures would be too optimistic and the gen-
In semantic segmentation, every pixel (2-dimensional [2D]) or voxel eralizability of the model would be negatively affected. The perfor-
(3-dimensional [3D]) of an image dataset is assigned to a class (eg, mance of the algorithm both on the validation and the test dataset
“lung tumor” and “background”). This results in a distinct definition should be reported. The performance on the test dataset is expected to
of object boundaries. If the subclasses of each class are further distin- be worse than on the validation dataset, as hyperparameters were opti-
guished (eg, tumor 1, tumor 2, etc), one speaks of instance segmentation.46 mized on the latter one. The numerical relation of training, validation,
There is a range of segmentation methods from fully manual over semi- and test datasets depends on the amount of data available and the clas-
automated to fully automated segmentation. For manual approaches, sification accuracy (smaller N and higher accuracy: more data assigned
substantial intrarater and interrater variabilities were reported.47,48 It is to training/validation sets). There is no fixed rule, but a ratio of 2/3 for
therefore crucial to quantify that variability, preferably by multiple an- training and 1/3 for validation and testing is common.58
notators and indication of a measure of reliability such as the Dice
score.49 The newest generation of fully automated segmentation
algorithms is based on U-shaped CNNs50; however, traditional image STATISTICAL EVALUATION
processing techniques such as region growing51 and other threshold- As mentioned previously, the outputs of AI algorithms for image
based methods52 are also frequently applied. Another method is that of analysis in radiology are (1) classes on an examination/image level (eg,
patch-wise segmentation, in which each pixel/voxel of a target image is radiograph contains fracture: yes/no), (2) detection of objects (eg, lung
labeled by comparing the patch with the pixel/voxel at its center with a tumor detected: yes/no), and (3) segmentation masks (eg, a lung tumor).
database of manually labeled patches.53 Semiautomated approaches They require different evaluation methods.
require minimal user interaction, for example, a single click to mark
a lung tumor, and segmentation is executed by an algorithm. 1. The output of AI algorithms predicting a class of an examination is
For the creation of segmentation masks, many open-source soft- a score. It can be thought of as a measure of certainty that an input
ware programs are available, for example, 3D Slicer,54 ITK-SNAP,55 or belongs to a target class, for example, that a radiograph shows a frac-
MITK.56 There are 2 important general requirements for these tools: first, ture. Then, a threshold is defined and a class is attributed to each
ease of use with multiple segmentation techniques at hand, as labeling/ examination (eg, if the prediction score of a radiograph is greater
segmentation tasks are time-consuming, and second, easy export of than 0.5, then the label “fracture-yes” is given). In the common
labels/segmentation mask in formats that allows for interchangeability case of a binary classification task, the results can be displayed
with secondary programs maintaining orientation in space. The Neuro- in a confusion matrix (see Fig. 3A) describing true-positive (TP),
imaging Informatics Technology Initiative format is widely used for true-negative (TN), false-positive (FP), and false negative (FN) find-
segmentation masks. Although, in recent years, image processing tasks ings. Frequently derived performance metrics are sensitivity and
have become the domain of DCNNs, classification problems are also specificity. Sensitivity, commonly called recall in data science, is cal-
TP
solved with other ML approaches such as random forest or support culated as TPþFP . Specificity is calculated as TNTN
þFN . The interpreta-
vector machines. tion of these measures depends on the actual use case: although one
might accept a sensitivity of 0.8 in research, this sensitivity is un-
acceptable for a clinically deployed worklist prioritization tool,
TRAINING, VALIDATION, AND TESTING flagging examinations that contain acute abdominal bleeding. A
As in other areas of ML, it is important to differentiate between related measure is the FP findings per case (FPF/c). It is valuable
the training dataset, the validation dataset, and the testing dataset. On for assessing the clinical relevance of an algorithm: a high sensitivity
the training dataset, the model fit is performed and the model “learns” is clinically useless when FPF/c is too high, as many FP cases ob-
to map input to output data. The validation dataset is used to repeatedly struct clinical workflows and lower acceptance among radiologists.
assess the model's performance during training phase and tune its There are many examples of algorithms with high FPF/c's of up to
hyperparameters. These are parameters set before training such as the 40 that would clearly fail in a clinical context.59–61 Summary

© 2019 Wolters Kluwer Health, Inc. All rights reserved. www.investigativeradiology.com 3


Copyright © 2019 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
This paper can be cited using the date of access and the unique DOI number which can be found in the footnotes.
Weikert et al Investigative Radiology • Volume 00, Number 00, Month 2019

precision in data science), and F1 score. When the bounding box


predicted by the algorithm shows sufficient overlap with the
ground truth bounding box, a finding is considered to be detected.
The established evaluation metric is the “Intersection over Union”
(IoU), calculated as area of overlap
area of union of the 2 bounding boxes, with an
IoU greater than 0.5 normally considered to indicate a good predic-
tion. Besides the IoU threshold, at least sensitivity and FPF/c should
be reported, as high sensitivity can be “bought” with high FPF/c: a hy-
pothetical algorithm assigning every voxel of a CT scan to the class
“tumor” reaches a sensitivity of 100% for tumor detection, but at
the cost of a practically infinite FPF/c, rendering the algorithm use-
FIGURE 3. A, Confusion matrix used to display results of binary less. Furthermore, frequency and reasons for FP as well as FN find-
classification tasks, for example, if a radiograph of the wrist contains a
fracture (+) or not (−). Y-axis: prediction of the algorithm; X-axis: ground
ings should be analyzed to identify systematic patterns that can help
truth. B, Receiver operating curve (continuous black line) with area in the further development of the algorithm. Specificity and accuracy
under the curve (area hatched in orange). The dotted line indicates an for this method usually cannot be calculated, as the content of the field
AUC of 0.5, which is equivalent to an algorithm that is not “no finding, not detected” in the confusion matrix is unknown.
discriminating classes at all.
3. Segmentation tasks result in masks (eg, of a tumor). To assess seg-
performance measures are accuracy and F1 score. Accuracy is cal- mentation quality, they have to be compared with ground truth
TPþTN masks. Excellent quality of the ground truth masks is a prerequisite
culated as: TPþTN þFPþFN . It describes the ratio of correct class pre-
dictions (TP + TN) to all class predictions. However, when for a meaningful comparison. Therefore, it is crucial to agree on
prevalence of classes is imbalanced, high accuracy should not be segmentation rules before starting with annotations to minimize in-
confused with high classification performance. Case in point: if terobserver variability (eg, should only the solid part of a lung tumor
15% of CTs contain pulmonary embolism, an algorithm predicting be segmented or also its spiculations?). Furthermore, the masks
“no embolism” in all cases reaches an accuracy of 85%, but is should be saved in a format that allows for full interchangeability
worthless due to a sensitivity of 0%. A combined measure and automated analysis (eg, Neuroimaging Informatics Technology
known as the F1 score and based on sensitivity and the positive Initiative format). Only then ground-truth masks can be exploited for
predictive value (PPV) is more informative. It is calculated as 2  other projects and automated measurement becomes feasible. For statis-
SensitivityPPV tical evaluation, there are many statistical measures with varying
SensitivityþPPV , therefore taking into account both FPs and FNs. All
results should be interpreted in the light of the number of classes: sensitivity for different errors.49 Most prevalent in literature are
correct class prediction of 70% in a binary classification task the Dice score, Jaccard index (JI), and volumetric similarity coef-
where random guessing would result in a correct prediction in ficient (VS). Dice score and JI are directly correlated overlap-
50% of cases is far less impressive than the same prediction perfor- based measures. VS is insensitive to overlap and should be used
mance when the choice is between 50 classes. whenever the volume, not the exact position of an entity, matters
(eg, when comparing the volume of empyema in a patient over
Another important method to assess classification performance is time). For evaluating the performance of a segmentation algorithm,
the area under the receiver operating characteristic curve (AUC overlap-based measures should be preferred. EvaluateSegmentation
under the ROC; see Fig. 3B). The ROC is a probability curve plot- is a royalty-free tool for the evaluation of segmentation performance
ting sensitivity against 1-specificity = TNþFPFP
and displays class supporting many common 2D and 3D image formats and offering
separation performance of an algorithm over various thresholds over 20 metrics, among them Dice score, JI, and VS (available
of a classifier. It nicely visualizes the tradeoff between sensitivity at: https://github.com/Visceral-Project/EvaluateSegmentation).49
and specificity. The AUC ranges from 0 to 1, with 0.5 meaning
that a model has no separation capability at all. The AUC of 0.5 SOFTWARE AND HARDWARE REQUIREMENTS
to 0.7 is considered poor, whereas an AUC > 0.9 (equivalent to
<0.1) is indicative of an outstanding discrimination perfor- To run AI data analysis, a lean platform with easy access for multi-
mance.62 One should be aware of some weaknesses of AUC: it ple users is valuable. A free and wide-spread example is Jupyter Notebook,
summarizes the discrimination performance over all thresholds, an open-source Web application that allows to set up, modify, share,
of which most are not clinically relevant.63 Furthermore, the mea- and run ML algorithms (http://www.jupyter.org).69 It supports more
sure treats sensitivity and specificity as equally important.63 How- than 40 programming languages, including Python, the most prevalent
ever, depending on the concrete question, FP (eg, pulmonary programming language in the field of AI at the moment (Python Soft-
vessel classified as nodule) and FN findings (eg, missed lung tu- ware Foundation, Wilmington, DE; http://www.python.org). Open-source
mor) usually come with different costs. Nonetheless, due to its frameworks for AI algorithms are TensorFlow (https://www.
good comparability, the AUC is a standard performance measure tensorflow.org), Keras (available at: https://github.com/fchollet/keras,
in ML.64 Although AUC under the ROC and F1 score give a good and now also integrated in TensorFlow), PyTorch (htpps://pytorch.
overview of models performance, sensitivity and specificity allow org), and scikit-learn (https://scikit-learn.org).70 A commercial alterna-
for a more detailed analysis considering the specific costs of FP and tive is MATLAB (The MathWorks, Inc, Natick, MA). Due to the great
FN findings for the given problem. Therefore, at least sensitivity, amount of computational power needed for the training of AI algo-
specificity, F1 score, and the AUC under the ROC should be re- rithms involving DCNNs, graphic processing units are advantageous.
ported. Many of the discussed measures can also be used to assess Tim Dettmers gives a good, hands-on introduction to the hardware
multiclass classification tasks (accuracy, F1 score, sensitivity).65 needed for DL.71 Other, computationally less expensive ML techniques
However, there is no established generalization of the AUC under such as random forest can also run on standard CPUs.
the ROC for multiclass problems, despite some proposals.66–68
PECULIARITIES OF MEDICAL IMAGING DATA
2. Detection performance of individual findings, for example, lung There are important peculiarities of medical imaging data with
tumors, can be described by sensitivity, FPF/c, PPV (also called implications for image-focused AI projects in radiology: first, medical

4 www.investigativeradiology.com © 2019 Wolters Kluwer Health, Inc. All rights reserved.


Copyright © 2019 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
This paper can be cited using the date of access and the unique DOI number which can be found in the footnotes.
Investigative Radiology • Volume 00, Number 00, Month 2019 Guide to AI-Based Image Analysis in Radiology

FIGURE 4. Simplified pipeline for the creation of an AI-based image analysis algorithm in radiology. After identification and download of report texts and
image data from the local RIS/PACS archives, report texts are transferred to a curated database containing the information of interest. This can be
achieved with manual labeling or automated techniques like the ones from NLP. Labeled image datasets are created by manual, semiautomatic, or
automatic labeling. Then, a study dataset can be compiled and divided into training, validation, and testing datasets. Training and validation is
sometimes run on the same dataset (“cross-validation”). The test dataset must not be used for training or validation to obtain realistic performance
measures. Finally, the AI-based image analysis algorithm is ready for use.

image data are less reproducible compared with most nonmedical imag- THE FUTURE OF THE DATA BASIS OF AI PROJECTS
ing data. Heterogeneity is introduced by multiple vendors of hardware IN RADIOLOGY
and unstandardized scanner parameters that differ significantly from one As of today, data acquisition in radiology departments and in
radiology department to another and even between scanners in the same hospitals in general is not AI ready. Data are stored in fragmented, mu-
institution.72 Imaging artifacts further reduce reproducibility. tually noncompatible IT systems. Radiology reports consist of unstruc-
Second, the output categories are not as distinct as in nonmedical tured or semistructured texts. Therefore, the pipeline of an AI project in
domains: a human reader has no difficulties in differentiating a boat from radiology requires auxiliary steps such as retrieval and reorganization of
a cat; however in radiology, there is relevant interobserver variability.73 In data and text-mining. However, auxiliary steps add some imprecision.
many cases, it is even impossible to make a definite call (eg, differentiat- As data are the fuel of AI projects, we should strive for more sophisti-
ing a lymph node metastasis from an inflammatory lymph node in a pos- cated ways of acquiring and storing medical data to foster accessibility
itron emission tomography/CTwithout a histology report). This translates and data quality. Efforts toward structured reporting in recent years go
to a situation where the compilation of high-quality ground truth datasets in that direction.80,81 At the end of this process, we should have tools
in radiology is very challenging. Furthermore, it is demanding to create a at hand that convert the information provided by radiologists into struc-
sufficient amount of high-quality ground truth data in radiology due to tured reports (via a report engine) and at the same time fill a databank.
the wide range of entities encountered in radiological practice and the This would render many auxiliary tools such as NLP unnecessary and
enormous time and expertise needed for label creation. allow for an instant analysis of clean data. From the hospital's perspec-
Third, as a result of the above, datasets in radiology AI projects tive, this would enable big data cross-departmental projects that are not
are smaller by magnitudes than datasets in nonmedical domains, for feasible today. Regarding image analysis, we should push for prospec-
example, the ImageNet challenge with 1.4 million labels. There are po- tive labeling of image data instead of retrospective labeling. During the
tential remedies: data augmentation, compilation of multicenter public normal reading process, the radiologist identifies and even measures
datasets, and crowd-based approaches. Data augmentation is a common many pathologic findings. At the moment, this costly label information
practice in radiology AI projects. It means that original dataset is aug- is subsequently lost. To change this, we need to modify current PACS
mented with transformed variants of the original images (eg, by scale software to ensure (a) data accessibility by saving measurements and la-
transformations, rotations).74 It is essential to specify if or to what extent bels in an interchangeable format and (b) add tools that allow for a
augmented data was used and which method was used for its generation. quick annotation, for example, based on semiautomated object segmen-
The compilation of multicenter public datasets is promising as the huge tation. When the right infrastructure is set up, it will quickly accumulate
cost of compiling datasets in the medical domain can be shared and structured data, given the high number of examinations performed in ra-
generalizability is increased. Important examples are the LIDC-IDRI diology departments around the world.
database for lung nodules containing 7371 marked pulmonary lesions These data serve then as foundation for a new generation of algo-
on CTs75 and the ChestX-ray8 dataset provided by the National Insti- rithms that enhance radiology reports with quantitative measurements,
tutes of Health containing more than 100,000 frontal-view radiographs help radiologists not to miss lesions, and prioritize examinations with
with 8 disease labels.76 Crowd-labeling approaches were so far only rarely critical findings in worklists, thereby improving the quality provided by
implemented in the medical imaging domain, for example, for the an- radiology departments.
notation of lung cancer77 and the annotation of mitotic activity detec-
tion on histologic images of breast cancer.78
A fourth peculiarity of image data in radiology is that they are CONCLUSIONS
often in 3 dimensions (eg, CT, positron emission tomography/CT, In this article, we reviewed the current technology stack for AI
magnetic resonance imaging) or even 4 dimensions (eg, dynamic projects in radiology. Figure 4 summarizes the steps that are required
cardiac imaging). This is especially relevant for DCNNs initially de- for a successful AI project within a radiology department. We also created
signed for the processing of nonmedical image 2D data. Compatibility awareness for the challenges and their potential solutions, and provided a
is often reached by greatly lowering the resolution (standard matrix: short outlook on future perspectives. Thereby, we hope to encourage ra-
256  256) and treating 3D datasets as a series of subsequent 2D images, diologists to launch their own AI image analysis projects and help enable
thereby losing information. them to make an objective appraisal of articles on AI-based software
Fifth, there are regional differences in the incidence of findings. in radiology.
Although tuberculosis is a frequent finding in India, its incidence is neg-
ligible in most high-income countries.79 An algorithm trained on data
from one region might therefore have serious trouble in classifying med- REFERENCES
ical data from another region. A possible remedy might be fine-tuning 1. Röntgen WC. Über eine neue Art von Strahlen. Sitzungsberichte der Physikalisch-
or retraining of algorithms on regional data. Medizinischen Gesellschaft zu Würzburg. 1895:2–16.

© 2019 Wolters Kluwer Health, Inc. All rights reserved. www.investigativeradiology.com 5


Copyright © 2019 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
This paper can be cited using the date of access and the unique DOI number which can be found in the footnotes.
Weikert et al Investigative Radiology • Volume 00, Number 00, Month 2019

2. Buchanan BG. A (very) brief history of artificial intelligence. AI Mag. 2005;26: 30. Larson DB. Strategies for implementing a standardized structured radiology
53–53.3. reporting program. Radiographics. 2018;38:1705–1716.
3. McCorduck P. Machines Who Think. A K Peters: Wellesley, MA; 2004. 31. Shea LAG, Towbin AJ. The state of structured reporting: the nuance of standard-
4. Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. ized language. Pediatr Radiol. 2019;49:500–508.
Science. 2015;349:255–260. 32. Herts BR, Gandhi NS, Schneider E, et al. How we do it: creating consistent struc-
5. Chartrand G, Cheng PM, Vorontsov E, et al. Deep learning: a primer for radiolo- ture and content in abdominal radiology report templates. Am J Roentgenol. 2019;
gists. Radiographics. 2017;37:2113–2131. 212:490–496.
6. Chilamkurthy S, Ghosh R, Tanamala S, et al. Deep learning algorithms for detec- 33. Brown AD, Kachura JR. Natural language processing of radiology reports in pa-
tion of critical findings in head CT scans: a retrospective study. Lancet. 2018;392: tients with hepatocellular carcinoma to predict radiology resource utilization. J Am
2388–2396. Coll Radiol. 2019;16:840–844.
7. Santin M, Brama C, Théro H, et al. Detecting abnormal thyroid cartilages on CT 34. Chen MC, Ball RL, Yang L, et al. Deep learning to classify radiology free-text re-
using deep learning. Diagn Interv Imaging. 2019;100:251–257. ports. Radiology. 2018;286:845–852.
8. Winkel DJ, Heye T, Weikert TJ, et al. Evaluation of an AI-based detection software 35. Li AY, Elliot N. Natural language processing to identify ureteric stones in radiol-
for acute findings in abdominal computed tomography scans: toward an automated ogy reports. J Med Imaging Radiat Oncol. 2019;1754–9485.12861.
work list prioritization of routine CT examinations. Invest Radiol. 2019;54:55–59. 36. Pons E, Braun LM, Hunink MG, et al. Natural language processing in radiology: a
9. Mannil M, von Spiczak J, Manka R, et al. Texture analysis and machine learning systematic review. Radiology. 2016;279:329–343.
for detecting myocardial infarction in noncontrast low-dose computed tomogra- 37. Annarumma M, Withey SJ, Bakewell RJ, et al. Automated triaging of adult chest
phy. Invest Radiol. 2018;53:338–343. radiographs with deep artificial neural networks. Radiology. 2019;291:272.
10. Kim Y, Lee KJ, Sunwoo L, et al. Deep learning in diagnosis of maxillary sinusitis 38. Chapman BE, Lee S, Kang HP, et al. Document-level classification of CT pulmo-
using conventional radiography. Invest Radiol. 2019;54:7–15. nary angiography reports based on an extension of the ConText algorithm.
11. Zhang N, Yang G, Gao Z, et al. Deep learning for diagnosis of chronic myocardial J Biomed Inform. 2011;44:728–737.
infarction on nonenhanced cardiac cine MRI. Radiology. 2019;291:606–617. 39. Yu S, Cai T. A Short Introduction to NILE. Available at: https://arxiv.org/pdf/
12. Zheng Y, Ai D, Mu J, et al. Automatic liver segmentation based on appearance and 1311.6063.pdf. Accessed May 3, 2019.
context information. Biomed Eng Online. 2017;16:16. 40. Iroju OG, Olaleke JO. Information technology and computer science. Inf Technol
13. Zhu W, Huang Y, Zeng L, et al. AnatomyNet: deep learning for fast and fully au- Comput Sci. 2015;08:44–50.
tomated whole-volume segmentation of head and neck anatomy. Med Phys. 2019; 41. Rajchl M, Koch LM, Ledig C, et al. Employing weak annotations for medical im-
46:576–589. age analysis problems. Available at: http://labelme.csail.mit.edu/. Accessed April
14. Couteaux V, Si-Mohamed S, Renard-Penna R, et al. Kidney cortex segmentation in 2D 29, 2019.
CT with U-Nets ensemble aggregation. Diagn Interv Imaging. 2019;100:211–217. 42. Kuznetsova A, Rom H, Alldrin N, et al. The Open Images Dataset V4: unified im-
15. Perkuhn M, Stavrinou P, Thiele F, et al. Clinical evaluation of a multiparametric age classification, object detection, and visual relationship detection at scale.
deep learning model for glioblastoma segmentation using heterogeneous magnetic 2018. Available at: http://arxiv.org/abs/1811.00982. Accessed April 29, 2019.
resonance imaging data from clinical routine. Invest Radiol. 2018;53:1.
43. Pesce E, Withey S, Ypsilantis PP, et al. Learning to detect chest radiographs con-
16. Lin L, Dou Q, Jin YM, et al. Deep learning for automated contouring of primary tumor taining pulmonary lesions using visual attention networks. 2019. Available at:
volumes by MRI for nasopharyngeal carcinoma. Radiology. 2019;291:677–686. https://arxiv.org/pdf/1712.00996.pdf. Accessed April 29, 2019.
17. Dreizin D, Zhou Y, Zhang Y, et al. Performance of a deep learning algorithm for 44. Wang S, Zhou M, Liu ZZ, et al. Central focused convolutional neural networks:
automated segmentation and quantification of traumatic pelvic hematomas on developing a data-driven model for lung nodule segmentation. Med Image Anal.
CT. J Digit Imaging. 2019. 2017;40:172–183.
18. Nishio M, Sugiyama O, Yakami M, et al. Computer-aided diagnosis of lung nod- 45. Figueroa RL, Zeng-Treitler Q, Kandula S, et al. Predicting sample size required
ule classification between benign nodule, primary lung cancer, and metastatic for classification performance. BMC Med Inform Decis Mak. 2012;12:8.
lung cancer at different image size using deep convolutional neural network with
transfer learning. PLoS One. 2018;13:e0200721. 46. Romera-Paredes B, Hilaire P, Torr S. Recurrent Instance Segmentation. Available
at: https://arxiv.org/pdf/1511.08250.pdf%5D. Accessed May 9, 2019.
19. Dalmiş MU, Gubern-Mérida A, Vreemann S, et al. Artificial intelligence-based
classification of breast lesions imaged with a multiparametric breast MRI protocol 47. Yu HJ, Chang A, Fukuda Y, et al. Comparative study of intra-operator variability
with ultrafast DCE-MRI, T2, and DWI. Invest Radiol. 2019;54:325–332. in manual and semi-automatic segmentation of knee cartilage. Osteoarthr Cartil.
2016;24:S296–S297.
20. Nakagawa M, Nakaura T, Namimoto T, et al. Machine learning based on multi-
parametric magnetic resonance imaging to differentiate glioblastoma multiforme from 48. Saha A, Grimm LJ, Harowicz M, et al. Interobserver variability in identification of
primary cerebral nervous system lymphoma. Eur J Radiol. 2018;108:147–154. breast tumors in MRI and its implications for prognostic biomarkers and
radiogenomics. Med Phys. 2016;43(8 Part 1):4558–4564.
21. Dunnmon JA, Yi D, Langlotz CP, et al. Assessment of convolutional neural net-
works for automated classification of chest radiographs. Radiology. 2019;290: 49. Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: anal-
537–544. ysis, selection, and tool. BMC Med Imaging. 2015;15:29.
22. Sun R, Limkin EJ, Vakalopoulou M, et al. A radiomics approach to assess tumour- 50. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedi-
infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: cal Image Segmentation. 2015. Available at: https://arxiv.org/pdf/1505.04597.pdf.
an imaging biomarker, retrospective multicohort study. Lancet Oncol. 2018;19: Accessed Jun 20, 2019.
1180–1191. 51. Adams R, Bischof L. Correspondence Seeded Region Growing. 1994. Available at:
23. Meng Y, Zhang Y, Dong D, et al. Novel radiomic signature as a prognostic biomarker https://pdfs.semanticscholar.org/db44/31b2a552d0f3d250df38b2c60959f404536f.
for locally advanced rectal cancer. J Magn Reson Imaging. 2018;48:605–614. pdf. Accessed May 9, 2019.
24. Buizza G, Toma-Dasu I, Lazzeroni M, et al. Early tumor response prediction for 52. Al-amri SS, Kalyankar NV, Khamitkar SD. Image Segmentation by Using Thresh-
lung cancer patients using novel longitudinal pattern features from sequential old Techniques. 2010. Available at: http://arxiv.org/abs/1005.4020. Accessed May
PET/CT image scans. Phys Med. 2018;54:21–29. 9, 2019.
25. Kohli M, Prevedello LM, Filice RW, et al. Implementing machine learning in ra- 53. Mechrez R, Goldberger J, Greenspan H. Patch-based segmentation with spatial consis-
diology practice and research. Am J Roentgenol. 2017;208:754–760. tency: application to MS lesions in brain MRI. Int J Biomed Imaging. 2016;2016:1–13.
26. NEMA. PS3.1 DICOM PS3.1 2019a–Introduction and Overview. 2019. Available 54. Fedorov A, Beichel R, Kalpathy-Cramer J, et al. 3D Slicer as an image computing
at: http://dicom.nema.org/medical/dicom/current/output/pdf/part01.pdf. Accessed platform for the Quantitative Imaging Network. Magn Reson Imaging. 2012;30:
April 16, 2019. 1323–1341.
27. Hussein S, Kandel P, Bolan CW, et al. Lung and pancreatic tumor characterization 55. Yushkevich PA, Piven J, Hazlett HC, et al. User-guided 3D active contour segmen-
in the deep learning era: novel supervised and unsupervised learning approaches. tation of anatomical structures: significantly improved efficiency and reliability.
IEEE Trans Med Imaging. 2019;1–11. Neuroimage. 2006;31:1116–1128.
28. Lee CC, Yang HC, Lin CJ, et al. Intervening nidal brain parenchyma and risk of 56. Goch C, Metzger J, Nolden M. Tutorial: medical image processing with MITK in-
radiation-induced changes after radiosurgery for brain arteriovenous malformation: troduction and new developments. In: Maier-Hein KH, Deserno TM, Handels H,
a study using an unsupervised machine learning algorithm. World Neurosurg. 2019. et al, eds. Bildverarbeitung für die Medizin 2017. Informatik aktuell. Berlin,
29. Li H, Galperin-Aizenberg M, Pryma D, et al. Unsupervised machine learning of Heidelberg, Germany: Springer Vieweg; 2017:10–10.
radiomic features for predicting treatment response and overall survival of early 57. Probst P, Boulesteix AL, Bischl B. Tunability: Importance of Hyperparameters of
stage non-small cell lung cancer patients treated with stereotactic body radiation Machine Learning Algorithms; 2018. Available at: https://arxiv.org/pdf/1802.
therapy. Radiother Oncol. 2018;129:218–226. 09596.pdf. Accessed May 14, 2019.

6 www.investigativeradiology.com © 2019 Wolters Kluwer Health, Inc. All rights reserved.


Copyright © 2019 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
This paper can be cited using the date of access and the unique DOI number which can be found in the footnotes.
Investigative Radiology • Volume 00, Number 00, Month 2019 Guide to AI-Based Image Analysis in Radiology

58. Dobbin KK, Simon RM. Optimally splitting cases for training and testing high di- 71. Tim Dettmers. A Full Hardware Guide to Deep Learning—Tim Dettmers. Available
mensional classifiers. BMC Med Genomics. 2011;4:31. at: https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/. Accessed
59. Özkan H, Osman O, Şahin S, et al. A novel method for pulmonary embolism de- April 29, 2019.
tection in CTA images. Comput Methods Programs Biomed. 2014;113:757–766. 72. Lubner MG, Smith AD, Sandrasegaran K, et al. CT texture analysis: defini-
60. Zhou C, Chan HP, Sahiner B, et al. Computer-aided detection of pulmonary em- tions, applications, biologic correlates, and challenges. Radiographics. 2017;37:
bolism in computed tomographic pulmonary angiography (CTPA): performance 1483–1503.
evaluation with independent data sets. Med Phys. 2009;36:3385–3396. 73. Ambinder EB, Mullen LA, Falomo E, et al. Variability in individual radiologist
61. Liang J, Bi J. Computer aided detection of pulmonary embolism with tobogganing BI-RADS 3 usage at a large academic center: what's the cause and what should
and mutiple instance classification in CT pulmonary angiography. Inf Process we do about it? Acad Radiol. 2019;26:915–922.
Med Imaging. 2007;20:630–641. 74. Roth HR, Lu L, Liu J, et al. Improving computer-aided detection using convolutional
62. Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. Hoboken, neural networks and random view aggregation. IEEE Trans Med Imaging. 2016;
NJ: Wiley; 2013. 35:1170–1181.
63. Halligan S, Altman DG, Mallett S. Disadvantages of using the area under the re- 75. Armato SG, McLennan G, Bidaut L, et al. The Lung Image Database Con-
ceiver operating characteristic curve to assess imaging tests: a discussion and pro- sortium (LIDC) and Image Database Resource Initiative (IDRI): a com-
posal for an alternative approach. Eur Radiol. 2015;25:932. pleted reference database of lung nodules on CT scans. Med Phys. 2011;38:
64. Vanderlooy S, Hüllermeier E. A critical analysis of variants of the AUC. Mach 915–931.
Learn. 2008;72:247–262. 76. Wang X, Peng Y, Lu L, et al. ChestX-ray8: Hospital-scale Chest X-ray Database and
65. Hossin M, Sulaiman. A review on evaluation metrics for data classification eval- Benchmarks on Weakly-Supervised Classification and Localization of Common
uations. Int J Data Min Knowl Manag Process. 2015;5. Thorax Diseases. 2017. Available at: http://arxiv.org/abs/1705.02315. Accessed
November 30, 2018.
66. Everson A, Everson RM, Fieldsend JE. ORE Open Research Exeter TITLE Multi-
class ROC analysis from a multi-objective optimisation perspective A NOTE ON 77. Kalpathy-Cramer J, Beers A, Mamonov A, Ziegler E, et al. Crowds Cure Cancer: Data
VERSIONS Multi-class ROC analysis from a multi-objective optimisation perspec- collected at the RSNA 2017 annual meeting. The Cancer Imaging Archive. Available at:
tive. 2013. Available at: http://hdl.handle.net/10871/15243. Accessed May 6, 2019. https://wiki.cancerimagingarchive.net/display/DOI/Crowds+Cure+Cancer%3A+Data
67. Landgrebe TC, Duin RP. Approximating the multiclass ROC by pairwise analysis. +collected+at+the+RSNA+2017+annual+meeting. Accessed July 20, 2019.
2007. Available at: www.elsevier.com/locate/patrec. Accessed May 6, 2019. 78. Albarqouni S, Baur C, Achilles F, et al. AggNet: deep learning from crowds for
68. Hand DJ. A Simple Generalisation of the Area Under the ROC Curve for Multiple mitosis detection in breast cancer histology images. IEEE Trans Med Imaging.
Class Classification Problems. Mach. Learn. 2001;45:171–186. 2016;35:1313–1321.
69. Kluyver T, Ragan-Kelley B, Pérez F, et al. Jupyter Notebooks-a publishing format 79. Anon. WHO | Global tuberculosis report. WHO. 2018;2019. Available at: https://
for reproducible computational workflows. 2016. Available at: https://nbviewer. www.who.int/tb/publications/global_report/en/. Accessed April 17, 2019.
jupyter.org/. Accessed April 17, 2019. 80. Kahn CE, Langlotz CP, Burnside ES, et al. Toward best practices in radiology
70. Buitinck L, Louppe G, Blondel M, et al. API design for machine learning soft- reporting. Radiology. 2009;252:852–856.
ware: experiences from the scikit-learn project. 2013. Available at: http://arxiv. 81. European Society of Radiology (ESR). ESR paper on structured reporting in radi-
org/abs/1309.0238. Accessed April 17, 2019. ology. Insights Imaging. 2018;9:1–7.

© 2019 Wolters Kluwer Health, Inc. All rights reserved. www.investigativeradiology.com 7


Copyright © 2019 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
This paper can be cited using the date of access and the unique DOI number which can be found in the footnotes.

You might also like