Abstract
Detection of low contrast liver metastases varies between radiologists. Training may improve performance for lower-performing readers and reduce inter-radiologist variability. We recruited 31 radiologists (15 trainees, 8 non-abdominal staff, and 8 abdominal staff) to participate in four separate reading sessions: pre-test, search training, classification training, and post-test. In the pre-test, each radiologist interpreted 40 liver CT exams containing 91 metastases, circumscribed suspected hepatic metastases while under eye tracker observation, and rated confidence. In search training, radiologists interpreted a separate set of 30 liver CT exams while receiving eye tracker feedback and after coaching to increase use of coronal reformations, interpretation time, and use of liver windows. In classification training, radiologists interpreted up to 100 liver CT image patches, most with benign or malignant lesions, and compared their annotations to ground truth. Post-test was identical to pre-test. Between pre- and post-test, sensitivity increased by 2.8% (p = 0.01) but AUC did not change significantly. Missed metastases were classified as search errors (<2 seconds gaze time) or classification errors (>2 seconds gaze time) using the eye tracker. Out of 2775 possible detections, search errors decreased (10.8% to 8.1%; p < 0.01) but classification errors were unchanged (5.7% vs 5.7%). When stratified by difficulty, easier metastases showed larger reductions in search errors: for metastases with average sensitivity of 0–50%, 50–90%, and 90–100%, reductions in search errors were 16%, 35%, and 58%, respectively. The training program studied here may be able to improve radiologist performance by reducing errors but not classification errors.
Keywords: Eye tracking, reader performance, low contrast detectability, reader training
1. INTRODUCTION
We have previously studied and reported on inter-reader (or inter-radiologist) variability in liver metastasis detection (1–4). At the 2021 SPIE Medical Imaging meeting, we found differences in area under the curve (AUC) and sensitivity stemming from reader subspecialization (with abdominal subspecialists exhibiting higher AUC than trainees) and workstation variables (with longer interpretation time, greater use of liver windows, and greater use of coronal reformations associated with higher sensitivity) (4). We also categorized missed lesions as search or classification errors. We found that about a third of missed metastases were gazed at for <0.5 seconds, a third between 0.5 to 2 seconds, and a third for >2 seconds. This previous study characterized associations between reader behavior and performance but did not test if intervention on the associated variables could improve performance. The purpose of this work was to test such a training program, coaching readers to increase behaviors associated with higher sensitivity and providing targeted practice on classification, to see if such training would improve sensitivity and AUC.
2. METHODS
2.1. Overview
31 radiologist readers were recruited from our institution after IRB approval, including 15 trainees (senior residents or fellows), 8 abdominal subspecialists employed as staff radiologists at our institution, and 8 non-abdominal subspecialists employed as staff in a different subspecialization. Readers participated in four sessions, each lasting approximately 4 hours: pre-test, search training, classification training, and post-test. The order of the search training and classification training was not controlled, and the total time between pre-test and post-test was between two and four weeks.
2.2. Pre-test and post-test
These two sessions were identical. After receiving instructions, readers were calibrated on the eye tracker and were asked to interpret 40 contrast-enhanced abdominal CT exams, marking all suspected liver metastases with a circumscription tool and rating confidence of malignancy between 0 and 100. Readers were told that scores of 0 were not used but were provided to mark obvious benign lesions (i.e., cyst, hemangioma, scar) as a convenience to readers. A score of 100 implied absolute confidence of malignancy, while low scores implied either a lesion that was probably benign, or a possible malignancy that could not be distinguished from noise. Ground truth was determined from a separate radiologist not participating in this study using either histopathology or progression compared with previous and/or follow-up imaging exam and was never communicated to readers.
2.3. Search training
In search training, readers interpreted a set of 30 liver CT exams and were immediately informed of the correct results. After each interpretation, readers also viewed a dashboard that displayed three variables of interest: (1) interpretation time, (2) use of the coronal stack, and (3) use of liver windows. We tracked these variables because prior studies showed an association with increased sensitivity. When readers were informed of correct results, they also viewed a color wash display that described where in the image they looked, as shown in Figure 1. This color wash display was provided in the hope that readers might self-adjust: if a reader missed a lesion because they never looked at that portion of the reconstruction, they might change their reading habits in the future.
2.4. Classification training
In classification training, readers interpreted a batch of up to 100 liver CT “patches.” A patch was a small volume of interest that was constrained, so that rather than interpreting an entire CT exams, readers only needed to direct their attention to a small portion. The motivation for patch training was that in a prior study, abdominal subspecialists and trainees different in their area under the curve (AUC) scores but not sensitivity or false positive rates because abdominal subspecialists assigned better ratings to their circumscriptions (higher confidence to true positives, lower confidence to false positives), and we felt that the continued, routine practice of subspecialists in differentiating benign and malignant lesions led to this difference. By constraining attention to only a small region, we hoped to create an intensive, targeted training environment to rapidly learn the critical step of differentiating lesion malignancy. After interpretation of each patch, readers were provided with ground truth and also viewed a teaching point.
2.5. Eye tracking
Except for the classification training, readers were monitored using the Eyelink Portable Duo (SR Research). Prior to the start of each session, readers were calibrated, and the calibration was repeated under the average error was better than 1 degree. Missed metastases were classified as search errors if the total gaze time within 40 pixels of the metastasis in either the axial or coronal stacks was less than 2 seconds; otherwise, the missed metastasis was a classification error. Note that our definition of a search error includes relatively long durations that some authors would call a “recognition error” (reader foveates near the lesion but does not recognize it).
2.6. Statistical analysis
Readers were scored along two dimensions: per-lesion sensitivity and area under the jackknife alternative free response receiver operating characteristic curve (AUC of JAFROC, or simply AUC). We selected JAFROC out of several possible ROC curves because there is no single best extension of ROC in free-response tasks with multiple lesions possible per exam. Briefly, the JAFROC curve plots per-lesion sensitivity against per-exam specificity as the confidence score threshold of accepted markings (1–100) is varied. An exam was counted as a false positive if the most confident false positive was above the confidence score threshold, and we calculated false positives from both exams with metastases (32 of 40) and exams without metastases (8 of 40).
There were 91 metastases total. Because of an implementation error, however, data from some cases was not saved. To avoid confounding effects of missing data (e.g., if easier lesions went preferentially missing in the post-test, an apparent reduction of performance would be seen), we dropped cases symmetrically on a per-reader basis if a case went missing: if a case was absent for one reader in pre-test, corresponding data on the post-test would also be skipped. In total, we analyzed 2775 metastases out of a possible 2821 (31 readers * 91 metastases), with less than 2% of the data missing.
3. RESULTS
Figure 2 compares the sensitivity and AUC of readers in the pre-test and post-test, stratified by reader experience subgroup. Abdominal subspecialists were observed to have both higher sensitivity and higher AUC than trainees. The sensitivity improved after training (Wilcoxon sign rank test, p = 0.01), but the AUC showed no evidence of change (p = 0.36) and indeed trended downwards. The number of false positives increased after training (p = 0.005). Any circumscription with a positive confidence rating counted for the calculation of sensitivity.
At face value, the improvements in sensitivity were modest. However, it should be pointed out that although our cases were selected to preferentially contain difficult metastases, the metastases themselves were heterogeneous. Metastases had a median detection rate of 95% (interquartile range, 84% to 98%), despite a mean detection rate of 85%. The relatively low mean sensitivity (compared to the median) is driven by a small number of very difficult metastases.
Table 1 contrasts the number of search errors (<2 second gaze time) and classification errors (>2 seconds gaze time) for missed metastases. Metastases are categorized according to detection rate quartiles. There were a total of 2775 possible detections, of which 17% (458/2775) were missed in the pre-test, and 14% (381/2775) were missed in the post test. Using the 2 second gaze threshold, search errors decreased, especially in the second quartile of metastases, but there was no difference in classification errors. One interpretation of our results is that training made the readers more thorough, improving the sensitivity but also increasing the number of false positives. The net effect is neutral on AUC.
Table 1.
Any metastasis | First (hardest) quartile | Second quartile | Third and fourth quartile | |
---|---|---|---|---|
Search, pre-test | 10.8% | 35% | 5.4% | 1.2% |
Search, post-test | 8.1% | 29% | 1.6% | 0.8% |
Classification, pre-test | 5.7% | 16% | 5.4% | 0.7% |
Classification, post-test | 5.7% | 15% | 5.4% | 0.9% |
Figure 3 shows examples of three metastases in our study, chosen to represent a range of difficulties. Each metastasis is the center of their respective quartile (for example, the leftmost metastasis is as the 12.5th percentile of difficulty). The Q1 metastasis, subtle and difficult to detect, had 20 search and 0 classification errors in the pre-test, but 9 search and 2 classification errors post-test. The Q2 metastasis had 3 search and 1 classification errors pre-test, but 0 and 2 respectively post-test. The Q3 metastasis was not difficult to find, and had one classification error each in both pre-test and post-test. The Q4 metastasis was always found.
4. CONCLUSIONS
A training program that incorporates both search training and classification training was successful at improving sensitivity but was not successful at improving AUC. Improvements in sensitivity were driven by reductions in search error rates, not classification error rates. The overall reduction in search error rates was 25%.
REFERENCES
- 1.Fletcher JG, Fidler JL, Venkatesh SK, Hough DM, Takahashi N, Yu L, et al. Observer performance with varying radiation dose and reconstruction methods for detection of hepatic metastases. Radiology. 2018;289(2):455–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Fletcher JG, Yu L, Fidler JL, Levin DL, DeLone DR, Hough DM, et al. Estimation of observer performance for reduced radiation dose levels in CT: eliminating reduced dose levels that are too low is the first step. Academic radiology. 2017;24(7):876–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Fletcher JGF JL; Sheedy S; Hough DM; Froemming A; Venkatesh S;Takahashi N; McMenomy B; Wells M; Barlow J, Kim B, Goenka A, Michalak G, Yu L, Drees T, Leng S, Holmes D, Toledano A, Carter R, McCollough CH, . Multireader Multicase Observer Performance for Detection of Hepatic Metastases at Contrast-enhanced CT: Lowest Radiation Dose Levels that Insure Performance. RSNA; Chicago, IL: 2017. [Google Scholar]
- 4.Hsieh SS, Inoue A, Pillai PS, Gong H, Holmes DR III, Cook DA, et al. , editors. A 25-reader performance study for hepatic metastasis detection: lessons from unsupervised learning. Medical Imaging 2022: Physics of Medical Imaging; 2022: SPIE. [DOI] [PMC free article] [PubMed] [Google Scholar]