Abstract
Neonatal endotracheal intubation (ETI) is an important, complex resuscitation skill, which requires a significant amount of practice to master. Current ETI practice is conducted on the physical manikin and relies on the expert instructors’ assessment. Since the training opportunities are limited by the availability of expert instructors, an automatic assessment model is highly desirable. However, automating ETI assessment is challenging due to the complexity of identifying crucial features, providing accurate evaluations and offering valuable feedback to trainees. In this paper, we propose a dilated Convolutional Neural Network (CNN) based ETI assessment model, which can automatically provide an overall score and performance feedback to pediatric trainees. The proposed assessment model takes the captured kinematic multivariate time-series (MTS) data from the manikin-based augmented reality (AR) ETI system that we developed, automatically extracts the crucial features of captured data, and eventually provides an overall score as output. Furthermore, the visualization based on the class activation mapping (CAM) can automatically identify the motions that have significant impact on the overall score, thus providing useful feedback to trainees. Our model can achieve 92.2% average classification accuracy using the Leave-One-Subject-Out-Cross-Validation (LOOCV).
I. INTRODUCTION
Infants in the neonatal intensive care unit (NICU) or delivery room have a high risk of adverse events [6] [7]. Neonatal endotracheal intubation (ETI) is an essential resuscitation skill [1], and therefore requires proficiency for pediatric trainees [4]. Manikin-based ETI training is an essential training regimen for pediatric trainees to gain some levels of proficiency before clinic exposure. The efficiency and quality of ETI training primarily determine the success rate of ETI [14]. However, limited opportunities of practicing resulting from shortage of expert instructors hamper trainees to achieve and maintain proficiency. Therefore, there is a pressing need to have an automated and accurate assessment for improving the efficiency and quality of ETI training.
Sensor-based, computer-assisted simulations, which solves various medical training problems [16], have been used to parameterize the ETI procedure from the sensor data for motion analysis. For example, electromagnetic (EM) sensors [13] have been used to capture and convert the motion of laryngoscope to global movement features which are designed by the ETI experts, such as movement time, path length, and curvatures. These features are applied to statistical analysis to distinguish the motion patterns between novices and experts. While global movement features can provide some indications for ETI skill evaluation, the manually designed features cannot fully and precisely reflect the complex and time-varying motions, thereby failing to provide an adequate performance assessment. Carlson et al. [2] explored several machine learning methods to predict the ratio of glottis opening based on the captured images of videolaryngoscopic views. Matava et al. [11] have applied convolutional neural network (CNN) to detect the location of the glottis. Although these methods show promising predictability for detecting the glottis opening, the pooling method (e.g. max pooling) used in these methods is prone to losing meaningful samples for extracting coarse-grained features due to the downsampling the input data. Furthermore, only a few of these methods have been applied to the automated ETI performance assessment [18]. The complexity of designing an automated assessment system is further compounded by providing useful feedback to trainees, as it requires the system to identify the critical factors (e.g. time and motion) contributing to the overall evaluation.
In this paper, we propose a novel neonatal ETI assessment model with dilated CNN to automatically assess trainees’ performance by providing an overall score and useful feedback to trainees. The contribution of the proposed ETI assessment model is four-fold:
The neonatal ETI assessment model can automatically assess the performance of trainees based on captured motion data without intervention of expert instructors, which significantly increases the training opportunities.
The CNN directly uses kinematic MTS data as input instead of manually designed features, which can comprehensively take the complex and time-varying motions into account without losing useful information.
The dilated convolution is applied to extend the receptive fields of CNN without downsampling input data. This operation can extract the crucial features from motion patterns that primarily determine the training performance, which preserves the resolution of input data and improves the accuracy of performance assessment.
The visualization is able to provide detailed and constructive feedback to trainee. The CAM, including speed and movement patterns, can identify critical factors that significantly contribute to the overall performance.
II. METHODS
The proposed ETI assessment model includes 3 modules: dataset collection and preprocessing, dilated CNN framework, and CAM.
A. Dataset Collection and Preprocessing
1). Dataset:
We collected an ETI motion dataset using our in-house AR intubation simulation system. The motions of both the manikin and the laryngoscope were tracked and captured by EM sensors. Specifically, the dataset includes 44 subjects with the range of expertise from novices to experts, and 190 intubation attempts in total. Note that all the intubation motions were scored with a 3-point scale by an experienced senior neonatologist who has 9 years of work experience as a practicing neonatologist and 8 years of simulation and instruction experience. To minimize the risk of bias, the rater was unaware of the identity of the participants, and the order of the playback was randomized. To maintain the scoring consistency, we selected several representative attempts as references for the assessments of the intubation performance. The data collection and study presented in this paper were approved by the Institutional Review Board of the Children’s National Health Systems.
2). Input parameter compression:
Considering motions of both the manikin and the laryngoscope inevitably increases the input size of CNN, which could have a negative impact on the convergence of the CNN training due to the increasing trainable parameters. To resolve this problem, we compress the input size of CNN by computing the relative transformation between the manikin and the laryngoscope. Specifically, the TLary and the THead represent the global transformation of the laryngoscope and the manikin, respectively. The relative transformation TLH can be evaluated with . Such transformation can reduce the input size of the CNN by half as compared to the original dataset.
3). Kinematic input:
The input kinematic feature vector is the concatenation of 4 kinematic features with 13 dimensions, which comprises of rotation, position, linear velocity, and angular velocity. The rotation is represented as quaternion. Linear velocity and angular velocity are extracted from the relative transformations between consecutive frames with the method from [3]. Each kinematic feature vector is processed by an individual convolutional module in the first layer of the neural network. In order to reduce the size of the kinematic feature vector, we use the relative transformation to represent the global transformations captured by sensors. Furthermore, the rotation is represented in quaternion instead of a rotation matrix. Therefore, the input size in the first convolutional layer is smaller than the one in [5].
4). Data padding:
As the kinematic MTS data are of varying lengths, it is problematic to perform batch-based training. To address this problem, we apply a padding method to align the frames of the kinematic feature vector to a fixed length. The absent frames are padded with zeros using the same structure of the kinematic vector.
B. Dilated CNN Framework
The proposed dilated CNN framework has 6 layers in total, as shown in Fig. 1. The input of the neural network is kinematic MTS data padded to a fixed sequence length, which contains 13 feature dimensions. The output neurons are the predictions for the performance score y ∈ {1,2,3}.
1). Dilated Convolution:
The receptive field is a critical factor that affects the performance of CNN. The size of the receptive field determines the coverage of the input data of each convolutional layer [10]. A larger size of the receptive field can make the convolutional layers get access to more input information, but usually requires more layers and larger stride size or pooling size. However, these approaches have limited applicability to the problems in medical domains for two reasons. First, increasing the pooling and stride size increases the receptive field at the expense of coarse-grained sampled information which could lose the critical motion patterns. Second, the size of medical dataset is often limited, which is insufficient to guarantee the convergence of the deep neural network.
In this work, we are inspired by the use of dilated convolution which has been applied to image segmentation [17] and audio generation [12]. The 1D dilated convolution [17] increases the receptive field without losing any information, ensuring the CNN to obtain all useful features. Moreover, extending the receptive field with dilated convolution does not introduce more trainable parameters because the number of parameters in the kernel remains the same. Thus, the intact feature information can result in improved training accuracy without requiring a large amount of training data. In our experiments, we set a unified dilation size for all kernels in each convolutional layer to demonstrate that the application of dilation convolution can improve the prediction accuracy of ETI assessment.
2). Convolutional layer:
The first layer has 4 sub-clusters, each of which extracts convolutional features with a kernel size of 9. For other layers, the number of kernels in each layer is twice of the previous layer, and the size of each kernel is 7. Specifically, the input of the second convolutional layer is the concatenated tensor of the outputs from sub-clusters. Each convolution layer also contains a Rectified Linear Unit (ReLU) activation function.
3). Regularization:
In order to prevent overfitting, we add dropout regularization in the last two convolutional layers. The dropout ratio is set to 0.2. A GAP is located at the fifth layer to perform a global regularization, which evaluates the average of convolutional features for each channel. Moreover, the GAP also reduces the number of parameters for the neural network training. However, the GAP evaluates features with the fixed sequence length instead of the varying sequence length for the time-series data after padding. To address this problem, we use the number of valid frames as the divisor to correct the calculation of the global average feature rather than using the fixed frame lengths of padding. Lastly, we use a fully connected layer with a softmax activation function to generate a prediction vector of three classes of scores.
4). Training:
We use Adam [8] optimizer with the multinomial cross-entropy as the objective function for training the neural network. We add an L2 regularization to prevent overfitting. The learning rate is 0.001, and the number of epochs is 600. We set the mini-batch size to 32 for training. We use the Leave-One-Subject-Out scheme to evaluate the classification accuracy due to the limited size of dataset and the repeated intubation attempts made by each subject in the dataset. For each epoch, we record the best validation accuracy during the training of the model.
C. Class Activation Mapping (CAM)
To provide constructive feedback to trainees, we apply CAM to create a heatmap of trajectory that identifies the regions of significant contribution in the motion sequence for predicting a specific class [19], as shown in Fig. 1. The warm (i.e. red) and cold (i.e. blue) colors represent high and low contribution movements, respectively. Inspired by [5], the prediction contribution is the dot product of the output of the GAP layer and the weights of the last convolutional layer. These contribution values are further converted to the heatmap colors and assigned to the corresponding temporal locations in the 3D trajectory. Thus, the visualized trajectory identifies the critical patterns in movements that mainly contribute to the prediction in the dilated CNN, which helps trainees to improve their motor behaviors based on the constructive feedback.
III. RESULTS
In this section, we evaluate the performance of the proposed method. We implemented the neural network algorithm with PyTorch 1.2 based on Python 3.7. All the experiments are performed with a NVIDIA GTX 1080Ti GPU. To demonstrate the effectiveness of the proposed method, we tested our CNN model with different dilation sizes. The baseline CNN is the same model by setting dilation size to 1. All configurations in the experiments share the same layer size. The results reported in Table I are obtained by averaging the LOOCV results of 10 repeats, which show that the CNN with dilated convolution improves the classification accuracy compared to the baseline CNN. Specifically, the accuracy of CNN with 5-dilated convolution is 8.1% higher than the baseline CNN.
TABLE I:
Method | Dilation Size | Receptive Field (Layer 4) | Average Classification Accuracy (SD) |
---|---|---|---|
CNN | 1 | 27 | 84.1% (1.6%) |
Dilated CNN | 3 | 79 | 90.1% (1.0%) |
Dilated CNN | 5 | 131 | 92.2% (1.0%) |
Dilated CNN | 7 | 183 | 89.7% (0.8%) |
To explore the performance of different dilation sizes, we conducted experiments with 3 different dilation sizes (3, 5, and 7) and calculated their receptive fields at the last convolutional layer. The receptive fields cover from 79 to 183 frames. Since EM sensor data are captured in 60 fps, the receptive time range is from 1.3s to 4.3s. The best result is achieved when the dilation size is 5, which corresponds to a range of 2s.
To comprehensively demonstrate the predictability of the proposed CNN model, we evaluate the confusion matrices and the receiver operating characteristic (ROC) curves from the prediction results of experiments. Note that we use results with median classification accuracy for each experiment. As shown in Fig. 3, the confusion matrices show that our approach with different dilation sizes is substantially better in predicting the score class 1 and 3 and slightly worse in predicting the score class 2 than the baseline CNN (the dilation size equals to 1). This indicates that the dilated convolution in general performs better for ETI performance prediction. Specifically, the classification accuracy of 5-dilated convolution outperforms the accuracy of baseline CNN by 17% on predicting the score class 1 and 35% on predicting the score class 3. Fig. 2 provides four plots of ROC curves, where a larger area under curve (AUC) value indicates better predictability. The Macro-averaged and the Micro-averaged AUC values of all approaches are larger than or equal to 0.8, indicating all neural networks with different dilation sizes have excellent predictability. Specifically, all AUC values in the 5-dilated CNN are above 0.7, which demonstrates the good predictability of our model for each score class. From these results, we can conclude that the dilated CNN is able to accurately assess the ETI performance with a reliable score.
The visualization of CAM results shows that our dilated CNN can provide useful feedback to help trainees improve their performance, as shown in Fig. 4. The movements circled in red represent fast and high frequency movements which significantly contribute to the classifier for predicting the score class 1. The movements circled in blue represent slow and steady movements which are important in predicting the score class 3. This useful information also shows that the model can make prediction based on the motion patterns.
IV. CONCLUSION
In this paper, we proposed a novel dilated CNN for automated neonatal ETI assessment based on the high-level features of ETI motion. Instead of using manually designed features, we use kinematic MTS data as the input without losing any motion patterns. The dilated convolution extracts representative features that can predict the overall ETI performance. The results of various experiments that we conducted with our ETI motion dataset show that our approach achieves the highest accuracy among the comparison methods, which demonstrates our assessment system can provide more accurate and reliable ETI assessments. In future works, we will explore the proper methodology to integrate the assessment model in both our computer-assisted ETI training system and virtual reality training system [15]. We will also collect more data to generate a larger dataset for developing machine learning methods of ETI training.
Acknowledgments
This work was supported by NIH grant R01HD091179
References
- [1].Bercic J and et al. The influence of tracheal vascularization on the optimum location, shape and size of the tracheostomy in prolonged intubation. Resuscitation, 6(2):131–143, 1978. [DOI] [PubMed] [Google Scholar]
- [2].Carlson JN and et al. A novel artificial intelligence system for endotracheal intubation. Prehospital Emergency Care, 20(5):667–671, 2016. [DOI] [PubMed] [Google Scholar]
- [3].Corke P . Robotics, vision and control: fundamental algorithms in MATLAB® second, completely revised, vol. 118 Springer, 2017. [Google Scholar]
- [4].Falck AJ and et al. Proficiency of pediatric residents in performing neonatal endotracheal intubation. Pediatrics, 112(6):1242–1247, 2003. [DOI] [PubMed] [Google Scholar]
- [5].Fawaz HI and et al. Evaluating surgical skills from kinematic data using convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 214–221. Springer, 2018. [Google Scholar]
- [6].Foglia EE and et al. Factors associated with adverse events during tracheal intubation in the nicu. Neonatology, 108(1):23–29, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Hatch LD and et al. Endotracheal intubation in neonates: a prospective study of adverse safety events in 162 infants. The Journal of Pediatrics, 168:62–66, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Kingma DP and Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]
- [9].Lin M, Chen Q, and Yan S. Network in network. arXiv preprint arXiv:1312.4400, 2013. [Google Scholar]
- [10].Luo W and et al. Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems, pp. 4898–4906, 2016. [Google Scholar]
- [11].Matava C and et al. A convolutional neural network for real time classification, identification, and labelling of vocal cord and tracheal using laryngoscopy and bronchoscopy video. 44(2):1–10, 2020. [DOI] [PubMed] [Google Scholar]
- [12].Oord A and et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. [Google Scholar]
- [13].Rahman T and et al. Tracking manikin tracheal intubation using motion analysis. Pediatric Emergency Care, 27(8):701–705, 2011. [DOI] [PubMed] [Google Scholar]
- [14].Sanders RC and et al. Level of trainee and tracheal intubation outcomes. Pediatrics, 131(3):e821–e828, 2013. [DOI] [PubMed] [Google Scholar]
- [15].Xiao X and et al. A physics-based virtual reality simulation framework for neonatal endotracheal intubation. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 557–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Xiao X and et al. evaluation of performance, acceptance, and compliance of an auto-injector in healthy and rheumatoid arthritic subjects measured by a motion capture system. Patient Preference and Adherence, 12:515, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Yu F and Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015. [Google Scholar]
- [18].Zhao S and et al. Automated assessment system with cross reality for neonatal endotracheal intubation training. In IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 739–740, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Zhou B and et al. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929, 2016. [Google Scholar]