2.2. Data Preparation
Two gastroenterologists specializing in capsule endoscopy (Oh DJ and Lim YJ from Dongguk University Ilsan Hospital) independently performed image labeling for lesion detection. After manually reviewing and categorizing the entire WCE case images as normal, bleeding, inflammation, vascular, and polyp tissues, they cross-checked their findings to ensure accuracy. The 40 labeled case datasets are then classified into training (36 cases, 90%) and test (4 cases, 10%) sets. The datasets were composed of clip units rather than still-shot images. Each clip consisted of four sequential images. The reason for setting the unit of one clip as four sequential images is discussed in
Section 3.2. The training set consists of 1,291,004 (322,751), 140,788 (35,197), 10,912 (2,728), 2,328 (582), and 14,832 (3,708) images (clips) for normal, bleeding, inflammation, vascular, and polyp tissues, respectively. The test set consists of 172,820 (43,205), 4200 (1,050), 304 (76), 892 (223), and 24 (6) images (clips) for normal, bleeding, inflammation, vascular, and polyp tissues, respectively (
Table 1).
In previous studies, labeling was traditionally conducted on an individual basis to identify the presence of lesions. However, in this study, we aim to reduce manual labor costs by implementing a simultaneous labeling approach for sequential images, as depicted in
Figure 2. Nonetheless, utilizing sequential images as a dataset presents certain limitations. Firstly, the presence of numerous similar images within a video sequence poses challenges to effective learning. Learning from a dataset containing numerous similar images increases the risk of overfitting, which can hinder the model's generalization capability. Secondly, there is a significant data imbalance between normal and abnormal images. The dataset primarily comprises normal images, while the number of abnormal images is relatively low. This data imbalance can negatively impact the performance of the model, leading to suboptimal results.
2.3. Study Design
Our method is implemented using the PyTorch codebase MMAction2 [
25].
Figure 2 shows a flow chart of the proposed VWCE-Net. First, the sequential input images are converted into clips comprising several images. Next, the clips are placed into the model, and the lesion is detected in each clip.
In this study, we set the clip size to 4 (that is, four consecutive frames constituted a clip). The size of each frame is 320 pixels in width and height, but we resize them to 224 pixels. Unlike ViT [
12], which takes a single image as input, our method’s input is a clip containing multiple images. Hence, the dimension of the input is also different from conventional methods, precisely 4 × 224 × 224 × 3 [
22], where 4 represents the number of images in the clip. Specifically, the input sequence of VWCE-Net is
, and as shown in
Figure 4., one 224 × 224 image becomes
(
) patches, denoted as
with a patch size of
. These patches are flattened into
.
In BERT, clstoken is added to first of all features, used in classification tasks, and ignored in other tasks [
26]. After passing through all layers of the Transformer, the "clstoken" acquires the combined meaning of the token sequence. In the classification task, you can pass through this clstoken to the classifier to classify the entire sentence entered. In contrast to BERT, where the input data was in word form, in this work, the input embedding is of the dimension
, so the clstoken is represented as
The clstoken
is concatenated to the flattened patch
. Finally, the size of the embedding becomes
. VWCE-Net learns this clstoken to represent a sequence composed of 196 patched images so that it can operate as a classification token that determines whether there is a lesion or not.
The Transfomer-based self-attention model does not compute convolution and does not contain recurrence like LSTM. To use the order of sequence information, it is necessary to mathematically model the relative or absolute position of flatten patches.
Where
is the position in the sequence and
is the index of the dimension representing the position. Positional encoding
has the form of sinusoidal. It has a pair value of sine and cosine depending on the value of
. That is, the even-numbered dimension uses sin and the odd-numbered dimension uses cos.
By this poitional encoding, in NLP, even the same word can have different embedding values depending on the position used in the sentence [
11]. In tasks such as text translate and text generation, 1-dimensional positional encoding is calculated because the input is a 1-dimensional word. In the task of this experiment to find a lesion in a given image, since the input is a 2-dimensional image, 2-dimensional position encoding can be considered. To apply 2-dimensional position encoding, first divide embedding in half, set one to
-embedding and the other to
-embedding. Each size is set to
. By concatenating
-embedding and
-embedding, the final positional encoding value of the patch of the corresponding position can be obtained. This work uses 1-dimensional positional encoding instead of 2-dimensional positional encoding because there is little difference in performance [
12]. This means that 2-dimensional positional relationship information between coordinate of
and
is sufficiently included in the 1-dimensional positional relationship between flatten patches.
Then, the embedded feature
is created by adding the position embedding vector
, which includes the spatial information of the patch. Formally, the embedded feature
is represented as
where
represents the concatenation.
The embedded feature
is projected as a query
, key
, and value
representations following the Transformer architecture [
11]. This transformation is achieved through linear operations using parameter matrices
, which are described as
where the dimensions of the projected
are equally
. Each operation of
contains layer normalization of embedded feature
.
As shown in
Figure 5, the Multi-Head Attention module performs several self-attention operations in parallel. Each of these operations is referred to as a "Head," following the terminologies in [
11]. In this paper, the number of "Heads," denoted as
, is set to
. Because the feature dimension is
, the dimension within the multi-head is
. Consequently, the
, and
vectors are converted to dimensions of
.
We perform matrix multiplication between
and
, followed by scaling with
, where
, and then apply the softmax function as [
26]
which is the self-attention coefficient. In this paper, we set
Then, the self-attention value is obtained as
denotes element-wise product. Multi-head attent heads are concatenated and pass through the Multi Layer Perceptron (MLP). For each operation residual connection is used.
denotes layer normalization and MLP is a multi-layer perceptron consisting of two hidden layers. In summary, clstoken
is concatenated to the flatten image patch
and positional encoding
is added to obtain embedded feature z. Then,
is used to calculate multi head attention to obtain self attention
, and
passes through MLP to obtion
in the Transformer layer. In this paper, we set the number of Transformer layers to 12, so we calculate up to
by repeating Transformer operation. The first position of
obtained through the entire Transformer layer is used to determine whether there is a lesion or not.
denotes fully connected layer . If the classification result is 0, it means that there is no lesion, and if it is more than 1, it means that there is a lesion. The reason why the final output dimension is 5 is because the type of lesion is set to 4 when preparing the data.