Vision Transformer (Vit) Model For Birds Classification
Vision Transformer (Vit) Model For Birds Classification
Vision Transformer (Vit) Model For Birds Classification
CLASSIFICATION
A PROJECT REPORT
Submitted by
JAYARAMAN P (913120104040)
MATHI ALAGAN T (913120104051)
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
Certified that this project report “VISION TRANSFORMER (VIT) MODEL FOR
BIRDS CLASSIFICATION” is the bonafide work of “JAYARAMAN P (913120104040),
MATHI ALAGAN T (913120104051)” of VIII Semester B.E Computer Science and
Engineering who carried out the project work under my supervision.
SIGNATURE SIGNATURE
II
ACKNOWLEDGEMENT
We thank the almighty for giving us moral strength to work on the project for the past few months.
We express our sincere thanks to our Respected Principal, Dr. P. ALLI, for providing more
facilities to do this project work.
We convey our thanks to our Guide Dr. G. Vinoth Chakkaravarthy Professor / CSE, Department
of Computer Science and Engineering, for his/her innovative suggestions and valuable guidance.
We would also wish to extend our sincere gratitude to all faculty members of the Department of
Computer Science and Engineering for their valuable guidance throughout the course of our project.
We also thank our parents and friends who provided moral and physical support.
III
ABSTRACT
Bird classification stands as a formidable challenge amidst the intricate tapestry
endeavor hindered by the sheer breadth of species and the subtle nuances that
ViT-B16 excels in capturing intricate features in bird images. Integrated into the
Gadio web app, ViT-B16 provides users with a seamless platform for accurate
biodiversity conservation.
IV
TABLE OF CONTENTS
S.NO. CONTENT PAGE NUMBER
1. Introduction 1-2
1.1. Overview 1
1.2. Objective 2
2. Literature Survey 3- 4
V
7.2. Output Design 18
8. System Architecture 20
8.2. Algorithm 20
9.4.1. EfficientNetB2
9.5. Reshaping
9.6. Training 27
9.7. Testing 28
9.8. Visualization 29
10. Result 30
VI
10.1. EfficientNetB2 Evaluation
11. Conclusion 31
Appendix 1 33
Appendix 2 34
Appendix 3 35
Appendix 4 36
Appendix 5 36
VII
LIST OF TABLES
TABLE.NO TABLE PAGE NUMBER
VIII
LIST OF FIGURES
TABLE.NO FIGURES PAGE
NUMBER
IX
LIST OF ABBREVIATIONS
CNN - Convolutional Neural Network
X
XI
1. INTRODUCTION
1.1. OVERVIEW
Bird classification stands as a formidable challenge amidst the intricate
tapestry of avian diversity, where manual identification often becomes a
laborious endeavor hindered by the sheer breadth of species and the subtle
nuances that distinguish them. However, in the realm of modern technology,
machine learning emerges as a beacon of hope, offering a path towards
automation and efficiency. In this study, we present "Gadio," a pioneering web
application meticulously crafted to navigate the complexities of bird
classification using cutting-edge deep learning models. Our investigation delves
into the comparative analysis of two prominent models, the venerable
EfficientNet-B2 and the transformative Vision Transformer (ViT) B-16, in the
realm of bird classification. Through rigorous experimentation and meticulous
evaluation on an extensive corpus of avian imagery, we meticulously scrutinize
the performance of these models, unearthing profound insights into their
efficacy in discerning avian species from unseen data. Notably, our findings
unveil the ViT-B16 model's supremacy over its EfficientNet-B2 counterpart, a
triumph attributed to the transformative potential inherent within its
revolutionary transformer architecture. With its unparalleled ability to capture
intricate spatial dependencies and discern fine-grained features embedded
within bird imagery, the ViT-B16 model emerges as a formidable contender in
the realm of bird classification. Moreover, our journey extends beyond the
confines of theoretical prowess, as we seamlessly integrate the ViT-B16 model
into the fabric of the Gadio web application, bestowing upon users an intuitive
and empowering platform for seamless bird identification. Through Gadio's
immersive interface, enthusiasts, researchers, and conservationists alike can
embark on a transformative journey, effortlessly uploading images and
receiving accurate predictions of bird species, thereby catalyzing engagement in
birdwatching and driving forward the frontiers of wildlife conservation. As we
traverse this terrain of technological innovation and ecological stewardship, we
illuminate the transformative potential of machine learning in democratizing
access to bird identification tools, fostering a newfound synergy between human
ingenuity and the preservation of avian biodiversity.
1
1.2. OBJECTIVE
The primary objective of this project is to achieve the following
description,
Compare different models to find the most accurate one for bird
classification.
2
2. LITERATURE SURVEY
Recent research efforts have led to significant advancements in birds
image classification, with various innovative methodologies aimed at enhancing
accuracy and efficiency.
3
to enhance accuracy even in Large Dataset. While using the transfer learning
model, more loss occurs in one of the sub models compared to others, so it’s
wise to pick the accurate model.
4
3. SYSTEM STUDY
3.1. FEASIBILITY STUDY
The feasibility study for implementing deep learning models in bird
classification involves assessing
Economic Feasibility
Technical Feasibility
Social Feasibility
5
Potential Cost Reduction Strategies: Exploring potential avenues for
cost reduction while maintaining optimal service quality is paramount.
The proposed deep learning system is meticulously designed with cost
considerations in mind, aiming to streamline operations, reduce manual
labor, and ultimately lower overall project costs while enhancing
classification accuracy and conservation efforts.
6
• Considering the technical feasibility of implementing advanced
techniques in bird image classification, what are the critical factors
that need to be addressed to ensure the successful deployment of such
systems in research settings?
7
4. SYSTEM ANALYSIS
Existing Accuracy
Strength of the method Limitation
Method (%)
Convolutional 86.5 Accuracy vs loss ratio is Low accuracy in larger
Neural lower dataset
Network
(CNN)
ResNet152 85 Extracting features by Difficult to measure
the detection of parts accuracy
8
Bilinear 84.1 Saves computing time Requires manual
model greatly elimination of noise
images on large datasets.
9
indicate imbalances in learning or inconsistencies in the training process.
Addressing these discrepancies requires careful monitoring and adjustment
of training parameters to ensure that all parts of the model converge
effectively towards the desired objective.
Accuracy Measurement Complexity: Measuring accuracy accurately in
larger datasets presents several challenges. With a vast number of images to
evaluate, traditional evaluation methodologies may be insufficient to capture
the nuanced performance of the classification system. Robust evaluation
techniques and extensive computational resources are required to analyze
accuracy comprehensively and account for variations across different subsets
of the dataset.
Implement the proposed ViT model using deep learning frameworks such
as PyTorch or TensorFlow. Train the model on the preprocessed bird dataset,
optimizing hyperparameters and regularization techniques to enhance
performance and generalization.
10
images. Experiment with self-attention mechanisms to improve feature
representation and classification accuracy.
Evaluate the performance of the ViT model using standard metrics such
as accuracy, precision, recall, and F1-score. Conduct comparative analyses
against traditional CNN-based approaches to assess the ViT model's
effectiveness in bird classification tasks.
Validation:
11
classification across diverse bird species and environmental conditions
without the need for extensive manual feature engineering.
Attention Mechanisms for Contextual Understanding: By leveraging
attention mechanisms, ViT models can dynamically focus on relevant
regions of interest within bird images, enhancing contextual
understanding and improving classification performance, particularly in
scenarios with complex backgrounds or occlusions.
Efficient Resource Utilization: Vision Transformer models offer
efficient resource utilization by eliminating the need for computationally
expensive convolutional operations, leading to faster inference times and
reduced computational overhead, making them suitable for deployment in
resource-constrained environments.
5. SYSTEM SPECIFICATION
The system specifications serve as a blueprint for configuring the optimal
hardware and software environment to facilitate the development and
deployment of the breast cancer image classification system.
NVIDIA GPUs with CUDA support, ideally with at least 8GB of VRAM
such as GeForce GTX 1080 Ti or higher, are preferred for accelerating deep
learning computations.
12
A minimum of 16GB of RAM is necessary, although larger datasets and
more complex models may require 32GB or more to prevent memory
constraints.
Storage:
Fast and ample storage, preferably Solid-State Drives (SSDs) with at least
500GB capacity, is essential for storing datasets, model checkpoints, and
intermediate results.
Additional Considerations:
Python is the preferred programming language for deep learning and data
science tasks due to its simplicity, versatility, and rich ecosystem of libraries. It
13
serves as the primary language for implementing the project's codebase and
integrating various software components.
Data Preprocessing Libraries:
Libraries such as NumPy and Pandas are essential for data preprocessing
tasks, including data manipulation, transformation, and normalization. These
libraries enable efficient handling and processing of the breast cancer image
dataset before model training.
Image Processing Libraries:
Image processing libraries like OpenCV are useful for performing image
augmentation, preprocessing, and manipulation tasks. They provide
functionalities for resizing, cropping, and applying filters to images, enhancing
data quality and model robustness.
Model Evaluation and Visualization Tools:
Deployment Environment:
14
Depending on deployment requirements, containerization technologies
like Docker may be used to package the application along with its dependencies
into a portable container. Additionally, cloud platforms such as AWS, Google
Cloud Platform, or Microsoft Azure can be utilized for deploying and hosting
the classification system in production environments.
15
"b16" typically representing the size of the model, specifying the number
of layers and other architectural parameters.
The model was then trained for a total of 5 epochs, allowing it to learn from
the augmented data and optimize its parameters over multiple iterations.
This iterative training process helped improve the model's accuracy and
convergence, ultimately leading to better performance on unseen data.
16
Categorical crossentropy loss, on the other hand, measures the dissimilarity
between the true distribution of the data and the predicted distribution,
serving as a key optimization objective during training.
7. SYSTEM DESIGN
Data Acquisition:
The input system initiates with the collection of bird images sourced from
various repositories, wildlife databases, and research platforms dedicated to
ornithology and wildlife conservation.
17
Images may be obtained in common formats such as JPEG or PNG from
wildlife cameras, birdwatching enthusiasts, or research studies.
Data Preprocessing:
Preprocessing techniques are applied to standardize and enhance the quality
of the input data before feeding it into the classification model.
Augmentation:
Image augmentation methods are employed to increase dataset diversity and
improve the model's ability to generalize to unseen bird image variations.
Data Splitting:
The dataset is partitioned into distinct subsets for training, validation, and
testing purposes.
18
Bird images are fed into the model layers, comprising attention
mechanisms and classification heads, to generate output logits
representing the model's confidence scores for each bird species.
Epoch:
Epochs are incorporated into the input system to dynamically adjust
preprocessing parameters, such as learning rate and augmentation
intensity, during model training.
This iterative process optimizes the data pipeline and enhances model
accuracy over successive training iterations.
Overall, the input system design ensures efficient processing and preparation of
bird images as input for the classification model, ultimately contributing to
accurate and reliable bird species identification outcomes.
Classification Prediction:
Following the processing of bird images through the classification
model, the system produces predictions indicating the likelihood of
each bird species present in the input image.
Class Labels:
The output includes class labels corresponding to the predicted bird
species, representing the 510 classes available in the classification model.
19
Each image is assigned a class label based on the highest probability
score among the output logits generated by the model.
Confidence Scores:
Alongside class labels, the system provides confidence scores or
probabilities indicating the model's certainty in its predictions for each
bird species.
These confidence scores quantify the model's confidence level for each
predicted class, enabling users to assess the reliability of the classification
results.
Visualization:
The system offers visualizations of the classification results to aid
interpretation and analysis by end-users, enhancing the user experience.
Visualization techniques may include displaying probability distribution
plots or generating visual summaries showcasing the predicted bird
species and their corresponding confidence scores.
20
8. SYSTEM ARCHITECTURE
8.2. ALGORITHM
1.Start by loading the bird image dataset.
2.Preprocess the images by resizing them to a standard size and
normalizing the pixel values.
3.Initialize data augmentation techniques to increase the diversity of the
dataset.
4.Generate an augmented dataset using the initialized augmentation
parameters.
5.Configure the model architecture by combining ResNet50 with LSTM
and dense layers.
6.Define augmentation parameters such as rotation, shift, shear, zoom,
and horizontal flip.
7.Initialize the pre-trained ResNet50 architecture for feature extraction.
8.Predict the dataset with the pre-trained model to extract features.
9.Initialize the LSTM layer for sequential analysis of features.
10.Run the feature extraction dataset through the LSTM model.
11.Monitor the training progress to track accuracy and loss.
21
12.Optimize hyperparameters such as learning rate, batch size, and
epochs if necessary.
13.Evaluate the model's performance on the testing set using accuracy,
precision, recall, and F1-score metrics.
14.Save the trained model for future use.
15.End.
9. SYSTEM IMPLEMENTATION
Sample data:
22
Fig. 9.1.1. Sample Data
The pixel values of the bird images are rescaled to enhance model
convergence and performance. The rescale parameter is set to 1./255, indicating
that each pixel value in the image will be divided by 255. This operation scales
down the pixel intensities to a range between 0 and 1, ensuring that pixel values
are within a consistent and standardized range across all images in the dataset.
23
the optimization algorithm can more effectively navigate the parameter space,
leading to faster convergence and better generalization performance.
Size of training
In the bird classification project, the size of the training data is an
essential aspect that directly influences the performance and
generalization capabilities of the deep learning model. By ensuring an
adequate amount of training data, the model can learn diverse patterns
and features present in bird images, leading to more accurate and robust
classification results. A sufficient size of the training data is crucial to
capture the variability and complexity inherent in bird images across
different species, poses, environments, and lighting conditions. With a
larger training dataset, the model can better generalize to unseen bird
images and exhibit improved performance when deployed in real-world
scenarios. Moreover, a substantial training dataset helps mitigate the risk
of overfitting, where the model memorizes the training data instead of
learning underlying patterns. By exposing the model to a diverse range of
bird images during training, overfitting can be minimized, resulting in a
more reliable and effective classification model. Therefore, it is
recommended to ensure a sizable training dataset comprising a sufficient
number of bird images representing various species and scenarios. This
approach enhances the model's ability to learn discriminative features and
patterns, ultimately leading to better classification performance in bird
species identification tasks.
Training Set:
The training set comprises a majority portion of the dataset and is used to
train the machine learning model. The model learns from the patterns present in
the training data, adjusting its parameters during training to minimize the
prediction error.
Validation Set:
Test Set:
The test set is a completely independent portion of the dataset that is not
used during model training or hyperparameter tuning. It is reserved for the final
evaluation of the trained model's performance and provides an unbiased
estimate of its generalization ability. Evaluating the model on the test set helps
assess its performance in real-world scenarios and provides insights into its
effectiveness for the intended task.
25
9.4. FEATURE EXTRACTION AND CLASSIFICATION
9.4.1. EfficientNetB2
EfficientNetB2 is a variant of the EfficientNet family of convolutional
neural network (CNN) architectures, known for its balance between model size
and performance. It is specifically designed to be efficient in terms of
computational resources while achieving competitive accuracy on image
classification tasks.
Architecture:
EfficientNetB2 follows a compound scaling method that
optimizes both depth, width, and resolution of the network. It consists of
multiple layers of convolutional and pooling operations, with a focus on
minimizing computational overhead while maximizing feature extraction
capabilities. The architecture of EfficientNetB2 is designed to strike a
balance between model complexity and computational cost, making it
suitable for resource-constrained environments.
Efficiency:
EfficientNetB2 achieves high efficiency by carefully balancing
model complexity with computational cost. It introduces novel techniques
such as compound scaling and efficient building blocks to achieve state-
of-the-art performance on various computer vision tasks while using
fewer parameters and computational resources compared to larger
models.
Feature Extraction:
26
EfficientNetB2 can effectively extract hierarchical features from bird images,
enabling accurate classification of bird species.
Architecture:
Feature Extraction:
Efficiency
27
9.5. RESHAPING
EfficientNetB2:
9.6. TRAINING
In the training and validation process with an epoch of 5, a hybrid model is
iteratively trained on a labeled training dataset for 5 epochs while
simultaneously evaluated on a separate validation dataset. Each epoch involves
processing batches of training examples and adjusting the model's parameters
28
using optimization algorithms. Performance metrics, such as loss and accuracy,
are computed on both the training and validation data to monitor learning
progress and detect overfitting. Early stopping may be employed to halt training
if the model's performance on the validation dataset stagnates or degrades. This
iterative approach aims to optimize the model's parameters and ensure its
robustness and generalization to unseen data.
9.7. TESTING
29
9.7.2. TESTING OF Vision Transformer
Accuracy: 0.9201
9.8. VISUALIZATION
30
Fig. 9.9.2.1. Training Curve of Vision Transformer
31
Training accuracy: 0.8826
Valid loss: 0.4478
Valid accuracy: 0.8980
Test accuracy: 0.9201
The comparison between the Vision Transformer (ViT) model with b16
architecture and the EfficientNetB2 model reveals that ViT_b16 demonstrates
superior performance across all metrics. ViT_b16 achieves lower training and
validation losses, higher training and validation accuracies, and ultimately,
higher test accuracy compared to EffNetB2. This indicates ViT_b16's better
convergence during training, superior ability to capture training data patterns,
and stronger generalization to unseen data. Moreover, ViT_b16 achieves these
results with slightly lower computational resources, showcasing its efficiency.
Overall, the ViT model with b16 architecture emerges as the preferred choice
for bird species classification tasks due to its outstanding performance and
efficiency.
11. CONCLUSION
In conclusion, the development of the "Gadio" web application
incorporating the Vision Transformer (ViT) B-16 model marks a significant
stride towards automating bird classification processes. Through rigorous
comparison with the EfficientNet-B2 model, our study highlights the superior
performance of ViT-B16, particularly in discerning intricate features within bird
images. Leveraging its innovative transformer architecture, ViT-B16 not only
enhances accuracy but also streamlines the bird identification experience for
users. This advancement holds immense potential in democratizing access to
birdwatching tools and bolstering conservation efforts.
32
for bird identification, "Gadio" facilitates greater engagement in birdwatching
activities and encourages participation in conservation initiatives. This
democratization of access to bird classification tools empowers individuals and
organizations to contribute meaningfully to our understanding and protection of
avian diversity.
33
Investigating the transferability of pre-trained ViT models across
different domains, such as wildlife photography or ornithological
research, can extend its utility beyond traditional bird classification tasks.
Efficient Training Strategies:
Developing efficient training strategies, such as curriculum
learning or self-supervised learning, can expedite the training process of
ViT models while maintaining high classification accuracy, making them
more practical for real-world applications.
Interactive Learning Interfaces:
Designing interactive learning interfaces that leverage ViT models
can empower citizen scientists and birdwatchers to contribute to bird
classification efforts by providing real-time feedback and annotations.
Robustness to Environmental Variability:
Enhancing the robustness of ViT models to environmental
variability, such as changes in lighting conditions or background clutter,
can improve their performance in challenging field conditions where bird
images may exhibit significant variations.
Ethical Considerations and Bias Mitigation:
Addressing ethical concerns, such as fairness and bias mitigation,
is essential to ensure the responsible deployment of ViT models in bird
classification systems, particularly in contexts where biases may
inadvertently influence model predictions.
APPENDICES
APPENDIX 1
#Data Loader Function:
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader
from torchvision import datasets
34
split_size: int = 0.2,
transform: torchvision.transforms = None,
batch_size: int = 32,
shuffle: bool = False,
num_workers = 1,
return_classes: bool = False):
"""
Creates a dataset and convert it into a DataLoader
if transform is None:
transform = transforms.Compose([
transforms.Resize(size=(224, 224)),
transforms.ToTensor()
])
dataset = datasets.ImageFolder(path,
transform=transform,
target_transform=None,
)
classes = dataset.classes
# Making a split
if split:
length = int(len(dataset)*split_size)
rem_length = len(dataset) - length
dataset, _ = torch.utils.data.random_split(dataset=dataset,
lengths=[length, rem_length],
generator=torch.manual_seed(42))
dataloader = DataLoader(dataset=dataset,
batch_size=batch_size,
shuffle=shuffle,
num_workers=num_workers,
pin_memory=True
)
if return_classes:
35
return dataloader, classes
return dataloader
APPENDIX 2
#EfficientNet B2
import torchvision
effnet_weights =
torchvision.models.EfficientNet_B2_Weights.DEFAULT
effnetb2_model =
torchvision.models.efficientnet_b2(weights=effnet_weights).to(device)
effnetb2_model
eff_net_transforms_with_data_augmentation =
torchvision.transforms.Compose([
torchvision.transforms.TrivialAugmentWide(),
effnet_weights.transforms()
])
effnetb2_model.classifier = nn.Sequential(
nn.Dropout(p=0.2, inplace=True),
nn.Linear(in_features=1408, out_features=len(class_names))
).to(device)
effnetb2_model.classifier
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=effnetb2_model.parameters(),
lr=0.001)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
effnetb2_results = train(
model=effnetb2_model,
train_dataloader=effnetb2_train_dataloader,
valid_dataloader=effnetb2_valid_dataloader,
loss_fn=loss_fn,
optimizer=optimizer,
epochs=5,
device=device
36
)
APPENDIX 3
#Vision Transformer
vit_weights = torchvision.models.ViT_B_16_Weights.DEFAULT
vit_b16_model = torchvision.models.vit_b_16(weights=vit_weights)
vit_b16_model
vit_transforms_with_data_augmentation =
torchvision.transforms.Compose([
torchvision.transforms.TrivialAugmentWide(),
vit_weights.transforms()
])
vit_b16_model.heads = nn.Sequential(
nn.Linear(in_features=768, out_features=len(class_names))
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=vit_b16_model.parameters(),
lr=0.001)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
vit_b16_results = train(
model=vit_b16_model,
train_dataloader=vit_b16_train_dataloader,
valid_dataloader=vit_b16_valid_dataloader,
loss_fn=loss_fn,
optimizer=optimizer,
epochs=5,
device=device
)
APPENDIX 4
#ResNet50
r50 = tf.keras.applications.ResNet50(weights='imagenet', include_top=False,
input_shape=(image_size, image_size, 3))
APPENDIX 5
37
#Visualization
def plot_loss_curves(model_results):
"""
Plots the model's loss and accuracy curves
"""
train_loss = model_results["train_loss"]
train_acc = model_results["train_acc"]
valid_loss = model_results["valid_loss"]
valid_acc = model_results["valid_acc"]
epochs = range(len(train_loss))
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
plt.plot(epochs, train_loss, label="Train Loss")
plt.plot(epochs, valid_loss, label="Validation Loss")
plt.title("Loss")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(epochs, train_acc, label="Train Accuracy")
plt.plot(epochs, valid_acc, label="Validation Accuracy")
plt.title("Accuracy")
plt.legend()
plt.suptitle("Loss and Accuracy Curves", fontsize=26);
REFERENCES
38
2. Y. Peng, X. He and J. Zhao, "Object-Part Attention Model for Fine-Grained
Image Classification," in IEEE Transactions on Image Processing, vol. 27,
no. 3, pp. 1487-1500, March 2018.
8. Pralhad Gavali, Ms. Prachi Abhijeet Mhetre, Ms. Neha Chandrakhant Patil.
"Bird Species Identification using Deep Learning." International Journal of
Engineering Research & Technology (IJERT) Vol. 8 Issue 04, April-2019
9. G. Van Horn et al., "Building a bird recognition app and large scale dataset
with citizen scientists: The fine print in fine-grained dataset collection," 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Boston, MA, USA, 2015, pp. 595-604
39
10.A V Siva Krishna Reddy, M A Srinuvasu, K Manibabu, B V Sai Krishna.
“International Journal Of Creative Research Thoughts (IJCRT)” 021 IJCRT,
Volume 9, Issue 7 July 2021
11.A. Vaswani et al., "Attention Is All You Need," in Advances in Neural
Information Processing Systems (NeurIPS), 2017.
12.A. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale," arXiv:2010.11929, 2020.
16.D. Kim, K. Kim, and S. Han, "Bird Species Identification with Vision
Transformer," in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 35, no. 18, pp. 16132-16139, 2021.
18.J. Xiong et al., "Learning to Classify Bird Species with Vision Transformer,"
arXiv:2109.01862, 2021.
40