Vision Transformer (Vit) Model For Birds Classification

VISION TRANSFORMER (VIT) MODEL FOR BIRDS
CLASSIFICATION
A PROJECT REPORT
Submitted by
JAYARAMAN P (913120104040)
MATHI ALAGAN T (913120104051)
in partial fulfillment for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
VELAMMAL COLLEGE OF ENGINEERING AND TECHNOLOGY
ANNA UNIVERSITY – CHENNAI 600 025

APRIL 2024
ANNA UNIVERSITY- CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “VISION TRANSFORMER (VIT) MODEL FOR
BIRDS CLASSIFICATION” is the bonafide work of “JAYARAMAN P (913120104040),
MATHI ALAGAN T (913120104051)” of VIII Semester B.E Computer Science and
Engineering who carried out the project work under my supervision.
SIGNATURE SIGNATURE
DR. R. DEEPALAKSHMI DR. G. VINOTH CHAKKARAVARTHY
HEAD OF THE DEPARTMENT SUPERVISOR
DEAN & PROFESSOR PROFESSOR

DEPARTMENT OF COMPUTER DEPARTMENT OF
SCIENCE AND ENGINEERING COMPUTER SCIENCE AND
ENGINEERING
VELAMMAL COLLEGE OF
ENGINEERING AND VELAMMAL COLLEGE OF
TECHNOLOGY, MADURAI ENGINEERING AND
TECHNOLOGY, MADURAI
Submitted for the university viva voce held on at Velammal College of
Engineering and Technology.
INTERNAL EXAMINER EXTERNAL EXAMINER
II
ACKNOWLEDGEMENT
We thank the almighty for giving us moral strength to work on the project for the past few months.
We would like to express our sincere thanks to SHRI. M. V. MUTHURAMALINGAM,

Honorable Chairman, founder of the institution who had been a guiding light for us in all activities.
We express our sincere thanks to our Respected Principal, Dr. P. ALLI, for providing more
facilities to do this project work.
Our heartfelt gratitude to Dr.R.PERUMALRAJA ,Professor and Head of the Department,

Computer Science and Engineering, for his valuable guidance, inspiration, and encouraging
appreciation which helped us a lot in completing this project in time.
We convey our thanks to our Guide Dr. G. Vinoth Chakkaravarthy Professor / CSE, Department
of Computer Science and Engineering, for his/her innovative suggestions and valuable guidance.
We would also wish to extend our sincere gratitude to all faculty members of the Department of
Computer Science and Engineering for their valuable guidance throughout the course of our project.
We also thank our parents and friends who provided moral and physical support.
III
ABSTRACT
Bird classification stands as a formidable challenge amidst the intricate tapestry
of avian diversity, where manual identification often becomes a laborious
endeavor hindered by the sheer breadth of species and the subtle nuances that
distinguish them. However, advancements in machine learning offer promising
solutions to automate this process. However, advancements in machine learning
offer promising solutions to automate this process. In this study, we introduce
"Gadio," a web application that streamlines bird classification using state-of-
the-art deep learning models. We compare the performance of two models,
EfficientNet-B2 and Vision Transformer (ViT) B-16, finding ViT-B16 to
outperform EfficientNet-B2. Leveraging its innovative transformer architecture,
ViT-B16 excels in capturing intricate features in bird images. Integrated into the
Gadio web app, ViT-B16 provides users with a seamless platform for accurate
bird identification, fostering engagement in birdwatching and conservation
efforts. This project underscores the transformative potential of machine
learning in democratizing access to bird identification tools and advancing
biodiversity conservation.
IV
TABLE OF CONTENTS
S.NO. CONTENT PAGE NUMBER
1. Introduction 1-2
1.1. Overview 1
1.2. Objective 2
2. Literature Survey 3- 4
3. System Study 5-7
3.1. Feasibility Study 5
3.2. Economical Feasibility 5
3.3. Technical Feasibility 6
3.4. Social Feasibility 7
4. System Analysis 8-11
4.1. Existing Solution 8
4.2. Disadvantages of Existing Solution 9
4.3. Proposed Work 10
4.4. Advantages of Proposed Work 11
5. System Specification 12-14
5.1. Hardware Specification 12
5.2. Software Specification 13
6. Detailed Description of Technology 15-16
7. System Design 17-19
7.1. Input Design 17
V
7.2. Output Design 18
8. System Architecture 20
8.1. Architecture Diagram 20
8.2. Algorithm 20
9. System Implementation 21-35
9.1. Data Collection 21
9.2. Data Pre-Processing 22
9.3. Dataset Splitting 23
9.4. Feature Extraction 24
9.4.1. EfficientNetB2
9.4.2. Vision Transformer
9.5. Reshaping
9.6. Training 27
9.6.1. Training of EfficientNetB2
9.6.2. Training of Vision Transfomer
9.7. Testing 28
9.7.1. Testing of EfficientNetB2
9.7.2. Testing of Vision Transformer
9.8. Visualization 29
9.8.1. Visualization of EfficientNetB2
9.8.2. Visualization of Vision Transformer
10. Result 30
VI
10.1. EfficientNetB2 Evaluation
10.2. Vision Transformer Evaluation
11. Conclusion 31
12. Future Work 32
13. Appendices 33-36
Appendix 1 33
Appendix 2 34
Appendix 3 35
Appendix 4 36
Appendix 5 36
14. References 37-38
VII
LIST OF TABLES
TABLE.NO TABLE PAGE NUMBER
1. 4.1.1. Existing Solution 8
2. 9.7.1.1. Training Progress of EfficientNetB2 27
3. 9.7.2.1. Training Progress of Vision Transformer 27
VIII
LIST OF FIGURES
TABLE.NO FIGURES PAGE
NUMBER
1. 8.1.1. Architecture Diagram of the Proposed Work 20
2. 9.1.1. Sample data 21
3. 9.9.1.1. Training Curve of EfficientNetB2 29
4. 9.9.2.1. Training Curve of Vision Transformer 29
IX
LIST OF ABBREVIATIONS
CNN - Convolutional Neural Network
DCNN – Deep Convolutional Neural Network
ResNet142 - Residual Network with 152 layers
MobileNet - Mobile Network
ViT – Vision Transformer
X
XI
1. INTRODUCTION
1.1. OVERVIEW
Bird classification stands as a formidable challenge amidst the intricate
tapestry of avian diversity, where manual identification often becomes a
laborious endeavor hindered by the sheer breadth of species and the subtle
nuances that distinguish them. However, in the realm of modern technology,
machine learning emerges as a beacon of hope, offering a path towards
automation and efficiency. In this study, we present "Gadio," a pioneering web
application meticulously crafted to navigate the complexities of bird
classification using cutting-edge deep learning models. Our investigation delves
into the comparative analysis of two prominent models, the venerable
EfficientNet-B2 and the transformative Vision Transformer (ViT) B-16, in the
realm of bird classification. Through rigorous experimentation and meticulous
evaluation on an extensive corpus of avian imagery, we meticulously scrutinize
the performance of these models, unearthing profound insights into their
efficacy in discerning avian species from unseen data. Notably, our findings
unveil the ViT-B16 model's supremacy over its EfficientNet-B2 counterpart, a
triumph attributed to the transformative potential inherent within its
revolutionary transformer architecture. With its unparalleled ability to capture
intricate spatial dependencies and discern fine-grained features embedded
within bird imagery, the ViT-B16 model emerges as a formidable contender in
the realm of bird classification. Moreover, our journey extends beyond the
confines of theoretical prowess, as we seamlessly integrate the ViT-B16 model
into the fabric of the Gadio web application, bestowing upon users an intuitive
and empowering platform for seamless bird identification. Through Gadio's
immersive interface, enthusiasts, researchers, and conservationists alike can
embark on a transformative journey, effortlessly uploading images and
receiving accurate predictions of bird species, thereby catalyzing engagement in
birdwatching and driving forward the frontiers of wildlife conservation. As we
traverse this terrain of technological innovation and ecological stewardship, we
illuminate the transformative potential of machine learning in democratizing
access to bird identification tools, fostering a newfound synergy between human
ingenuity and the preservation of avian biodiversity.
1
1.2. OBJECTIVE
The primary objective of this project is to achieve the following
description,
 Create a user-friendly platform for identifying bird species using

advanced deep learning models.
 Compare different models to find the most accurate one for bird
classification.
 Design an intuitive interface accessible to bird enthusiasts, researchers,

and conservationists.
 Encourage engagement in birdwatching and wildlife conservation

through easy bird identification.
 Raise awareness about bird species and inspire action to protect

biodiversity.
 Showcase the potential of machine learning in addressing real-world

challenges like bird classification.
By achieving these objectives, our project aims to establish a valuable

resource that caters to the needs of bird enthusiasts and conservationists alike.
Through the development of an intuitive and accessible platform, we seek to
empower users with the ability to effortlessly identify bird species using
cutting-edge deep learning models. We strive to create an intuitive interface that
simplifies bird classification, making it accessible to bird enthusiasts,
researchers, and conservationists. By integrating advanced technology
seamlessly, users can effortlessly upload bird images and receive accurate
species predictions, enhancing their birdwatching experience and facilitating
their engagement in wildlife conservation efforts.
2
2. LITERATURE SURVEY
Recent research efforts have led to significant advancements in birds
image classification, with various innovative methodologies aimed at enhancing
accuracy and efficiency.
"Fine-Grained Bird Classification Based on Low-Dimensional Bilinear Model"

[1], the low-dimensional bilinear model is introduced to iteratively adjust
support vector machine parameters, significantly improving bird recognition
accuracy. This advancement saves the computing time greatly, but requires
manual elimination of noise images on large datasets.
"Object-Part Attention Model for Fine-grained Image Classification" [2], an

Object-Part Attention Model tailored for weakly supervised fine-grained image
classification. They stress the necessity of utilizing larger and more diverse
datasets to robustly validate the performance of their model, highlighting the
crucial role of comprehensive data in advancing the state-of-the-art in this
domain.
"Birdsnap: Large-Scale Fine-Grained Visual Categorization of Birds" [3], a

"One-vs-most" classification approach aimed at improving the accuracy of bird
type prediction. The paper also proposes methods to differentiate subordinate
categories within entry-level categories, offering insights into fine-grained
visual categorization strategies specifically tailored for bird species recognition.
"Weakly Supervised Fine-Grained Categorization with Part-Based Image

Representation" [4], Fine-grained image categorization technique that relies on
part-based image representation. This method is designed to achieve higher
accuracy compared to annotation-dependent approaches. However, it
emphasizes the critical importance of carefully selecting object parts for
accurate classification, highlighting the nuanced nature of fine-grained
categorization tasks.
"Bird Image Classification using Convolutional Neural Network Transfer

Learning Architectures" [5], a Transfer Learning Model approach is developed
3
to enhance accuracy even in Large Dataset. While using the transfer learning
model, more loss occurs in one of the sub models compared to others, so it’s
wise to pick the accurate model.
"Classification of Bird Species Using Deep Learning Techniques" [6],

Convolutional Neural Network (CNN) approach aimed at minimizing the trade-
off between accuracy and loss. Their strategy leverages the advantages of pre-
training on annotated datasets, recognizing the benefits of transfer learning.
However, the paper also acknowledges the challenge posed by the reliance on
large annotated datasets for effective pre-training.
"Image Based Bird Species Identification Using Deep Learning" [7],

ResNet152, focusing on feature extraction through part detection. This research
endeavors to deliver precise results, particularly when working with partial bird
images, demonstrating the effectiveness of deep learning methods in addressing
fine-grained tasks such as bird species identification.
"Bird Species Identification using Deep Learning" [8], a Deep Convolutional

Neural Network (DCNN) model tailored for classifying bird subtypes with
reduced computational requirements. While demonstrating potential for
enhancing bird image classification, the paper highlights the challenge posed by
the requirement for extensive labeled data, underscoring the importance of
sufficient training data for the effective deployment of deep learning models in
fine-grained tasks like bird species identification.
Overall, the ongoing exploration of hybrid algorithms underscores the

importance of integrating diverse neural network structures for optimal
performance in birds image classification, offering a promising avenue for
further improvements. Table 4.1.1 provides a comprehensive overview of the
current conventional methods used for birds classification, highlighting the
methodologies reviewed as part of this work.
4
3. SYSTEM STUDY
3.1. FEASIBILITY STUDY
The feasibility study for implementing deep learning models in bird
classification involves assessing
 Economic Feasibility
 Technical Feasibility
 Social Feasibility
Addressing these aspects ensures the project's feasibility and

sustainability in advancing bird classification and conservation efforts.
3.1.1. ECONOMICAL FEASIBILITY

The economic feasibility study serves as a pivotal component in
evaluating the viability of implementing advanced techniques in bird
classification using deep learning models. Through a meticulous examination of
the overall cost of investment, software and hardware expenses, and potential
cost reduction strategies, organizations can make well-informed decisions
regarding system investments. The proposed deep learning system aims to
deliver cost savings, operational efficiencies, and improved classification
accuracy, thereby justifying the investment made in advancing bird
classification and conservation efforts.
 Overall Cost of Investment: A comprehensive scrutiny of the total
investment outlay is indispensable. This includes expenditures on
hardware, software, data acquisition, and ongoing maintenance required
for the development and maintenance of the bird classification system.
 Software and Hardware Expenses: A detailed assessment of the prices
of required software licenses, computing hardware, and any specialized
equipment is imperative to ascertain the initial investment required for the
project.
5
 Potential Cost Reduction Strategies: Exploring potential avenues for
cost reduction while maintaining optimal service quality is paramount.
The proposed deep learning system is meticulously designed with cost
considerations in mind, aiming to streamline operations, reduce manual
labor, and ultimately lower overall project costs while enhancing
classification accuracy and conservation efforts.
By meticulously analyzing these financial aspects, the proposed solution

is expected to deliver significant cost savings, operational efficiencies, and
enhanced classification accuracy, thereby justifying the investment made in
advancing bird classification and conservation efforts.
3.1.2. TECHNICAL FEASIBILITY

As many organizations increasingly leverage artificial intelligence and
deep learning algorithms to eliminate human errors and to save time and cost, it
becomes essential to assess various technological aspects to determine the
feasibility of such implementations. This introduction provides an overview of
the key technical considerations in assessing the feasibility of advanced bird
image classification systems, including algorithmic complexity, data
requirements, integration with existing infrastructure, scalability, and model
interpretability.
• How does the algorithmic complexity of advanced techniques in bird
image classification impact their feasibility for implementation in
research settings?
• What are the key technological challenges associated with handling
large volumes of data required for training deep learning models in
bird image classification systems?
• Why is it essential for advanced bird image classification systems to
seamlessly integrate with existing research infrastructure?
• How does the scalability of a bird image classification system affect
its feasibility for long-term use in clinical practice?
• What measures can be taken to ensure the interpretability of decisions
made by complex deep learning models in bird image classification
systems, thereby enhancing transparency?
6
• Considering the technical feasibility of implementing advanced
techniques in bird image classification, what are the critical factors
that need to be addressed to ensure the successful deployment of such
systems in research settings?
By addressing these technical considerations, the proposed system can be

deemed technically feasible, paving the way for its successful implementation
in research settings for bird image classification.
3.1.3. SOCIAL FEASIBILITY

Social feasibility is crucial in assessing the viability and acceptance of
advanced bird image classification systems within broader societal contexts.
This aspect evaluates various social implications, including ethical
considerations, user acceptance, cultural sensitivities, and the potential impact
on human-animal interactions.
 How do ethical considerations surrounding the use of advanced bird
image classification systems influence their social acceptability and
adoption in research settings?
 What are the potential cultural sensitivities associated with
implementing bird image classification systems, and how can they be
addressed to ensure social acceptance?
 Why is user acceptance a critical factor in determining the success of
bird image classification systems, and how can user-centered design
principles be applied to enhance usability and user satisfaction?
 How might the introduction of advanced bird image classification
systems affect human-animal interactions, and what measures can be
taken to promote positive interactions and mitigate any potential
negative consequences?
By addressing these social considerations, the proposed bird image

classification system can garner support from stakeholders and the broader
community, facilitating its successful implementation and utilization in research
settings while promoting positive social outcomes.
7
4. SYSTEM ANALYSIS
4.1. EXISTING SOLUTIONS

Deep learning, inspired by the human brain, utilizes neural networks for
automatic feature extraction. Its strength lies in representation learning,
eliminating the need for manual feature engineering. Despite challenges, such as
interpretability and computational demands, deep learning excels in various
domains, including image and labeled data processing.
Existing Accuracy
Strength of the method Limitation
Method (%)
Convolutional 86.5 Accuracy vs loss ratio is Low accuracy in larger
Neural lower dataset
Network
(CNN)
ResNet152 85 Extracting features by Difficult to measure
the detection of parts accuracy
Object-Part 85.83 Effective for weakly Combining complexity

Attention supervised fine-grained of object and part
Model image classification attention model
Artificial 99.41 Flexibility in Proper preprocessing of

Neural architecture design. image data is crucial.
Networks
(ANN)
8
Bilinear 84.1 Saves computing time Requires manual
model greatly elimination of noise
images on large datasets.
Transfer 95.45 Higher accuracy even in More loss in one of the

Learning Large Dataset sub models compared to
others
Table 4.1.1. Existing solutions
4.2. DISADVANTAGES OF EXISTING SOLUTIONS

 Challenges in Large Dataset Accuracy: Achieving high accuracy in larger
datasets becomes increasingly challenging due to the sheer volume and
diversity of data. With more images to classify, the model must contend with
a broader range of variations within the dataset, such as different lighting
conditions, backgrounds, and poses.
 Integration Complexity: Integrating object and part attention models adds
a layer of complexity to the classification system. Balancing the
contributions of both models while ensuring seamless coordination between
them demands intricate architectural design and fine-tuning of parameters.
 Importance of Data Preprocessing: Proper preprocessing of image data is
essential for optimizing model performance and enhancing generalization
capabilities. This preprocessing may involve tasks such as normalization,
resizing, and noise reduction to standardize the input data and remove
irrelevant information. By preparing the data effectively, the model can
focus on learning meaningful patterns.
 Manual Noise Elimination: On large datasets, the presence of noise images
can significantly impact the performance of the classification system.
Manual elimination of these noisy data points is necessary to maintain data
quality and ensure reliable model training. However, this process can be
labor-intensive and time-consuming, requiring human intervention to
identify and remove outlier images that may skew the training process.
 Discrepancies in Loss: Discrepancies in loss among sub-models can hinder
the overall performance and convergence of the classification system.
Variations in loss values between different components of the model may
9
indicate imbalances in learning or inconsistencies in the training process.
Addressing these discrepancies requires careful monitoring and adjustment
of training parameters to ensure that all parts of the model converge
effectively towards the desired objective.
 Accuracy Measurement Complexity: Measuring accuracy accurately in
larger datasets presents several challenges. With a vast number of images to
evaluate, traditional evaluation methodologies may be insufficient to capture
the nuanced performance of the classification system. Robust evaluation
techniques and extensive computational resources are required to analyze
accuracy comprehensively and account for variations across different subsets
of the dataset.
4.3. PROPOSED WORK

Data Collection and Preprocessing:
Begin by curating a comprehensive dataset of bird images covering

diverse species and environmental conditions. Preprocess the images to ensure
consistency in size, format, and quality, optimizing them for compatibility with
Vision Transformer models.
Model Architecture Design:
Design a novel model architecture leveraging Vision Transformer (ViT)

for feature extraction and classification. Explore different configurations of ViT
models and investigate techniques to adapt them to fine-grained bird
classification tasks.
Implementation and Training:
Implement the proposed ViT model using deep learning frameworks such
as PyTorch or TensorFlow. Train the model on the preprocessed bird dataset,
optimizing hyperparameters and regularization techniques to enhance
performance and generalization.
Integration of LSTM for Temporal Analysis:
Analyze the integration of attention mechanisms within the ViT model

architecture to capture fine-grained details and spatial relationships in bird
10
images. Experiment with self-attention mechanisms to improve feature
representation and classification accuracy.
Evaluation and Comparative Analysis:
Evaluate the performance of the ViT model using standard metrics such
as accuracy, precision, recall, and F1-score. Conduct comparative analyses
against traditional CNN-based approaches to assess the ViT model's
effectiveness in bird classification tasks.
Validation:
Validate the trained ViT model using separate datasets to ensure

robustness and generalization across diverse bird species and environmental
conditions. Employ cross-validation techniques to mitigate potential biases and
overfitting.
Documentation and Reporting:
Document the entire project workflow, including data preprocessing

steps, model design, training procedures, and evaluation results. Prepare a
detailed report summarizing the proposed ViT model approach, experimental
findings, and insights gained.
Real-World Applications and Future Directions:
Apply the trained ViT model in real-world bird classification scenarios,

such as ornithological research or wildlife conservation efforts. Explore avenues
for further research, including ensemble techniques, multi-modal integration,
and transfer learning strategies, to enhance the ViT model's performance and
applicability in diverse settings.
4.4. ADVANTAGES OF PROPOSED WORK

 Fine-Grained Feature Representation: The Vision Transformer (ViT)
model excels in capturing fine-grained details and spatial relationships
within bird images, leading to more nuanced and accurate classification
outcomes compared to traditional methods.
 Scalability and Generalization: ViT models demonstrate superior
scalability and generalization capabilities, allowing for effective
11
classification across diverse bird species and environmental conditions
without the need for extensive manual feature engineering.
 Attention Mechanisms for Contextual Understanding: By leveraging
attention mechanisms, ViT models can dynamically focus on relevant
regions of interest within bird images, enhancing contextual
understanding and improving classification performance, particularly in
scenarios with complex backgrounds or occlusions.
 Efficient Resource Utilization: Vision Transformer models offer
efficient resource utilization by eliminating the need for computationally
expensive convolutional operations, leading to faster inference times and
reduced computational overhead, making them suitable for deployment in
resource-constrained environments.
5. SYSTEM SPECIFICATION
The system specifications serve as a blueprint for configuring the optimal
hardware and software environment to facilitate the development and
deployment of the breast cancer image classification system.
5.1. HARDWARE SPECIFICATION

A robust hardware setup is essential to efficiently handle various tasks
such as data preprocessing, model training, and inference. Here's a breakdown
of the hardware components required:
CPU (Central Processing Unit):
A multi-core CPU with at least 4 cores is recommended to manage

computational workloads effectively.
GPU (Graphics Processing Unit):
NVIDIA GPUs with CUDA support, ideally with at least 8GB of VRAM
such as GeForce GTX 1080 Ti or higher, are preferred for accelerating deep
learning computations.
RAM (Random Access Memory):
12
A minimum of 16GB of RAM is necessary, although larger datasets and
more complex models may require 32GB or more to prevent memory
constraints.
Storage:
Fast and ample storage, preferably Solid-State Drives (SSDs) with at least
500GB capacity, is essential for storing datasets, model checkpoints, and
intermediate results.
Additional Considerations:
High-speed internet connectivity is required for downloading large

datasets. Proper cooling solutions and adequate power supply units (PSUs) are
necessary to prevent overheating and ensure stable operation, especially during
intensive training sessions.
Overall, the hardware specifications should be chosen based on the

project's specific requirements, considering factors such as dataset size, model
complexity, and available budget. Cloud-based services can also be utilized for
accessing scalable computing resources if local hardware is insufficient.
5.2. SOFTWARE SPECIFICATION

The software specifications outline the necessary tools, libraries, and
frameworks required to develop, train, and deploy the breast cancer image
classification system. Here's an overview of the software components:
 Deep Learning Framework:
A deep learning framework is essential for implementing and training

neural network models. Popular frameworks include TensorFlow and Keras,
which provide comprehensive APIs for building and training deep learning
models efficiently.
 Python Programming Language:
Python is the preferred programming language for deep learning and data
science tasks due to its simplicity, versatility, and rich ecosystem of libraries. It
13
serves as the primary language for implementing the project's codebase and
integrating various software components.
 Data Preprocessing Libraries:
Libraries such as NumPy and Pandas are essential for data preprocessing
tasks, including data manipulation, transformation, and normalization. These
libraries enable efficient handling and processing of the breast cancer image
dataset before model training.
 Image Processing Libraries:
Image processing libraries like OpenCV are useful for performing image
augmentation, preprocessing, and manipulation tasks. They provide
functionalities for resizing, cropping, and applying filters to images, enhancing
data quality and model robustness.
 Model Evaluation and Visualization Tools:
Tools such as Matplotlib are valuable for evaluating model performance,

generating classification metrics, and visualizing results. They facilitate
thorough analysis and interpretation of the breast cancer image classification
system’s performance.
 Development Environment:
Integrated Development Environments (IDEs) like Jupyter Notebook,

PyCharm, or Visual Studio Code provide comprehensive development
environments with features such as code editing, debugging, and project
management. They streamline the development process and enhance
productivity.
 Version Control System:
A version control system like Git is essential for managing project

codebase, tracking changes, and collaborating with team members. Platforms
like GitHub or GitLab can be used for hosting repositories and facilitating
collaboration among project contributors.
 Deployment Environment:
14
Depending on deployment requirements, containerization technologies
like Docker may be used to package the application along with its dependencies
into a portable container. Additionally, cloud platforms such as AWS, Google
Cloud Platform, or Microsoft Azure can be utilized for deploying and hosting
the classification system in production environments.
Overall, these software specifications provide a foundation for building,

testing, and deploying the breast cancer image classification system, ensuring
compatibility, efficiency, and maintainability throughout the project lifecycle.
6. DETAILED DESCRIPTION OF TECHNOLOGY

Vision Transformer b16:
The Vision Transformer (ViT) with b16 architecture represents a

groundbreaking advancement in the field of computer vision. Unlike traditional
convolutional neural network (CNN) architectures,
 ViT adopts a transformer-based approach originally developed for natural

language processing tasks. The ViT architecture replaces convolutional
layers with transformer blocks, consisting of self-attention mechanisms
and feed-forward neural networks.
 This allows ViT to capture global dependencies and spatial relationships

within images, enabling effective image classification. The "b16"
designation indicates the specific configuration of the ViT model, with
15
"b16" typically representing the size of the model, specifying the number
of layers and other architectural parameters.
 ViT with b16 architecture has demonstrated remarkable performance in

various image classification tasks, making it a preferred choice for tasks
such as bird species classification.
Model Training and Optimization:
 For training the model, a comprehensive approach was adopted, which

included the utilization of data augmentation techniques to increase the
diversity of the training dataset.
 Data augmentation involved operations such as rotation, shifting, flipping,

and zooming of the bird images, enhancing the model's ability to generalize
to unseen data and mitigate overfitting.
 The model was then trained for a total of 5 epochs, allowing it to learn from
the augmented data and optimize its parameters over multiple iterations.
 This iterative training process helped improve the model's accuracy and
convergence, ultimately leading to better performance on unseen data.
Evaluation and Validation:
 To assess the performance of the trained model, rigorous evaluation and

validation procedures were implemented. The model's performance was
evaluated using standard performance metrics such as accuracy and
categorical crossentropy loss.
 Accuracy measures the percentage of correctly classified instances out of

the total number of instances, providing insight into the model's overall
performance.
16
 Categorical crossentropy loss, on the other hand, measures the dissimilarity
between the true distribution of the data and the predicted distribution,
serving as a key optimization objective during training.
 By analyzing accuracy and categorical crossentropy loss metrics on both

training and validation datasets, the model's performance and generalization
capabilities were thoroughly evaluated and validated.
7. SYSTEM DESIGN
7.1. INPUT DESIGN
 Data Acquisition:
 The input system initiates with the collection of bird images sourced from
various repositories, wildlife databases, and research platforms dedicated to
ornithology and wildlife conservation.
17
 Images may be obtained in common formats such as JPEG or PNG from
wildlife cameras, birdwatching enthusiasts, or research studies.
 Data Preprocessing:
 Preprocessing techniques are applied to standardize and enhance the quality
of the input data before feeding it into the classification model.
 Common preprocessing steps include resizing images to a uniform resolution

suitable for the model input and normalization to ensure consistent pixel
values.
 Augmentation:
 Image augmentation methods are employed to increase dataset diversity and
improve the model's ability to generalize to unseen bird image variations.
 Augmentation techniques such as rotation, flipping, scaling, and adding

noise are utilized to simulate different bird poses, lighting conditions, and
backgrounds
 Data Splitting:
 The dataset is partitioned into distinct subsets for training, validation, and
testing purposes.
 A majority portion of the dataset is allocated for training to facilitate model

learning, while smaller subsets are reserved for validation to tune
hyperparameters and testing for final performance evaluation.
.
 Integration with Deep Learning Model:

 The input system seamlessly integrates with the chosen deep learning
model architecture, such as the Vision Transformer (ViT), tailored for
bird classification tasks.
18
 Bird images are fed into the model layers, comprising attention
mechanisms and classification heads, to generate output logits
representing the model's confidence scores for each bird species.
 Epoch:
 Epochs are incorporated into the input system to dynamically adjust
preprocessing parameters, such as learning rate and augmentation
intensity, during model training.
 This iterative process optimizes the data pipeline and enhances model
accuracy over successive training iterations.
Overall, the input system design ensures efficient processing and preparation of
bird images as input for the classification model, ultimately contributing to
accurate and reliable bird species identification outcomes.
7.2. OUTPUT DESIGN

The system output design outlines how classification results are generated
and presented to end-users in the context of bird classification across 510
species. Here's an overview of the system output design tailored for your
project:
Classification Prediction:
 Following the processing of bird images through the classification
model, the system produces predictions indicating the likelihood of
each bird species present in the input image.
 The model assigns probabilities or confidence scores to the 510

classes based on learned features and patterns in the input data.
.
 Class Labels:
 The output includes class labels corresponding to the predicted bird
species, representing the 510 classes available in the classification model.
19
 Each image is assigned a class label based on the highest probability
score among the output logits generated by the model.
 Confidence Scores:
 Alongside class labels, the system provides confidence scores or
probabilities indicating the model's certainty in its predictions for each
bird species.
 These confidence scores quantify the model's confidence level for each
predicted class, enabling users to assess the reliability of the classification
results.
 Visualization:
 The system offers visualizations of the classification results to aid
interpretation and analysis by end-users, enhancing the user experience.
 Visualization techniques may include displaying probability distribution
plots or generating visual summaries showcasing the predicted bird
species and their corresponding confidence scores.
20
8. SYSTEM ARCHITECTURE
8.1. ARCHITECTURE DIAGRAM
Fig. 8.1.1. Architecture Diagram of the Proposed Work
8.2. ALGORITHM
1.Start by loading the bird image dataset.
2.Preprocess the images by resizing them to a standard size and
normalizing the pixel values.
3.Initialize data augmentation techniques to increase the diversity of the
dataset.
4.Generate an augmented dataset using the initialized augmentation
parameters.
5.Configure the model architecture by combining ResNet50 with LSTM
and dense layers.
6.Define augmentation parameters such as rotation, shift, shear, zoom,
and horizontal flip.
7.Initialize the pre-trained ResNet50 architecture for feature extraction.
8.Predict the dataset with the pre-trained model to extract features.
9.Initialize the LSTM layer for sequential analysis of features.
10.Run the feature extraction dataset through the LSTM model.
11.Monitor the training progress to track accuracy and loss.
21
12.Optimize hyperparameters such as learning rate, batch size, and
epochs if necessary.
13.Evaluate the model's performance on the testing set using accuracy,
precision, recall, and F1-score metrics.
14.Save the trained model for future use.
15.End.
9. SYSTEM IMPLEMENTATION
9.1. DATA COLLECTION

For the bird classification project, the data collection process primarily involves
obtaining images from the "100 Bird Species" dataset available on Kaggle. This
dataset, provided by user "gpiosenka" on Kaggle, offers a comprehensive
collection of bird images representing 100 different bird species. The "100 Bird
Species" dataset contains high-quality images of various bird species, sourced
from diverse locations and environments. Each image is meticulously
categorized by species, ensuring that researchers have access to a well-
organized and labeled dataset for model training and evaluation. By leveraging
the "100 Bird Species" dataset, researchers can access a diverse array of bird
images, spanning different species, poses, and habitats. The dataset provides a
valuable resource for training and evaluating bird classification models,
enabling researchers to develop accurate and robust algorithms for bird species
identification. Moreover, Kaggle's platform ensures accessibility and
transparency, allowing researchers to obtain the necessary data while adhering
to ethical considerations and licensing agreements. The availability of the "100
Bird Species" dataset on Kaggle facilitates the seamless integration of data
collection into the project workflow, laying a solid foundation for developing
advanced bird classification models.
Data score: https://www.kaggle.com/datasets/gpiosenka/100-bird-species
Sample data:
22
Fig. 9.1.1. Sample Data
9.2. DATA PRE-PROCESSING

In the bird classification project, data pre-processing plays a crucial role in
preparing the image data for training the deep learning model. The pre-
processing step involves several processes tailored to optimize the performance
and convergence of the model:
 Rescaling Pixel Values:
The pixel values of the bird images are rescaled to enhance model
convergence and performance. The rescale parameter is set to 1./255, indicating
that each pixel value in the image will be divided by 255. This operation scales
down the pixel intensities to a range between 0 and 1, ensuring that pixel values
are within a consistent and standardized range across all images in the dataset.
 Normalization for Model Training:
Normalization is applied to ensure that input features are on a similar

scale, a common technique in deep learning. By rescaling pixel values to the
range [0, 1], normalization helps stabilize and accelerate the training process of
neural networks. It prevents issues such as vanishing or exploding gradients
during backpropagation, leading to more stable and efficient model training.
 Enhanced Model Convergence:
Normalization of pixel values improves the convergence behavior of the

deep learning model during training. When input features are on a similar scale,
23
the optimization algorithm can more effectively navigate the parameter space,
leading to faster convergence and better generalization performance.
 Preventing Numerical Instabilities:
Rescaling pixel values to a smaller range helps prevent numerical

instabilities that may arise during model training, especially when using
activation functions like sigmoid or softmax. By keeping the input values within
a bounded range, normalization mitigates the risk of gradient saturation and
improves the overall stability of the training process.
Size of training
In the bird classification project, the size of the training data is an
essential aspect that directly influences the performance and
generalization capabilities of the deep learning model. By ensuring an
adequate amount of training data, the model can learn diverse patterns
and features present in bird images, leading to more accurate and robust
classification results. A sufficient size of the training data is crucial to
capture the variability and complexity inherent in bird images across
different species, poses, environments, and lighting conditions. With a
larger training dataset, the model can better generalize to unseen bird
images and exhibit improved performance when deployed in real-world
scenarios. Moreover, a substantial training dataset helps mitigate the risk
of overfitting, where the model memorizes the training data instead of
learning underlying patterns. By exposing the model to a diverse range of
bird images during training, overfitting can be minimized, resulting in a
more reliable and effective classification model. Therefore, it is
recommended to ensure a sizable training dataset comprising a sufficient
number of bird images representing various species and scenarios. This
approach enhances the model's ability to learn discriminative features and
patterns, ultimately leading to better classification performance in bird
species identification tasks.
9.3. DATASET SPLITTING

24
Dataset splitting is a crucial step in machine learning model development, as it
ensures that the model's performance is evaluated on unseen data, thereby
providing a reliable estimate of its generalization ability.
 Training Set:
The training set comprises a majority portion of the dataset and is used to
train the machine learning model. The model learns from the patterns present in
the training data, adjusting its parameters during training to minimize the
prediction error.
Training data shape: [513, 3, 224, 224]
 Validation Set:
The validation set is used to evaluate the performance of the model

during training and tune hyperparameters. It serves as a proxy for the test set,
allowing researchers to monitor the model's performance on unseen data and
prevent overfitting. Hyperparameters such as learning rate are adjusted based on
the model's performance on the validation set.
Validation data shape: [16, 3, 224, 224]
 Test Set:
The test set is a completely independent portion of the dataset that is not
used during model training or hyperparameter tuning. It is reserved for the final
evaluation of the trained model's performance and provides an unbiased
estimate of its generalization ability. Evaluating the model on the test set helps
assess its performance in real-world scenarios and provides insights into its
effectiveness for the intended task.
Testing data shape: [80, 3, 224, 224]
25
9.4. FEATURE EXTRACTION AND CLASSIFICATION
9.4.1. EfficientNetB2
EfficientNetB2 is a variant of the EfficientNet family of convolutional
neural network (CNN) architectures, known for its balance between model size
and performance. It is specifically designed to be efficient in terms of
computational resources while achieving competitive accuracy on image
classification tasks.
 Architecture:
EfficientNetB2 follows a compound scaling method that
optimizes both depth, width, and resolution of the network. It consists of
multiple layers of convolutional and pooling operations, with a focus on
minimizing computational overhead while maximizing feature extraction
capabilities. The architecture of EfficientNetB2 is designed to strike a
balance between model complexity and computational cost, making it
suitable for resource-constrained environments.
 Efficiency:
EfficientNetB2 achieves high efficiency by carefully balancing
model complexity with computational cost. It introduces novel techniques
such as compound scaling and efficient building blocks to achieve state-
of-the-art performance on various computer vision tasks while using
fewer parameters and computational resources compared to larger
models.
 Feature Extraction:
EfficientNetB2 excels at feature extraction, capturing

discriminative features from input images using its compact yet powerful
architecture. By leveraging efficient building blocks and compound scaling,
26
EfficientNetB2 can effectively extract hierarchical features from bird images,
enabling accurate classification of bird species.
9.4.2 Vision Transformer:
 Vision Transformer is a transformer-based architecture specifically

designed for image classification tasks. It applies the transformer
architecture, originally developed for natural language processing, to the
field of computer vision, allowing the model to learn representations of
images directly from pixel values
 Architecture:
Vision Transformer replaces the traditional convolutional layers with

transformer blocks, which consist of self-attention mechanisms and feed-
forward neural networks. This architecture enables the model to capture global
dependencies and spatial relationships within images, facilitating effective
image classification.
 Feature Extraction:
Vision Transformer excels at feature extraction by learning

representations of images directly from pixel-level information using self-
attention mechanisms. This approach enables the model to capture long-range
dependencies and semantic relationships within bird images, leading to more
context-aware feature representations and accurate classification of bird species.
 Efficiency
Vision Transformer exhibits strong efficiency in terms of scalability

and generalization. By leveraging self-attention mechanisms, Vision
Transformer can effectively model complex relationships within bird
images and adapt to different visual patterns present in the data.
Additionally, Vision Transformer's architecture allows it to handle
images of varying resolutions and aspect ratios, making it highly versatile
and efficient.
27
9.5. RESHAPING
EfficientNetB2:
Before providing the data to the EfficientNetB2 model for classification,

the following transformations are applied to the images:
Crop Size: Images are cropped to a size of 288x288 pixels.
Resize Size: Images are resized to a size of 288x288 pixels.
Mean: The mean values [0.485, 0.456, 0.406] are subtracted from each
pixel in the image.
Standard Deviation (std): The standard deviation values [0.229, 0.224,
0.225] are divided from each pixel in the image.
Interpolation: Bicubic interpolation is used for resizing the images.
These transformations help preprocess the input images and ensure they
are in the appropriate format and scale for input into the EfficientNetB2
model, thereby improving the model's performance and accuracy.
Vision Transformer (ViT) with b16:

Before providing the data to the Vision Transformer (ViT) model with
b16 architecture for classification, the following transformations are
applied to the images:
Crop Size: Images are cropped to a size of 224x224 pixels.
Resize Size: Images are resized to a size of 256x256 pixels.
Mean: The mean values [0.485, 0.456, 0.406] are subtracted from each
pixel in the image.
Standard Deviation (std): The standard deviation values [0.229, 0.224,
0.225] are divided from each pixel in the image.
Interpolation: Bilinear interpolation is used for resizing the images.
9.6. TRAINING
In the training and validation process with an epoch of 5, a hybrid model is
iteratively trained on a labeled training dataset for 5 epochs while
simultaneously evaluated on a separate validation dataset. Each epoch involves
processing batches of training examples and adjusting the model's parameters
28
using optimization algorithms. Performance metrics, such as loss and accuracy,
are computed on both the training and validation data to monitor learning
progress and detect overfitting. Early stopping may be employed to halt training
if the model's performance on the validation dataset stagnates or degrades. This
iterative approach aims to optimize the model's parameters and ensure its
robustness and generalization to unseen data.
Epoch Training Accuracy Training Loss Validation Accuracy Validation Loss

1 3.0593 0.4373 3.0593 1.2767
9.6.1. T
2 0.7483 1.2514 0.8606 0.7468
3 0.8268 0.8268 0.8962 0.5834
R
4 0.8601 0.7077 0.9059 0.4796 A
5 0.8826 0.6020 0.8980 0.4478 I
N
ING OF EfficientNetB2
Epoch Training Accuracy Training Loss Validation Accuracy Validation Loss
1 0.3801 3.9813 0.7470 1.8110
2 0.7245 1.7062 0.8492 1.0063
3 0.7953 1.1750 0.8490 1.1750
4 0.8490 0.7503 0.8301 0.9373
5 0.8486 0.7954 0.8747 0.5827
Fig. 9.7.1.1. Training Progress of EfficientNetB2
Fig. 9.7.3.1. Training Progress of Vision Transformer
9.7. TESTING
9.7.1. TESTING OF EfficientNetB2

Accuracy: 0.9059
29
9.7.2. TESTING OF Vision Transformer
Accuracy: 0.9201
9.8. VISUALIZATION
9.8.1. VISUALIZAION OF EfficientNetB2:
Fig. 9.9.1.1. Training Curve of EfficientNetB2
9.8.2. VISUALIZAION OF Vision Transformer:
30
Fig. 9.9.2.1. Training Curve of Vision Transformer
10. RESULT & ANALYSIS
10.1. EfficientNetB2 EVALUATION

Training loss: 0.7954
Training accuracy: 0.8486
Valid loss: 0.5827
Valid accuracy: 0.8747
Test accuracy: 0.9059
10.2. Vision Transformer EVALUATION
Training loss: 0.6020
31
Training accuracy: 0.8826
Valid loss: 0.4478
Valid accuracy: 0.8980
Test accuracy: 0.9201
The comparison between the Vision Transformer (ViT) model with b16
architecture and the EfficientNetB2 model reveals that ViT_b16 demonstrates
superior performance across all metrics. ViT_b16 achieves lower training and
validation losses, higher training and validation accuracies, and ultimately,
higher test accuracy compared to EffNetB2. This indicates ViT_b16's better
convergence during training, superior ability to capture training data patterns,
and stronger generalization to unseen data. Moreover, ViT_b16 achieves these
results with slightly lower computational resources, showcasing its efficiency.
Overall, the ViT model with b16 architecture emerges as the preferred choice
for bird species classification tasks due to its outstanding performance and
efficiency.
11. CONCLUSION
In conclusion, the development of the "Gadio" web application
incorporating the Vision Transformer (ViT) B-16 model marks a significant
stride towards automating bird classification processes. Through rigorous
comparison with the EfficientNet-B2 model, our study highlights the superior
performance of ViT-B16, particularly in discerning intricate features within bird
images. Leveraging its innovative transformer architecture, ViT-B16 not only
enhances accuracy but also streamlines the bird identification experience for
users. This advancement holds immense potential in democratizing access to
birdwatching tools and bolstering conservation efforts.
Furthermore, the successful integration of ViT-B16 into the "Gadio"

platform underscores the transformative impact of machine learning on
biodiversity conservation. By providing a user-friendly and efficient interface
32
for bird identification, "Gadio" facilitates greater engagement in birdwatching
activities and encourages participation in conservation initiatives. This
democratization of access to bird classification tools empowers individuals and
organizations to contribute meaningfully to our understanding and protection of
avian diversity.
As we continue to navigate the complexities of environmental conservation,

technology remains a powerful ally in our quest for sustainability. Initiatives
like "Gadio" exemplify the potential of human ingenuity and innovation to
address global challenges. By leveraging machine learning and digital
platforms, we can create synergies between scientific research, community
engagement, and policy advocacy, ultimately paving the way towards a more
harmonious relationship between humanity and the natural world.
12. FUTURE WORK

 Fine-Tuning for Specific Bird Species:
Further research can focus on fine-tuning the ViT model to
specialize in identifying specific bird species, thereby enhancing its
accuracy and applicability across diverse avian populations.
 Multi-Scale Feature Integration:
Exploring methods to integrate multi-scale features into the ViT
architecture can improve its ability to capture fine-grained details and
subtle characteristics present in bird images, thereby enhancing
classification performance.
 Transfer Learning Across Domains:
33
Investigating the transferability of pre-trained ViT models across
different domains, such as wildlife photography or ornithological
research, can extend its utility beyond traditional bird classification tasks.
 Efficient Training Strategies:
Developing efficient training strategies, such as curriculum
learning or self-supervised learning, can expedite the training process of
ViT models while maintaining high classification accuracy, making them
more practical for real-world applications.
 Interactive Learning Interfaces:
Designing interactive learning interfaces that leverage ViT models
can empower citizen scientists and birdwatchers to contribute to bird
classification efforts by providing real-time feedback and annotations.
 Robustness to Environmental Variability:
Enhancing the robustness of ViT models to environmental
variability, such as changes in lighting conditions or background clutter,
can improve their performance in challenging field conditions where bird
images may exhibit significant variations.
 Ethical Considerations and Bias Mitigation:
Addressing ethical concerns, such as fairness and bias mitigation,
is essential to ensure the responsible deployment of ViT models in bird
classification systems, particularly in contexts where biases may
inadvertently influence model predictions.
APPENDICES
APPENDIX 1
#Data Loader Function:
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader
from torchvision import datasets
def create_dataloader(path: str,

split=False,
34
split_size: int = 0.2,
transform: torchvision.transforms = None,
batch_size: int = 32,
shuffle: bool = False,
num_workers = 1,
return_classes: bool = False):
"""
Creates a dataset and convert it into a DataLoader
if transform is None:
transform = transforms.Compose([
transforms.Resize(size=(224, 224)),
transforms.ToTensor()
])
dataset = datasets.ImageFolder(path,
transform=transform,
target_transform=None,
)
classes = dataset.classes
# Making a split
if split:
length = int(len(dataset)*split_size)
rem_length = len(dataset) - length
dataset, _ = torch.utils.data.random_split(dataset=dataset,
lengths=[length, rem_length],
generator=torch.manual_seed(42))
dataloader = DataLoader(dataset=dataset,
batch_size=batch_size,
shuffle=shuffle,
num_workers=num_workers,
pin_memory=True
)
if return_classes:
35
return dataloader, classes
return dataloader
APPENDIX 2
#EfficientNet B2
import torchvision
effnet_weights =
torchvision.models.EfficientNet_B2_Weights.DEFAULT
effnetb2_model =
torchvision.models.efficientnet_b2(weights=effnet_weights).to(device)
effnetb2_model
eff_net_transforms_with_data_augmentation =
torchvision.transforms.Compose([
torchvision.transforms.TrivialAugmentWide(),
effnet_weights.transforms()
])
effnetb2_model.classifier = nn.Sequential(
nn.Dropout(p=0.2, inplace=True),
nn.Linear(in_features=1408, out_features=len(class_names))
).to(device)
effnetb2_model.classifier
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=effnetb2_model.parameters(),
lr=0.001)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
effnetb2_results = train(
model=effnetb2_model,
train_dataloader=effnetb2_train_dataloader,
valid_dataloader=effnetb2_valid_dataloader,
loss_fn=loss_fn,
optimizer=optimizer,
epochs=5,
device=device
36
)
APPENDIX 3
#Vision Transformer
vit_weights = torchvision.models.ViT_B_16_Weights.DEFAULT
vit_b16_model = torchvision.models.vit_b_16(weights=vit_weights)
vit_b16_model
vit_transforms_with_data_augmentation =
torchvision.transforms.Compose([
torchvision.transforms.TrivialAugmentWide(),
vit_weights.transforms()
])
vit_b16_model.heads = nn.Sequential(
nn.Linear(in_features=768, out_features=len(class_names))
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=vit_b16_model.parameters(),
lr=0.001)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
vit_b16_results = train(
model=vit_b16_model,
train_dataloader=vit_b16_train_dataloader,
valid_dataloader=vit_b16_valid_dataloader,
loss_fn=loss_fn,
optimizer=optimizer,
epochs=5,
device=device
)
APPENDIX 4
#ResNet50
r50 = tf.keras.applications.ResNet50(weights='imagenet', include_top=False,
input_shape=(image_size, image_size, 3))
APPENDIX 5
37
#Visualization
def plot_loss_curves(model_results):
"""
Plots the model's loss and accuracy curves
"""
train_loss = model_results["train_loss"]
train_acc = model_results["train_acc"]
valid_loss = model_results["valid_loss"]
valid_acc = model_results["valid_acc"]
epochs = range(len(train_loss))
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
plt.plot(epochs, train_loss, label="Train Loss")
plt.plot(epochs, valid_loss, label="Validation Loss")
plt.title("Loss")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(epochs, train_acc, label="Train Accuracy")
plt.plot(epochs, valid_acc, label="Validation Accuracy")
plt.title("Accuracy")
plt.legend()
plt.suptitle("Loss and Accuracy Curves", fontsize=26);
REFERENCES
1. S. Haobin, Z. Renyu and S. Gang, "Fine-Grained Bird Classification Based

on Low-Dimensional Bilinear Model," 2018 IEEE 3rd International
Conference on Image, Vision and Computing (ICIVC), Chongqing, China,
2018, pp. 424-428.
38
2. Y. Peng, X. He and J. Zhao, "Object-Part Attention Model for Fine-Grained
Image Classification," in IEEE Transactions on Image Processing, vol. 27,
no. 3, pp. 1487-1500, March 2018.
3. T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs and P. N.

Belhumeur, "Birdsnap: Large-Scale Fine-Grained Visual Categorization of
Birds," 2014 IEEE Conference on Computer Vision and Pattern Recognition,
Columbus, OH, USA, 2014, pp. 2019-2026
4. Y. Zhang et al., "Weakly Supervised Fine-Grained Categorization With Part-

Based Image Representation," in IEEE Transactions on Image Processing,
vol. 25, no. 4, pp. 1713-1725, April 2016
5. Asmita Manna , Nilam Upasani , Shubham Jadhav. ”Bird Image

Classification using Convolutional Neural Network Transfer Learning
Architectures”, (IJACSA) International Journal of Advanced Computer
Science and Applications, Vol. 14, No. 3, 2023
6. Rohit Aglawe, Rupali Patil. “Classification of Bird Species Using Deep

Learning Techniques.” International Journal for Research in Applied Science
& Engineering Technology (IJRASET)
7. M. Z. Hossain, F. Uz Zaman and M. R. Islam, "Advancing AI-Generated

Image Detection: Enhanced Accuracy through CNN and Vision Transformer
Models with Explainable AI Insights," 2023 26th International Conference
on Computer and Information Technology (ICCIT), Cox's Bazar,
Bangladesh, 2023, pp..
8. Pralhad Gavali, Ms. Prachi Abhijeet Mhetre, Ms. Neha Chandrakhant Patil.
"Bird Species Identification using Deep Learning." International Journal of
Engineering Research & Technology (IJERT) Vol. 8 Issue 04, April-2019
9. G. Van Horn et al., "Building a bird recognition app and large scale dataset
with citizen scientists: The fine print in fine-grained dataset collection," 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Boston, MA, USA, 2015, pp. 595-604
39
10.A V Siva Krishna Reddy, M A Srinuvasu, K Manibabu, B V Sai Krishna.
“International Journal Of Creative Research Thoughts (IJCRT)” 021 IJCRT,
Volume 9, Issue 7 July 2021
11.A. Vaswani et al., "Attention Is All You Need," in Advances in Neural
Information Processing Systems (NeurIPS), 2017.
12.A. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale," arXiv:2010.11929, 2020.
13.H. D. Nguyen, H. A. Nguyen, and D. Q. Phung, "Vision Transformer for

Large-Scale Fine-Grained Image Classification," in IEEE Transactions on
Image Processing, vol. 31, pp. 2617-2628, 2022.
14.Y. Chen et al., "Understanding and Improving Vision Transformer," in

Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV), 2021.
15.C. Chen et al., "Bird Species Identification using Vision Transformer,"

arXiv:2110.06900, 2021.
16.D. Kim, K. Kim, and S. Han, "Bird Species Identification with Vision
Transformer," in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 35, no. 18, pp. 16132-16139, 2021.
17.S. Li et al., "Vision Transformer for Fine-Grained Image Recognition,"

arXiv:2105.01928, 2021.
18.J. Xiong et al., "Learning to Classify Bird Species with Vision Transformer,"
arXiv:2109.01862, 2021.
40

Vision Transformer (Vit) Model For Birds Classification

Uploaded by

Copyright:

Available Formats

Vision Transformer (Vit) Model For Birds Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vision Transformer (Vit) Model For Birds Classification

Uploaded by

Copyright:

Available Formats

VISION TRANSFORMER (VIT) MODEL FOR BIRDS

in partial fulfillment for the award of the degree of

VELAMMAL COLLEGE OF ENGINEERING AND TECHNOLOGY

ANNA UNIVERSITY – CHENNAI 600 025

DR. R. DEEPALAKSHMI DR. G. VINOTH CHAKKARAVARTHY

HEAD OF THE DEPARTMENT SUPERVISOR

DEAN & PROFESSOR PROFESSOR

INTERNAL EXAMINER EXTERNAL EXAMINER

We would like to express our sincere thanks to SHRI. M. V. MUTHURAMALINGAM,

Our heartfelt gratitude to Dr.R.PERUMALRAJA ,Professor and Head of the Department,

of avian diversity, where manual identification often becomes a laborious

distinguish them. However, advancements in machine learning offer promising

solutions to automate this process. However, advancements in machine learning

offer promising solutions to automate this process. In this study, we introduce

"Gadio," a web application that streamlines bird classification using state-of-

the-art deep learning models. We compare the performance of two models,

EfficientNet-B2 and Vision Transformer (ViT) B-16, finding ViT-B16 to

outperform EfficientNet-B2. Leveraging its innovative transformer architecture,

bird identification, fostering engagement in birdwatching and conservation

efforts. This project underscores the transformative potential of machine

learning in democratizing access to bird identification tools and advancing

3. System Study 5-7

3.1. Feasibility Study 5

3.2. Economical Feasibility 5

3.3. Technical Feasibility 6

3.4. Social Feasibility 7

4. System Analysis 8-11

4.1. Existing Solution 8

4.2. Disadvantages of Existing Solution 9

4.3. Proposed Work 10

4.4. Advantages of Proposed Work 11

5. System Specification 12-14

5.1. Hardware Specification 12

5.2. Software Specification 13

6. Detailed Description of Technology 15-16

7. System Design 17-19

7.1. Input Design 17

8.1. Architecture Diagram 20

9. System Implementation 21-35

9.1. Data Collection 21

9.2. Data Pre-Processing 22

9.3. Dataset Splitting 23

9.4. Feature Extraction 24

9.4.2. Vision Transformer

9.6.1. Training of EfficientNetB2

9.6.2. Training of Vision Transfomer

9.7.1. Testing of EfficientNetB2

9.7.2. Testing of Vision Transformer

9.8.1. Visualization of EfficientNetB2

9.8.2. Visualization of Vision Transformer

10.2. Vision Transformer Evaluation

12. Future Work 32

13. Appendices 33-36

14. References 37-38

1. 4.1.1. Existing Solution 8

2. 9.7.1.1. Training Progress of EfficientNetB2 27

3. 9.7.2.1. Training Progress of Vision Transformer 27

1. 8.1.1. Architecture Diagram of the Proposed Work 20