Airline Passenger Data Analysis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Airline Passengers Satisfaction Data Analysis

Domain Problem:
Customer satisfaction is one of the most important factors for modern
businesses as it can significantly contribute in service quality improvement. In order to meet
customer expectations and achieve higher quality services, airline used different kind of survey
to get feedback about the passengers experience during their travel. This project is about to
analyzing the data of US airline survey from Kaggle and find interesting insights to improve the
airline service quality. In the survey the passengers were asked to give their feedback about their
experience of different aspects of airlines like wifi services, food and drink services, Cleanliness
etc... Based upon the feedback of the passengers and the interesting insights found using
different python and statistical techniques the airline can understand which services they should
improve further for better passenger’s experience.

Motivation:
The motivation behind this project is to explore the US Airline passenger’s
feedback data to find interesting insights and useful information from the data. The information
like which factors contribute most in the passenger’s satisfaction, and which factors should be
further improved for better quality services and better customer’s satisfaction. These insights and
visualization will help the airlines in decision making to further enhance their business. Another
motive behind this project is to learn, explore and apply different statistical, data analysis, data
visualization, techniques on the data in hand.

There are different challenges faced in the data exploration part. The first challenge is about the
data quality. The data quality must be good in order to find useful information from it. The
second issue is about the dirty data that whether the data in hand have some missing or invalid
values or not, and if it has then how to clean the data which method should be use to fill the
missing values etc…

US Airline Dataset:
The U.S. Air line dataset was obtained from the kaggle. The link of the dataset is given below:

https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?select=train.csv

The dataset has total 23 features. The dataset details information for each passenger as it
related to age, gender, type of travel, distance traveled, class type etc… The name and details
about each feature is given below:
Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel:  Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0: Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level (Satisfaction, neutral or dissatisfaction)

The dataset has total 103904 examples and 24 attributes. There are 310 missing values in the
satisfactory column.

Data Exploration:
There are multiple applications to explore the data and find interesting
insights from it. Some of them are unique value count, Frequency Count, Histogram, bar plot
scatter plot etc… I used different python and statistical techniques for exploratory data analysis. I
found some interesting insights that are given below.
Satisfaction Distribution by Type of Travel:
The first analysis is about the distribution of
satisfaction by different type of travel. The bar chart representing this distribution is shown
below:

From this chart it is pretty evident that most of the people with Personal travel type or neutral or
dissatisfied from airline while the most of the people with Business travel type of satisfied from
the services of airline.

Distribution of Satisfaction by Class Type:


The second analysis is about the distribution of
satisfaction by passengers from different class types. The bar plot for this analysis is shown
below:
From this bar chart I observed that the most of the passengers from business class or satisfied
and most of the passengers from Economy and Economy plus Class dissatisfied from the airline.

Satisfaction Distribution by Age:


This analysis is about the distribution of satisfaction by age. I
used box plot to represent this analysis. The box plot for this is analysis is shown below:

From this plot it is pretty evident that the age of satisfied passengers is slightly higher as
compare to the age of dissatisfied or neutral passengers. The average age of satisfied passengers
is 41.75 as compared to the age of neutral or dissatisfied passengers which is 37.57.

Distribution of Satisfaction by Flight Distance:


In this analysis I tried to found out the
relation between the passengers satisfaction and the flight distance. The bar chart representing
this information is shown below:
From this box plot I observed that passengers are more satisfied by flights with longer distance
as compare to the flights with shorter distance. The average distance of flights with satisfied
passengers is 1530.14 as compared to the average flights distance with neutral or dissatisfied
passengers which is 928.92.

Distribution of Satisfaction by Inflight Wifi Service:


This analysis is about the distribution of
satisfaction by the satisfaction of inflight wifi service. The bar chart for representing this
information is shown below:
From above bar chart it is pretty evident that if the wifi service is not applicable than all the
passengers are satisfied as shown in the bar chart for 0 values. Most of the passengers are neutral
or dissatisfied if the satisfaction level of inflight wifi services ranges from 1 to 3. For the
satisfaction level 4 most passengers are satisfied but there are considerable numbers of
passengers that are neutral or dissatisfied. For the satisfaction level 5 almost all the passengers
are satisfied from the inflight wifi services.

Distribution of Satisfaction by Food and Drink Service:


This analysis is about the
distribution of Satisfaction for different levels of Food and drink services (0-5). The bar chart
representing this information is shown below:

From this chart it I observed that there are few flights with no drink and services represented by
0 in bar chart. All the people in those flights are dissatisfied since there are no food and drink
services. Most of the passengers are neutral or dissatisfied by satisfaction level of food and drink
from 1 to 4. There are significant amount of passengers are neutral or dissatisfied even if the
satisfaction level for food and drink is 4 or 5 although not as much as the people who are
satisfied.

Flight Distance Distribution:


This analysis is about the distribution of Flight distance
differentiated by different colors for the satisfied and neutral or dissatisfied passengers. The
histogram representing this information is shown below
From the above histogram I observed that most of the passengers are neutral or dissatisfied of the
flight distance ranges from 0 to 1500 and most of the passengers or satisfied if the flight distance
ranges from 1500 to 4000.

Passengers Age Distribution


This analysis is about the distribution of age of all passengers
differentiated by different colors for satisfied and neutral or dissatisfied passengers. The
histogram representing this information is shown below:

From above histogram it is pretty evident that most passenger with age between 40 to 60 are
satisfied by the airline services as compared to the passengers with younger or older than 40 to
60 years.
Use of One hot Encoding and PCA Dimensionality Reduction:
One hot encoding
technique is used for converting categorical data into numerical data and Principal Component
Analysis (PCA) is used for reducing the dimension of larger datasets with almost no or little loss
from the original data. The details about the use of these techniques are given below.

One hot Encoding:


Most of the time the dataset contains categorical variables that represent
different values for variables. For example in our dataset there is a variable Class that represent
the travel class in the plane of the passengers. This variable has three different values Business,
Eco and Eco plus. Most of the machine learning algorithms can’t handle the categorical data. So
we have to first convert the categorical data into numerical data before applying the machine
learning algorithms. One hot Encoding is one such technique by which we can convert
categorical data to numerical data. If there are n unique values for categorical variables, then one
hot encoding refers to split the categorical data into n unique column. For example in our case
the Class columns will be splitted into three different columns named Class_Business, Class_Eco
and Class_Eco_plus. For all the observations in the data that belongs to a particular Class has
value one for that Class column and has value 0 for other two Class columns.

Principal Component Analysis:


Principal Component Analysis (PCA) is a dimensionality
reductions techniques the is used for reducing the dimensionality of the large datasets by
transforming a large set of variables into a smaller one that still contains most of the information
from the large dataset. Normally if we reduce the number of features from the data then there is
a chance of losing key information from the dataset, but Principal Component Analysis
technique is implemented in such a way that we can convert larger dimension datasets into
smaller dimension data with almost no or little information loss from original dataset. So the
basic idea behind PCA is to reduce the number of variables from dataset while preserving as
much information as possible. There are some prerequisite for applying Principal Component
Analysis on the data that the data must be scaled before applying PCA. If we don’t normalized
the data then the output of PCA will be biased towards the variables with larger scale or
variance. For example if there is a variable A that ranges b/w 0 to 100 and the other variables
range between 0 to 1 then the PCA model will be biased toward variable A.

Code Explanation:
In coding section, first I will import all the necessary libraries like pandas,
matplotlib, seaborn, numpy etc…. Then I will import data from csv file into pandas data frame.
Then I will print first few rows of the data after that I will check the statistical summary of the
data and then the info of the data to check the data types of all the features. Then I will plot all
the numerical variables in histogram and all the numerical variables in bar chart. Then I will
normalize the data using Min Max normalization and Z-score standardization. Then I will plot all
the numerical data into scatter plot pair by pair. Then I will plot all the categorical variables in
bar chart separated by the output variable (satisfaction). Then I will use different statistical,
aggregation and visualization techniques to perform exploratory data analysis. Then I will
perform one hot encoding to convert categorical data to numerical data and then I will perform
pca using different number of components to reduce the dimensions of the dataset.

Conclusion
To conclude, this project shows that which factors are most important for
customer’s satisfaction. While there are some services like Food and Drink from which most of
the people are dissatisfied even if the level of services or 4, 5. So maybe we need to include more
questions in survey to gather the data about other factors like the food delivery time to get better
understanding why people were dissatisfied by the higher level services related to food and
drink. Also there are more loyal customer who or neutral or dissatisfied so that’s also an area of
concerns. In future we could gather more data about different insights to extract more useful
insights and we may also apply machine learning classification models to predict the satisfaction
feature from all other features.

References
1. Dataset: <https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?
select=train.csv>
2. Dataset: < https://www.kaggle.com/johndddddd/customer-satisfaction>
3. Dwi Suhartanto* & Any Ariani Noor, CUSTOMER SATISFACTION IN THE
AIRLINE INDUSTRY: THE ROLE OF SERVICE QUALITY AND PRICE
4. ML | One Hot Encoding of datasets in Python <https://www.geeksforgeeks.org/ml-one-
hot-encoding-of-datasets-in-python/?
5. Step by Step Explanation to PCA <https://builtin.com/data-science/step-step-explanation-
principal-component-analysis>

You might also like