Airline Passenger Data Analysis
Airline Passenger Data Analysis
Airline Passenger Data Analysis
Domain Problem:
Customer satisfaction is one of the most important factors for modern
businesses as it can significantly contribute in service quality improvement. In order to meet
customer expectations and achieve higher quality services, airline used different kind of survey
to get feedback about the passengers experience during their travel. This project is about to
analyzing the data of US airline survey from Kaggle and find interesting insights to improve the
airline service quality. In the survey the passengers were asked to give their feedback about their
experience of different aspects of airlines like wifi services, food and drink services, Cleanliness
etc... Based upon the feedback of the passengers and the interesting insights found using
different python and statistical techniques the airline can understand which services they should
improve further for better passenger’s experience.
Motivation:
The motivation behind this project is to explore the US Airline passenger’s
feedback data to find interesting insights and useful information from the data. The information
like which factors contribute most in the passenger’s satisfaction, and which factors should be
further improved for better quality services and better customer’s satisfaction. These insights and
visualization will help the airlines in decision making to further enhance their business. Another
motive behind this project is to learn, explore and apply different statistical, data analysis, data
visualization, techniques on the data in hand.
There are different challenges faced in the data exploration part. The first challenge is about the
data quality. The data quality must be good in order to find useful information from it. The
second issue is about the dirty data that whether the data in hand have some missing or invalid
values or not, and if it has then how to clean the data which method should be use to fill the
missing values etc…
US Airline Dataset:
The U.S. Air line dataset was obtained from the kaggle. The link of the dataset is given below:
https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?select=train.csv
The dataset has total 23 features. The dataset details information for each passenger as it
related to age, gender, type of travel, distance traveled, class type etc… The name and details
about each feature is given below:
Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0: Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level (Satisfaction, neutral or dissatisfaction)
The dataset has total 103904 examples and 24 attributes. There are 310 missing values in the
satisfactory column.
Data Exploration:
There are multiple applications to explore the data and find interesting
insights from it. Some of them are unique value count, Frequency Count, Histogram, bar plot
scatter plot etc… I used different python and statistical techniques for exploratory data analysis. I
found some interesting insights that are given below.
Satisfaction Distribution by Type of Travel:
The first analysis is about the distribution of
satisfaction by different type of travel. The bar chart representing this distribution is shown
below:
From this chart it is pretty evident that most of the people with Personal travel type or neutral or
dissatisfied from airline while the most of the people with Business travel type of satisfied from
the services of airline.
From this plot it is pretty evident that the age of satisfied passengers is slightly higher as
compare to the age of dissatisfied or neutral passengers. The average age of satisfied passengers
is 41.75 as compared to the age of neutral or dissatisfied passengers which is 37.57.
From this chart it I observed that there are few flights with no drink and services represented by
0 in bar chart. All the people in those flights are dissatisfied since there are no food and drink
services. Most of the passengers are neutral or dissatisfied by satisfaction level of food and drink
from 1 to 4. There are significant amount of passengers are neutral or dissatisfied even if the
satisfaction level for food and drink is 4 or 5 although not as much as the people who are
satisfied.
From above histogram it is pretty evident that most passenger with age between 40 to 60 are
satisfied by the airline services as compared to the passengers with younger or older than 40 to
60 years.
Use of One hot Encoding and PCA Dimensionality Reduction:
One hot encoding
technique is used for converting categorical data into numerical data and Principal Component
Analysis (PCA) is used for reducing the dimension of larger datasets with almost no or little loss
from the original data. The details about the use of these techniques are given below.
Code Explanation:
In coding section, first I will import all the necessary libraries like pandas,
matplotlib, seaborn, numpy etc…. Then I will import data from csv file into pandas data frame.
Then I will print first few rows of the data after that I will check the statistical summary of the
data and then the info of the data to check the data types of all the features. Then I will plot all
the numerical variables in histogram and all the numerical variables in bar chart. Then I will
normalize the data using Min Max normalization and Z-score standardization. Then I will plot all
the numerical data into scatter plot pair by pair. Then I will plot all the categorical variables in
bar chart separated by the output variable (satisfaction). Then I will use different statistical,
aggregation and visualization techniques to perform exploratory data analysis. Then I will
perform one hot encoding to convert categorical data to numerical data and then I will perform
pca using different number of components to reduce the dimensions of the dataset.
Conclusion
To conclude, this project shows that which factors are most important for
customer’s satisfaction. While there are some services like Food and Drink from which most of
the people are dissatisfied even if the level of services or 4, 5. So maybe we need to include more
questions in survey to gather the data about other factors like the food delivery time to get better
understanding why people were dissatisfied by the higher level services related to food and
drink. Also there are more loyal customer who or neutral or dissatisfied so that’s also an area of
concerns. In future we could gather more data about different insights to extract more useful
insights and we may also apply machine learning classification models to predict the satisfaction
feature from all other features.
References
1. Dataset: <https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?
select=train.csv>
2. Dataset: < https://www.kaggle.com/johndddddd/customer-satisfaction>
3. Dwi Suhartanto* & Any Ariani Noor, CUSTOMER SATISFACTION IN THE
AIRLINE INDUSTRY: THE ROLE OF SERVICE QUALITY AND PRICE
4. ML | One Hot Encoding of datasets in Python <https://www.geeksforgeeks.org/ml-one-
hot-encoding-of-datasets-in-python/?
5. Step by Step Explanation to PCA <https://builtin.com/data-science/step-step-explanation-
principal-component-analysis>