WIA1006 Report (OrionX)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

WIA1006 MACHINE LEARNING

ASSIGNMENT REPORT

TITLE: MEASURING THE SEVERITY OF DEPRESSION AMONG STUDENTS

LECTURER NAME: DR. ERMA RAHAYU BINTI MOHD FAIZAL ABDULLAH

GROUP NAME: OrionX

NAMES: MATRIC NUMBER:

SANJIVAN BALAJAWAHAR S2109465

TAN JIA XUAN U2102793

REGINA TANG HEU YAN U2102860

HAM ZHI YING U2102741

ONG SHU YING U2102835

DATE: 17 June 2022


OrionX WIA1006

TABLE OF CONTENT

NO. CONTENT PAGE

REPORT EVALUATION 1

1.0 Introduction to Problem 3

2.0 Hypothesis made for the problem 4

3.0 Project Objectives 5

4.0 Methodology 6

5.0 Elaboration on Data & Features Used 7-11

6.0 Results and discussions 12-15

7.0 Suggestion for future works 16-17

REPORT EVALUATION 2

8.0 Processes Involved 18-25

9.0 Background Theory 26-28

10.0 Experimental Protocol 29-30

11.0 Commented Source Code 31-38

12.0 References 39-42

2
OrionX WIA1006

REPORT EVALUATION 1

1.0 Introduction to Problem


Depression is a common health problem among students. However, many students are
unaware that they are suffering from depression. If left untreated, it will cause a detrimental
impact on students' psychosocial, emotional, and academic performance. According to a
study conducted in various Asian countries, the prevalence of depression among students
ranges between 4% and 79.2%. A study conducted in African Universities reported that
depression affects 16.2% to 67% of the population. In Ethiopia, between 29.1% and 32.2% of
students suffer from depression. Based on the research, we can find that the prevalence of
depression is increasing at an alarming rate globally.

Depression among students is significantly associated positively and negatively with


gender, age, interest in doing things, insomnia, fatigue, appetite, current job, study hour per
day, number of gadgets, social media spend, academic achievement, etc. As a result of
depression, students missed a greater number of classes, assignments, and examinations and
were even forced to drop out of school or university. Therefore, our main motive for this
project is to develop a machine learning model to measure the severity of depression
among students. Thus, we will feed our Machine Learning model a dataset and eventually
train it to classify the input given as either mild depression or severe depression.

3
OrionX WIA1006

2.0 Hypothesis made for the problem


Our hypothesis for the project is students who have higher marks in feeling down,
trouble falling in sleeping, feeling tired, overeating, feeling bad about themselves, trouble
concentrating on things, moving and speaking slowly, moving slowly that other people could
have noticed or been so restless that you have been moving around a lot more than usual,
thought they would be better off dead by hurting themselves in some way, do not have a
current job, study hour per day, number of gadgets, hours spent on social media and having a
lower CGPA will get ‘1’, which represent severely depressed in our project while students in
contrary will get ‘0’ which represents mildly depressed. To justify our hypothesis, we will do
research online to find related evidence to support our view. By doing so, we can find out the
relationship between our features and the depression level of students. After concluding that
all the features are valid, we will implement a model using logistic regression to predict the
depression level of students.

To implement the logistic regression model, we find datasets from Kaggle. Due to
datasets with specific features being limited, we predict our model will have an accuracy of
85% for training, 80% for validation, and 80% for test. In brief, our depression indicator
model’s task (T) is to predict the depression level of students, experience (E) is learned from
the dataset with the same features, and performance (P) is the accuracy of the depression
level of the students.

4
OrionX WIA1006

3.0 Project Objectives


In this Machine Learning project, we have designed a specific Depression Indicator by
implementing a Machine Learning algorithm that is logistic regression to help us achieve our
project’s objectives which include:

1. To determine whether students are suffering from mild depression or severe


depression. Our Machine Learning training model will classify the students and
outcome the prediction of the student suffering from mild depression as output “0”
and severe depression as “1”.
2. To determine the relationship between the features (variables) and how it contributes
to the severity of depression.
3. To examine whether the factor of having a low GPA contributes to the highest amount
of stress among students based on our training model.
4. To determine how accurate our Machine Learning training model is based on our
outcome and results.

5
OrionX WIA1006

4.0 Methodology
In our project, we employed a variety of techniques to complete the given tasks. All
of the strategies listed below have been acknowledged and debated among the members of
our group to ensure the best possible outcome.

Firstly, we obtained our dataset from Kaggle, a well-known platform that allows data
scientists and machine learning practitioners to explore and publish data sets. We did not go
for manual data collection since we wanted to have over 800 datasets for a more accurate
training model. It's impossible to get such a large amount of information via surveys from our
friends and families alone in a short amount of time.

Next, we cleaned the data and filtered out some features that we felt were
unnecessary. All of the features selected were also backed up by evidence. Details of how and
what we used to select the features will be discussed later on in the report. The data obtained
from Kaggle was qualitative. Therefore, we transformed the categorical data into numbers
since machine learning models can only be trained with numeric data. Our aim was to
categorise the output into moderate or severe depression, thus we applied logistic regression
to train the data.

Before analysing, we split the data into training, validation, and test sets. Data
splitting was performed to avoid overfitting, a condition where the learned hypothesis may fit
the training set very well but fail to generalise new examples. Moving on, we used the
confusion matrix to evaluate the performance of our classification model. For our model to be
considered successful, we aim to have an accuracy of higher than 80%.

6
OrionX WIA1006

5.0 Elaboration on Data & Features Used


In order to complete this Machine Learning group assignment, we were required to
acquire datasets related to depression and if not available, we were even given the option to
develop a pseudo-dataset to demonstrate our proposed problem and solution. For your
information, our group has obtained a dataset from Kaggle which can be accessed here. We
proceeded to choose this dataset because this dataset had a lot of features that we were
looking for that could eventually be used in our Machine Learning training model. Thus, in
this part of the report, we will be explaining and elaborating on the dataset that we chose.
There will also be an in-depth explanation of why each feature from the dataset is relatable to
our Depression Indicator and how it affects the well-being of someone until they suffer from
depression.

First and foremost, we cleaned the data that we acquired by removing a few features
that were already available in the dataset because we felt they were unnecessary for our
project. Besides, we also allocated the features that result in a value that is less than or equal
to two as more likely to be mildly depressed and the features that result in a value that is
more than or equal to three is more likely to be severely depressed.

Next, we categorised the features into values of either 0, 1, 2, or 3 so that an easier


output can be obtained rather than having a lot of values. Moving on, we eventually split the
dataset into three parts that are training sets, validation sets, and test sets. This will help us to
see the probability of how accurate our model is and how far off it is from the actual value by
using the validation set that is to be tested.

Moving on, after doing more data cleaning and modifying the data so that it is
arranged in an orderly manner, we can finally proceed to train our model using the Machine
Learning algorithm that we decided upon which is logistic regression. Logistic regression
basically estimates the probability of something happening and in this case, it is classifying
between mild and severe depression. Logistic regression aims to distinguish between classes.
Unlike a generative algorithm, such as Naive Bayes, it cannot generate information such as
an image of the class that it is trying to predict. Logistic regression also helps us to assess
which input variable is responsible for the greatest change in predicted value. What logistic
regression does that other algorithms don’t is that it outputs well-calibrated probabilities

7
OrionX WIA1006

along with classification results. This is an advantage over models that only give the final
classification as results. If a training example has a 95% probability for a class, and another
has a 55% probability for the same class, we get an inference about which training examples
are more accurate for the formulated problem. Due to its simple probabilistic interpretation,
the training time of the logistic regression algorithm comes out to be far less than most
complex algorithms, such as an Artificial Neural Network.

Proceeding to the explanation about the features that we used. Depression has been
found in studies to cause people to think more slowly than others. A person who is
depressed may exhibit slow speech or difficulty understanding and registering
information. Based on a recent study, participants were asked to count backwards from one
hundred by seven. People with depression are proven to be slower in doing so and make more
mistakes. (Tracy, 2022) People who are suffering from depression are also most likely to have
suicidal thoughts. According to a research paper, there is a direct connection between
depression and self-harming behaviours among adolescents. It is commonly used by
victims as a coping technique, a source of comfort, a way to regulate moods, self-punishment,
and sensation seeking.

Furthermore, the feature of whether the participants had part-time or full-time work
was used since juggling between being a full-time college student and full-time employment
is one of the most difficult scenarios a student may face. Time management for students
might be tough since they have to concentrate on achieving good grades and at the same time
performing well at work. This can take a huge toll on their mental health. (Santoro, 2021)
The amount of time one spends studying can also be an indicator of depression. Studies
show that there is a correlation between study habits and depression. The more one spends on
studying, the less likely one may be to suffer from depression. This is due to the fact that
gaining excellent marks is one of the most concerning issues for students. When a student's
study habits are well-planned and disciplined, they will have sufficient time to prepare for
upcoming exams.

Next, the number of electronic gadgets and time spent on them are considered
contributing factors to depression among students. Excessive usage of digital gadgets was
found to promote depression among users in a 2017 study. Teens and adults who spent more
than six hours a day staring at screens were substantially more likely to suffer from moderate

8
OrionX WIA1006

to severe depression than those who spent less time on them. (Madhav et al., 2017)
According to another study, adolescents who used electronic devices for more than two hours
are 1.71 times more likely to have depression. (Al Salman et al., 2020) Spending so much
time alone in front of a screen can exacerbate feelings of loneliness and disrupt true human
interactions. Lack of real human interactions adds to people's feelings of depression, and their
attempts to alleviate depression through screen time can create a vicious cycle that only adds
fuel to the fire in the context of suffering from depression.

There is a significant link between adolescent depression and their academic


performance. Lower grades, in fact, might be the first noticeable sign of depression. This is
because students who are depressed may have difficulty concentrating in class. They will
often refuse to perform certain tasks that they perceive to be excessively tough or
overwhelming, especially if doing so makes them doubt their capability. Moreover,
depression in teens can cause slurred speech and a weakened ability to convey thoughts and
ideas. Class presentations, for example, may be a terrifying experience that is to be avoided at
all costs. All the factors stated above can negatively impact their grades and class
performance.

In addition to that, the feature of having little interest or pleasure in doing things or
more commonly known as anhedonia is yet another clear symptom of a person suffering
from depression. Social withdrawal, lack of interest in previous hobbies that one used to
enjoy, and also diminished pleasure derived from daily activities are just some of the clear
symptoms that show one could be suffering from severe depression. It just goes to show how
a serious mental health condition like anhedonia can affect a person to the fact that nothing
brings them joy in their life anymore. What makes this worse is that it is extremely tough to
bring them back to their normal state of mind and make them enjoy the little things in life. It
is undoubtedly a core symptom of major depressive disorder.

Moreover, trouble falling or staying asleep or sleeping too much is yet another
indicator of someone suffering from depression. Depression and sleep problems are closely
linked where people with insomnia may have a tenfold higher risk of developing depression
than people who get a good night’s sleep. Among people with depression, 75% have trouble
falling asleep or staying asleep. Sleep issues associated with depression include insomnia,
hypersomnia, and obstructive sleep apnea It is believed that about 20% of people with

9
OrionX WIA1006

depression have obstructive sleep apnea and about 15% have hypersomnia. Many people with
depression may go back and forth between insomnia and hypersomnia during a single period
of depression.

Furthermore, feeling tired or having little energy which is fatigue is considered


another clear indicator of someone suffering from depression. According to data gathered in
2018, over 90% of people living with depression have symptoms of fatigue. Depression and
fatigue are related because depression impacts the neurotransmitters associated with alertness
and our reward systems. Therefore, the condition has a physiological impact on our energy
levels. Depression also has a negative effect on our sleep, leading to difficulty falling asleep
or staying asleep, not sleeping as deeply, waking up too early, or just sleeping too much.

Besides, if one is experiencing a poor appetite or overeating, then it is most likely


that they might be suffering from depression. Based on research done, patients with major
depressive disorder exhibit marked heterogeneity in appetite, with approximately 48% of
adult depressed patients exhibiting depression-related decreases in appetite, while
approximately 35% exhibit depression-related increases in appetite (Maxwell, 2009). The
reason why this happened is that depressed people have an extreme response to food stimuli.
They may lose interest in food or consume a huge amount of food in a short time to gain
temporary pleasure. If this situation happens for a longer time, their depression level might
get even more serious because both insufficient and overeating will lead to an imbalance of
hormones and malnutrition. Therefore, a loss of appetite and overeating appears as a really
clear depression indicator as appetite disturbance is also related to weight change which is a
physical characteristic that we can observe.

Other than that, we can also associate depression with one’s inner thoughts like when
one feels bad about themselves or thinks themself is a failure that lets themselves and their
family down. This is because people who feel bad about themselves usually are in excessive
self-blaming conditions and always think negatively about themselves. (Zahn et al., 2015). In
the long term, people who are constantly experiencing the same exact condition will result in
a decrease in self-worth and feel hopeless as they did not ask for help from family and
friends. A long-term negative mindset will affect their mental health and bring them to severe
depression. This will disturb their daily tasks and life because the emotion of depressed
people is unstable. Not just that, their career life and relationship will also be largely affected

10
OrionX WIA1006

due to their negative mindset and low effectiveness. This will form a vicious cycle which
makes the patients even more depressed and finally result in serious psychological disease or
suicidal thoughts.

Lastly, another crucial depression indicator is trouble concentrating on things, such


as reading the newspaper or watching television. In fact, the inability to focus is related to the
processing speed of the brain. Researchers have found out that the ability to take in
information quickly and efficiently, which is the processing speed is impaired in individuals
who are depressed. The slower processing speed of the brain, in reality, is one of the
symptoms of depression as it shows that people might be in unstable mental conditions and
have negative thoughts which causes them the inability to focus. Concentration is essential
for people to perform their daily tasks and achieve their goals. For example, students need to
concentrate in their class to understand the syllabus taught by the lecturer so that they can
pass their exams with flying colours. In contrast, if they cannot focus in class, they will have
doubts about their syllabus and cause their academic performances to plummet. At last, this
will kick off a chain reaction as they will feel dejected when they receive low grades. They
would blame themselves and think that they let their parents down by getting a bad or
mediocre result.

11
OrionX WIA1006

6.0 Results and discussions


To measure the performance of our model, we use a confusion matrix to visualise the
result. After a series of training and debugging, our logistic regression model finally achieved
an accuracy of 86.83% for training, 86.50% for validation, and 82.50% for tests calculated
using the formula Figure 1. The accuracy of our model successfully meets our expectations
with slightly higher than our expected accuracy.

Figure 1

Figure 2
From the training of the confusion matrix above (Figure 2), we can see when the
actual target is mildly depressed, our model successfully predicts 91% of them as mildly
depressed correctly but wrongly predicts 8.9% of them as severely depressed. While when
the actual target is severely depressed, our model successfully predicts 79% of them as
severely depressed correctly but 21% of them as mildly depressed.

12
OrionX WIA1006

Figure 3
While from the validation confusion matrix (Figure 3), we can know when the valid
target is mildly depressed, our model successfully predicts 92% of them as mildly depressed
correctly but 8.2% of them as severely depressed which is incorrect. When the valid target is
severely depressed, our model is able to predict 76% of them as severely depressed correctly
and 24% of them as mildly depressed which is inaccurate.

Figure 4
While from test confusion matrix (Figure 4), we can see that our model can predict
85% of the tested target correctly as mildly depressed (actual mildly depressed, predict mildly
depressed) but 15% incorrect (actual mildly depressed, predict severely) and 78% of the
tested target correctly as severely depressed (actual severely depressed, predict severely
depressed) but 22% incorrect (actual severely depressed, predict mildly depressed).

13
OrionX WIA1006

Figure 5
While for the F1 score, our model achieved a score of 80.30% for training, 78.74%
for validation, and 75.18% for test.

Figure 6

14
OrionX WIA1006

Figure 7
To make predictions with new data by using our logistic regression model, we create
two sets of new data. For data from Figure 6, we create a set of data with gender = 0
(Female), age = 1 (19 to 24 years old), has higher average marks on ‘Little interest or
pleasure in doing things’, ‘Trouble falling or staying asleep, or sleeping too much’, ‘Feeling
tired or having little energy’, ‘Poor appetite or overeating’, ‘Feeling bad about yourself or
that you are a failure or not have let yourself or your family down’, ‘Trouble concentrating on
things, such as reading the newspaper or watching television’, ‘Moving or speaking so slowly
that other people could have noticed or being so restless that you have been moving around a
lot more than usual’, ‘Thoughts that you would be better off dead or of hurting yourself in
some way’, ‘CurrentJob’, ‘StudyhourPerDay’, ‘Number of gadget’, ‘hour spend on social
media’, and lower ‘GPA’. After we pass the new dataset into the function we defined, our
logistic regression model classifies the dataset as severely depressed (1) with a probability of
0.9130573807461071 and for the dataset with lower average marks and higher GPA, our
model classifies the dataset as mildly depressed (0) with a probability of
0.9813853473574341. Both of the predictions match our hypothesis and the evidence we
found.

15
OrionX WIA1006

7.0 Suggestions for future works

This project has undoubtedly improved all of our groupmates’ understanding of the
various Machine Learning algorithms and figuring out how to implement the algorithm to our
project. This project has truly been an eye-opener for us and we all agree that this project has
helped us explore the Python language thoroughly. We would like to thank all the lecturers
involved including Dr Aznul Qalid and Dr Erma Rahayu for assigning us this project and for
always clearing our doubts along the way. Thank you also for teaching us this brand new
subject with the utmost patience from the very start. Although the project has been completed
successfully, there is a lot of room for improvement. Here are some of the suggestions for
improvement for any future Machine Learning projects that we handle.

Implementing Feature Engineering


Feature engineering is the art of creating new features from already existing ones.
Feature engineering helps improve the accuracy of machine learning models by basically
allowing them to make more accurate predictions. Combining numerous existing features into
one or more new features is one of the most common approaches to develop new features.
Maybe in future Machine Learning projects, more and more feature engineering approaches
can be taken so that we get a more accurate result. We have to bear in mind to make use of
data pre-processing techniques like feature extraction and selection to help us find the most
important features in our dataset.

Implementing Multiple Algorithms


Another way to get a more accurate result in our Machine Learning project is to
simply try and error multiple and various algorithms that produce the best piece of work with
the best accuracy. However, due to the time constraint and our lack of skills, our group was
only able to dig deep in the algorithm that we chose from the start which is logistic
regression. In real life problems, there will be a vast significant and complex amount of data
that requires us to train it on our model. It is most likely that some features in our dataset
don't contribute much to the accuracy of our model and removing them would most probably
make things even messier. This is why using multiple algorithms is very helpful because by

16
OrionX WIA1006

trying different algorithms, we can eventually identify which ones work best for our data and
then also use that information to improve the accuracy of your models. We can also
cross-validate with multiple algorithms on the same dataset and then compare their accuracy
against each other as there are a lot of Machine Learning algorithms out there.

Hyperparameter Tuning
Hyperparameters are the parameters in Machine Learning models that determine how
they work. All these parameters can include things like the number of layers in a deep neural
network or how many trees there should be in an ensemble model. These parameters take
some values to perform their task under the model. This is where cross-validation is equally
helpful. By splitting our data into training and test sets, we can try different combinations of
hyperparameters on the training set and then see how well they perform on the test set. What
this does is that it helps us to find the best combination of hyperparameters for our model.
Grid search, which is a method of determining the best combination of hyperparameters for
data, is another option. Grid search works by trying out all of the potential parameter
combinations until it discovers one that gives us the best results based on our metric (e.g.,
accuracy). After that, we can train our model using that set of hyperparameters.

17
OrionX WIA1006

REPORT EVALUATION 2

8.0 Processes Involved


Data Collection
First and foremost, in order to create a machine learning model that can be
used to predict and measure the severity of depression among students, we did a lot of
research on the suitable methods to achieve this purpose. In the end, we decided to
use questionnaire answers as the inputs for our prediction model. Initially, we
collected many relevant open-source data, which are the datasets of depression level
among students from Kaggle, Google Dataset Search, etc. Then, we performed a
quality comparison of data collected. At last, we decided to use a dataset of
‘Depression and Academic performance of students’ which can be accessed here.

Data Preparation and Cleaning


We have downloaded our dataset files in the form of comma-separated values
(.csv) from Kaggle and then, we proceeded onto our whole process of model training
by using DeepNote. DeepNote allows five of us to write and execute python code
through the browser at the same time and it is well suited to machine learning data
analysis. Thus, we started to read and store data from our CSV dataset file into
DeepNote and imported necessary libraries, which are matplotlib, pandas, numpy,
seaborn, plotly.express as well as scikit-learn.

After that, we dropped unnecessary columns that are not related to the
depression, which are ‘Educational Level’ and ‘Which of the following best describes
your term-time accommodation?’. Apart from that, we also converted categorical
data (qualitative) to numeric data (quantitative) since machine learning models can
only be trained with numbers. This basically makes sure that all the data can be
analysed straightforwardly and it is less prone to error. Additionally, we identified and
set ‘DepressionLevel’ as our actual target column and created a list of input columns
for the other features.

18
OrionX WIA1006

Exploratory Data Analysis and Visualisation


The shape of the dataframe was 1000 and 16 in terms of the number of rows
and columns respectively. Summary statistics was conducted to show the mean,
standard deviation, minimum and maximum values for each column in the dataset.

Figure 8
It's always a good idea to look at the distributions of various columns and see
how they relate to the target column before training our machine learning model.
Thus, we explored and visualised the data using histogram and checking if there are
correlations between the different characteristics that we obtained.

Figure 9
Histogram above shows that students between 19 to 24 years old suffered from severe
depression.

19
OrionX WIA1006

Figure 10
Histogram above shows that, the population of students between a GPA of 3.2 to 3.39
are the ones that are severely depressed.

Selecting Logistic Regression Algorithm


Logistic regression is a commonly used technique for solving binary
classification problems. Since the main purpose of our training model is to predict the
probability of mild depression (0) or severe depression (1), we chose the most
appropriate algorithm, which is the logistic regression algorithm (Classification).

Data Splitting
After that, we splitted our data into three classes of data sets, which are
training data set, cross-validation data set and test data set. The reason why this
should be done is the scenario when the test data set ends up fitting well with new
features that are developed based on the evaluation of the test data set error. First, we
splitted the data into 80% temporary train and validation sets and 20% for test set
using train_test_split function from the scikit-learn library. Then, the temporary train
and validation sets were splitted into 25% cross-validation set and 75% train set. In
the end, the original dataset was splitted into 60% for train, 20% for cross-validation
and 20% for test, which consisted of 600 rows, 200 rows and 200 rows respectively.
Then, we created inputs and targets for the training, validation and test sets for further
processing and model training.

20
OrionX WIA1006

Features Scaling
Next, we used feature scaling (Min-Max Normalization) to bring all values to
the same magnitudes, thus, all the independent features can be presented in the data in
a fixed range within 0 and 1. This improves the accuracy and integrity of our data
while ensuring that our database is easier to navigate.

Formula for Normalization:

21
OrionX WIA1006

Algorithm Implementation (Model Training)


With the use of scikit-learn library, a Logistic Regression Algorithm can be
created easily. Therefore, we started developing and training our logistic regression
model by the selected logistic regression algorithm from scikit-learn and fitted our
model to the normalised train datasets with the train targets. By doing so, we were
able to run smoothly and see an incremental improvement in the prediction rate.

Model Evaluation
Evaluating a model is a core part of building an effective machine learning
model. Therefore, we checked the machine created against our evaluation data set that
contains inputs that the model does not know and verified the precision of our
training.

Scikit-learn library has a few built-in methods that can be used to access the
model performance. First, it is the model accuracy score. Model accuracy is defined
as the ratio of true positives and true negatives to all positive and negative
observations. The percentage of accuracy scores for train dataset, cross-validation
dataset and test dataset are 86.83%, 86.50% and 82.50% respectively. Note that the
maximum percentage of accuracy score is 100%. We finally obtained an 82.50%
accuracy score on test datasets, meaning that we have good confidence in the results
that our model gives us.

22
OrionX WIA1006

The formula is as below:

Second, our model was also evaluated with the F1 score. F1 score is a machine
learning model performance metric that gives equal weight to both the precision and
recall for measuring its performance in terms of accuracy, making it an alternative to
accuracy metrics. The percentage of F1 scores for train dataset, cross-validation
dataset and test dataset are 80.30%, 78.74% and 75.18% respectively. Note that the
maximum percentage of F1 score is 100%.

The formulas are as below:

23
OrionX WIA1006

Confusion Matrix
In addition, a confusion matrix was also plotted as well. It is useful because
they give direct comparisons of values like True Positives, False Positives, True
Negatives and False Negatives. We can also visualize the breakdown of correctly and
incorrectly classified inputs using a confusion matrix. At last, we are able to define
the performance of our algorithm.

Figure 11

Prediction on New Inputs


We tried changing the values in new individual inputs and observing how the
predictions and probabilities change. Finally, our Machine Learning model was able
to predict and determine whether students are suffering from mild depression or
severe depression. Our model will classify the students and outcome the prediction of
the student suffering from mild depression as output “0” and severe depression as “1”.

24
OrionX WIA1006

Save Processed Data to Disk


It can be useful to save our processed data to disk, especially for large
datasets, to avoid repeating the preprocessing steps every time we start the Jupyter
Notebook or DeepNote. We used the parquet format as it is a fast and efficient format
for saving and loading Pandas data frames.

Model Deployment
After making sure all the data is ready for deployment, we prepared for
container deployment to deploy our code to production. We took the validated
features in a staging environment and deployed them into the production environment,
so they are readied for release.

Model and Data Monitoring/Updating


We plan for continuous monitoring and maintenance after machine learning
deployment. It is a vital part of the ongoing success of machine learning deployment
as our models can be kept optimised to avoid data drift or outliers.

25
OrionX WIA1006

9.0 Background Theory


The algorithm that we choose to employ to tackle the problem presented, which is to
create a depression indicator, is logistic regression. Logistic regression is commonly used to
solve classification problems. There are 3 main types of logistic regression: Binary Logistic
Regression, Multinomial Logistic Regression and Ordinal Logistic Regression. Binary
logistic regression is a statistical analysis method to predict a binary outcome, such as yes or
no, based on prior observations of a data set whereas multinomial logistic regression is a
classification approach that extends logistic regression to issues with more than two distinct
discrete outcomes. Ordinal logistic regression on the other hand is a statistical analysis
method that can be used to model the relationship between an ordinal response variable and
one or more explanatory variables.

In our project, we used binary logistic regression, which is a statistical method used to
predict the relationship between a dependent variable and an independent variable.
Dependent variable is a binary variable that accepts two values for example either 0 or 1, true
or false, yes or no, etc. In the logistic regression model, firstly, we will take the linear
combination or the weighted sum of the input features. We then apply the sigmoid function to
the result to obtain a number between 0 and 1. This number of 1 or 0 represents the
probability of the input being classified as ‘yes’ or ‘no’.

Figure 12

26
OrionX WIA1006

Figure 13
Since our aim is to categorise the output into mild or severe depression, thus we
picked logistic regression above all the other machine learning algorithms to solve the
problem. The dependent variable in our indicator accepts either 0 or 1, where 0 denotes a
mild case of depression and 1 denotes a severe case of depression. The elements that caused
depression were the input features that we employed. This algorithm is straightforward to
comprehend, easy to construct, and train. Besides, it is appropriate for small datasets and has
a high level of accuracy. Finding correlations between features is also beneficial while using
this algorithm.

One of the most important phases in the pre-processing of data prior to developing a
machine learning model is feature scaling. Scaling may make the difference between a bad
and a good machine learning model. Why do we need scaling? Machine learning algorithms
just look at numbers, and if there is a significant difference in range, such as a few ranging in
the thousands against a few ranging in the tens, it assumes that greater ranging numbers have
some form of superiority. As a result, these more significant numbers start playing a more
decisive role while training the model. Feature scaling is needed to bring every feature in the
same footing without any upfront importance. Therefore in our project, we implemented the
feature scaling technique. Due to the differences in ranges of features, each feature will have
a distinct step size. We scaled the data before feeding it to the model to guarantee that the
gradient descent moves smoothly towards the minima and that the gradient descent steps are
updated at the same rate for all features. It is advisable to get every feature into
approximately a (0,1) or (-1,1) range. Optimization algorithms also work better in practice
with smaller numbers.

27
OrionX WIA1006

Hyperparameters is a way to tailor the behaviour of the algorithm used to the specific
dataset. Hyperparameters are not the same as parameters, which are the internal coefficients
or weights found by the learning procedure for a model. In our logistic regression model, we
do not have critical hyperparameters to tune. Hence, we considered our algorithm used in
context of their scikit-learn implementation (Python) by using solver to get the best
performance for our indicator. Since we have only 1000 datasets, the ‘liblinear’ library is
used.

28
OrionX WIA1006

10.0 Experimental Protocol


We are evaluating our logistic regression model using cross-validation and using the
confusion matrix to determine the performance of the model.

Figure 14
After we gather, clean, pre-process and wrangle our data, the next step we do is to
feed it to an outstanding logistic regression model and of course, get an output in
probabilities. It is important for us to use cross-validation to evaluate the machine learning
model by training it on 20% of our available input data. This technique is used to detect
overfitting. For example, if our model has been trained too well on training data, it will fail to
generalise a pattern and will make inaccurate predictions when given new data. Thus, we had
split our complete data into three sets (train, validation and test). Cross-validation allows us
to tune hyperparameters with only our training set. This allows us to keep the test set as a
truly unseen dataset for selecting our final model. Then, we observe how well our model will
perform on the new test dataset.

Figure 15

29
OrionX WIA1006

Confusion Matrix is a performance measurement for machine learning classification.


In logistic regression (classification) problems, a confusion matrix is the better way to
measure the effectiveness of our model. Better the effectiveness, better the performance, and
that is exactly what we want. It is a table with four different combinations of predicted and
actual values. It is extremely useful for measuring Recall, Precision, Specificity, Accuracy,
and most importantly AUC-ROC curves. It also gives direct comparisons of values like True
Positives (we predicted mild depression (0) and it is true), False Positives (we predicted mild
depression (0) and it is false), True Negatives (we predicted severe depression (1) and it is
true) and False Negatives (we predicted severe depression (1) and it is false). Finally, we are
able to visualize the breakdown of correctly and incorrectly classified inputs using a
confusion matrix and also compute the accuracy score which is the overall accuracy of the
model as well as plot a confusion matrix for the given inputs.

Figure 16

30
OrionX WIA1006

11.0 Commented Source Code


https://deepnote.com/workspace/zhi-ying-ham-d38d-6f3bdb60-060a-4ce3-8001
-9b6e2e0ba32f/project/groupassignment-e4c4deab-bedd-4cd2-b3fb-55cd44129
09e/%2Fnotebook.ipynb

Import Libraries
# import library
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px

Data Loading and Cleaning


# load the data from dataset_depression.csv using Pandas
df = pd.read_csv("dataset_depression.csv")

# Cleaning data part

# Drop unsed data


df = df.drop('Educational Level',axis=1)
df = df.drop('Which of the following best describes your term-time
accommodation?',axis=1)

# replace the long feature name to short variable name


df.columns = df.columns.str.replace("?","")
df.columns = df.columns.str.replace('Feeling down, depressed, or
hopeless', 'DepressionLevel')
df.columns = df.columns.str.replace('Do you have part-time or
full-time job', 'CurrentJob')
df.columns = df.columns.str.replace('How many hours do you spend
studying each day', 'StudyHourPerDay')
df.columns = df.columns.str.replace('How many of the electronic
gadgets','NumOfGadget')
df.columns = df.columns.str.replace('How many hours do you spend on
social media per day', 'SocialMediaSpend')
df.columns = df.columns.str.replace('Your Last Semester GPA:','GPA')

31
OrionX WIA1006

df.loc[ df['DepressionLevel'] <= 2, 'DepressionLevel'] = 0


df.loc[ df['DepressionLevel'] >= 3, 'DepressionLevel'] = 1

# Move DepressionLevel column to the Beginning


temp_cols=df.columns.tolist()
index=df.columns.get_loc("DepressionLevel")
new_cols=temp_cols[index:index+1] + temp_cols[0:index] +
temp_cols[index+1:]
df=df[new_cols]

# Then move DepressionLevel column from Begining to Last


temp_cols=df.columns.tolist()
new_cols=temp_cols[1:] + temp_cols[0:1]
df=df[new_cols]

df

Data Visualization
# data visualization
px.histogram(df,x='Gender',title='Gender vs Depression
level',color='DepressionLevel')

px.histogram(df,x='Age', title='Age vs Depression


level',color='DepressionLevel')

px.histogram(df,x='GPA',title='GPA vs Depression
level',color='DepressionLevel')

#Categorical feature
df.Gender=df.Gender.map({'Female':0,'Male':1})
df.Age=df.Age.map({'18 years or less':0,'19 to 24 years':1,'25 years
and above':2})
df.CurrentJob=df.CurrentJob.map({'No':0,'Part Time':1,'Full Time':2})
df.StudyHourPerDay=df.StudyHourPerDay.map({'1 - 2 hours':0,'2 - 4
hours':1,'More than 4 hours':2})
df.NumOfGadget=df.NumOfGadget.map({'None':0,'1 - 3':1,'4 - 6':2,
'More than 6':3})

32
OrionX WIA1006

df.SocialMediaSpend=df.SocialMediaSpend.map({'Not Applicable':0,'1 -
2 Hours':1,'2 - 4 Hours':2, 'More than 4 Hours':3})

Data Splitting (Training, Validation and Test Sets)


# split the dataset into three parts: traning, validation, test

!pip install scikit-learn --upgrade --quiet


from sklearn.model_selection import train_test_split
# pick random subsets of rows for creating test and validation sets.
# general rule of thumb you can use around 60% of the data for the
training set,
# 20% for the validation set and 20% for the test set

# 20% of data will be put into test_df, the remaining put into
train_val_df
train_val_df, test_df = train_test_split(df,
test_size=0.2,random_state=42)

# train_val_df split into train and validation set again


train_df, val_df = train_test_split(train_val_df,
test_size=0.25,random_state=42)

print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

Identifying Input and Target Columns


# Identifying Input and Target Columns

# input columns are from first column to the second last column
input_cols = list(train_df.columns)[0:-1]

target_cols = 'DepressionLevel'

print(input_cols)
print(target_cols)

33
OrionX WIA1006

# create inputs and targets for the training, validation and


# test sets for further processing and model training.

train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_cols].copy()

val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_cols].copy()

test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_cols].copy()

# check if it is correct
train_targets

Identifying Numerical Data


!pip install numpy --quiet
import numpy as np

# identify which of the columns are numerical and which ones are
categorical
Numeric_cols
=train_inputs.select_dtypes(include=np.number).columns.tolist()

# view some statistics for the numeric columns.


train_inputs[numeric_cols].describe()

Features Scaling
# Scaling Numeric Features to a (0,1) range
df[numeric_cols].describe()

# use MinMaxScaler from sklearn.preprocessing to scale values to the


(0,1) range.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(df[numeric_cols])

34
OrionX WIA1006

print('Minimum')
list(scaler.data_min_)

print('Maximum')
list(scaler.data_max_)

# separately scale the training, validation and test sets using the
transform method of scaler
train_inputs[numeric_cols]
= scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols]
= scaler.transform(test_inputs[numeric_cols])

# verify that values in each column line in the range (0,1)


train_inputs[numeric_cols].describe()

Saving Processed Data to Disk


print('train_inputs:', train_inputs.shape)
print('train_targets:', train_targets.shape)
print('val_inputs:', val_inputs.shape)
print('val_targets:', val_targets.shape)
print('test_inputs:', test_inputs.shape)
print('test_targets:', test_targets.shape)

!pip install pyarrow --quiet

# Saving Processed Data to Disk


# avoid repeating the preprocessing steps every time

train_inputs.to_parquet('train_inputs.parquet')
val_inputs.to_parquet('val_inputs.parquet')
test_inputs.to_parquet('test_inputs.parquet')

%%time
pd.DataFrame(train_targets).to_parquet('train_targets.parquet')

35
OrionX WIA1006

pd.DataFrame(val_targets).to_parquet('val_targets.parquet')
pd.DataFrame(test_targets).to_parquet('test_targets.parquet')

%%time

# read the data back using pd.read_parquet


train_inputs = pd.read_parquet('train_inputs.parquet')
val_inputs = pd.read_parquet('val_inputs.parquet')
test_inputs = pd.read_parquet('test_inputs.parquet')

train_targets = pd.read_parquet('train_targets.parquet')[target_cols]
val_targets = pd.read_parquet('val_targets.parquet')[target_cols]
test_targets = pd.read_parquet('test_targets.parquet')[target_cols]

# verify that the data was loaded properly.


print('train_inputs:', train_inputs.shape)
print('train_targets:', train_targets.shape)
print('val_inputs:', val_inputs.shape)
print('val_targets:', val_targets.shape)
print('test_inputs:', test_inputs.shape)
print('test_targets:', test_targets.shape)

Algorithm Implementation (Model Training)


# Training a Logistic Regression Model using scikit-learn
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

# train the model using model.fit.


model.fit(train_inputs[numeric_cols], train_targets)

print(model.coef_.tolist())

# Each weight is applied to the value in a specific column of the


input.
# Higher the weight, greater the impact of the column on the

36
OrionX WIA1006

prediction.
print(model.intercept_)

Model Evaluation
# use the trained model to make predictions on the training, test
X_train = train_inputs[numeric_cols]
X_val = val_inputs[numeric_cols]
X_test = test_inputs[numeric_cols]

train_preds = model.predict(X_train)
train_preds

train_targets

train_probs = model.predict_proba(X_train)
train_probs

model.classes_

Confusion Matrix (Accuracy and F1 Score)


# use the trained model to make predictions on the training, test
X_train = train_inputs[numeric_cols]
X_val = val_inputs[numeric_cols]
X_test = test_inputs[numeric_cols]

train_preds = model.predict(X_train)
train_preds

train_targets

train_probs = model.predict_proba(X_train)
train_probs

model.classes_
'''

37
OrionX WIA1006

We can test the accuracy of the model's predictions by computing the


percentage of matching values
in train_preds and train_targets.
This can be done using the accuracy_score function from
sklearn.metrics.
'''
from sklearn.metrics import accuracy_score
accuracy_score(train_targets, train_preds)

Prediction on New Inputs


# helper function to make predictions for individual inputs.
def predict_input(single_input):

# passing a list containing the given dictionary to the pd.DataFrame


constructor.
input_df = pd.DataFrame([single_input])

# Scaling numerical features using the scaler created earlier


input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])

X_input = input_df[numeric_cols]

# make a prediction using model.predict.


# output 0 -> mild
# 1 -> severe
pred = model.predict(X_input)[0]

# check the probability of the prediction.


prob
= model.predict_proba(X_input)[0][list(model.classes_).index(pred)]

return pred, prob

38
OrionX WIA1006

12.0 References
Simmons, W. K., Burrows, K., Avery, J. A., Kerr, K. L., Bodurka, J., Savage, C. R., &

Drevets, W. C. (2016, Jan 22). Depression-Related Increases and Decreases in

Appetite: Dissociable Patterns of Aberrant Activity in Reward and Interoceptive

Neurocircuitry. The American journal of psychiatry, 173(4), 418–428. Retrieved from

https://doi.org/10.1176/appi.ajp.2015.15020162

Zahn, R., Lythe, K. E., Gethin, J. A., Green, S., Deakin, J. F., Young, A. H., & Moll, J. (2015,

Nov 1). The role of self-blame and worthlessness in the psychopathology of major

depressive disorder. Journal of affective disorders, 186, 337–341. Retrieved from

https://doi.org/10.1016/j.jad.2015.08.001

Maxwell, M. A., Cole, D.A. (2009, Jan 7). Clinical Psychology Review. Weight change and

appetite disturbance as symptoms of adolescent depression: Toward an integrative

biopsychosocial model. Retrieved from

https://www.sciencedirect.com/science/article/abs/pii/S0272735809000075

Theobald, M. (2013, January 23). Depression, Memory Loss, and Concentration. Depression

and Concentration: The Far-Reaching Effects. Retrieved from

https://www.everydayhealth.com/hs/major-depression/depression-memory-loss-and-c

oncentration/

Griffey, H. (2019, Aprill 11). How Do Depression and Anxiety Affect Concentration? How

does depression affect concentration? Retrieved from

https://welldoing.org/article/how-do-depression-anxiety-affect-concentration

39
OrionX WIA1006

Al Salman, Z. H., Al Debel, F. A., Al Zakaria, F. M., Shafey, M. M., & Darwish, M. A.

(2020). Anxiety and depression and their relation to the use of electronic devices

among secondary school students in Al-Khobar, Saudi Arabia, 2018–2019. Journal of

family & community medicine, 27(1).

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6984035/

Madhav, K.C., Prasad, S. P., & Sherchan, S. (2017). Association between screen time and

depression among US adults. Preventive medicine reports, 8, 61-71.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5574844/

Santoro, I. (2021, December 6). How being a full-time student and employee affects mental

health. The Anchor Newspaper.

https://www.anchorweb.org/post/how-being-a-full-time-student-and-employee-affects

-mental-health

Tracy, N. (2022). Depression and Slow Thinking (Reduced Processing Speed). HealthyPlace.

https://www.healthyplace.com/depression/symptoms/depression-and-slow-thinking-re

duced-processing-speed

B, H. N. (2020, June 1). Confusion matrix, accuracy, precision, recall, F1 score. Medium.

Retrieved June 16, 2022, from

https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-s

core-ade299cf63cd

40
OrionX WIA1006

Sklearn.linear_model.LogisticRegression. (n.d.). scikit-learn. Retrieved June 17, 2022, from

https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegress

ion.html

Stefanovic, S. (2021, April 2). #005 PyTorch - Logistic regression in PyTorch - Master data

science. Master Data Science.

https://datahacker.rs/005-pytorch-logistic-regression-in-pytorch/

Tune Hyperparameters for classification machine learning algorithms. (2020, August 27).

Machine Learning Mastery.

https://machinelearningmastery.com/hyperparameters-for-classification-machine-learn

ing-algorithms/

Zeta, C. Z. (2021, March 7). 5 Effective Ways to Improve the Accuracy of Your Machine

Learning Models. Towards Data Science.

https://towardsdatascience.com/5-effective-ways-to-improve-the-accuracy-of-your-ma

chine-learning-models-f1ea1f2b5d65

WebMD Editorial Contributors. (2021, March 25). What to Know About Depression in

College Students. WebMD.

https://www.webmd.com/depression/what-to-know-about-depression-in-college-stude

nts

Verma, Y. V. (2022, May 6). How to improve the accuracy of a classification model?

Analytics India Magazine.

https://analyticsindiamag.com/how-to-improve-the-accuracy-of-a-classification-mode

l/

41
OrionX WIA1006

Shin, T. S. (2020, September 23). How I Consistently Improve My Machine Learning Models

From 80% to Over 90% Accuracy. KD Nuggets.

https://www.kdnuggets.com/2020/09/improve-machine-learning-models-accuracy.htm

Karimjee, M. K. (2021, October 29). Anhedonia. Healthline.

https://www.healthline.com/health/depression/anhedonia

Ahmed, G., Negash, A., Kerebih, H., Alemu, D., & Tesfaye, Y. (2020, July 28). Prevalence

and associated factors of depression among Jimma University students. A

cross-sectional study - International Journal of Mental Health Systems. BioMed

Central. Retrieved June 17, 2022, from

https://ijmhs.biomedcentral.com/articles/10.1186/s13033-020-00384-5

Burry, M. B. (2020, April 13). Why depression makes you tired and how to deal with fatigue.

Insider.

https://www.insider.com/guides/health/mental-health/why-does-depression-make-you-

tired

Johns Hopkins Medicine. (n.d.). Depression and Sleep: Understanding the

Connection.

https://www.hopkinsmedicine.org/health/wellness-and-prevention/depression-

and-sleep-understanding-the-connection

Grover, K. G. (n.d.). Advantages and Disadvantages of Logistic Regression.

OpenGenus IQ.

https://iq.opengenus.org/advantages-and-disadvantages-of-logistic-regression

42

You might also like