Python Tutorial Text 2024-1
Python Tutorial Text 2024-1
This tutorial has been made in collaboration with Dieter Brughmans, Sofie Goethals, James Hinns,
Bjorge Meulemeester, Stiene Praet, Yanou Ramon, Jellis Vanhoeyveld, Tom Vermeire and Ine Weyts.
September, 2024
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 Key concepts 7
1.1.1 Programming languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Benefits of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Modules and packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.4 Integrated Development Environments (IDE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Installation and configuration 10
1.2.1 Visual Studio Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Python interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4 Virtual environment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Getting started 14
1.3.1 Writing python files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Executing code in an interactive window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Writing notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Code snippets 16
1.4.1 Python comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 BASICS IN PYTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Python Fundamentals 19
2.1.1 Variable types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.4 Naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.1.5 Different types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Basic Python libraries for Data Science 42
2.2.1 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.2 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.3 Visualization using Pandas and Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.4 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3 Chapter challenge: Putting it all together 56
This Python tutorial is specifically designed for people interested in performing data mining tasks.
Apart from this introduction chapter, it contains three chapters. Chapter 2 covers the very basics
you need to master before continuing with the rest of this tutorial. It highlights a number of basic
Python concepts (variables, for/while loops, functions, etc.) and explains a number of data mining
toolboxes/packages in more detail. We strongly recommend that you read the extra reading material
and that you run the code snippets provided in the text as well as doing the exercises. This will be
very helpful in the follow-up chapters. If you struggle with the contents of this chapter, we advise
you to practice a bit more until you get the hang of the Python fundamentals. The next chapters will
require a good understanding of these basics. Chapter 3 provides a step-by-step guide that covers
the entire chain of operations (data exploration, data preprocessing, modelling and evaluation) that
are required in a data mining environment. This chapter will also be your main guide in any data
science challenge you may encounter. Chapter 4 provides several useful tools and includes tips and
tricks to facilitate data science projects and programming.
In the tutorial notes, there will often be remarks as shown below. These remarks can refer to
important things, chapters presented in the book of Provost & Fawcett (2013): Data Science for
Business, What You Need to Know About Data Mining and Data-Analytic Thinking or they highlight
some additional reading material or fun facts.
R This is a remark.
Python is one of the many programming languages available. Other notable contenders include C++,
C#, JavaScript and Java. All programming languages have their own benefits and drawbacks which
make them suitable for different task. When it comes to data mining, Python is widely favoured due
to its readability, relatively low learning threshold, and its comprehensive modules.
R Kaggle conducted a worldwide survey in 2018 1 of 23.859 data professionals, asking two
questions: (1) What programming languages do you use on a regular basis? and (2) what
programming language would you recommend an aspiring data scientist to learn first? Both
answers reveal Python as the clear winner.
Comparing programming languages to normal languages can be helpful: just as the grammar and
syntax vary between languages like French and German, they also differ among programming
languages such as Python and C++. However, despite these differences, both can convey the same
ideas and give the same instructions through different expressions. It’s important not to underestimate
these differences: some human languages have words that simply don’t exist in other languages,
just like coding languages have statements that don’t exist in all languages. For instance, the
Dutch word gezellig is not easily translated to English. The best possible translation cozy does not
entirely capture the warmth and fun gezellig embodies. Another example are tonal languages such
as Chinese, where the same words pronounced with different tones produce different meanings. In
the programming languages, C++ is read from the beginning until the end before it even begins to
execute an instruction, while Python is read more or less line per line as your computer immediately
runs each instruction. This difference means that if you make an error in C++, it will tell you
immediately, while in Python, it might take minutes or hours before an erroneous line gets spotted
by your computer.
the code you need and continue. This lets you focus on other important tasks and speeds up
the development phase.
Python can make use of modules to enrich its capabilities, and avoid the situation where the
programmer has to start from scratch for each project. These may be written in Python, but also
other programming languages. It is up to the module to integrate it with Python, but the actual
calculations can be outsourced to the much faster C++ language for example. This is commonly
done to speed up Python processes. These modules, when written in Python, can again make use of
other modules as well. The Scikit-learn module, for example, uses the SciPy and NumPy modules.
While these are considered core modules to Python (most of the time, these come pre-installed
along with the installation of Python), you can imagine that some modules have other module
dependencies that get very complicated, or are incompatible with other modules. One Python module
in particular makes the programming language especially suitable for data mining: Scikit-learn
(see https://scikit-learn.org/stable/index.html). This is a huge toolbox that has an
enormous amount of pre-written code, and can do anything your heart may possibly desire during a
data mining project. More on the Scikit-learn module in Chapter 3
R IMPORTANT: If you have any questions about what a module or function does, or about the
meaning of certain parameters, please check the online documentation first. This is a great
resource to help you understand these details. You can find the documentation by looking up
documentation + function name or module name.
just a language, like English, while the IDE is the program in which you write the language, like
you would write English in Microsoft Word for example. The job of an IDE is larger than just
showing the code, however. It not only underlines syntax errors but also reads and executes code,
keeps track of the Python modules etcetera. The Integrated Development Environment we will use
in the Machine Learning course is Visual Studio Code. Visual Studio Code works consistently and
seamlessly on Windows, Linux and macOS, and has a large marketplace of extensions including
GitHub Copilot and cloud-related extensions.
If you prefer to use a different IDE, you are welcome to do so. However, please note that the
assistants will only be able to support issues related to Visual Studio Code. If you encounter any
problems with another IDE, you will need to resolve them on your own.
1.2.3 Extensions
Visual Studio Code is an IDE that supports multiple programming languages, allowing developers to
write code in various languages without restrictions. To enable Python programming in VS Code,
you need to install a couple of extensions from the VS Code Marketplace.
An extension in VS Code is a plugin you can install through the Marketplace that adds specific
features or capabilities to the VS Code editor, enhancing its functionality. Extensions can include
support for additional programming languages, debugging tools, code beautification, colour themes
and many more. Essentially, they allow users to customise and extend their development environment
to better suit their specific need and workflow. These extensions are created by both Microsoft and
the online community. You can find the marketplace by clicking on the extensions icon in the sidebar
on the left-hand side of the screen or by clicking Ctrl + Shift + X. The two extensions used in this
course are the Python extension and the Jupyter extension. The Python extension is needed to be
able to write and run Python code. The Jupyter extension will be used to interactively write and
execute code, and to work with notebooks. You simply need to install the extensions as shown in
Figure 1.4. In Chapter 4, we will discuss some additional extensions that can help you when writing
code, however, for the basics of this tutorial, only the Python extension and the Jupyter extension are
needed.
To create your virtual environment you can follow the steps below:
1. Download the zip file Main_Tutorial_STUDENTS from Blackboard and unpack it.
2. Open Visual Studio Code and open the Main_Tutorial_STUDENTS folder. This can be done
using the File > Open Folder buttons in the top left corner of your screen.
3. Go to the terminal. You can find the terminal (on the bottom of your screen, see Figure 1.5). If
not, click on the three dots on the top of your screen and select Terminal, and then New and
Terminal (see Figure 1.6), the terminal will open on the bottom of your screen. In the terminal,
14 Chapter 1. Introduction
type python -m venv venv and press enter. This creates a new virtual environment.
REMARK: When VS Code asks you if you want to select the new environment for the
workspace folder, you click yes.
4. Next, you activate the virtual environment by typing a command in the VS Code terminal.
• On Windows: .\venv\Scripts\activate
• On macOS and Linux: source venv/bin/activate
5. The final step is to install all required packages described in the requirements.txt file. You do
this by typing in the terminal: python.exe - m pip install -r requirements.txt
If VS Code gives you the error that running scripts is disabled on your system, you can follow the
following steps first:
1. Search for Windows PowerShell in your Start Menu and select ’Run as Administrator’
2. Check your current execution policy by running this command: Get-ExecutionPolicy
This will likely show ’Restricted’, which means scripts are not allowed to run.
3. Change your execution policy to unrestricted by running Set-ExecutionPolicy -ExecutionPolicy
Unrestricted -Scope CurrentUser
4. Close PowerShell and restart Visual Studio Code to apply the new settings.
5. Now you can try again to activate the virtual environment using the previous step by step
guide.
Now you should be all set and have the required packages installed for the remainder of this tutorial.
Figure 1.5: Screenshot of Visual Studio Code. The main components are shown: the Python file
editor, the terminal, the extensions section (and two useful layout features).
Figure 1.7: Screenshot of Visual Studio Code. The main components of working with the interactive
window are shown: the interactive window, the Jupyter variable explorer and the Python file editor.
to add more structure to the notebook file. Jupyter notebook files have a .ipynb file extension. You
can save the interactive window as a Jupyter Notebook.
In the remaining chapters of this tutorial, you will use (incomplete) Python scripts as a guideline for
the exercises. When you see the #TODO comment, you need to complete the code before it can be
executed. All these scripts can be found in the main Python tutorial folder that we share with you on
Blackboard (as a zip file called Main_Tutorial_STUDENTS).
Exercise 1.1
1. Type the code snippet shown above in the Python terminal. Do this again in a .py file and
run it in an interactive window.
2. Save this file as code_snippet.py.
■
Another use of the hashtag "#" symbol is to temporarily comment out a particular piece of code.
Besides this symbol, you can comment a block of text by enclosing it with triple quotation marks:
1 """
2 This is a text block that is being commented .
3 This can be convenient in case your explanation for a Python file
4 or a piece of code spans multiple lines .
5 """
2. BASICS IN PYTHON
In this chapter, we provide a brief background in Python to get you up and running. This includes the
very fundamentals of the Python language (e.g., variable types, if statements, for/while loops etc.),
but also basic data analysis skills such as how to read and write data, explore data and visualize data.
You will get familiar with the most important libraries to get started using Python for data science.
Here, you tell Python to remember some value, so that you don’t have to do it yourself. Beware
that the equal sign "=" is not the same as a mathematical equal sign. This statement does not tell
Python that these are two variables ("income" and the decimal value 2100.00) that are the exact
same. Rather, it saves this value 2100.00 in the random access memory (RAM), so it can be retrieved
later on using the name of the variable: "income". When you re-use the same name for some new
value, Python will save the new value under the same name, losing the previous one.
1 income = 2100.00
20 Chapter 2. BASICS IN PYTHON
The last statement prints the new value of the variable "income"
String
A String is a data type that is used to store a sequence of characters. In the example below, the
variables A, B, Name and Age are used to store different strings:
1 basic_string = " Hello , World ! "
2 sentence = " Belgium has a total of 10 provinces "
3 name = ' Charlie '
4 age = ' 47 '
As you can see, strings can be created by enclosing the characters with single quotes (’) or double
quotes ("). Hence, age is considered as a string object (and not an integer because it has a numeric
value). To check if an object is of type string str, you can use the isinstance() function. For
example:
1 var_1 = 5
2 var_2 = " 5 "
3 print ( isinstance ( var_1 , str ) ) # should print False
4 print ( isinstance ( var_2 , str ) ) # should print True
Strings can be added together by using the + operator. This means that a new string is created which
puts together the two strings on both sides of the + operator. For example:
1 name = " Charlie "
2 age = " 47 "
3 print ( " Your name is " + name + " and your age is " + age )
4 # returns " Your name is Charlie and your age is 47"
When using the + operator to concatenate strings, it is important that all variables used in the
expression are also strings. Here, the code would not work if age = 47 was used instead of "47"
which makes a string of the number 47.
You can also use the formatted string literals, commonly known as f-strings. F-strings provide a
way to embed expressions inside string literals using curly braces {}. You can easily concatenate
multiple strings and include variable values in an f-string. When using f-strings, it is possible to add
numeric values to the expression:
1 name = " Alice "
2 age = 30
3 city = " New York "
4 # Concatenating multiple strings with variables using an f - string f " "
5 print ( f " My name is { name } , I am { age } years old , and I live in { city }. " )
6 # returns My name is Alice , I am 30 years old , and I live in New York .
String multiplication is also possible, where a new string is created that consists of multiple copies
of the original string. The multiplication * operator is used to achieve this. For example:
1 letters = " abc "
2 print ( letters *3) # returns abc abc abc
2.1 Python Fundamentals 21
A Boolean check (True or False) to see if a string (or substring) appears in another string is also
possible. The keyword in is used for this check. Note that the comparison is case sensitive. To
illustrate:
1 quote = " Belgium has great cities like Bruges and Antwerp "
2 print ( " Antwerp " in quote ) # returns True
3 print ( " Antw " in quote ) # returns True
4 print ( " antw " in quote ) # returns False
It may be useful to obtain all the substrings occurring in a given string, separated by a certain
delimiter. The split() function returns a list (see the discussion on lists in Section 2.1.1) containing
the individual substrings. To illustrate:
1 quote = " the die is cast , Suetonius said to Caesar "
2 # return a list containing substrings separated by whitespace
3 print ( quote . split () )
4 # [" the " , " die " , " is " , " cast ," , " Suetonius " , " said " , " to " , " Caesar "]
5
6 # return a list containing substrings separated by a comma
7 print ( quote . split ( " ," ) )
8 # returns [" the die is cast " , " Suetonius said to Caesar "]
Exercise 2.1 Obtain the individual substrings of the string ’I|am|learning|Python’ (with |
as a delimiter). ■
Apart from the aforementioned split() function, Python has a large number of built-in functions
you can use on string objects. A couple of examples: find(), rfind(), replace(), count(),
strip(), lower(), upper(). We refer to https://docs.python.org/3/library/stdtyp
es.html#string-methods for a complete overview.
Numeric data
The two most common data types to represent numeric values are integers and floats. Integers
represent standard natural numbers, while floats correspond to real numbers.
R Integers have an unlimited precision: we can store and perform calculations of integers of any
size without errors. Floats, on the other hand, have a limited precision which depends on the
specific platform on which the program is running. A limited precision means that small errors
can occur when performing calculations due to the fact that the entire range of real numbers
is not fully represented: the set of numbers that a floating point number can adopt is a subset
of all possible real numbers. The following command shows you platform specific details
regarding floating point numbers (e.g. machine precision, largest possible number, smallest
possible number):
1 import sys # ignore this for now , we cover import statements later
2 print ( sys . float_info )
Integers are created by writing a sequence of digits to form a natural number (e.g. 24586 is an
integer). Floats are created by writing a sequence of digits with a dot in between to separate the
whole number part from the fractional part of a number (e.g. 24586.52 is a float). We illustrate their
creation below and also point to the most common mathematical operations one can apply to such
numeric data types. Note you can combine the use of integers and floats in these operations and that
the result is a float, e.g. writing 4 + 5.0 (integer + float) results in the float 9.0.
22 Chapter 2. BASICS IN PYTHON
Exercise 2.2 Multiply the numbers 7, 8 and 24. Calculate the remainder of the division of that
number divided by 3 (mod 3). ■
It is also possible to convert a string containing nothing but a number into a float or integer. Likewise,
the transformation of a float to an integer (or vice versa) can be done. We hereby rely on the int()
and float() constructors, for example:
1 # conversion of string
2 string = " 340 ' # a string
3 a = int ( string ) # an integer with value 340
4 b = float ( string ) # a float with value 340.0
5
6 # transform float to integer
7 c = int (340.954) # an integer with value 340 ( default rounds towards - Inf )
8
9 # transform integer to float
10 d = float (59) # a float with a value of 59.0
R As a sidenote, you may also encounter binary numbers (with prefix 0b) and hexadecimal
numbers (with prefix 0x) in Python code. Binary numbers consist of digits taking on the values
of 0 and 1. Hexadecimal numbers have characters that can take on the standard digits 0-9 and
letters a(=10), b(=11), c(=12), d(=13), e(=14), f(=15). The hexadecimal number a7c can be
converted to our standard decimal system as follows: a7c = 10 × 162 + 7 × 161 + 12 × 160 =
2684. The following example illustrates how one creates binary or hexadecimal numbers, how
to perform simple addition and how to transform the result into a standard integer.
1 # binary numbers
2 a = 0 b1101
3 b = 0 b0001
4 print ( bin ( a + b ) ) # returns 0 b1110
5 print ( int ( a + b ) ) # returns 14
6
7 # hexadecimal numbers
8 c = 0 x0010
9 d = 0 x000f
10 print ( hex ( c + d ) ) # returns 0 x1f
11 print ( int ( c + d ) ) # returns 31
2.1 Python Fundamentals 23
List
A list can be seen as a container object used to store several items. One can, for instance, construct a
list consisting entirely of numbers or one made up entirely of strings (or a mixture both). In fact,
most Python object types can be stored in a list. A list can even contain another list as one of its
elements. The size of the list or the elements inside the list can change, which means a list is a
mutable object. It pertains to the class of iterable objects, because one can range (iterate) over the
list’s elements. A list can be created by enclosing its elements with square brackets [] and separating
them by commas. A list’s elements are accessed by means of indexing, e.g. a[1:3] is a sub-list
containing the elements at position 1 and position 2 of the list a. Note that indexing starts counting
from 0 and the last index is not included, e.g. b[0:3] will contain the elements from list b with index
0 up to but not including index 3 (so index 0, 1 and 2). The concepts of list creation, accessing list
elements and changing elements of a list are illustrated in the code snippet.
1 # List creation
2 names = [ " Ada " ," Steve " ," Mohammed " ," Anna " ," Jane " ," Samantha " ] # strings
3 ages = [21 ,23 ,19 ,23 ,20 ,17] # integers
4 names_ages = [ names , ages ] # combined lists
5
6 # Access list elements
7 print ( names [0]) # shows the string " Ada "
8 print ( names_ages [0]) # shows the names list
9 print ( names_ages [0][0]) # shows the string " Ada "
10 print ( names [0:2]) # shows the list [" Ada " ," Steve "]
11 print ( ages [1:6:2]) # shows the list [23 ,23 ,17] - starting at index 1 up to
not including index 6 with a stepsize of 2
12
13 # Change elements in lists
14 names [5] = " Sam "
15 names [3:5] = [ " Annabel " ," Janeth " ]
16 print ( names ) # [" Ada " , " Steve " , " Mohammed " , " Annabel " , " Janeth " , " Sam "]
17 print ( names_ages [0]) # same result as print ( names )
18 ages [0] = 30
19 names_ages [1][1:3] = [50 ,50]
20 print ( ages ) # shows [30 , 50 , 50 , 23 , 20 , 17]
21 print ( names_ages [1]) # shows [30 , 50 , 50 , 23 , 20 , 17]
One can also access elements from a string in a similar way. As far as Python is concerned, a string
is just a sort of list of singular characters. The code A[a:b:step] selects from the string variable A
the substring starting at index position a (inclusive) up until position b (exclusive) in steps of size
step. If a, b, step are not entered by the user, their default values are: a = 0, b is set to the last
element of the string and step = 1. Python indexing starts counting at zero, which is true for other
object types as well (e.g., A[0] is the first character of string A). To illustrate
Exercise 2.3 Open Visual Studio Code and complete the code snippet below: ■
Exercise 2.4 Complete the code snippet below in Visual Studio Code. ■
1 a = [3 ,2 ,4 ,1 ,5 ,6]
2 a . append (7)
3 print ( a ) # returns [3 , 2 , 4 , 1 , 5 , 6 , 7]
4 a . remove (4)
5 print ( a ) # returns [3 , 2 , 1 , 5 , 6 , 7]
6 a . sort ()
7 print ( a ) # returns [1 , 2 , 3 , 5 , 6 , 7]
8
9 b = [ " Eva " ," Aisha " ," Sophia " ," Mei " ]
10 b . sort ()
11 print ( b ) # returns [" Aisha " , " Eva " , " Mei " , " Sophia "]
12
13 # complete the following code The insert(index, value), besides append and sort
14 c = [1 , 2 , 3 , 5 , 100]
15 # TODO # Add the number 4 to c , so that c = [1 , 2 , 3 , 4 , 5 , 100]
16 final_number = # TODO # Look online for a command to remove and also return
the number 100 from c .
17 print ( final_number )
18 print ( c ) final_number= c.pop()
Sometimes, it may be convenient to perform operations on a copy of a list, with the intent to keep
the original list as is. For example, say that list a contains all the substrings appearing in a document
(and that you want to retain). Next, you construct a new list b that initially coincides with a (b=a).
Finally, you remove from b the following substrings: the, and, or. If you write b = a, with a a list,
then b points to the same piece of memory as a (they are the same object). Hence, modifying b
will translate to the same modifications in a. You can circumvent this behaviour by using the copy
function: b = a.copy() (now b and a are independent of one another) or by writing b = list(a)
(the constructor of list returns a new object). A simple example:
1 a = [1 ,2 ,3 ,4 ,5]
2 b = a
3 b . append (6)
4 b . append (7)
5 print ( a ) # returns [1 , 2 , 3 , 4 , 5 , 6 , 7]
6
7 a = [1 ,2 ,3 ,4 ,5]
8 b = a . copy ()
9 b . append (6)
10 b . append (7)
2.1 Python Fundamentals 25
Sometimes, you may want to verify if a certain string appears in a list of strings. You use the in
keyword to do so:
1 print ( " Gregg " in [ " Jane " , " Sean " , " Bob " , " Gregg " ]) # returns True
One can also compare two lists: the check a == b verifies if the elements inside both lists are
identical (though a and b can still be different objects residing in different parts of memory). To
check if both lists are the same object (point to the same piece of memory): a is b. For example:
1 a = [1 ,2 ,3]
2 b1 = [1 ,2 ,3]
3 print ( a == b1 ) # returns True when you say b2= List (a), is like
4 print ( a is b1 ) # returns False creating a new list differente from a.
But if you only ad =, like b3= a, they
5 b2 = list ( a ) are the same object in pyhton
6 print ( a == b2 ) # returns True
7 print ( a is b2 ) # returns False
8 b3 = a
9 print ( a == b3 ) # returns True
10 print ( a is b3 ) # returns True
List comprehensions
List comprehension offers a concise way to create lists. This powerful feature of Python allows
you to generate a new list by applying an expression to each element in an existing iterable (like a
list or a range). The syntax for list comprehension is more compact and often more readable than
traditional for loops used for the same purpose.
The basic syntax for list comprehension is: [expression for item in iterable if condition]
Where:
• expression: The operation or calculation to apply to each element.
• item: A variable representing each element in the iterable.
• iterable: Any Python iterable (e.g., list, range, string) that you want to process.
• condition (optional): A filter that allows only elements that meet a certain criterion to be
included in the new list.
Tuple
Tuples are very similar to lists and can be regarded as a container object storing several items.
The main difference with a list is that tuples are immutable (they cannot be modified after cre-
ation). A tuple is created by enclosing its elements with standard brackets (), for example: t =
("Yes","No","Maybe"). Note that the use of brackets is a good convention though this is not strictly
necessary. Writing t = "Yes","No","Maybe" works as well and creates the same tuple as in the
previous example. A tuple’s elements are accessed with square brackets []. To illustrate, t[1:3] is the
sub-tuple ("No","Maybe") and t[0] corresponds to the string "Yes". Below, we outline some Python
code to further illustrate tuple creation and accessing of its elements.
1 # create tuples
2 t = ( " Belgium " ," Sweden " ," Germany " ," Finland " ," Greece " )
3 u = ( " Brussels " , " Stockholm " , " Berlin " , " Helsinki " , " Athens " )
4 v = (1 , 1 , 2 , 3 , 5 , 8 , 13)
5 # access tuple elements
6 print ( t [0:3]) # returns (" Belgium " , " Sweden " , " Germany ")
7 print ( u [0:5:2]) # returns (" Brussels " , " Berlin " , " Athens ")
8 print ( v [6]) # returns 13
9 # attempt to modify a tuple 's elements
10 v [0] = 0 # Error ( tuple object does not support item assignment )
Exercise 2.5 Create a tuple student storing your personal details: name, age, length, major study
direction and ZIP (postal) code. Next, write code to
1. access your name
2. access your major study direction
■
Note that a tuple can also contain a tuple as one of its elements. Furthermore, it is also possible
to create tuples that contain mutable objects, such as lists. Even though a tuple cannot change, the
mutable objects inside a tuple can change. For example:
1 t = ( " Train " , 425 , [1 ,2 ,3 ,4 ,5])
2 t [2]. append (6)
3 t [2]. remove (3)
4 print ( t ) # returns (" Train " , 425 , [1 , 2 , 4 , 5 , 6])
We can also store the elements of a tuple into separate variables. This is called ’unpacking’ of a tuple
and requires that there are as many variables on the left side of the equals sign as there are elements
2.1 Python Fundamentals 27
in the tuple (the same is true for a number of other data types such as lists as well). The code block
below illustrates the unpacking of a tuple:
1 t = ( " yes " , " no " , 123 , [1 , 2 , 3 , 4 , 5])
2 ( Str_1 , Str_2 , Num_1 , List_1 ) = t # tuple unpacking
3 print ( Str_1 ) # prints " yes "
4 print ( Str_2 ) # prints " no "
5 print ( Num_1 ) # prints 123
6 print ( List_1 ) # prints [1 , 2 , 3 , 4 , 5]
Dictionaries
A dictionary is a collection of key-value pairs. Its name is derived from real-life dictionaries (e.g.
Van Dale), where the keys correspond to all the words occurring in a language and the associated
values contain descriptions of the words. Python dictionaries are more general and can contain
many kinds of key-value pairs. For example, in data mining, the keys may represent identifiers of
individual persons and the related values are a list of socio-demographic and financial variables to
describe a person. In fact, a key can take on any immutable object (string, integer, tuple) and values
may hold any data type (e.g. list, string, integer, dictionary etc.).
Dictionaries are unordered collections, which means that you cannot access its content by means of
indexing. Also, dictionaries come with a guarantee that there are no duplicates in their list of keys.
The keys are unique and attempting to give two values to the same key results in only retaining the
last value. Dictionaries are created by enclosing its content with curly braces {}. Key-value pairs are
entered as Key:Value. Below, we show how to create a dictionary.
1 # create a dictionary
2 user_age = {
3 " Kenneth " : 45 ,
4 " Hassan " : 23 ,
5 " Emma " : 53 ,
6 " Eve " : 63
7 }
Accessing a value associated to a key is done in the following fashion: dict_name[Key]. The
alternative is to use the get(Key) function with an optional additional argument get(Key, string
if key not present (optional)) to return a string if the key is not present. If the key exists,
both the direct accessing and accessing via the get() function return the same value. The difference
lies in the handling of keys that are not defined in the dictionary. The direct access brackets method
will throw an error that terminates program execution, while the get() function returns either None
or a pre-specified string. Continuing our example:
1 print ( user_age [ " Kenneth " ]) # returns 45
2 print ( user_age . get ( " Kenneth " ) ) # returns 45
3 print ( user_age [ " Gregg " ]) # throws an error
4 print ( user_age . get ( " Gregg " ) ) # returns None
5 print ( user_age . get ( " Gregg " ," Not present " ) ) # returns Not present
It is also possible to change the content of an existing dictionary. To add a new key-value pair, you
can use dict_name[New Key] = New Value. Similarly, deleting a key-value pair is done with
the command del dict_name[Key]. Note that an error is raised in case the key does not exist. A
dictionary can also be cleared by writing dict_name.clear(). This removes all content from the
dictionary and the variable will then refer to an empty dictionary. Continuing our running example:
28 Chapter 2. BASICS IN PYTHON
Furthermore, one can obtain a list containing all the keys or values of a dictionary:
1 print ( list ( user_age . keys () ) ) # [" Emma " , " Eve " , " Maya " , " Brandon "]
2 print ( list ( user_age . values () ) ) # [53 , 63 , 23 , 32]
Exercise 2.6 Create a dictionary "Phone" to store the phone numbers (values) of four of your
friends (key is the name of the friend). After creation of the dictionary, add an emergency contact
with phone number "112". Next, access the phone number of one of your friends. Finish the
exercise by printing a list of all the names (keys) you have currently stored in your dictionary. ■
1 # construct a dictionary user with the keys being the name of a person and
the values a list consisting of that person 's age , height , mass , city
2 user = {
3 " Wei " : [27 , 1.85 , 87 , " Antwerp " ] ,
4 " Layla " : [23 , 1.70 , 76 , " Brussels " ] ,
5 " Glenn " : [28 , 1.82 , 95 , " Gent " ]
6 }
7 # Access the attribute city of Wei
8 # TODO # should return Antwerp ( hint : Dict [ key ][ index in list ])
9
10 # Add a new person James to dictionary user , with age = 31 , height = 1.72 ,
mass = 69 and city = " Bruges "
11 # TODO
12 print ( user )
Besides the material we have already discussed, dictionaries also have other build-in methods such as
pop, update. You can view the documentation thereof by typing help(dict). As a final remark,
remember that when you write B = A (with A a dictionary), then the new variable B points to the
same piece of memory as A (and hence they refer to the same dictionary object). Modifying B results
in the same changes to A. To avoid this behaviour, use the copy function B = A.copy() or call the
dictionary constructor B = dict(A).
Open the Python code Variable_Types_student.py. Modify the script by replacing the missing code
(#TODO statements) with your own code and uncommenting some lines of code. The average grade
of student Dennis P. is 15,33, did you find the same result? Regarding sub-exercise 5, if you choose
Kevin G. for the name variable, did you manage to print ’Student Kevin G. has grades [15, 17, 16]’?
Have you successfully added the grade of 14 to the list of grades of student Dave L.?
R If some things were not clear, please search online for more resources. For example, https:
//data-flair.training/blogs/python-variables-and-data-types.
Below, we show a simple example of a body mass index (BMI) calculation and its division into
several categories ranging from underweight to extremely obese.
1 mass = 82 # the mass of a person in kg
2 height = 1.86 # the length of a person in meters
3 bmi = mass /( height ** 2) # ** is an exponential operation
4
5 if bmi < 18.5:
6 category = " Underweight "
7 elif 18.5 <= bmi <= 24.9:
8 category = " Normal Weight "
9 elif 24.9 < bmi < 30:
10 category = " Overweight "
11 elif 30 <= bmi < 40:
30 Chapter 2. BASICS IN PYTHON
Let’s present a larger fictitious example of an insurance company that wants to assign a risk category
(low risk, medium risk, high risk) to a potential client applying for car insurance. The decision is
based on three characteristics of the applicant: the type of car to be insured (slow, average, fast), the
driver’s age and whether or not the individual has already been involved in a previous accident. A
possible risk assignment strategy is outlined next.
1 car_type = " Fast " # can take on values Fast , Slow , Average
2 driver_age = 25
3 previ ous_accident = False
4
5 # the indentation level is important !
6 if car_type == " Fast " :
7 if previous_accident :
8 print ( " This is a high risk driver " )
9 else :
10 if driver_age < 27:
11 print ( " This is a high risk driver " )
12 else :
13 print ( " This is a medium risk driver " )
14 elif car_type == " Average " :
15 if previous_accident :
16 print ( " This is a high risk driver " )
17 else :
18 if driver_age > 27 and driver_age < 65:
19 print ( " This is a low risk driver " )
20 else :
21 print ( " This is a medium risk driver " )
22 else :
23 if previous_accident :
24 print ( " This is a medium risk driver " )
25 else :
26 if driver_age <= 27 or } driver_age >= 65:
27 print ( " This is a medium risk driver " )
28 else :
29 print ( " This is a low risk driver " )
30
31 print ( " Code at this indentation level is always executed because it is
outside of the if statement structure . " )
In this example, one can easily infer that a driver of age 25 who owns a fast car and has not yet
been involved in previous accidents is assigned a high risk. As shown in the example, an if or elif
expression may directly contain boolean variables, yet it is also possible to compare numbers (e.g.
5 > 3) with comparison operators ( == (equal), >= (larger or equal), <= (smaller or equal), != (not
equal), etc.). Furthermore, one may also compare strings (i.e. car_type == "Fast"). The and and
or keywords facilitate the evaluation and combination of multiple expressions. Though we have
not indicated this explicitly in the previous code, the not keyword can be used in expressions as
well. For example, when previous_accident = True, then not previous_accident evaluates
to False. Also note that it is possible to nest if statements as illustrated in the previous example.
2.1 Python Fundamentals 31
1 sunny = False
2 temp = 30
3
4 if temp > 35:
5 print ( " Stay inside , drink sufficient water " )
6 elif 28 < temp <= 35 and sunny :
elif 28 < temp <= 35 and not sunny:
7 print ( " Put on sunscreen and wear sunglasses " )
8 # Add a condition so that if the temperature is larger than 28 and smaller
or equal to 35 and it is not sunny , the message " It is warm and cloudy
outside " gets displayed
9 # TODO
10 else :
11 print ( " This is a regular day " )
For loop
A for loop iterates over (loops over) each element of a collection. A collection is any iterable
object type, such as a range of numbers, a standard list (e.g. a list of strings) or a plain string.
Python’s for statement iterates over the elements of any collection in the order that they appear in
the sequence. We present a number of examples below. Note the occurrence of the in keyword
explained in Section 2.1.2 and the introduction of the range data type (useful to store integer numbers
in a pre-defined interval). range(a, b) includes a but excludes b.
1 # Print all values in the interval [3 ,10]
2 for number in range (3 ,11) :
3 print ( number )
4
5 # Print all elements appearing in a list
6 list = [ " Nancy " , " Tifanny " , " Betty " , " Laura " ]
32 Chapter 2. BASICS IN PYTHON
Exercise 2.9 Construct a dictionary called ’jump’ that stores the names of people jumping (keys)
and the distances they have jumped in successive trials (values, stored as a list). Tia has jumped
2,65 and 2,81 meters, Layla has jumped 2,90 and 2,87 meters. Maria jumped 2,98 and 3,05
meters. Finally, Jia jumped 2,76, 2,89 and 3,13 meters. Next, use a for loop to print, for each
player, the sentence: "Player X jumped a maximal distance of Y meters", with the correct values
■
There are three basic Python keywords to further control the program flow in for loops: continue,
break and else. continue skips the remainder of the code (after the continue statement) in the
surrounding for-block and moves to the next iteration of the loop. break stops the execution of
the surrounding for loop (the current and possibly remaining iterations are skipped). The program
continues with the code appearing after the for loop. The else clause can be added to a for loop.
Statements written in the else part only get executed in case the for loop did not break. A simple
illustration:
1 """
2 Print out numbers in the range [1 , 100] that are divisible by the user
3 specified integers a or b ( so if the number is divisible by a or by b , it
4 should be printed ) . When a number in range [1 , 100] can be divided both by
5 a and b , the loop stops and this is the last value printed .
6 """
7 a = 35
8 b = 40
9 for num in range (1 ,101) :
10 if num % a == 0 and num % b == 0:
11 print ( num )
12 break
13 elif num % a == 0 or num % b == 0:
14 print ( num )
15 continue
16 else :
17 # this only gets executed in case the for loop does not break
18 print ( " There is no number in [1 , 100] divisible by both a and b " )
19 """
20 Running this program ( with a = 35 and b = 40) gives the following }
21 output :
22 35 40 70 80 There is no number in [1 , 100] divisible by both a and b
23 """
2.1 Python Fundamentals 33
While loop
A while loop is similar to a for loop in the sense that it executes a block of code multiple times.
The major difference is that the number of iterations in a for loop is known in advance, whereas
in a while loop this is typically unknown. A while loop starts with checking a certain condition
(boolean expression) and executes the associated block of code when this condition is True. Next,
the process repeats itself and the condition is verified again. The statements in the block of code
keep on executing until the condition evaluates to False. Note that the continue, break and else
clauses can also be used in while loops.
As an example, the code block below presents the logic behind a simple guessing game. The program
generates a random number in the interval [1,10] and then simulates a user making multiple guesses
until they get the number correct. The number of guesses the simulated player can make is limited to
5 in the example.
1 import random # ignore this for now , we cover import statements later on
2
3 target = random . randint (1 ,10) # random integer in range [1 , 10]
4 guess = 0 # initialisations before the while loop
5 number_tries = 0 # initialisations before the while loop
6
7 print ( " Game started : Guess the number between 1 and 10 " )
8
9 while guess != target :
10 number_tries += 1
11 guess = random . randint (1 , 10) # Simulate a guess
12 print ( f " Attempt { number_tries }: Guess = { guess } " )
13
14 if number_tries > 5: if number_tries >= 5: , sino te impirme 6 opciones
15 print ( " You lose ( exceeded number of guesses ) " )
16 break
17 if guess == target :
18 print ( f " You win , the value is indeed { target } " )
19 break
2.1.3 Functions
Python functions come in three categories: the standard Python library that contains built-in functions
you can access straight away; the external Python modules which need to be imported by the user
and, finally, the manually defined functions implemented by yourself (user-defined functions). We
present more details with regard to each of these categories in the upcoming subsections.
34 Chapter 2. BASICS IN PYTHON
Running this code will print out a list containing: different kind of exceptions, True, False, min, max,
None, abs, print, range etc.
Python modules
The external Python modules are packages (or libraries) that other programmers have written and
which are not loaded in automatically. They are simply pieces of code that have been written to
perform specific tasks. They don’t even have to be written in Python! You have to import these
yourself at the beginning of your code file if you want to use them. These packages fall into two
types: packages that are included upon installing Python on your computer (e.g., the pip package,
used to install other packages) and packages that you need to install yourself, such as sklearn or
numpy. The terms packages, mlibraries and modules are often used interchangeably. But technically,
a module is a single file, a package is a collection of modules and a library is a collection of packages
or modules designed to help with a specific set of tasks.
In this section, we will deal with the second type of packages, namely the ones you have to import
yourself. We should note that, apart from the additional installation phase, the explanations in
the upcoming paragraphs are equally valid for the first type of packages. Please be aware that the
Python community is large and many modules have already been developed that may help you in
accomplishing your goals. There exist entire companies devoted to developing and maintaining a
package. This is one of the main reasons why Python is the preferred language for machine learning:
the packages written for machine learning work extremely efficient, and are very easy to use.
Modules can be included in your own code with the import command. For example, writing import
math (see https://docs.python.org/3/library/math.html) at the top of the file enables all
functionality of the math.py module. You can access methods of a package by writing packagename.
method() (e.g., math.factorial()). A Python method is like a Python function, but it must be
called on an object, namely the package. Alternatively, it is also possible to import only specific
2.1 Python Fundamentals 35
functions of a package in the following fashion: from packagename import methods (e.g., from
math import factorial, sqrt, pi). When imported like this, you call these methods by their
standard name (e.g., factorial(number)). An example of both import approaches is shown below.
Exercise: run the code below.
1 import math # option 1
2
3 print ( math . sqrt (16) )
4 print ( math . factorial (5) )
5 print ( math . pi )
The second import approach is detailed below and results in the same produced output as in the
previous example.
1 from math import sqrt , factorial , pi # option 2
2
3 print ( sqrt (16) )
4 print ( factorial (5) )
5 print ( pi )
There exists a third import option, which is from filename import * (e.g. from math import
*). In this case, all methods included in the package become available to the user (e.g. forward(240)
would work). However, this approach is not recommended because the names of all methods in the
package are typically unknown for programmers and when they write their own code with identical
method names as existing in the module, this will cause problems and may raise errors.
Below, we indicate how we can retrieve some information about a module. We can access the
content of a module, its implementation and its documentation. Although most of the time it is
easiest to simply look up the module on the internet. The implementation (source code) can be
accessed by pressing ctrl+click on the module name in the Python file after the import statement. The
documentation can be displayed with the help(modulename) or help(modulename.methodname)
command. For example:
1 import math # press ctrl + click on math in the Python script to open
implementation of math
2
3 help ( math ) # documentation of math ( also available online !!)
4 help ( math . factorial ) # documentation of forward method
5
6 # print what is in the math module :
7 for m in dir ( math ) :
8 print ( m )
R It’s important to remember that no one can know the syntax and meaning of every parameter
for every function in every module. Fortunately, you can always find the documentation for
any module or function online. This documentation is a valuable resource that will help you
understand the details more thoroughly: just google modulename + documentation.
returned). It is also possible that the function code does not need to specifically output anything
in which case there is no return clause. Python functions automatically returns None if you don’t
specify a return value. The following pseudo-code indicates the general Python function layout:
1 def function_name ( parameter1 , parameter2 , ...) :
2 statement_1
3 statement_2
4 ...
5 statement n
6 result = ...
7 return result # optional
Below is a simple example of a manually defined function (obviously, you would not use this function
in practice and more advanced things - such as for loops, if statements, advanced arithmetic - are
possible). It also highlights how to call such a function with user-defined values. Because Python
functions can only return a single object, like a list, string, dictionary,... If we need to return multiple
values, you can either make them into a list or other object, or you can use tuples. In the example
below, we stored the summation and product outcomes in a tuple object. During function call, this
tuple is unpacked. Only for tuples, it is also an option to remove the round brackets when returning
and unpacking the values.
1 def su m_product_numbers (a , b ) :
2 summation = a + b
3 product = a * b
4 return summation , product # return a single tuple , is the same as
writing ( summation , product )
5
6 val_1 = 5
7 val_2 = 4
8 ( res_1 , res_2 ) = sum_product_numbers ( val_1 , val_2 ) # function call , you can
also write res_1 , res_2 = sum_product_numbers ( val_1 , val_2 ) which does
exactly the same
9 print ( res_1 , res_2 ) # returns 9 20
Parameters refer to the variables defined in the function definition (i.e. a and b enclosed in brackets
after the function’s name). Arguments, on the other hand, refer to the actual values that are passed
to the function when called (i.e. val_1 = 5 and val_2 = 4). However, the terms are often used
interchangeably.
Exercise 2.10 Write a function that accepts as input a string and returns a list containing all the
unique characters appearing in the string. As an example, if the input is "dear mister president",
then the output should be ["d", "e", "a", "r", " ", "m", "i", "s", "t", "p", "n"] ■
It is possible to supply multiple arguments to a single parameter. This is useful because there can
be situations where the same actions/operations need to be imposed to different arguments. One
can do this in the following manner: func_name(*args,...) (the * notation implies multiple
arguments are associated to args). During function call, the actual arguments are separated with
commas (,). We outline a simple example below where the goal is to capitalize each string that is fed
to the function:
1 def c apitalize_strings (* args ) :
2 list = []
2.1 Python Fundamentals 37
meaning where it is accessible. Here, the scope of the variables A and B is limited to the block of
code associated to the function definition. A and B are called local variables. Even though variables
C and D are defined outside the function’s definition, they are still accessible within the function.
Variables C and D exist both inside and outside of the function. We call them global variables.
1 def func ( a ) :
2 b = object_1
3 return res
4 c = object_2
5 res = func ( c )
6 d = object_3
If one uses the name of a global variable inside of a function, Python assumes that you want to use
the global variable. However, as soon as a new value is assigned (with the = operator) to a global
variable within a function, Python will create a new local variable with the same name and the code
no longer refers to the global variable (the global variable is not modified). We say that the local
variable ‘shadows’ the global variable. To illustrate:
1 def func_name ( list ) :
2 list . append (10)
3 print ( list )
4 b = 10 # a NEW local variable b is created
5 # ( that is independent of the global variable b )
6 print ( b ) # returns 10
7 print ( c ) # returns [4 , 5]
8 c . append (10) # c is still the global variable !
9 # ( so also modified in the caller scope / outside the
function )
10 print ( c ) # returns [4 , 5 , 10]
11 return
12
13 a = [0 ,1]
14 b = [2 ,3]
15 c = [4 ,5]
16 func_name ( a )
17 print ( a ) # returns [0 , 1 , 10]
18 print ( b ) # returns [2 , 3]
19 print ( c ) # returns [4 , 5 , 10]
Even though you should avoid assigning a new value to a global variable within a function, Python
does provide a way to do so by adding the code global variable_name after the function definition.
Within the function, Python no longer tries to create a local variable but works directly with the
global variable instead. Continuing our example:
1 def func_name ( list ) :
2 global b # The function no longer creates a local variable b
3 # but works with the global variable b
4 list . append (10)
5 print ( list ) # returns [0 , 1 , 10]
6 b = 10 # b is still the global variable !
7 # ( so also modified in the caller scope )
8 print ( b ) # returns 10
9 print ( b ) # returns [4 , 5]
10 c . append (10) # c is still the global variable !
11 # ( so also modified in the main scope )
12 print ( c ) # returns [4 , 5 , 10]
2.1 Python Fundamentals 39
13
14 a = [0 ,1]
15 b = [2 ,3]
16 c = [4 ,5]
17 func_name ( a )
18 print ( a ) # returns [0 , 1 , 10]
19 print ( b ) # returns 10
20 print ( c ) # returns [4 , 5 , 10]
Variable names and function names are recommended to follow the "snake_case" naming convention
where each word is in lower case, and separated by underscores. For example, average_age or
student_name, or an example function name could be calculate_average or execute_experiment.
Class names should follow the "CamelCase" convention, where the first letter of each word is
capitalised and no underscores (or spaces) are used. This makes class names stand out against the
lowercase used for functions and variables. Examples, ModelTrainer or DataProcesser. However,
it’s important to note that in this tutorial, we will not be using classes as our code will be structured
in a more procedural or functional style, specific to our machine learning tasks. This approach is
chosen to simplify our examples and focus more directly on applying machine learning techniques
without the overhead of object-oriented programming (where classes are used).
For those interested in diving deeper into writing clean, readable, and stylish Python code, the PEP 8
guidelines are a great resource. Python Enhancement Proposal 8 (PEP 8) provides a comprehensive
set of recommendations that many Python developers adopt to ensure their code is consistent and
aesthetically pleasing. You can learn more about these style guidelines by visiting the official PEP
8 documentation at https://peps.python.org/pep-0008/. This guide covers everything
from naming conventions to best practices for line length, white space, and more. Exploring these
guidelines is entirely optional but highly recommended if you’re planning to collaborate on larger
Python projects or contribute to open-source software. They provide a solid foundation for writing
code that is easily readable and maintainable by others.
SyntaxError
This error occurs when the Python interpreter encounters incorrect syntax. This means there is a
typo or a syntax mistake in the code. Common causes are missing colons (:) at the end of control
structures (if, for, while, etc.) or misbalanced parentheses, brackets, or braces.
1 if True
2 print ( " This will cause a SyntaxError " )
3 # SyntaxError : fix this error by adding the missing colon after if True
IndentationError
Python relies on indentation to define blocks of code. An IndentationError occurs when the
spaces or tabs used for indentation are inconsistent. Common causes are mixing of tabs and spaces
and incorrect indentation levels.
1 def example_function () :
2 print ( " This line is correctly indented " )
3 print ( " This line has an extra space and will cause an
IndentationError " )
NameError
This error is raised when you try to use a variable or function name that has not been defined.
Common causes are the misspelling of a variable name, and using a variable before it is defined.
1 print ( undeclared_variable )
2 # Ensure the variable is defined before using it
TypeError
A TypeError occurs when an operation is applied to an object of an inappropriate type. Common
causes are adding a string to an integer of calling a function with the wrong number or type of
arguments.
1 result = " string " + 10
ValueError
This error is raised when a function receives an argument of the right type but an inappropriate value.
Common causes are passing a string that cannot be converted to an integer or using ou-of-range
values.
1 number = int ( " not_a_number " )
IndexError
An IndexError occurs when you try to access an index that is outside the range of a list or other
indexed collection. Common causes are trying to access an element beyond the end of the list or
using a negative index that is out of range.
1 my_list = [1 , 2 , 3]
2 print ( my_list [3])
3 # The list only has 3 elements , not 4 ( remember indexation starts at 0)
42 Chapter 2. BASICS IN PYTHON
KeyError
This error is raised when a dictionary key is not found in the set of existing keys. Checking if the key
exists in the dictionary before accessing it can prevent this error.
1 my_dict = { " a " : 1 , " b " : 2}
2 print ( my_dict [ " c " ])
AttributeError
An AttributeError occurs when you try to access an attribute or method that does not exist for a
particular object. Common causes are the misspelling of an attribute name or using an attribute that
is not defined for an object.
1 my_list = [1 , 2 , 3]
2 my_list . append (4)
3 my_list . appendd (5) # Misspelled method name
2.2 Basic Python libraries for Data Science 10 de octubre 2da clase
In the following sections, we will focus on how to use Python for data science. We will discuss some
very useful Python libraries for data science such as NumPy, Pandas, Scikit-learn and NumPy.
NumPy and Pandas are great for exploring and analysing the data. Scikit-learn is the most-often
used library for modelling and Matplotlib is a data visualisation library that allows you to make plots
and graphs of the data.
In this part of the chapter, you can use the Python_for_data_science.py script. Complete the script
yourself in VS Code and answer the corresponding questions. We will be using the GermanCredit.csv
data for completing the questions.
The Python Standard Library is a collection of exact syntax, tokens and semantics in Python, and it
comes with a core Python distribution as mentioned in the previous section. It is written in the C
language and consists of more than 200 core modules. Additional libraries refer to those optional
components that are commonly included in Python distributions. The Python installer for Windows
automatically adds the standard library and some additional libraries. In addition to the standard
library, there is a growing collection of several thousand components (from individual programs and
modules to packages and entire application development frameworks), available from the Python
Package Index (https://pypi.org/).
Important Python Libraries for Data Science that can be imported and that are discussed in this
tutorial are the following:
• NumPy (Numerical Python): provides advanced mathematical functions and scientific com-
puting. It supports large multidimensional arrays and matrices, and functions to operate on
them.
2.2 Basic Python libraries for Data Science 43
• Pandas (Python Data Analysis Library): provides fast, expressive and flexible data structures
to easily work with structured data. This will be your main tool to read in and write out .csv
files. It also works well together with Scikit-Learn and Seaborn.
• Matplotlib: helps with data analysis and is a numerical plotting library. It allows you to create
static, animated and interactive visualisations in Python.
• Scikit-learn: provides a range of supervised and unsupervised learning algorithms via a
consistent interface. The library is built upon the SciPy (Scientific Python) library and NumPy
library that must be installed before you can use it.
2.2.1 NumPy
NumPy1 (short for Numerical Python) provides an efficient interface to perform mathematical and
logical operations on arrays. NumPy also has built-in functions for linear algebra and random
number generation. We installed Numpy with the requirements file (see Section 1.2.4. The next step
is to import it into your Python scripts. Open or create a new Python file in VS Code. In your Python
file, import NumPy using the following line of code:
1 import numpy as np
You will see that np is commonly used to shorten the reference to NumPy making your code easier
to write and read.
R We will illustrate the concepts discussed throughout this section with some code. We encourage
you to try out the code yourself while following this tutorial. There will also be some new
exercises you need to do yourself.
Arrays
The most important object defined in NumPy is an N-dimensional array type called ndarray. It
can be seen as a table of elements (usually numbers), all of the same type. Items in the table can be
accessed using a zero-based index. The basic ndarray is created using an array function in NumPy
np.array(). All you need to do is pass a list to it, and optionally, you can also specify the data type
of the data.
1 a = np . array ([1 ,2 ,3 ,4])
2 b = np . array ([[1 ,2 ,3 ,4] ,[4 ,5 ,6 ,7] , [8 ,9 ,10 ,11]])
You can access the number of axes (dimensions) of the array using ndarray.ndim. The ndarray.
shape method gives you the size of the array in each dimension. For a matrix with n rows and
m columns, the shape will be (n,m). The length of this shape tuple is therefore the number of
dimensions.
1 print ( a . ndim )
2 print ( a . shape )
Question: What are the number of dimensions and the shape of a and b?
Exercise 2.12 Create a new array Z with the grades of the Machine Learning course of the first
examination period of 15 students: 12, 13, 16, 11, 10, 19, 18, 19, 15, 13, 12, 12, 15, 15, 12. ■
1 https://www.numpy.org
44 Chapter 2. BASICS IN PYTHON
NumPy provides many functions to create arrays from scratch.
Exercise 2.13 Print the following arrays to see what they look like. ■
Question: What are the number of dimensions and the shape of c and f?
The functions np.linspace() and np.arange() in NumPy both generate arrays of evenly spaced
values, but they operate differently. np.linspace(start, stop, num) generates an array of
num evenly spaced values starting from start and ending at stop (inclusive). For example, np.
linspace(0, 2, 9) creates an array of 9 values ranging from 0 to 2. np.arange(start, stop,
step) generates an array starting at start and incrementing by step up to but not including stop. For
example, np.arange(10, 30, 5) creates an array starting at 10 and increasing by steps of 5: [10,
15, 20, 25]. To reshape an array, you can use the ndarray.reshape() method. This allows you to
change the shape of the array by specifying the size of the array in each dimension. For instance,
if you have an array with 12 elements, you can reshape it into a 2-dimensional 3x4 array using
array.reshape(3, 4).
1 a . reshape (2 ,2)
Exercise 2.14 Create a 1-dimensional array ranging from 0 to 120 (not including) with stepsize
2. Then reshape this array to a 3-dimensional array. You can choose the size of the dimensions
yourself. ■
Array operations
In this section, you’ll discover some of the functions that you can use to do mathematics with arrays.
You can just use +, -, *, or % to add, subtract, multiply, divide or calculate the remainder of two (or
2.2 Basic Python libraries for Data Science 45
more) arrays. These operators all operate elementwise. The matrix product can be performed using
the @ operator or the np.dot() method.
Exercise 2.15 Do the following operations yourself to see what they look like. ■
1 # add a and b
2 i = a + b
3
4 # multiply a and d elementwise , but first change the dimensions of a so the
product of a and d is possible
5 a = a . reshape (2 ,2)
6 j = a*d
7
8 # calculate the matrix product of a and f
9 k = a . dot ( d )
Exercise 2.16 For the array Z containing the grades of the Data Mining course of the first
examination period, add 1 extra point to the final grade of each student. ■
Many unary operations, such as computing the sum of all the elements in the array, are implemented
as methods of the ndarray class. By default, these operations apply to the array as though it were a
list of numbers, regardless of its shape. However, by specifying the axis parameter you can apply an
operation along the specified axis of an array: axis=0 refers to the columns (you operate over the
rows) and axis=1 to the rows (you operate over the columns).
1 l = b . min () # minimum of all elements
2
3 m = b . sum ( axis =0) # sum of each column
4 n = b . sum ( axis =1) # sum of each row
5
6 o = b . cumsum ( axis =1) # cumulative sum along each row
Exercise 2.17 Compute the mean and the standard deviation of the columns in b. ■
You can find a list of all the routines provided by numpy in the documentation 2 .
2 https://docs.scipy.org/doc/numpy/reference/routines
46 Chapter 2. BASICS IN PYTHON
On the other hand, you might want to find the indices of the elements with a certain value.
1 # get the index of elements of array h with value other than 0
2 index = np . argwhere ( h )
3
4 # get the index of elements of array h that are missing
5 index = np . argwhere ( np . isnan ( h ) )
6
7 # get the index of elements of array h with value 1.25
8 index = np . where ( h ==1.25)
Exercise 2.18 In our grades vector Z of the Machine Learning course, change the grade of the
first student to 8. ■
Array concatenation
Sometimes we want to combine different arrays. So, instead of typing each of their elements
manually, you can use array concatenation to handle this task easily.
1 # you can concatenate two or more arrays at once
2 x = np . array ([1 , 2 , 3])
3 y = np . array ([3 , 2 , 1])
4 z = np . array ([21 ,21 ,21])
5
6 print ( np . concatenate ([ x , y , z ]) ) # creates a single 1 - dimensional array
7
8 # you can also use this function on 2 - dimensional arrays
9 grid = np . array ([[1 ,2 ,3] ,[4 ,5 ,6]])
10
11 print ( np . concatenate ([ grid , grid ]) )
Until now, we used the concatenation function of arrays of equal dimension. But what if you are
required to combine a 2D array with 1D array? In such situations, np.concatenate() might not
be the best option to use. Instead, you can use np.vstack to append vertically or np.hstack() to
append horizontally.
1 # vertical stack
2 x = np . array ([3 ,4 ,5])
3 grid = np . array ([[1 ,2 ,3] ,[17 ,18 ,19]])
4 print ( np . vstack ([ x , grid ]) )
2.2.2 Pandas
NumPy (https://pandas.pydata.org/docs/) is one of the most popular Python libraries for
data science and analytics. It helps you to analyse two-dimensional data tables in Python and has
many useful features. Just like with NumPy, we installed Pandas with the requirements file (see
Section 1.2.4. The next step is to import it into your Python scripts. It is conventional to use the alias
pd for Pandas:
2.2 Basic Python libraries for Data Science 47
1 import pandas as pd
Data structures
There are two types of data structures in Pandas: Series and DataFrames. A NumPy Series is a one
dimensional data structure (’a one dimensional ndarray’) that can store values — and for every value
it holds a unique index too. A Pandas DataFrame is a two (or more) dimensional data structure –
basically a table with rows and columns. The columns have names and the rows have indexes. Let’s
import a dataset to see what this means.
If you run this code snippet in the interactive window, you can open the German_credit data in
the Jupyter variable explorer. This nice 2D table is a Pandas DataFrame. The numbers in the
first column are the indexes and the column names on the top are picked up from the first row of
our GermanCredit.csv file automatically. One column of this table is called a Pandas Series (see
Figure 2.1).
Selecting data
Sometimes you want to select only a part (certain rows or columns) of a DataFrame. Here are some
basic operations to do this.
Exercise 2.19 Print the operations out to see what they look like ■
You can also filter for specific values in your DataFrame. Let’s say you want to find those rows
where "Astatus" is equal to "A11". In fact, two steps need to be performed here. First for every line,
evaluate whether Astatus is equal to A11 or not.
1 german_credit [ " Astatus " ] == " A11 "
The results are boolean values (True or False). Next, select from German_credit the rows where this
boolean value is True.
1 germa n_credit_A11 = german_credit [ german_credit [ " Astatus " ] == " A11 " ]
Exercise 2.20 Select the Client_ID, Purpose and Amount columns for the clients who have a
duration of more than 6. Print the first five rows only. ■
Data aggregation
Very commonly used methods in analytics and data science projects are aggregation (such as min,
max, sum, count, etc.) and grouping.
1 # count the number of clients in our DataFrame
2 print ( german_credit [ " ClientID " ]. count () )
3 # or
4 print ( german_credit . ClientID . count () )
5
6 # count the number of different values for Astatus
7 print ( german_credit . Astatus . value_counts () )
8
9 # calculate the sum of all credit amounts
10 print ( german_credit . Camount . sum () )
11
12 # find the smallest credit amount
2.2 Basic Python libraries for Data Science 49
Now suppose that you want to know these statistics for every type of "Purpose" ("A40", "A41", "A42",
...) separately. For this you can use Pandas.groupby() in combination with any of aggregation
methods mentioned above.
1 # calculate the summary statistics of all remaining ( numeric ) columns for
every type of Purpose .
2 german_credit . groupby ([ " Purpose " ]) . describe ()
3
4 # calculate the mean credit amount for every type of Purpose
5 german_credit . groupby ([ " Purpose " ]) . Camount . mean ()
Data formatting
In real-life data projects, we usually don’t store all the data in one big data table. We store it in a few
smaller ones instead because it is easier to manage your data, you can query tables faster, etc. It’s
therefore quite common that during your analysis you will need to combine data from two or more
different tables. The solution for that is called a merge. There are different ways to combine two
DataFrames: inner, outer, left or right (see Figure 2.2). When you do an inner merge (the default
mode in Pandas), you merge only those values that are found in both tables. On the other hand, when
you do the outer merge, it merges all values, even if you can find some of them in only one of the
tables. A left merge keeps all the values from the left table and merges only those values from the
right table that we have in the left one. Similarly, a right merge keeps all the values from the right
table and merges only those values from the left table that we have in the right one.
Imagine that you would want to merge the German_credit data with another data set that contains
information on the salaries of some clients. For doing the merge, Pandas needs the key-columns
you want to base the merge on (in our case "ClientID"). You can specify this using the on parameter.
In case the column names are not the same for both DataFrames, use the left_on and right_on
parameters.
1 # load csv file into Pandas DataFrame
2 Salaries = pd . read_csv ( ' Datasets \ G erman Credit _sala ries . csv ' , sep = '; ')
3
4 # left merge two DataFrames on ' ClientID '
5 Merged = german_credit . merge ( Salaries , on = " ClientID " , how = " left " )
Notice that for some clients there was no salary data available, they will have a NaN value for salary
in the merged DataFrame. With Pandas, you can easily fill missing values using the fillna()
function. This function basically finds and replaces all NaN-values in your DataFrame with a
predefined value.
50 Chapter 2. BASICS IN PYTHON
Figure 2.2: Left, right, inner and outer merge of two DataFrames.
1 # fill missing values in the column " Salary " with the string " unknown "
2 Merged [ " Salary " ] = Merged [ " Salary " ]. fillna ( " unknown " )
3
4 # or with the value 0
5 Merged [ " Salary " ] = Merged [ " Salary " ]. fillna (0)
Another important operation you might want to perform on a Pandas DataFrame is sorting the data.
The function for this is sort_values().
1 # create new DataFrame sorted on age in ascending order
2 g e r m a n _credit_sorted = german_credit . sort_values ( " Age " )
3
4 # sort existing DataFrame on age in ascending order
5 german_credit . sort_values ( " Age " , inplace = True )
6
7 # sort existing DataFrame on age in descending order
8 german_credit . sort_values ( " Age " , inplace = True , ascending = False )
9
10 # sort existing DataFrame on age in descending order and on credit amount in
ascending order
11 german_credit . sort_values ( by =[ " Age " , " Camount " ] , inplace = True ,
12 ascending =[ False , True ])
Now have a look at your indexes. They are completely mixed up. Sometimes, after performing
transformations on your DataFrame, you have to re-index the rows using reset_index(). This
method moves the existing indexes into a new column named index and creates new ascending
indexes for the DataFrame. By doing this, you can access the first row of the rearranged DataFrame
using the index 0, instead of manually checking the DataFrame to find which index is now at the top.
More information can be found on https://pandas.pydata.org/docs/reference/api/pan
das.DataFrame.reset_index.html.
2.2 Basic Python libraries for Data Science 51
Saving a DataFrame
You can save a Pandas DataFrame as a CSV or Excel file with the following commands:
1 # Save to csv file
2 german_credit . to_csv ( " filename . csv " )
3 # Save to Excel file
4 german_credit . to_excel ( " filename . xlsx " )
Similarly, you can create a boxplot of the credit amounts. You could even group the boxplots by a
second variable, for example "Astatus", the status of existing checking account (Figure 2.4).
1 # create boxplot of Camount
2 german_credit . boxplot ( column = " Camount " )
3
4 # create boxplot of Camount grouped by Astatus
5 german_credit . boxplot ( column = " Camount " , by = " Astatus " )
52 Chapter 2. BASICS IN PYTHON
For categorical values, you can create a barchart with Pandas to visualise the counts per category
(see Figure 2.5).
1 german_credit . Astatus . value_counts () . plot ( kind = " bar " )
Let’s start with a very simple plot using the plot() function.
1 # prepare data
2 x = np . linspace (0 , 10 , 100)
3 y = np . sin ( x )
4 # plot data
5 plt . plot (x , y )
6 plt . show ()
Note that the first array appears on the x-axis and second array appears on the y-axis of the plot.
The basic linestyle is "b", a solid blue line. Now that our first plot is ready, let us add the title, and
name x-axis and y-axis using methods title(), xlabel() and ylabel() respectively. We can
specify the size of the figure using the method figure() and passing the values of the length of
rows and columns to the argument figsize. The axis() command takes a list of [xmin, xmax,
ymin, ymax] and specifies the viewport of the axes. We will also add a legend to our figure and
change the linestyle to green dots ("go"). See the plot() documentation 3 for a complete list of line
styles and format strings.
1 # prepare data
2 x = np . linspace (0 , 10 , 100)
3 y = np . sin ( x )
4 # adjust figure size
5 plt . figure ( figsize =(10 ,5) )
6 # plot data
7 plt . plot (x , y , " go " , label = " sin ( x ) " )
8 # add title and label axis
9 plt . title ( " sine wave " )
10 plt . xlabel ( " x " )
11 plt . ylabel ( " sin ( x ) " )
12 # set the minimum and maximum for the axes
13 plt . axis ([0 , 10 , -1 , 1])
14 # add legend
15 plt . legend ()
16 plt . show ()
Exercise 2.23 Plot the sine and cosine function of x on the same axis as a dotted red line and a
solid blue line respectively. Give your plot a title, label the axes and add a legend. ■
2.2.4 Scikit-learn
Scikit-learn4 contains simple and efficient tools for data mining and data analysis. It features various
algorithms for supervised and unsupervised data mining methods. In this tutorial, we will focus on
3 https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html
4 https://scikit-learn.org/stable/index.html
54 Chapter 2. BASICS IN PYTHON
supervised classification methods such as nearest neighbour, decision tree, logistic regression and
support vector machine algorithms. Below, we provide a short introduction to Scikit-learn. We will
work with the breast cancer dataset which can be loaded through Scikit-learn. This dataset is already
preprocessed and cleaned and can therefore be used for modelling directly.
R IMPORTANT! This section is a brief introduction to the Scikit-learn library. In the next chapter,
there are more exercises and code snippets to thoroughly explain the most used functions.
After executing this code snippet in the interactive window, you can check out the breast cancer
dataset in the Jupyter variable explorer. The goal is to predict whether a cancer is benign (0) or
malignant (1) from a set of characteristics of the cancer. These characteristics are stored the data
array and the target classes are stored in the target array.
Create training and testing set
A common practice in data mining is to evaluate an algorithm by splitting a data set into two:
the training set, on which we learn some properties; and the testing set, on which we test the
learned properties. With Scikit-learn, you can easily split into training and testing set using the
predefined method train_test_split() from sklearn.model_selection. With the parameter
test_size, you indicate the proportion of the dataset to include in the test split. You can also split
the data in a stratified fashion with stratify=True.
1 from sklearn . model_selection import train_test_split
2 # split data set into train and test sets
3 X_train , X_test , y_train , y_test = train_test_split ( breast_cancer . data ,
breast_cancer . target , test_size = 0.30)
Model building
Next, we will show how to train and test a logistic regression model on the breast_cancer dataset
using the models already included in the Scikit-learn package. These methods are similar for other
2.2 Basic Python libraries for Data Science 55
algorithms like decision trees, support vector machines etc. The model must first be trained on the
data. This is done by passing our training set to the fit() method.
Exercise 2.24 Type the following code snippets in Python yourself and see what happens. ■
When the model has been trained, you can predict the labels on the test set with the predict()
method. Or you can output the prediction scores/probabilities using the predict_proba() method.
The first index refers to the probability that the data belong to class 0, and the second refers to the
probability that the data belong to class 1.
1 # predict the labels for the testing data
2 labels = model . predict ( X_test )
3
4 # predict the prediction scores of belonging to class 1 ( hence the [: ,1] for
the testing data
5 scores = model . predict_proba ( X_test ) [: ,1]
For now, we just set the regularisation parameter (C) manually to 0.01. A better way to do this is to
optimise this value on a validation set with cross-validation. To achieve this, we define a so-called
grid of parameters that we would want to test out in the model and select the best model using
GridSearchCV(). You can choose the number of folds with the cv parameter.
1 from sklearn . model_selection import GridSearchCV
2 lr = LogisticRegression ( max_iter =10000)
3 # Construct the parameter grid
4 param_grid = {
5 " C " : [0.001 , 0.01 ,0.1 , 1.0]
6 }
7 # perform cross - validated grid search with 5 folds
8 gridsearch = GridSearchCV ( lr , param_grid , cv =5)
9 gridsearch . fit ( X_train , y_train )
10
11 # evaluate the results of the grid search
12 print ( " Gridsearch results : " + str ( gridsearch . cv_results_ ) )
13
14 # select best model
15 print ( " Best model is : " + str ( gridsearch . best_estimator_ ) )
16 best_model = gridsearch . best_estimator_
17 # best_estimator_ selects estimator which gave the highest score on leftout
data ( on average )
18
19 # fit the best algorithm to the data
20 best_model . fit ( X_train , y_train )
It is possible to save a model in Scikit-learn by using Python’s built-in persistence model, pickle:
1 import pickle
2 # save model
56 Chapter 2. BASICS IN PYTHON
Model evaluation
Finally, the test set is used to determine the generalisation capability of the selected model. We will
discuss some of the most common evaluation metrics (accuracy, ROC and AUC) below, but keep in
mind that a wide range of evaluation techniques exists.
1 from sklearn . metrics import accuracy_score , roc_auc_score , roc_curve
2
3 # Predict scores and labels on the test set
4 y_test_scores = best_model . predict_proba ( X_test ) [: ,1]
5 y_test_labels = best_model . predict ( X_test )
6
7 # Accuracy on test data
8 accuracy_test = accuracy_score ( y_test , y_test_labels )
9
10 # AUC value on test data
11 AUC_test = roc_auc_score ( y_test , y_test_scores )
12
13 # ROC curve
14 # calculate false positive and true positive rate
15 fpr , tpr , threshold = roc_curve ( y_test , y_test_scores )
16
17 # plot fpr and tpr
18 plt . title ( " Receiver Operating Characteristic on test data " )
19 plt . plot ( fpr , tpr , " b " )
20
21 plt . plot ([0 , 1] , [0 , 1] , "r - - " ) # plot diagonal indicating a random model
22 plt . xlim ([0 , 1])
23 plt . ylim ([0 , 1])
24 plt . ylabel ( " True Positive Rate " ) # y - label
25 plt . xlabel ( " False Positive Rate " ) # x - label
26 plt . show ()
Exercise 2.25 Perform a cross-validated grid search (5 folds) where you optimise both the
regularisation penalty function and the value for the regularisation parameter [0.1, 1, 10]. What
are the best performing parameters? Now use the best performing model to make predictions for
the test set (y_test_scores) and print out the AUC. ■
The final exercise covers the use of (built-in) packages as well as the development of manually
defined functions. In the folder Final_Exercise, there is a Python file called Cards_Test_student.py
and a folder Functions_Cards. That folder contains the file Cards_Func_student.py. In the first
part of the exercise, the file Cards_Func_student.py needs to be completed. It contains various
functions that are useful for a variety of card games. The second part of the exercise is the completion
of Cards_Test_student.py. This file tests each of the functions defined in Cards_Func_student.py.
Questions: if you run the file Cards_Test_student.py twice, do you observe different results? Can
you explain why this happens and also why this is useful for card games in general?
In the 2.3 Additional Exercises folder, you can find some extra exercises in the additional_exercises_student.py
file on loops and functions.
3. MACHINE LEARNING IN PYTHON
3.1 CRISP-DM
When starting a machine learning project, it is always helpful to keep in mind the CRISP-DM
process for data mining as shown in Figure 3.1. This chapter will follow this approach and it is also
recommended that you follow this design for the Data Science Challenge.
Figure 3.1: Cross-Industry Standard Process for Data Mining (p.27 in Data Science for Business)
60 Chapter 3. MACHINE LEARNING IN PYTHON
R For more information about the CRISP-DM and the different steps in the process, review
Chapter 2: Business Problems and Data Science Solutions, page 26–33 from Data Science for
Business (Provost and Fawcett, 2013)
The dataset you will be working with in these sessions can be found in GermanCredit.csv in the
Datasets folder 1 . The target variable classifies loans described by a set of features as either a good
(0) or a bad (1) loan, i.e. has the borrower defaulted on the loan (1) or not (0)? The meaning of the
attributes is given in the Word file (Description of the German credit dataset.docx).
Open the file GermanCredit.csv and the pdf file with descriptions and make sure you understand the
meaning of each attribute in order to get a high-level understanding of the problem. Question: Can
you guess from their definition what their role might be in a credit risk prediction?
You can also find the cost matrix in the Word file. The rows of this cost matrix represent the actual
classification, the columns represent the predicted class. Question: Can you infer from this matrix
which prediction results in the highest cost and why?
Exercise 3.1 Open the Python code Data_understanding_student.py, import the dataset Ger-
manCredit.csv in Visual Studio Code as seen in Chapter 2 and follow along during this section.
■
R It is important to realise that this is not a statistics course. Data exploration is important to
understand the dataset you are working with, but not the main goal of machine learning projects.
Do not overly focus you results or code on statistics.
Univariate statistics
We will first look at some simple univariate statistics for the target variable and the attribute values
separately. Some useful questions and exercises are:
• What is the frequency distribution of the target variable? The target variable (Status) models
whether a loan is a good (0) or a bad (1) loan. What is the percentage of good and bad loans in
the data set?
• An important quality assessment concerns the target variable. Imagine the case of fraud
detection, where the target variable is extremely underrepresented (e.g. 1% positive cases vs.
99% non-fraud cases). This extreme skewness of the target variable may lead to undesirable
effects in many rule and tree-based techniques (e.g. decision trees, Ripper) and methods
aiming to minimise overall training set error (e.g. neural networks, SVM). These learners are
designed with a lack of consideration of the underlying target distribution and will usually
emphasise the majority concepts at the expense of neglecting minority class concepts, while
the latter is usually the phenomenon of interest (e.g. identifying frauds is more important than
finding legal instances). In datasets showing a high imbalance level, cost-sensitive learning or
sampling techniques (oversampling the minority class or undersampling the majority class)
are most commonly applied. Question: Do we have to worry about extreme skewness in the
German credit dataset?
• What is the frequency distribution of the categorical features in the dataset? Also, keep in
mind that there might be categorical attributes with missing values like Etime. Tip: you can
construct an overview of the data exploration phase as shown in Table 3.1 for each categorical
attribute (e.g., in a separate word document).
• For the continuous attributes, the frequency distribution can be visualised by means of
histograms and boxplots. For the continuous variable representing the amount of credit asked
by the applicant (Camount), the histogram and boxplot are given in Figure 3.2 and Figure 3.3.
Construct these plots for the other continuous variables yourself.
• Outliers may be present in a dataset due to noise or if a value has been recorded wrongly.
There are two types of outliers: valid observations or invalid observations. The first type of
outlier takes on a valid value in the range of the values of the attribute. The latter type of
outlier takes on a value which is not possible (such as -1 for age). Question: For the German
62 Chapter 3. MACHINE LEARNING IN PYTHON
credit data, which attributes have extreme values? Are there any variables containing outliers
(take a closer look at the boxplots)? Do these outliers make sense? Do you consider them
valid or invalid observations? (Hint: look again at the boxplot command). What are possible
ways to treat outliers?
• Another way of gaining insight into the distribution of an attribute is shown in Table 3.2.
The following commands can be used to complete this table for the remaining continuous
attributes: .min(), .max(), .std() etc. Note that these commands ignore the presence of
missing values (nan values). Question: Based on these basic statistics, can you infer anything
about the population from which the dataset was drawn?
• Question: Does it make sense to include the ClientID variable in the previous analyses?
Why or why not?
• The previous analyses should have given you an intuition regarding which features have
missing values and how these missing values are represented in the dataset. Concretely, which
features contain missing values? (Hint: look at the isna() command).
• Calculate the percentage of missing values per feature. Extra question: you could also create a
function that returns the percentage of missing values for a given variable (provided as input).
How would you do this? Look at Chapter 2 (section 2.1).
Multivariate statistics
Let us now look at some multivariate statistics of the dataset. Complete the following exercises:
• It would be helpful to see if there are any interactions between the target variable and the
predictor attributes. For example, Table 3.3 shows the contingency table for the target variable
Status and the purpose of the loan (variable Purpose). From this multivariate frequency
distribution, it can be observed that the probability of default is higher when focusing on a
sample of loans made for educational purposes (A46). Construct the contingency table for the
target variable and the following variables: Saccount and Pstatus. What can be observed?
64 Chapter 3. MACHINE LEARNING IN PYTHON
Table 3.3: Contingency table for the Status and Purpose variable.
A40 A41 A42 A43 A44 A45 A46 A48 A49 A410 ∑
Good (0) 145 86 123 218 8 14 28 8 63 7 700
62% 84% 68% 78% 67% 64% 56% 89% 65% 58% 70%
Bad (1) 89 17 58 62 4 8 22 1 34 5 300
38% 16% 32% 22% 33% 36% 44% 11% 35% 42% 30%
∑ 234 103 181 280 12 22 50 9 97 12 1000
Look at the Python code. Question: How would you obtain the Data (X) and labels (Y) for the
training, validation and test data using the obtained indices?
Furthermore, we extract the target variable as a separate Pandas Series, as well as all the features.
For the educational purpose of this tutorial, it is nice to have all features as separate Series, however,
it is definitely not necessary to do this every time.
When looking at the Phone attribute, we can see that it has two possible values (A191 if the client
does not have a phone or A192 if the client has a phone). One way of replacing these missing values
would be to replace them by A191, i.e. treating them as if the client did not have a phone. Another
possibility could be to replace them by the most occurring case, for categorical attributes, or by
the mean value, for continuous attributes (check the histogram and mode and mean command). If
the percentage of observations with missing values is low, we could also delete these observations.
There is no single "perfect" way to handle missing values and it depends on your data. It is best to
try a couple of different techniques and see how the model performance is affected by handling the
missing variables in different ways.
R Important! You have to calculate the mean or mode on the training set and then impute this
value in both the training, validation and test set. That way we avoid inserting information
from our test/validation set in our training set, also known as data leakage.
3.4 Data preparation 65
Exercise 3.2 For the missing values in the Phone feature, replace missing values with mode.
Hint: look at the code Data_preparation_student.py on Data quality examination. ■
The first attribute is the unique ClientID of a loan in the system. Question: Do you need this variable
to make a prediction? Why or why not?
One way to normalise variables is to apply the formula for statistical normalisation to each attribute
that should be normalised. In that case, the mean and standard deviation of the training set are used
to calculate the z-score for both the training, validation and test sets:
(xi − µi )
zi = (3.1)
σi
The z-score reflects how many standard deviations an observation lies from the mean. In the case of
outliers, abs(z) > 3. Hence, for outliers, replace the z-score with 3 (or -3).
3.4.5 Exercises
Complete the Python file Data_preparation_student.py and preprocess all discrete and continuous
variables using the functions Preprocessing_continuous.py and Preprocessing_discrete.py. Make
sure to save the preprocessed data, as well as the train/test/val indices, as they will be used in the
next section on Modelling and Evaluation.
66 Chapter 3. MACHINE LEARNING IN PYTHON
Carefully inspect the preprocessing functions yourself to verify the previous preparation steps
(missing values, variable encoding, normalisation and outliers). Can you confirm that no information
of the test set is ’leaked’ into the preprocessed data?
In the Model Evaluation phase, the test set is used to determine the generalisation capability of the
selected machine learning techniques. A wide range of evaluation metrics can be used for this, for
example accuracy or AUC. It is important to keep in mind the various audiences to which the results
should be presented and to tailor the evaluation to those backgrounds and needs. Finally, one must
make sure that all business objectives are met.
In these exercises we will be building kNN models with the KNeighborsClassifier from sklearn2 .
2 https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier
.html
3.5 Modelling and Evaluation 67
Actual
Non-default TN = FP =
Default FN = TP =
Have a look at the documentation to find out what the default distance metric and weighting scheme
are.
1-NN model
We will start by building a simple 1-NN model on the training set. You can open the Python script
Data_Modelling_KNN_student.py to help you with the code. Now use this 1-NN model to make
predictions for your validation set and evaluate this model:
1. Construct a confusion matrix and complete Table 3.5.
2. Calculate the accuracy of the predictions. How many loans are correctly classified?
3. Because the classes are not equally divided (there are more non-defaults than defaults),
accuracy might give a distorted picture of the classification results. Therefore, we prefer to
complement accuracy with AUC. Calculate the AUC value for the prediction probabilities and
plot the ROC-curve.
600-NN model
Obviously, when using one neighbour, you are overfitting on the training sample. In order to have
a more general and better performing model, you should increase the number of neighbours. The
optimal number of neighbours lies in the range [1,N] with N the number of observations in the
training set.
1. Estimate a new nearest neighbour model, but this time set the number of neighbours to 600
(=N). Make new predictions for the validation set using the 600-NN model. Do you notice
something odd about the scores? Can you explain this?
2. Now change the weights to inverse distance. Redo the predictions on the validation set. Did
the scores change? Why and how?
3. Calculate accuracy and AUC for this model.
ACCURACY = ................. AUC = .................
Gridsearch
You will notice that the performance is better for the 600-NN model than the 1-NN model. However,
the optimal number of neighbours probably lies somewhere in between these extreme values. In order
to find the optimal number of neighbours, we will perform a grid search. A grid search is a search
through a manually specified set of the hyperparameter space (here: the number of neighbours). Each
value of this set will be validated through either cross-validation on the training set or validation on a
hold-out data set. The general form of a grid search is as follows:
1 performances = [] # list initialization
2 hyperparameters = np . arange ( min_value , max_value + 1 , stepsize )
3
68 Chapter 3. MACHINE LEARNING IN PYTHON
4
5 for hyperparameter in hyperparameters :
6 clf = model_function ( hyperparameter )
7 clf . fit ( X_train , Y_train )
8 prediction = clf . predict_function ( X_val )
9 performance = performance_function ( Y_val , prediction )
10 performances . append ( performance )
11
12 i n d e x _ m a x i m a l _ p e rf o r m a n c e = np . argmax ( performances )
13 h y p e r p a ram et er_ op tim al = hyperparameters [ i n d e x _ m a x im a l _ p e r f o r m an c e ]
Perform a grid search in the set [1,600]. Use steps of size 1 and use AUC as performance measure.
Make sure you use the inverse distance as the weight function.
R When the data set is larger, it will not be time-efficient to look at the entire set of hyperpa-
rameters [1,N] in the grid search. Either set the maximum lower (maxneighbor < N) or do
a coarse grid search first (with a large stepsize. If you have performed a coarse grid search,
you can do a finer search with smaller steps around the optimal value found in the coarse grid
search. You can also try RandomSearchCV(), which randomly tries out a couple of parameter
sets and picks the best one. RandomSearchCV() does not consider all parameter combinations,
so it might not find the best one, but it will surely find a good one. This is a good technique to
figure out which parameters have a big influence, or great to combine with a finer Gridsearch
afterwards.
R More information about the meaning and possible values of all the parameters of the DecisionTreeClassifier
can be found in the documentation on https://scikit-learn.org/stable/modules/ge
nerated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.Decision
TreeClassifier
Decision tree
Build a decision tree on the training set using entropy as splitting criterion (look at the criterion
argument).
1. Visualize the tree. Can you comprehend the tree?
2. Use the tree to make predictions on the validation set and estimate accuracy and AUC.
ACCURACY = ................. AUC = .................
Gridsearch
A tree with too many branches can overfit on the training data. To avoid overfitting, we need to find
the optimal tree length by doing a grid search over a hyperparameter. We can tune the max_depth
or how deep the tree can be. Another option is to tune the min_samples_split. This parameter
represents the minimum number of samples required to split an internal node. When we increase
this parameter, the tree becomes more constrained as it has to consider more samples at each node.
Finally, min_samples_leaf is the minimum number of samples required to be at a leaf node. A split
point at any depth will only be considered if it leaves at least min_samples_leaf training samples
in each of the left and right branches. We will perform a grid search to optimize the minimum
number of samples at a leaf node:
1. Perform a grid search to find the optimal value for min_samples_leaf (Minimum Leaf Size).
Decide on the boundaries and the steps at which you want to perform your search. Look for
the value that maximises AUC. What is the optimal leaf size?
2. Given the optimal leaf size, rebuild your tree. Make predictions on the test set using this
new tree. Store the predictions, the results and the model. Calculate the total cost of the
misclassifications for the final decision tree.
Train a LR-model on your training data. Open the Python code Data_Modelling_LogisticRegression_
student.py. In the code, a for loop is used for hyper-parameter tuning. Try to repeat the hyper-
parameter tuning by using the GridSearch function. Questions: What is the value of the constant
term (bias) in your model? Which features have the highest contribution to your model? Specifically,
focus on the most important feature. Does that feature make sense? (if it is a category of a categorical
variable, you can look at the proportion of defaulters in this group and compare it to the overall
default rate of the population). Y_pred_test = model.predict_proba(X_test) can be used to
70 Chapter 3. MACHINE LEARNING IN PYTHON
obtain probability estimates. How would you transform these estimates in concrete binary decisions
(default or non-default)? Which command would you use?
Evaluate your model on the test data, save the trained model (using the pickle library), store your
predicted scores (probability estimates) and write down the following results:
There is one important parameter to be set: the number of trees. You can run a grid search to
find the optimal number of trees (validate on a validation set or using ten fold cross validation).
Simultaneously, you can also run a grid search to estimate the optimal value for min_samples_leaf,
which is the minimum number of samples required to be at a leaf node.
Exercise 3.3 Try to build a random forest model from scratch, just like we did in the previous
examples of Decision Trees and Logistic Regression. Keep n_estimators at the default value of
100 and make a grid to search for the optimal value for the hyperparameter min_samples_leaf.
■
Evaluate your model on the test data, save the trained model (using the pickle library), store your
predicted scores (probability estimates) and write down the following results:
Linear SVM
In this exercise, we will use the SVC function of the sklearn library. We will start with a lin-
ear SVM. The general function to train a SVM is as follows: model = SVC(C = value, kernel =
”linear”) followed by model. f it(X_train,Y _train), where C is the regularisation parameter and
3.6 Deployment 71
the kernel function is linear. The general function for (explicitly) predicting the class label
with your obtained model is: model.predict(X_test). For obtaining the predicted scores, use
model.decision_ f unction(X_test).
Open the Python code Data_Modelling_SVM_student.py. Perform a grid search over the C-parameter
space and pick the one showing the highest accuracy on your validation data. Subsequently, evaluate
the obtained model on your test data, save the trained model (using the pickle library) and provide
the following performance measures:
Question: Does the non-linear model have a higher model performance compared to the linear SVM?
Which one would you prefer?
3.6 Deployment
Now that you have built six different classification models, you will need to compare their perfor-
mances. Complete Table 3.7 with the results calculated on the test set.
R Open the code snippet of Data_deployment_student.py and use it as a guideline by the students
to complete the exercises on lift and cost curves. The functions in the modules lift.py and
cost.py can be used to obtain the lift and cost curves. Make sure you understand what happens
in these functions and how they work.
72 Chapter 3. MACHINE LEARNING IN PYTHON
Exercise 3.4 Make a plot containing the lift curves of all six models. Python does not have a
built-in function to build the lift curve, you will have to create a function to calculate the lift at
different percentages. In this function you first have to rank all scores from high (= default) to
low (= non default). Then you calculate the following for different target areas (i.e. the x highest
scores, with x going from 1% to 50%). See the function liftatp() from the module lift.py. ■
Compare the different lift curves. If you select the 1% highest scores, which model will have found
the largest amount of defaults? The 1% highest scores are the loans that the models predict as
the riskiest loans. A good performing model should not have too much false positive amongst the
high-risk predictions. Does the model with the highest lift coincide with the model with the highest
AUC/accuracy?
Now that you have evaluated each model in terms of several performance measures, can you answer
the following questions:
• What are the advantages and disadvantages of each model?
• When you look at model understandability for a non-technical audience, which model do you
think will be best?
• Focusing on classification performance, which model performed best?
• Which model(s) would you definitely try first for the Data Science Challenge? Why?
Weight of Evidence (WoE) is a measure used to convert high-cardinality categorical variables into
numerical values. It reflects the predictive power of each category of a variable in relation to the
target variable. WoE is particularly useful in binary classification problems and used primarily in the
fields of credit scoring and risk management.
You can make your own function to implement WoE encoding for the high-cardinality categorical
variables, or you can use the predefined module WoEEncoder in the feature_engine library (https:
//feature-engine.trainindata.com/en/latest/api_doc/encoding/WoEEncoder.html).
74 Chapter 4. TIPS, TRICKS AND TOOLS
Supervised ratio encoding provides a straightforward way to transform categorical variables into
numerical values, making them suitable for predictive modelling. It directly reflects the proportion
of good to bad cases, thus highlighting the categories with higher or lower risk.
Both WoE and supervised ratio encoding are useful techniques to manage high-cardinality categorical
variables, enhancing the performance of predictive models by capturing the predictive power of each
category.
Hint: the winning teams of most of the previous years of the AXA Data Science Challenge got high
performance scores because they included high-cardinality variables in a different way than making
a lot of dummies.
4.1.3 GridSearch
Apart from processing data, GridSearch will probably be the most time-consuming part of your
project. How much time is spent on GridSearch depends on your models, the amount of hyperpa-
rameters considered, and the amount of considered values for these hyperparameters. Even here,
you can make some smart choices to substantially speed up the hyperparameter tuning process.
Scikit-learns’s GridSearchCV() method has a couple of useful parameters:
1. njobs: the amount of ’jobs’ (i.e. calculations) that can be performed at the same time.
This parameter tells GridSearchCV() how many calculations it is allowed to run in parallel.
Setting the argument njobs to −1 will simply use as many processors as are available. If you
have e.g. 8 cores in your computer, the code will run about 8 times as quickly.
more information, you can visit for example https://medium.com/@mandava807/cross-validation-and
1 For
-hyperparameter-tuning-in-python-65cfb80ee485.
4.1 General modelling best practices 75
2. verbose: setting this argument to anything bigger than 0 will have the effect that your code
will print out some results during the tuning process. If you are considering tuning a lot of
models, a lot of hyperparameters, a lot of cross-validation folds or a lot of hyperparameter
values (or all of the above), then your GridSearch() will run for a (very) long time. It is of
great value to know if your gridsearch is going to take a lot of time, so you can stop it early
and maybe change your parameter grid.
1 gs = GridSearchCV (
2 MODEL , # undefined variable
3 PARAM_GRID , # undefined variable
4 cv = KFold (5 , random_state = RANDOM_SEED , shuffle = True ) , # 5 - fold cross
- validation
5 n_jobs = -1 , # use all available processors for speed
6 verbose =3 , # how much output you want during fitting
7 scoring = " auc " # use auc , f1 or f2
8 )
Python script 4.1: Example of performing a gridsearch with various optional parameters. Variables
in caps-lock are undefined in this code snippet and depend on your own implementation.
Limit your hyperparameter grid! If you have e.g. 5 cross-validation folds, are considering 10
hyperparameters, each with 10 possible values, then the GridsearchCV() method will have to train
5 × 10 × 10 = 500 different models, just to pick out the best one. Imagine you’re doing this for
an ensemble learner, e.g. a random forest with 1000 trees, then this means training 1000 × 500 =
500.000 different decision trees! This is going to take forever! A good way to avoid this is to:
• Not tune parameters that aren’t very important to the model, or tune these ones last
• Start out with a coarse parameter grid, and iteratively make it finer. For example, start out with
a random forest with [10, 100, 1000] trees. If the model with 1000 trees yields the best
score, retry your gridsearch, this time with [700, 1400, 2100] trees and so on. This way,
you avoid losing a lot of time on trying out parameters that don’t yield a good model anyways.
Maybe, don’t use GridSearchCV() at all, but opt for RandomSearchCV(). While GridsearchCV()
covers all possible hyperparameter combinations, RandomSearchCV() randomly considers a couple.
You may not find the absolute best set of parameters, but you’ll quickly find out which ones have an
impact and which don’t.
4.1.5 Pipelines
In data science and machine learning, a pipeline refers to a sequence of data processing steps arranged
to automate the flow from raw data to a machine learning model. Pipelines help in organising and
streamlining workflows, ensuring that each step is executed in the correct order and that the process
is repeatable and consistent. This section will cover the basics of creating and using pipelines in
Python, particularly using the Scikit-learn library.
76 Chapter 4. TIPS, TRICKS AND TOOLS
1 # Import libraries
2 from sklearn . pipeline import Pipeline
3 from sklearn . impute import SimpleImputer
4 from sklearn . preprocessing import StandardScaler
5 from sklearn . ensemble import Ran do mFo re stC las si fie r
6 from sklearn . model_selection import train_test_split , GridSearchCV
7 from sklearn . metrics import accuracy_score
8 import pandas as pd
9
10 # Load dataset : replace this with any dataset you want . You can try to use
the German credit dataset like in the rest of the tutorial .
11 data = pd . read_csv ( ' data . csv ')
12 X = data . drop ( ' target ' , axis =1)
13 y = data [ ' target ']
14
15 # Split dataset into training and test sets
16 X_train , X_test , y_train , y_test = train_test_split (X , y , test_size =0.2 ,
random_state =42)
17
18 # Define a pipeline with three steps : imputation , scaling , and
classification
19 pipeline = Pipeline ([
20 ( ' imputer ' , SimpleImputer ( strategy = ' mean ') ) ,
21 ( ' scaler ' , StandardScaler () ) ,
22 ( ' classifier ' , R and om For es tCl ass if ier () )
23 ])
24
25 # Train the pipeline
26 pipeline . fit ( X_train , y_train )
27
28 # Predict probabilities using the pipeline
29 y_pred_proba = pipeline . predict_proba ( X_test ) [: , 1]
30
31 # Evaluate the model using AUC
32 auc_score = roc_auc_score ( y_test , y_pred_proba )
33 print ( f ' AUC : { auc_score } ')
34
35 # Define the parameter grid
36 param_grid = {
37 ' c la s s if i e r_ _ n _e s t i ma t o rs ': [100 , 200] ,
38 ' c las sifie r__max _dept h ': [ None , 10 , 20] ,
39 }
40
41 # Use GridSearchCV to find the best parameters
42 grid_search = GridSearchCV ( pipeline , param_grid , cv =5 , scoring = ' roc_auc ')
43 grid_search . fit ( X_train , y_train )
44
45 # Print the best parameters and best score
46 print ( f ' Best Parameters : { grid_search . best_params_ } ')
47 print ( f ' Best AUC Score : { grid_search . best_score_ } ')
is derived from the amount of impurity reduction each feature contributes across all trees in the forest.
The feature importance can be obtained directly from the model using the feature_importances_
attribute in Scikit-learn. In a similar way, it is possible to extract feature importances from other
machine learning models. By analysing the feature importances, you can identify which features
contribute the most to the model’s predictions. This can help in feature selection, understanding the
model, and making data-driven decisions.
1 # Assume a Random Forest model has been trained on the training set
2
3 # Get feature importances
4 importances = model . feature_importances_
5 feature_names = X . columns
6
7 # Create a DataFrame for visualisation
8 importance_df = pd . DataFrame ({ ' Feature ': feature_names , ' Importance ':
importances })
9 importance_df = importance_df . sort_values ( by = ' Importance ' , ascending = False )
10
11 # Plot feature importances
12 plt . figure ( figsize =(10 , 6) )
13 plt . barh ( importance_df [ ' Feature '] , importance_df [ ' Importance '] , color = ' red ')
14 plt . xlabel ( ' Feature Importance ')
15 plt . title ( ' Feature Importances from Random Forest Model ')
16 plt . gca () . invert_yaxis ()
17 plt . show ()
In the presentations during the tutorial sessions, there are also notebooks about data understanding,
data preprocessing with packages, high cardinality variables encoding, model understanding, etc.
clauses can be written for a single try block, each focusing on a specific type of error. The error that
is raised first is the one that will be addressed by its corresponding catch (except) code block. A
number of examples to illustrate:
1 try :
2 a = 5/0 # should throw ZeroDivsionError
3 t = (1 , 2 , 3)
4 t . append (4) # should throw AttributeError
5 except ZeroDivisionError :
6 print ( " a ZeroDivisionError occured in try part " ) # gets printed
7 except AttributeError :
8 print ( " an AttributeError occured in try part " ) # not printed as
ZerodivisionError raised first
9
10 try :
11 d = { " lemon " : " a sour citrus fruit " ,
12 " grape " : " a small sweet fruit " }
13 del d [ " banana " ] # should throw error
14 except KeyError :
15 print ( " a KeyError was raised in try part " )
A large number of exception types exist. As explained in Section 2.1.3, the line of code print(dir
(__builtins__)) prints out a list containing such exceptions.
R IMPORTANT! Never upload the data you received for the Data Science Challenge to any
online platform such as ChatGPT for example as the data is confidential and you signed a
non-disclosure agreement.
4.2.1 ChatGPT
ChatGPT, developed by OpenAI, is a conversational AI model that can understand and generate
human-like text. It is particularly useful for Python programmers when dealing with complex coding
problems or seeking quick solutions to common issues. One of the primary benefits of ChatGPT is
its ability to provide instant code snippets for specific tasks, such as data manipulation, visualization,
or model training. This can save time and help you focus on more complex aspects of your projects.
You could for example ask "Provide Python code to make a graph with a yellow dashed line for
variable X and a purple dotted line for variable Y". Again, beware that ChatGPT makes mistakes
and the code is not perfect. You should always remain critical and double check it yourself.
Another very useful way to leverage ChatGPT is to debug your code or to solve errors. When you
encounter a bug or an error, ChatGPT can help pinpoint the source of the problem by analysing the
4.3 Data Science Challenge tips 79
code and explaining error messages. This feature is particularly useful in complex projects where
identifying the exact cause of an issue can be challenging or if you are a beginner and have difficulty
reading the error messages and understanding where the issue might come from. You can copy the
error message in ChatGPT and it will give you several solutions or other things you can check in
order to resolve the error. For example, if your code throws a ValueError, you can describe the
situation to ChatGPT, which might suggest checking for type mismatches or validating input data.
This can be a good start to solve a bug, however, ChatGPT is not perfect and can give you wrong
solutions or can misunderstand the problem. Always think before you blindly copy paste code from
ChatGPT.
To get started with GitHub Copilot, you need to apply for the GitHub student Developer Pack through
their website (https://education.github.com/pack). You need to apply using your student
email address. Once your GitHub Student application has been approved, you can install the GitHub
Copilot Extension from the Visual Studio Marketplace. After installing the extension, you need to
sign in to your GitHub account through VS Code. Now you are all set to start typing code and using
the GitHub Copilot suggestions.
4.2.3 GitHub
GitHub is a widely used platform for version control and collaboration in software development.
It allows multiple developers to work on the same project efficiently, providing tools to manage
changes, track issues, and integrate new features seamlessly. This section will guide you through the
basics of using GitHub for code collaboration, helping you and your team work together effectively.
GitHub Desktop is a user-friendly application that simplifies the process of using Git and GitHub. It
provides a visual interface for managing repositories, making it accessible for beginners and efficient
for experienced users. You can download GitHub Desktop for free from https://desktop.gith
ub.com/. You can find a detailed manual on how to use it on https://docs.github.com/en/d
esktop/overview/getting-started-with-github-desktop and https://docs.github.
com/en/desktop/overview/creating-your-first-repository-using-github-desktop.
It is definitely recommended to use some sort of version control and collaboration platform for the
Data Science Challenge as it makes it a lot easier to work together on the same code.
R If you’re excited about data science and coding, the GitHub Student Developer Pack offers a
fantastic range of resources for free. You can access online courses, tools, and much more to
help you learn and grow in these fields!
same time.
You start by dividing the DSC202X_Training.xlsx data into a training, validation and test set. (Note:
you cannot use the "real" test set that we provided in DSC202X_Scoring.xlsx for model evaluation
because the labels are missing, we shall call this the “scoring data set” in the remainder of this
section.) The indices are now called indices_train, indices_val, and indices_test (see
Figure 4.1). Let’s say you want to train a Logistic Regression model (with a hyperparameter C
to finetune). Then you first preprocess all variables using your preprocessing functions or the
provided Preprocessing_continuous and Preprocessing_discrete and you choose indices_train as
an argument. You then use the preprocessed data of the validation set and target variable to finetune
the regularisation parameter C. More specifically, you train the model using the training instances of
indices_train, and you evaluate the model on the validation set (instances of indices_val), and
you select the C-parameter for which the AUC on the validation set is maximal. To evaluate this
model with the selected C-parameter, you apply it to the instances of indices_test. You always
use the same preprocessing steps that you initialised on the training subset of the data to avoid data
leakage and overfitting. You can do these steps for multiple models (e.g., decision tree, random
forests, neural networks etc.), each with their own parameters to finetune. Make sure you keep track
of the optimal hyperparameters you have obtained. You keep all AUCs on the test set.
Figure 4.1: Data science challenge: extra information on preprocessing, testing, and scoring.
you can compete: Kaggle and Analytics Vidhya. These platforms offer many datasets to experiment
with. There are often open competitions or ’hackatons’ (sometimes with money prizes), where you
can build models and submit them to find out where you rank compared to others. This is particularly
useful for learning to evaluate the performance of data mining models. By doing so, you can learn
whether your interpretation of the model score is correct and think about ways to improve your
model.