Introduction To Numpy Pandas and Matplotlib
Introduction To Numpy Pandas and Matplotlib
Data Analysis
Data Analysis is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
NumPy
NumPy is a package for scientific computing.
[1 2 3]
[[1 2]
[3 4]]
[1 2 3 4 5]
[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[1 2 3]
[0. 0. 0. 0. 0. 0. 0. 0.]
[[[0. 0.]
[0. 0.]]
[[0. 0.]
[0. 0.]]]
[0. 0. 0. 0. 0. 0. 0. 0.]
[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
8
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[1 3 5 7 9]
[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[[1 2 3]
[3 4 5]
[4 5 6]]
[[1 2]
[3 4]]
(3, 3)
2
4
[10. 20. 30. 44. 57. 70. 35. 32. 55. 22.]
[10 20 30 44 57 70 35 32 55 22]
In [18]: # NumPy arrays can be dumped into CSV files using the savetxt function and the comma delimiter
# The genfromtxt function can be used to read data from a CSV file into a NumPy array
arr_csv = np.genfromtxt('C:/Users/amaly/Jupyter Work/DSML/Data_file2.csv', delimiter = ',')
print(arr_csv)
int_array2 = arr_csv.astype(int)
print(int_array2)
np.savetxt('C:/Users/amaly/Jupyter Work/DSML/Data_file3.csv', int_array2, delimiter = ',')
Pandas
Pandas is an open-source Python library providing efficient, easy-to-use data structure and data analysis tools. The name Pandas is derived from "Panel Data" - an Econometrics from Multidimensional Data. Pandas is well suited for many different
kinds of data:
Pandas provides three data structure - all of which are build on top of the NumPy array - all the data structures are value-mutable
Series
1. A Series is a single-dimensional array structures that stores homogenous data i.e., data of a single type.
2. All the elements of a Series are value-mutable and size-immutable
3. Data can be of multiple data types such as ndarray, lists, constants, series, dict etc.
4. Indexes must be unique, hashable and have the same length as data. Defaults to np.arrange(n) if no index is passed.
5. Data type of each column; if none is mentioned, it will be inferred; automatically
6. Deep copies data, set to false as default
Creating a Series
In [5]: # Creating an empty Series
series = pd.Series() # The Series() function creates a new Series
print(series)
a 10
b 20
c 30
dtype: int64
1 20
2 30
3 40
dtype: int32
DataFrames
1. A DataFrame is a 2D data structure in which data is aligned in a tabular fashion consisting of rows & columns
2. A DataFrame can be created using the following constructor - pandas.DataFrame(data, index, dtype, copy)
3. Data can be of multiple data types such as ndarray, list, constants, series, dict etc.
4. Index Row and column labels of the dataframe; defaults to np.arrange(n) if no index is passed
5. Data type of each column
6. Creates a deep copy of the data, set to false as default
Creating a DataFrame
In [10]: # Converting a list into a DataFrame
list1 = [10, 20, 30, 40]
t = pd.Series(list1)
print(t)
table = pd.DataFrame(list1)
print(table)
0 10
1 20
2 30
3 40
dtype: int64
0
0 10
1 20
2 30
3 40
a b c
0 1 2 NaN
1 2 4 8.0
In [4]: # Creating a DataFrame from a list of dictionaries and accompaying row indices
table2 = pd.DataFrame(data, index = ['first', 'second'])
# Dict keys become column lables
print(table2)
a b c
first 1 2 NaN
second 2 4 8.0
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
three
a 10.0
b 20.0
c 30.0
d NaN
two three
a 1 10.0
b 2 20.0
c 3 30.0
d 4 NaN
three
a 10.0
b 20.0
c 30.0
d NaN
three 30.0
Name: c, dtype: float64
In [24]: # Row selection can also be done using the row index
print(table3.iloc[1])
three 20.0
Name: b, dtype: float64
In [ ]: # The append() function can be used to add more rows to the DataFrame
data2 = {'one':pd.Series([1,2,3], index = ['a', 'b', 'c']),
'two':pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])}
table5 = pd.DataFrame(data2)
table5['three'] = pd.Series([10,20,30], index = ['a', 'b', 'c'])
print(table5)
row = pd.DataFrame(['one':23])
table6 = table5.append(row,ignore_index = True)
print(table6)
In [5]: # The drop() function can be used to drop rows whose labels are provided
print(table2)
table7 = table2.drop('first')
print(table7)
a b c
first 1 2 NaN
second 2 4 8.0
a b c
second 2 4 8.0
In [40]: # Data present in DataFrames can be written to a CSV file using the to_csv() function
# If the specified path doesn't exist, a file of the same name is automatically created
print(table2)
table2.to_csv('C:/Users/amaly/Jupyter Work/DSML/data_file4.csv')
a b c d
first 1 2 NaN NaN
second 2 4 8.0 NaN
third 2 4 NaN 10.0
In [ ]: # Data can be loaded into DataFrames from input data stored in the Excelsheet format using read_excel()
sheet = pd.read_excel('cars2015.xlsx')
Matplotlib
1. Matplotlib is a Python library that is specially designed for the development of graphs, charts etc., in order to provide interactive data visualisation
2. Matplotlib is inspired from the MATLAB software and reproduces many of it's features
Plotting in Matplotlib
In [7]: plt.plot([1,2,3,4]) # List of vertical co-ordinates of the points plotted
plt.show() # Displays plot
# Implicit X-axis values from 0 to (N-1) where N is the length of the list
In [10]: # We can use NumPy to specify the values for both axes with greater precision
import numpy as np
x = np.arange(0, 5, 0.01)
plt.plot(x, [x1**2 for x1 in x]) # vertical co-ordinates of the points plotted: y = x^2
plt.show()
Multiline Plots
In [11]: # Multiple functions can be drawn on the same plot
x = range(5)
plt.plot(x, [x1 for x1 in x])
plt.plot(x, [x1*x1 for x1 in x])
plt.plot(x, [x1*x1*x1 for x1 in x])
plt.show()
Grids
In [13]: # The grid() function adds a grid to the plot
# grid() takes a single Boolean parameter
# grid appears in the background of the plot
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.show()
In [15]: # The scale of the plot can also be set using xlim() and ylim()
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlim(-1, 5)
plt.ylim(-1, 10)
plt.show()
Adding Labels
In [16]: # Labels can be added to the axes of the plot
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
Adding a Legend
In [18]: # Legends explain the meaning of each line in the graph
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.show()
Adding a Markers
In [19]: x = [1, 2, 3, 4, 5, 6]
y = [11, 22, 33, 44, 55, 66]
plt.plot(x, y, 'bo')
for i in range(len(x)):
x_cord = x[i]
y_cord = y[i]
plt.text(x_cord, y_cord, (x_cord, y_cord), fontsize = 10)
plt.show()
Saving Plots
In [20]: # Plots can be saved using savefig()
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.savefig('plot.png') # Saves an image names 'plot.png' in the current directory
plt.show()
Plot Types
Matplotlib provides many types of plot formats for visualising information
1. Scatter Plot
2. Histogram
3. Bar Graph
4. Pie Chart
Histogram
In [21]: # Histograms display the distribution of a variable over a range of frequencies or values
y = np.random.randn(100, 100) # 100x100 array of a Gaussian distribution
plt.hist(y) # Function to plot the histogram takes the dataset as the parameter
plt.show()
Bar Chart
In [23]: # Bar charts are used to visually compare two or more values using rectangular bars
# Default width of each bar is 0.8 units
# [1,2,3] Mid-point of the lower face of every bar
# [1,4,9] Heights of the successive bars in the plot
plt.bar([1,2,3], [1,4,9])
plt.show()
Pie Chart
In [26]: plt.figure(figsize = (3,3)) # Size of the plot in inches
x = [40, 20, 5] # Proportions of the sectors
labels = ['Bikes', 'Cars', 'Buses']
plt.pie(x, labels = labels)
plt.show()
Scatter Plot
In [27]: # Scatter plots display values for two sets of data, visualised as a collection of points
# Two Gaussion distribution plotted
x = np.random.rand(1000)
y = np.random.rand(1000)
plt.scatter(x, y)
plt.show()
Styling
In [28]: # Matplotlib allows to choose custom colours for plots
y = np.arange(1, 3)
plt.plot(y, 'y') # Specifying line colours
plt.plot(y+5, 'm')
plt.plot(y+10, 'c')
plt.show()
Color code:
1. b = Blue
2. c = Cyan
3. g = Green
4. k = Black
5. m = Magenta
6. r = Red
7. w = White
8. y = Yellow