0% found this document useful (0 votes)
124 views2 pages

Introduction To Numpy Pandas and Matplotlib

This document introduces NumPy, Pandas and Matplotlib for data analysis. NumPy provides multi-dimensional arrays and tools for working with arrays. Pandas provides data structures like Series and DataFrame for working with structured and labeled data. Matplotlib is a library for creating plots and visualizing data.

Uploaded by

rk73462002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
124 views2 pages

Introduction To Numpy Pandas and Matplotlib

This document introduces NumPy, Pandas and Matplotlib for data analysis. NumPy provides multi-dimensional arrays and tools for working with arrays. Pandas provides data structures like Series and DataFrame for working with structured and labeled data. Matplotlib is a library for creating plots and visualizing data.

Uploaded by

rk73462002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 2

Introduction to NumPy, Pandas and Matplotlib

Data Analysis
Data Analysis is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.

Steps for Data Analysis, Data Manipulation and Data Visualization:

1. Tranform Raw Data in a Desired Format


2. Clean the Transformed Data (Step 1 and 2 also called as a Pre-processing of Data)
3. Prepare a Model
4. Analyse Trends and Make Decisions

NumPy
NumPy is a package for scientific computing.

1. Multi dimensional array


2. Methods for processing arrays
3. Element by element operations
4. Mathematical operations like logical, Fourier transform, shape manipulation, linear algebra and random number generation

In [1]: import numpy as np

Ndarray - NumPy Array


The ndarray is a multi-dimensional array object consisting of two parts -- the actual data, some metadata which describes the stored data. They are indexed just like sequence are in Python, starting from 0

1. Each element in ndarray is an object of data-type object called dtype


2. An item extracted from ndarray, is represented by a Python object of an array scalar type

Single Dimensional Array

Creating a Numpy Array


In [2]: # Creating a single-dimensional array
a = np.array([1,2,3]) # Calling the array function
print(a)

[1 2 3]

In [3]: # Creating a multi-dimensional array


# Each set of elements within a square bracket indicates a row
# Array of two rows and two columns
b = np.array([[1,2], [3,4]])
print(b)

[[1 2]
[3 4]]

In [4]: # Creating an ndarray by wrapping a list


list1 = [1,2,3,4,5] # Creating a list
arr = np.array(list1) # Wrapping the list
print(arr)

[1 2 3 4 5]

In [5]: # Creating an array of numbers of a specified range


arr1 = np.arange(10, 100) # Array of numbers from 10 up to and excluding 100
print(arr1)

[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]

In [6]: # Creating a 5x5 array of zeroes


arr2 = np.zeros((5,5))
print(arr2)

[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]

In [7]: # Creating a linearly spaced vector, with spacing


vector = np.linspace(0, 20, 5) # Start, stop, step
print(vector)

[ 0. 5. 10. 15. 20.]

In [8]: # Creating Arrays from Existing Data


x = [1,2,3]
# Used for converting Python sequences into ndarrays
c = np.asarray(x) #np.asarray(a, dtype = None, order = None)
print(c)

[1 2 3]

In [10]: # Converting a linear array of 8 elements into a 2x2x2 3D array


arr3 = np.zeros(8) # Flat array of eight zeroes
print(arr3)
arr3d = arr3.reshape((2,2,2)) # Restructured array
print(arr3d)

[0. 0. 0. 0. 0. 0. 0. 0.]
[[[0. 0.]
[0. 0.]]

[[0. 0.]
[0. 0.]]]

In [11]: # Flatten rgw 3d array to get back the linear array


arr4 = arr3d.ravel()
print(arr4)

[0. 0. 0. 0. 0. 0. 0. 0.]

Indexing of NumPy Arrays


In [12]: # NumPy array indexing is identical to Python's indexing scheme
arr5 = np.arange(2, 20)
print(arr5)
element = arr5[6]
print(element)

[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
8

In [13]: # Python's concept of lists slicing is extended to NumPy.


# The slice object is constructed by providing start, stop, and step parameters to slice()
arr6 = np.arange(20)
print(arr6)
arr_slice = slice(1, 10, 2) # Start, stop & step
element2 = arr6[6]
print(arr6[arr_slice])

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[1 3 5 7 9]

In [14]: # Slicing items beginning with a specified index


arr7 = np.arange(20)
print(arr7[2:])

[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]

In [15]: # Extracting specific rows and columns using Slicing


d = np.array([[1,2,3], [3,4,5], [4,5,6]])
print(d)
print(d[0:2, 0:2]) # Slice the first two rows and the first two columns

[[1 2 3]
[3 4 5]
[4 5 6]]
[[1 2]
[3 4]]

NumPy Array Attributes


In [16]: print(d.shape) # Returns a tuple consisting of array dimensions
print(d.ndim) # Attribute returns the number of array dimensions
print(a.itemsize) # Returns the length of each element of array in bytes

(3, 3)
2
4

Reading & Writing from Files


In [17]: # NumPy provides the option of importing data from files directly into ndarray using the loadtxt function
# The savetxt function can be used to write data from an array into a text file
#import os
#print(os.listdir('../input'))
arr_txt = np.loadtxt('C:/Users/amaly/Jupyter Work/DSML/Data_file1.txt',delimiter=',')
print(arr_txt)
int_array = arr_txt.astype(int)
print(int_array)
np.savetxt('C:/Users/amaly/Jupyter Work/DSML/newfile1.txt', int_array)

[10. 20. 30. 44. 57. 70. 35. 32. 55. 22.]
[10 20 30 44 57 70 35 32 55 22]

In [18]: # NumPy arrays can be dumped into CSV files using the savetxt function and the comma delimiter
# The genfromtxt function can be used to read data from a CSV file into a NumPy array
arr_csv = np.genfromtxt('C:/Users/amaly/Jupyter Work/DSML/Data_file2.csv', delimiter = ',')
print(arr_csv)
int_array2 = arr_csv.astype(int)
print(int_array2)
np.savetxt('C:/Users/amaly/Jupyter Work/DSML/Data_file3.csv', int_array2, delimiter = ',')

[10. 20. 44. 56. 78. 34. 67. 90.]


[10 20 44 56 78 34 67 90]

Pandas
Pandas is an open-source Python library providing efficient, easy-to-use data structure and data analysis tools. The name Pandas is derived from "Panel Data" - an Econometrics from Multidimensional Data. Pandas is well suited for many different
kinds of data:

1. Tabular data with heterogeneously-type columns.


2. Ordered and unordered time series data.
3. Arbitary matrix data with row and column labels.
4. Any other form observational/statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.

Pandas provides three data structure - all of which are build on top of the NumPy array - all the data structures are value-mutable

1. Series (1D) - labeled, homogenous array of immutable size


2. DataFrames (2D) - labeled, heterogeneously typed, size-mutable tabular data structures
3. Panels (3D) - Labeled, size-mutable array

In [2]: import pandas as pd

Series

1. A Series is a single-dimensional array structures that stores homogenous data i.e., data of a single type.
2. All the elements of a Series are value-mutable and size-immutable
3. Data can be of multiple data types such as ndarray, lists, constants, series, dict etc.
4. Indexes must be unique, hashable and have the same length as data. Defaults to np.arrange(n) if no index is passed.
5. Data type of each column; if none is mentioned, it will be inferred; automatically
6. Deep copies data, set to false as default

Creating a Series
In [5]: # Creating an empty Series
series = pd.Series() # The Series() function creates a new Series
print(series)

Series([], dtype: object)

In [6]: # Creating a series from an ndarray


# Note that indexes are a assigned automatically if not specifies
import numpy as np
arr = np.array([10,20,30,40,50])
series1 = pd.Series(arr)
print('Numpy 1D array:', arr)
print('Pandas Series:\n',series1)

Numpy 1D array: [10 20 30 40 50]


Pandas Series:
0 10
1 20
2 30
3 40
4 50
dtype: int32

In [7]: # Creating a series from a Python dict


# Note that the keys of the dictionary are used to assign indexes during conversion
data = {'a':10, 'b':20, 'c':30}
series2 = pd.Series(data)
print(series2)

a 10
b 20
c 30
dtype: int64

In [8]: # Retrieving a part of the series using slicing


print(series1[1:4])

1 20
2 30
3 40
dtype: int32

DataFrames

1. A DataFrame is a 2D data structure in which data is aligned in a tabular fashion consisting of rows & columns
2. A DataFrame can be created using the following constructor - pandas.DataFrame(data, index, dtype, copy)
3. Data can be of multiple data types such as ndarray, list, constants, series, dict etc.
4. Index Row and column labels of the dataframe; defaults to np.arrange(n) if no index is passed
5. Data type of each column
6. Creates a deep copy of the data, set to false as default

Creating a DataFrame
In [10]: # Converting a list into a DataFrame
list1 = [10, 20, 30, 40]
t = pd.Series(list1)
print(t)
table = pd.DataFrame(list1)
print(table)

0 10
1 20
2 30
3 40
dtype: int64
0
0 10
1 20
2 30
3 40

In [3]: # Creating a DataFrame from a list of dictionaries


data = [{'a':1, 'b':2}, {'a':2, 'b':4, 'c':8}]
table1 = pd.DataFrame(data)
print(table1)
# NaN (not a number) is stored in areas where no data is provided

a b c
0 1 2 NaN
1 2 4 8.0

In [4]: # Creating a DataFrame from a list of dictionaries and accompaying row indices
table2 = pd.DataFrame(data, index = ['first', 'second'])
# Dict keys become column lables
print(table2)

a b c
first 1 2 NaN
second 2 4 8.0

In [16]: # Converting a dictionary of series into a DataFrame


data1 = {'one':pd.Series([1,2,3], index = ['a', 'b', 'c']),
'two':pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])}
table3 = pd.DataFrame(data1)
print(table3)
# the resultant index is the union of all the series indexes passed

one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4

DataFrame - Addition & Deletion of Columns


In [30]: # A new column can be added to a DataFrame when the data is passed as a Series
table3['three'] = pd.Series([10,20,30], index = ['a', 'b', 'c'])
tableX = table3
print(table3)

three
a 10.0
b 20.0
c 30.0
d NaN

In [20]: # DataFrame columns can be deleted using the del() function


del table3['one']
print(table3)

two three
a 1 10.0
b 2 20.0
c 3 30.0
d 4 NaN

In [21]: # DataFrame columns can be deleted using the pop() function


table3.pop('two')
print(table3)

three
a 10.0
b 20.0
c 30.0
d NaN

DataFrame - Addition & Deletion of Rows


In [22]: # DataFrame rows can be selected by passing the row lable to the loc() function
print(table3.loc['c'])

three 30.0
Name: c, dtype: float64

In [24]: # Row selection can also be done using the row index
print(table3.iloc[1])

three 20.0
Name: b, dtype: float64

In [ ]: # The append() function can be used to add more rows to the DataFrame
data2 = {'one':pd.Series([1,2,3], index = ['a', 'b', 'c']),
'two':pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])}
table5 = pd.DataFrame(data2)
table5['three'] = pd.Series([10,20,30], index = ['a', 'b', 'c'])
print(table5)
row = pd.DataFrame(['one':23])
table6 = table5.append(row,ignore_index = True)
print(table6)

In [5]: # The drop() function can be used to drop rows whose labels are provided
print(table2)
table7 = table2.drop('first')
print(table7)

a b c
first 1 2 NaN
second 2 4 8.0
a b c
second 2 4 8.0

Importing & Exporting Data


In [ ]: # Data can be loaded into DataFrames from input data stored in the CSV format using the read_csv() function
table_csv = pd.read_csv('../input/Cars2015.csv')

In [40]: # Data present in DataFrames can be written to a CSV file using the to_csv() function
# If the specified path doesn't exist, a file of the same name is automatically created
print(table2)
table2.to_csv('C:/Users/amaly/Jupyter Work/DSML/data_file4.csv')

a b c d
first 1 2 NaN NaN
second 2 4 8.0 NaN
third 2 4 NaN 10.0

In [ ]: # Data can be loaded into DataFrames from input data stored in the Excelsheet format using read_excel()
sheet = pd.read_excel('cars2015.xlsx')

In [ ]: # Data present in DataFrames can be written to a spreadsheet file using to_excel()


#If the specified path doesn't exist, a file of the same name is automatically created
sheet.to_excel('newcars2015.xlsx')

Matplotlib

1. Matplotlib is a Python library that is specially designed for the development of graphs, charts etc., in order to provide interactive data visualisation
2. Matplotlib is inspired from the MATLAB software and reproduces many of it's features

In [6]: # Import Matplotlib submodule for plotting


import matplotlib.pyplot as plt

Plotting in Matplotlib
In [7]: plt.plot([1,2,3,4]) # List of vertical co-ordinates of the points plotted
plt.show() # Displays plot
# Implicit X-axis values from 0 to (N-1) where N is the length of the list

In [8]: # We can specify the values for both axes


x = range(5) # Sequence of values for the x-axis
# X-axis values specified - [0,1,2,3,4]
plt.plot(x, [x1**2 for x1 in x]) # vertical co-ordinates of the points plotted: y = x^2
plt.show()

In [10]: # We can use NumPy to specify the values for both axes with greater precision
import numpy as np
x = np.arange(0, 5, 0.01)
plt.plot(x, [x1**2 for x1 in x]) # vertical co-ordinates of the points plotted: y = x^2
plt.show()

Multiline Plots
In [11]: # Multiple functions can be drawn on the same plot
x = range(5)
plt.plot(x, [x1 for x1 in x])
plt.plot(x, [x1*x1 for x1 in x])
plt.plot(x, [x1*x1*x1 for x1 in x])
plt.show()

In [12]: # Different colours are used for different lines


x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*x1 for x1 in x],
x, [x1*x1*x1 for x1 in x])
plt.show()

Grids
In [13]: # The grid() function adds a grid to the plot
# grid() takes a single Boolean parameter
# grid appears in the background of the plot
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.show()

Limiting the Axes


In [14]: # The scale of the plot can be set using axis()
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.axis([-1, 5, -1, 10]) # Sets new axes limits
plt.show()

In [15]: # The scale of the plot can also be set using xlim() and ylim()
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlim(-1, 5)
plt.ylim(-1, 10)
plt.show()

Adding Labels
In [16]: # Labels can be added to the axes of the plot
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Adding the Title


In [17]: # The title defines the data plotted on the graph
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph") # Pass the title as a parameter to title()
plt.show()

Adding a Legend
In [18]: # Legends explain the meaning of each line in the graph
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.show()

Adding a Markers
In [19]: x = [1, 2, 3, 4, 5, 6]
y = [11, 22, 33, 44, 55, 66]
plt.plot(x, y, 'bo')
for i in range(len(x)):
x_cord = x[i]
y_cord = y[i]
plt.text(x_cord, y_cord, (x_cord, y_cord), fontsize = 10)
plt.show()

Saving Plots
In [20]: # Plots can be saved using savefig()
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.savefig('plot.png') # Saves an image names 'plot.png' in the current directory
plt.show()

Plot Types
Matplotlib provides many types of plot formats for visualising information

1. Scatter Plot
2. Histogram
3. Bar Graph
4. Pie Chart

Histogram
In [21]: # Histograms display the distribution of a variable over a range of frequencies or values
y = np.random.randn(100, 100) # 100x100 array of a Gaussian distribution
plt.hist(y) # Function to plot the histogram takes the dataset as the parameter
plt.show()

In [22]: # Histogram groups values into non-overlapping categories called bins


# Default bin value of the histogram plot is 10
y = np.random.randn(1000)
plt.hist(y, 100)
plt.show()

Bar Chart
In [23]: # Bar charts are used to visually compare two or more values using rectangular bars
# Default width of each bar is 0.8 units
# [1,2,3] Mid-point of the lower face of every bar
# [1,4,9] Heights of the successive bars in the plot
plt.bar([1,2,3], [1,4,9])
plt.show()

In [24]: dictionary = {'A':25, 'B':70, 'C':55, 'D':90}


for i, key in enumerate(dictionary):
plt.bar(i, dictionary[key]) # Each key-value pair is plotted individually as dictionaries are not iterable
plt.show()

In [25]: dictionary = {'A':25, 'B':70, 'C':55, 'D':90}


for i, key in enumerate(dictionary):
plt.bar(i, dictionary[key])
plt.xticks(np.arange(len(dictionary)), dictionary.keys()) # Adds the keys as labels on the x-axis
plt.show()

Pie Chart
In [26]: plt.figure(figsize = (3,3)) # Size of the plot in inches
x = [40, 20, 5] # Proportions of the sectors
labels = ['Bikes', 'Cars', 'Buses']
plt.pie(x, labels = labels)
plt.show()

Scatter Plot
In [27]: # Scatter plots display values for two sets of data, visualised as a collection of points
# Two Gaussion distribution plotted
x = np.random.rand(1000)
y = np.random.rand(1000)
plt.scatter(x, y)
plt.show()

Styling
In [28]: # Matplotlib allows to choose custom colours for plots
y = np.arange(1, 3)
plt.plot(y, 'y') # Specifying line colours
plt.plot(y+5, 'm')
plt.plot(y+10, 'c')
plt.show()

Color code:

1. b = Blue
2. c = Cyan
3. g = Green
4. k = Black
5. m = Magenta
6. r = Red
7. w = White
8. y = Yellow

In [29]: # Matplotlib allows different line styles for plots


y = np.arange(1, 100)
plt.plot(y, '--', y*5, '-.', y*10, ':')
plt.show()
# - Solid line
# -- Dashed line
# -. Dash-Dot line
# : Dotted Line

In [30]: # Matplotlib provides customization options for markers


y = np.arange(1, 3, 0.2)
plt.plot(y, '*',
y+0.5, 'o',
y+1, 'D',
y+2, '^',
y+3, 's') # Specifying line styling
plt.show()
In [ ]:

You might also like