Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
on
A training report
Submitted in partial fulfilment of the requirements for the award of degree of
B. TECH Computer Science and Engineering (Data Science (ML and AI))
Submitted to
PHAGWARA, PUNJAB
SUBMITTED BY
1
Student Declaration
I, Karri John Pradeep Reddy, 12109211, hereby declare that the work done by me on “Data Science” from
June, 2023 to July, 2023, is a record of original work for the partial fulfillment of the requirements for the
award of the degree, B. TECH Computer Science And Engineering (Data Science (ML and AI)).
Registration no : 12109211
2
Acknowledgement:
With heartfelt appreciation, I would like to extend my acknowledgment to the collective efforts of numerous
well-wishers who, in their own unique ways, have contributed to the successful completion of the Summer
Training. Accomplishing any technological endeavour is a collaborative effort, reliant on the support of many
individuals. In preparing this report, I have also sought assistance from various sources. It is now my
endeavour to express my profound gratitude to those who offered their valuable assistance.
First and foremost, I wish to convey my deep gratitude and indebtedness to our Training mentor, Mr. Mayank
Raghuwanshi and Mr. Abdul . their unwavering support and guidance throughout the training have been
instrumental in my journey. Without his valuable insights and direction, this would not have achieved the level
of success it has. At every step of the project, his supervision and counsel have played a pivotal role in shaping
this training experience into a resounding accomplishment.
3
Project Completion Certificate:
4
Declaration Letter:
5
IBM Skills build Certificate:
6
TimeLine Of Summer Training:
7
TABLE OF CONTENTS:
8
Introduction to Data Science
Data Science
The field of bringing insights from data using scientific techniques is called data science
Computer Vision - The advancement in recognizing an image by a computer involves processing large sets
of image data from multiple objects of same category. For example, Face recognition.
What is likely to
happen?
Predictive analysis
What’s happening
now?
Dashboards
Why did it
happen?
Detective Analysis
What happened?
Reporting
9
Reporting / Management Information System
Detective Analysis
Asking questions based on data we are seeing, like. Why something happened?
Predictive Modelling
Big Data
Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety, or velocity of data. Use specific tools to analyse such scale data.
• Recommendation System
Example-In Amazon recommendations are different for
different users according to their past search.
• Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
• Deciding the right credit limit for credit card customers.
• Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
• How google and other search engines know what are the more relevant results for our search query?
1. Apply ML and Data Science
2. Fraud Detection
10
3. AD placement
4. Personalized search results
Python Introduction
Why Python???
4. Extensive Packages.
• UNDERSTANDING OPERATORS:
• Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean and
strings.
• CONDITIONAL STATEMENTS:
• FUNCTIONS:
o Functions are re-usable piece of code. Created for solving specific problem.
o Two types: Built-in functions and User- defined functions.
o Functions cannot be reused in python.
• LISTS: A list is an ordered data structure with elements separated by comma and enclosed within
square brackets.
• DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.
11
Statistics:
Descriptive Statistic
Mode:
It is robust and is not generally affected much by addition of couple of new values.
Code import pandas as pd data=pd.read_csv( "Mode.csv") //reads data from
csv file
data.head() //print first five lines
Outliers
Any value which will fall outside the range of the data is termed as a outlier. E.g.- 9700 instead of 97.
Reasons of Outliers
• Intentional Error-Errors which are induced intentionally. E.g.-claiming smaller amount of alcohol
consumed then actual.
• Legit Outlier—These are values which are not actually errors but in data due to legitimate reasons.
Histograms
Inferential statistics allows to make inferences about the population from the sample data.
Hypothesis Testing:
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data, and then
examining what the data tells us about how to proceed. The hypothesis to be tested is called the null
hypothesis and given the symbol Ho. We test the null hypothesis against an alternative hypothesis, which is
given the symbol Ha.
T Tests:
13
Use sample standard deviation to estimate population standard deviation.
Z Score:
The distance in terms of number of standard deviations, the observed value is away from mean, is standard
score or z score.
The distribution once converted to z- score is always same as that of shape of original distribution.
Correlation:
14
Predictive Modelling:
Making use of past data and attributes we predict future using this data. Eg-
Types:
1. Supervised Learning
Supervised learning is a type algorithm that uses a known dataset (called the training dataset) to
make predictions. The training dataset includes input data and response values.
• Regression-which have continuous possible values. Eg-Marks
2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither classified nor. Here
the task of machine is to group unsorted information according to similarities, patterns and
differences without any prior training of data.
• Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behaviour.
15
• Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
5. Predictive Modelling
6. Model Development/Implementation
Problem Definition:
Identify the right problem statement, ideally formulate the problem mathematically.
Hypothesis Generation:
List down all possible variables, which might influence problem objective. These variables should be free
from personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.
Data Extraction/Collection:
Collect data from different sources and combine those for exploration and model building.
Data extraction is a process that involves retrieval of data from various sources for further data processing or
data storage.
Steps of Data Extraction
• Univariate Analysis
• Bivariate Analysis
16
• Outlier treatment
• Variable Transformation
Variable Treatment
Univariate Analysis:
Bivariate Analysis:
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
Types:
1. MCAR (Missing completely at random): Missing values have no relation to the variable in which
missing value exist and other variables in dataset.
17
2. MAR (Missing at random): Missing values have no relation to the in which missing value exist and
the variables other than the variables in which missing values exist.
3. MNAR (Missing not at random): Missing values have relation to the variable in which missing value
exists
Identifying
1. describe():-
gives
statistical
analysis.
2. Isnull() :- Output will we in True or False
1. Imputation :-
2. Deletion :-
Outlier Treatment:
Reasons of Outliers:
2. Measurement Errors
3. Processing Errors
Types of Outlier
Univariate
Bivariate
Eg- In scatter plot graph of height and weight. Both will we analysed.
Identifying Outlier
18
Graphical Method
• Box Plot :
• Scatter Plot :
Formula Method
Where IQR= Q3 – Q1
1.We replace a variable with some function of that variable. Eg – Replacing a variable
x with its log.
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.
19
Model Building:
It is a process to create a mathematical model for estimating / predicting the future based on past data.
E.g.
A retail wants to know the default behaviour of its credit card customers. They want to predict the
probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
It moves the probability towards one of the extremes based on attributes of past information.
A customer with healthy credit history for last years has low chances of default (closer to 0).
20
Algorithm Selection:
Example-
Algorithms:
• Logistic Regression
• Decision Tree
• Random Forest
Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate.
Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score. Prediction / Scoring
21
It is the process to estimate/predict dependent variable of train data set by applying model rules.
Linear regression is a statistical approach for modelling relationship between a dependent variable with a
given set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear function. That predicts
the response value(y) as accurately as possible as a function of the feature or independent variable(x).
10
0
0 1 2 3 4 5 6 7 8 9
Logistic Regression:
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary
dependent variable, although many
more complex extensions exist.
22
K-Means Clustering (Unsupervised learning):
K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data (i.e.,
data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the
number of groups represented by the variable K. The algorithm works iteratively to assign each data point to
one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
➢ Improve operations.
➢ Better understand end users or customers.
➢ Drive efficiency.
➢ Reduce costs.
➢ Increase profits.
➢ Find new innovations.
➢ Data is a problem solver.
Data analysts spend a lot of time working in a database. A database is an organized collection of structured
data in a computer system. Transforming data into standard format (or tidy data) makes storage and analysis
easier.
23
Then what is Big Data:
There is no official definition for big data, but according to tech giants Big Data is high-volume,
high-velocity, and/or high-variety information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision making and process automation.
While each line chart presents the trending price over time for "Stock J", the vertical scale (or y-axis) for
Price is different. The scales show the data in two different increments. Notice the second chart is misleading
because it doesn’t depict $0 to 25 for Price like the first chart does.
And, it shows Price in $5 increments. It makes it look like "Stock J" increased in price faster! The first chart
is a more accurate depiction because it does not skip the price from $0 to $25 and shows Price consistently in
$10 increments.
The key point here is to be precise in how you choose to depict data.
24
There are four types of data analytics that answer key questions, build on each other, and increase in
complexity:
➢ Descriptive
➢ Diagnostic
➢ Predictive
➢ Prescriptive
There are three classic and widely
adopted data science methodologies:
➢ CRISP-DM stands for Cross-Industry Standard Process for Data Mining. consists of six phases with
arrows indicating the most important and frequent dependencies between phases:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment
1. Selection
2. Preprocessing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation
1. Sample
2. Explore
3. Modify
4. Model
5. Assess
25
➢ open-source industry tools. Git and GitHub are two related, but separate, platforms that are extremely
popular and widely used by open-source contributors.
1. Host your own open source project. To do this, you create an online repository and add files.
2. Contribute to an existing open source project that’s public. To do this, you access a copy of the
project’s repository, make updates, and request a review of the changes to you want to contribute.
➢ Python:
➢ IBM Watson Studio: It's a collaborative data science and machine learning environment.
26
2. IBM Watson Studio offers a graphical interface with built-in operations.
3. You don’t need to know how to code to use the tool.
4. And, IBM Watson Studio has a built-in data refinery
tool.
➢ Tableau:
1. Analyze large volumes of data.
2. Create different dashboards, charts, graphics, maps, stories and more to help make business
decisions.
3. Perform tasks without programming experience. It offers an intuitive interface.
o Design interactive visualizations.
4. Tableau is a popular data visualization and business intelligence software for deriving meaningful
insights from data. Many businesses use Tableau for pictorial and graphical representations of
data
➢ Matplotlib:
1. A Python Matplotlib script is structured so that, in most instances, a few lines of code can generate a
visual data plot.
2. You can create different types of plots, such as scatterplots, histograms, bar charts, and more.
3. The visualizations can be static, animated, and interactive.
4. You can export to many different types of file formats.
➢ Google Sheets is a free tool you can use to perform tasks like entering, analyzing, and visualizing
data to make data-driven decisions.
27
Project:
Problem Description:
Provided with following files: cardata.csv
Divide the data into test data and train data by specifying the target variable selling price.
Importing the file "car data" using pandas read function and loading it into a pandas data
frame
In [3]: cars=pd.read_csv(r"C:\Users\johnp\OneDrive\Desktop\car data.csv")
cars
Out[3]:
28
count 301.000000 301.000000 301.000000 301.000000 301.000000
In [6]: #getting some information about the data points in the data set cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300 Data
columns (total 9 columns):
# Column Non-Null Count Dtype ---
------ -------------- ----- 0
Car_Name 301 non-null object
1 Year 301 non-null int64
2 Selling_Price 301 non-null float64 3 Present_Price
301 non-null float64
4 Kms_Driven 301 non-null int64
5 Fuel_Type 301 non-null object
6 Seller_Type 301 non-null object 7 Transmission 301 non-null
object 8 Owner 301 non-null int64 dtypes: float64(2),
int64(3), object(4) memory usage: 21.3+ KB
In [7]: #checking if there are any null values in the data set
cars.isnull().sum()
Car_Name 0 Out[7]:
Year 0 Selling_Price
0
Present_Price 0
Kms_Driven 0
Fuel_Type 0 Seller_Type
0
Transmission 0 Owner
0 dtype: int64
In [9]: cars.Car_Name.unique()
We have bikes mixed in the dataset we need to delete the bikes as they decrease the
efficiency of the model
All the bikes present price in the dataset is less than 2 lakh so we use this condition to
eliminate all the bikes
bikes=cars[cars["Present_Price"]<=2.0] 100
bikes
2016
1.75
UM
101 Renegade 2017 1.70 1.82 1400 Petrol Individual Manu
Mojave
KTM
102 2017 1.65 1.78 4000 Petrol Individual Manu
RC200
30
Bajaj
103 Dominar 2017 1.45 1.60 1200 Petrol Individual Manu
400
Royal
Enfield
104 2017 1.35 1.47 4100 Petrol Individual Manu
Classic
350
Honda CB
197 twister 2010 0.16 0.51 33000 Petrol Individual Manu
Bajaj
198 Discover 2011 0.15 0.57 35000 Petrol Individual Manu
125
Honda CB
199 2007 0.12 0.58 53000 Petrol Individual Manu
Shine
Bajaj
200 2006 0.10 0.75 92233 Petrol Individual Manu
Pulsar 150
98 rows × 9 columns
In [11]: # delete all rows with column "Present Price" has value less than
2.0 dropBikes = cars[ (cars["Present_Price"] <= 2.0)].index
cars.drop(dropBikes , inplace=True) cars.shape
(203, 9) Out[11]:
In [12]: plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
plt.title('Selling Price Distribution Plot')
sns.distplot(cars.Selling_Price)
plt.subplot(1,2,2)
plt.title('Selling Price Spread')
sns.boxplot(y=cars.Selling_Price) plt.show()
31
Inference:-
There is no significant diference between the mean and median
32
In [15]: def scatter(x,fig):
plt.subplot(5,2,fig)
plt.scatter(cars[x],cars["Selling_Price"])
plt.title(x+" vs Selling_Price")
plt.ylabel("Selling_Price")
plt.xlabel(x) plt.figure(figsize=(10,20))
scatter("Kms_Driven", 1)
scatter("Transmission", 2)
plt.tight_layout()
33
Inference:-
Kms driven is inversly proportional to the selling price for any car.
The lowest price of the cars start with manual gear transmission and the mean price of the
automatic transmission is more compared to the manual transmission
Model Training
Linear Regression
In [21]: # linear regression model loading linear_reg=LinearRegression()
In [22]: linear_reg.fit(x_train,y_train)
# the equation is y=mx+c
LinearRegression()
34
Out[22]:
Model Evaluation
In [23]: #training data prediction
training_data_prediction=linear_reg.predict(x_train)
Another method to predict the accuracy of the outcome is by plotting the values predicted by
the model to the actual values
35
In [27]: # R-Squared error
error_score=metrics.r2_score(y_test, testing_data_prediction) print("R-
Squared error", error_score)
R-Squared error 0.8294993134054333 In
Inference:-
By seeing the plot we can say that there isn't much scattering and the graph is linear If we
have more data in the dataset we can have better pediction by the way it's not a bad
prediction but not upto the mark
We now try the lasso regression because generally if the variables are positively correlated
then the linear regression works well but in other cases when more variables are that support
the target variables then lasso regression works best
2.Lasso Regression
In [29]: # linear regression model loading
In [30]: lasso_reg=Lasso()
lasso_reg.fit(x_train,y_train)
36
Out[30]: Lasso()
In [31]: #training data prediction
training_data_prediction=lasso_reg.predict(x_train)
# R-Squared error
In [35]:
error_score=metrics.r2_score(y_test, testing_data_prediction) print("R-
Squared error", error_score)
By comparing the two models linear regression is working better than lasso regression but
the change is low so I can say that both are best for this project.
If the data set is larger and had more columns then there will be significant change between
the both models.
38
Reason for choosing data science:
Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as the
‘Sexiest job of the 21st century’. Data Science is a buzzword with very few people knowing about the
technology in its true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of data science
and give out a real picture. In this article, we will discuss these points in detail and provide you with the
necessary insights about Data Science.
Advantages: -
1. It’s in Demand.
2. Abundance of Positions.
Disadvantages: -
39
Learning Outcome:
40
Bibliography:
• Google.
• IBM Skills build.
• Wikipedia.
• Python documentation.
• Kaggle.
• Geeks for Geeks.
41