03 Machine Learning Lab Guide-Student Version
03 Machine Learning Lab Guide-Student Version
03 Machine Learning Lab Guide-Student Version
Machine Learning
Lab Guide
Teacher Version
1.1 Introduction
1.1.1 About This Lab
Feature engineering is a process of extracting features from raw data. Data and features
determine the upper limit of machine learning, while models and algorithms help
continuously approaching this upper limit. Feature engineering and construction aim to
enable extracted features to represent the essential characteristics of data to the greatest
extent, so that a model constructed based on these features has a good prediction effect
on unknown datasets.
1.1.2 Objectives
Upon completion of this task, you will be able to:
Master the Python-based feature selection method.
Master the Python-based feature extraction method.
Master the Python-based feature construction method.
1.2.2 Procedure
1.2.2.1 Importing Data
Code:
import pandas as pd
Output:
The missing values in the data may be caused by machine faults, manual input errors, or
service attributes. The method for processing the missing values varies with the cause.
missingno is a tool for visualizing missing values. You can run the following command to
view the missing-value distribution in the data:
Code:
Output:
Machine Learning Lab Guide-Student Version Page 3
Output:
Machine Learning Lab Guide-Student Version Page 4
Pandas provides fillna() to fill the missing values, and mode() to fill the missing values
with the mode. You need to construct a for loop to process multiple fields that contain
missing values and fill the missing values with the mode.
# Define the list of fields with missing values.
# Use the for loop to process the missing values in the multiple fields.
After the processing is complete, check the missing rate of each field.
----End
1.3.3 Filter
Step 1 Analyze the crosstab.
Apply the crosstab() method to draw a crosstab by using the variable House_State and
the target variable Target as an example.
Machine Learning Lab Guide-Student Version Page 6
In the output, the default rate is 0.019 when House_State is set to 1, and is 0.045 when
House_State is set to 2. If the default rates are considered the same, the variable
House_State does not affect the default prediction.
The crosstab analysis can only be used for preliminary judgment and analysis. The chi-
square test is further needed to determine whether the numerical difference has statistical
significance.
Separate independent variables and dependent variables from the raw data, and select
categorical variables from the independent variables.
The Target field is a target variable and is assigned to y. The column with the target
variable removed is assigned to X as an independent variable. X_category indicates a
categorical variable.
Import the chi-square test package chi2 of sklearn.feature_selection and use chi2() to
calculate the chi-square values of each categorical variable and target variable.
If two continuous independent variables are highly correlated, delete one of the two
independent variables or extract common information from the two independent variables.
The method parameter indicates the method for calculating the correlation coefficient.
The options are as follows:
Machine Learning Lab Guide-Student Version Page 7
Calculate the correlation coefficient between continuous independent variables and select
the combination of independent variables whose correlation coefficient is greater than 0.8.
----End
Machine Learning Lab Guide-Student Version Page 8
1.3.4 Wrapper
In the wrapper selection method, different feature subsets are used for modeling, the
model precision is used as the evaluation indicator for the feature subsets, and a base
model is selected to perform multi-round training. After each round of training, features
of some weight coefficients are removed, and then the next round of training is performed
based on the new feature set. The RFE() method of the feature_selection submodule in
sklearn is invoked. The logistic regression model LogisticRegressio() is used as the base
model to be invoked, and parameter will be transferred into this model.
Wrapper:
estimator: basic training model, which is a logistic regression model in this example.
n_features_to_select: indicates the number of retained features.
fit(X,y): invokes and trains a model.
Output:
20
[ True True False True True True False True True True True False
False True False True True False True True True True True True
True False False False True False]
[ 1 1 9 1 1 1 10 1 1 1 1 6 3 1 11 1 1 8 1 1 1 1 1 1
1 5 4 7 1 2]
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False)
The return value of the RFE() method is output, which can be any of the following:
n_features_: number of selected features, that is, the value of the
n_features_to_select parameter transferred into the RFE() method.
support_: indicates that the selected features are displayed at their corresponding
positions. True indicates that the feature is retained, and False indicates that the
feature is removed.
ranking_: indicates the feature ranking. ranking_[i] corresponds to the ranking of
the ith feature. The value 1 indicates the optimal feature. The selected features are
the 20 feature corresponding to the value 1, namely, the optimal features.
estimator_: returns the parameters of the base model.
1.3.5 Embedded
The embedded method uses a machine learning model for training to obtain weight
coefficients of features, and selects features in descending order of the weight coefficients.
Machine Learning Lab Guide-Student Version Page 9
After the model training is complete, the weight evaluation value of each feature is printed.
Output:
[(0.1315, 'Ast_Curr_Bal'),
(0.1286, 'Age'),
(0.0862, 'Year_Income'),
(0.0649, 'Std_Cred_Limit'),
(0.043, 'ZX_Max_Account_Number'),
(0.0427, 'Highest Education'),
(0.0416, 'ZX_Link_Max_Overdue_Amount'),
(0.0374, 'ZX_Max_Link_Banks'),
(0.0355, 'Industry'),
(0.0354, 'ZX_Max_Overdue_Duration'),
(0.0311, 'ZX_Total_Overdu_Months'),
(0.0305, 'Marriage_State'),
(0.0305, 'Duty'),
(0.0292, 'Couple_Year_Income'),
(0.0279, 'ZX_Credit_Max_Overdu_Amount'),
(0.0246, 'ZX_Max_Overdue_Account'),
(0.0241, 'ZX_Max_Credit_Banks'),
(0.0221, 'ZX_Max_Credits'),
(0.0205, 'Birth_Place'),
(0.0195, 'Loan_Curr_Bal'),
(0.0173, 'L12_Month_Pay_Amount'),
(0.015, 'ZX_Credit_Max_Overdue_Duration'),
(0.013, 'Title'),
(0.0097, 'ZX_Credit_Total_Overdue_Months'),
(0.0096, 'Nation'),
(0.0084, 'Gender'),
(0.0079, 'Work_Years'),
(0.0064, 'ZX_Max_Overdue_Credits'),
(0.0059, 'House_State'),
(0.0, 'Couple_L12_Month_Pay_Amount')]
Machine Learning Lab Guide-Student Version Page 10
# Convert data.
To check the correlation between the newly generated variable and the target variable,
construct a dataset containing the target variable and the newly generated variable first.
The corr() function is used to calculate the correlation coefficient between the newly
generated variable and the target variable.
Output:
2.1 Introduction
2.1.1 About This Lab
Mr. Zhao works in the AI algorithm department of e-commerce platform company A and
is responsible for product recommendation for online businesses. In the modern world of
the Internet and e-commerce, people are overwhelmed by data that provides useful
information. However, it is impossible for users to extract the information they are
interested in from the data. To help users find product information, the recommendation
system can create similarities between users and products and provide suggestions for
customers based on the similarities. The recommendation system is beneficial in:
Helping users find the right products.
Increasing user engagement. Providing recommendations. For example, Google News
saw a 40% increase in hits due to recommendations.
Helping project providers deliver projects to the right users. At Amazon, 35% of
products are sold through recommendations.
Helping personalize the recommended content. In Netflix, most rented movies are
recommended ones.
2.2 Procedure
2.2.1 Preparing E-commerce Platform Data
Step 1 Import the required packages.
Functions in the NumPy library are used to perform basic operations on arrays. Pandas
provides many data processing methods and time sequence operation methods.
View the format of the read data. You can use the head() function to check the first five
rows of the data to get a rough understanding of the data content.
You can further view the data size (the number of samples and the number of features in
the data) by using the shape function.
After learning of the data size, you still need to view the data type by using the dtypes()
function to facilitate subsequent data calculation.
According to the result, only Rating and timestamp fall into the numeric type and can be
used for mathematical calculation. If userId and productId need to be used for
mathematical calculation, convert the types of them. In addition, you can use the info()
function to view the general information about the data.
Machine Learning Lab Guide-Student Version Page 14
The result contains the number of data samples, feature type, data type, and data storage
size. The info function can display the preceding information by default, but you can set
an item to False to hide the item. For example, you can run the following command to
hide the data storage size:
Product ratings are important data that can reflect users' preference. The data is critical to
an efficient recommendation system. You can use the describe function to check the data
overview of the numeric type. To view only the preliminary data analysis of Rating, add
the corresponding column name in square brackets to the end of the command.
The result contains the average value, maximum value, minimum value, standard deviation,
and quartile of the data, and the product rating is generally about 4. You can use the min()
and max() functions to print the maximum and minimum value of the rating.
Machine Learning Lab Guide-Student Version Page 15
You can also use the print() function to print the result or the value of a parameter.
According to the result, the highest rating is 5, indicating that users' ratings on the product
are generally high.
The most important factors that affect data quality are default values and abnormal values.
As the ratings all fall within the normal ranges, you need to use the isnull() function to
check whether the parameter is null, and then use the sum() function to count the total
number of non-null parameters.
A user can rate multiple products. Similarly, a product can be rated by different users. To
determine the product types and the number of users, you need to check whether the users
and products are unique.
You can use the drop() function to delete the product rating time.
axis: deletes the column name part when it is set to 1, and deletes the index number part
when it is set to 0.
inplace: indicates the operation result when it is set to True.
Machine Learning Lab Guide-Student Version Page 16
Sort the users and products by rating and view the sorting result.
groupby(): performs matching based on the specific data.
sort_values(): sorts a group of data.
ascending: ascending order.
After obtaining the product data corresponding to the sorted user ratings, the system
returns the quantiles by using the quantile() function, and displays the quantiles by icons.
----End
Similar to user sorting, products can be sorted based on the rating data to obtain products
that have been rated for more than 50 times.
Machine Learning Lab Guide-Student Version Page 18
Calculate the average rating of each product, and then sort the products based on the
average rating.
# Obtain the rankings of the products sorted by the number of rating times.
Machine Learning Lab Guide-Student Version Page 19
The result shows that the product with the highest average rating is rated by 1051 users.
Analyze the product rankings and display the result in a chart. Specifically, use a histogram
first to display the distribution of the number of users who rate each type of products.
hist(): histogram
bins: number of buckets in the histogram.
Machine Learning Lab Guide-Student Version Page 20
Sort the products by the number of users who rate the products, to obtain the product
popularity.
Machine Learning Lab Guide-Student Version Page 21
----End
Select 10,000 samples and use pivot_table() to create a table of relationships between
products and users.
You can use the shape function to view the table size, and then transform the table. The
data in the table is the product ratings from users.
Machine Learning Lab Guide-Student Version Page 22
You can use the SVD algorithm to reduce the dimensions of the table to obtain 10
important product-based features.
Randomly select a product, select products whose coefficient of correlation with the
selected one is greater than 0.65, and recommend these products to users who like the
selected one.
#Select products whose coefficient of correlation with the 20th product is greater than 0.65.
# Recommend products ranked ahead to the users who like the 20th product.
As shown in the result, there are eight products whose coefficient of correlation with the
20th product (9984984354) is greater than 0.65. You can also select other products to view
their similar products.
----End
Machine Learning Lab Guide-Student Version Page 24
3.1 Introduction
Under the impact of the Internet, financial institutions are suffering from internal and
external troubles. On one hand, financial institutions encounter great competition and
performance pressure from large financial and technology enterprises; on the other hand,
more and more criminal groups use artificial intelligence (AI) technologies to increase the
crime efficiency. These risk details are hidden in each transaction phase. If they are not
prevented, losses will be irreparable. Therefore, financial institutions pose increasingly high
requirements on risk management accuracy and approval efficiency.
This experiment will discuss the problem and perform practice step by step from the
perspectives of problem statement, breakdown, priority ranking, solution design, key point
analysis, and summary and suggestions, and cultivate the project implementation thinking
and implement analysis of the private credit default prediction from scratch.
3.1.1 Objectives
Upon completion of this task, you will be able to:
Understand the significance of credit default prediction.
Master the development process of big data mining projects.
Master the common algorithms for private credit default prediction.
Understand the importance of data processing and feature engineering.
Master the common methods for data preprocessing and feature engineering
Master the algorithm principles of logistic regression and XGBoost, and understand
the key parameters.
3.1.2 Background
The case in this document is for reference only. The actual procedure may vary. For details,
see the corresponding product documents.
The company has just set up a project team for private credit default prediction. Engineer
A was appointed as the offline development PM of the project. This project aims to:
Identify high-risk customers efficiently and accurately using new technologies.
Make risk modes data-based by using scientific methods.
Provide objective risk measurement.
Reduce subjective judgments.
Improve risk management efficiency.
Machine Learning Lab Guide-Student Version Page 25
3.2 Procedure
3.2.1 Reading Data
First, import the dataset. This document uses a third-party module from Pandas to import
the dataset.
import pandas as pd
# Use pd.read_csv to read the dataset. (The dataset is stored in the current directory so that it can be
read directly.)
# ./credit.csv indicates the current directory. The slash (/) here must be in the same direction as one
in a directory of the Linux operating system (OS).
# In the Windows OS, the backslash (\) is used. Therefore, the slash in the file path must be the same
as that in the Linux OS.
# Be aware of using the slash symbol in the same key on the keyboard as the question mark (?).
# This module can help filter many redundant and annoying warnings.
# After data reading, some simple operations can be performed, for example:
# Run the following command to view all data.
data
# Run the following command to view the first 10 rows of data.
# Run the following command to view the length and width of data in the matrix format.
data.shape
#isnull() is used to determine whether a value is null. If yes, True is returned. If not, False is
returned.
# In Python, 1 is equal to True, and 0 is equal to False.
# Therefore, sum() is used for judgment. If the result is greater than 0, True is displayed.
# The features with missing values are placed in the missname list.
#fillna() is used to fill empty values with the mode.
X_train is the training set, and y_train is the answer to the training set. X_test is the test
set, and y_test is the answer to the test set. test_size=0.1 indicates that the ratio of the
Machine Learning Lab Guide-Student Version Page 27
training set to the test set is 9:1. shuffle indicates that the training set and test set are
shuffled.
the standardization function StandardScaler() is first declared. The following fit function
is used to obtain the standard deviation and average value of the dataset. Then, transform
is used to transform the data.
# Declare the logistic regression algorithm and set max_iter (the maximum number of training
times) to 500.
# Perform judgment based on the cross verification thinking to help split the dataset.
# cv=5 indicates that the dataset is split into five equal parts.
# Combine the regularization coefficient with the optimization method using the dictionary method.
# Declare the grid search algorithm and describe the cross verification method.
# Perform training.
# After the model is loaded, use the model for prediction directly.
Machine Learning Lab Guide-Student Version Page 30
4.1 Introduction
4.1.1 About This Lab
This experiment is to predict whether passengers on the Titanic can survive based on the
Titanic datasets.
4.1.2 Objectives
Upon completion of this task, you will be able to:
Use the Titanic datasets open to the Internet as the model input data.
Build, train, and evaluate machine learning models
Understand the overall process of building a machine learning model.
4.2 Procedure
4.2.1 Importing Related Libraries
import pandas as pd
import numpy as np
import random as rnd
The data overview helps check whether some data is missing and what the data type is.
The related numeric-type information of the data helps check the average value and other
statistics.
Machine Learning Lab Guide-Student Version Page 33
The character-type information helps check the number of types, the type with the
maximum value, and the frequency.
Step 3 Check the survival probability corresponding to each feature based on statistics.
The intuitive data shows that passengers in class 1 cabins are more likely to survive.
Machine Learning Lab Guide-Student Version Page 34
The following figure shows the survival probability determined based on the cabin and age.
Machine Learning Lab Guide-Student Version Page 35
----End
Process the datasets by using different methods as required. For example, fill the Fare and
Embarked parameters having few missing values with the mode.
Delete less significant data. Before this, assign a value to Target first.
Convert some character-type data into numeric-type data for model input. To do so, check
the number of types first.
Use the search function to obtain each character-type value and replace it with a numeric-
type value.
Machine Learning Lab Guide-Student Version Page 37
test.csv cannot be used as a training test set as it does not contain Target. train.csv
contains 891 pieces of data (with Target), which need to be extracted.
----End
The logistic regression algorithm, random forest algorithm, and AdaBoost are used for
training.
----End
Machine Learning Lab Guide-Student Version Page 39
5 Linear Regression
5.1 Introduction
5.1.1 About This Lab
This experiment uses the basic Python code and the simplest data to reproduce how a
linear regression algorithm iterates and fits the existing data distribution.
The NumPy and Matplotlib modules are used in the experiment. NumPy is used for
calculation, and Matplotlib is used for drawing.
5.1.2 Objectives
Upon completion of this task, you will be able to:
Be familiar with basic Python statements.
Master the procedure for implementing linear regression.
5.2 Procedure
5.2.1 Preparing Data
Randomly set ten pieces of data, with the data in a linear relationship.
Convert the data into an array format so that the data can be directly calculated when
multiplication and addition are used.
Code:
# Import the required modules NumPy for calculation and Matplotlib for drawing.
import numpy as np
import matplotlib.pyplot as plt
#This code is used only for Jupyter Notebook.
%matplotlib inline
Output:
# The basic linear regression model is wx+b. In this example, the model is ax+b as a two-dimensional
space is used.
# The mean square error loss function is the most commonly used loss function in the linear
regression model.
# The optimization function mainly uses the partial derivatives to update a and b.
Code:
Output:
Step 2 Perform the second iteration and display the parameter values, loss values, and
visualization effect.
Code:
Output:
Machine Learning Lab Guide-Student Version Page 42
Step 3 Perform the third iteration and display the parameter values, loss values, and
visualization effect.
Code:
Output:
Machine Learning Lab Guide-Student Version Page 43
Step 4 Perform the fourth iteration and display the parameter values, loss values, and
visualization effect.
Code:
Output:
Machine Learning Lab Guide-Student Version Page 44
Step 5 Perform the fifth iteration and display the parameter values, loss values, and visualization
effect.
Code:
Output:
Step 6 Perform the 10000th iteration and display the parameter values, loss values, and
visualization effect.
Code:
Output:
Machine Learning Lab Guide-Student Version Page 45
----End
5.3.2 Question 2
What is the function of Lr during Lr modification?
Machine Learning Lab Guide-Student Version Page 46
6.1 Introduction
6.1.1 About This Lab
This experiment uses a dataset with a small sample quantity. The dataset includes the
open-source Iris data provided by scikit-learn. The Iris prediction project is a simple
classification model. By using this model, you can understand the basic usage and data
processing methods of the machine learning library sklearn.
According to the preceding code, x is specified as a feature, and y as a label. The dataset
includes a total of 150 samples and four features: sepal length, sepal width, petal length,
and petal width.
Logistic regression is used for modeling first. The one-vs-one (OvO) multiclass method is
used for logistic regression by default.
6.2.4.2 SVM
Use the Support Vector Machine (SVM) for classification. The one-vs-the-rest (OvR)
multiclass method is used for the SVM by default.
Machine Learning Lab Guide-Student Version Page 48
Three neighbors are set for the k-nearest neighbors algorithm. Another number of
neighbors can be tried for better accuracy.
Therefore, the recursion method is used to find the optimal number of neighbors.
Machine Learning Lab Guide-Student Version Page 49
As shown in the figure above, the k-nearest neighbors algorithm has the optimal effect
when there is one nearest neighbor.
After standardization, the standard deviation is 1, and the mean value is infinitely close to
0.
Then, use the SVM to perform modeling after the standardization. Change the data names
of the training set and test set to new ones.
As described above, the SVM precision is also improved after the standardization.
Machine Learning Lab Guide-Student Version Page 50
7.1 Introduction
Emotion analysis is a classification technology based on natural language processing (NLP),
and is usually used in classification methods for extracting emotional content of texts.
Compared with related recommendation and precision marketing, users prefer to view or
listen to the personal experience and feedback of users of the same type. For example,
evaluations from users who have purchased similar products and comparison results from
users who have used similar products can bring bidirectional values to users and enterprises.
This experiment will discuss the problem and perform practice step by step from the
perspectives of problem statement, breakdown, priority ranking, solution design, key point
analysis, and summary and suggestions, and cultivate the project implementation thinking
and implement analysis of the evaluation emotion analysis project from scratch.
7.1.1 Objectives
Upon completion of this task, you will be able to:
Clarify the function and business value of emotion analysis.
Understand the differences between conventional machine learning and deep
learning in emotion analysis methods.
Clarify label extraction methods for emotion analysis.
Master deep learning-based emotion analysis methods.
Understand future applications of emotion analysis.
7.2 Procedure
7.2.1 Data Management
The following information is involved:
Id: ID
reviews.rating: score
reviews.text: text evaluation
reviews.title: evaluation keywords
reviews.username: name of the evaluator
This dataset contains 21 attribute fields and 34,657 data samples. The experiment aims to
analyze customer evaluation data. Therefore, this document describes only the data
attributes required in this experiment.
Step 1 Import common library files such as sklearn, pandas, and numpy.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import nltk.classify.util
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk
%matplotlib inline
Visualize the first five rows of data and view the data attribute columns.
Output:
Output:
With respect to score processing, this experiment defines data samples with the
reviews.rating value greater than or equal to 4 as positive (pos) and those with the
reviews.rating value less than 4 negative (neg), and renames the reviews.rating attribute
column senti.
replace(x,y): replaces x with y.
Output:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk
Text data includes spaces, punctuation marks, and data. This experiment focuses on the
text (English) analysis. Therefore, you need to delete the information other than letters.
You can define the cleanup() function, use a regular expression to delete non-letter
characters, use the lower() function to convert uppercase letters into lowercase ones, and
delete spaces, including '\n', '\r', '\t', and ' '. After apply() is used, the reviews.text
attribute is saved as the summary_clean column.
Obtain ["Summary_Clean","senti"] from the senti dataset and save it as the split dataset.
Output:
Machine Learning Lab Guide-Student Version Page 55
Use 80% of data in split as the training set through split.sample(), remove the data that
has been used in the training set train from split through drop(), and use the remaining
data as the test set test.
Output:
Convert the data in the training set, test set, and verification set into a list and create
indexes.
Set all words in train["words"] to True and add neg or pos to the end of a sentence based
on the scoring criteria.
Use a trained classifier to attach emotion labels to the test set and verification set to predict
whether words in the test set and verification set are positive or negative.
Machine Learning Lab Guide-Student Version Page 57
Output:
Output:
The original dataset check does not contain review.ratings data. As shown in the
preceding figure, whether each word is negative or positive is predicted after the classifier
is created based on the training set.
Use the CountVectorizer class to perform vectorization, invoke the TfidfTransformer class
to perform preprocessing, construct the term frequency (TF) vector, and calculate the
importance of words. The training set, test set, and verification set are obtained, which are
X_train_tfidf, X_test_tfidf, and checktfidf, respectively.
The main idea of TF is as follows: If a word or phrase has a high TF in an article but a low
TF in other retail articles, the word or phrase is considered to have a good class
distinguishing capability. TF-IDF tends to filter out commonly used words and retain
important words.
Machine Learning Lab Guide-Student Version Page 58
The CountVectorizer class converts words in the text into a TF matrix, and uses the
fit_transform() function to calculate the number of appearance times of each word. In
general, you can use CountVectorizer to extract features and then use TfidfTransformer to
calculate the weight of each feature.
Output:
Output:
Output:
In comparison, the LR model has higher accuracy than the other two models.
Output:
The classifier accurately provides the positive probability and negative probability of each
sentence.
Output:
----End
Machine Learning Lab Guide-Student Version Page 60
8.1 Introduction
8.1.1 About This Lab
This experiment uses a dataset with a small sample quantity. The dataset includes the
open-source Boston housing price data provided by scikit-learn. The Boston housing price
forecast project is a simple regression model. By using this model, you can understand the
basic usage and data processing methods of the machine learning library sklearn.
8.1.2 Objectives
Upon completion of this task, you will be able to:
Use the Boston housing price dataset open to the Internet as the model input data.
Build, train, and evaluate machine learning models
Understand the overall process of building a machine learning model.
Master the application of machine learning model training, grid search, and
evaluation indicators.
Master the application of related APIs.
8.2 Procedure
8.2.1 Introducing the Dependency
Code:
#Introduce algorithms.
from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression, ElasticNet
#Compared with SVC, it is the regression form of SVM.
from sklearn.svm import SVR
#Integrate algorithms.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
Machine Learning Lab Guide-Student Version Page 62
Code:
Output:
Feature column names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' RM' 'AGE' DIS' 'RAD' 'TAX' PTRATIO' 'B'
'LSTAT'], sample quantity: 506, feature quantity: 13, target sample quantity: 506
Code:
Output:
Code:
Machine Learning Lab Guide-Student Version Page 63
Output:
----End
Output:
Output:
Code:
'''
'kernel': kernel function
'C': SVR regularization factor
'gamma': 'rbf', 'poly' and 'sigmoid' kernel function coefficient, which affects the model performance
'''
Output:
Machine Learning Lab Guide-Student Version Page 65
Code:
Output:
Code:
##Perform visualization.
#Display in a diagram.
Output:
Machine Learning Lab Guide-Student Version Page 66
----End
Machine Learning Lab Guide-Student Version Page 67
9.1 Introduction
9.1.1 About This Lab
This experiment performs modeling based on the k-means algorithm by using the virtual
dataset automatically generated by sklearn to obtain user categories. It is a clustering
experiment, which can find out the method for selecting the optimal k value and observe
the effect in a visualized manner.
import numpy as np
import matplotlib.pyplot as plt
The built-in tool of sklearn is used to create the virtual data, which is scientific and
conforms to a normal distribution. Parameter settings are as follows:
n_samples: set to 2000, indicating that 2000 sample points are set.
centers: set to 2, indicating that the data actually has two centers.
n_features: set to 2, indicating the number of features.
For ease of illustration in the coordinate system, only two features are used.
n_clusters=5: indicates that five data clusters are expected. However, there are only two
data categories.
Output:
Different data is generated each time. Therefore, the output diagram may be different
from that in the lab. To generate the same data, add the random_state parameter during
data generation.
In this example, random_state is set to 3. In this way, the same data can be generated for
the same data input.
Machine Learning Lab Guide-Student Version Page 69
In this example, ten features are used to generate data, random_state is set to 30, and
there are three categories in theory.
----End
Machine Learning Lab Guide-Student Version Page 70
import random
First, generate two random numbers ranging from 1 to 30 (indicating that the number of
true centers in the data is unknown), and use a random number of features.
Then, perform k-means clustering by using a recursive method. The .inertia_ attribute
returns the distance from the attribute point to the center.
The result varies each time due to impact of the random numbers. As shown in the
preceding figure, the turning point appears at the position corresponding to the value 21.
Therefore, 21 is the optimal k value.