DSF - Data Preprocessing

Data Science Fundamental
Data Preprocessing
Pusdiklat - Kementerian Komunikasi dan Informatika #JadiJagoanDigital

1 Pengenalan Tools (Rapidminer)
2 Data Preprocessing
1
Pengenalan Tools (Rapidminer)

Instalasi Tools
Jika belum menginstall Rapidminer, disarankan untuk mengunduh dan menginstal melalui link
berikut :
https://my.rapidminer.com/nexus/account/index.html#downloads
Pilih file installer sesuai dengan jenis sistem operasi yang digunakan pada masing-masing
komputer (Windows, Linux, MacOS) dan perhatikan arsitektur yang digunakan (32 bit atau 64
bit)
2
Data Preprocessing
“Data Preparation is more than half of every data
mining process”
Tahapan Data Science
Why we need preprocessing?
Real World Data are Dirty
• Tidak Lengkap (banyak data kosong)

• Noisy / banyak outlier
• Tidak Berkualitas (tidak konsisten, tidak akurat, dll)
Some Information are hidden within data
• Informasi dapat diekstrak dari data yang ada (umur dapat dihitung dari tanggal lahir)
• Kadang informasi pada data harus disajikan secara eksplisit untuk meningkatkan performa model
Machine learning model’s performance depends on data
• Beberapa hanya bisa memproses data dalam bentuk numerical

• Sensitif terhadap outlier
• Beberapa model memiliki persyaratan (4 asumsi klasik linear model, NN butuh data pada range 0-1)
Why data can be dirty?
• Responden tidak merespon pada survei
• Nilai tidak tersedia pada entri data
Missing data
• Kehilangan data dalam perjalanan
• Kesalahan entri data
• Instrumen pengumpulan data yang salah

• Masalah entri data
Data Noise • Masalah transmisi data
• Keterbatasan teknologi
• Inkonsistensi dalam konvensi penamaan
• Duplikat Record
Other Data Problems • Data tidak lengkap
• Data tidak konsisten
Some Data
Pre-Processing Tasks
Data Cleaning Features Extraction Features Transformation Data Reduction
• Fill in missing values • Derived Features • Normalize/ Standardize • Dimensionality reduction

• Smooth noisy data • Encoding • Scaling (Select Features)
• Identify or remove outliers • Binning • Box Cox Transformation • Numerosity reduction
• Resolve inconsistencies • Vectorizer (Power / Log / Square root) (Select Rows)
• Textual and datetime data • Data compression
can generate a lot of
features
Data Cleaning
Remove Rows / Columns Value Imputation Model based Imputation
mode / most frequent

remove column if n missing (categorical)
rows >> n rows
Use other features to

mean / median (numerical)
predict missing rows
remove row if n missing

rows << n rows Random / defined value
Contoh Missing Data
Dataset: Missingdataset.csv
Bagaimana mengolah missing data?
Fill in the missing value Fill in it automatically

manually with
• Melelahkan dan tidak mungkin • A global constant
Ignore the tuple
• The attribute mean
• The most probable value
MissingDataSet.csv
• Jerry is the marketing manager for a small Internet design and advertising firm
• Jerry’s boss asks him to develop a data set containing information about Internet
users
• The company will use this data to determine what kinds of people are using the
Internet and how the firm may be able to market their services to this group of
users
• To accomplish his assignment, Jerry creates an online survey and places links to
the survey on several popular Web sites
MissingDataSet.csv
• Within two weeks, Jerry has collected enough data to begin analysis, but he finds that his
data needs to be denormalized
• He also notes that some observations in the set are missing values or they appear to
contain invalid values
• Jerry realizes that some additional work on the data needs to take place before analysis
begins.
Latihan Preprocessing dengan Rapidminer
Membuang dataset yang missing

• menggunakan filter example
• menggunakan replace missing value
Noisy Data
• Noise: random error or variance in a measured variable

• Incorrect attribute values may be due to:
▪ Faulty data collection instruments
▪ Data entry problems
▪ Data transmission problems
▪ Technology limitation
▪ Inconsistency in naming convention
• Other data problems which require data cleaning:
▪ Duplicate records
▪ Incomplete data
▪ Inconsistent data
How to handle
Noisy Data?
Binning
▪ First sort data and partition into (equal-frequency) bins
▪ Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regression
▪ Smooth by fitting the data into regression functions
Clustering
▪ Detect and remove outliers
Combined computer and human inspection
▪ Detect suspicious values and check by human (e.g., deal with possible outliers)
Latihan Preprocessing dengan Rapidminer
• Membuang dataset yang noisy (menggunakan replace), (menggunakan regex) (menggunakan map)
• Impor data MissingData-Noisy.csv
• Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data pada atribut
nominal menjadi “N”
• Impor data MissingData-Noisy-Multiple.csv

• Gunakan operator Replace Missing Value untuk mengisi data kosong
• Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data pada atribut
nominal menjadi “N”
• Gunakan operator Map untuk mengganti semua isian Face, FB dan Fesbuk menjadi Facebook
TERIMA KASIH
#JadiJagoanDigital Digital Talent Scholarship digitalent.kominfo DTS_kominfo

DSF - Data Preprocessing

Uploaded by

Copyright:

Available Formats

DSF - Data Preprocessing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSF - Data Preprocessing

Uploaded by

Copyright:

Available Formats

Data Science Fundamental

Pusdiklat - Kementerian Komunikasi dan Informatika #JadiJagoanDigital

Pengenalan Tools (Rapidminer)

Real World Data are Dirty

• Tidak Lengkap (banyak data kosong)

Some Information are hidden within data

Machine learning model’s performance depends on data

• Beberapa hanya bisa memproses data dalam bentuk numerical

• Instrumen pengumpulan data yang salah

Data Cleaning Features Extraction Features Transformation Data Reduction

• Fill in missing values • Derived Features • Normalize/ Standardize • Dimensionality reduction

Remove Rows / Columns Value Imputation Model based Imputation

mode / most frequent

Use other features to

remove row if n missing

Fill in the missing value Fill in it automatically

Membuang dataset yang missing

• Noise: random error or variance in a measured variable

• Impor data MissingData-Noisy-Multiple.csv

#JadiJagoanDigital Digital Talent Scholarship digitalent.kominfo DTS_kominfo

You might also like