DSF - Data Preprocessing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Data Science Fundamental

Data Preprocessing

Pusdiklat - Kementerian Komunikasi dan Informatika #JadiJagoanDigital


1 Pengenalan Tools (Rapidminer)

2 Data Preprocessing
1

Pengenalan Tools (Rapidminer)


Instalasi Tools
Jika belum menginstall Rapidminer, disarankan untuk mengunduh dan menginstal melalui link
berikut :

https://my.rapidminer.com/nexus/account/index.html#downloads

Pilih file installer sesuai dengan jenis sistem operasi yang digunakan pada masing-masing
komputer (Windows, Linux, MacOS) dan perhatikan arsitektur yang digunakan (32 bit atau 64
bit)
2

Data Preprocessing
“Data Preparation is more than half of every data
mining process”
Tahapan Data Science
Why we need preprocessing?

Real World Data are Dirty

• Tidak Lengkap (banyak data kosong)


• Noisy / banyak outlier
• Tidak Berkualitas (tidak konsisten, tidak akurat, dll)

Some Information are hidden within data

• Informasi dapat diekstrak dari data yang ada (umur dapat dihitung dari tanggal lahir)
• Kadang informasi pada data harus disajikan secara eksplisit untuk meningkatkan performa model

Machine learning model’s performance depends on data

• Beberapa hanya bisa memproses data dalam bentuk numerical


• Sensitif terhadap outlier
• Beberapa model memiliki persyaratan (4 asumsi klasik linear model, NN butuh data pada range 0-1)
Why data can be dirty?
• Responden tidak merespon pada survei
• Nilai tidak tersedia pada entri data
Missing data
• Kehilangan data dalam perjalanan
• Kesalahan entri data

• Instrumen pengumpulan data yang salah


• Masalah entri data
Data Noise • Masalah transmisi data
• Keterbatasan teknologi
• Inkonsistensi dalam konvensi penamaan

• Duplikat Record
Other Data Problems • Data tidak lengkap
• Data tidak konsisten
Some Data
Pre-Processing Tasks

Data Cleaning Features Extraction Features Transformation Data Reduction

• Fill in missing values • Derived Features • Normalize/ Standardize • Dimensionality reduction


• Smooth noisy data • Encoding • Scaling (Select Features)
• Identify or remove outliers • Binning • Box Cox Transformation • Numerosity reduction
• Resolve inconsistencies • Vectorizer (Power / Log / Square root) (Select Rows)
• Textual and datetime data • Data compression
can generate a lot of
features
Data Cleaning

Remove Rows / Columns Value Imputation Model based Imputation

mode / most frequent


remove column if n missing (categorical)
rows >> n rows

Use other features to


mean / median (numerical)
predict missing rows

remove row if n missing


rows << n rows Random / defined value
Contoh Missing Data

Dataset: Missingdataset.csv
Bagaimana mengolah missing data?

Fill in the missing value Fill in it automatically


manually with
• Melelahkan dan tidak mungkin • A global constant
Ignore the tuple
• The attribute mean
• The most probable value
MissingDataSet.csv

• Jerry is the marketing manager for a small Internet design and advertising firm
• Jerry’s boss asks him to develop a data set containing information about Internet
users
• The company will use this data to determine what kinds of people are using the
Internet and how the firm may be able to market their services to this group of
users
• To accomplish his assignment, Jerry creates an online survey and places links to
the survey on several popular Web sites
MissingDataSet.csv

• Within two weeks, Jerry has collected enough data to begin analysis, but he finds that his
data needs to be denormalized
• He also notes that some observations in the set are missing values or they appear to
contain invalid values
• Jerry realizes that some additional work on the data needs to take place before analysis
begins.
Latihan Preprocessing dengan Rapidminer

Membuang dataset yang missing


• menggunakan filter example
• menggunakan replace missing value
Noisy Data

• Noise: random error or variance in a measured variable


• Incorrect attribute values may be due to:
▪ Faulty data collection instruments
▪ Data entry problems
▪ Data transmission problems
▪ Technology limitation
▪ Inconsistency in naming convention
• Other data problems which require data cleaning:
▪ Duplicate records
▪ Incomplete data
▪ Inconsistent data
How to handle
Noisy Data?
Binning
▪ First sort data and partition into (equal-frequency) bins
▪ Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regression
▪ Smooth by fitting the data into regression functions
Clustering
▪ Detect and remove outliers
Combined computer and human inspection
▪ Detect suspicious values and check by human (e.g., deal with possible outliers)
Latihan Preprocessing dengan Rapidminer

• Membuang dataset yang noisy (menggunakan replace), (menggunakan regex) (menggunakan map)
• Impor data MissingData-Noisy.csv
• Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data pada atribut
nominal menjadi “N”

• Impor data MissingData-Noisy-Multiple.csv


• Gunakan operator Replace Missing Value untuk mengisi data kosong
• Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data pada atribut
nominal menjadi “N”
• Gunakan operator Map untuk mengganti semua isian Face, FB dan Fesbuk menjadi Facebook
TERIMA KASIH

#JadiJagoanDigital Digital Talent Scholarship digitalent.kominfo DTS_kominfo

You might also like