DSF - Data Preprocessing
DSF - Data Preprocessing
DSF - Data Preprocessing
Data Preprocessing
2 Data Preprocessing
1
https://my.rapidminer.com/nexus/account/index.html#downloads
Pilih file installer sesuai dengan jenis sistem operasi yang digunakan pada masing-masing
komputer (Windows, Linux, MacOS) dan perhatikan arsitektur yang digunakan (32 bit atau 64
bit)
2
Data Preprocessing
“Data Preparation is more than half of every data
mining process”
Tahapan Data Science
Why we need preprocessing?
• Informasi dapat diekstrak dari data yang ada (umur dapat dihitung dari tanggal lahir)
• Kadang informasi pada data harus disajikan secara eksplisit untuk meningkatkan performa model
• Duplikat Record
Other Data Problems • Data tidak lengkap
• Data tidak konsisten
Some Data
Pre-Processing Tasks
Dataset: Missingdataset.csv
Bagaimana mengolah missing data?
• Jerry is the marketing manager for a small Internet design and advertising firm
• Jerry’s boss asks him to develop a data set containing information about Internet
users
• The company will use this data to determine what kinds of people are using the
Internet and how the firm may be able to market their services to this group of
users
• To accomplish his assignment, Jerry creates an online survey and places links to
the survey on several popular Web sites
MissingDataSet.csv
• Within two weeks, Jerry has collected enough data to begin analysis, but he finds that his
data needs to be denormalized
• He also notes that some observations in the set are missing values or they appear to
contain invalid values
• Jerry realizes that some additional work on the data needs to take place before analysis
begins.
Latihan Preprocessing dengan Rapidminer
• Membuang dataset yang noisy (menggunakan replace), (menggunakan regex) (menggunakan map)
• Impor data MissingData-Noisy.csv
• Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data pada atribut
nominal menjadi “N”