Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
This checklist can guide you through your Machine Learning projects. There are 8
main steps:
Obviously, you should feel free to adapt this checklist to your needs.
425
7. What would be the minimum performance needed to reach the business objec‐
tive?
8. What are comparable problems? Can you reuse experience or tools?
9. Is human expertise available?
10. How would you solve the problem manually?
11. List the assumptions you (or others) have made so far.
12. Verify assumptions if possible.
1. List the data you need and how much you need.
2. Find and document where you can you get that data.
3. Check how much space it will take.
4. Check legal obligations, get authorization if necessary.
5. Get access authorizations.
6. Create a workspace (with enough storage space).
7. Get the data.
8. Convert the data to a format you can easily manipulate (without changing the
data itself).
9. Ensure sensitive information is deleted or protected (eg. anonymized).
10. Check the size and type of data (time series, sample, geographical, etc.).
11. Sample a test set, put it aside and never look at it (no data snooping!).
1. Create a copy of the data for exploration (sampling it down to a manageable size
if necessary).
2. Create a Jupyter notebook to keep a record of your data exploration.
3. Study each attribute and its characteristics:
• name
• type (categorical, int/float, bounded/unbounded, text, structured, etc.)
1. Data cleaning:
• Fix or remove outliers (optional).
• Fill in missing values (eg. with zero, mean, median…) or drop their rows (or
columns).
• If the data is huge, you may want to sample smaller training sets so you can train
many different models in a reasonable time (be aware that this penalizes complex
models such as large neural nets or random forests).
• Once again, try to automate these steps as much as possible.
1. Train many quick & dirty models from different categories (eg. linear, naive
Bayes, SVM, random forest, neural net, etc.) using standard parameters.
2. Measure and compare their performance.
• For each model, use N-fold cross-validation and compute the mean and stan‐
dard deviation of the performance measure on the N folds.
• You will want to use as much data as possible for this step, especially as you move
towards the end of fine-tuning.
• As always automate what you can.
2. Try ensemble methods. Combining your best models will often perform better
than running them individually.
3. Once you are confident about your final model, measure its performance on the
test set to estimate the generalization error.
3. Retrain your models on a regular basis on fresh data (automate as much as possi‐
ble).