Ebook701 pages6 hours

Learning Data Mining with Python

Name: Learning Data Mining with Python
Author: Robert Layton
ISBN: 9781784391201

By Robert Layton

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The next step in the information age is to gain insights from the deluge of data coming our way. Data mining provides a way of finding this insight, and Python is one of the most popular languages for data mining, providing both power and flexibility in analysis.This book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. Next, we move on to more complex data types including text, images, and graphs. In every chapter, we create models that solve real-world problems.There is a rich and varied set of libraries available in Python for data mining. This book covers a large number, including the IPython Notebook, pandas, scikit-learn and NLTK.Each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will gain a large insight into using Python for data mining, with a good knowledge and understanding of the algorithms and implementations.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateJul 29, 2015

ISBN9781784391201

Author

Robert Layton

Dr. Robert Layton is a Research Fellow at the Internet Commerce Security Laboratory (ICSL) at Federation University Australia. Dr Layton’s research focuses on attribution technologies on the internet, including automating open source intelligence (OSINT) and attack attribution. Dr Layton’s research has led to improvements in authorship analysis methods for unstructured text, providing indirect methods of linking profiles on social media.

Related to Learning Data Mining with Python

Related ebooks

Skip carousel

Python Data Science Essentials
Ebook
Python Data Science Essentials
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis - Second Edition
Ebook
Python Data Analysis - Second Edition
byArmando Fandango
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
Python Data Science Essentials - Second Edition
Ebook
Python Data Science Essentials - Second Edition
byBoschetti Alberto
Rating: 4 out of 5 stars
4/5
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
Learning pandas
Ebook
Learning pandas
byHeydt Michael
Rating: 4 out of 5 stars
4/5
Getting Started with Python Data Analysis
Ebook
Getting Started with Python Data Analysis
byVo.T.H Phuong
Rating: 0 out of 5 stars
0 ratings
Building Machine Learning Systems with Python
Ebook
Building Machine Learning Systems with Python
byWilli Richert
Rating: 4 out of 5 stars
4/5
Python for Secret Agents
Ebook
Python for Secret Agents
bySteven F. Lott
Rating: 0 out of 5 stars
0 ratings
Designing Machine Learning Systems with Python
Ebook
Designing Machine Learning Systems with Python
byDavid Julian
Rating: 0 out of 5 stars
0 ratings
Python Unlocked
Ebook
Python Unlocked
byTigeraniya Arun
Rating: 0 out of 5 stars
0 ratings
R High Performance Programming
Ebook
R High Performance Programming
byAloysius Lim
Rating: 4 out of 5 stars
4/5
Mastering Predictive Analytics with R
Ebook
Mastering Predictive Analytics with R
byRui Miguel Forte
Rating: 4 out of 5 stars
4/5
R for Data Science
Ebook
R for Data Science
byDan Toomey
Rating: 5 out of 5 stars
5/5
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
Ebook
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
byPrateek Gupta
Rating: 0 out of 5 stars
0 ratings
Mastering Machine Learning with R
Ebook
Mastering Machine Learning with R
byLesmeister Cory
Rating: 0 out of 5 stars
0 ratings
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Ebook
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
byPURNA CHANDER RAO. KATHULA
Rating: 5 out of 5 stars
5/5
Artificial Intelligence with Python
Ebook
Artificial Intelligence with Python
byPrateek Joshi
Rating: 4 out of 5 stars
4/5
Hands-On Genetic Algorithms with Python: Applying genetic algorithms to solve real-world deep learning and artificial intelligence problems
Ebook
Hands-On Genetic Algorithms with Python: Applying genetic algorithms to solve real-world deep learning and artificial intelligence problems
byEyal Wirsansky
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python
Ebook
Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python
byAvinash Navlani
Rating: 0 out of 5 stars
0 ratings
Python High Performance - Second Edition
Ebook
Python High Performance - Second Edition
byGabriele Lanaro
Rating: 0 out of 5 stars
0 ratings
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
Ebook
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
byStefanie Molin
Rating: 0 out of 5 stars
0 ratings
Learning Predictive Analytics with Python
Ebook
Learning Predictive Analytics with Python
byKumar Ashish
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis Cookbook
Ebook
Python Data Analysis Cookbook
byIvan Idris
Rating: 5 out of 5 stars
5/5
Large Scale Machine Learning with Python
Ebook
Large Scale Machine Learning with Python
byBastiaan Sjardin
Rating: 2 out of 5 stars
2/5
Advanced Machine Learning with Python
Ebook
Advanced Machine Learning with Python
byJohn Hearty
Rating: 0 out of 5 stars
0 ratings
Mastering Social Media Mining with Python
Ebook
Mastering Social Media Mining with Python
byMarco Bonzanini
Rating: 5 out of 5 stars
5/5
Python: Real World Machine Learning
Ebook
Python: Real World Machine Learning
byJohn Hearty
Rating: 0 out of 5 stars
0 ratings
Web Scraping with Python
Ebook
Web Scraping with Python
byRichard Lawson
Rating: 4 out of 5 stars
4/5
Python 3 Text Processing with NLTK 3 Cookbook
Ebook
Python 3 Text Processing with NLTK 3 Cookbook
byJacob Perkins
Rating: 4 out of 5 stars
4/5

Computers For You

Skip carousel

Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 5 out of 5 stars
5/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Uncanny Valley: A Memoir
Ebook
Uncanny Valley: A Memoir
byAnna Wiener
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byT.C. Boyle
Rating: 5 out of 5 stars
5/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
Ebook
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
Computer Science I Essentials
Ebook
Computer Science I Essentials
byRandall Raus
Rating: 5 out of 5 stars
5/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Tor and the Dark Art of Anonymity
Ebook
Tor and the Dark Art of Anonymity
byLance Henderson
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
CompTia Security 701: Fundamentals of Security
Ebook
CompTia Security 701: Fundamentals of Security
byAS Snipes
Rating: 0 out of 5 stars
0 ratings
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
UNLIMITED
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful
Advantages of Completing Small Python Projects
UNLIMITED
Advantages of Completing Small Python Projects
byThe Real Python Podcast
0 ratings
0% found this document useful
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
UNLIMITED
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
#059 - 10 Python clean code tips drawn from code reviews
UNLIMITED
#059 - 10 Python clean code tips drawn from code reviews
byPybites Podcast
0 ratings
0% found this document useful
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
UNLIMITED
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
byDataFramed
100%
100% found this document useful
#77 Acing the Data Science Interview
UNLIMITED
#77 Acing the Data Science Interview
byDataFramed
0 ratings
0% found this document useful
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
UNLIMITED
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
[AI is Here] Unlocking NLP's Potential in Banking - with Christophe Makni of Migros Bank: Today’s guest is Christophe Makni, Head of Business Operations at Migros Bank. Christophe shares a few key insights in this episode, starting with where natural language processing is finding a fit in banking today and the real deployments in the...
UNLIMITED
[AI is Here] Unlocking NLP's Potential in Banking - with Christophe Makni of Migros Bank: Today’s guest is Christophe Makni, Head of Business Operations at Migros Bank. Christophe shares a few key insights in this episode, starting with where natural language processing is finding a fit in banking today and the real deployments in the...
byThe AI in Business Podcast
0 ratings
0% found this document useful
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
UNLIMITED
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
byFinding Genius Podcast
0 ratings
0% found this document useful
#35 Data Science in Finance
UNLIMITED
#35 Data Science in Finance
byDataFramed
0 ratings
0% found this document useful
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
UNLIMITED
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
byDataFramed
0 ratings
0% found this document useful
Unraveling Python's Syntax to Its Core With Brett Cannon
UNLIMITED
Unraveling Python's Syntax to Its Core With Brett Cannon
byThe Real Python Podcast
100%
100% found this document useful
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
UNLIMITED
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
byAnalytics on Fire
0 ratings
0% found this document useful
Measuring Your Python Learning Progress
UNLIMITED
Measuring Your Python Learning Progress
byThe Real Python Podcast
100%
100% found this document useful
S1:E1 "The Beginning"
UNLIMITED
S1:E1 "The Beginning"
byData Science Now
0 ratings
0% found this document useful
#75 The Data Storytelling Skills Data Teams Need with Andy Cotgreave, Technical Evangelist at Tableau
UNLIMITED
#75 The Data Storytelling Skills Data Teams Need with Andy Cotgreave, Technical Evangelist at Tableau
byDataFramed
0 ratings
0% found this document useful
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
UNLIMITED
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
UNLIMITED
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
byScreaming in the Cloud
0 ratings
0% found this document useful
Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook: An interview about the Querybook SQL IDE for big data analytics and how you can use it to build more expressive and maintainable analytics.
UNLIMITED
Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook: An interview about the Querybook SQL IDE for big data analytics and how you can use it to build more expressive and maintainable analytics.
byData Engineering Podcast
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
UNLIMITED
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
UNLIMITED
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Tools for Setting Up Python on a New Machine
UNLIMITED
Tools for Setting Up Python on a New Machine
byThe Real Python Podcast
100%
100% found this document useful
#40 Becoming a Data Scientist
UNLIMITED
#40 Becoming a Data Scientist
byDataFramed
100%
100% found this document useful
#52 Data Science at the BBC
UNLIMITED
#52 Data Science at the BBC
byDataFramed
0 ratings
0% found this document useful
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
UNLIMITED
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
byTest and Code
0 ratings
0% found this document useful
Exploring the Zen of Python & pandas Features for Finance
UNLIMITED
Exploring the Zen of Python & pandas Features for Finance
byThe Real Python Podcast
0 ratings
0% found this document useful
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
UNLIMITED
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
byData Skeptic
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
UNLIMITED
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Composable Data Analytics
UNLIMITED
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Launchpad Studio with Malika Cantor and Peter Norvig: Malika Cantor and Peter Norvig tell us about Launchpad Studio, a program for Applied Machine Learning Startups.
UNLIMITED
Launchpad Studio with Malika Cantor and Peter Norvig: Malika Cantor and Peter Norvig tell us about Launchpad Studio, a program for Applied Machine Learning Startups.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful

Skip carousel

Scikit-Learn: The Ultimate Python Library
APC
UNLIMITED
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
2 The Use of Python in AI and ML
Techfastly
UNLIMITED
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
DJANGO Create A Database-driven Website
Linux Format
UNLIMITED
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Manipulate Data Like A Pro With Pandas
Linux Format
UNLIMITED
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
How Image Recognition Works
APC
UNLIMITED
How Image Recognition Works
Nov 4, 2019
4 min read
Tensor Flow 101
APC
UNLIMITED
Tensor Flow 101
Jan 27, 2020
4 min read
Why Python?
Linux Format
UNLIMITED
Why Python?
Apr 7, 2020
Python is an interpreted, high-level, general-purpose programming language that was first released in 1991 by its creator, Guido van Rossum. Very similar in programming construct to how BASIC (Beginners All-purpose Sybollic Instruction Code) was used
1 min read
The Deep Learning Revolution For Artificial Intelligence
Facility Management
UNLIMITED
The Deep Learning Revolution For Artificial Intelligence
Mar 28, 2019
3 min read
Getting The edge
The European Business Review
UNLIMITED
Getting The edge
Feb 25, 2021
7 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
UNLIMITED
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
Accurate, Open Source IP-based Localisation
Linux Format
UNLIMITED
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
10 Ways To Transform Your
PC Pro Magazine
UNLIMITED
10 Ways To Transform Your
Jun 6, 2024
8 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
AppleMagazine
UNLIMITED
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 28, 2023
4 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
TechLife News
UNLIMITED
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 29, 2023
4 min read
Machine-learning On Your Android Phone?
APC
UNLIMITED
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Google Answer Box Strategy
Techfastly
UNLIMITED
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
PyScript – Bring Python Coding To The Web
APC
UNLIMITED
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Artificial Intelligence Rules Of The Road
Linux Format
UNLIMITED
Artificial Intelligence Rules Of The Road
Nov 14, 2023
AI FOR ALL! Anyone who works with computers needs to understand that AI will undoubtedly change how work is executed. That said, I don’t think we are anywhere near the much bleated “Everyone will lose their jobs!” IT-related jobs will change but they
2 min read
How Can AI Help Your Business?
PC Pro Magazine
UNLIMITED
How Can AI Help Your Business?
Jun 8, 2023
7 min read
Seven Questions About Chatgpt Answered
NZBusiness and Management
UNLIMITED
Seven Questions About Chatgpt Answered
Apr 18, 2023
3 min read
How Does Generative AI Affect Photo Imaging?
Photo Review
UNLIMITED
How Does Generative AI Affect Photo Imaging?
May 30, 2024
Expect to hear a lot more about generative AI this year, not only in association with computer-generated texts and translations, but also for images, voice recordings and even video clips. AI-based tools are increasingly relevant for photo-imaging. S
6 min read
Generative AI: What Leaders Need To Know
Rotman Management
UNLIMITED
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
Raspberry Pi 5 AI Kit
Linux Format
UNLIMITED
Raspberry Pi 5 AI Kit
Jul 23, 2024
5 min read
How Can I Use Artificial Intelligence (AI) More Effectively At Work?
Her World Singapore
UNLIMITED
How Can I Use Artificial Intelligence (AI) More Effectively At Work?
May 7, 2024
2 min read
01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
UNLIMITED
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
UNLIMITED
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
A.i. Coding
Linux Format
UNLIMITED
A.i. Coding
Aug 22, 2023
16 min read
Safe And Sound
Business Today
UNLIMITED
Safe And Sound
Jul 19, 2024
3 min read
The Verdict
Linux Format
UNLIMITED
The Verdict
Sep 22, 2020
2 min read
A KiwiGPT?
New Zealand Listener
UNLIMITED
A KiwiGPT?
Jun 25, 2023
2 min read

Related categories

Skip carousel

Reviews for Learning Data Mining with Python

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Learning Data Mining with Python - Robert Layton

Learning Data Mining with Python

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Getting Started with Data Mining

Introducing data mining

Using Python and the IPython Notebook

Installing Python

Installing IPython

Installing scikit-learn

A simple affinity analysis example

What is affinity analysis?

Product recommendations

Loading the dataset with NumPy

Implementing a simple ranking of rules

Ranking to find the best rules

A simple classification example

What is classification?

Loading and preparing the dataset

Implementing the OneR algorithm

Testing the algorithm

Summary

2. Classifying with scikit-learn Estimators

scikit-learn estimators

Nearest neighbors

Distance metrics

Loading the dataset

Moving towards a standard workflow

Running the algorithm

Setting parameters

Preprocessing using pipelines

An example

Standard preprocessing

Putting it all together

Pipelines

Summary

3. Predicting Sports Winners with Decision Trees

Loading the dataset

Collecting the data

Using pandas to load the dataset

Cleaning up the dataset

Extracting new features

Decision trees

Parameters in decision trees

Using decision trees

Sports outcome prediction

Putting it all together

Random forests

How do ensembles work?

Parameters in Random forests

Applying Random forests

Engineering new features

Summary

4. Recommending Movies Using Affinity Analysis

Affinity analysis

Algorithms for affinity analysis

Choosing parameters

The movie recommendation problem

Obtaining the dataset

Loading with pandas

Sparse data formats

The Apriori implementation

The Apriori algorithm

Implementation

Extracting association rules

Evaluation

Summary

5. Extracting Features with Transformers

Feature extraction

Representing reality in models

Common feature patterns

Creating good features

Feature selection

Selecting the best individual features

Feature creation

Principal Component Analysis

Creating your own transformer

The transformer API

Implementation details

Unit testing

Putting it all together

Summary

6. Social Media Insight Using Naive Bayes

Disambiguation

Downloading data from a social network

Loading and classifying the dataset

Creating a replicable dataset from Twitter

Text transformers

Bag-of-words

N-grams

Other features

Naive Bayes

Bayes' theorem

Naive Bayes algorithm

How it works

Application

Extracting word counts

Converting dictionaries to a matrix

Training the Naive Bayes classifier

Putting it all together

Evaluation using the F1-score

Getting useful features from models

Summary

7. Discovering Accounts to Follow Using Graph Mining

Loading the dataset

Classifying with an existing model

Getting follower information from Twitter

Building the network

Creating a graph

Creating a similarity graph

Finding subgraphs

Connected components

Optimizing criteria

Summary

8. Beating CAPTCHAs with Neural Networks

Artificial neural networks

An introduction to neural networks

Creating the dataset

Drawing basic CAPTCHAs

Splitting the image into individual letters

Creating a training dataset

Adjusting our training dataset to our methodology

Training and classifying

Back propagation

Predicting words

Improving accuracy using a dictionary

Ranking mechanisms for words

Putting it all together

Summary

9. Authorship Attribution

Attributing documents to authors

Applications and use cases

Attributing authorship

Getting the data

Function words

Counting function words

Classifying with function words

Support vector machines

Classifying with SVMs

Kernels

Character n-grams

Extracting character n-grams

Using the Enron dataset

Accessing the Enron dataset

Creating a dataset loader

Putting it all together

Evaluation

Summary

10. Clustering News Articles

Obtaining news articles

Using a Web API to get data

Reddit as a data source

Getting the data

Extracting text from arbitrary websites

Finding the stories in arbitrary websites

Putting it all together

Grouping news articles

The k-means algorithm

Evaluating the results

Extracting topic information from clusters

Using clustering algorithms as transformers

Clustering ensembles

Evidence accumulation

How it works

Implementation

Online learning

An introduction to online learning

Implementation

Summary

11. Classifying Objects in Images Using Deep Learning

Object classification

Application scenario and goals

Use cases

Deep neural networks

Intuition

Implementation

An introduction to Theano

An introduction to Lasagne

Implementing neural networks with nolearn

GPU optimization

When to use GPUs for computation

Running our code on a GPU

Setting up the environment

Application

Getting the data

Creating the neural network

Putting it all together

Summary

12. Working with Big Data

Big data

Application scenario and goals

MapReduce

Intuition

A word count example

Hadoop MapReduce

Application

Getting the data

Naive Bayes prediction

The mrjob package

Extracting the blog posts

Training Naive Bayes

Putting it all together

Training on Amazon's EMR infrastructure

Summary

A. Next Steps…

Chapter 1 – Getting Started with Data Mining

Scikit-learn tutorials

Extending the IPython Notebook

Chapter 2 – Classifying with scikit-learn Estimators

Scalability with the nearest neighbor

More complex pipelines

Comparing classifiers

Chapter 3: Predicting Sports Winners with Decision Trees

More on pandas

More complex features

Chapter 4 – Recommending Movies Using Affinity Analysis

New datasets

The Eclat algorithm

Chapter 5 – Extracting Features with Transformers

Adding noise

Vowpal Wabbit

Chapter 6 – Social Media Insight Using Naive Bayes

Spam detection

Natural language processing and part-of-speech tagging

Chapter 7 – Discovering Accounts to Follow Using Graph Mining

More complex algorithms

NetworkX

Chapter 8 – Beating CAPTCHAs with Neural Networks

Better (worse?) CAPTCHAs

Deeper networks

Reinforcement learning

Chapter 9 – Authorship Attribution

Increasing the sample size

Blogs dataset

Local n-grams

Chapter 10 – Clustering News Articles

Evaluation

Temporal analysis

Real-time clusterings

Chapter 11: Classifying Objects in Images Using Deep Learning

Keras and Pylearn2

Mahotas

Chapter 12 – Working with Big Data

Courses on Hadoop

Pydoop

Recommendation engine

More resources

Index

Learning Data Mining with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2015

Production reference: 1230715

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78439-605-3

www.packtpub.com

Credits

Author

Robert Layton

Reviewers

Asad Ahamad

P Ashwin

Christophe Van Gysel

Edward C. Delaporte V

Commissioning Editor

Taron Pereira

Acquisition Editor

James Jones

Content Development Editor

Siddhesh Salvi

Technical Editor

Naveenkumar Jain

Copy Editors

Roshni Banerjee

Trishya Hajare

Project Coordinator

Nidhi Joshi

Proofreader

Safis Editing

Indexer

Priya Sane

Graphics

Sheetal Aute

Production Coordinator

Nitesh Thakur

Cover Work

Nitesh Thakur

About the Author

Robert Layton has a PhD in computer science and has been an avid Python programmer for many years. He has worked closely with some of the largest companies in the world on data mining applications for real-world data and has also been published extensively in international journals and conferences. He has extensive experience in cybercrime and text-based data analytics, with a focus on behavioral modeling, authorship analysis, and automated open source intelligence. He has contributed code to a number of open source libraries, including the scikit-learn library used in this book, and was a Google Summer of Code mentor in 2014. Robert runs a data mining consultancy company called dataPipeline, providing data mining and analytics solutions to businesses in a variety of industries.

About the Reviewers

Asad Ahamad is a data enthusiast and loves to work on data to solve challenging problems.

He did his master's degree in industrial mathematics with computer application at Jamia Millia Islamia, New Delhi. He admires mathematics a lot and always tries to use it to gain maximum profit for businesses.

He has good experience working in data mining, machine learning, and data science and has worked for various multinationals in India. He mainly uses R and Python to perform data wrangling and modeling. He is fond of using open source tools for data analysis.

He is an active social media user. Feel free to connect with him on Twitter at @asadtaj88.

P Ashwin is a Bangalore-based engineer who wears many different hats depending on the occasion. He graduated from IIIT, Hyderabad at in 2012 with an M Tech in computer science and engineering. He has a total of 5 years of experience in the software industry, where he has worked in different domains such as testing, data warehousing, replication, and automation. He is very well versed in DB concepts, SQL, and scripting with Bash and Python. He has earned professional certifications in products from Oracle, IBM, Informatica, and Teradata. He's also an ISTQB-certified tester.

In his free time, he volunteers in different technical hackathons or social service activities. He was introduced to Raspberry Pi in one of the hackathons and he's been hooked on it ever since. He writes a lot of code in Python, C, C++, and Shell on his Raspberry Pi B+ cluster. He's currently working on creating his own Beowulf cluster of 64 Raspberry Pi 2s.

Christophe Van Gysel is pursuing a doctorate degree in computer science at the University of Amsterdam under the supervision of Maarten de Rijke and Marcel Worring. He has interned at Google, where he worked on large-scale machine learning and automated speech recognition. During his internship in Facebook's security infrastructure team, he worked on information security and implemented measures against compression side-channel attacks. In the past, he was active as a security researcher. He discovered and reported security vulnerabilities in the web services of Google, Facebook, Dropbox, and PayPal, among others.

Edward C. Delaporte V leads a software development group at the University of Illinois, and he has contributed to the documentation of the Kivy framework. He is thankful to all those whose contributions to the open source community made his career possible, and he hopes this book helps continue to attract enthusiasts to software development.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

If you have ever wanted to get into data mining, but didn't know where to start, I've written this book with you in mind.

Many data mining books are highly mathematical, which is great when you are coming from such a background, but I feel they often miss the forest for the trees—that is, they focus so much on how the algorithms work, that we forget about why we are using these algorithms.

In this book, my aim has been to create a book for those who can program and want to learn data mining. By the end of this book, my aim is that you have a good understanding of the basics, some best practices to jump into solving problems with data mining, and some pointers on the next steps you can take.

Each chapter in this book introduces a new topic, algorithm, and dataset. For this reason, it can be a bit of a whirlwind tour, moving quickly from topic to topic. However, for each of the chapters, think about how you can improve upon the results presented in the chapter. Then, take a shot at implementing it!

One of my favorite quotes is from Shakespeare's Henry IV:

But will they come when you do call for them?

Before this quote, a character is claiming to be able to call spirits. In response, Hotspur points out that anyone can call spirits, but what matters is whether they actually come when they are called.

In much the same way, learning data mining is about performing experiments and getting the result. Anyone can come up with an idea to create a new data mining algorithm or improve upon an experiment's results. However, what matters is: can you build it and does it work?

What this book covers

Chapter 1, Getting Started with Data Mining, introduces the technologies we will be using, along with implementing two basic algorithms to get started.

Chapter 2, Classifying with scikit-learn Estimators, covers classification, which is a key form of data mining. You'll also learn about some structures to make your data mining experimentation easier to perform..

Chapter 3, Predicting Sports Winners with Decision Trees, introduces two new algorithms, Decision Trees and Random Forests, and uses them to predict sports winners by creating useful features.

Chapter 4, Recommending Movies Using Affinity Analysis, looks at the problem of recommending products based on past experience and introduces the Apriori algorithm.

Chapter 5, Extracting Features with Transformers, introduces different types of features you can create and how to work with different datasets.

Chapter 6, Social Media Insight Using Naive Bayes, uses the Naive Bayes algorithm to automatically parse text-based information from the social media website, Twitter.

Chapter 7, Discovering Accounts to Follow Using Graph Mining, applies cluster and network analysis to find good people to follow on social media.

Chapter 8, Beating CAPTCHAs with Neural Networks, looks at extracting information from images and then training neural networks to find words and letters in those images.

Chapter 9, Authorship Attribution, looks at determining who wrote a given document, by extracting text-based features and using support vector machines.

Chapter 10, Clustering News Articles, uses the k-means clustering algorithm to group together news articles based on their content.

Chapter 11, Classifying Objects in Images Using Deep Learning, determines what type of object is being shown in an image, by applying deep neural networks.

Chapter 12, Working with Big Data, looks at workflows for applying algorithms to big data and how to get insight from it.

Appendix, Next Steps…, goes through each chapter, giving hints on where to go next for a deeper understanding of the concepts introduced.

What you need for this book

It should come as no surprise that you'll need a computer, or access to one, to complete this book. The computer should be reasonably modern, but it doesn't need to be overpowered. Any modern processor (from about 2010 onwards) and 4 GB of RAM will suffice, and you can probably run almost all of the code on a slower system too.

The exception here is with the final two chapters. In these chapters, I step through using Amazon Web Services (AWS) to run the code. This will probably cost you some money, but the advantage is less system setup than running the code locally. If you don't want to pay for those services, the tools used can all be set up on a local computer, but you will definitely need a modern system to run it. A processor built in at least 2012 and with more than 4 GB of RAM is necessary.

I recommend the Ubuntu operating system, but the code should work well on Windows, Macs, or any other Linux variant. You may need to consult the documentation for your system to get some things installed, though.

In this book, I use pip to install code, which is a command-line tool for installing Python libraries. Another option is to use Anaconda, which can be found online here: http://continuum.io/downloads.

I have also tested all code using Python 3. Most of the code examples work on Python 2, with no changes. If you run into any problems and can't get around them, send an email and we can offer a solution.

Who this book is for

This book is for programmers who want to get started in data mining in an application-focused manner.

If you haven't programmed before, I strongly recommend that you learn at least the basics before you get started. This book doesn't introduce programming, nor does it give too much time to explain the actual implementation (in code) of how to type out the instructions. That said, once you go through the basics, you should be able to come back to this book fairly quickly—there is no need to be an expert programmer first!

I highly recommend that you have some Python programming experience. If you don't, feel free to jump in, but you might want to take a look at some Python code first, possibly focusing on tutorials using the IPython Notebook. Writing programs in the IPython Notebook works a little differently than other methods such as writing a Java program in a fully fledged IDE.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

The most important is code. Code that you need to enter is displayed separate from the text, in a box like this one:

if True:

print(Welcome to the book)

Keep a careful eye on indentation. Python cares about how much lines are indented. In this book, I've used four spaces for indentation. You can use a different number (or tabs), but you need to be consistent. If you get a bit lost counting indentation levels, reference the code bundle that comes with the book.

Where I refer to code in text, I'll use this format. You don't need to type this in your IPython Notebooks, unless the text specifically states otherwise.

Any command-line input or output is written as follows:

# cp file1.txt file2.txt

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Click on the Export link.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/6053OS_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Getting Started with Data Mining

We are collecting information at a scale that has never been seen before in the history of mankind and placing more day-to-day importance on the use of this information in everyday life. We expect our computers to translate Web pages into other languages, predict the weather, suggest books we would like, and diagnose our health issues. These expectations will grow, both in the number of applications and also in the efficacy we expect. Data mining is a methodology that we can employ to train computers to make decisions with data and forms the backbone of many high-tech systems of today.

The Python language is fast growing in popularity, for a good reason. It gives the programmer a lot of flexibility; it has a large number of modules to perform different tasks; and Python code is usually more readable and concise than in any other languages. There is a large and an active community of researchers, practitioners, and beginners using Python for data mining.

In this chapter, we will introduce data mining with Python. We will cover the following topics:

What is data mining and where can it be used?

Setting up a Python-based environment to perform data mining

An example of affinity analysis, recommending products based on purchasing habits

An example of (a classic) classification problem, predicting the plant species based on its measurement

Introducing data mining

Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.

Data mining is part of algorithms, statistics, engineering, optimization, and computer science. We also use concepts and knowledge from other fields such as linguistics, neuroscience, or town planning. Applying it effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

Most data mining applications work with the same high-level view, although the details often change quite considerably. We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of two aspects:

Samples that are objects in the real world. This can be a book, photograph, animal, person, or any other object.

Features that are descriptions of the samples in our dataset. Features could be the length, frequency of a given word, number of legs, date it was created, and so on.

The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

As a simple example, we may wish the computer to be able to categorize people as short or tall. We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:

The next step involves tuning our algorithm. As a simple algorithm; if the height is more than x, the person is tall, otherwise they are short. Our training algorithm will then look at the data and decide on a good value for x. For the preceding dataset, a reasonable value would be 170 cm. Anyone taller than 170 cm is considered tall by the algorithm. Anyone else is considered short.

In the preceding dataset, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This engineering feature is an important problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.

Note

In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to do this. This sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.

Using Python and the IPython Notebook

In this section, we will cover installing Python and the environment that we will use for most of the book, the IPython Notebook. Furthermore, we will install the numpy module, which we will use for the first set of examples.

Installing Python

The Python language is a fantastic, versatile, and an easy to use language.

For this book, we will be using Python 3.4, which is available for your system from the Python Organization's website: https://www.python.org/downloads/.

There will be two major versions to choose from, Python 3.4 and Python 2.7. Remember to download and install Python 3.4, which is the version tested throughout this book.

In this book, we will be assuming that you have some knowledge of programming and Python itself. You do not need to be an expert with Python to complete this book, although a good level of knowledge will help.

If you do not have any experience with programming, I recommend that you pick up the Learning Python book from.

The Python organization also maintains a list of two online tutorials for those new to Python:

For nonprogrammers who want to learn programming through the Python language: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers

For programmers who already know how to program, but need to learn Python specifically: https://wiki.python.org/moin/BeginnersGuide/Programmers

Note

Windows users will need to set an environment variable in order to use Python from the command line. First, find where Python 3 is installed; the default location is C:\Python34. Next, enter this command into the command line (cmd program): set the enviornment to PYTHONPATH=%PYTHONPATH%;C:\Python34. Remember to change the C:\Python34 if Python is installed into a different directory.

Once you have Python running on your system, you should be able to open a command prompt and run the following code:

$ python3 Python 3.4.0 (default, Apr 11 2014, 13:05:11) [GCC 4.8.2] on Linux Type help, copyright, credits or license for more information. >>> print(Hello, world!) Hello, world! >>> exit()

Note that we will be using the dollar sign ($) to denote that a command is to be typed into the terminal (also called a shell or cmd on Windows). You do not need to type this character (or the space that follows it). Just type in the rest of the line and press Enter.

After you have the above Hello, world! example running, exit the program and move on to installing a more advanced environment to run Python code, the IPython Notebook.

Note

Python 3.4 will include a program called pip, which is a package manager that helps to install new libraries on your system. You can verify that pip is working on your system by running the $ pip3 freeze command, which tells you which packages you have installed on your system.

Installing IPython

IPython is a platform for Python development that contains a number of tools and environments for running Python and has more features than the

Enjoying the preview?

Page 1 of 1

Learning Data Mining with Python

About this ebook

Robert Layton

Read more from Robert Layton

Python: Real-World Data Science

Learning Data Mining with Python - Second Edition

Related authors

Related to Learning Data Mining with Python

Related ebooks

Python Data Science Essentials

Python Data Analysis - Second Edition

Mastering Python for Data Science

Python Data Science Essentials - Second Edition

Python Data Analysis

Learning pandas

Getting Started with Python Data Analysis

Building Machine Learning Systems with Python

Python for Secret Agents

Designing Machine Learning Systems with Python

Python Unlocked

R High Performance Programming

Mastering Predictive Analytics with R

R for Data Science

Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples

Mastering Machine Learning with R

Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries

Artificial Intelligence with Python

Hands-On Genetic Algorithms with Python: Applying genetic algorithms to solve real-world deep learning and artificial intelligence problems

Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python

Python High Performance - Second Edition

Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python

Learning Predictive Analytics with Python

Python Data Analysis Cookbook

Large Scale Machine Learning with Python

Advanced Machine Learning with Python

Mastering Social Media Mining with Python

Python: Real World Machine Learning

Web Scraping with Python

Python 3 Text Processing with NLTK 3 Cookbook

Computers For You

Elon Musk

ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

The Invisible Rainbow: A History of Electricity and Life

How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally

Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

Uncanny Valley: A Memoir

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad

How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)

People Skills for Analytical Thinkers

Deep Search: How to Explore the Internet More Effectively

Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!

Computer Science I Essentials

CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition

Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work

Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters

Data Analytics for Beginners: Introduction to Data Analytics

Tor and the Dark Art of Anonymity

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

CompTia Security 701: Fundamentals of Security

Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence

Related podcast episodes

Related articles

Related categories

Reviews for Learning Data Mining with Python

What did you think?

Book preview

Learning Data Mining with Python - Robert Layton

Table of Contents

Learning Data Mining with Python

Learning Data Mining with Python

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters