0% found this document useful (0 votes)
24 views

DVP 2

The document discusses data gathering and cleaning. It notes that data is vital for decision making and strategic planning. There are five main steps for data science processing: 1) data acquisition, 2) data cleaning, 3) exploratory analysis, 4) creating an analysis model, and 5) data visualization. Python provides libraries like Pandas, NumPy, SciPy, and Matplotlib to support data gathering, cleaning, processing, and visualization. Cleaning data is important as data collected from various sources may contain errors, missing values, and noisy data that needs to be addressed.

Uploaded by

padma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

DVP 2

The document discusses data gathering and cleaning. It notes that data is vital for decision making and strategic planning. There are five main steps for data science processing: 1) data acquisition, 2) data cleaning, 3) exploratory analysis, 4) creating an analysis model, and 5) data visualization. Python provides libraries like Pandas, NumPy, SciPy, and Matplotlib to support data gathering, cleaning, processing, and visualization. Cleaning data is important as data collected from various sources may contain errors, missing values, and noisy data that needs to be addressed.

Uploaded by

padma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Gathering and Cleaning

In the 21st century,

• data is vital for decision-making and developing long-term strategic


plans.
• Python provides numerous libraries and built-in features that make
it easy to support data analysis and processing.

Making business decisions, forecasting weather, studying protein


structures in biology, and designing a marketing campaign are all
examples that require collecting data and then cleaning, processing, and
visualizing it.
There are five main steps for data science processing.

1. Data acquisition is where you read data from various sources of unstructured data, semi
structured data, or full-structured data that might be stored in a spreadsheet, comma separated file,
web page, database, etc.

2. Data cleaning is where you remove noisy data and make operations needed to keep only the relevant data.

3. Exploratory analysis is where you look at your cleaned data and make statistical processing fits for specific analysis
purposes

4. An analysis model needs to be created. Advanced tools such as machine learning algorithms can be used in this step.

5. Data visualization is where the results are plotted using various systems provided by Python to help in the decision-
making process.
Python provides several libraries for data gathering, cleaning, integration, processing, and
visualizing.

• Pandas is an open-source Python library used to load, organize, manipulate, model, and analyze data by
offering powerful data structures.

• NumPy is a Python package that stands for “numerical Python. It is a library consisting of multidimensional array objects and
a collection of routines for manipulating arrays. It can be used to perform mathematical, logical,
and linear algebra operations on arrays.

• SciPy is another built-in Python library for numerical integration and optimization.

• Matplotlib is a Python library used to create 2D graphs and plots. It supports a wide variety of graphs and plots
such as histograms, bar charts, power spectra, error charts, and so on, with additional formatting such as control line
styles, font properties, formatting axes, and more
Cleaning Data

Data is collected and entered manually or automatically using various


methods such as weather sensors, financial stock market data servers,
users’ online commercial preferences, etc.

Collected data is not error-free and usually has various missing data points and
erroneously entered data. For instance, online users might not want to enter their
information because of privacy concerns. Therefore,
treating missing and noisy data (NA or NaN) is important for any data
analysis processing.
Checking for Missing Values

You might also like