DVP 2
DVP 2
1. Data acquisition is where you read data from various sources of unstructured data, semi
structured data, or full-structured data that might be stored in a spreadsheet, comma separated file,
web page, database, etc.
2. Data cleaning is where you remove noisy data and make operations needed to keep only the relevant data.
3. Exploratory analysis is where you look at your cleaned data and make statistical processing fits for specific analysis
purposes
4. An analysis model needs to be created. Advanced tools such as machine learning algorithms can be used in this step.
5. Data visualization is where the results are plotted using various systems provided by Python to help in the decision-
making process.
Python provides several libraries for data gathering, cleaning, integration, processing, and
visualizing.
• Pandas is an open-source Python library used to load, organize, manipulate, model, and analyze data by
offering powerful data structures.
• NumPy is a Python package that stands for “numerical Python. It is a library consisting of multidimensional array objects and
a collection of routines for manipulating arrays. It can be used to perform mathematical, logical,
and linear algebra operations on arrays.
• SciPy is another built-in Python library for numerical integration and optimization.
• Matplotlib is a Python library used to create 2D graphs and plots. It supports a wide variety of graphs and plots
such as histograms, bar charts, power spectra, error charts, and so on, with additional formatting such as control line
styles, font properties, formatting axes, and more
Cleaning Data
Collected data is not error-free and usually has various missing data points and
erroneously entered data. For instance, online users might not want to enter their
information because of privacy concerns. Therefore,
treating missing and noisy data (NA or NaN) is important for any data
analysis processing.
Checking for Missing Values