Module 3 - Data Science

MODULE 3:
Data Science
Learning Competencies
3.1. Understand the data, problems, and tools that data analysts use.
3.2. Familiarize with the programming languages used in Data Science.
3.3. Identify what each tool is used for, what programming languages they can
execute, their features, and their limitations.
3.4. Introduce relational database concepts and learn and apply foundational
knowledge of the Python, R, and SQL language.
G11: 1
MODULE 3:
A Data Scientist is responsible for extracting, manipulating, pre-processing, and generating

predictions out of data. To do so, he requires various statistical tools and programming
languages.
Data Science has emerged as one of the most popular fields of the 21st Century. Companies
employ Data Scientists to help them gain insights about the market and to better their
products.
Data Scientists work as decision-makers and are primarily responsible for analyzing and
handling a large amount of unstructured and structured data.
In order to do so, he requires various tools and programming languages for Data Science to
mend the day in the way he wants. We will go through some of these data science tools
utilized to analyze and generate predictions.
Programming forms the backbone of Software Development. Data Science is an

agglomeration of several fields including Computer Science. It involves the usage of
scientific processes and methods to analyze and draw conclusions from the data.
Specific programming languages designed for this role carry out these methods. While most
languages cater to software development, programming for Data Science differs because it
helps the user pre-process, analyze, and generate predictions from the data.
These data-centric programming languages can carry out algorithms suited for the specifics
of Data Science. Therefore, to become a proficient Data Scientist, you must master one of the
following data science programming languages.
TOOLS FOR HANDLING BIG DATA

To truly grasp the meaning behind Big Data, we must understand the basic principles that
define the data as big data. These are known as the 3 V’s of big data:
 Volume
 Variety
 Velocity
I. Volume
As the name suggests, volume refers to the scale and the amount of data. Over the decade,
with the increase in data, technology has also become better. The decrease in computational
G11: 2
MODULE 3:
and storage costs has made collecting and storing vast amounts of data far easier. The volume
of the data defines whether it qualifies as big data or not.
When data ranges from 1 GB to around 10 GB, the traditional data science tools tend to work
well in these cases. The following are the tools that work well in these cases.
a. Microsoft Excel. Excel prevails as the most accessible and
most popular tool for handling small amounts of data. The
maximum number of rows it supports is just a shade over 1
million, and one sheet can hold only up to 16,380 columns
at a time. These numbers are not enough when the amount
of data is considerable.
b. Microsoft Access. It is a popular tool by Microsoft that is
used for data storage. Users of Microsoft Access can either
design their database or create a database using a readily
available template as per requirements. Smaller databases
up to 2 GB can be handled smoothly with this tool
but,beyond that, it starts cracking up.
c. SQL. SQL is used to access data from relational database
management systems. It is used to define the data in the
database, and manipulate it when needed. It is also used to
create a view, stored procedure, function in a database, and
allows users to set permissions on tables, procedures, and
views.
If the data is greater than 10 GB all the way up to storage greater than 1Tb, then the following
tools are needed to be implemented:
a. Hadoop. It is an open-source distributed framework that
manages data processing and storage for big data. Hadoop
is a framework written in Java with some code in C and
Shell Script that works over the collection of various
simple commodity hardware to deal with the large dataset
using a very basic level programming model.
b. Hive. It is a data warehouse built on top of Hadoop. Hive
provides a SQL-like interface to query the data stored in
various databases and file systems that integrate with
Hadoop.
II. Variety
Variety refers to the different types of data that are out there. The data type may be one of
these – structured and unstructured data.
Structured data is data that has been predefined and formatted to a set structure before being
placed in data storage, which is often referred to as schema-on-write (e.g., tabular data,
employee table, payout table, loan application table). It is the basis for inventory control
G11: 3
MODULE 3:
systems and ATMs. It can be human- or machine-generated. Common examples of machine-

generated structured data are weblog statistics and points of sale data, such as barcodes and
quantity.
On the other hand, unstructured data do not follow any trend or form. It is data stored in its
native format and not processed until used, known as schema-on-read. It comes in a various
of file formats, including email, social media posts, presentations, chats, IoT sensor data, and
satellite imagery. Unstructured data is qualitative rather than quantitative, which is more
characteristic and categorical.
The two most common databases are SQL and NoSQL.
a. Structured Query Language (SQL) is the standard language used for writing
queries in databases. It was approved by ISO (International Standard Organization)
and ANSI (American National Standards Institute). SQL is used to communicate with
a database. According to ANSI, it is the standard language for relational database
management systems. SQL statements are used to perform tasks such as updating a
database or retrieving data from a database.
b. Not Only SQL (NoSQL) is an approach to database management that can
accommodate various data models, including key-value, document, columnar, and
graph formats. A NoSQL database generally means that it is non-relational,
distributed, flexible, and scalable. Additional standard NoSQL database features
include the lack of a database schema, data clustering, replication support, and
eventual consistency instead of the typical ACID (atomicity, consistency, isolation,
and durability) transaction consistency of relational and SQL databases. Many
NoSQL abase systems are also open source.
III. Velocity
Velocity is the speed at which the data is captured. This includes both real-time and non-real-
time data. Examples of real-time data being collected are CCTV, stock trading, fraud
detection for credit card transactions, and network data.
The following are the most commonly used data science tools for real-time data:
a. Apache Kafka. Kafka is an open-source tool by Apache. It is used for building real-
time data pipelines. Some of the advantages of Kafka are – It is fault-tolerant, really
quick, and used in production by a large number of organizations. The original use
case
G11: 4
MODULE 3:
for Kafka was to be able to rebuild a user activity tracking

pipeline as a set of real-time publish-subscribe feeds. This
means site activity (page views, searches, or other actions
users may take) is published to central topics with one topic
per activity type. These feeds are available for subscription
for a range of use cases including real-time processing,
real-time monitoring, and loading into Hadoop or offline
data warehousing systems for offline processing and
reporting.
b. Apache Storm. This tool by Apache can be used with
almost all the programming languages. It can process up to
1 Million tuples per second and it is highly scalable.
Apache Storm is a distributed, fault-tolerant, open-source
computation system. You can use Storm to process streams
of data in real time with Apache Hadoop. Storm solutions
can also provide guaranteed processing of data, with the
ability to replay data that wasn't successfully processed the
first time.
c. Amazon Kinesis Data Streams. Amazon Kinesis Data
Streams (KDS) is a massively scalable and durable real-
time data streaming service. KDS can continuously capture
gigabytes of data per second from hundreds of thousands of
sources such as website clickstreams, database event
streams, financial transactions, social media feeds, IT logs,
and location-tracking events. The data collected is available
in milliseconds to enable real-time analytics use cases such
as real-time dashboards, real-time anomaly detection,
dynamic pricing, and more.
d. Apache Flink. Apache Flink is a framework and
distributed processing engine for stateful computations over
unbounded and bounded data streams. Flink has been
designed to run in all common cluster environments,
perform computations at in-memory speed and at any scale.
DATA SCIENCE TOOLS ACCORDING TO DOMAIN

Data Science is a broad term in itself, and it consists of various domains, and each domain
has its business importance and complexity. The data science spectrum consists of multiple
domains, and these domains are represented by their relative complexity and the business
value they provide.
G11: 5
MODULE 3:
Figure 1. Spectrum of Data Science
I. Reporting and Business Intelligence
Tools for this spectrum enable an organization to identify trends and patterns to make crucial
strategic decisions. The types of analysis range from MIS, data analytics over to
dashboarding.
The commonly used tools in these domains are:
a. Excel. It gives various options, including Pivot tables and charts that let you analyze
in double-quick time. This is, in short, the Swiss Army Knife of data science/analytics
tools.
b. QlikView. It enables you to consolidate, search, visualize

and analyze your data sources with just a few clicks. It is an
easy and intuitive tool to learn, which makes it so popular.
c. Tableau. It is amongst the most popular data visualization

tools in the market today. It is capable of handling large
amounts of data and even offers Excel-like calculation
functions and parameters. Tableau is well-liked because of
its neat dashboard and story interface.
d. MicroStrategy. MicroStrategy is an enterprise BI

application software. It supports an interactive dashboard,
highly detailed reports, ad-hoc queries, automated report
disputes, and Microsoft Office integration. Micro Strategy
G11: 6
MODULE 3:
also supports mobile BI.ce integration. Micro Strategy also

supports mobile BI.
e. PowerBI. Microsoft Power BI is a suite of business

intelligence (BI), reporting, and data visualization products
and services for individuals and teams. Power BI stands out
with streamlined publication and distribution capabilities
and integration with other Microsoft products and services.
f. Google Analytics. Google Analytics is used to track

website activity such as session duration, pages per session,
bounce rate, etc., of individuals using the site and the
information on the traffic source. It can be integrated with
Google Ads, with which users can create and review online
campaigns by tracking landing page quality and
conversions (goals). Google Analytics analysis can identify
poorly performing pages with techniques such as funnel
visualization, where visitors came from (referrers), how
long they stayed on the website, and their geographical
position. It also provides more advanced features, including
custom visitor segmentation.
II. Predictive Analytics and Machine Learning Tools
This is the domain where the bread and butter of most data scientists come from. Some of the
problems a data scientist will solve are statistical modeling, forecasting, neural networks, and
deep learn
The following are the commonly used tools for this domain:
a. Python. Python is an interpreted, object-oriented,

high-level programming language with dynamic
semantics. Its high-level built-in data structures,
combined with dynamic typing and dynamic
binding, made it very attractive for Rapid
Application Development and used as a scripting or
glue language to connect existing components.
Python's simple, easy-to-learn syntax emphasizes
readability and therefore reduces the cost of
program maintenance. Python supports modules and
packages, which encourages program modularity
and code reuse.
b. R. It is another very commonly used and respected

language in data science. R has a thriving and
incredibly supportive community, and it comes with
h a plethora of packages and libraries that support
G11: 7
MODULE 3:
most machine learning tasks. R provides a wide

variety of statistical (linear and nonlinear modelling,
classical statistical tests, time-series analysis,
classification, clustering, etc.) and graphical
techniques, and is highly extensible. The S language
is often the vehicle of choice for research in
statistical methodology, and R provides an Open
Source route to participation in that activity.
c. Apache Spark. Spark was open-sourced by UC

Berkley in 2010 and has since become one of the
largest communities in big data. It is known as the
Swiss army knife of big data analytics as it offers
multiple advantages such as flexibility, speed,
computational power, etc. Apache Spark is a unified
analytics engine for large-scale data processing. It
provides high-level APIs in Java, Scala, Python and
R, and an optimized engine that supports general
execution graphs. It also supports a rich set of
higher-level tools including Spark SQL for SQL
and structured data processing, MLlib for machine
learning, GraphX for graph processing, and
Structured Streaming for incremental computation
and stream processing.
d. Julia. Julia is a high-level, high-performance,

dynamic programming language. While it is a
general-purpose language and can be used to write
any application, many of its features are well suited
for numerical analysis and computational science.
Distinctive aspects of Julia's design include a type
system with parametric polymorphism in a dynamic
programming language; with multiple dispatch as
its core programming paradigm.
e. Jupyter Notebook. The Jupyter Notebook is an

open-source web application that allows you to
create and share documents that contain live code,
equations, visualizations and narrative text. Uses
include: data cleaning and transformation,
numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
f. SAS. SAS is an integrated software suite for

advanced analytics, business intelligence, data
management, and predictive analytics. With SAS,
one can access data in almost any format, including
SAS tables, Microsoft Excel tables, and database
G11: 8
MODULE 3:
files. The user can also manage and manipulate

existing data to get the needed data.
g. Statistical Package for Social Sciences (SPSS). It

offers advanced statistical analysis, a vast library of
machine learning algorithms, text analysis, and
much more. The SPSS software package was
created for the management and statistical analysis
of social science data. It is widely coveted due to its
straightforward and English-like command
language and impressively thorough user manual.
h. Matlab. It is a high-performance language for

technical computing. It integrates computation,
visualization, and programming in an easy-to-use
environment where problems and solutions are
expressed in familiar mathematical notation. Its
typical uses include math and computation,
algorithm development, modeling, simulation, and
prototyping, data analysis, exploration, and
visualization, scientific and engineering graphics,
and application development, including Graphical
User Interface building.
III. Artificial Intelligence Tools
Automated machine learning (AutoML) is the process of applying machine learning (ML)
models to real-world problems using automation. AutoML was proposed as an artificial
intelligence-based solution to the ever-growing challenge of applying machine learning. The
high degree of automation in AutoML allows non-experts to make use of machine learning
models and techniques without requiring them to become experts in machine learning.
Automating the process of applying machine learning end-to-end additionally offers the
advantages of producing simpler solutions, faster creation of those solutions, and models that
often outperform hand-designed models. AutoML has been used to compare the relative
importance of each factor in a prediction model.
The following are the tools for AutoML:
a. AutoKeras. Auto-Keras is an open source software

library for automated machine learning. Auto-Keras
provides functions to automatically search for
architecture and hyperparameters of deep learning
models.
b. Google Cloud AutoML. This tool enables

developers with limited machine learning expertise
to train high-quality models specific to their
business needs.
G11: 9
MODULE 3:
c. IBM Watson is a computer system that answers

your questions. It is based on cognitive computing.
Cognitive computing is a mixture of various
techniques such as natural language processing,
machine learning, AI, reasoning etc. With the help
of IBM Watson, one can integrate the artificial
intelligence into an important business process.
d. DataRobot. DataRobot’s Prediction Explanations

help you understand the reasons behind your
machine learning model results. The explanations
are highly interpretable, making the model’s
predictions easy to explain to anyone, regardless of
data science experience.
e. H2O Driverless AI. H2O Driverless AI is an

artificial intelligence (AI) platform for automatic
machine learning. Driverless AI automates some of
the most difficult data science and machine learning
workflows such as feature engineering, model
validation, model tuning, model selection, and
model deployment.
f. Amazon Lex. Amazon Lex is a service for building

conversational interfaces into any application using
voice and text. Amazon Lex provides the advanced
deep learning functionalities of automatic speech
recognition (ASR) for converting speech to text,
and natural language understanding (NLU) to
recognize the intent of the text, to enable you to
build applications with highly engaging user
experiences and lifelike conversational interactions.
LIBRARIES FOR DATA SCIENCE
IA library is a collection of non-volatile resources used by computer programs, often for

software development in computer science. These may include configuration data,
documentation, help data, message templates, pre-written code and subroutines, classes,
values, or type specifications.
Python is one of the most popular languages used by data scientists and software developers
alike for data science tasks. It can predict outcomes, automate tasks, streamline processes,
and offer business intelligence insights.
Below is a line-up of the most important Python libraries for data science tasks, covering
areas such as data processing, modelling, and visualization.
G11: 10
MODULE 3:
I. Data Mining
Data mining is a process used by companies to turn raw data into useful information. By
using software to look for patterns in large batches of data, businesses can learn more about
their customers to develop more effective marketing strategies, increase sales and decrease
costs. Data mining depends on effective data collection, warehousing, and computer
processing.
Libraries for data mining are as follows:
a. Scrapy. One of the most popular Python data

science libraries, Scrapy helps to build crawling
programs (spider bots) that can retrieve structured
data from the web – for example, URLs or contact
info. It's a great tool for scraping data used in, for
example, Python machine learning models.
Developers use it for gathering data from APIs. This full-fledged framework follows
the Don't Repeat Yourself principle in the design of its interface. As a result, the tool
inspires users to write universal code that can be reused for building and scaling large
crawlers.
b. Beautiful Soup. BeautifulSoup is another really

popular library for web crawling and data scraping.
If the user wants to collect data that is available on
some website but not via a proper CSV or API,
BeautifulSoup can scrape it and arrange it into the
format that is needed.
II. Data Processing and Modeling
Data processing is the conversion of data into the usable and desired form. This conversion or
“processing” is carried out using a predefined sequence of operations either manually or
automatically. Most of the processing is done by using computers and other data processing
devices and thus done automatically.
Data modeling is the process of creating a visual representation of either a whole information
system or parts of it to communicate connections between data points and structures. The
goal is to illustrate the types of data used and stored within the system, the relationships
among them, how the data can be grouped and organized, and its formats and attributes.
Libraries for data processing and modeling are as follows:
a. NumPy. NumPy (Numerical Python) is a perfect tool for

scientific computing and performing basic and advanced
array operations.
G11: 11
MODULE 3:
The library offers many handy features performing operations on n-arrays and
matrices in Python. It helps to process arrays that store values of the same data type
and makes performing math operations on arrays (and their vectorization) easier. In
fact, the vectorization of mathematical operations on the NumPy array type increases
performance and accelerates the execution time.
b. SciPy. This useful library includes modules for linear

algebra, integration, optimization, and statistics. Its main
functionality was built upon NumPy, so its arrays make use
of this library. SciPy works great for all kinds of scientific
programming projects (science, mathematics, and
engineering). It offers efficient numerical routines such as
numerical optimization, integration, and others in
submodules. The extensive documentation makes working
with this library really easy.
c. Pandas. Pandas is a library created to help developers work

with "labeled" and "relational" data intuitively. It's based
on two main data structures: "Series" (one-dimensional,
like a list of items) and "Data Frames" (two-dimensional,
like a table with multiple columns). Pandas allows
converting data structures to DataFrame objects, handling
missing data, and adding/deleting columns from DataFrame,
imputing missing files, and plotting data with histogram or
plot box. It’s a must-have for data wrangling, manipulation,
and visualization.
d. Keras. Keras is a great library for building neural networks

and modeling. It's very straightforward to use and provides
developers with a good degree of extensibility. The library
takes advantage of other packages, (Theano or TensorFlow)
as its backends. Moreover, Microsoft integrated CNTK
(Microsoft Cognitive Toolkit) to serve as another backend.
e. SciKit-Learn. This is an industry-standard for data science

projects based in Python. Scikits is a group of packages in
the SciPy Stack that were created for specific
functionalities – for example, image processing. Scikit-
learn uses the math operations of SciPy to expose a concise
interface to the most common machine learning algorithms.
Data scientists use it for handling standard machine learning and data mining tasks
such as clustering, regression, model selection, dimensionality reduction, and
classification. Another advantage? It comes with quality documentation and offers
high performance.
f. PyTorch. PyTorch is a framework that is perfect for data

scientists who want to perform deep learning tasks easily.
G11: 12
MODULE 3:
The tool allows performing tensor computations with GPU

acceleration. It's also used for other tasks – for example, for
creating dynamic computational graphs and calculating gradients automatically.

PyTorch is based on Torch, which is an open-source deep learning library
implemented in C, with a wrapper in Lua.
g. TensorFlow. TensorFlow is a popular Python framework

for machine learning and deep learning, which was
developed at Google Brain. It's the best tool for tasks like
object identification, speech recognition, and many others.
It helps in working with artificial neural networks that need
to handle multiple data sets. The library includes various
layer-helpers (tflearn, tf-slim, skflow), which make it even
more functional. TensorFlow is constantly expanded with
its new releases – including fixes in potential security
vulnerabilities or improvements in the integration of
TensorFlow and GPU.
h. XGBoost. Use this library to implement machine learning

algorithms under the Gradient Boosting framework.
XGBoost is portable, flexible, and efficient. It offers
parallel tree boosting that helps teams to resolve many data
science problems. Another advantage is that developers can
run the same code on major distributed environments such
as Hadoop, SGE, and MPI.
III.Data Visualization
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
Libraries for data processing and modeling are as follows:
a. Matplotlib. This is a standard data science library that

helps to generate data visualizations such as two-
dimensional diagrams and graphs (histograms, scatterplots,
and non-Cartesian coordinates graphs). Matplotlib is one of
those plotting libraries that are really useful in data science
projects — it provides an object-oriented API for
embedding plots into applications.
b. Seaborn. Seaborn is based on Matplotlib and serves as a

useful Python machine learning tool for visualizing
statistical models – heat maps and other types of
visualizations that summarize data and depict the overall
distributions. When using this library, you get to benefit
from an extensive gallery of visualizations (including
G11: 13
MODULE 3:
complex ones like time series, joint plots, and violin

diagrams).
c. Bokeh. This library is a great tool for creating interactive

and scalable visualizations inside browsers using JavaScript
widgets. Bokeh is fully independent of Matplotlib. It
focuses on interactivity and presents visualizations through
modern browsers – similarly to Data-Driven Documents
(d3.js). It offers a set of graphs, interaction abilities (like
linking plots or adding JavaScript widgets), and styling.
d. Plotly. This web-based tool for data visualization that

offers many useful out-of-box graphics – you can find them
on the Plot.ly website. The library works very well in
interactive web applications. Its creators are busy
expanding the library with new graphics and features for
supporting multiple linked views, animation, and crosstalk
integration.
e. Pydot. This library helps to generate oriented and non-oriented graphs. It serves as an
interface to Graphviz (written in pure Python). You can easily show the structure of
graphs with the help of this library. That comes in handy when you're developing
algorithms based on neural networks and decision trees.
APPLICATION PROGRAMMING INTERFACE
API is an acronym for Application Programming Interface that software uses to access data,
server software, or other applications and has been around for quite some time. In layman’s
terms, it is a software intermediary that allows two applications to talk to each other.
APIs are versatile and can be used on web-based systems, operating systems, database
systems, and computer hardware. Developers use APIs to make their jobs more efficient by
reusing code from before and only changing the part relevant to the process they want to
improve. A good API makes it easier to create a program because the building blocks are in
place. APIs use defined protocols to enable developers to build, connect and integrate
applications quickly and at scale.
APIs communicate through a set of rules that define how computers, applications, or
machines can talk to each other. The API acts as a middleman between any two devices that
want to connect for a specified task.
A simplified example would be when you sign in to Facebook from your phone, you tell the
Facebook application that you would like to access your account. The mobile application
makes a call to an API to retrieve your Facebook account and credentials. Facebook would
then access this information from one of its servers and return the data to the mobile
application.
G11: 14
MODULE 3:
An API has three primary elements:

a. Access: is the user or who is allowed to ask for data or services?
b. Request: is the actual data or service being asked for (e.g., if I give you current
location from my game, tell me the map around that place). A Request has two main
parts:
 Methods: i.e. the questions you can ask, assuming you have access (it also
defines the type of responses available).
 Parameters: additional details you can include in the question or response.
c. Response: the data or service as a result of your request.
Categories of API
A. Web-Based System. A web API is an interface to either a web server or a web

browser. These APIs are used extensively for the development of web applications.
These APIs work at either the server end or the client end. Companies like Google,
Amazon, and eBay all provide web-based API.
Some popular examples of web based API are Twitter REST API, Facebook Graph
API, Amazon S3 REST API, etc.
B. Operating System. There are multiple OS based API that offers the functionality of
various OS features that can be incorporated in creating windows or mac applications.
Some of the examples of OS based API are Cocoa, Carbon, WinAPI, etc.
C. Database System. Interaction with most of the database is done using the API calls to
the database. These APIs are defined in a manner to pass out the requested data in a
predefined format that is understandable by the requesting client.
This makes the process of interaction with databases generalised and thereby
enhancing the compatibility of applications with the various database. They are very
robust and provide a structured interface to database.
Some popular examples are Drupal 7 Database API, Drupal 8 Database API, and
Django API.
D. Hardware System. These APIs allows access to the various hardware components of
a system. They are extremely crucial for establishing communication to the hardware.
Due to which it makes possible for a range of functions from the collection of sensor
data to even display on your screens.
For example, the Google PowerMeter API will allow device manufacturers to build
home energy monitoring devices that work with Google PowerMeter.
G11: 15
MODULE 3:
Some other examples of Hardware APIs are: QUANT Electronic, WareNet

CheckWare, OpenVX Hardware Acceleration, CubeSensore, etc.
Types of APIs
A. Rest API. This stands for representational state transfer and delivers data using the
lightweight JSON format. Most public APIs use this because of its fast performance,
dependability and ability to scale by reusing modular components without affecting
the system as a whole. This API gives access to data by using a uniform and
predefined set of operations. REST APIs are based on URLs and the HTTP protocol
and are based on these 6 architectural constraints:
1. Client-Server Based – the client handles the front end process while the server
handles the backend and can be both replaced independently of each other.
2. Uniform Interface – defines the interface between client and server and simplifies
the architecture to enable each part to develop separately.
3. Stateless – each request from client to server must be independent and contain all
of the necessary information so that the server can understand and process it
accordingly.
4. Cacheable – maintains cached responses between client and server avoiding any
additional processing.
5. Code-On-Demand – allows client functionality to be extended by downloading

and executing code in the form of applets and scripts. This simplifies clients by
reducing the number of features required to be pre-implemented.
B. SOAP. Simple Object Access Protocol is a little more complex then REST because it
requires more upfront about how it sends its messages. This API has been around
since the late 1990s and uses XML to transfer data. It requires strict rules and
advanced security that requires more bandwidth.
This protocol does not have the ability to cache, has strict communication and needs
every piece of information about an interaction before any calls are even thought of to
be processed.
C. XML-RPC. Known as extensible markup language – Remote Procedure Calls. This

protocol uses a specific XML format to transfer data and is older and simpler than
SOAP. A client performs a RPC by sending an HTTP request to a server that
implements XML-RPC and receives the HTTP response.
D. JSON-RPC is very similar to XML-RPC in that they work the same way except that
this protocol uses JSON instead of XML format. The client is typically software that
calls on a single method of a remote system.
G11: 16
MODULE 3:
IDENTIFICATION.
Directions: Identify what is described in each statement. Write your answer on the space
before the number.
__________1. This refers to the scale and amount of data.

__________2. This type of data does not follow any trend or form.
__________3. It is the process of applying machine learning models to real-world problems
using automation.
__________4. It is a collection of non-volatile resources used by computer programs for
software development.
__________5. It is the graphical representation of information and data.
__________6. This constraint of Rest API allows client functionality to be extended by
downloading and executing code in the form of applets and scripts.
__________7. In this domain of Data Science, analysts identify trends and patterns so as to
make crucial strategic decisions.
__________8. This tool is usually used for predictive analytics. It integrates computation,
visualization, and programming in an easy-to-use environment where
problems and solutions are expressed in familiar mathematical notation.
__________9. It is the process of converting data into usable and desired form.
__________10. It is the process of creating a visual representation of either a whole or part of
information system to communicate connections between data points and
structure.
MULTPLE CHOICE.
A. Directions: Analyze the questions carefully. Choose the letter of the correct answer. Write
your answer on the space before the number.
____1. Which of the following tools tends to work well with data that has a volume of less
than 10 GB?
1. Hive
2. Hadoop
3. Microsoft Excel
4. Scrapy
____2. Which of the following tools is used for automated machine learning?
A. DataRobot
B. Jupyter Notebook
C. QlikView
D. Python
G11: 17
MODULE 3:
____3. The following are characteristics of API, except:

A. API acts as middle man between two machines that want to connect with each
other.
B. It is a collection of non-versatile resources used by computer programs.
C. It is versatile and can be used on web-based systems, operating systems, database
systems, and computer hardware.
D. APIs communicate through a set of rules that define how computers talk to each
other.
____4. Which tool is great for creating interface and scalable visualizations?
A. TensorFlow
B. BeautifulSoup
C. IBM Watson
D. Bokeh
____5. Pandas is a library created to help developers work with ‘labeled’ and ‘relational’ data
intuitively. Which of the following describes the task Pandas is used for?
A. Data Mining
B. Data Processing
C. Data Visualization
D. Data Programming
B. Directions: Choose the word that best complete the analogy. Write the letter of the answer
on the space before the number.
____6. Apache Flink : Velocity as NoSQL : _____

A. Unstructured Data C. Big Data
B. Variety D. Volume
____7. Predictive Analytics : Python as _____ : Microsoft Excel

A. AI Tools C. Volume
B. Data Mining D. Reporting
____8. Structured Data : Payout Table as Unstructured Data : _____

A. Facebook Post C. Loan Application
B. NoSQL D. Barcode
____9. AutoKeras : AI Tools as _____ : _____

A. Predictive Analytics : R C. Data Processing : Numpy
B. Reporting : Excel D. Scrapy: Data Mining
____10. Beautiful Soup : Data Mining as _____ : Data Visualization

A. Library C. Numpy
B. Seaborn D. Pandas
G11: 18
MODULE 3:
CONCEPTUALIZATION.
Directions: In your Virtual Expo, create 4 concept maps with Data Science languages
(Python, SQL, R, and Jupyter Notebook) in the middle of the diagram. You may copy the
concept maps below or create a different style of concept map. Then, complete the concept
maps by supplying the function or features of the words written in the center of the diagrams.
G11: 19
MODULE 3:
You will be graded based on the following rubrics:
CONCEPT MAP RUBRICS
Partially
Exemplary (10) Proficient (8) Incomplete (2)
Proficient (5)
Content The content is Content has one or There are 4-5 There is
complete, rich, two discrepancies, missing details. insufficient detail,
concise, and but includes Some extraneous or detail is
straightforward. relevant details. information and irrelevant and
minor gaps are extraneous.
The content is included.
relevant to the
discussed topics
and thoroughly
answers the
questions.
Creativity/Visual The concept maps The concept maps The main theme is Lacks visual
are visually are visually still discernible, clarity. The
effective. The use sensible. The use but use of graphics/images/
of of graphics/images/ photographs are
graphics/images/ graphics/images/ photographs are distracting from
photographs photographs are included but are the content of the
seamlessly relate included and used randomly. expo.
well to the content. appropriate.
Team The group The group The group The group does
Collaboration establishes and establishes clear establishes not establish roles
documents clear and formal roles informal roles for for each member
and formal roles for each member each member. The and the workload
for each member and distributes the workload could be is unequally
and distributes the workload equally. distributed more distributed.
workload equally. equally.
G11: 20
MODULE 3:
Amazon Lex – AWS Chatbot AI. (n.d.). Amazon Web Services, Inc. https://aws.amazon.com/lex/
An Introduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!
(2020, July 5). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/11/an-
introduction-to-apis-application-programming-interfaces-5-apis-a-data-scientist-must-know/
Custer, C. (n.d.). 15 Python Libraries for Data Science You Should Know. Dataquest.
https://www.dataquest.io/blog/15-python-libraries-for-data-science/
Data Mining: How Companies Use Data to Find Useful Patterns and Trends. (n.d.). Investopedia.
https://www.investopedia.com/terms/d/datamining.asp
Davis, T. (2019, December 31). What is An API and How Does It Work? - Towards Data Science.
Medium. https://towardsdatascience.com/what-is-an-api-and-how-does-it-work-
1dccd7a8219e
Dewani, R. (n.d.). 22 Widely Used Data Science and Machine Learning Tools in 2020. Analytics
Vidhya. https://www.analyticsvidhya.com/blog/2020/06/22-tools-data-science-machine-
learning/
IBM Watson and its Key Features. (n.d.). NewGenApps - A Deep Tech Company, with B2B FinTech,
Blockchain, Cloud, Mobile Apps, Analytics Solutions.
https://www.newgenapps.com/blogs/ibm-watson-and-its-key-features/
Jordan, M. (n.d.). What is SPSS and How Does it Benefit Survey Data Analysis? Alchemer.
https://www.alchemer.com/resources/blog/what-is-spss/
Overview - Spark 3.1.2 Documentation. (n.d.). Spark.Apache.Org.
https://spark.apache.org/docs/latest/
R: What is R? (n.d.). R-Project.Org. https://www.r-project.org/about.html
What can I do with SAS? (n.d.). Support.Sas.Com. Retrieved August 17, 2021, from
https://support.sas.com/software/products/sas-studio/faq/SAS_whatis.htm
What Is Data Visualization? Definition, Examples, and Learning Resources. (n.d.). Tableau.
https://www.tableau.com/learn/articles/data-visualization
What is Matlab? (n.d.). Cimss.Ssec.Wisc.Edu.
https://cimss.ssec.wisc.edu/wxwise/class/aos340/spr00/whatismatlab.htm
What is Python? Executive Summary. (n.d.). Python.Org. https://www.python.org/doc/essays/blurb/
Wiki Archive. (n.d.). DataRobot. https://www.datarobot.com/wiki/
Wikipedia contributors. (n.d.). Automated machine learning. Wikipedia.
https://en.wikipedia.org/wiki/Automated_machine_learning
G11: 21
MODULE 3:
Module Author/Curator: Mrs. Timmy Anne L. Garcia

Template & Layout Designer: Mrs. Jenny P. Macalalad
Identification Multiple Choices

1. Volume 1. C
2. Unstructured Data 2. A
3. Automated Learning Machine 3. B
4. Library 4. D
5. Data Visualization 5. B
6. Code-on-demand 6. B
7. Reporting and Business Intelligence 7. D
8. Matlab 8. A
9. Data Processing 9. D
10. Data Modeling 10. B
G11: 22

Module 3 - Data Science

Uploaded by

Copyright:

Available Formats

Module 3 - Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 3 - Data Science

Uploaded by

Copyright:

Available Formats

MODULE 3:

A Data Scientist is responsible for extracting, manipulating, pre-processing, and generating

Programming forms the backbone of Software Development. Data Science is an

TOOLS FOR HANDLING BIG DATA

systems and ATMs. It can be human- or machine-generated. Common examples of machine-

for Kafka was to be able to rebuild a user activity tracking

DATA SCIENCE TOOLS ACCORDING TO DOMAIN

Figure 1. Spectrum of Data Science

I. Reporting and Business Intelligence

The commonly used tools in these domains are:

b. QlikView. It enables you to consolidate, search, visualize

c. Tableau. It is amongst the most popular data visualization

d. MicroStrategy. MicroStrategy is an enterprise BI

also supports mobile BI.ce integration. Micro Strategy also

e. PowerBI. Microsoft Power BI is a suite of business

f. Google Analytics. Google Analytics is used to track

II. Predictive Analytics and Machine Learning Tools

a. Python. Python is an interpreted, object-oriented,

b. R. It is another very commonly used and respected

most machine learning tasks. R provides a wide

c. Apache Spark. Spark was open-sourced by UC

d. Julia. Julia is a high-level, high-performance,

e. Jupyter Notebook. The Jupyter Notebook is an

f. SAS. SAS is an integrated software suite for

files. The user can also manage and manipulate

g. Statistical Package for Social Sciences (SPSS). It

h. Matlab. It is a high-performance language for

III. Artificial Intelligence Tools

The following are the tools for AutoML:

a. AutoKeras. Auto-Keras is an open source software

b. Google Cloud AutoML. This tool enables

c. IBM Watson is a computer system that answers

d. DataRobot. DataRobot’s Prediction Explanations

e. H2O Driverless AI. H2O Driverless AI is an

f. Amazon Lex. Amazon Lex is a service for building

LIBRARIES FOR DATA SCIENCE

IA library is a collection of non-volatile resources used by computer programs, often for

Libraries for data mining are as follows:

a. Scrapy. One of the most popular Python data

b. Beautiful Soup. BeautifulSoup is another really

II. Data Processing and Modeling

Libraries for data processing and modeling are as follows:

a. NumPy. NumPy (Numerical Python) is a perfect tool for

b. SciPy. This useful library includes modules for linear

c. Pandas. Pandas is a library created to help developers work

d. Keras. Keras is a great library for building neural networks

e. SciKit-Learn. This is an industry-standard for data science

f. PyTorch. PyTorch is a framework that is perfect for data

The tool allows performing tensor computations with GPU

creating dynamic computational graphs and calculating gradients automatically.

g. TensorFlow. TensorFlow is a popular Python framework

h. XGBoost. Use this library to implement machine learning

Libraries for data processing and modeling are as follows:

a. Matplotlib. This is a standard data science library that

b. Seaborn. Seaborn is based on Matplotlib and serves as a

complex ones like time series, joint plots, and violin

c. Bokeh. This library is a great tool for creating interactive

d. Plotly. This web-based tool for data visualization that

APPLICATION PROGRAMMING INTERFACE

An API has three primary elements:

6. Apache Flink : Velocity as NoSQL : _

7. Predictive Analytics : Python as _ : Microsoft Excel

8. Structured Data : Payout Table as Unstructured Data : _

__9. AutoKeras : AI Tools as _ : ___

10. Beautiful Soup : Data Mining as _ : Data Visualization