0% found this document useful (0 votes)
9 views

FDS Sem5

Uploaded by

King Kong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

FDS Sem5

Uploaded by

King Kong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A multimodal data set contains multiple types of data.

Example: A
dataset containing text (reviews), images (product photos), and
audio (customer feedback recordings) about products
Predictor variables are independent variables used to predict the
outcome of a dependent variable. Example: Age and education
level predicting income.
Define a Quantile: A quantile is a value that divides a data set into
equal parts, e.g., quartiles divide data into four equal parts.
difference Descriptive statistics and Inferential statistics:
1)Descriptive statistics summarises data, while inferential statistics
uses data to make predictions or inferences about a
population.2)Descriptive statistics deals with actual data, while
inferential statistics applies probability theory.
Point Estimate:A point estimate is a single value calculated from
a sample that is used as an approximation of an unknown
population parameter. For example, the sample mean (xˉ\bar{x}xˉ)
is a point estimate of the population mean (μ\muμ).
Parameter: A parameter is a numerical value that describes a
characteristic of a population. Parameters are usually unknown
because it is often impractical to measure an entire population.
Common parameters include the population mean population
standard deviation , and population proportion .
Statistic:A statistic is a numerical value calculated from a sample
that is used to estimate a corresponding population parameter.
Examples include the sample mean , sample standard deviation ,
and sample proportion
Sampling Error: Sampling error is the difference between a
sample statistic and the corresponding population parameter due to
the fact that the sample is only a subset of the population.
Interval Estimate: An interval estimate is a range of values within
which a population parameter is likely to fall, with a certain level of
confidence.
A Parameter: This appears to be a duplicate of parameter (defined
above). Let me know if you intended to ask about something else!
Test Statistic: A test statistic is a standardised value computed
from sample data during hypothesis testing. It measures how far the
sample statistic is from the null hypothesis value of the parameter in
standard error units
What are the factors based on which data quality is measured?
1)Accuracy2)Completeness3)Consistency4)Timeliness5)Relevance
Define data wrangling: Data wrangling is the process of cleaning,
transforming,and organising raw data into a suitable format for
analysis.
State any two reasons, why data can be missing in a data set?
Non-response in surveys Data corruption or loss
data interpolation? significance in data preprocessing:Data
interpolation is the method of estimating missing data points within
a range of known values. It helps maintain the continuity of data for
analysis.
reasons for the presence of noisy data in a data set?
1)Measurement errors 2)Data entry mistakes 3)Irregularities in data
sources
State the significance of data integration in datapreprocessing:
Data integration combines data from different sources into a unified
format, improving data quality and enabling more comprehensive
analysis.
Why is data reduction necessary in data science?
Data reduction helps in reducing the complexity and size of data,
improving computational efficiency, and focusing on relevant data.
State a major drawback of label encoding:
Label encoding can introduce ordinal relationships where none
exist, leading to incorrect model assumptions.
State a point of difference between feature selection and
feature extraction methods, in dimension reduction:
Feature selection selects a subset of the original features, while
feature extraction creates new features by combining or
transforming existing ones.
Ordinal variables help to differentiate members within a group.
True. Ordinal variables allow ranking but do not quantify the exact
difference between rankings.
A data set representing the IQ scores within a population is an
example of symmetric distribution: True. IQ scores are designed
to follow a normal distribution with symmetry around the mean.
A mean value is easily distorted by the presence of outliers in
the data set: True. Outliers can significantly shift the mean away
from the central value of the data.
Robust scaler is not sensitive to outliers. True, the robust scaler
uses the median and interquartile range to scale data, making it
less sensitive to outliers.
Data cube aggregation is also known as multidimensional
aggregation.True. Data cube aggregation involves summarising
data across multiple dimensions, making it a form of
multidimensional aggregation.
Social media data, like tweets/reviews, are examples of
network data.False: Social media data, such as tweets or reviews,
are typically considered textual or unstructured data, not network
data. Network data refers to data that represents relationships or
interactions between entities, like social networks, where nodes
represent users and edges represent relationships between them.
Machine-generated data requires highly scalable tools for
analytics.True: Machine-generated data, like sensor data or logs,
often comes in large volumes and requires tools that can scale to
handle the massive amounts of data. Scalable tools (e.g., Hadoop,
Spark) are necessary to process, store, and analyse such
high-volume data efficiently.
A smaller p-value indicates the more likelihood we are to reject
the null hypothesis.True: A smaller p-value indicates stronger
evidence against the null hypothesis. If the p-value is below a
threshold (commonly 0.05), it suggests that the observed data is
unlikely under the null hypothesis, leading to its rejection.
4 V's of Big Data:

Big Data is characterised by the following 4V’s, which describe its


key features:

Volume: Refers to the enormous amount of data generated


continuously from various sources such as social media, IoT
devices, and business transactions. The volume of data can range
from terabytes to petabytes, and the ability to process such large
datasets is a critical challenge.
Example: Social media platforms like Facebook generate terabytes
of data each day, including user posts, comments, and interactions.

Velocity: Describes the speed at which data is generated,


processed, and analysed. In some cases, data must be processed
in real-time to derive insights or make decisions.
Example: In financial markets, stock prices and trades happen at
extremely high speeds, requiring real-time processing to make
informed decisions.

Variety: Represents the different types of data – structured,


semi-structured, and unstructured. Big data encompasses diverse
data formats, from numerical data in databases to images, videos,
and social media posts.
Example: Customer data in an e-commerce website might include
structured transaction records, semi-structured product reviews,
and unstructured images of products.

Veracity: Refers to the quality or reliability of the data. With large


amounts of data coming from multiple sources, ensuring that the
data is accurate and trustworthy can be challenging.
Example: Sensor data from autonomous vehicles may sometimes
be unreliable due to environmental factors like fog, affecting data
quality.
Inferential Statistics:
Inferential statistics involves using data from a sample to make
generalisations about a larger population. This branch of statistics
allows researchers to make predictions, estimate population
parameters, and test hypotheses. The process typically involves
calculating probabilities and confidence intervals to infer
conclusions from sample data. For example, using a sample of
students' test scores, inferential statistics can help estimate the
average test score for all students in a school district.
Significance of a Confidence Interval:
A confidence interval provides a range of values that is likely to
contain the true population parameter, with a specified level of
confidence. It allows researchers to quantify the uncertainty in their
estimate. For instance, if a sample mean height is 5'8" with a 95%
confidence interval of 5'7" to 5'9", it means that there is a 95%
probability that the true mean height for the population lies within
that range. Confidence intervals are significant because they
provide more information than a single point estimate, giving a
range of possible values and indicating the precision of the
estimate. They are widely used in research, quality control, and
decision-making.
Hypothesis Testing:
Hypothesis testing is a statistical method used to make inferences
or draw conclusions about a population based on sample data. It
involves two hypotheses: Null Hypothesis (H₀): A statement
assuming no effect or relationship exists in the population. For
example, "There is no significant difference between the means of
two groups."Alternate Hypothesis (H₁): Contradicts the null
hypothesis, suggesting there is an effect or relationship. For
example, "There is a significant difference between the means of
two groups."The process involves selecting a significance level
(usually 5%), calculating a test statistic, and comparing it to critical
values to decide whether to reject the null hypothesis. If the p-value
is less than the significance level, the null hypothesis is rejected.
Chi-Square Hypothesis Testing Method:
The Chi-Square test is a statistical test used to determine if there is
a significant association between two categorical variables. It
compares the observed frequency of occurrences in a contingency
table with the expected frequencies under the null hypothesis (no
association between variables).
The formula for the chi-square statistic is:
Where O is the observed frequency, and E is the expected
frequency. If the calculated chi-square statistic exceeds the critical
value, the null hypothesis is rejected.
Calculating Dissimilarities between Data Objects:
Calculating dissimilarities or distances between data objects is
crucial in various machine learning techniques, particularly
clustering and classification. The dissimilarity measure quantifies
how different two objects are based on their features. Common
distance metrics include: Euclidean Distance: Measures the
straight-line distance between two points in space, used for
continuous data.Manhattan Distance: Sums the absolute
differences of the coordinates, used for grid-like data.
Formula:Hamming Distance: Measures the number of positions at
which two strings of equal length are different, often used for
categorical data.
Data Discretization:
Data discretization is the process of converting continuous data into
discrete categories or intervals. This technique simplifies the
analysis of data, especially when using machine learning algorithms
that work better with discrete data.Methods of discretization include:
Equal Width Binning: Divides the range of data into equal-width
intervals.Equal Frequency Binning: Divides the data such that
each interval contains approximately the same number of data
points.Clustering-Based Discretization: Uses clustering
algorithms (like k-means) to group similar values together.
Example: Age, which is continuous, can be discretized into age
groups like 20-30, 30-40, etc., for use in classification models.
Visual Encoding: Visual encoding is the process of representing
data visually through various graphical elements such as charts,
graphs, colours, and shapes. The goal is to make complex data
easier to understand and interpret.Common types of visual
encoding include: Position: Using position on the x- and y-axes to
represent numerical values (e.g., in scatter plots). Length/Size:
Using the length or size of elements (e.g., bar charts, bubble
charts). Colour: Using colour intensity or hue to encode information
(e.g., heatmaps). Shape: Using different shapes to represent
categories (e.g., circle vs. square in scatter plots).
Visual encoding is essential for exploratory data analysis, as it
helps in identifying patterns, trends, and anomalies quickly.
Dendrograms: A dendrogram is a tree-like diagram that is used to
illustrate the hierarchical relationships between clusters in
hierarchical clustering. Each branch of the dendrogram represents
a merge or split of clusters, with the height of the branch indicating
the level of similarity between clusters. Dendrograms are
particularly useful for visualising the results of clustering algorithms,
where the data points are grouped into hierarchical levels based on
similarity. The cuts in the dendrogram at different heights allow for
determining the optimal number of clusters.
Visualization Techniques for Geospatial Data:
Geospatial data visualisation techniques are critical for representing
data related to geographic locations, enabling easier interpretation
of spatial relationships. Some common visualisation techniques
include: Choropleth Maps: Use colour to represent data values in
different regions, like population density or income levels.
Heatmaps: Visualise the intensity of data points, showing patterns
like traffic or crime hotspots. Scatter Plots on Maps: Plot data
points on maps to display spatial distributions, such as store
locations or accident sites. 3D Surface Maps: Display
topographical features like terrain or ocean depth in three
dimensions. Flow Maps: Show the movement of data, such as
migration or goods flow across regions.
Metadata of Open Data Should Be Available:Metadata provides
essential information about datasets, such as data source,
structure, units, and collection methods. For open data to be useful,
accessible, and interpretable, it must include comprehensive
metadata. Without it, users may misinterpret the data, leading to
errors or incorrect conclusions. Metadata ensures that the data is
reusable and can be validated by different users for various
applications.
Data Preparation Is an Important Step in the Data Science Data
preparation is critical because raw data is often incomplete,
inconsistent, or noisy. This phase involves cleaning, transforming,
and organising data into a format suitable for analysis. For example,
missing values might be filled in, duplicates removed, or categorical
variables encoded. Without proper data preparation, any analysis or
modelling could yield inaccurate results. In practical terms, cleaning
financial transaction data or preprocessing text data for sentiment
analysis are key examples of data preparation.
Significance of Data Exploration Phase in Data Science Data
exploration allows data scientists to understand the characteristics
and patterns in a dataset before applying complex models. It
involves visualising data, checking distributions, and identifying
relationships between variables. For instance, exploring sales data
to detect seasonal trends or visualising customer demographics to
segment them for marketing purposes can provide valuable
insights. It helps identify issues like missing values or outliers early,
guiding further analysis or model selection.
A Central Tendency Is a Single Number That Represents the
Most Common Value for a List of Numbers:
Central tendency is a statistical concept that refers to a single value
that represents the center of a data distribution. The most common
measures of central tendency are the mean, median, and mode.
These values give an overview of the data, such as the average
income of a population or the most frequent age group in a survey.
Understanding central tendency helps summarise large datasets
with a single, interpretable number.
Statistics Help in Extracting the Right Samples from a Big Data
Set:Statistics provide methods to sample a representative subset of
data from a large dataset, ensuring that the sample reflects the
characteristics of the entire population. Techniques like random
sampling, stratified sampling, and bootstrapping help reduce bias
and ensure that insights drawn from the sample are reliable and
valid. For example, in marketing research, instead of surveying an
entire population, a well-chosen sample can yield accurate insights
into consumer preferences.
Income and Wealth Are Classic Examples of Right-Skewed
Distributions:Income and wealth distributions often follow a
right-skewed pattern, meaning most individuals earn or own
amounts that are relatively low, while a small percentage earn or
own large amounts. This skewness causes the mean to be higher
than the median. For instance, in a population, most people might
earn an average income, but a few wealthy individuals can
significantly increase the average income, leading to a right-skewed
distribution.
Natural Language Data Requires Knowledge of Specific Data
Science Techniques and Linguistics to Process: Processing
natural language data, such as text or speech, requires a
combination of statistical techniques and linguistic understanding.
Techniques like tokenization, part-of-speech tagging, and sentiment
analysis are used to extract meaning from raw text. Knowledge of
linguistics is essential to understand syntax, semantics, and
context, which are crucial for tasks like machine translation or
named entity recognition. This makes natural language processing
(NLP) more challenging than other forms of data analysis.
Difference between histogram and bar chart
Histogram: 1)Continuous data in intervals.n2)Represents ranges
(no gaps). 3) Bars are connected. 4)Shows data distribution.
5) Shows data spread. Bar Chart: 1)Discrete/categorical data.
2)Represents individual categories. 3)Bars are separated.
4)Compare category values.5)Shows category comparison.
When Travelling at Different Speeds for Equal Time Intervals,
the Harmonic Mean Is a More Accurate Representation of the
Average Speed Than the Arithmetic Mean:
The harmonic mean is more appropriate than the arithmetic mean
when calculating averages for rates, such as speed, that vary over
equal time intervals. For example, if you drive 50 miles at 60 mph
and 50 miles at 40 mph, the total time taken is 1.5 hours, and the
average speed calculated with the harmonic mean is 48 mph. This
method accounts for the fact that more time is spent travelling at the
slower speed, providing a more accurate representation of the
overall average speed.
A Confidence Interval Tells Us the Uncertainty of the Point
Estimate:A confidence interval provides a range of values within
which the true population parameter is likely to lie, given the sample
data. For example, a 95% confidence interval for the average height
of a group of people might range from 5'7" to 5'9". This means we
are 95% confident that the true average height is within this range,
indicating the uncertainty of the point estimate and providing a level
of confidence about the data.
Heatmaps Can Be Used to Visualize Correlation Between
Various Data Attributes in a Data Set:
Heatmaps are an effective way to visualise the relationships
between different variables in a dataset. By colouring the cells
based on the strength of correlation, heatmaps allow you to quickly
identify positive or negative correlations between attributes. For
instance, a heatmap of a customer dataset might show a strong
positive correlation between income and spending, or a negative
correlation between age and product preference, helping to inform
business decisions or further analysis.
Big Data Is an Amalgamation of Daily Transactions,
Interactions, and Observations.Big data comes from a wide
variety of sources, including daily transactions, interactions, and
observations. For example, consider an e-commerce platform like
Amazon. Every time a customer makes a purchase, browses
products, or even adds an item to their cart, data is generated. This
includes transaction data (purchase details), interaction data (click
patterns, search queries), and observational data (user behaviour,
preferences). Together, this data forms a massive volume of
information that can be analysed for insights such as customer
preferences, sales trends, and inventory management.
Illustrate the Significance of Statistics in Data Science,
Statistics plays a critical role in data science by enabling us to make
sense of large datasets, identify patterns, and make informed
decisions. For example, in a healthcare dataset containing patient
information (age, weight, medical history), statistical analysis can
help identify risk factors for diseases, such as the correlation
between age and the likelihood of heart disease. Without statistics,
we would lack the ability to summarise data, perform hypothesis
testing, or estimate population parameters from sample data,
leading to less accurate insights.
IImportance of Dimension Reduction in Data Preprocessing:
Dimension reduction is essential in data preprocessing to simplify
complex datasets, reduce computational costs, and improve model
performance. For example, in image recognition tasks, raw images
have thousands or millions of pixels, each representing a
dimension. By applying techniques like Principal Component
Analysis (PCA), we can reduce the number of dimensions while
retaining most of the important features. This helps in reducing
overfitting, improving the speed of algorithms, and making it easier
to visualise data. For instance, PCA can reduce an image dataset’s
features to a lower-dimensional space while still preserving
important information for classification.
What Is a Histogram? How Is It Different from a Bar Chart?:
A histogram is a graphical representation of the distribution of
numerical data, showing the frequency of data points within
specified ranges or bins. It’s particularly useful for understanding
the distribution of continuous data, like test scores or ages.
What Is a Treemap? Can Treemaps Be Used to Represent the
Sale Value Across Different Cities,: A treemap is a data
visualisation technique that displays hierarchical data as nested
rectangles. Each rectangle represents a category, and its size is
proportional to a numerical value. Treemaps are useful for
displaying proportions among categories, especially when dealing
with hierarchical structures. Representation of Sales in Treemaps:
Yes, treemaps can be used to represent the sale value across
different cities, aggregated to different states, and then to different
countries. In this case: The largest rectangle could represent the
total sales across all countries. Within it, smaller rectangles could
represent the sales values of different countries.Inside each
country, smaller rectangles would represent sales in individual
states, and so on. Treemaps are particularly effective for visualising
proportions within hierarchical data, making it easy to compare
values across different levels of aggregation.For instance, you can
easily see which countries,states,or cities are driving the mostsales.

Non-parametric methods reduce dataset size without


assuming data distribution. Key techniques include:
Clustering: Groups data into clusters (e.g., k-means) to reduce
data points. Sampling: Selects a representative subset (e.g.,
random or stratified sampling). Dimensionality Reduction:
Reduces features using methods like PCA.Histograms: Groups
data into bins to summarise continuous data. Decision Trees:
Reduces data complexity by representing it in a tree structure.
Features of Stacked Bar Charts:
Stacked bar charts show the composition of categories:
Categorical Representation: Categories are on the x-axis, with
total values on the y-axis. Sub-categories: Each bar is divided into
segments representing sub-categories. Comparison: Compares
total values and sub-category contributions. Colour Coding:
Different colours for each sub-category. Proportional: Shows
sub-category proportions within each main category.
Formatting Issues Making Data Dirty:
Dirty data results from various formatting issues:
Missing Values: Missing data entries.
Inconsistent Formatting: Different formats for dates or numbers.
Typographical Errors: Spelling mistakes or incorrect values.
Duplicate Entries: Multiple records for the same data.
Outliers: Extreme values distorting analysis.
Incorrect Data Types: Data stored in wrong formats.
Inconsistent Categorization: Inconsistent category labels.
Data Entry Errors: Manual input mistakes leading to inaccurate
data.

A project charter is a formal document that outlines the key


aspects of a project, serving as a reference for the project’s
objectives, scope, stakeholders, and resources. It helps establish a
clear understanding of the project's purpose and direction among all
involved parties. Here's what a project charter typically includes:
Project Title: Name of the project. Objective: The purpose or goal
of the project. Scope: Defines what is included and excluded.
Deliverables: Expected outcomes of the project. Stakeholders:
Key people involved, including sponsors and team members.
Resources: Budget, tools, and personnel needed.
Timeline: Estimated schedule or milestones.
Risks and Assumptions: Identified risks and project constraints.
Significance of a High P-value:
A high p-value (typically above 0.05) indicates that there is
insufficient evidence to reject the null hypothesis. In other words, it
suggests that the observed data is consistent with the null
hypothesis, meaning there is no statistically significant difference or
effect. Interpretation: A high p-value implies that any observed
differences or relationships in the data could have occurred by
chance, and the null hypothesis (no effect or difference) cannot be
rejected. Example: If testing whether a new drug is more effective
than a placebo, a high p-value means there’s no significant
evidence to suggest the drug is better,so the null hypothesis stands.

Difference between Data Scientist and Statistician


Data Scientist: 1)Analyses large datasets using machine learning
and programming. 2)Coding (Python, R), machine learning, big data
tools. 3) Uses big data frameworks and data visualisation tools.
4)Develops predictive models and solves business problems.
5)Exploratory and iterative with unstructured data.
Statistician: 1)Focuses on statistical methods to analyse data and
make inferences. 2)Expertise in statistical theory and hypothesis
testing. 3)Primarily uses statistical software (e.g., SPSS, SAS).
4)Tests hypotheses and interprets data trends.
5)Structured approach focused on statistical rigour.

Data science process(Life Cycle): 1.Setting the Research Goal:


This is where you define what you want to achieve. It's like setting a
target before you start. 2.Retrieving Data: This step involves
gathering all the data you need. 3.Data Preparation: Here, you
clean and organise the data. This means fixing errors and making
whatever changes required. 4.Data Exploration: This is where you
start looking at the data to understand what’s inside.
5.Data Modeling: In this step, you build a model to predict
outcomes or find relationships in the data. 6.Presentation and
Automation: This is where you present your findings in a way that
others can understand, like with charts or reports.
Structured and unstructured Diff 1)Definition: Structured data is
organised in predefined formats, while unstructured data lacks a
fixed format. 2)Storage: Structured data is stored in relational
databases, whereas unstructured data is stored in data lakes or
NoSQL systems. 3)Examples: Structured includes tables and logs;
unstructured includes images and videos. 4)Analysis: Structured is
easy to analyse with SQL unstructured needs advanced tools like
NLP. 5)Flexibility: Structured data is rigid; unstructured is flexible
but harder to manage.
Exploratory Data Analysis (EDA) is the process of examining and
visualising data to uncover its key characteristics, patterns, and
insights. It involves using statistical summaries and graphical tools
to better understand the structure of the data, detect anomalies,
and identify relationships between variables.Key Steps in EDA:
Data Inspection: Examine data dimensions, types, and check for
missing values. Descriptive Statistics: Summarise data using
measures like mean, median, and standard deviation.
Visualisation: Use histograms, scatter plots, box plots, and
heatmaps to explore data distributions and relationships.
Data Cleaning: Address missing values, outliers, and
inconsistencies for better data quality.
Benefits of EDA: 1)Reveals insights and trends for better
decision-making. 2)Identifies data quality issues, including outliers
and missing values. 3)Guides feature engineering and model
selection. 4)EDA is a critical step that lays the foundation for
effective data analysis and predictive modelling.
Feature Extraction is the process of transforming raw data into a
set of meaningful and relevant attributes (features) that can be used
for machine learning or statistical modelling. It involves selecting or
creating features that capture the important information in the data
while reducing dimensionality.Importance of Feature Extraction:
Simplifies the dataset by retaining essential information.Enhances
model performance by focusing on the most relevant attributes.
Reduces computational complexity by lowering the number of input
variables.

You might also like