FDS Sem5
FDS Sem5
Example: A
dataset containing text (reviews), images (product photos), and
audio (customer feedback recordings) about products
Predictor variables are independent variables used to predict the
outcome of a dependent variable. Example: Age and education
level predicting income.
Define a Quantile: A quantile is a value that divides a data set into
equal parts, e.g., quartiles divide data into four equal parts.
difference Descriptive statistics and Inferential statistics:
1)Descriptive statistics summarises data, while inferential statistics
uses data to make predictions or inferences about a
population.2)Descriptive statistics deals with actual data, while
inferential statistics applies probability theory.
Point Estimate:A point estimate is a single value calculated from
a sample that is used as an approximation of an unknown
population parameter. For example, the sample mean (xˉ\bar{x}xˉ)
is a point estimate of the population mean (μ\muμ).
Parameter: A parameter is a numerical value that describes a
characteristic of a population. Parameters are usually unknown
because it is often impractical to measure an entire population.
Common parameters include the population mean population
standard deviation , and population proportion .
Statistic:A statistic is a numerical value calculated from a sample
that is used to estimate a corresponding population parameter.
Examples include the sample mean , sample standard deviation ,
and sample proportion
Sampling Error: Sampling error is the difference between a
sample statistic and the corresponding population parameter due to
the fact that the sample is only a subset of the population.
Interval Estimate: An interval estimate is a range of values within
which a population parameter is likely to fall, with a certain level of
confidence.
A Parameter: This appears to be a duplicate of parameter (defined
above). Let me know if you intended to ask about something else!
Test Statistic: A test statistic is a standardised value computed
from sample data during hypothesis testing. It measures how far the
sample statistic is from the null hypothesis value of the parameter in
standard error units
What are the factors based on which data quality is measured?
1)Accuracy2)Completeness3)Consistency4)Timeliness5)Relevance
Define data wrangling: Data wrangling is the process of cleaning,
transforming,and organising raw data into a suitable format for
analysis.
State any two reasons, why data can be missing in a data set?
Non-response in surveys Data corruption or loss
data interpolation? significance in data preprocessing:Data
interpolation is the method of estimating missing data points within
a range of known values. It helps maintain the continuity of data for
analysis.
reasons for the presence of noisy data in a data set?
1)Measurement errors 2)Data entry mistakes 3)Irregularities in data
sources
State the significance of data integration in datapreprocessing:
Data integration combines data from different sources into a unified
format, improving data quality and enabling more comprehensive
analysis.
Why is data reduction necessary in data science?
Data reduction helps in reducing the complexity and size of data,
improving computational efficiency, and focusing on relevant data.
State a major drawback of label encoding:
Label encoding can introduce ordinal relationships where none
exist, leading to incorrect model assumptions.
State a point of difference between feature selection and
feature extraction methods, in dimension reduction:
Feature selection selects a subset of the original features, while
feature extraction creates new features by combining or
transforming existing ones.
Ordinal variables help to differentiate members within a group.
True. Ordinal variables allow ranking but do not quantify the exact
difference between rankings.
A data set representing the IQ scores within a population is an
example of symmetric distribution: True. IQ scores are designed
to follow a normal distribution with symmetry around the mean.
A mean value is easily distorted by the presence of outliers in
the data set: True. Outliers can significantly shift the mean away
from the central value of the data.
Robust scaler is not sensitive to outliers. True, the robust scaler
uses the median and interquartile range to scale data, making it
less sensitive to outliers.
Data cube aggregation is also known as multidimensional
aggregation.True. Data cube aggregation involves summarising
data across multiple dimensions, making it a form of
multidimensional aggregation.
Social media data, like tweets/reviews, are examples of
network data.False: Social media data, such as tweets or reviews,
are typically considered textual or unstructured data, not network
data. Network data refers to data that represents relationships or
interactions between entities, like social networks, where nodes
represent users and edges represent relationships between them.
Machine-generated data requires highly scalable tools for
analytics.True: Machine-generated data, like sensor data or logs,
often comes in large volumes and requires tools that can scale to
handle the massive amounts of data. Scalable tools (e.g., Hadoop,
Spark) are necessary to process, store, and analyse such
high-volume data efficiently.
A smaller p-value indicates the more likelihood we are to reject
the null hypothesis.True: A smaller p-value indicates stronger
evidence against the null hypothesis. If the p-value is below a
threshold (commonly 0.05), it suggests that the observed data is
unlikely under the null hypothesis, leading to its rejection.
4 V's of Big Data: