Analystics Data Cleaning Questions Interview
Analystics Data Cleaning Questions Interview
Intermediate
Q1. Different types of plots you can create using Matplotlib?
1. Line Plot - plt.plot()
2. Scatter Plot - plt.scatter()
3. Bar Plot - plt.bar()
4. Histogram - plt.hist()
5. Pie Chart - plt.pie()
6. Box Plot - plt.boxplot()
7. Area Plot - plt.fill_between()
8. Hexbin Plot - plt.hexbin()
9. Stem Plot - plt.stem()
10. Polar Plot - plt.polar()
11. Stacked Bar Plot - plt.bar(stacked=True)
12. Error Bar Plot - plt.errorbar()
These are some common types of plots you can create using Matplotlib.
Q2. What are some common methods for handling missing values in a DataFrame?
1. Removing missing values:
o dropna(): Removes rows or columns that contain missing values (NaN).
▪ Example: df.dropna() removes any rows with NaN values.
▪ Example: df.dropna(axis=1) removes columns with NaN values.
2. Filling missing values:
o fillna(): Replaces missing values (NaN) with a specified value, such as the mean, median, or a
custom value.
▪ Example: df.fillna(0) replaces NaN with 0.
▪ Example: df.fillna(df.mean()) fills NaN with the mean of each column.
3. Forward filling:
o ffill() or fillna(method='ffill'): Fills missing values using the previous row's value (forward fill).
▪ Example: df.fillna(method='ffill') fills NaN with the previous non-null value.
4. Backward filling:
o bfill() or fillna(method='bfill'): Fills missing values using the next row's value (backward fill).
▪ Example: df.fillna(method='bfill') fills NaN with the next non-null value.
5. Interpolating missing values:
o interpolate(): Uses interpolation to estimate missing values based on existing data.
▪ Example: df.interpolate() fills NaN by interpolating between existing values (linear
interpolation by default).
6. Replacing missing values with a constant:
o replace(): Replaces missing values (or other specific values) with a constant.
▪ Example: df.replace(np.nan, 0) replaces NaN values with 0.
These methods are common strategies for handling missing data in a pandas DataFrame, depending on your
analysis and data requirements.
Q3. How do you customize the appearance of a plot in Matplotlib?
1. Title: plt.title('Title')
2. Axis labels: plt.xlabel('X-axis label'), plt.ylabel('Y-axis label')
3. Line style and color: plt.plot(x, y, linestyle='--', color='r')
4. Grid: plt.grid(True)
5. Legend: plt.legend(['Label'])
6. Marker style: plt.plot(x, y, marker='o')
7. Figure size: plt.figure(figsize=(10, 6))
8. Ticks: plt.xticks([0, 1, 2], ['A', 'B', 'C']), plt.yticks([0, 5, 10], ['Low', 'Medium', 'High'])
9. Subplots: plt.subplot(1, 2, 1) for multiple plots
10. Font size: plt.title('Title', fontsize=14)
These methods allow you to modify titles, labels, colors, markers, gridlines, and more.
Q4. What is the purpose of data normalization in data analysis?
The purpose of data normalization is to rescale the values of numerical features to a common scale
without distorting differences in the ranges of values. It is particularly useful in machine learning algorithms
that require input features to be on a similar scale to prevent certain features from dominating others.
Q5. What are some common methods for data normalization?
Common methods for data normalization include:
1. Min-Max Scaling: Rescales data to a fixed range (usually 0 to 1).
o Formula: (x - min) / (max - min)
2. Z-score Standardization (Standard Scaling): Centers data around 0 with a standard deviation of 1.
o Formula: (x - mean) / std
3. Max Absolute Scaling: Scales data by dividing by the maximum absolute value.
o Formula: x / max(|x|)
4. Robust Scaling: Scales data using the median and interquartile range (IQR), less sensitive to outliers.
o Formula: (x - median) / IQR
5. Log Transformation: Applies a logarithmic transformation to compress the range of values.
These methods are used to rescale features to improve the performance of machine learning algorithms.
Q6. What is the difference between a series and a dataframe in Pandas?
Series only support a single list with index, whereas a dataframe supports one or more series. In other words:
• Series is a one-dimensional array that supports any datatype (including integers, strings, floats, etc.).
In a series, the axis labels are the index.
• A dataframe is a two-dimensional data structure with columns that can support different data types.
It is similar to a SQL table or a dictionary of series objects.
Q7. Define Counter() Function in Python?
The Counter() function in Python, from the collections module, counts the occurrences of elements in an
iterable and returns a dictionary-like object with elements as keys and their counts as values.
Example:
from collections import Counter
data = ['a', 'b', 'a', 'c', 'b', 'a']
counter = Counter(data)
print(counter) # Output: Counter({'a': 3, 'b': 2, 'c': 1})
Q8. Define f-string formatting in Python?
f-string formatting in Python is a way to embed expressions inside string literals using curly braces {} and
prefixing the string with f.
Example:
name = "Alice"
age = 30
greeting = f"Hello, my name is {name} and I am {age} years old."
This allows for inline variable interpolation and expression evaluation directly within strings.
Q9. What is the purpose of data aggregation in data analysis?
The purpose of data aggregation is to summarize and condense large datasets into more manageable and
meaningful information by grouping data based on specified criteria and computing summary statistics
for each group. It helps in gaining insights into the overall characteristics and patterns of the data.
Q10. How do you perform data aggregation using Pandas?
You can perform data aggregation using the groupby() method in Pandas to group data based on one or
more columns and then apply an aggregation function to compute summary statistics for each group.
For example:
grouped = df.groupby('Name').mean()
Q11. How do you calculate descriptive statistics for a DataFrame in Pandas?
You can use the describe() method in Pandas to calculate descriptive statistics for a DataFrame, including
count, mean, standard deviation, minimum, maximum, and percentiles.
Q12. What is a histogram, and how is it used in data analysis?
A histogram is a graphical representation of the distribution of numerical data. It consists of a series of bars,
where each bar represents a range of values and the height of the bar represents the frequency of values
within that range. Histograms are commonly used to visualize the frequency distribution of a dataset.
Q13. Different ways to check for missing values in a DataFrame using Pandas:
1. isnull() or df.isnull()
o Checks for missing values (NaN) in the DataFrame or Series.
2. notnull() or df.notnull()
o Opposite of isnull(), it checks for non-missing values (not NaN).
3. isna() or df.isna()
o Equivalent to isnull(), it checks for missing values (NaN).
4. notna() or df.notna()
o Equivalent to notnull(), it checks for non-missing values (not NaN).
5. isna().sum()
o Returns the number of missing values in each column.
6. isnull().sum()
o Also returns the number of missing values in each column.
7. any()
o Used with isnull() or isna() to check if there are any missing values in each column or row.
o Example: df.isnull().any()
8. all()
o Used with isnull() or isna() to check if all values in a column or row are missing.
o Example: df.isnull().all()
9. info()
o Provides a summary of the DataFrame, including the count of non-null values in each column.
These are the main ways to check for missing data in a pandas DataFrame.
Q14. Explain the difference between a DataFrame and a Series in Pandas.
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It
can be thought of as a table with rows and columns. A Series, on the other hand, is a 1-dimensional labeled
array capable of holding any data type.
Q15. How do you select specific rows and columns from a DataFrame in Pandas?
You can use indexing and slicing to select specific rows and columns from a DataFrame in Pandas.
For example:
df.iloc[2:5, 1:3]
Q16. What types of files can be read by a DataFrame in Pandas?
1. CSV - pd.read_csv()
2. Excel - pd.read_excel()
3. JSON - pd.read_json()
4. SQL - pd.read_sql()
5. Parquet - pd.read_parquet()
6. HDF5 - pd.read_hdf()
7. Feather - pd.read_feather()
8. Stata - pd.read_stata()
9. SAS - pd.read_sas()
10. Google BigQuery - pd.read_gbq()
11. Clipboard - pd.read_clipboard()
12. HTML - pd.read_html()
13. Pickle - pd.read_pickle()
These are the most common file formats pandas can read.
Q17. How would you find duplicate values in a dataset for a variable in Python?
You can check for duplicates using the Pandas duplicated() method. This will return a boolean series
which is TRUE only for unique elements.
DataFrame.duplicated(subset=None,keep='last')
In this example, keep determines what to do with duplicates. You can use
• First - Considers the first value unique and the rest as duplicates.
• Last - Considers the last value unique and the rest as duplicates.
• False - Considers all same values as duplicates.
Q18. What is a lambda function in Python?
Sometimes called an “anonymous function,” the lambda function is just like a normal function but is
not defined with the keyword. They are defined with the keyword. Lambda functions are restricted to a single
line expression, and can take in multiple parameters, just like normal functions.
Here is an example of both normal and lambda functions for the argument (x) and the expression (x+x)
Normal function:
def function_name(x)
return x+x
Lambda function:
lambda x: x+x
Q19. What is list comprehension in Python? Provide an example.
List comprehension is used to define and create a list based on an existing list. For example, if we
wanted to separate all the letters in the word “retain,” and make each letter a list item, we could use list
comprehension:
r_letters = [ letter for letter in 'retain' ]
print( r_letters)
Output:
['r', 'e', 't', 'a', 'i', 'n']
Q20. What are some of the limitations of Python?
Python is limited in a few key ways, including:
• Speed - Studies have shown that Python is slower than languages like Java and C++. However, there
are options to make Python faster, like a custom runtime.
• V2 vs V3 - Python 2 and Python 3 are incompatible.
• Mobile development - Python is great for desktop and server applications, but weaker for mobile
development.
• Memory consumption - Python is not great for memory intensive applications.
Q21. What is the difference between loc and iloc in Pandas?
loc is used for label-based indexing, where you specify the row and column labels, while iloc is used
for integer-based indexing, where you specify the row and column indices.
Q22. How do you handle categorical data in Pandas?
Categorical data in Pandas can be handled using the astype('category') method to convert columns
to categorical data type or by using the Categorical() constructor. It helps in efficient memory usage and
enables faster operations.
Q23. What is the purpose of the pd.concat() function in Pandas?
The pd.concat() function in Pandas is used to concatenate (combine) two or more DataFrames along
rows or columns. It allows you to stack DataFrames vertically or horizontally.
Q24. How do you handle datetime data in Pandas?
Datetime data in Pandas can be handled using the to_datetime() function to convert strings or
integers to datetime objects, and the dt accessor can be used to extract specific components like year, month,
day, etc.
Q25. What is the purpose of the resample() method in Pandas?
The resample() method in Pandas is used to change the frequency of time series data. It allows you
to aggregate data over different time periods, such as converting daily data to monthly or yearly data.
Found this useful?
Follow me
Save it
Sangeetha A Comment below