Pandas 1705297450
Pandas 1705297450
What it does:
Loads and cleans datasets of various formats (CSV, Excel, SQL databases,
etc.).
Creates and manipulates data structures like DataFrames (similar to
spreadsheets) and Series (single arrays).
Performs data analysis tasks like filtering, sorting, grouping, aggregating, and
statistical calculations.
Enables data visualization through built-in plotting functions and integration
with other libraries like Matplotlib.
Data Types:
Can hold various data types in each cell, such as numbers, strings, booleans,
dates, and even other DataFrames (nested!).
Allows mixing data types within columns, providing flexibility for diverse data
sets.
Key Features:
Indexing and selection: Access specific rows, columns, or cells using labels,
positions, or logical conditions.
Operations: Perform calculations, aggregations, filtering, and sorting on data
within columns or rows.
Merging and joining: Combine data from multiple DataFrames based on
shared information.
Visualization: Easily visualize data patterns and relationships through built-in
plotting functions.
Benefits:
Out[151]:
Column1 Column2 Column3 Column4
Row1 0 1 2 3
Row2 4 5 6 7
Row3 8 9 10 11
Row4 12 13 14 15
Row5 16 17 18 19
In [152]: df.to_csv("test.csv")
Position-based (iloc):
Row: df.iloc[row_index]
Specific element: df.iloc[row_index, col_index]
Subset: df.iloc[start_row:end_row, start_col:end_col]
Boolean Indexing:
Tips:
Bonus:
In [154]: df.head()
Out[154]:
Column1 Column2 Column3 Column4
Row1 0 1 2 3
Row2 4 5 6 7
Row3 8 9 10 11
Row4 12 13 14 15
Row5 16 17 18 19
In [155]: df["Column1"]["Row1"]
Out[155]: 0
In [156]: df["Column1"]
Out[156]: Row1 0
Row2 4
Row3 8
Row4 12
Row5 16
Row5 20
Name: Column1, dtype: int32
Out[157]:
Column1 Column2
Row1 0 1
Row2 4 5
Row3 8 9
Row4 12 13
Row5 16 17
Row5 20 21
In [158]: df.loc["Row1"]
Out[158]: Column1 0
Column2 1
Column3 2
Column4 3
Name: Row1, dtype: int32
In [159]: type(df.loc["Row1"])
Out[159]: pandas.core.series.Series
Remember:
Out[160]:
Column1 Column2
Row1 0 1
Row2 4 5
Out[161]: pandas.core.frame.DataFrame
Out[162]: pandas.core.series.Series
Tips:
Out[164]: (6, 3)
In [165]: df.isnull() #FIND THE NULL VALUES IN DATAFRAME
Out[165]:
Column1 Column2 Column3 Column4
Out[166]: Column1 0
Column2 0
Column3 0
Column4 0
dtype: int64
Out[167]: Column1
0 1
4 1
8 1
12 1
16 1
20 1
Name: count, dtype: int64
Tips:
Out[170]:
id name class mark gender
In [171]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 35 non-null int64
1 name 35 non-null object
2 class 35 non-null object
3 mark 35 non-null int64
4 gender 35 non-null object
dtypes: int64(2), object(3)
memory usage: 1.5+ KB
df.info()
Provides quick overview of DataFrame structure and content.
Shows:
Rows & columns count
Column names & data types
Memory usage
Non-null value counts
Useful for:
Exploring data structure
Checking for missing values
Verifying data types
Optimizing memory usage
Out[172]:
id mark
df.describe() :
Purpose:
Output:
Benefits:
You can control the displayed percentiles and other statistics using optional
arguments in df.describe() .
Use df.describe(include='all') to include descriptive statistics for
object columns (e.g., unique value counts).
Remember, df.describe() only summarizes numeric data. For non-
numeric columns, consider alternative analysis methods.
Out[173]: mark
88 7
55 5
78 3
79 3
75 2
69 2
60 2
85 2
90 1
86 1
81 1
54 1
65 1
18 1
94 1
89 1
96 1
Name: count, dtype: int64
In [174]: df[df["mark"] >= 75]
Out[174]:
id name class mark gender
Read CSV
In [175]: from io import StringIO, BytesIO
Out[177]: str
In [178]: pd.read_csv(StringIO(data))
Out[178]:
col1 col2 col3
0 x y 1
1 a b 2
2 c d 3
In [180]: df
Out[180]:
col1
0 x
1 a
2 c
In [182]: print(data)
a, b, c, d
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12
In [183]: # Read CSV data from a string, treating all columns as string object
df = pd.read_csv(StringIO(data), dtype = object)
In [184]: df
Out[184]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
In [185]: df["a"][0]
Out[185]: '1'
In [186]: type(df["a"][0])
Out[186]: str
In [187]: df["a"]
Out[187]: 0 1
1 5
2 9
Name: a, dtype: object
In [189]: type(df["a"][0])
Out[189]: numpy.int32
In [191]: type(df["a"][0])
Out[191]: numpy.float64
In [193]: df
Out[193]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
In [194]: type(df["a"][1])
Out[194]: numpy.int64
In [195]: df.dtypes
Out[195]: a int64
b int64
c int64
d int64
dtype: object
In [197]: pd.read_csv(StringIO(data))
Out[197]:
index a b c
Out[198]:
a b c
index
In [200]: pd.read_csv(StringIO(data))
Out[200]:
a b c
0 4 apple bat
1 8 orange cow
Out[201]:
a
0 4
1 8
In [202]: #Quoting and Escape Chracters, Very useful in NLP
data = 'a, b\n"hello, \\"Vithu\\", nice to see you", 5'
Out[203]:
a b
In [205]: df.head()
Out[205]:
carat cut color clarity depth table price x y z
In [206]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
JSON to CSV
In [207]: data = '{"id": 1, "name": "Thas Vithu", "position": "Data Scientist"
# Read JSON string into a Pandas DataFrame
df1 = pd.read_json(data, orient='index')
# Display the DataFrame
print(df1)
0
id 1
name Thas Vithu
position Data Scientist
department Data Science
In [208]: df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-da
In [209]: df.head()
Out[209]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
In [211]: df1.to_json()
In [212]: #df.to_json()
In [213]: df.to_json(orient = "records")
0, 7 :3.04, 8 :0.2, 9 :2.08, 10 :5.1, 11 :0.89, 12 :3.53, 13 :76
0},{"0":1,"1":13.56,"2":1.71,"3":2.31,"4":16.2,"5":117,"6":3.1
5,"7":3.29,"8":0.34,"9":2.34,"10":6.13,"11":0.95,"12":3.38,"13":
795},{"0":1,"1":13.41,"2":3.84,"3":2.12,"4":18.8,"5":90,"6":2.4
5,"7":2.68,"8":0.27,"9":1.48,"10":4.28,"11":0.91,"12":3.0,"13":1
035},{"0":1,"1":13.88,"2":1.89,"3":2.59,"4":15.0,"5":101,"6":3.2
5,"7":3.56,"8":0.17,"9":1.7,"10":5.43,"11":0.88,"12":3.56,"13":1
095},{"0":1,"1":13.24,"2":3.98,"3":2.29,"4":17.5,"5":103,"6":2.6
4,"7":2.63,"8":0.32,"9":1.66,"10":4.36,"11":0.82,"12":3.0,"13":6
80},{"0":1,"1":13.05,"2":1.77,"3":2.1,"4":17.0,"5":107,"6":3.
0,"7":3.0,"8":0.28,"9":2.03,"10":5.04,"11":0.88,"12":3.35,"13":8
85},{"0":1,"1":14.21,"2":4.04,"3":2.44,"4":18.9,"5":111,"6":2.8
5,"7":2.65,"8":0.3,"9":1.25,"10":5.24,"11":0.87,"12":3.33,"13":1
080},{"0":1,"1":14.38,"2":3.59,"3":2.28,"4":16.0,"5":102,"6":3.2
5,"7":3.17,"8":0.27,"9":2.19,"10":4.9,"11":1.04,"12":3.44,"13":1
065},{"0":1,"1":13.9,"2":1.68,"3":2.12,"4":16.0,"5":101,"6":3.
1,"7":3.39,"8":0.21,"9":2.14,"10":6.1,"11":0.91,"12":3.33,"13":9
85},{"0":1,"1":14.1,"2":2.02,"3":2.4,"4":18.8,"5":103,"6":2.7
5,"7":2.92,"8":0.32,"9":2.38,"10":6.2,"11":1.07,"12":2.75,"13":1
060},{"0":1,"1":13.94,"2":1.73,"3":2.27,"4":17.4,"5":108,"6":2.8
8 "7":3 54 "8":0 32 "9":2 08 "10":8 9 "11":1 12 "12":3 1 "13":12
Out[215]:
Bank Acquiring Closing
CityCity StateSt CertCert FundFund
NameBank InstitutionAI DateClosing
Heartland
Dream First July 28,
1 Tri-State Elkhart KS 25851 10544
Bank, N.A. 2023
Bank
First JPMorgan
San
2 Republic CA 59017 Chase Bank, May 1, 2023 10543
Francisco
Bank N.A.
First–
Silicon
Santa Citizens March 10,
4 Valley CA 24735 10539
Clara Bank & Trust 2023
Bank
Company
Superior
Superior July 27,
563 Hinsdale IL 32646 Federal, 6004
Bank, FSB 2001
FSB
Malta
North Valley
564 National Malta OH 6629 May 3, 2001 4648
Bank
Bank
First Southern
Alliance New February 2,
565 Manchester NH 34264 4647
Bank & Hampshire 2001
Trust Co. Bank & Trust
National
Banterra
State Bank December
566 Metropolis IL 3815 Bank of 4646
of 14, 2000
Marion
Metropolis
In [216]: type(dfs)
Out[216]: list
Out[218]:
Mobile National
ISO Mobile network
country Country MNC Remarks
3166 codes
code authority
List of mobile
1 412 Afghanistan AF network codes in NaN NaN
Afghanistan
List of mobile
2 276 Albania AL network codes in NaN NaN
Albania
List of mobile
3 603 Algeria DZ network codes in NaN NaN
Algeria
List of mobile
247 452 Vietnam VN network codes in NaN NaN
the Vietnam
List of mobile
W Wallis and
248 543 WF network codes in NaN NaN
Futuna
Wallis and Futuna
List of mobile
249 421 Y Yemen YE network codes in NaN NaN
the Yemen
List of mobile
250 645 Z Zambia ZM network codes in NaN NaN
Zambia
List of mobile
251 648 Zimbabwe ZW network codes in NaN NaN
Zimbabwe
In [220]: type(df_excel)
Out[220]: pandas.core.frame.DataFrame
In [221]: df_excel.head()
Out[221]:
work_year experience_level employment_type job_title salary salary_currency
Principal
0 2023 SE FT Data 80000 EUR
Scientist
ML
1 2023 MI CT 30000 USD
Engineer
ML
2 2023 MI CT 25500 USD
Engineer
Data
3 2023 SE FT 175000 USD
Scientist
Data
4 2023 SE FT 120000 USD
Scientist
Pickling
All pandas objects are equipped with to_pickle methods which use Pyton's
cPickle module to save data structures to disk using the pickle format.
In [222]: df_excel.to_pickle("pickleFile.xlsx")
In [223]: df = pd.read_pickle("pickleFile.xlsx")
In [224]: df.head()
Out[224]:
work_year experience_level employment_type job_title salary salary_currency
Principal
0 2023 SE FT Data 80000 EUR
Scientist
ML
1 2023 MI CT 30000 USD
Engineer
ML
2 2023 MI CT 25500 USD
Engineer
Data
3 2023 SE FT 175000 USD
Scientist
Data
4 2023 SE FT 120000 USD
Scientist
2024.01.14