Python Pandas - Working with HTML Data



The Pandas library provides extensive functionalities for handling data from various formats. One such format is HTML (HyperText Markup Language), which is a commonly used format for structuring web content. The HTML files may contain tabular data, which can be extracted and analyzed using the Pandas library.

An HTML table is a structured format used to represent tabular data in rows and columns within a webpage. Extracting this tabular data from an HTML is possible by using the pandas.read_html() function. Writing the Pandas DataFrame back to an HTML table is also possible using the DataFrame.to_html() method.

In this tutorial, we will learn about how to work with HTML data using Pandas, including reading HTML tables and writing the Pandas DataFrames to HTML tables.

Reading HTML Tables from a URL

The pandas.read_html() function is used for reading tables from HTML files, strings, or URLs. It automatically parses <table> elements in HTML and returns a list of pandas.DataFrame objects.

Example

Here is the basic example of reading the data from a URL using the pandas.read_html() function.

import pandas as pd # Read HTML table from a URL url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm" tables = pd.read_html(url) # Access the first table from the URL df = tables[0] # Display the resultant DataFrame print('Output First DataFrame:', df.head())

Following is the output of the above code −

Output First DataFrame:
ID NAME AGE ADDRESS SALARY
0 1 Ramesh 32 Ahmedabad 2000.0
1 2 Khilan 25 Delhi 1500.0
2 3 Kaushik 23 Kota 2000.0
3 4 Chaitali 25 Mumbai 6500.0
4 5 Hardik 27 Bhopal 8500.0

Reading HTML Data from a String

Reading HTML data directly from a string can be possible by using the Python's io.StringIO module.

Example

The following example demonstrates how to read the HTML string using StringIO without saving to a file.

Open Compiler
import pandas as pd from io import StringIO # Create an HTML string html_str = """ <table> <tr><th>C1</th><th>C2</th><th>C3</th></tr> <tr><td>a</td><td>b</td><td>c</td></tr> <tr><td>x</td><td>y</td><td>z</td></tr> </table> """ # Read the HTML string dfs = pd.read_html(StringIO(html_str)) print(dfs[0])

Following is the output of the above code −


C1 C2 C3
0 a b c
1 x y z

Example

This is an alternative way of reading the HTML string with out using the io.StringIO module. Here we will save the HTML string into a temporary file and read it using the pandas.read_html() function.

Open Compiler
import pandas as pd # Create an HTML string html_str = """ <table> <tr><th>C1</th><th>C2</th><th>C3</th></tr> <tr><td>a</td><td>b</td><td>c</td></tr> <tr><td>x</td><td>y</td><td>z</td></tr> </table> """ # Save to a temporary file and read with open("temp.html", "w") as f: f.write(html_str) df = pd.read_html("temp.html")[0] print(df)

Following is the output of the above code −


C1 C2 C3
0 a b c
1 x y z

Handling Multiple Tables from an HTML file

While reading an HTML file of containing multiple tables, we can handle it by using the match parameter of the pandas.read_html() function to read a table that has specific text.

Example

The following example reads a table that has a specific text from the HTML file of having multiple tables using the match parameter.

import pandas as pd # Read tables from a SQL tutorial url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm" tables = pd.read_html(url, match='Field') # Access the table df = tables[0] print(df.head())

Following is the output of the above code −


Field Type Null Key Default Extra
1 ID int(11) NO PRI NaN NaN
2 NAME varchar(20) NO NaN NaN NaN
3 AGE int(11) NO NaN NaN NaN
4 ADDRESS char(25) YES NaN NaN NaN
5 SALARY decimal(18,2) YES NaN NaN NaN

Writing DataFrames to HTML

Pandas DataFrame objects can be converted to HTML tables using the DataFrame.to_html() method. This method returns a string if the parameter buf is set to None.

Example

The following example demonstrates how to write a Pandas DataFrame to an HTML Table using the DataFrame.to_html() method.

Open Compiler
import pandas as pd # Create a DataFrame df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) # Convert the DataFrame to HTML table html = df.to_html() # Display the HTML string print(html)

Following is the output of the above code −

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <th>1</th>
      <td>3</td>
      <td>4</td>
    </tr>
  </tbody>
</table>
Advertisements