Exporting PDF Data using Python
Last Updated :
10 May, 2020
Improve
Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.
Extracting Text With PDFMiner
PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:
pip install pdfminer
Let’s get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data
- create a resource manager instance.
- create a file-like object via Python’s io module.
- create a converter.
- create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
- open the PDF and loop through each page.
Below is the implementation.
PDF File Used:
import io from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfpage import PDFPage def extract_text_by_page(pdf_path): with open (pdf_path, 'rb' ) as fh: for page in PDFPage.get_pages(fh, caching = True , check_extractable = True ): resource_manager = PDFResourceManager() fake_file_handle = io.StringIO() converter = TextConverter(resource_manager, fake_file_handle) page_interpreter = PDFPageInterpreter(resource_manager, converter) page_interpreter.process_page(page) text = fake_file_handle.getvalue() yield text # close open handles converter.close() fake_file_handle.close() def extract_text(pdf_path): for page in extract_text_by_page(pdf_path): print (page) print () # Driver code if __name__ = = '__main__' : print (extract_text( 'GFG.pdf' )) |
Output:
In this example, we create a function that yields the text for each page. The extract_text function prints out the text of each page.