0% found this document useful (0 votes)
27 views20 pages

01 Python Text Basics

This document provides an overview of working with text files and regular expressions in Python. It discusses opening text and PDF files, formatting strings, reading and writing to text files, extracting text from PDFs using PyPDF2, and using regular expressions to search for patterns in text. The goals are to understand basic text file handling, regular expressions, and gain practice through an assessment exercise.

Uploaded by

Adriano Vianna
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
0% found this document useful (0 votes)
27 views20 pages

01 Python Text Basics

This document provides an overview of working with text files and regular expressions in Python. It discusses opening text and PDF files, formatting strings, reading and writing to text files, extracting text from PDFs using PyPDF2, and using regular expressions to search for patterns in text. The goals are to understand basic text file handling, regular expressions, and gain practice through an assessment exercise.

Uploaded by

Adriano Vianna
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 20

Python Text Basics

Natural Language Processing Bootcamp

● Section Goals
○ Understand how to open normal .txt and .pdf files
with basic Python libraries
○ Learn some basic regular expressions
○ Test skills with an assessment exercise.
● Let’s get started!
Working with Text Files
PART ONE
Natural Language Processing Bootcamp

● Let’s go over some basic print formatting (f-string


literal).
● We’ll also discuss alignment options with f-string
literals.
● Let’s get started!
Working with Text Files
PART TWO
Natural Language Processing Bootcamp

● Let’s go over how to read and write to text files with


Python!
Working with PDF Files
Natural Language Processing Bootcamp

● Often you may need to read in text data from a PDF


file.
● We can use the PyPDF2 library to read in text data
from a PDF file.
● Keep in mind: NOT ALL PDFS HAVE TEXT
THAT CAN BE EXTRACTED!
Natural Language Processing Bootcamp

● Some PDFs are created through scanning, instead of


being exported from a text editor like Word.
● These scanned PDFs are more like image files,
making it much harder to extract the text.
● Often this requires specialized software!
Natural Language Processing Bootcamp

● The PyPDF2 library is made to extract text from PDF


files directly created from a word processor, but keep
in mind, not all word processors created PDFs with
extractable text!
● To begin, make sure you are using our environment
file (or you’ve installed PyPDF2)
Natural Language Processing Bootcamp

● To install PyPDF2 , simply open up your command


line and directly type:
○ pip install PyPDF2

● Let’s get started!


Regular Expressions
Natural Language Processing Bootcamp

● Imagine you needed to search a string for a term, such


as “phone”. You can use the in keyword to do this:
“phone” in “Is the phone here?”
>>> True
Natural Language Processing Bootcamp

● Now imagine you need to find a telephone number,


such as “408-555-1234”, you could do the same:
“408-555-1234” in “Her phone is 408-555-1234”
>>> True
Natural Language Processing Bootcamp

● But what if you didn’t know the exact number?


● If all you knew was the format of the number: ###-
###-#### you would need regular expressions to
search through the document for this pattern.
Natural Language Processing Bootcamp

● Regular expressions allow for pattern searching in a


text document.
● The syntax for regular expressions can be very
intimidating at first:
○ r'\d{3}-\d{3}-\d{4}'
Natural Language Processing Bootcamp

● The key thing to keep in mind is that every character


type has a corresponding pattern code.
● For example, digits have the placeholder pattern code
of \d
● The use of backslash allows python to understand that
it is a special code and not the letter “d”.
Regular Expressions
Continued
Python Text Basics
Assessment
Overview
Python Text Basics
Assessment
SOLUTIONS

You might also like