Beginners Tutorial For Regular Expressions in Python - Python Learning
Beginners Tutorial For Regular Expressions in Python - Python Learning
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
(https://www.facebook.com/AnalyticsVidhya)
(https://twitter.com/analyticsvidhya)
(https://plus.google.com/+Analyticsvidhya/posts)
(https://www.linkedin.com/groups/Analytics-Vidhya-Learn-everything-about-5057165)
(https://www.analyticsvidhya.com)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
PYTHON
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/PYTHON-2/)
SHARE
(http://www.facebook.com/sharer.php?u=https://www.analyticsvidhya.com/blog/2015/06/regular-expression-
python/&t=Beginners%20Tutorial%20for%20Regular%20Expressions%20in%20Python)
(https://twitter.com/home?
status=Beginners%20Tutorial%20for%20Regular%20Expressions%20in%20Python+https://www.analyticsvidhya.com/blog/2015/06/regularexpression-python/)
(https://plus.google.com/share?url=https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/)
(http://pinterest.com/pin/create/button/?url=https://www.analyticsvidhya.com/blog/2015/06/regular-expression-
python/&media=https://www.analyticsvidhya.com/wpcontent/uploads/2015/06/caracters008.jpg&description=Beginners%20Tutorial%20for%20Regular%20Expressions%20in%20Python)
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
1/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
In last few years, there has been a dramatic shift in usage of general purpose programming
languages for data science and machine learning. This was not always the case a decade back
this thoughtwould have met a lot of skeptic eyes!
This means that more people / organizations are using tools like Python / JavaScript for solving
their data needs. This is where Regular Expressions become super useful. Regular expressions are
normally the default way of data cleaning and wrangling in most of these tools. Be it extraction of
speci c parts of text from web pages, making sense of twitter data or preparing your data for text
mining Regular expressions are your best bet for all these tasks.
Given their applicability, it makes sense to know them and use them appropriately.
this
concept
using
Python
programming
language
(https://www.analyticsvidhya.com/wp-content/uploads/2015/06/caracters008.jpg)
2/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
Simply put, regular expression is a sequence of character(s) mainly used to nd and replace
patterns in a string or le. As I mentioned before, they are supported by most of the programming
languages like python (https://www.analyticsvidhya.com/blog/2014/07/baby-steps-learningpython-data-analysis/), perl, R (https://www.analyticsvidhya.com/learning-paths-data-sciencebusiness-analytics-business-intelligence-big-data/learning-path-r-data-science/),
Java
and
many others. So, learning them helps in multiple ways (more on this later).
Regular expressions usetwo types of characters:
a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in
wild card.
b) Literals (like a,b,1,2)
In Python, we have module re that helpswith regular expressions. So you need toimport library
re before you can useregular expressions in Python.
Usethiscode>Importre
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
3/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
re.match(pattern,string):
This method nds match if itoccurs at start of the string. For example, calling match() on the string
AV Analytics AV and looking for a pattern AV will match. However,if we look for only Analytics,
the pattern will not match. Letsperform it in python now.
Code
importre
result=re.match(r'AV','AVAnalyticsVidhyaAV')
printresult
Output:
<_sre.SRE_Matchobjectat0x0000000009BE4370>
Above, it showsthat pattern match has been found. To print the matching string welluse method
group (It helps to return the matching string). Use r at the start of the pattern string, it designates
a python raw string.
result=re.match(r'AV','AVAnalyticsVidhyaAV')
printresult.group(0)
Output:
AV
Lets now nd Analytics in the given string. Here we seethat string is not starting with AV so it
should return no match. Lets see what we get:
Code
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
4/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
result=re.match(r'Analytics','AVAnalyticsVidhyaAV')
printresult
Output:
None
There aremethods like start() and end() to know the start and end position of matching pattern in
the string.
Code
result=re.match(r'AV','AVAnalyticsVidhyaAV')
printresult.start()
printresult.end()
Output:
0
2
Above you can see that start and end position of matching pattern AV in the string and sometime
it helps a lot whileperforming manipulation with the string.
re.search(pattern,string):
Itis similar to match() but it doesnt restrict us to nd matches at the beginning of the string only.
Unlike previous method, here searching for pattern Analyticswill return a match.
Code
result=re.search(r'Analytics','AVAnalyticsVidhyaAV')
printresult.group(0)
Output:
Analytics
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
5/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
Here you can see that, search() method is able to nd a pattern from any position of the string but
it only returns the rst occurrence of the search pattern.
re.findall (pattern,string):
Ithelps to get a list of all matching patterns. It has no constraints of searching from start or end. If
we will use method ndall to search AV in given string it will return both occurrence of AV.
While searching a string, I would recommend you to use re. ndall() always, it can work like
re.search() and re.match() both.
Code
result=re.findall(r'AV','AVAnalyticsVidhyaAV')
printresult
Output:
['AV','AV']
re.split(pattern,string, [maxsplit=0]):
This methods helps to splitstringby the occurrences ofgiven pattern.
Code
result=re.split(r'y','Analytics')
result
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Output:
['Anal','tics']
Above, we have split the string Analytics by y. Method split() has another argument maxsplit. It
has default value of zero. In this case it does the maximum splits that can be done, but if we give
value to maxsplit, it will split the string. Lets look at the example below:
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
6/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
Code
result=re.split(r'i','AnalyticsVidhya')
(https:/
/datahack.analyticsvidhya.com/contest/skilltestprintresult
tree-based-algorithms/)
Output:
['Analyt','csV','dhya']#Ithasperformedallthesplitsthatcanbedonebypattern"i".
Code
result=re.split(r'i','AnalyticsVidhya',maxsplit=1)
result
Output:
['Analyt','csVidhya']
Here, you can notice that we have xed the maxsplit to 1. And the result is,it has only two values
whereas rst example has three values.
re.sub(pattern,repl,string):
It helps to search a pattern and replace with a new sub string. If the pattern is notfound, stringis
returned unchanged.
Code
result=re.sub(r'India','theWorld','AVislargestAnalyticscommunityofIndia')
(https:/
/datahack.analyticsvidhya.com/contest/theresult
strategic-monk/)
Output:
'AVislargestAnalyticscommunityoftheWorld'
re.compile(pattern,repl,string):
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
7/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
re.compile(pattern,repl,string):
We can combine a regular expression pattern into pattern objects, which can be used for pattern
matching. It also helps to search a pattern again without rewriting it.
Code /datahack.analyticsvidhya.com/contest/skilltest(https:/
tree-based-algorithms/)
importre
pattern=re.compile('AV')
result=pattern.findall('AVAnalyticsVidhyaAV')
printresult
result2=pattern.findall('AVislargestanalyticscommunityofIndia')
printresult2
Output:
['AV','AV']
['AV']
Description
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
8/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
\s
Matches with a single white space character (space, newline, return, tab, form) and\S(upper
matches any non-white space character.
\b
[..]
Matches any single character in a square bracket and[^..]matches any single character not in
bracket
It is used for special meaning characters like\. to match a period or \+ for plus sign.
^ and $
{n,m}
a| b
Matches either a or b
()
\t, \n, \r
For more details on meta characters (, ),| and others details, you can refer this link
(https://docs.python.org/2/library/re.html (https://docs.python.org/2/library/re.html)).
Now, lets understand the pattern operators by looking at the below examples.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Problem 1: Return the first word of a given
string
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
9/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
importre
result=re.findall(r'.','AVislargestAnalyticscommunityofIndia')
printresult
(https://datahack.analyticsvidhya.com/contest/skilltest
tree-based-algorithms/)
Output:
['A','V','','i','s','','l','a','r','g','e','s','t','','A','n','a','l','y',
't','i','c','s','','c','o','m','m','u','n','i','t','y','','o','f','','I',
'n','d','i','a']
Code
result=re.findall(r'\w','AVislargestAnalyticscommunityofIndia')
printresult
Output:
['A','V','i','s','l','a','r','g','e','s','t','A','n','a','l','y','t','i','c',
's','c','o','m','m','u','n','i','t','y','o','f','I','n','d','i','a']
result=re.findall(r'\w*','AVislargestAnalyticscommunityofIndia')
(https:/
/datahack.analyticsvidhya.com/contest/theprintresult
strategic-monk/)
Output:
['AV','','is','','largest','','Analytics','','community','','of','','India','']
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
10/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
Again, it is returning space as a word because * returns zero or more matches of pattern to its left.
Now to remove spaces we will go with +.
Code /datahack.analyticsvidhya.com/contest/skilltest(https:/
tree-based-algorithms/)
result=re.findall(r'\w+','AVislargestAnalyticscommunityofIndia')
printresult
Output:
['AV','is','largest','Analytics','community','of','India']
result=re.findall(r'^\w+','AVislargestAnalyticscommunityofIndia')
printresult
Output:
['AV']
If we will use $ instead of ^, it will return the word from the end of the string. Lets look at it.
Code
result=re.findall(r'\w+$','AVislargestAnalyticscommunityofIndia')
printresult
(https://datahack.analyticsvidhya.com/contest/theOutput:
strategic-monk/)
[India]
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
11/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
Code
result=re.findall(r'\w\w','AVislargestAnalyticscommunityofIndia')
(https:/
/datahack.analyticsvidhya.com/contest/skilltestprintresult
tree-based-algorithms/)
Output:
['AV','is','la','rg','es','An','al','yt','ic','co','mm','un','it','of','In','di']
Solution-2 Extract consecutivetwo characters those available at start of word boundary(using \b)
Code
result=re.findall(r'\b\w.','AVislargestAnalyticscommunityofIndia')
printresult
Output:
['AV','is','la','An','co','of','In']
result=re.findall(r'@\w+','abc.test@gmail.com,xyz@test.in,test.first@analyticsvidhya.com,first.
test@rest.biz')
printresult
Output:['@gmail','@test','@analyticsvidhya','@rest']
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Above, you can see that .com, .in part is not extracted. To add it, we will go with below code.
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
12/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
result=re.findall(r'@\w+.\w+','abc.test@gmail.com,xyz@test.in,test.first@analyticsvidhya.com,fi
rst.test@rest.biz')
printresult
(https://datahack.analyticsvidhya.com/contest/skilltestOutput:
tree-based-algorithms/)
['@gmail.com','@test.in','@analyticsvidhya.com','@rest.biz']
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com,xyz@test.in,test.first@analyticsvidhya.com,
first.test@rest.biz')
printresult
Output:
['com','in','com','biz']
result=re.findall(r'\d{2}\d{2}\d{4}','Amit34345612052007,XYZ56453211112011,ABC6789
4512012009')
printresult
(https://datahack.analyticsvidhya.com/contest/theOutput:
strategic-monk/)
['12052007','11112011','12012009']
If you want to extract only year again parenthesis ( ) will help you.
Code
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
13/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
result=re.findall(r'\d{2}\d{2}(\d{4})','Amit34345612052007,XYZ56453211112011,ABC67
894512012009')
(https://datahack.analyticsvidhya.com/contest/skilltestprintresult
tree-based-algorithms/)
Output:
['2007','2011','2009']
result=re.findall(r'\w+','AVislargestAnalyticscommunityofIndia')
printresult
Output:
['AV','is','largest','Analytics','community','of','India']
result=re.findall(r'[aeiouAEIOU]\w+','AVislargestAnalyticscommunityofIndia')
printresult
Output:
(https:/
/datahack.analyticsvidhya.com/contest/the['AV','is','argest','Analytics','ommunity','of','India']
strategic-monk/)
Above you can see that it has returned argest and ommunity from the mid of words. To drop these
two, we need to use \b for word boundary.
Solution- 3
Code
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
14/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
result=re.findall(r'\b[aeiouAEIOU]\w+','AVislargestAnalyticscommunityofIndia')
printresult
(https://datahack.analyticsvidhya.com/contest/skilltestOutput:
tree-based-algorithms/)
['AV','is','Analytics','of','India']
In similar ways, we can extract words those starts with constant using ^ within square bracket.
Code
result=re.findall(r'\b[^aeiouAEIOU]\w+','AVislargestAnalyticscommunityofIndia')
printresult
Output:
['is','largest','Analytics','community','of','India']
Above you can see that it has returned words starting with space. To drop it from output, include space in
square bracket[].
Code
result=re.findall(r'\b[^aeiouAEIOU]\w+','AVislargestAnalyticscommunityofIndia')
printresult
Output:
['largest','community']
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
15/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
importre
li=['9999999999','999999999','99999x9999']
forvalinli:
(https://datahack.analyticsvidhya.com/contest/skilltestifre.match(r'[89]{1}[09]{9}',val)andlen(val)==10:
tree-based-algorithms/)
print'yes'
else:
print'no'
Output:
yes
no
no
importre
line='asdffjdk;afed,fjek,asdf,foo'#Stringhasmultipledelimiters(";",",","").
result=re.split(r'[;,\s]',line)
printresult
Output:
['asdf','fjdk','afed','fjek','asdf','foo']
We can also use method re.sub()to replace these multiple delimiters with one as space .
(https://datahack.analyticsvidhya.com/contest/theCode
strategic-monk/)
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
16/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
importre
line='asdffjdk;afed,fjek,asdf,foo'
result=re.sub(r'[;,\s]','',line)
(https://datahack.analyticsvidhya.com/contest/skilltestprintresult
tree-based-algorithms/)
Output:
asdffjdkafedfjekasdffoo
<tralign="center"><td>1</td><td>Noah</td><td>Emma</td></tr>
<tralign="center"><td>2</td><td>Liam</td><td>Olivia</td></tr>
<tralign="center"><td>3</td><td>Mason</td><td>Sophia</td></tr>
<tralign="center"><td>4</td><td>Jacob</td><td>Isabella</td></tr>
<tralign="center"><td>5</td><td>William</td><td>Ava</td></tr>
<tralign="center"><td>6</td><td>Ethan</td><td>Mia</td></tr>
<tralign="center"><td>7</td><tdHTML>Michael</td><td>Emily</td></tr>
Solution:
Code
result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
printresult
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Output:
[('Noah','Emma'),('Liam','Olivia'),('Mason','Sophia'),('Jacob','Isabella'),('William','Av
a'),('Ethan','Mia'),('Michael','Emily')]
You can read html le using library urllib2 (see below code).
Code
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
17/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
importurllib2
response=urllib2.urlopen('')
html=response.read()
(https://datahack.analyticsvidhya.com/contest/skilltesttree-based-algorithms/)
End Notes
In this article, we discuss about the regular expression, methods and meta characters to form a
regular expression. We have also looked at various examples to see the practical uses of it. Here I
have tried to introduce you with regular expression and cover most common methods to solve
maximum of regular expression problems.
Did you nd the article useful? Do let us know your thoughts about this guide in the comments
section below.
If you like what you just read & want to continue your analytics
learning,subscribe to our emails
(http://feedburner.google.com/fb/a/mailverify?uri=analyticsvidhya),follow
us on twitter (http://twitter.com/analyticsvidhya)or like ourfacebookpage
(http://facebook.com/analyticsvidhya).
Share this:
(https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/?share=linkedin&nb=1)
284
(https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/?share=facebook&nb=1)
(https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/?share=googleplus1&nb=1)
(https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/?share=twitter&nb=1)
(https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/?share=pocket&nb=1)
(https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/?share=reddit&nb=1)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
RELATED
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
18/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
(https://datahack.analyticsvidhya.com/contest/skilltesttree-based-algorithms/)
(https://www.analyticsvidhya.com
/blog/2016/06/exclusive-python-
(https://www.analyticsvidhya.com
/blog/2014/11/text-data-cleaning-
(https://www.analyticsvidhya.com
/blog/2014/07/baby-steps-
tutorials-talks-pycon-2016-
steps-python/)
libraries-data-structure/)
portland-oregon/)
Exclusive Python Tutorials & Talks
from PyCon 2016 Portland, Oregon
(https://www.analyticsvidhya.com
/blog/2016/06/exclusive-pythontutorials-talks-pycon-2016portland-oregon/)
In "Big data"
In "Big data"
Next Article
(https://www.analyticsvidhya.com/blog/2015/06/infographic-cheat-sheet-data-exploration-python/)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Previous Article
(https://www.analyticsvidhya.com/blog/2015/06/dataset-description-megastar/)
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
19/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
(https://datahack.analyticsvidhya.com/contest/skilltesttree-based-algorithms/)
(https://www.analyticsvidhya.com/blog/author/sunil-ray/)
Author
4 COMMENTS
Rami BelgacemREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2015/06/REGULAR-EXPRESSION-PYTHON/?REPLYTOCOM=88000#RESPOND)
Gopinathan K.Munappy
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2015/06/REGULAR-EXPRESSION-PYTHON/?REPLYTOCOM=88557#RESPOND)
JUNE 14, 2015 AT 10:25 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2015/06/REGULAR-EXPRESSION-PYTHON/#COMMENT88557)
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
20/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
Is it not a Python programming language aspect than the theme of the site Analytics?
Raju Kommarajula
REPLY says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2015/06/REGULAR-EXPRESSION-PYTHON/?REPLYTOCOM=107986#RESPOND)
MARCH 22, 2016 AT 7:30 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2015/06/REGULAR-EXPRESSIONPYTHON/#COMMENT-107986)
It helped me a lot.
LEAVE A REPLY
Connect with:
(https://www.analyticsvidhya.com/wp-login.php?
action=wordpress_social_authenticate&mode=login&provider=Facebook&redirect_to=https%3A%2F%2Fwww.a
expression-python%2F)
Your email address will not be published.
Comment
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Name (required)
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
21/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
Email (required)
(https:/
/datahack.analyticsvidhya.com/contest/skilltestWebsite
tree-based-algorithms/)
SUBMIT COMMENT
TOP AV USERS
Rank
Name
Points
5388
Aayushmnit
(https://datahack.analyticsvidhya.com/user/pro le/aayushmnit)
4978
4433
4417
3371
(http://www.greatlearning.in/great-lakes-pgpba?
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
utm_source=avm&utm_medium=avmbanner&utm_campaign=pgpba)
POPULAR POSTS
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
22/23
11/23/2016
BeginnersTutorialforRegularExpressionsinPython|PythonLearning
POPULAR POSTS
A Complete
Tutorial to Learn Data Science with Python from Scratch
(https:/
/datahack.analyticsvidhya.com/contest/skilltest(https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-pythontree-based-algorithms/)
scratch-2/)
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
(https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratchin-python/)
Essentials of Machine Learning Algorithms (with Python and R Codes)
(https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely)
(https://www.analyticsvidhya.com/blog/2016/10/17-ultimate-data-science-projects-to-boost-yourknowledge-and-skills/)
7 Types of Regression Techniques you should know!
(https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
60:45:01
https://www.analyticsvidhya.com/blog/2015/06/regularexpressionpython/
23/23