Regular expressions
Contents
Regular expressions#
Regular expressions are an immensely powerful tool built into most modern computer languages. They are a type of formal grammar that allow you to match strings that match or mismatch a particular rule. Common uses include checking if user input conforms to a desired pattern (e.g., 3 numbers followed by two numbers, followed by 3 numbers), to all sorts of complicated search-replace operations both in text-files and, e.g., renaming files.
There are entire books written on regular expressions as well as comprehensive online references. We’ll only concern ourselves with a few basics here.
Start by looking over the first 10 lessons of this tutorial (they’re very quick), paying special attention to the sidebar on right, which I reproduce below.
Now go to this snazzy interactive regexp matcher and play around with it to get a feel for the syntax.
Then go through the rest of this notebook and make sure you understand why each regexp works in the way it does.
Syntax |
Meaning |
---|---|
abc… |
Literal letters |
\d |
Any Digit |
\D |
Any Non-digit character |
. |
Any single character |
. |
Period (slash is an escape character) |
[abc] |
Only a, b, or c |
[^abc] |
Not a, b, nor c |
[a-z] |
Characters a to z |
[A-Z] |
Characters A to Z |
[0-9] |
Numbers 0 to 9 |
\w |
Any Alphanumeric character |
\W |
Any Non-alphanumeric character |
{m} |
m Repetitions |
{m,n} |
m to n Repetitions |
* |
Zero or more repetitions |
? |
Optional character (0 or 1) |
+ |
One or more repetitions |
\s |
Any Whitespace |
\S |
Any Non-whitespace character |
^ |
Start of string (or line for multiline matching) |
$ |
End of string (or line for multiline matching) |
(…) |
Capture Group (for capturing matches and backreference) |
(a(bc)) |
Capture Sub-group |
(.*) |
Capture all |
(abc|def) |
Matches abc or def |
Use as a filter#
Let’s begin by reading in a file containing a bunch of words from the American National Corpus that have a frequency of at least 9. Here’s a sample of what this file looks like.
word lemma pos freq
the the DT 1081168
of of IN 539793
and and CC 466737
to to TO 448519
a a DT 406057
in in IN 360853
is be VBZ 192975
For those unfamiliar with language lingo, English lemmas are basically the word-stems, e.g., the lemma of cars is car; the lemma of walking is walk. pos stands for part of speech.
import re #import the python regexp module
import pandas as pd
data = pd.read_csv('https://psych750.github.io/data/ANC-written-count_over9.txt',encoding = "ISO-8859-1",sep="\t")
words = list(set(data['word']))[1:]
print (f"We have {len(words)} unique words")
We have 48316 unique words
Now let’s use some regular expressions starting with simple ones, and moving on to every slightly more complicated ones.
Grab words beginning with q
[curWord for curWord in words if re.findall('^q',curWord)]
['qualitatively',
'quarterly',
'quadrupled',
'quotes',
'qaida',
'questionnaires',
'qtls',
'quibble',
'quintana',
'qin',
'quilts',
'quetzalcoatl',
'qçí',
'quadratic',
'quartop',
'quit',
'querying',
'qing',
'qualities',
'quartet',
'qd',
'quadruple',
'quoting',
'quso',
'quixotic',
'quiz',
'quests',
'queue',
'qingdao',
'qtc',
'queried',
'quai',
'quantiles',
'qassam',
'quiescent',
'quiche',
'quietly',
'quirky',
'quarrel',
'quincy',
'quell',
'quarter-century',
'quadrants',
'quick',
'qspline',
'quintessential',
'quashed',
'qaddafi',
'quintanilla',
'quiver',
'queen',
'quilting',
'quantum',
'quipped',
'q-pna',
'quill',
'qualifying',
'quick-edit',
'quart',
'qios',
'quay',
'quitting',
'quagmire',
'quickly',
'qe',
'quarterfinals',
'quilt',
'quenched',
'quantities',
'quartiles',
'quantitate',
'qtl',
'quantitatively',
'quran',
'quarterfinal',
'quickening',
'quantitative',
'quickness',
'quine',
'quaint',
'qianlong',
'quebec',
'qualifications',
'quickest',
'quantified',
'q',
'quandary',
'qutb',
'quot',
'qadir',
'quote',
'quarterback',
'queasy',
'quarterbacks',
'qualified',
'qt',
'queries',
'quintile',
'quark',
'qualifier',
'quayle',
'qeis',
'qsp',
'quail',
'qida',
'quad',
'qpak',
'quo',
'qualms',
'questionable',
'quickie',
'quartz',
'quantify',
'quotations',
'quinone',
'quarry',
'quantity',
'quindlen',
'quantifying',
'qur',
'questioning',
'qb',
'quds',
'quirks',
'quarters',
'quality',
'qéåüåçäçöó',
'québécois',
'qureshi',
'quadrangle',
'quotable',
'quadrant',
'queues',
'qu',
'questions',
'quickedit',
'quenk',
'qualcomm',
'quasispecies',
'quotation',
'quixote',
'quips',
'quake',
'quentin',
'quid',
'qa',
'quaternary',
'que',
'quilted',
'quantitated',
'quackery',
'quoted',
'quantile',
'qualifiers',
'question',
'quip',
'qui',
'qiaquick',
'quite',
'quasi',
'quinta',
'qus',
'qrs',
'qrna',
'quality-of-life',
'quizzes',
'quakes',
'queer',
'qualitative',
'québec',
'quaker',
'quixtar',
'quirk',
'quartops',
'qua',
'quack',
'quintas',
'quigley',
'questioned',
'quiet',
'qualify',
'quarreling',
'quintiles',
'quotas',
'queensland',
'question-and-answer',
'quest',
'quantitation',
'qiagen',
'quench',
'qualifies',
'quotient',
'qio',
'questionnaire',
'quash',
'quayside',
'quibbles',
'quieter',
'questioner',
'qaeda',
'query',
'quantifiable',
'qol',
'quits',
'quinn',
'quick-line',
'quintet',
'quarter',
'quickened',
'qatar',
'quantification',
'quarrels',
'queens',
'quota',
'quidditch',
'quicker',
'quenching',
'quecreek',
'quarries',
'qualification',
'quinones',
'quartets',
'qwest',
'quivering',
'qp',
'quaid',
'qrt-pcr',
'quays',
'quinean',
'quartile']
[curWord for curWord in words if re.findall('l{1}.+m{1}.+n{1}.+o{1}',curWord)]
['self-determination',
'multimillion-dollar',
'self-monitoring',
'flamingo',
'limoncello',
'illumination',
'filamentous',
'complementation',
'filamentation',
'ultramicroextensions',
'telecommunications',
'flamingos',
'supplementation',
'multidimensional',
'chloramphenicol',
'elimination',
'self-incrimination',
'bloomington',
'implementations',
'flamenco',
'implementation']
Grab all words begin with an a and end with an i
[curWord for curWord in words if re.findall('^a\w+i$',curWord)]
['armani',
'avi',
'alexei',
'api',
'accompli',
'abdelghani',
'assisi',
'asci',
'acini',
'ami',
'adlai',
'arundhati',
'aulaqi',
'asahi',
'ambani',
'ascii',
'alibi',
'arabi',
'ani',
'ashkenazi',
'afghani',
'agassi',
'abyssi',
'amalfi',
'anti',
'alveoli',
'antoni',
'ajami',
'agouti',
'abi',
'audi',
'adi',
'aci',
'afi',
'ali',
'agnelli',
'andrei',
'alumni',
'ari']
Grab all words that begin with an a, followed by 4-6 letters and and on an i
[curWord for curWord in words if re.findall('^a\w{4,6}i$',curWord)]
['armani',
'alexei',
'accompli',
'assisi',
'aulaqi',
'ambani',
'afghani',
'agassi',
'abyssi',
'amalfi',
'alveoli',
'antoni',
'agouti',
'agnelli',
'andrei',
'alumni']
Grab words that start with a b, end on an t, and contain a t somewhere in the middle
[curWord for curWord in words if re.findall('^b\w+t\w+t$',curWord)]
['batshit',
'baptist',
'blatant',
'bittersweet',
'backstreet',
'butaprost',
'bartlett',
'brightest',
'blastocyst',
'betterment',
'bestest',
'bittorrent',
'butait',
'bioterrorist']
Let’s say we want to exclude words that end on two ts.
[curWord for curWord in words if re.findall('^b\w+t\w+[^tt]t$',curWord)]
['batshit',
'baptist',
'blatant',
'bittersweet',
'backstreet',
'butaprost',
'brightest',
'blastocyst',
'betterment',
'bestest',
'bittorrent',
'butait',
'bioterrorist']
Let’s get all the words containing the vowels a, e, i, o, in that order
[curWord for curWord in words if re.findall('\w+a+\w+e+\w+i+\w+o+',curWord)]
['characterization',
'catheterization',
'carvedilol',
'cardiorespiratory',
'compartmentalization',
'categorization',
'characterizations',
'parameterization',
'intraperitoneal',
'chloramphenicol',
'campesino',
'intraperitoneally']
You know that saying i before e except after c (in which case it’s i after e, like receive). Let’s see how well this mnemonic holds up.
Let’s find out how many words there are that have ie vs. ei in them.
print ("ie words:", len([curWord for curWord in words if re.findall('ie',curWord)]))
print ("ei words:", len([curWord for curWord in words if re.findall('ei',curWord)]))
ie words: 1439
ei words: 483
Now let’s check what happens when we check for a ‘c’ preceding ie/ei
print ("cie words:", len([curWord for curWord in words if re.findall('cie',curWord)]))
print ("cei words:", len([curWord for curWord in words if re.findall('cei',curWord)]))
cie words: 107
cei words: 33
There are actually more words that violate the mnemonic than those that obey it! What are these words?
[curWord for curWord in words if re.findall('cie',curWord)]
['btk-deficient',
'prescience',
'malignancies',
'cross-species',
'species-specific',
'contingencies',
'inconsistencies',
'omniscient',
'efficiencies',
'consciences',
'redundancies',
'efficiency',
'vacancies',
'conspiracies',
'efficiently',
'inadequacies',
'legacies',
'pregnancies',
'sucient',
'delicacies',
'glacier',
'efficacies',
'newscientist',
'deficient',
'bureaucracies',
'prescient',
'fancier',
'coefficient',
'bankruptcies',
'scientific',
'fancies',
'conscientious',
'agencies',
'scientist',
'pricier',
'marcie',
'ancients',
'immunodeficiency',
'conscience',
'lucie',
'insufficiency',
'deficiency',
'self-sufficiency',
'policies',
'suycient',
'science-fiction',
'concierge',
'societal',
'constituencies',
'inefficiency',
'inaccuracies',
'currencies',
'insufficient',
'ancient',
'society',
'proficiency',
'science',
'scientists',
'science-based',
'saucier',
'sufficiently',
'quasispecies',
'scientia',
'neuroscientist',
'dependencies',
'biosciences',
'fancied',
'tendencies',
'intricacies',
'subspecies',
'scientifically',
'glaciers',
'proficient',
'energy-efficient',
'species',
'gracie',
'scientology',
'francie',
'frequencies',
'emergencies',
'aberrancies',
'candidacies',
'inefficiencies',
'sciences',
'tumefaciens',
'interspecies',
'conscientiously',
'insufficiently',
'hacienda',
'democracies',
'financier',
'coefficients',
'financiers',
'deficiencies',
'prophecies',
'inefficient',
'unscientific',
'competencies',
'self-sufficient',
'sufficiency',
'neuroscience',
'sufficient',
'efficient',
'potencies',
'societies',
'pharmacies',
'discrepancies']
Here’s a tricky one. Let’s find words containing 4 rs (interspersed among other letters). One way to do this is to explicitly specify it… any character, r, any character, r.. etc. Like so..
[curWord for curWord in words if re.findall('\w*r+\w*r+\w*r+\w*r+\w*',curWord)]
['counterterrorist',
'cardiorespiratory',
'extraterrestrials',
'refrigerator',
'extracurricular',
'refrigerators',
'grrrr',
'extraterrestrial',
'counterterrorism']
There are two shortcomings to this approach. The first is that if we want 3 or 5 matches, we need to explicitly remove or add code rather than changing a single number-of-matches parameter. Another shortcoming is that hyphenated words are excluded. We can add hyphens by replacing \w
with [a-z\-]
, but that makes the expression even longer. Here’s a better solution:
[curWord for curWord in words if re.findall('([^r]*r[^r]*){4}$',curWord)]
['counter-terrorism',
'counterterrorist',
'cardiorespiratory',
'extraterrestrials',
'refrigerator',
'reverse-transcribed',
'reverse-transcription',
'writer-director',
'extracurricular',
'antiretroviral-experienced',
'refrigerators',
'grrrr',
'corporate-reform',
'extraterrestrial',
'counterterrorism']
Let’s unpack that. We are matching a group which is demarcated by parentheses. The group pattern is: not-an-r (0 or more times), an r, and then not-an-r (0 or more times). We want words that match this pattern exactly 4 times. That gives us all the words containing four rs and anything in between them (including nothing, hence grrrr)
Use in place of conditionals#
Let’s say we want to check whether an entered word is color or the British colour. We can do this with a conditional (if "color" or "colour"
), but we can also use regular expressions (which scale much better than conditionals). For example:
re.findall('colou?r','The British like to colour their colors')
['colour', 'color']
Unlike conditionals, this approach easily scales to, e.g., all cases where a non-initial ‘o’ is followed by either an ‘r’ or a ‘ur’. What regexp would match color/colour, favor/favour, humor/humour, neighbor/neighbour, (but not match or/our)?
Here’s another example of a regexp using a series of disjunctions (OR statements) that matches “dog” and “cat” and “cag” and “cog” but not “got”. To get the matching string from the match object, use .group()
, i.e., variousWords.match('cat').group()
will return “cat”
variousWords = re.compile('[d|c][a|o][g|t]')
print(variousWords.match('cat'))
print(variousWords.match('dog'))
print(variousWords.match('cog'))
print(variousWords.match('cag'))
print(variousWords.match('got'))
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(0, 3), match='dog'>
<re.Match object; span=(0, 3), match='cog'>
<re.Match object; span=(0, 3), match='cag'>
None
Here are some more examples.
import re
#will match any numbers
anyNums = re.compile('[0-9]+')
anyNums.findall('There are 99 bottles of beer on the wall. 999....') #will return all matches
anyNums.search('There are 99 bottles of beer on the wall. 999....').group() #will return just the first occurrence
#two digit numbers from 00 to 59 or 80 to 89
someNums = re.compile('[0-5][0-9]|[8][0-9]')
matches = [someNums.search(x).group() for x in 'It will match 54, 52, and 88, but not 7 or 92 or any of the letters'.split(' ') if someNums.search(x)]
matches
#We don’t need to compile regular expressions using re.compile. but it speeds things up when using the same rule over a large corpus.
emailRegExGrouped = re.compile('([\w.-]+)@([\w.-]+)')
#the parenthesis allow us to access groups -- the first group corresponds to the first matched part (before the @). The second group to the domain (e.g., wisc.edu(
emailRegExGrouped.search('g.lupyan@gmail.com').groups()
('g.lupyan', 'gmail.com')
emailRegExGrouped.findall('g.lupyan@gmail.com lupyan@wisc.edu')
#returns [('g.lupyan', 'gmail.com'), ('lupyan', 'wisc.edu')]
#to get all the domains:
[email[1] for email in emailRegExGrouped.findall('g.lupyan@gmail.com lupyan@wisc.edu')]
#returns ['gmail.com', 'wisc.edu']
['gmail.com', 'wisc.edu']
Search and replace#
All good text editors allow you to use regular expressions in search and replace.
A simple usage case is searching for lines that begin or end with a certain character sequence. To find lines that begin with “ab”, search for ^ab
. To find lines that end on ies
search for ies$
. If you’re trying to replace a string with some variant of the matched string, you’ll want to use capture groups.
Note
Make sure to enable regular-expression search by clicking on .* button in the lower-left corner in Sublime text. or checking the appropriate box (sometimes labeled “Grep”) in other text editors
When using regular expressions in search/replace, it becomes useful to use matching groups.
For example, suppose you want to replace the occurrences of the following strings, which occur at the start of each line:
bdSubjCode_130
badSbjCode_131
baSubjCode_132
badSubjCode_133
badubjCode_134
BadSubjCode_135
with
MYSUBJCODE_130
MYSUBJCODE_131
MYSUBJCODE_132
MYSUBJCODE_133
MYSUBJCODE_134
MYSUBJCODE_135
You could manually do search replaces for each one. But if you have a hundred of these, that gets tedious fast and is a recipe for errors.
Here’s a much better solution. Simply search for:
(^\w+_)([0-9]+)
and replace with
MYSUBJCODE_\2
The \2 refers to the second group, i.e., the number
Here’s another example. Delete all the lines that start with some letters and end in ‘ing’:
Search:
(^\w+ing).*
Replace with: nothing
Now you can do another search and replace, searching for
\n+
and replacing with
\n
To get rid of multiple newlines that the first search/replace may have created.
Batch file renaming#
You can use what you’ve learned about regular expressions for manipulating not just actual text, but text used in file names. You can do this in straight-up Python using the os
library, or by using GUI programs like NameChanger (Mac), or Bulk Rename (PC). These programs allow you to do batch renaming of files using simple search/replace (e.g., replace _
with -
as well as by using regular expressions for more complex changes!