Regular expressions#

Regular expressions are an immensely powerful tool built into most modern computer languages. They are a type of formal grammar that allow you to match strings that match or mismatch a particular rule. Common uses include checking if user input conforms to a desired pattern (e.g., 3 numbers followed by two numbers, followed by 3 numbers), to all sorts of complicated search-replace operations both in text-files and, e.g., renaming files.

There are entire books written on regular expressions as well as comprehensive online references. We’ll only concern ourselves with a few basics here.

Start by looking over the first 10 lessons of this tutorial (they’re very quick), paying special attention to the sidebar on right, which I reproduce below.

Now go to this snazzy interactive regexp matcher and play around with it to get a feel for the syntax.

Then go through the rest of this notebook and make sure you understand why each regexp works in the way it does.

Syntax

Meaning

abc…

Literal letters

\d

Any Digit

\D

Any Non-digit character

.

Any single character

.

Period (slash is an escape character)

[abc]

Only a, b, or c

[^abc]

Not a, b, nor c

[a-z]

Characters a to z

[A-Z]

Characters A to Z

[0-9]

Numbers 0 to 9

\w

Any Alphanumeric character

\W

Any Non-alphanumeric character

{m}

m Repetitions

{m,n}

m to n Repetitions

*

Zero or more repetitions

?

Optional character (0 or 1)

+

One or more repetitions

\s

Any Whitespace

\S

Any Non-whitespace character

^

Start of string (or line for multiline matching)

$

End of string (or line for multiline matching)

(…)

Capture Group (for capturing matches and backreference)

(a(bc))

Capture Sub-group

(.*)

Capture all

(abc|def)

Matches abc or def

Use as a filter#

Let’s begin by reading in a file containing a bunch of words from the American National Corpus that have a frequency of at least 9. Here’s a sample of what this file looks like.

word	lemma	pos	freq
the	the	DT	1081168
of	of	IN	539793
and	and	CC	466737
to	to	TO	448519
a	a	DT	406057
in	in	IN	360853
is	be	VBZ	192975

For those unfamiliar with language lingo, English lemmas are basically the word-stems, e.g., the lemma of cars is car; the lemma of walking is walk. pos stands for part of speech.

import re #import the python regexp module
import pandas as pd

data = pd.read_csv('https://psych750.github.io/data/ANC-written-count_over9.txt',encoding = "ISO-8859-1",sep="\t")
words = list(set(data['word']))[1:]
print (f"We have {len(words)} unique words")
We have 48316 unique words

Now let’s use some regular expressions starting with simple ones, and moving on to every slightly more complicated ones.

Grab words beginning with q

[curWord for curWord in words if re.findall('^q',curWord)]
['qualitatively',
 'quarterly',
 'quadrupled',
 'quotes',
 'qaida',
 'questionnaires',
 'qtls',
 'quibble',
 'quintana',
 'qin',
 'quilts',
 'quetzalcoatl',
 'qçí',
 'quadratic',
 'quartop',
 'quit',
 'querying',
 'qing',
 'qualities',
 'quartet',
 'qd',
 'quadruple',
 'quoting',
 'quso',
 'quixotic',
 'quiz',
 'quests',
 'queue',
 'qingdao',
 'qtc',
 'queried',
 'quai',
 'quantiles',
 'qassam',
 'quiescent',
 'quiche',
 'quietly',
 'quirky',
 'quarrel',
 'quincy',
 'quell',
 'quarter-century',
 'quadrants',
 'quick',
 'qspline',
 'quintessential',
 'quashed',
 'qaddafi',
 'quintanilla',
 'quiver',
 'queen',
 'quilting',
 'quantum',
 'quipped',
 'q-pna',
 'quill',
 'qualifying',
 'quick-edit',
 'quart',
 'qios',
 'quay',
 'quitting',
 'quagmire',
 'quickly',
 'qe',
 'quarterfinals',
 'quilt',
 'quenched',
 'quantities',
 'quartiles',
 'quantitate',
 'qtl',
 'quantitatively',
 'quran',
 'quarterfinal',
 'quickening',
 'quantitative',
 'quickness',
 'quine',
 'quaint',
 'qianlong',
 'quebec',
 'qualifications',
 'quickest',
 'quantified',
 'q',
 'quandary',
 'qutb',
 'quot',
 'qadir',
 'quote',
 'quarterback',
 'queasy',
 'quarterbacks',
 'qualified',
 'qt',
 'queries',
 'quintile',
 'quark',
 'qualifier',
 'quayle',
 'qeis',
 'qsp',
 'quail',
 'qida',
 'quad',
 'qpak',
 'quo',
 'qualms',
 'questionable',
 'quickie',
 'quartz',
 'quantify',
 'quotations',
 'quinone',
 'quarry',
 'quantity',
 'quindlen',
 'quantifying',
 'qur',
 'questioning',
 'qb',
 'quds',
 'quirks',
 'quarters',
 'quality',
 'qéåüåçäçöó',
 'québécois',
 'qureshi',
 'quadrangle',
 'quotable',
 'quadrant',
 'queues',
 'qu',
 'questions',
 'quickedit',
 'quenk',
 'qualcomm',
 'quasispecies',
 'quotation',
 'quixote',
 'quips',
 'quake',
 'quentin',
 'quid',
 'qa',
 'quaternary',
 'que',
 'quilted',
 'quantitated',
 'quackery',
 'quoted',
 'quantile',
 'qualifiers',
 'question',
 'quip',
 'qui',
 'qiaquick',
 'quite',
 'quasi',
 'quinta',
 'qus',
 'qrs',
 'qrna',
 'quality-of-life',
 'quizzes',
 'quakes',
 'queer',
 'qualitative',
 'québec',
 'quaker',
 'quixtar',
 'quirk',
 'quartops',
 'qua',
 'quack',
 'quintas',
 'quigley',
 'questioned',
 'quiet',
 'qualify',
 'quarreling',
 'quintiles',
 'quotas',
 'queensland',
 'question-and-answer',
 'quest',
 'quantitation',
 'qiagen',
 'quench',
 'qualifies',
 'quotient',
 'qio',
 'questionnaire',
 'quash',
 'quayside',
 'quibbles',
 'quieter',
 'questioner',
 'qaeda',
 'query',
 'quantifiable',
 'qol',
 'quits',
 'quinn',
 'quick-line',
 'quintet',
 'quarter',
 'quickened',
 'qatar',
 'quantification',
 'quarrels',
 'queens',
 'quota',
 'quidditch',
 'quicker',
 'quenching',
 'quecreek',
 'quarries',
 'qualification',
 'quinones',
 'quartets',
 'qwest',
 'quivering',
 'qp',
 'quaid',
 'qrt-pcr',
 'quays',
 'quinean',
 'quartile']
[curWord for curWord in words if re.findall('l{1}.+m{1}.+n{1}.+o{1}',curWord)]
['self-determination',
 'multimillion-dollar',
 'self-monitoring',
 'flamingo',
 'limoncello',
 'illumination',
 'filamentous',
 'complementation',
 'filamentation',
 'ultramicroextensions',
 'telecommunications',
 'flamingos',
 'supplementation',
 'multidimensional',
 'chloramphenicol',
 'elimination',
 'self-incrimination',
 'bloomington',
 'implementations',
 'flamenco',
 'implementation']

Grab all words begin with an a and end with an i

[curWord for curWord in words if re.findall('^a\w+i$',curWord)]
['armani',
 'avi',
 'alexei',
 'api',
 'accompli',
 'abdelghani',
 'assisi',
 'asci',
 'acini',
 'ami',
 'adlai',
 'arundhati',
 'aulaqi',
 'asahi',
 'ambani',
 'ascii',
 'alibi',
 'arabi',
 'ani',
 'ashkenazi',
 'afghani',
 'agassi',
 'abyssi',
 'amalfi',
 'anti',
 'alveoli',
 'antoni',
 'ajami',
 'agouti',
 'abi',
 'audi',
 'adi',
 'aci',
 'afi',
 'ali',
 'agnelli',
 'andrei',
 'alumni',
 'ari']

Grab all words that begin with an a, followed by 4-6 letters and and on an i

[curWord for curWord in words if re.findall('^a\w{4,6}i$',curWord)]
['armani',
 'alexei',
 'accompli',
 'assisi',
 'aulaqi',
 'ambani',
 'afghani',
 'agassi',
 'abyssi',
 'amalfi',
 'alveoli',
 'antoni',
 'agouti',
 'agnelli',
 'andrei',
 'alumni']

Grab words that start with a b, end on an t, and contain a t somewhere in the middle

[curWord for curWord in words if re.findall('^b\w+t\w+t$',curWord)]
['batshit',
 'baptist',
 'blatant',
 'bittersweet',
 'backstreet',
 'butaprost',
 'bartlett',
 'brightest',
 'blastocyst',
 'betterment',
 'bestest',
 'bittorrent',
 'butait',
 'bioterrorist']

Let’s say we want to exclude words that end on two ts.

[curWord for curWord in words if re.findall('^b\w+t\w+[^tt]t$',curWord)]
['batshit',
 'baptist',
 'blatant',
 'bittersweet',
 'backstreet',
 'butaprost',
 'brightest',
 'blastocyst',
 'betterment',
 'bestest',
 'bittorrent',
 'butait',
 'bioterrorist']

Let’s get all the words containing the vowels a, e, i, o, in that order

[curWord for curWord in words if re.findall('\w+a+\w+e+\w+i+\w+o+',curWord)]
['characterization',
 'catheterization',
 'carvedilol',
 'cardiorespiratory',
 'compartmentalization',
 'categorization',
 'characterizations',
 'parameterization',
 'intraperitoneal',
 'chloramphenicol',
 'campesino',
 'intraperitoneally']

You know that saying i before e except after c (in which case it’s i after e, like receive). Let’s see how well this mnemonic holds up.

Let’s find out how many words there are that have ie vs. ei in them.

print ("ie words:", len([curWord for curWord in words if re.findall('ie',curWord)]))
print ("ei words:", len([curWord for curWord in words if re.findall('ei',curWord)]))
ie words: 1439
ei words: 483

Now let’s check what happens when we check for a ‘c’ preceding ie/ei

print ("cie words:", len([curWord for curWord in words if re.findall('cie',curWord)]))
print ("cei words:", len([curWord for curWord in words if re.findall('cei',curWord)]))
cie words: 107
cei words: 33

There are actually more words that violate the mnemonic than those that obey it! What are these words?

[curWord for curWord in words if re.findall('cie',curWord)]
['btk-deficient',
 'prescience',
 'malignancies',
 'cross-species',
 'species-specific',
 'contingencies',
 'inconsistencies',
 'omniscient',
 'efficiencies',
 'consciences',
 'redundancies',
 'efficiency',
 'vacancies',
 'conspiracies',
 'efficiently',
 'inadequacies',
 'legacies',
 'pregnancies',
 'sucient',
 'delicacies',
 'glacier',
 'efficacies',
 'newscientist',
 'deficient',
 'bureaucracies',
 'prescient',
 'fancier',
 'coefficient',
 'bankruptcies',
 'scientific',
 'fancies',
 'conscientious',
 'agencies',
 'scientist',
 'pricier',
 'marcie',
 'ancients',
 'immunodeficiency',
 'conscience',
 'lucie',
 'insufficiency',
 'deficiency',
 'self-sufficiency',
 'policies',
 'suycient',
 'science-fiction',
 'concierge',
 'societal',
 'constituencies',
 'inefficiency',
 'inaccuracies',
 'currencies',
 'insufficient',
 'ancient',
 'society',
 'proficiency',
 'science',
 'scientists',
 'science-based',
 'saucier',
 'sufficiently',
 'quasispecies',
 'scientia',
 'neuroscientist',
 'dependencies',
 'biosciences',
 'fancied',
 'tendencies',
 'intricacies',
 'subspecies',
 'scientifically',
 'glaciers',
 'proficient',
 'energy-efficient',
 'species',
 'gracie',
 'scientology',
 'francie',
 'frequencies',
 'emergencies',
 'aberrancies',
 'candidacies',
 'inefficiencies',
 'sciences',
 'tumefaciens',
 'interspecies',
 'conscientiously',
 'insufficiently',
 'hacienda',
 'democracies',
 'financier',
 'coefficients',
 'financiers',
 'deficiencies',
 'prophecies',
 'inefficient',
 'unscientific',
 'competencies',
 'self-sufficient',
 'sufficiency',
 'neuroscience',
 'sufficient',
 'efficient',
 'potencies',
 'societies',
 'pharmacies',
 'discrepancies']

Here’s a tricky one. Let’s find words containing 4 rs (interspersed among other letters). One way to do this is to explicitly specify it… any character, r, any character, r.. etc. Like so..

[curWord for curWord in words if re.findall('\w*r+\w*r+\w*r+\w*r+\w*',curWord)]
['counterterrorist',
 'cardiorespiratory',
 'extraterrestrials',
 'refrigerator',
 'extracurricular',
 'refrigerators',
 'grrrr',
 'extraterrestrial',
 'counterterrorism']

There are two shortcomings to this approach. The first is that if we want 3 or 5 matches, we need to explicitly remove or add code rather than changing a single number-of-matches parameter. Another shortcoming is that hyphenated words are excluded. We can add hyphens by replacing \w with [a-z\-], but that makes the expression even longer. Here’s a better solution:

[curWord for curWord in words if re.findall('([^r]*r[^r]*){4}$',curWord)]
['counter-terrorism',
 'counterterrorist',
 'cardiorespiratory',
 'extraterrestrials',
 'refrigerator',
 'reverse-transcribed',
 'reverse-transcription',
 'writer-director',
 'extracurricular',
 'antiretroviral-experienced',
 'refrigerators',
 'grrrr',
 'corporate-reform',
 'extraterrestrial',
 'counterterrorism']

Let’s unpack that. We are matching a group which is demarcated by parentheses. The group pattern is: not-an-r (0 or more times), an r, and then not-an-r (0 or more times). We want words that match this pattern exactly 4 times. That gives us all the words containing four rs and anything in between them (including nothing, hence grrrr)

Use in place of conditionals#

Let’s say we want to check whether an entered word is color or the British colour. We can do this with a conditional (if "color" or "colour"), but we can also use regular expressions (which scale much better than conditionals). For example:

re.findall('colou?r','The British like to colour their colors')
['colour', 'color']

Unlike conditionals, this approach easily scales to, e.g., all cases where a non-initial ‘o’ is followed by either an ‘r’ or a ‘ur’. What regexp would match color/colour, favor/favour, humor/humour, neighbor/neighbour, (but not match or/our)?

Here’s another example of a regexp using a series of disjunctions (OR statements) that matches “dog” and “cat” and “cag” and “cog” but not “got”. To get the matching string from the match object, use .group(), i.e., variousWords.match('cat').group() will return “cat”

variousWords = re.compile('[d|c][a|o][g|t]')

print(variousWords.match('cat'))
print(variousWords.match('dog'))
print(variousWords.match('cog'))
print(variousWords.match('cag'))
print(variousWords.match('got'))
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(0, 3), match='dog'>
<re.Match object; span=(0, 3), match='cog'>
<re.Match object; span=(0, 3), match='cag'>
None

Here are some more examples.

import re

#will match any numbers
anyNums = re.compile('[0-9]+')
anyNums.findall('There are 99 bottles of beer on the wall. 999....') #will return all matches
anyNums.search('There are 99 bottles of beer on the wall. 999....').group() #will return just the first occurrence

#two digit numbers from 00 to 59 or 80 to 89 
someNums = re.compile('[0-5][0-9]|[8][0-9]')
matches = [someNums.search(x).group() for x in 'It will match 54, 52, and 88, but not 7 or 92 or any of the letters'.split(' ') if someNums.search(x)]
matches

#We don’t need to compile regular expressions using re.compile. but it speeds things up when using the same rule over a large corpus.

emailRegExGrouped = re.compile('([\w.-]+)@([\w.-]+)')
#the parenthesis allow us to access groups -- the first group corresponds to the first matched part (before the @). The second group to the domain (e.g., wisc.edu(

emailRegExGrouped.search('g.lupyan@gmail.com').groups()
('g.lupyan', 'gmail.com')

emailRegExGrouped.findall('g.lupyan@gmail.com lupyan@wisc.edu')
#returns [('g.lupyan', 'gmail.com'), ('lupyan', 'wisc.edu')]

#to get all the domains:
[email[1] for email in emailRegExGrouped.findall('g.lupyan@gmail.com lupyan@wisc.edu')]
#returns ['gmail.com', 'wisc.edu']
['gmail.com', 'wisc.edu']

Search and replace#

All good text editors allow you to use regular expressions in search and replace. A simple usage case is searching for lines that begin or end with a certain character sequence. To find lines that begin with “ab”, search for ^ab. To find lines that end on ies search for ies$. If you’re trying to replace a string with some variant of the matched string, you’ll want to use capture groups.

Note

Make sure to enable regular-expression search by clicking on .* button in the lower-left corner in Sublime text. or checking the appropriate box (sometimes labeled “Grep”) in other text editors

When using regular expressions in search/replace, it becomes useful to use matching groups.

For example, suppose you want to replace the occurrences of the following strings, which occur at the start of each line:

bdSubjCode_130
badSbjCode_131
baSubjCode_132
badSubjCode_133
badubjCode_134
BadSubjCode_135

with

MYSUBJCODE_130
MYSUBJCODE_131
MYSUBJCODE_132
MYSUBJCODE_133
MYSUBJCODE_134
MYSUBJCODE_135

You could manually do search replaces for each one. But if you have a hundred of these, that gets tedious fast and is a recipe for errors.

Here’s a much better solution. Simply search for:

(^\w+_)([0-9]+)

and replace with

MYSUBJCODE_\2

The \2 refers to the second group, i.e., the number

Here’s another example. Delete all the lines that start with some letters and end in ‘ing’:

Search: (^\w+ing).* Replace with: nothing

Now you can do another search and replace, searching for \n+ and replacing with \n To get rid of multiple newlines that the first search/replace may have created.

Batch file renaming#

You can use what you’ve learned about regular expressions for manipulating not just actual text, but text used in file names. You can do this in straight-up Python using the os library, or by using GUI programs like NameChanger (Mac), or Bulk Rename (PC). These programs allow you to do batch renaming of files using simple search/replace (e.g., replace _ with - as well as by using regular expressions for more complex changes!