Exercise 9 - Practice with regular expressions#

Accept the exercise

We’re using python’s re package for regular expressions (regexps), but regexp syntax is very similar across languges, so what you learn here will transfer to, for example, regular expressions in R

Part 1: Word matching#

Write a regexp to match (and return) all words in words that begin and end with a vowel. To read in the word list, use the code below.

import pandas as pd
data = pd.read_csv('https://psych750.github.io/data/ANC-written-count_over9.txt',encoding = "ISO-8859-1",sep="\t")
words = list(set(data['word']))

Bonus (.5 pt):

The regexp notebook includes an example of finding words that contain the vowels aeiou, in that order. A more difficult version is to find words that contain all the letters in any order. Write a regexp to find words that contain all the vowels, but in any order.

Note

The solution should be a single regexp. There should be no loops or if statements in your code

Part 2: Match lmno words#

Write a regexp to match (and return) all words in words that contain the letters l, m, n, and o – in that order, but not necessarily consecutively. flamingo should match, but so should self-monitoring because even though the letters are disrupted by a dash, they are still all there and it’s a single (compound) word.

Note

The solution should be a single regexp. There should be no loops or if statements in your code

Part 3 - Fix weird spacing#

Write a function fix_spaces() that takes a string (of arbitrary length) as an argument and returns the string with weird spacing issues all fixed up. Some examples:

fix_spaces("The  quick    brown fox") --> "The quick brown fox" <– multiple spaces to single spaces

fix_spaces("Once\t\tupon a time\n\n") --> "Once upon a time\n" <– tabs should be changed to (single) spaces. Multiple newlines to a single newline

fix_spaces("This sentence has a misplaced space before the period . Beginning of next sentence.") --> "This sentence has a misplaced space before the period. Beginning of next sentence"

fix_spaces("Version    3.2") --> "Version 3.2" <– don’t get tripped up by non alphanumeric characters like the dot in 3.2

Note

Solutions involving several regexps are acceptable, but there should be no loops or if statements in your code

Tip

Use re.sub()

Part 4: Match phone numbers#

Write regexp that matches phone numbers. It should be able to match phone numbers in the following formats:

(123)121-1234

(123) 121-1234

123-121-1234

123-456-7890 ext. [some 3-5 digit number]

123-456-7890 ext [some 3-5 digit number]

Part 5: Email validation#

Write a function that takes a string as an argument and returns a list of all valid emails addresses contained in the string. If the string contains no valid emails, the function should return False. Also explain what each part of your regular expression does.

A valid email address has the form: some_alphanumeric_string@something.something (there can be multiple dots at the end e.g., uk_education@edinburgh.ac.uk is a valid address.

Note

Email addresses can include dashes and plus signs (+). They cannot include two @ signs or two dots in a row (name..something@domain.com should be rejected). Don’t worry about all the possible edge-cases; there are quite a few and accounting for all of them is beyond the scope of this exercise.

Note

Again, there should be no loops or if statements in your code

Tip

Check to make sure no punctuation is included as part of the email(s) that your function returns!

Part 6: Redacting text#

Back in 2001 (before many of your were born?!) the company Enron filed for bankruptcy in what became the largest financial scandal in American Economics in decades (up until the 2008 crash and the Bernie Madoff affair stole the spotlight. The investigation of Enron’s finances resulted in the release of many thousands of internal emails and this dataset became a widely studied example of corporate communication. When these types of materials are released to the public, it is common to redact sensitive information such as names, emails, and phone numbers. In this final part of the exercise, you’ll be asked to write a script to redact names, emails and phone numbers from a handful of emails. Your solution should scale to a much larger corpus.

The starter code contains a file enronemail.txt. Your code should read in this file and process it using various regular expressions so that it looks like enronemail-redacted.txt. Specifically, you’ll notice that the names have been converted to (name redacted), emails with (email redacted), and phone numbers with (phone number redacted). You can hard-code the specific first and last names to redact, but your code for emails and phone numbers should be flexible to accommodate various emails and phone numbers (as in parts 4 and 5). You can (and should!) use the code you already wrote for parts 4 and 5. Your code should write a file enronemail_redacted_teamname.txt and push it along with your code when you submit. If your solution is correct, the file should be nearly identical to enronemail-redacted.txt

Important

One of the first emails in the file is a..shankman@enron.com which violates email-address specifications. Since these emails are from >20 years ago, the addresses may not be conforming to the revised specs or the email address got mangled somehow. In any case, make sure it too is redacted for this part of the assignment (it involves simplifying the regular expression you used for part 5).