# Webscraping with Beautiful Soup

This notebook shows off some basic capabilities. of the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) webscraping package. And here's a handy [cheat sheet](https://whatacold.io/blog/2021-12-05-beautifulsoup4-cheatsheet/).


Why is it called "Beautiful Soup"? If you [google it](https://groups.google.com/g/beautifulsoup/c/nCOB_U4HqRc?pli=1), you'll be told that it's because of "tag soup" or something like thatnonsense. But here's a [hint](https://www.crummy.com/software/BeautifulSoup/). See that picture? That's your [answer](https://www.youtube.com/watch?v=FWxFsJUlBbw).


## Get heading names in a Wikipedia article

Let's begin by automatically grabbing The Simpsons character names from the *Simpson Family* [Wikipedia page](https://en.wikipedia.org/wiki/Simpson_family). Looking at the wiki article, we see that the character names are part of a bulleted list (produced by the html `<ul>` tag. Furthermore, they're bolded. Using this we can do the following:

In [28]:
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/Simpson_family"
r = requests.get(url)
soup = BeautifulSoup(r.content)

for headline in soup('span', attrs={'class' : 'mw-headline'}):
    topics = headline.find_next('ul').find_all('b') #ul is bulletpoints; b is bold
    for topic in topics:
        print ('*', topic.text)



* Herbert "Herb" Powell (voiced by Danny DeVito) – As his paternal half-brother, Herb resembles Homer, though he is much thinner, boasts a full head of hair and is more astute. He first appeared in the season two episode "Oh Brother, Where Art Thou?" when Homer is informed by his father Abe, after the latter suffered a mild heart attack, that he had a half-brother, the product of a short-lived affair between Abe and a carnival dunk-tank worker who was also a prostitute (identified in The Simpsons Uncensored Family Album as 'Gaby'). A year after putting the baby up for adoption, Abe married Mona, who insisted he promise never to tell Homer about Herb or how he was conceived. Herb was raised by his adoptive parents Edward and Mililani Powell (first names given in The Simpsons Uncensored Family Album), put himself through college by working odd jobs, then founded Powell Motors, a car company based in Detroit. Herb is an exception to 'the Simpson gene', which causes all male members of the

There are some false alarms, but it works pretty well! You might wonder whether this is too much trouble to go through just to get the Simpsons characters. But of course the identical code will work on pretty much *any* wiki article. And this general approach can be used on any webpage. And that's really the point.

## Get current temperature in Madison

Let's look at another example showing how we can grab a particular piece of information from a webpage: the current temperature. I'll walk you through what's going on here in class.


In [29]:
soup = BeautifulSoup(requests.get("https://weather.com/weather/today/l/USWI0411:1:US").text)
results= soup.find_all('span', attrs={'class': 'CurrentConditions--tempValue--MHmYY'})
print (results[0].getText())

33°


(This is about as simple as it gets). This approach scales (and that's the point!). If we want to get the temperature of a 100 different cities, just load the appropriate URLs and iterate through them.

## Example of using the newspaper package

There's an enormous infrastructure for web scraping with lots of codebases for common tasks, e.g., scraping [newspapers3k](https://newspaper.readthedocs.io/en/latest/) for scraping online news. Let's do a quick demo of this one.


First install the package using `pip newspaper3k`

In [2]:
from newspaper import Article

url = "https://www.cnn.com/2022/11/30/world/black-hole-devours-star-scn/index.html"
article = Article(url)
article.download()
print(article)
article.parse()
print(article.authors)
print(article.publish_date)
print(article.text[:1000]) #first 1000 chars




<newspaper.article.Article object at 0x7fb1a6fdb8e0>
['Ashley Strickland']
2022-11-30 00:00:00
Sign up for CNN’s Wonder Theory science newsletter. Explore the universe with news on fascinating discoveries, scientific advancements and more.

CNN —

An incredibly bright flash that appeared in the night sky in February was the result of a star straying too close to a supermassive black hole, meeting its untimely end there as it was ripped to shreds.

But the rare cosmic event actually occurred 8.5 billion light years away from Earth, when the universe was just a third of its current age — and it has created more questions than answers.

The signal from the luminous explosion, known as AT 2022cmc, was first picked up by the Zwicky Transient Facility at the California Institute of Technology’s Palomar Observatory on February 11.

This graphic shows how a tidal disruption event might look in space. Carl Knox/OzGrav/Swinburne University of Technology

When a star is torn apart by a black hole

We can also grab images associated with it. Let's grab the URL of the head image and then download it.

In [6]:
import urllib
from IPython import display

print(article.top_image)
raw_img = urllib.request.urlopen(article.top_image).read()
#display.Image(raw_img) # this will work when you render the notebook locally, but not when it's uploaded to github as we do here





https://media.cnn.com/api/v1/images/stellar/prod/221130121224-01-black-hole-tidal-disruption-event.jpg?c=16x9&q=w_800,c_fill



![top_image](https://media.cnn.com/api/v1/images/stellar/prod/221130121224-01-black-hole-tidal-disruption-event.jpg)

## Lots of examples and pre-written code out there!

Here's a nice step by step tutorial  example of a script that uses BeautifulSoup to [scrape data from google scholar](https://proxiesapi-com.medium.com/scraping-google-scholar-with-python-and-beautifulsoup-850cbdfedbcf).

For an especially creative example of scraping, check out [this blog post by Erik Bern](https://erikbern.com/2017/02/01/language-pitch.html) which bulk downloads pronunciations of words in various languages to examine whether there are consistent differences in pitch (fundamental frequency) between languages (Sounds like Finnish is Lowwww).



## Now it's your turn

### More weather

* Get the low and high temperatures for today in Madison, from Wundergound:
`https://www.wunderground.com/weather/us/wi/madison`

### Billboard 100

Let's do another one. Here's an example of scraping the top songs of 2022 from the *Billboard 100* site

In [34]:
page = requests.get('https://www.billboard.com/charts/year-end/2022/hot-100-songs/')
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find_all('h3', attrs={'class': 'c-title'})
for placement, result in enumerate(results):
    print(placement+1,result.getText().strip())
    if placement+1>=100: #avoids some junk at the end
        break


Can you figure out how to get the **artists** instead?

### Isthmus Music Calendar

Another, somewhat more complex task. The Isthmus publishes a [local music calendar](https://isthmus.com/search/event/music-calendar/#page=1). 

Have a look at the calendar and figure out how to scrap the event name, date/time, and location of the listed events. Print them out like in a format like this:

```
Event: Jazz Jam
Date and Time: Nov 30, 2023 5:30 PM - 7:30 PM
Location: Zuzu Cafe
--------------------------------------------------
Event: Tony Castañeda Latin Jazz Band
Date and Time: Nov 30, 2023 5:30 PM - 7:30 PM
Location: Cardinal Bar
--------------------------------------------------

... etc.
```