STAT 29000: Project 2 — Spring 2021

Motivation: Web scraping is is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. Depending on the task at hand, web scraping can be incredibly simple. With that being said, it can quickly become difficult. Typically, students find web scraping fun and empowering.

Context: In the previous project we gently introduced XML and xpath expressions. In this project, we will learn about web scraping, scrape data from The New York Times, and parse through our newly scraped data using xpath expressions.

Scope: python, web scraping, xml

Learning objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Use the requests package to scrape a web page.

  • Use the lxml package to filter and parse data from a scraped web page.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

You will be extracting your own data from online in this project. There is no base dataset.

Questions

Question 1

The New York Times is one of the most popular newspapers in the United States. Open a modern browser (preferably Firefox or Chrome), and navigate to nytimes.com.

By the end of this project you will be able to scrape some data from this website! The first step is to explore the structure of the website. You can either right click and click on "view page source", which will pull up a page full of HTML used to render the page. Alternatively, if you want to focus on a single element, an article title, for example, right click on the article title and click on "inspect element". This will pull up an inspector that allows you to see portions of the HTML.

Click around the website and explore the HTML however you see fit. Open a few front page articles and notice how most articles start with a bunch of really important information, namely: an article title, summary, picture, picture caption, picture source, author portraits, authors, and article datetime.

For example:

![](./images/nytimes_image.jpg)

Copy and paste the h1 element (in its entirety) containing the article title (for the article provided) in an HTML code chunk. Do the same for the same article’s summary.

Items to submit
  • 2 code chunks containing the HTML requested.

Question 2

In question (1) we copied two elements of an article. When scraping data from a website, it is important to continually consider the patterns in the structure. Specifically, it is important to consider whether or not the defining characteristics you use to parse the scraped data will continue to be in the same format for new data. What do I mean by defining characterstic? I mean some combination of tag, attribute, and content from which you can isolate the data of interest.

For example, given a link to a new nytimes article, do you think you could isolate the article title by using the id="link-4686dc8b" attribute of the h1 tag? Maybe, or maybe not, but it sure seems like "link-4686dc8b" might be unique to the article and not able to be used given a new article.

Write an xpath expression to isolate the article title, and another xpath expression to isolate the article summary.

You do not need to test your xpath expression yet, we will be doing that shortly.

Items to submit
  • Two xpath expressions in an HTML code chunk.

Question 3

Use the requests package to scrape the webpage containing our article from questions (1) and (2). Use the lxml.html package and the xpath method to test out your xpath expressions from question (2). Did they work? Print the content of the elements to confirm.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 4

Here are a list of article links from nytimes.com:

Write a function called get_article_and_summary that accepts a string called link as an argument, and returns both the article title and summary. Test get_article_and_summary out on each of the provided links:

title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/19/us/politics/trump-china-xinjiang.html')
print(f'Title: {title}, Summary: {summary}')
title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/06/technology/personaltech/tech-2021-augmented-reality-chatbots-wifi.html')
print(f'Title: {title}, Summary: {summary}')
title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/13/movies/letterboxd-growth.html')
print(f'Title: {title}, Summary: {summary}')

The first line of your function should look like this:

def get_article_and_summary(myURL: str) → (str, str):

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 5

In question (1) we mentioned a myriad of other important information given at the top of most New York Times articles. Choose one other listed pieces of information and copy, paste, and update your solution to question (4) to scrape and return those chosen pieces of information.

If you choose to scrape non-textual data, be sure to return data of an appropriate type. For example, if you choose to scrape one of the images, either print the image or return a PIL object.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.