STAT 39000: Project 4 — Fall 2020
Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. grep
is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep
, and experiment with regular expressions using grep
, R, and later on, Python.
Scope: grep, regular expression basics, utilizing regular expression tools in R and Python
You can find useful examples that walk you through relevant material in The Examples Book:
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
I would highly recommend using single quotes |
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/the_office_dialogue.csv
A public sample of the data can be found here: the_office_dialogue.csv
Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
grep
stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep
, we will be using it with textual data. You can read about and see examples of grep
here.
Question
Question 1
Login to Scholar and use grep
to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". What is the name of the dataset and where is it located?
-
The
grep
command used to find the dataset. -
The name and location in Scholar of the dataset.
-
Use
grep
andgrepl
within R to solve a data-driven problem.
Question 2
grep
prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.
Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.
The result if you were to search for "bears. beets. battlestar galactica." should be: "Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica." |
One method to solve this problem would be to pipe the output from |
-
The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".
Question 3
Find all of the lines where Pam is called "Beesley" instead of "Pam" or "Pam Beesley".
A negative lookbehind would be one way to solve this, in order to use a negative lookbehind with |
Regular expressions are really a useful semi-language-agnostic tool. What this means is regardless of the programming language you are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using read.csv
. Name the resulting data.frame dat
.
-
The UNIX command used to solve this problem.
Question 4
The text_w_direction
column in dat
contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.
(a) Create a new column called has_direction
that is set to TRUE
if the text_w_direction
column has direction, and FALSE
otherwise. Use the grepl
function in R to accomplish this.
Make sure all opening brackets "[" have a corresponding closing bracket "]". |
Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text. |
(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.
We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.
This is a line with [emphasize this] only 1 direction! This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.
In our actual dataset, for example, dat$text_w_direction[2789]
is a line with 2 directions.
-
The R code and regular expression used to solve the first part of this problem.
-
The R code and regular expression used to solve the second part of this problem.
-
How many lines have >= 2 directions?
-
How many lines have >= 5 directions?
Question 5
Use the str_extract_all
function from the stringr
package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called direction
.
This is a line with [emphasize this] only 1 direction! This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
In this question, your solution may have extracted:
[emphasize this] [emphasize this] 2 sets of direction, do you see the difference [shrug]
It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.
-
The R code used to solve this problem.
OPTIONAL QUESTION
Repeat (5) but this time make sure you only capture the brackets and text within the brackets. Save the results in a new column called direction_correct
. You can test to see if it is working by running the following code:
dat$direction_correct[747]
This is a line with [emphasize this] only 1 direction! This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
In (5), your solution may have extracted:
[emphasize this] [emphasize this] 2 sets of direction, do you see the difference [shrug]
This is ok for (5). In this question, however, we want to fix this to only extract:
[emphasize this] [emphasize this] [shrug]
This regular expression will be hard to read. |
The pattern we want is: literal opening bracket, followed by 0+ of any character other than the literal [ or literal ], followed by a literal closing bracket. |
-
The R code used to solve this problem.