STAT 39000: Project 6 — Fall 2020
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Questions
Question 1
In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the Origin
, from 1987.csv
. Write another line, this time using awk
to do the same thing. Which one do you prefer, and why?
Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:
-
One line of UNIX commands to solve the problem without using
awk
. -
One line of UNIX commands to solve the problem using
awk
. -
1-2 sentences describing which method you prefer and why.
Question 2
Write a bash script that accepts a year (1987, 1988, etc.) and a column n and returns the nth column of the associated year of data.
Here are two examples to illustrate how to write a bash script:
In this example, you only need to turn in the content of your bash script (starting with #!/bin/bash
) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:
#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"
If you simply drop that text into a file called my_script.sh
, located here: /home/$USER/my_script.sh
, and if you run the following:
# Setup bash to run; this only needs to be run one time per session.
# It makes bash behave a little more naturally in RStudio.
exec bash
# Navigate to the location of my_script.sh
cd /home/$USER
# Make sure that the script is runable.
# This only needs to be done one time for each new script that you write.
chmod 755 my_script.sh
# Execute my_script.sh
./my_script.sh okay cool
then it will print:
First argument: okay Second argument: cool
In this example, if we were to turn in the "content of your bash script (starting with #!/bin/bash
) in a code chunk, our solution would look like this:
#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"
And although we aren’t running the code chunk above, we know that it works because we tested it in the terminal.
Using |
-
The content of your bash script (starting with
#!/bin/bash
) in a code chunk.
Question 3
How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using awk
, then solve this problem using only awk
.
Here is a similar example, using the election data set:
-
One line of UNIX commands to solve the problem without using
awk
. -
One line of UNIX commands to solve the problem using
awk
. -
The number of flights that arrived at Indianapolis (IND) in 2008.
Question 4
Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you’d like. Are they indeed the same? How many unique values do we have per category (Origin
, Dest
)?
Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:
-
1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.
-
The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.
-
The number of unique values per category (
Origin
,Dest
).
Question 5
In (4) we found that there are not the same number of unique Origin
as Dest
. Find the IATA airport code for all Origin
that don’t appear in a Dest
and all Dest
that don’t appear in an Origin
in the 2008 data.
The examples on this page should help. Note that these examples are based on Process Substitution, which basically allows you to specify commands whose output would be used as the input of |
-
The line(s) of UNIX command(s) used to answer the question.
-
The list of all
Origin
that don’t appear inDest
. -
The list of all
Dest
that don’t appear inOrigin
.
Question 6
What was the percentage of flights in 2008 per unique Origin
with the Dest
of "IND"? What percentage of flights had "PHX" as Origin
(among all flights with Dest
of "IND")?
Here is an example using the percentages of donations contributed from CEOs from various States:
You can do the mean calculation in awk by dividing the result from (3) by the number of unique |
-
The percentage of flights in 2008 per unique
Origin
with theDest
of "IND". -
1-2 sentences explaining how "PHX" compares (as a unique
ORIGIN
) to the otherOrigin`s (all with the `Dest
of "IND")?
Question 7
Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:
1987, 12345 1988, 44
Run the script with inputs: 1991
and ORD
. Include the output in your submission.
-
The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-
The output of the script given
1991
andORD
as inputs.
OPTIONAL QUESTION 1
Pick your favorite airport and get its IATA airport code. Write a bash script that, given the first year, last year, and airport code, runs the bash script from (7) for all years in the provided range for your given airport, or loops through all of the files for the given airport, appending all of the data to a new file called my_airport.csv
.
-
The content of your bash script (starting with "#!/bin/bash") in a code chunk.
OPTIONAL QUESTION 2
In R, load my_airport.csv
and create a line plot showing the year-by-year change. Label your x-axis "Year", your y-axis "Num Flights", and your title the name of the IATA airport code. Write 1-2 sentences with your observations.
-
Line chart showing year-by-year change in flights into and out of the chosen airport.
-
R code used to create the chart.
-
1-2 sentences with your observations.