STAT 19000: Project 10 — Fall 2021
Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using sapply
.
Context: We’ve just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. sapply
is one of the best ways to do this in R.
Scope: r, sapply, functions
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/election/*.txt
Questions
Question 1
Read the elections dataset from 2014 (itcont2014.txt
) into a data.frame called elections
using the fread
function from the data.table
package.
Make sure to use the correct argument |
Create a vector called transactions_starting_digit
that gets the starting digit for each transaction value (use the TRANSACTION_AMT
column). Be sure to use get_starting_digit
function from the previous project.
Take a look at the starting digits of the unique transaction amounts. Can we directly compare the results to the Benford’s law to look for anomalies? Explain why or why not, and if not, what do we need to do to be able to make the comparisons?
Pay close attention to the results — if you were able to directly compare, the numbers you were testing would need to be valid for the benfords law function. |
What are the possible digits a number can start with? |
Relevant topics: fread, unique, table
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences explaining if any changes are needed in our dataset to analyze it using Benford’s Law, why or why not? If so what changes are necessary?
Question 2
Be sure to watch the video from Question 1. It covers Question 2 too. |
If in question (1) you answered that there are modifications needed in the data, make the necessary modifications.
You should need to make a modification. |
Make a barplot showing the percentage of times each digit was the starting digit.
Include in your barplot a line indicating expected percentage based on Benford’s law.
If we compared our results to Benford’s Law would we consider the findings anomalous? Explain why or why not.
Relevant topics: barplot, lines, points, table, prop.table, indexing
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences explaining why or why not you think the results for this dataset are anomalous based on Benford’s law.
Question 3
Lets explore things a bit more. How does a different grouping look? To facilitate our analysis, lets create a function to replicate the steps from questions (1) and (2).
Create a function called compare_to_benfords
that accepts two arguments, values
and title
. values
represents a vector of values to analyze using Benford’s Law, and title
will provide the title
of our resulting plot.
Make sure the title
argument has a default value, so we if we don’t pass an argument to it, it will still be able to run the function.
The function should get the starting digits in values
, perform any necessary clean up, and compare the results with the Benford’s Law, graphically, by producing a plot we did in question (2).
Note that we are simplifying things by wrapping what we did in questions (1) and (2) into a function so we can do the analysis more efficiently.
Test your function on the TRANSACTION_AMT
column from the elections
dataset. Note that the results should be the same as question (2) — even the title of your plot.
For fair comparison, set the y-axis limits to be between 0 and 50%.
If you called either of the What if you shared this function with your friend, who didn’t have access to your Instead, it is perfectly acceptable to declare your functions inside your |
-
R code used to solve this problem.
-
The results of running the R code.
-
The results of running
compare_to_benfords(elections$TRANSACTION_AMT)
.
Question 4
Let’s dig into data a bit more. Using the compare_to_benfords
function, analyze the transactions from the following entities (ENTITY_TP
):
-
Candidate ('CAN'),
-
Individual - a person - ('IND'),
-
and Organization - not a committee and not a person - ('ORG').
Use a loop, or one of the functions in the apply
suite to solve this problem.
Write 1-2 sentences comparing the transactions for each type of ENTITY_TP
.
Before running your code, run the following code to create a 2x2 grid for our plots.
par(mfrow=c(1,3))
There are many ways to solve this problem. |
-
R code used to solve this problem.
-
The results of running the R code.
-
The results of running
compare_to_benfords(elections$TRANSACTION_AMT)
. -
Optional: Include the name or abbreviation of the entity in its title.
Question 5
Use the elections datasets and what you learned from the Benford’s Law to explore the dataset more.
You can compare specific states, donations to other entities, or even use datasets from other years.
Explain what and why you are doing, and what are your conclusions. Be creative!
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences explaining what and why you are doing.
-
1-2 sentences explaining your conclusions.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |