Apply Functions
piping
Much like the |
operator in bash, the %>%
operator in R pipes the output from the first expression to the second. For example instead of:
sum(c(1,2,3))
[1] 6
It is extremely common practice in the tidyverse
to pipe output from one function to another. For example:
subset <- iris %>%
subset(Sepal.Length > 5) %>%
mutate(Sepal.Length.Sq = Sepal.Length^2)
head(subset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.Sq
1 5.1 3.5 1.4 0.2 setosa 26.01
2 5.4 3.9 1.7 0.4 setosa 29.16
3 5.4 3.7 1.5 0.2 setosa 29.16
4 5.8 4.0 1.2 0.2 setosa 33.64
5 5.7 4.4 1.5 0.4 setosa 32.49
6 5.4 3.9 1.3 0.4 setosa 29.16
select
select
is a handy function used to select columns from a data.frame or tibble. For example:
iris %>% select(Sepal.Length, Species) %>% head()
Sepal.Length Species
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5.0 setosa
6 5.4 setosa
That alone is not that impressive, as we could easily do something like:
iris[, c("Sepal.Length", "Species")] %>% head()
Sepal.Length Species
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5.0 setosa
6 5.4 setosa
However, in the same way you can write 1:4 to represent a vector of numbers from 1-4, you can select columns from Sepal.Length
to Petal.Length
(and everything in between) by using Sepal.Length:Petal.Length
.
iris %>% select(Sepal.Length:Petal.Length) %>% head()
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
select
is particularly useful when paired with selection helpers, as you can select certain columns based on their names:
iris %>% select(contains(
"length"
)) %>%
head()
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
# or case sensitive
iris %>% select(contains(
"Length",
ignore.case=F
)) %>%
head()
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
selection helpers
Selection helpers are functions that make selecting variables easier. They are particularly easy to use with select
.
everything
matches all variables. For example:
iris %>% select(everything()) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
It is primarily useful when used in combination with functions like pivot_longer
and pivot_wider
.
last_col
selects the last variable, possibly with an offset.
iris %>% select(last_col()) %>% head()
Species
1 setosa
2 setosa
3 setosa
4 setosa
5 setosa
6 setosa
Or, with an offset:
iris %>% select(1:last_col(2)) %>% head()
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
contains
selects columns where the columns name contains another string. For example:
iris %>% select(contains("sepal")) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
|
In the same way that contains
looks for a string within the column names of a data.frame, starts_with
and ends_with
select columns where column names either start with one or more values or end with one or more values (respectively). For example, to get the columns starting with "Sepal":
iris %>% select(starts_with("sepal")) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Or to get columns that end in "width":
iris %>% select(ends_with("width")) %>% head()
Sepal.Width Petal.Width
1 3.5 0.2
2 3.0 0.2
3 3.2 0.2
4 3.1 0.2
5 3.6 0.2
6 3.9 0.4
For more fine grain control, matches
behaves the same way, but instead of literal string matching, we can feed a regular expression to matches
. For example, we could get all columns containing one or more ".":
iris %>% select(matches("+\\.")) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
Sometimes, you’ll have datasets with columns labeled sequentially, for example:
head(billboard)
# A tibble: 6 x 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doo… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 B… Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
# … with 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>,
# wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53 <dbl>, wk54 <dbl>,
# wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>,
# wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>, wk65 <dbl>, wk66 <lgl>,
# wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>,
# wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
Here, we have columns labeled wk1
all the way until wk76
. Using num_range
and select
we can get any number of those specific columns:
billboard %>% select(num_range("wk", 70:75)) %>% head()
# A tibble: 6 x 6
wk70 wk71 wk72 wk73 wk74 wk75
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 NA NA NA NA NA NA
5 NA NA NA NA NA NA
6 NA NA NA NA NA NA
all_of
is a selection helper designed to select strictly the columns whose names are inside the provided vector.
my_values <- c("Sepal.Length", "Sepal.Width")
iris %>% select(all_of(my_values)) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
But, whenever a single value in your vector isn’t present, an error is thrown.
my_values <- c("Sepal.Length", "Sepal.Width", "Sepal.Weight")
iris %>% select(all_of(my_values)) %>% head()
Error: Cannot subset columns that do not exist.
✖ Column `Sepal.Weight` does not exist.
For times you would like to select the values if they exist, any_of
is more useful. It is similar to all_of
, but doesn’t check if a value is missing.
my_values <- c("Sepal.Length", "Sepal.Width", "Sepal.Weight")
iris %>% select(any_of(my_values)) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
transmute
transmute
is a useful function that adds new variables and drops all existing ones. If a variable already exists, it overwrites the variable. For example, let’s say we wanted to capitalize the values of Species
in the iris
dataset:
iris %>%
transmute(Species = toupper(Species)) %>%
head()
Species
1 SETOSA
2 SETOSA
3 SETOSA
4 SETOSA
5 SETOSA
6 SETOSA
Here, the values in the Species
column are overwritten with the fully capitalized version. All of the other columns are dropped. One way to maintain other columns, would be to include them in the transmute
call:
iris %>%
transmute(Species = toupper(Species), Sepal.Length, Sepal.Width) %>%
head()
Species Sepal.Length Sepal.Width
1 SETOSA 5.1 3.5
2 SETOSA 4.9 3.0
3 SETOSA 4.7 3.2
4 SETOSA 4.6 3.1
5 SETOSA 5.0 3.6
6 SETOSA 5.4 3.9
Alternatively, you could use mutate
, which has the same behavior, but preserves existing variables.
mutate
mutate
is just like transmute
, but the original data is preserved. For example:
iris %>%
mutate(Species = toupper(Species)) %>%
head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 SETOSA
2 4.9 3.0 1.4 0.2 SETOSA
3 4.7 3.2 1.3 0.2 SETOSA
4 4.6 3.1 1.5 0.2 SETOSA
5 5.0 3.6 1.4 0.2 SETOSA
6 5.4 3.9 1.7 0.4 SETOSA
Here, since Species
already exists as a column, the column is overwritten by our new capitalized values. If the name of the new column does not already exist, the original Species
column will remain untouched. For example:
iris %>%
mutate(Species_Cap = toupper(Species)) %>%
head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_Cap
1 5.1 3.5 1.4 0.2 setosa SETOSA
2 4.9 3.0 1.4 0.2 setosa SETOSA
3 4.7 3.2 1.3 0.2 setosa SETOSA
4 4.6 3.1 1.5 0.2 setosa SETOSA
5 5.0 3.6 1.4 0.2 setosa SETOSA
6 5.4 3.9 1.7 0.4 setosa SETOSA
mutate
is extremely useful, and is difficult (and less intuitive) to replicate in pandas
in Python.
case_when
case_when
is a function that allows you to vectorize multiple if_else
statements. For example, let’s say we want to create a new column in our iris
dataset called size
, where the value is Large
if Sepal.Length
is greater than 5, and Not Large
otherwise?
new_iris <- iris %>%
mutate(size = case_when(
Sepal.Length > 5 ~ "Large",
Sepal.Length <= 5 ~ "Not Large"
))
head(new_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa Large
2 4.9 3.0 1.4 0.2 setosa Not Large
3 4.7 3.2 1.3 0.2 setosa Not Large
4 4.6 3.1 1.5 0.2 setosa Not Large
5 5.0 3.6 1.4 0.2 setosa Not Large
6 5.4 3.9 1.7 0.4 setosa Large
Here, mutate
is responsible for creating a new column called size
, and case_when
assigns the value Large
when Sepal.Length
is greater than 5 and Not Large
when Sepal.Length
is less than or equal to Not Large
. In this case we have exhaustively gone through all of the possible values of our new column, size
, because for each and every possible value of Sepal.Length
we have an associated value (Large
and Not Large
). In reality, this is not always possible. For example, let’s remove the second case:
new_iris <- iris %>%
mutate(size = case_when(
Sepal.Length > 5 ~ "Large"
))
head(new_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa Large
2 4.9 3.0 1.4 0.2 setosa <NA>
3 4.7 3.2 1.3 0.2 setosa <NA>
4 4.6 3.1 1.5 0.2 setosa <NA>
5 5.0 3.6 1.4 0.2 setosa <NA>
6 5.4 3.9 1.7 0.4 setosa Large
As you can see, by default, if no cases match, NA is the resulting value. One common technique to handle "all other cases" is the following:
new_iris <- iris %>%
mutate(size = case_when(
Sepal.Length > 5 ~ "Large",
TRUE ~ "Not Large"
))
head(new_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa Large
2 4.9 3.0 1.4 0.2 setosa Not Large
3 4.7 3.2 1.3 0.2 setosa Not Large
4 4.6 3.1 1.5 0.2 setosa Not Large
5 5.0 3.6 1.4 0.2 setosa Not Large
6 5.4 3.9 1.7 0.4 setosa Large
Here, each case is evaluated. If at the end, there was no match, TRUE
is always a match, and therefore the result will be Not Large
.
between
between
is a dead simple function from dplyr
that is an efficiently implemented shortcut for the following:
x <- 5
print(x >= 4 && x <= 10)
[1] TRUE
# instead you can use between
between(x, 4, 10)
[1] TRUE
group_by
group_by
is a function commonly used in conjunction with mutate
, transmute
, and summarize
. It is useful when you want to perform a tapply
-like operation on a data.frame. For example, let’s say you wanted to get the average Petal.Length
by Species
. Using tapply
, you would do something like:
tapply(iris$Petal.Length, iris$Species, mean)
setosa versicolor virginica
1.462 4.260 5.552
While useful, tapply
's end result isn’t in a format that is conducive to further analysis or wrangling. For example, what if we wanted to calculate and then plot (in ggplot
) the difference between the mean Petal.Length
and the mean Sepal.Length
by Species
? Using tapply
, you would have to do something like:
diff <- tapply(iris$Petal.Length, iris$Species, mean) - tapply(iris$Sepal.Length, iris$Species, mean)
myDF <- data.frame(Species = names(diff), diff = unname(diff))
ggplot(myDF, aes(x=diff, y=Species)) + geom_bar(stat="identity")
Again, a little bit more difficult to read than the following, and if you had more operations to complete, the previous example would make it difficult to do even more. In the following example, however, we can continue to utilize and build on myDF:
myDF <- iris %>%
group_by(Species) %>%
mutate(diff=mean(Petal.Length) - mean(Sepal.Length))
myDF %>% ggplot(aes(x=diff, y=Species)) + geom_bar(stat="identity")
summarize
summarize
is a useful function to get a new, tidy, data frame that is a summary of some other data. It’s particularly useful in conjunction with group_by
, when you want to compare groups.
For example, let’s say you wanted to the following:
-
Create a new column called
Sepal.Length.Cat
with valuessmall
whenSepal.Length
< 5.1,large
whenSepal.Length
>= 5.8, andmedium
otherwise. -
Get a summary containing the average
Sepal.Width
bySepal.Length.Cat
andSpecies
. -
Get a summary containing the variation in averages for each
Species
.
iris %>%
mutate(Sepal.Length.Cat = case_when(
Sepal.Length < 5.1 ~ "small",
Sepal.Length >= 5.8 ~ "large",
TRUE ~ "medium"
)) %>%
group_by(Sepal.Length.Cat, Species) %>%
summarize(avg_sepal_width_grouped = mean(Sepal.Width)) %>%
group_by(Species) %>%
summarize(std_of_avgs = sd(avg_sepal_width_grouped))
`summarise()` regrouping output by 'Sepal.Length.Cat' (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species std_of_avgs
<fct> <dbl>
1 setosa 0.402
2 versicolor 0.329
3 virginica 0.255
As you can see, it has some pretty powerful functionality that would be more difficult to replicate (and harder to read) using base R.
str_extract and str_extract_all
str_extract
and str_extract_all
are useful functions from the stringr
package. You can install the package by running:
install.packages("stringr")
str_extract
extracts the text which matches the provided regular expression or pattern. Note that this differs from grep in a major way. grep
simply returns the index in which a pattern match was found. str_extract
returns the actual matching text. Note that grep
typically returns the entire line where a match was found. str_extract
returns only the part of the line or text that matches the pattern. For example:
text <- c("cat", "mat", "spat", "spatula", "gnat")
# All 5 "lines" of text were a match.
grep(".*at", text)
[1] 1 2 3 4 5
text <- c("cat", "mat", "spat", "spatula", "gnat")
stringr::str_extract(text, ".*at")
[1] "cat" "mat" "spat" "spat" "gnat"
As you can see, although all 5 words match our pattern and would be returned by grep
, str_extract
only returns the actual text that matches the pattern. In this case "spatula" is not a "full" match — the pattern .at
only captures the "spat" part of "spatula". In order to capture the rest of the word you would need to add something like .*
to the end of the pattern:
text <- c("cat", "mat", "spat", "spatula", "gnat")
stringr::str_extract(text, ".*at.*")
[1] "cat" "mat" "spat" "spatula" "gnat"
Examples
How can I extract the text between parenthesis in a vector of texts?
text <- c("this is easy for (you)", "there (are) challenging ones", "text is (really awesome) (ok?)")
# Search for a literal "(", followed by any amount of any text other than more parenthesis ([^()]*), followed by a literal ")".
stringr::str_extract(text, "\\([^()]*\\)")
[1] "(you)" "(are)" "(really awesome)"
To get all matches, not just the first match:
text <- c("this is easy for (you)", "there (are) challenging ones", "text is (really awesome) more text (ok?)")
# Search for a literal "(", followed by any amount of any text (.*), followed by a literal ")".
stringr::str_extract_all(text, "\\([^()]*\\)")
[[1]]
[1] "(you)"
[[2]]
[1] "(are)"
[[3]]
[1] "(really awesome)" "(ok?)"
lubridate
lubridate
is a fantastic package that makes the typical tasks one would perform on dates, that much easier.
Examples
How do I convert a string "07/05/1990" to a Date?
library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:data.table':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:base':
date, intersect, setdiff, union
dat <- "07/05/1990"
dat <- mdy(dat)
class(dat)
[1] "Date"
How do I convert a string "31-12-1990" to a Date?
my_string <- "31-12-1990"
dat <- dmy(my_string)
dat
[1] "1990-12-31"
class(dat)
[1] "Date"
nchar
nchar
is a function which counts the number of characters and symbols in a word or a string. Punctuation and blank spaces are counted as well.
Examples
How can I find the number of characters and or symbols in the word "Protozoa"?
nchar("Protozoa")
[1] 8
How can I find the number of characters and or symbols for the following strings all at once: "pneumonoultramicroscopicsilicovolcanoconiosis", "password: DatamineRocks#stat1900@"?
string_vector <- c("pneumonoultramicroscopicsilicovolcanoconiosis", "password: DatamineRocks#stat1900@")
nchar(string_vector)
[1] 45 33
Fun Fact: pneumonoultramicroscopicsilicovolcanoconiosis is the longest word in the English dictionary.