Cleaning data can take hundreds or thousands of lines.
Sometimes we do some mistakes that can have big consequences.
Paper published in the JPE (error found in replication led by I4R):
What is {validate}?
{validate} is an R package whose goal is to ensure that our code has produced the expected output.
It should be used on the final and on the intermediate datasets (basically anytime we do some big modifications).
How to use {validate}?
Define a series of expectations, or rules, with validator()
Pass our dataset through these rules with confront()
Check that all rules are respected.
Example
Let’s take an example with some data:
head(my_data)
# A tibble: 6 × 7
country continent year lifeExp pop gdpPercap iso
<fct> <fct> <int> <dbl> <int> <dbl> <chr>
1 Afghanistan Asia 1952 28.8 8425333 779. AFG
2 Afghanistan Asia 1957 30.3 9240934 821. AFG
3 Afghanistan Asia 1962 32.0 10267083 853. AFG
4 Afghanistan Asia 1967 34.0 11537966 836. AFG
5 Afghanistan Asia 1972 36.1 13079460 740. AFG
6 Afghanistan Asia 1977 38.4 14880372 786. AFG
Define a series of expectations, or rules, with validator():
library(validate)rules <-validator(# Ensure that all ISO-3 codes have 3 lettersfield_length(iso, n =3),# Ensure that there are no duplicated combination of iso-yearis_unique(iso, year),# Ensure that year doesn't have any missing values!is.na(year))
Pass our dataset through these rules with confront():
# A tibble: 6 × 7
country continent year lifeExp pop gdpPercap iso
<fct> <fct> <int> <dbl> <int> <dbl> <chr>
1 Afghanistan Asia 1952 28.8 8425333 779. AFG
2 Afghanistan Asia 1957 30.3 9240934 821. AFG
3 Afghanistan Asia 1962 32.0 10267083 853. AFG
4 Afghanistan Asia 1967 34.0 11537966 836. AFG
5 Afghanistan Asia 1972 36.1 13079460 740. AFG
6 Afghanistan Asia 1977 38.4 14880372 786. AFG
Initialize {renv} whenever we want with init():
renv::init()
* Initializing project ...* Discovering package dependencies ... Done!* Copying packages into the cache ... Done!The following package(s) will be updated in the lockfile:# CRAN ===============================- R6 [*->2.5.1]- base64enc [*->0.1-3]- bslib [*->0.4.2]- cachem [*->1.0.6]- cli [*->3.5.0]- countrycode [*->1.4.0]- digest [*->0.6.31]- ellipsis [*->0.3.2]- evaluate [*->0.19]- fansi [*->1.0.3]- fastmap [*->1.1.0]- fs [*->1.5.2]- gapminder [*->0.3.0]- glue [*->1.6.2]- highr [*->0.10]- htmltools [*->0.5.4]- jquerylib [*->0.1.4]- jsonlite [*->1.8.4]- lifecycle [*->1.0.3]- magrittr [*->2.0.3]- memoise [*->2.0.1]- mime [*->0.12]- pillar [*->1.8.1]- pkgconfig [*->2.0.3]- rappdirs [*->0.3.3]- renv [*->0.16.0]- rmarkdown [*->2.19]- sass [*->0.4.4]- settings [*->0.2.7]- stringi [*->1.7.8]- stringr [*->1.5.0]- tibble [*->3.1.8]- tinytex [*->0.43]- utf8 [*->1.2.2]- validate [*->1.1.1]- vctrs [*->0.5.1]- xfun [*->0.36]- yaml [*->2.3.6]# GitHub =============================- rlang [*-> tidyverse/rlang@HEAD]# https://yihui.r-universe.dev =======- knitr [*->1.41.8]The version of R recorded in the lockfile will be updated:- R [*] -> [4.2.2]* Lockfile written to 'C:/Users/etienne/Desktop/Divers/good-practices/renv.lock'.Restarting R session...* Project 'C:/Users/etienne/Desktop/Divers/good-practices' loaded. [renv 0.16.0]
This will create:
a file called renv.lock
a folder called renv
don’t touch these files!
Work as usual. Let’s import another package:
library(dplyr)
Error inlibrary(dplyr) : there is no package called ‘dplyr’
Hum… weird, dplyr was installed on my laptop.
{renv} creates a sort of “local library” in our project, so we need to reinstall dplyr first:
install.packages("dplyr")
library(dplyr)
Now that we imported a new package, let’s see the status of {renv}:
renv::status()
The following package(s) are installed but not recorded in the lockfile: _ withr [2.5.0] dplyr [1.0.10] generics [0.1.3] tidyselect [1.2.0]Use `renv::snapshot()` to add these packages to your lockfile.
Run snapshot() from time to time to update the lockfile;
renv::snapshot()
The following package(s) will be updated in the lockfile:# CRAN ===============================- dplyr [*->1.0.10]- generics [*->0.1.3]- tidyselect [*->1.2.0]- withr [*->2.5.0]Do you want to proceed? [y/N]: Y* Lockfile written to 'C:/Users/etienne/Desktop/Divers/good-practices/renv.lock'.
Good to know
{renv} is not a panacea for reproducibility.
If we use some packages that depend on external software (e.g RSelenium uses Java), {renv} cannot install this software for us.