Find errors in data

Methods

errorlocate has two main functions to be used:

locate_errors for detecting errors
replace_errors for replacing faulty values with NA

library(validate)
library(errorlocate)

Let’s start with a simple example:

We have a rule that age cannot be negative:

rules <- validator(age > 0)

And we have the following data set

"age, income
 -10,    0  
  15, 2000
  25, 3000
  NA, 1000
" -> csv
d <- read.csv(textConnection(csv), strip.white = TRUE)

age	income
-10	0
15	2000
25	3000
NA	1000

le <- locate_errors(d, rules)
summary(le)
#> Variable:
#>     name errors missing
#> 1    age      1       1
#> 2 income      0       0
#> Errors per record:
#>   errors records
#> 1      0       3
#> 2      1       1

summary(le) gives an overview of the errors found in this data set. The complete error listing can be found with:

le$errors
#>        age income
#> [1,]  TRUE  FALSE
#> [2,] FALSE  FALSE
#> [3,] FALSE  FALSE
#> [4,]    NA  FALSE

Which says that record 1 has a faulty value for age.

Suppose we expand our rules

rules <- validator( r1 = age > 0
                  , r2 = if (income > 0) age > 16
                  )

With validate::confront we can see that rule r2 is violated (record 2).

summary(confront(d, rules))

name	items	passes	fails	nNA	error	warning	expression
r1	4	2	1	1	FALSE	FALSE	age > 0
r2	4	2	1	1	FALSE	FALSE	income <= 0 \| (age > 16)

What errors will be found by locate_errors?

set.seed(1)
le <- locate_errors(d, rules)
le$errors
#>        age income
#> [1,]  TRUE  FALSE
#> [2,]  TRUE  FALSE
#> [3,] FALSE  FALSE
#> [4,]    NA  FALSE

It now detects that age in observation 2 is also faulty, since it violates the second rule. Note that we use set.seed. This is needed because in this example, either age or income can be considered faulty. set.seed assures that the procedure is reproducible.

With replace_errors we can remove the errors (which still need to be imputed).

d_fixed <- replace_errors(d, le)
summary(confront(d_fixed, rules))

name	items	passes	fails	nNA	error	warning	expression
r1	4	1	0	3	FALSE	FALSE	age > 0
r2	4	2	0	2	FALSE	FALSE	income <= 0 \| (age > 16)

In which replace_errors set all faulty values to NA.

d_fixed

age	income
NA	0
NA	2000
25	3000
NA	1000

Weights

locate_errors allows for supplying weigths for the variables. It is common that the quality of the observed variables differs. When we have more trust in age we can give it more weight so it chooses income when it has to decide between the two (record 2):

set.seed(1) # good practice, although not needed in this example
weight <- c(age = 2, income = 1) 
le <- locate_errors(d, rules, weight)
le$errors
#>        age income
#> [1,]  TRUE  FALSE
#> [2,] FALSE   TRUE
#> [3,] FALSE  FALSE
#> [4,]    NA  FALSE

Weights can be specified in different ways: (see also errorlocate::expand_weights):

not specifying: all variables will have weight 1
named vector: all records will have same set of weights. Unspeficied columns will have weight 1.
named matrix or data.frame, same dimension as the data: specify weights per record.
Use Inf weights to fixate a variable, so it won’t be changed.

Performance / Parallelisation

locate_errors solves a mixed integer problem. When the number of interactions between validation rules is large, finding an optimal solution can become computationally intensive. Both locate_errors as well as replace_errors have a parallization option: Ncpus making use of multiple processors. The $duration (s) property of each solution indicates the time spent to find a solution for each record. This can be restricted using the argument timeout (s).

# duration is in seconds. 
le$duration
#> [1] 0.0010929108 0.0008401871 0.0000000000 0.0008130074

Find errors in data

Intro

Methods

Weights

Performance / Parallelisation