2.7 Sentinel values in non-text columns
R packages like readr recognise NA
as a
sentinel value that means “Not Applicable”, or “Not Available”, or anything you
want. It doesn’t affect the data type of a column when NA
is one of the
values. Some datasets use other symbols as a sentinel value, e.g. N/A
or .
,
or a combination, in which case you can instruct readr
to interpret those
values as sentinels, and it will import them all as NA
.
But what if the data uses more than one kind of sentinel value. For example,
Statistics New Zealand uses …
to mean “Not applicable”, and ..C
to mean
“Confidentialised”. Most tools will either regard both values as NA
, or
coerce the whole column to characters.
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 NA NA
## # A tibble: 2 x 3
## a b c
## <dbl> <chr> <chr>
## 1 1 2 3
## 2 4 … ..C
A better procedure is to import the sentinel values into their own column, or
even into separate TRUE
/FALSE
columns for each kind of sentinel.
Note that sentinel values relate the the value in the cell, rather than to the whole row, so the first step is to make the dataset extra-tidy as in the section “Already a tidy table but with meaningful formatting of single cells”.
# Tidy
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
x <- read_excel(path, sheet = "sentinels")
x
## # A tibble: 4 x 3
## Name Subject Score
## <chr> <chr> <chr>
## 1 Matilda Music 7
## 2 Nicholas Classics NA
## 3 Olivia … 3
## 4 Paul NA ..C
## # A tibble: 8 x 3
## Name variable value
## <chr> <chr> <chr>
## 1 Matilda Score 7
## 2 Matilda Subject Music
## 3 Nicholas Score NA
## 4 Nicholas Subject Classics
## 5 Olivia Score 3
## 6 Olivia Subject …
## 7 Paul Score ..C
## 8 Paul Subject NA
With an extra-tidy dataset, the sentinels can now be appended to the values of individual variables, rather than to whole observations.
# Extra-tidy, with row and column numbers of the original variables, and the
# sentinels omitted
extra_tidy <-
read_excel(path, sheet = "sentinels", na = c("NA", "…", "..C")) %>%
mutate(row = row_number() + 1L) %>%
gather(variable, value, -row, -Name) %>%
group_by(row) %>%
mutate(col = row_number() + 1L) %>%
ungroup() %>%
select(row, col, Name, variable, value) %>%
arrange(row, col)
extra_tidy
## # A tibble: 8 x 5
## row col Name variable value
## <int> <int> <chr> <chr> <chr>
## 1 2 2 Matilda Subject Music
## 2 2 3 Matilda Score 7
## 3 3 2 Nicholas Subject Classics
## 4 3 3 Nicholas Score <NA>
## 5 4 2 Olivia Subject <NA>
## 6 4 3 Olivia Score 3
## 7 5 2 Paul Subject <NA>
## 8 5 3 Paul Score <NA>
# Import all the cells, and filter for sentinel values
sentinels <-
xlsx_cells(path, sheet = "sentinels") %>%
dplyr::filter(character %in% c("NA", "…", "..C")) %>%
mutate(sentinel = character) %>%
select(row, col, sentinel)
sentinels
## # A tibble: 4 x 3
## row col sentinel
## <int> <int> <chr>
## 1 3 3 NA
## 2 4 2 …
## 3 5 2 NA
## 4 5 3 ..C
# Join the `sentinel` column to the rest of the data
left_join(extra_tidy, sentinels, by = c("row", "col"))
## # A tibble: 8 x 6
## row col Name variable value sentinel
## <int> <int> <chr> <chr> <chr> <chr>
## 1 2 2 Matilda Subject Music <NA>
## 2 2 3 Matilda Score 7 <NA>
## 3 3 2 Nicholas Subject Classics <NA>
## 4 3 3 Nicholas Score <NA> NA
## 5 4 2 Olivia Subject <NA> …
## 6 4 3 Olivia Score 3 <NA>
## 7 5 2 Paul Subject <NA> NA
## 8 5 3 Paul Score <NA> ..C
Here’s another version using only tidyxl and unpivotr, which provides
isolate_sentinels()
to make this much more straightforward.
xlsx_cells(path, sheet = "sentinels") %>%
select(row, col, data_type, character, numeric) %>%
isolate_sentinels(character, c("NA", "…", "..C")) %>%
behead("left", Name) %>%
behead("up", variable) %>%
select(Name, variable, character, numeric, sentinel)
## # A tibble: 8 x 5
## Name variable character numeric sentinel
## <chr> <chr> <chr> <dbl> <chr>
## 1 Matilda Subject Music NA <NA>
## 2 Matilda Score <NA> 7 <NA>
## 3 Nicholas Subject Classics NA <NA>
## 4 Nicholas Score <NA> NA NA
## 5 Olivia Subject <NA> NA …
## 6 Olivia Score <NA> 3 <NA>
## 7 Paul Subject <NA> NA NA
## 8 Paul Score <NA> NA ..C