2.7 Sentinel values in non-text columns

R packages like readr recognise NA as a sentinel value that means “Not Applicable”, or “Not Available”, or anything you want. It doesn’t affect the data type of a column when NA is one of the values. Some datasets use other symbols as a sentinel value, e.g. N/A or ., or a combination, in which case you can instruct readr to interpret those values as sentinels, and it will import them all as NA.

But what if the data uses more than one kind of sentinel value. For example, Statistics New Zealand uses … to mean “Not applicable”, and ..C to mean “Confidentialised”. Most tools will either regard both values as NA, or coerce the whole column to characters.

read_csv("a, b,   c
          1, 2,   3
          4, …, ..C",
         na = c("…", "..C")) # Regard both values as NA

## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4    NA    NA

read_csv("a, b,   c
          1, 2,   3
          4, …, ..C",
         na = "")              # Coerce the whole column to characters

## # A tibble: 2 x 3
##       a b     c    
##   <dbl> <chr> <chr>
## 1     1 2     3    
## 2     4 …     ..C

A better procedure is to import the sentinel values into their own column, or even into separate TRUE/FALSE columns for each kind of sentinel.

Note that sentinel values relate the the value in the cell, rather than to the whole row, so the first step is to make the dataset extra-tidy as in the section “Already a tidy table but with meaningful formatting of single cells”.

# Tidy
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
x <- read_excel(path, sheet = "sentinels")
x

## # A tibble: 4 x 3
##   Name     Subject  Score
##   <chr>    <chr>    <chr>
## 1 Matilda  Music    7    
## 2 Nicholas Classics NA   
## 3 Olivia   …        3    
## 4 Paul     NA       ..C

# Extra-tidy
extra_tidy <-
  gather(x, variable, value, -Name) %>%
  arrange(Name, variable)
extra_tidy

## # A tibble: 8 x 3
##   Name     variable value   
##   <chr>    <chr>    <chr>   
## 1 Matilda  Score    7       
## 2 Matilda  Subject  Music   
## 3 Nicholas Score    NA      
## 4 Nicholas Subject  Classics
## 5 Olivia   Score    3       
## 6 Olivia   Subject  …       
## 7 Paul     Score    ..C     
## 8 Paul     Subject  NA

With an extra-tidy dataset, the sentinels can now be appended to the values of individual variables, rather than to whole observations.

# Extra-tidy, with row and column numbers of the original variables, and the
# sentinels omitted
extra_tidy <-
  read_excel(path, sheet = "sentinels", na = c("NA", "…", "..C")) %>%
  mutate(row = row_number() + 1L) %>%
  gather(variable, value, -row, -Name) %>%
  group_by(row) %>%
  mutate(col = row_number() + 1L) %>%
  ungroup() %>%
  select(row, col, Name, variable, value) %>%
  arrange(row, col)
extra_tidy

## # A tibble: 8 x 5
##     row   col Name     variable value   
##   <int> <int> <chr>    <chr>    <chr>   
## 1     2     2 Matilda  Subject  Music   
## 2     2     3 Matilda  Score    7       
## 3     3     2 Nicholas Subject  Classics
## 4     3     3 Nicholas Score    <NA>    
## 5     4     2 Olivia   Subject  <NA>    
## 6     4     3 Olivia   Score    3       
## 7     5     2 Paul     Subject  <NA>    
## 8     5     3 Paul     Score    <NA>

# Import all the cells, and filter for sentinel values
sentinels <-
  xlsx_cells(path, sheet = "sentinels") %>%
  dplyr::filter(character %in% c("NA", "…", "..C")) %>%
  mutate(sentinel = character) %>%
  select(row, col, sentinel)
sentinels

## # A tibble: 4 x 3
##     row   col sentinel
##   <int> <int> <chr>   
## 1     3     3 NA      
## 2     4     2 …       
## 3     5     2 NA      
## 4     5     3 ..C

# Join the `sentinel` column to the rest of the data
left_join(extra_tidy, sentinels, by = c("row", "col"))

## # A tibble: 8 x 6
##     row   col Name     variable value    sentinel
##   <int> <int> <chr>    <chr>    <chr>    <chr>   
## 1     2     2 Matilda  Subject  Music    <NA>    
## 2     2     3 Matilda  Score    7        <NA>    
## 3     3     2 Nicholas Subject  Classics <NA>    
## 4     3     3 Nicholas Score    <NA>     NA      
## 5     4     2 Olivia   Subject  <NA>     …       
## 6     4     3 Olivia   Score    3        <NA>    
## 7     5     2 Paul     Subject  <NA>     NA      
## 8     5     3 Paul     Score    <NA>     ..C

Here’s another version using only tidyxl and unpivotr, which provides isolate_sentinels() to make this much more straightforward.

xlsx_cells(path, sheet = "sentinels") %>%
  select(row, col, data_type, character, numeric) %>%
  isolate_sentinels(character, c("NA", "…", "..C")) %>%
  behead("left", Name) %>%
  behead("up", variable) %>%
  select(Name, variable, character, numeric, sentinel)

## # A tibble: 8 x 5
##   Name     variable character numeric sentinel
##   <chr>    <chr>    <chr>       <dbl> <chr>   
## 1 Matilda  Subject  Music          NA <NA>    
## 2 Matilda  Score    <NA>            7 <NA>    
## 3 Nicholas Subject  Classics       NA <NA>    
## 4 Nicholas Score    <NA>           NA NA      
## 5 Olivia   Subject  <NA>           NA …       
## 6 Olivia   Score    <NA>            3 <NA>    
## 7 Paul     Subject  <NA>           NA NA      
## 8 Paul     Score    <NA>           NA ..C