2.1 Clean & tidy tables

If the tables in the spreadsheet are clean and tidy, then you should use a package like readxl. But it’s worth knowing how to emulate readxl with tidyxl and unpivotr, because some almost clean tables can be handled using these techniques.

Clean and tidy means

  • One table per sheet
  • A single row of column headers, or no headers
  • A single data type in each column
  • Only one kind of sentinel value (to be interpreted as NA)
  • No meaningful formatting
  • No data buried in formulas
  • No need to refer to named ranges

Here’s the full process.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
xlsx_cells(path, sheet = "clean") %>%
  behead("up", header) %>%
  select(row, data_type, header, character, numeric) %>%
  spatter(header) %>%
  select(-row)
## # A tibble: 3 x 2
##     Age Name    
##   <dbl> <chr>   
## 1     1 Matilda 
## 2     3 Nicholas
## 3     5 Olivia

tidyxl::xlsx_cells() imports the spreadsheet into a data frame, where each row of the data frame describes one cell of the spreadsheet. The columns row and col (and address) describe the position of the cell, and the value of the cell is in one of the columns error, logical, numeric, date, character, depending on the type of data in the cell. The column data_type says which column the value is in. Other columns describe formatting and formulas.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
xlsx_cells(path, sheet = "clean") %>%
  select(row, col, data_type, character, numeric)
## # A tibble: 8 x 5
##     row   col data_type character numeric
##   <int> <int> <chr>     <chr>       <dbl>
## 1     1     1 character Name           NA
## 2     1     2 character Age            NA
## 3     2     1 character Matilda        NA
## 4     2     2 numeric   <NA>            1
## 5     3     1 character Nicholas       NA
## 6     3     2 numeric   <NA>            3
## 7     4     1 character Olivia         NA
## 8     4     2 numeric   <NA>            5

unpivotr::behead() takes one level of headers from a pivot table and makes it part of the data. Think of it like tidyr::gather(), except that it works when there is more than one row of headers (or more than one column of row-headers), and it only works on tables that have first come through unpivotr::as_cells() or tidyxl::xlsx_cells().

xlsx_cells(path, sheet = "clean") %>%
  select(row, col, data_type, character, numeric) %>%
  behead("up", header)
## # A tibble: 6 x 6
##     row   col data_type character numeric header
##   <int> <int> <chr>     <chr>       <dbl> <chr> 
## 1     2     1 character Matilda        NA Name  
## 2     2     2 numeric   <NA>            1 Age   
## 3     3     1 character Nicholas       NA Name  
## 4     3     2 numeric   <NA>            3 Age   
## 5     4     1 character Olivia         NA Name  
## 6     4     2 numeric   <NA>            5 Age

unpivotr::spatter() spreads key-value pairs across multiple columns, like tidyxl::spread(), except that it handles mixed data types. It knows which column contains the cell value (i.e. the character column or the numeric column), by checking the data_type column. Just like tidyr::spread(), it can be confused by extraneous data, so it’s usually a good idea to drop the col column first, and to keep the row column.

xlsx_cells(path, sheet = "clean") %>%
  select(row, col, data_type, character, numeric) %>%
  behead("up", header) %>%
  select(-col) %>%
  spatter(header) %>%
  select(-row)
## # A tibble: 3 x 2
##     Age Name    
##   <dbl> <chr>   
## 1     1 Matilda 
## 2     3 Nicholas
## 3     5 Olivia

In case the table has no column headers, you can spatter the col column instead of a nonexistent header column.

xlsx_cells(path, sheet = "clean") %>%
  dplyr::filter(row >= 2) %>%
  select(row, col, data_type, character, numeric) %>%
  spatter(col) %>%
  select(-row)
## # A tibble: 3 x 2
##   `1`        `2`
##   <chr>    <dbl>
## 1 Matilda      1
## 2 Nicholas     3
## 3 Olivia       5

Tidyxl and unpivotr are much more complicated than readxl, and that’s the point: tidyxl and unpivotr give you more power and complexity when you need it.

read_excel(path, sheet = "clean")
## # A tibble: 3 x 2
##   Name       Age
##   <chr>    <dbl>
## 1 Matilda      1
## 2 Nicholas     3
## 3 Olivia       5
read_excel(path, sheet = "clean", col_names = FALSE, skip = 1)
## New names:
## * `` -> ...1
## * `` -> ...2
## # A tibble: 3 x 2
##   ...1      ...2
##   <chr>    <dbl>
## 1 Matilda      1
## 2 Nicholas     3
## 3 Olivia       5