2.1 Clean & tidy tables

If the tables in the spreadsheet are clean and tidy, then you should use a package like readxl. But it’s worth knowing how to emulate readxl with tidyxl and unpivotr, because some almost clean tables can be handled using these techniques.
Clean and tidy means
- One table per sheet
- A single row of column headers, or no headers
- A single data type in each column
- Only one kind of sentinel value (to be interpreted as
NA) - No meaningful formatting
- No data buried in formulas
- No need to refer to named ranges
Here’s the full process.
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
xlsx_cells(path, sheet = "clean") %>%
behead("up", header) %>%
select(row, data_type, header, character, numeric) %>%
spatter(header) %>%
select(-row)## # A tibble: 3 x 2
## Age Name
## <dbl> <chr>
## 1 1 Matilda
## 2 3 Nicholas
## 3 5 Olivia
tidyxl::xlsx_cells() imports the spreadsheet into a data frame, where each row
of the data frame describes one cell of the spreadsheet. The columns row and
col (and address) describe the position of the cell, and the value of the
cell is in one of the columns error, logical, numeric, date,
character, depending on the type of data in the cell. The column data_type
says which column the value is in. Other columns describe formatting and
formulas.
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
xlsx_cells(path, sheet = "clean") %>%
select(row, col, data_type, character, numeric)## # A tibble: 8 x 5
## row col data_type character numeric
## <int> <int> <chr> <chr> <dbl>
## 1 1 1 character Name NA
## 2 1 2 character Age NA
## 3 2 1 character Matilda NA
## 4 2 2 numeric <NA> 1
## 5 3 1 character Nicholas NA
## 6 3 2 numeric <NA> 3
## 7 4 1 character Olivia NA
## 8 4 2 numeric <NA> 5
unpivotr::behead() takes one level of headers from a pivot table and makes it
part of the data. Think of it like tidyr::gather(), except that it works when
there is more than one row of headers (or more than one column of row-headers),
and it only works on tables that have first come through
unpivotr::as_cells() or tidyxl::xlsx_cells().
xlsx_cells(path, sheet = "clean") %>%
select(row, col, data_type, character, numeric) %>%
behead("up", header)## # A tibble: 6 x 6
## row col data_type character numeric header
## <int> <int> <chr> <chr> <dbl> <chr>
## 1 2 1 character Matilda NA Name
## 2 2 2 numeric <NA> 1 Age
## 3 3 1 character Nicholas NA Name
## 4 3 2 numeric <NA> 3 Age
## 5 4 1 character Olivia NA Name
## 6 4 2 numeric <NA> 5 Age
unpivotr::spatter() spreads key-value pairs across multiple columns, like
tidyxl::spread(), except that it handles mixed data types. It knows which
column contains the cell value (i.e. the character column or the numeric
column), by checking the data_type column. Just like tidyr::spread(), it
can be confused by extraneous data, so it’s usually a good idea to drop the
col column first, and to keep the row column.
xlsx_cells(path, sheet = "clean") %>%
select(row, col, data_type, character, numeric) %>%
behead("up", header) %>%
select(-col) %>%
spatter(header) %>%
select(-row)## # A tibble: 3 x 2
## Age Name
## <dbl> <chr>
## 1 1 Matilda
## 2 3 Nicholas
## 3 5 Olivia
In case the table has no column headers, you can spatter the col column
instead of a nonexistent header column.
xlsx_cells(path, sheet = "clean") %>%
dplyr::filter(row >= 2) %>%
select(row, col, data_type, character, numeric) %>%
spatter(col) %>%
select(-row)## # A tibble: 3 x 2
## `1` `2`
## <chr> <dbl>
## 1 Matilda 1
## 2 Nicholas 3
## 3 Olivia 5
Tidyxl and unpivotr are much more complicated than readxl, and that’s the point: tidyxl and unpivotr give you more power and complexity when you need it.
## # A tibble: 3 x 2
## Name Age
## <chr> <dbl>
## 1 Matilda 1
## 2 Nicholas 3
## 3 Olivia 5
## New names:
## * `` -> ...1
## * `` -> ...2
## # A tibble: 3 x 2
## ...1 ...2
## <chr> <dbl>
## 1 Matilda 1
## 2 Nicholas 3
## 3 Olivia 5