3.2 Complex unpivoting

When behead() isn’t powerful enough (it makes certain assumptions, and it doesn’t understand formatting), then you can get much more control by using enhead(), which joins together two separate data frames of data cells and header cells.

This kind of unpivoting is always done in two stages.

  1. Identify which cells are headers, and which are data
  2. State how the data cells relate to the header cells.

3.2.1 Two clear rows of text column headers, left-aligned

The first stage, identifying header vs data cells, is simply filtering.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
all_cells <-
  xlsx_cells(path, sheets = "pivot-annotations") %>%
  dplyr::filter(col >= 4, !is_blank) %>% # Ignore the row headers in this example
  select(row, col, data_type, character, numeric) %>%
  print()
## # A tibble: 22 x 5
##      row   col data_type character numeric
##    <int> <int> <chr>     <chr>       <dbl>
##  1     2     4 character Female         NA
##  2     2     6 character Male           NA
##  3     3     4 character Matilda        NA
##  4     3     5 character Olivia         NA
##  5     3     6 character Nicholas       NA
##  6     3     7 character Paul           NA
##  7     4     4 numeric   <NA>            1
##  8     4     5 numeric   <NA>            2
##  9     4     6 numeric   <NA>            3
## 10     4     7 numeric   <NA>            0
## # … with 12 more rows
# View the cells in their original positions on the spreadsheet
rectify(all_cells)
## # A tibble: 6 x 5
##   `row/col` `4(D)`  `5(E)` `6(F)`   `7(G)`
##       <int> <chr>   <chr>  <chr>    <chr> 
## 1         2 Female  <NA>   Male     <NA>  
## 2         3 Matilda Olivia Nicholas Paul  
## 3         4 1       2      3        0     
## 4         5 3       4      5        1     
## 5         6 5       6      9        2     
## 6         7 7       8      12       3
first_header_row <-
  dplyr::filter(all_cells, row == 2) %>%
  select(row, col, sex = character)
  # the title of this header is 'sex'
  # the cells are text cells (`"Female"` and `"Male"`) so take the value in the
  # '`character` column.
first_header_row
## # A tibble: 2 x 3
##     row   col sex   
##   <int> <int> <chr> 
## 1     2     4 Female
## 2     2     6 Male
second_header_row <-
  dplyr::filter(all_cells, row == 3) %>%
  select(row, col, name = character)
  # The title of this header is 'name'.
  # The cells are text cells, so take the value in the '`character` column.
second_header_row
## # A tibble: 4 x 3
##     row   col name    
##   <int> <int> <chr>   
## 1     3     4 Matilda 
## 2     3     5 Olivia  
## 3     3     6 Nicholas
## 4     3     7 Paul
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is exam scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`

The second stage is to declare how the data cells relate to each row of column headers.

Starting from the point of view of a data cell, the relevant column header from the second row of headers is the one directly "up".

enhead(data_cells, second_header_row, "up")
## # A tibble: 16 x 4
##      row   col score name    
##    <int> <int> <dbl> <chr>   
##  1     4     4     1 Matilda 
##  2     4     5     2 Olivia  
##  3     4     6     3 Nicholas
##  4     4     7     0 Paul    
##  5     5     4     3 Matilda 
##  6     5     5     4 Olivia  
##  7     5     6     5 Nicholas
##  8     5     7     1 Paul    
##  9     6     4     5 Matilda 
## 10     6     5     6 Olivia  
## 11     6     6     9 Nicholas
## 12     6     7     2 Paul    
## 13     7     4     7 Matilda 
## 14     7     5     8 Olivia  
## 15     7     6    12 Nicholas
## 16     7     7     3 Paul

The first row of headers, from the point of view of a data cell, is either directly up, or up-then-left.

enhead(data_cells, first_header_row, "up-left")
## # A tibble: 16 x 4
##      row   col score sex   
##    <int> <int> <dbl> <chr> 
##  1     4     4     1 Female
##  2     4     5     2 Female
##  3     5     4     3 Female
##  4     5     5     4 Female
##  5     6     4     5 Female
##  6     6     5     6 Female
##  7     7     4     7 Female
##  8     7     5     8 Female
##  9     4     6     3 Male  
## 10     4     7     0 Male  
## 11     5     6     5 Male  
## 12     5     7     1 Male  
## 13     6     6     9 Male  
## 14     6     7     2 Male  
## 15     7     6    12 Male  
## 16     7     7     3 Male

Piping everything together, we get a complete, tidy dataset, and can finally drop the row and col columns.

data_cells %>%
  enhead(first_header_row, "up-left") %>%
  enhead(second_header_row, "up") %>%
  select(-row, -col)
## # A tibble: 16 x 3
##    score sex    name    
##    <dbl> <chr>  <chr>   
##  1     1 Female Matilda 
##  2     2 Female Olivia  
##  3     3 Female Matilda 
##  4     4 Female Olivia  
##  5     5 Female Matilda 
##  6     6 Female Olivia  
##  7     7 Female Matilda 
##  8     8 Female Olivia  
##  9     3 Male   Nicholas
## 10     0 Male   Paul    
## 11     5 Male   Nicholas
## 12     1 Male   Paul    
## 13     9 Male   Nicholas
## 14     2 Male   Paul    
## 15    12 Male   Nicholas
## 16     3 Male   Paul

3.2.2 Two clear columns of text row headers, top-aligned

This is almost the same as Two clear rows of text column headers, left-aligned, but with different directions: "left" for directly left, and "left-up" for left-then-up.

("up-left" and "left-up" look like synonyms. They happen to be synonyms in enhead(), but they aren’t in behead().

In this example, the table has no column headers, only row headers. This is artificial here, but sometimes table are deliberately laid out in transpose form: the first column contains the headers, and the data extends in columns from left to right instead of from top to bottom.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
all_cells <-
  xlsx_cells(path, sheets = "pivot-annotations") %>%
  dplyr::filter(row >= 3, !is_blank) %>% # Ignore the column headers in this example
  select(row, col, data_type, character, numeric) %>%
  print()
## # A tibble: 26 x 5
##      row   col data_type character  numeric
##    <int> <int> <chr>     <chr>        <dbl>
##  1     3     4 character Matilda         NA
##  2     3     5 character Olivia          NA
##  3     3     6 character Nicholas        NA
##  4     3     7 character Paul            NA
##  5     4     2 character Humanities      NA
##  6     4     3 character Classics        NA
##  7     4     4 numeric   <NA>             1
##  8     4     5 numeric   <NA>             2
##  9     4     6 numeric   <NA>             3
## 10     4     7 numeric   <NA>             0
## # … with 16 more rows
# View the cells in their original positions on the spreadsheet
rectify(all_cells)
## # A tibble: 5 x 7
##   `row/col` `2(B)`      `3(C)`   `4(D)`  `5(E)` `6(F)`   `7(G)`
##       <int> <chr>       <chr>    <chr>   <chr>  <chr>    <chr> 
## 1         3 <NA>        <NA>     Matilda Olivia Nicholas Paul  
## 2         4 Humanities  Classics 1       2      3        0     
## 3         5 <NA>        History  3       4      5        1     
## 4         6 Performance Music    5       6      9        2     
## 5         7 <NA>        Drama    7       8      12       3
first_header_col <-
  dplyr::filter(all_cells, col == 2) %>%
  select(row, col, field = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
first_header_col
## # A tibble: 2 x 3
##     row   col field      
##   <int> <int> <chr>      
## 1     4     2 Humanities 
## 2     6     2 Performance
second_header_col <-
  dplyr::filter(all_cells, col == 3) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.
second_header_col
## # A tibble: 4 x 3
##     row   col subject 
##   <int> <int> <chr>   
## 1     4     3 Classics
## 2     5     3 History 
## 3     6     3 Music   
## 4     7     3 Drama
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is examp scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`

data_cells %>%
  enhead(first_header_col, "left-up") %>%
  enhead(second_header_col, "left") %>%
  select(-row, -col)
## # A tibble: 16 x 3
##    score field       subject 
##    <dbl> <chr>       <chr>   
##  1     1 Humanities  Classics
##  2     2 Humanities  Classics
##  3     3 Humanities  Classics
##  4     0 Humanities  Classics
##  5     3 Humanities  History 
##  6     4 Humanities  History 
##  7     5 Humanities  History 
##  8     1 Humanities  History 
##  9     5 Performance Music   
## 10     6 Performance Music   
## 11     9 Performance Music   
## 12     2 Performance Music   
## 13     7 Performance Drama   
## 14     8 Performance Drama   
## 15    12 Performance Drama   
## 16     3 Performance Drama

3.2.3 Two clear rows and columns of text headers, top-aligned and left-aligned

This is a combination of the previous two sections. No new techniques are used.

  1. Identify which cells are headers, and which are data
  2. State how the data cells relate to the header cells.
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
all_cells <-
  xlsx_cells(path, sheets = "pivot-annotations") %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, data_type, character, numeric) %>%
  print()
## # A tibble: 28 x 5
##      row   col data_type character  numeric
##    <int> <int> <chr>     <chr>        <dbl>
##  1     2     4 character Female          NA
##  2     2     6 character Male            NA
##  3     3     4 character Matilda         NA
##  4     3     5 character Olivia          NA
##  5     3     6 character Nicholas        NA
##  6     3     7 character Paul            NA
##  7     4     2 character Humanities      NA
##  8     4     3 character Classics        NA
##  9     4     4 numeric   <NA>             1
## 10     4     5 numeric   <NA>             2
## # … with 18 more rows
# View the cells in their original positions on the spreadsheet
rectify(all_cells)
## # A tibble: 6 x 7
##   `row/col` `2(B)`      `3(C)`   `4(D)`  `5(E)` `6(F)`   `7(G)`
##       <int> <chr>       <chr>    <chr>   <chr>  <chr>    <chr> 
## 1         2 <NA>        <NA>     Female  <NA>   Male     <NA>  
## 2         3 <NA>        <NA>     Matilda Olivia Nicholas Paul  
## 3         4 Humanities  Classics 1       2      3        0     
## 4         5 <NA>        History  3       4      5        1     
## 5         6 Performance Music    5       6      9        2     
## 6         7 <NA>        Drama    7       8      12       3
first_header_row <-
  dplyr::filter(all_cells, row == 2) %>%
  select(row, col, sex = character)
  # the title of this header is 'sex'
  # the cells are text cells (`"Female"` and `"Male"`) so take the value in the
  # '`character` column.
first_header_row
## # A tibble: 2 x 3
##     row   col sex   
##   <int> <int> <chr> 
## 1     2     4 Female
## 2     2     6 Male
second_header_row <-
  dplyr::filter(all_cells, row == 3) %>%
  select(row, col, name = character)
  # The title of this header is 'name'.
  # The cells are text cells, so take the value in the '`character` column.
second_header_row
## # A tibble: 4 x 3
##     row   col name    
##   <int> <int> <chr>   
## 1     3     4 Matilda 
## 2     3     5 Olivia  
## 3     3     6 Nicholas
## 4     3     7 Paul
first_header_col <-
  dplyr::filter(all_cells, col == 2) %>%
  select(row, col, field = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
first_header_col
## # A tibble: 2 x 3
##     row   col field      
##   <int> <int> <chr>      
## 1     4     2 Humanities 
## 2     6     2 Performance
second_header_col <-
  dplyr::filter(all_cells, col == 3) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.
second_header_col
## # A tibble: 4 x 3
##     row   col subject 
##   <int> <int> <chr>   
## 1     4     3 Classics
## 2     5     3 History 
## 3     6     3 Music   
## 4     7     3 Drama
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is examp scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`

data_cells %>%
  enhead(first_header_row, "up-left") %>%
  enhead(second_header_row, "up") %>%
  enhead(first_header_col, "left-up") %>%
  enhead(second_header_col, "left") %>%
  select(-row, -col)
## # A tibble: 16 x 5
##    score sex    name     field       subject 
##    <dbl> <chr>  <chr>    <chr>       <chr>   
##  1     1 Female Matilda  Humanities  Classics
##  2     2 Female Olivia   Humanities  Classics
##  3     3 Female Matilda  Humanities  History 
##  4     4 Female Olivia   Humanities  History 
##  5     3 Male   Nicholas Humanities  Classics
##  6     0 Male   Paul     Humanities  Classics
##  7     5 Male   Nicholas Humanities  History 
##  8     1 Male   Paul     Humanities  History 
##  9     5 Female Matilda  Performance Music   
## 10     6 Female Olivia   Performance Music   
## 11     7 Female Matilda  Performance Drama   
## 12     8 Female Olivia   Performance Drama   
## 13     9 Male   Nicholas Performance Music   
## 14     2 Male   Paul     Performance Music   
## 15    12 Male   Nicholas Performance Drama   
## 16     3 Male   Paul     Performance Drama

3.2.4 Centre-aligned headers

Headers aren’t always aligned to one side of the data cells that they describe.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
all_cells <- xlsx_cells(path, sheets = "pivot-centre-aligned")
rectify(all_cells)
## # A tibble: 10 x 10
##    `row/col` `2(B)`      `3(C)`     `4(D)` `5(E)`  `6(F)` `7(G)` `8(H)` `9(I)`   `10(J)`
##        <int> <chr>       <chr>      <chr>  <chr>   <chr>  <chr>  <chr>  <chr>    <chr>  
##  1         2 <NA>        <NA>       <NA>   Female  <NA>   <NA>   <NA>   Male     <NA>   
##  2         3 <NA>        <NA>       Leah   Matilda Olivia Lenny  Max    Nicholas Paul   
##  3         4 <NA>        Classics   3      1       2      4      3      3        0      
##  4         5 Humanities  History    8      3       4      7      5      5        1      
##  5         6 <NA>        Literature 1      1       9      3      12     7        5      
##  6         7 <NA>        Philosophy 5      10      10     8      2      5        12     
##  7         8 <NA>        Languages  5      4       5      9      8      3        8      
##  8         9 <NA>        Music      4      10      10     2      4      5        6      
##  9        10 Performance Dance      4      5       6      4      12     9        2      
## 10        11 <NA>        Drama      2      7       8      6      1      12       3

Looking at that table, it’s not immediately obvious where the boundary between Female and Male falls, or between Humanities and Performance. A naive approach would be to match the inner headers to the outer ones by proximity, and there are four directions to do so: "up-ish", "left-ish", "down-ish", and "right-ish".

But in this case, those directions are too naive.

  • Languages is closest to the Performance header, but is a humanity.
  • Lenny is the same distance from Female as from Male.

You can fix this by justifying the header cells towards one side of the data cells that they describe, and then use a direction like "up-left" as usual. Do this with justify(), providing the header cells with a second set of cells at the positions you want the header cells to move to.

  • header_cells is the cells whose value will be used as the header
  • corner_cells is the cells whose position is in one corner of the domain of the header (e.g. the top-left-hand corner).

In the original spreadsheet, the borders mark the boundaries. So the corner cells of the headers can be found by filtering for cells with a particular border.

all_cells <-
  xlsx_cells(path, sheets = "pivot-centre-aligned") %>%
  select(row, col, is_blank, data_type, character, numeric, local_format_id)

formats <- xlsx_formats(path)
top_borders <- which(!is.na(formats$local$border$top$style))
left_borders <- which(!is.na(formats$local$border$left$style))

first_header_row_corners <-
  dplyr::filter(all_cells, row == 2, local_format_id %in% left_borders) %>%
  select(row, col)
first_header_row_corners
## # A tibble: 2 x 2
##     row   col
##   <int> <int>
## 1     2     4
## 2     2     7
first_header_col_corners <-
  dplyr::filter(all_cells, col == 2, local_format_id %in% top_borders) %>%
  select(row, col)
first_header_col_corners
## # A tibble: 2 x 2
##     row   col
##   <int> <int>
## 1     4     2
## 2     9     2

Next, get the first row and first column of header cells as usual.

first_header_row <-
  dplyr::filter(all_cells, !is_blank, row == 2) %>%
  select(row, col, sex = character)
  # the title of this header is 'sex'
  # the cells are text cells (`"Female"` and `"Male"`) so take the value in the
  # '`character` column.
first_header_row
## # A tibble: 2 x 3
##     row   col sex   
##   <int> <int> <chr> 
## 1     2     5 Female
## 2     2     9 Male
first_header_col <-
  dplyr::filter(all_cells, !is_blank, col == 2) %>%
  select(row, col, field = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
first_header_col
## # A tibble: 2 x 3
##     row   col field      
##   <int> <int> <chr>      
## 1     5     2 Humanities 
## 2    10     2 Performance

And now justify the header cells to the same positions as the corner cells.

first_header_row <- justify(first_header_row, first_header_row_corners)
first_header_col <- justify(first_header_col, first_header_col_corners)

first_header_row
## # A tibble: 2 x 3
##     row   col sex   
##   <int> <int> <chr> 
## 1     2     4 Female
## 2     2     7 Male
first_header_col
## # A tibble: 2 x 3
##     row   col field      
##   <int> <int> <chr>      
## 1     4     2 Humanities 
## 2     9     2 Performance

The rest of this example is the same as “Two clear rows and columns of text headers, top-aligned and left-aligned”.

second_header_row <-
  dplyr::filter(all_cells, row == 3) %>%
  select(row, col, name = character)
  # The title of this header is 'name'.
  # The cells are text cells, so take the value in the '`character` column.
second_header_row
## # A tibble: 7 x 3
##     row   col name    
##   <int> <int> <chr>   
## 1     3     4 Leah    
## 2     3     5 Matilda 
## 3     3     6 Olivia  
## 4     3     7 Lenny   
## 5     3     8 Max     
## 6     3     9 Nicholas
## 7     3    10 Paul
second_header_col <-
  dplyr::filter(all_cells, col == 3) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.
second_header_col
## # A tibble: 8 x 3
##     row   col subject   
##   <int> <int> <chr>     
## 1     4     3 Classics  
## 2     5     3 History   
## 3     6     3 Literature
## 4     7     3 Philosophy
## 5     8     3 Languages 
## 6     9     3 Music     
## 7    10     3 Dance     
## 8    11     3 Drama
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is examp scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`

data_cells %>%
  enhead(first_header_row, "up-left") %>%
  enhead(second_header_row, "up") %>%
  enhead(first_header_col, "left-up") %>%
  enhead(second_header_col, "left") %>%
  select(-row, -col)
## # A tibble: 56 x 5
##    score sex    name    field      subject   
##    <dbl> <chr>  <chr>   <chr>      <chr>     
##  1     3 Female Leah    Humanities Classics  
##  2     1 Female Matilda Humanities Classics  
##  3     2 Female Olivia  Humanities Classics  
##  4     8 Female Leah    Humanities History   
##  5     3 Female Matilda Humanities History   
##  6     4 Female Olivia  Humanities History   
##  7     1 Female Leah    Humanities Literature
##  8     1 Female Matilda Humanities Literature
##  9     9 Female Olivia  Humanities Literature
## 10     5 Female Leah    Humanities Philosophy
## # … with 46 more rows

3.2.5 Multiple rows or columns of headers, with meaningful formatting

This is a combination of the previous section with Meaningfully formatted cells. The section Meaningfully formatted rows doesn’t work here, because the unpivoting of multiple rows/columns of headers complicates the relationship between the data and the formatting.

  1. Unpivot the multiple rows/columns of headers, as above, but keep the row and col of each data cell.
  2. Collect the row, col and formatting of each data cell.
  3. Join the data to the formatting by the row and col.
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
all_cells <-
  xlsx_cells(path, sheets = "pivot-annotations") %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, data_type, character, numeric) %>%
  print()
## # A tibble: 28 x 5
##      row   col data_type character  numeric
##    <int> <int> <chr>     <chr>        <dbl>
##  1     2     4 character Female          NA
##  2     2     6 character Male            NA
##  3     3     4 character Matilda         NA
##  4     3     5 character Olivia          NA
##  5     3     6 character Nicholas        NA
##  6     3     7 character Paul            NA
##  7     4     2 character Humanities      NA
##  8     4     3 character Classics        NA
##  9     4     4 numeric   <NA>             1
## 10     4     5 numeric   <NA>             2
## # … with 18 more rows
# View the cells in their original positions on the spreadsheet
rectify(all_cells)
## # A tibble: 6 x 7
##   `row/col` `2(B)`      `3(C)`   `4(D)`  `5(E)` `6(F)`   `7(G)`
##       <int> <chr>       <chr>    <chr>   <chr>  <chr>    <chr> 
## 1         2 <NA>        <NA>     Female  <NA>   Male     <NA>  
## 2         3 <NA>        <NA>     Matilda Olivia Nicholas Paul  
## 3         4 Humanities  Classics 1       2      3        0     
## 4         5 <NA>        History  3       4      5        1     
## 5         6 Performance Music    5       6      9        2     
## 6         7 <NA>        Drama    7       8      12       3
first_header_row <-
  dplyr::filter(all_cells, row == 2) %>%
  select(row, col, sex = character)
  # the title of this header is 'sex'
  # the cells are text cells (`"Female"` and `"Male"`) so take the value in the
  # '`character` column.
first_header_row
## # A tibble: 2 x 3
##     row   col sex   
##   <int> <int> <chr> 
## 1     2     4 Female
## 2     2     6 Male
second_header_row <-
  dplyr::filter(all_cells, row == 3) %>%
  select(row, col, name = character)
  # The title of this header is 'name'.
  # The cells are text cells, so take the value in the '`character` column.
second_header_row
## # A tibble: 4 x 3
##     row   col name    
##   <int> <int> <chr>   
## 1     3     4 Matilda 
## 2     3     5 Olivia  
## 3     3     6 Nicholas
## 4     3     7 Paul
first_header_col <-
  dplyr::filter(all_cells, col == 2) %>%
  select(row, col, field = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
first_header_col
## # A tibble: 2 x 3
##     row   col field      
##   <int> <int> <chr>      
## 1     4     2 Humanities 
## 2     6     2 Performance
second_header_col <-
  dplyr::filter(all_cells, col == 3) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.
second_header_col
## # A tibble: 4 x 3
##     row   col subject 
##   <int> <int> <chr>   
## 1     4     3 Classics
## 2     5     3 History 
## 3     6     3 Music   
## 4     7     3 Drama
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is exam scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`

unpivoted <-
  data_cells %>%
  enhead(first_header_row, "up-left") %>%
  enhead(second_header_row, "up") %>%
  enhead(first_header_col, "left-up") %>%
  enhead(second_header_col, "left")
  # Don't delet the `row` and `col` columns yet, because we need them to join on
  # the formatting

# `formats` is a pallette of fill colours that can be indexed by the
# `local_format_id` of a given cell to get the fill colour of that cell
fill_colours <- xlsx_formats(path)$local$fill$patternFill$fgColor$rgb

# Import all the cells, filter out the header row, filter for the first column,
# and create a new column `approximate` based on the fill colours, by looking up
# the local_format_id of each cell in the `formats` pallette.
annotations <-
  xlsx_cells(path, sheets = "pivot-annotations") %>%
  dplyr::filter(row >= 4, col >= 4) %>% # Omit the headers
  mutate(fill_colour = fill_colours[local_format_id]) %>%
  select(row, col, fill_colour)
annotations
## # A tibble: 16 x 3
##      row   col fill_colour
##    <int> <int> <chr>      
##  1     4     4 <NA>       
##  2     4     5 FFFFFF00   
##  3     4     6 <NA>       
##  4     4     7 <NA>       
##  5     5     4 FFFFFF00   
##  6     5     5 <NA>       
##  7     5     6 <NA>       
##  8     5     7 <NA>       
##  9     6     4 <NA>       
## 10     6     5 <NA>       
## 11     6     6 <NA>       
## 12     6     7 <NA>       
## 13     7     4 <NA>       
## 14     7     5 <NA>       
## 15     7     6 FFFFFF00   
## 16     7     7 <NA>
left_join(unpivoted, annotations, by = c("row", "col")) %>%
  select(-row, -col)
## # A tibble: 16 x 6
##    score sex    name     field       subject  fill_colour
##    <dbl> <chr>  <chr>    <chr>       <chr>    <chr>      
##  1     1 Female Matilda  Humanities  Classics <NA>       
##  2     2 Female Olivia   Humanities  Classics FFFFFF00   
##  3     3 Female Matilda  Humanities  History  FFFFFF00   
##  4     4 Female Olivia   Humanities  History  <NA>       
##  5     3 Male   Nicholas Humanities  Classics <NA>       
##  6     0 Male   Paul     Humanities  Classics <NA>       
##  7     5 Male   Nicholas Humanities  History  <NA>       
##  8     1 Male   Paul     Humanities  History  <NA>       
##  9     5 Female Matilda  Performance Music    <NA>       
## 10     6 Female Olivia   Performance Music    <NA>       
## 11     7 Female Matilda  Performance Drama    <NA>       
## 12     8 Female Olivia   Performance Drama    <NA>       
## 13     9 Male   Nicholas Performance Music    <NA>       
## 14     2 Male   Paul     Performance Music    <NA>       
## 15    12 Male   Nicholas Performance Drama    FFFFFF00   
## 16     3 Male   Paul     Performance Drama    <NA>

3.2.6 Mixed headers and notes in the same row/column, distinguished by formatting

This doesn’t use any new techniques. The trick is, when selecting a row or column of header cells, to filter out ones that have the ‘wrong’ formatting (formatting that shows they aren’t really headers). In this example, cells with italic or red text aren’t headers, even if they are in amongst header cells.

First, identify the IDs of formats that have italic or red text.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
formats <- xlsx_formats(path)

italic <- which(formats$local$font$italic)

# For 'red' we can either look for the RGB code for red "FFFF0000"
red <- which(formats$local$font$color$rgb == "FFFF0000")
red
## [1] 12 13 14 40 41
# Or we can find out what that code is by starting from a cell that we know is
# red.
red_cell_format_id <-
  xlsx_cells(path, sheets = "pivot-notes") %>%
  dplyr::filter(row == 5, col == 2) %>%
  pull(local_format_id)
red_cell_format_id
## [1] 40
red_rgb <- formats$local$font$color$rgb[red_cell_format_id]
red <- which(formats$local$font$color$rgb == red_rgb)
red
## [1] 12 13 14 40 41

Now we select the headers, filtering out cells with the format IDs of red or italic cells.

all_cells <-
  xlsx_cells(path, sheets = "pivot-notes") %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, character, numeric, local_format_id) %>%
  print()
## # A tibble: 31 x 5
##      row   col character  numeric local_format_id
##    <int> <int> <chr>        <dbl>           <int>
##  1     2     4 Female          NA              18
##  2     2     6 Male            NA              18
##  3     2     7 0 = absent      NA              39
##  4     3     4 Matilda         NA              20
##  5     3     5 Olivia          NA              21
##  6     3     6 Nicholas        NA              20
##  7     3     7 Paul            NA              21
##  8     4     2 Humanities      NA              18
##  9     4     3 Classics        NA              19
## 10     4     4 <NA>             1              33
## # … with 21 more rows
first_header_row <-
  dplyr::filter(all_cells, row == 2, !(local_format_id %in% c(red, italic))) %>%
  select(row, col, sex = character)
  # the title of this header is 'sex'
  # the cells are text cells (`"Female"` and `"Male"`) so take the value in the
  # '`character` column.
first_header_row
## # A tibble: 2 x 3
##     row   col sex   
##   <int> <int> <chr> 
## 1     2     4 Female
## 2     2     6 Male
first_header_col <-
  dplyr::filter(all_cells, col == 2, !(local_format_id %in% c(red, italic))) %>%
  select(row, col, qualification = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
first_header_col
## # A tibble: 2 x 3
##     row   col qualification
##   <int> <int> <chr>        
## 1     4     2 Humanities   
## 2     6     2 Performance
second_header_col <-
  dplyr::filter(all_cells, col == 3) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.

data_cells %>%
  enhead(first_header_row, "up-left") %>%
  enhead(first_header_col, "left-up") %>%
  select(-row, -col)
## # A tibble: 16 x 3
##    score sex    qualification
##    <dbl> <chr>  <chr>        
##  1     1 Female Humanities   
##  2     2 Female Humanities   
##  3     3 Female Humanities   
##  4     4 Female Humanities   
##  5     3 Male   Humanities   
##  6     0 Male   Humanities   
##  7     5 Male   Humanities   
##  8     1 Male   Humanities   
##  9     5 Female Performance  
## 10     6 Female Performance  
## 11     7 Female Performance  
## 12     8 Female Performance  
## 13     9 Male   Performance  
## 14     2 Male   Performance  
## 15    12 Male   Performance  
## 16     3 Male   Performance

3.2.7 Mixed levels of headers in the same row/column, distinguished by formatting

Normally different levels of headers are in different rows, or different columns, like Two clear rows of text column headers, left-aligned. But sometimes they coexist in the same row or column, and are distinguishable by formatting, e.g. bold for the top level, italic for the mid level, and plain for the lowest level.

In this example, there is a single column of row headers, where the levels are shown by different amounts of indentation. The indentation is done by formatting, rather than by leading spaces or tabs.

The first step is to find the format IDs of all the different levels of indentation.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
formats <- xlsx_formats(path)

indent0 <- which(formats$local$alignment$indent == 0)
indent1 <- which(formats$local$alignment$indent == 1)

indent0
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 45 47 48 49 50 51 52 53 54 55 56 57 58 59
indent1
## [1] 44 46

Now we use these format IDs to indentify the different levels of headers in the first column.

all_cells <-
  xlsx_cells(path, sheets = "pivot-hierarchy") %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, data_type, character, numeric, local_format_id) %>%
  print()
## # A tibble: 16 x 6
##      row   col data_type character   numeric local_format_id
##    <int> <int> <chr>     <chr>         <dbl>           <int>
##  1     2     3 character Matilda          NA              18
##  2     2     4 character Nicholas         NA              42
##  3     3     2 character Humanities       NA              18
##  4     4     2 character Classics         NA              44
##  5     4     3 numeric   <NA>              1              20
##  6     4     4 numeric   <NA>              3              45
##  7     5     2 character History          NA              44
##  8     5     3 numeric   <NA>              3              20
##  9     5     4 numeric   <NA>              5              45
## 10     6     2 character Performance      NA              20
## 11     7     2 character Music            NA              44
## 12     7     3 numeric   <NA>              5              20
## 13     7     4 numeric   <NA>              9              45
## 14     8     2 character Drama            NA              46
## 15     8     3 numeric   <NA>              7              24
## 16     8     4 numeric   <NA>             12              47
field <-
  dplyr::filter(all_cells, col == 2, local_format_id %in% indent0) %>%
  select(row, col, field = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
field
## # A tibble: 2 x 3
##     row   col field      
##   <int> <int> <chr>      
## 1     3     2 Humanities 
## 2     6     2 Performance
subject <-
  dplyr::filter(all_cells, col == 2, local_format_id %in% indent1) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.
subject
## # A tibble: 4 x 3
##     row   col subject 
##   <int> <int> <chr>   
## 1     4     2 Classics
## 2     5     2 History 
## 3     7     2 Music   
## 4     8     2 Drama
name <-
  dplyr::filter(all_cells, row == 2) %>%
  select(row, col, name = character)
  # The title of this header is 'name'.
  # The cells are text cells, so take the value in the '`character` column.
name
## # A tibble: 2 x 3
##     row   col name    
##   <int> <int> <chr>   
## 1     2     3 Matilda 
## 2     2     4 Nicholas
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is exam scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`

data_cells %>%
  enhead(field, "left-up") %>%
  enhead(subject, "left") %>%
  enhead(name, "up") %>%
  select(-row, -col)
## # A tibble: 8 x 4
##   score field       subject  name    
##   <dbl> <chr>       <chr>    <chr>   
## 1     1 Humanities  Classics Matilda 
## 2     3 Humanities  Classics Nicholas
## 3     3 Humanities  History  Matilda 
## 4     5 Humanities  History  Nicholas
## 5     5 Performance Music    Matilda 
## 6     9 Performance Music    Nicholas
## 7     7 Performance Drama    Matilda 
## 8    12 Performance Drama    Nicholas

3.2.8 Repeated rows/columns of headers within the table

Repetitions can simply be ignored. Select one of the sets of headers, and use it for all the data. In this example, the data cells are easy to distinguish from the headers mixed in among them, because only the data cells have the numeric data type.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
all_cells <-
  xlsx_cells(path, sheets = "pivot-repeated-headers") %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, data_type, character, numeric) %>%
  print()
## # A tibble: 80 x 5
##      row   col data_type character numeric
##    <int> <int> <chr>     <chr>       <dbl>
##  1     2     4 character Term 1         NA
##  2     2     5 character Term 2         NA
##  3     2     6 character Term 3         NA
##  4     3     2 character Classics       NA
##  5     3     3 character Matilda        NA
##  6     3     4 numeric   <NA>            1
##  7     3     5 numeric   <NA>            8
##  8     3     6 numeric   <NA>            7
##  9     4     3 character Nicholas       NA
## 10     4     4 numeric   <NA>            3
## # … with 70 more rows
# View the cells in their original positions on the spreadsheet
rectify(all_cells)
## # A tibble: 20 x 6
##    `row/col` `2(B)`   `3(C)`   `4(D)` `5(E)` `6(F)`
##        <int> <chr>    <chr>    <chr>  <chr>  <chr> 
##  1         2 <NA>     <NA>     Term 1 Term 2 Term 3
##  2         3 Classics Matilda  1      8      7     
##  3         4 <NA>     Nicholas 3      1      2     
##  4         5 <NA>     Olivia   4      0      1     
##  5         6 <NA>     Paul     2      4      8     
##  6         7 <NA>     <NA>     Term 1 Term 2 Term 3
##  7         8 History  Matilda  4      7      3     
##  8         9 <NA>     Nicholas 3      5      5     
##  9        10 <NA>     Olivia   9      8      5     
## 10        11 <NA>     Paul     6      2      0     
## 11        12 <NA>     <NA>     Term 1 Term 2 Term 3
## 12        13 Music    Matilda  2      9      9     
## 13        14 <NA>     Nicholas 1      7      7     
## 14        15 <NA>     Olivia   0      3      5     
## 15        16 <NA>     Paul     2      2      3     
## 16        17 <NA>     <NA>     Term 1 Term 2 Term 3
## 17        18 Drama    Matilda  9      8      9     
## 18        19 <NA>     Nicholas 1      3      4     
## 19        20 <NA>     Olivia   6      1      4     
## 20        21 <NA>     Paul     6      0      2
# The 'term' headers appear four times, but only the first one is needed.
term <-
  dplyr::filter(all_cells, row == 2) %>%
  select(row, col, term = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
term
## # A tibble: 3 x 3
##     row   col term  
##   <int> <int> <chr> 
## 1     2     4 Term 1
## 2     2     5 Term 2
## 3     2     6 Term 3
subject <-
  dplyr::filter(all_cells, col == 2) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.
subject
## # A tibble: 4 x 3
##     row   col subject 
##   <int> <int> <chr>   
## 1     3     2 Classics
## 2     8     2 History 
## 3    13     2 Music   
## 4    18     2 Drama
name <-
  dplyr::filter(all_cells, col == 3) %>%
  select(row, col, name = character)
  # The title of this header is 'name'.
  # The cells are text cells, so take the value in the '`character` column.
name
## # A tibble: 16 x 3
##      row   col name    
##    <int> <int> <chr>   
##  1     3     3 Matilda 
##  2     4     3 Nicholas
##  3     5     3 Olivia  
##  4     6     3 Paul    
##  5     8     3 Matilda 
##  6     9     3 Nicholas
##  7    10     3 Olivia  
##  8    11     3 Paul    
##  9    13     3 Matilda 
## 10    14     3 Nicholas
## 11    15     3 Olivia  
## 12    16     3 Paul    
## 13    18     3 Matilda 
## 14    19     3 Nicholas
## 15    20     3 Olivia  
## 16    21     3 Paul
# The data cells are distinguished from the 'term' headers by their data type --
# the data cells are numeric, whereas the term headers are character.
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is exam scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`
data_cells
## # A tibble: 48 x 3
##      row   col score
##    <int> <int> <dbl>
##  1     3     4     1
##  2     3     5     8
##  3     3     6     7
##  4     4     4     3
##  5     4     5     1
##  6     4     6     2
##  7     5     4     4
##  8     5     5     0
##  9     5     6     1
## 10     6     4     2
## # … with 38 more rows
data_cells %>%
  enhead(term, "up") %>%
  enhead(subject, "up-left") %>%
  enhead(name, "left") %>%
  select(-row, -col)
## # A tibble: 48 x 4
##    score term   subject  name    
##    <dbl> <chr>  <chr>    <chr>   
##  1     1 Term 1 Classics Matilda 
##  2     8 Term 2 Classics Matilda 
##  3     7 Term 3 Classics Matilda 
##  4     3 Term 1 Classics Nicholas
##  5     1 Term 2 Classics Nicholas
##  6     2 Term 3 Classics Nicholas
##  7     4 Term 1 Classics Olivia  
##  8     0 Term 2 Classics Olivia  
##  9     1 Term 3 Classics Olivia  
## 10     2 Term 1 Classics Paul    
## # … with 38 more rows

3.2.9 Headers amongst the data

This happens when what is actually a row-header, instead of being presented to the left of the data, is presented above the data. (Alternatively, what is actually a column header, instead of being presented above the data, is presented to the side.)

The way to handle it is to pretend that it is a row header, and use the "left-up" direction as normal.

path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
all_cells <-
  xlsx_cells(path, sheets = "pivot-header-within-data") %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, data_type, character, numeric, local_format_id) %>%
  print()
## # A tibble: 80 x 6
##      row   col data_type character numeric local_format_id
##    <int> <int> <chr>     <chr>       <dbl>           <int>
##  1     2     3 character Classics       NA               2
##  2     3     3 character Term 1         NA              20
##  3     3     4 character Term 2         NA              37
##  4     3     5 character Term 3         NA              21
##  5     4     2 character Matilda        NA              18
##  6     4     3 numeric   <NA>            4              18
##  7     4     4 numeric   <NA>            0              27
##  8     4     5 numeric   <NA>            7              19
##  9     5     2 character Nicholas       NA              20
## 10     5     3 numeric   <NA>            4              20
## # … with 70 more rows
# View the cells in their original positions on the spreadsheet
rectify(all_cells)
## # A tibble: 24 x 5
##    `row/col` `2(B)`   `3(C)`   `4(D)` `5(E)`
##        <int> <chr>    <chr>    <chr>  <chr> 
##  1         2 <NA>     Classics <NA>   <NA>  
##  2         3 <NA>     Term 1   Term 2 Term 3
##  3         4 Matilda  4        0      7     
##  4         5 Nicholas 4        6      2     
##  5         6 Olivia   9        9      9     
##  6         7 Paul     5        0      0     
##  7         8 <NA>     History  <NA>   <NA>  
##  8         9 <NA>     Term 1   Term 2 Term 3
##  9        10 Matilda  0        4      2     
## 10        11 Nicholas 2        5      2     
## # … with 14 more rows
bold <- which(xlsx_formats(path)$local$font$bold)

# The subject headers, though mixed with the data and the 'term' headers, are
# distinguishable by the data type "character" and by being bold.
subject <-
  dplyr::filter(all_cells,
         col == 3,
         data_type == "character",
         local_format_id %in% bold) %>%
  select(row, col, subject = character)
  # The title of this header is 'subject'
  # The cells are text cells (`"history"`, etc.) so take the value in the
  # '`character` column.
subject
## # A tibble: 4 x 3
##     row   col subject 
##   <int> <int> <chr>   
## 1     2     3 Classics
## 2     8     3 History 
## 3    14     3 Music   
## 4    20     3 Drama
# We only need one set of the 'term' headers
term <-
  dplyr::filter(all_cells, row == 3, data_type == "character") %>%
  select(row, col, term = character)
  # the title of this header is 'field', meaning 'group of subjects'.
  # The cells are text cells (`"Humanities"`, `"Performance"`) so take the value
  # in the '`character` column.
term
## # A tibble: 3 x 3
##     row   col term  
##   <int> <int> <chr> 
## 1     3     3 Term 1
## 2     3     4 Term 2
## 3     3     5 Term 3
name <-
  dplyr::filter(all_cells, col == 2) %>%
  select(row, col, name = character)
  # The title of this header is 'name'.
  # The cells are text cells, so take the value in the '`character` column.
name
## # A tibble: 16 x 3
##      row   col name    
##    <int> <int> <chr>   
##  1     4     2 Matilda 
##  2     5     2 Nicholas
##  3     6     2 Olivia  
##  4     7     2 Paul    
##  5    10     2 Matilda 
##  6    11     2 Nicholas
##  7    12     2 Olivia  
##  8    13     2 Paul    
##  9    16     2 Matilda 
## 10    17     2 Nicholas
## 11    18     2 Olivia  
## 12    19     2 Paul    
## 13    22     2 Matilda 
## 14    23     2 Nicholas
## 15    24     2 Olivia  
## 16    25     2 Paul
# The data cells are distinguished from the 'subject' headers by their data
# type -- the data cells are numeric, whereas the term headers are character.
data_cells <-
  dplyr::filter(all_cells, data_type == "numeric") %>%
  select(row, col, score = numeric)
  # The data is exam scores in certain subjects, so give the data that title.
  # The data is numeric, so select only that 'value'.  If some of the data was
  # also text or true/false, then you would select the `character` and `logical`
  # columns as well as `numeric`
data_cells
## # A tibble: 48 x 3
##      row   col score
##    <int> <int> <dbl>
##  1     4     3     4
##  2     4     4     0
##  3     4     5     7
##  4     5     3     4
##  5     5     4     6
##  6     5     5     2
##  7     6     3     9
##  8     6     4     9
##  9     6     5     9
## 10     7     3     5
## # … with 38 more rows
data_cells %>%
  enhead(subject, "left-up") %>%
  enhead(term, "up") %>%
  enhead(name, "left") %>%
  select(-row, -col)
## # A tibble: 48 x 4
##    score subject  term   name    
##    <dbl> <chr>    <chr>  <chr>   
##  1     4 Classics Term 1 Matilda 
##  2     0 Classics Term 2 Matilda 
##  3     7 Classics Term 3 Matilda 
##  4     4 Classics Term 1 Nicholas
##  5     6 Classics Term 2 Nicholas
##  6     2 Classics Term 3 Nicholas
##  7     9 Classics Term 1 Olivia  
##  8     9 Classics Term 2 Olivia  
##  9     9 Classics Term 3 Olivia  
## 10     5 Classics Term 1 Paul    
## # … with 38 more rows