5.5 Superscript symbols
This is pernicious. What was Paula’s score, in the table below?
path <- system.file("extdata", "worked-examples.xlsx", package = "unpivotr")
read_excel(path, sheet = "superscript symbols")
## # A tibble: 2 x 2
## Name Score
## <chr> <chr>
## 1 Paula 91
## 2 Matilda 10
The answer is, it’s not Paula, it’s Paul (superscript ‘a’), who scored 9 (superscript ‘1’).
This sort of thing is difficult to spot. There’s a clue in the ‘Score’ column,
which has been coerced to character so that the author could enter the
superscript ‘1’ (Excel doesn’t allow superscripts in numeric cells), But it
would be easy to interpret that as an accident of translation, and simply coerce
back to numeric with as.integer()
.
With tidyxl, you can count the rows of each element of the character_formatted
column to identify cells that have in-cell formatting.
xlsx_cells(path, sheet = "superscript symbols") %>%
dplyr::filter(data_type == "character") %>%
dplyr::filter(map_int(character_formatted, nrow) != 1) %>%
select(row, col, character)
## # A tibble: 2 x 3
## row col character
## <int> <int> <chr>
## 1 2 1 Paula
## 2 2 2 91
The values and symbols can then be separated by assuming the value is the first string, and the symbol is the second.
xlsx_cells(path, sheet = "superscript symbols") %>%
mutate(character = map_chr(character_formatted,
~ ifelse(is.null(.x), character, .x$character[1])),
symbol = map_chr(character_formatted,
~ ifelse(is.null(.x), NA, .x$character[2])),
numeric = if_else(row > 1 & col == 2 & data_type == "character",
as.numeric(character),
numeric),
character = if_else(is.na(numeric), character, NA_character_)) %>%
select(row, col, numeric, character, symbol)
## Warning in if_else(row > 1 & col == 2 & data_type == "character", as.numeric(character), : NAs
## introduced by coercion
## # A tibble: 6 x 5
## row col numeric character symbol
## <int> <int> <dbl> <chr> <chr>
## 1 1 1 NA Name <NA>
## 2 1 2 NA Score <NA>
## 3 2 1 NA Paul a
## 4 2 2 9 <NA> 1
## 5 3 1 NA Matilda <NA>
## 6 3 2 10 <NA> <NA>