9.7 School performance

A certain United States state education department provides its schools with spreadsheets of statistics. I bet the children in that state get a great education, because there’s at least one R enthusiast on the staff whose curiosity has never left them.

9.7.1 Sheet 1

The first sheet is an example of mixed headers in column 1 being distinguished by bold formatting. Filter for the bold cells in column 1 and assign them to a variable. Then behead() the other headers, and finally enhead() the bold headers back on.

Download the file, modified from an original source provided to the author.

cells <-
  xlsx_cells(smungs::school, "Sheet1") %>%
  dplyr::filter(!is_blank)

formats <- xlsx_formats(smungs::school)

bold_headers <-
  cells %>%
  dplyr::filter(col == 1L, formats$local$font$bold[local_format_id]) %>%
  select(row, col, bold_header = character)

cells %>%
  behead("up-left", "metric") %>%
  behead("left", "plain-header") %>%
  enhead(bold_headers, "left-up") %>%
  select(row, data_type, numeric, metric, `plain-header`) %>%
  spatter(metric) %>%
  select(-row)

## # A tibble: 21 x 10
##    `plain-header` `% Advanced` `% Needs Improv… `% Proficient` `% Proficient o…
##    <chr>                 <dbl>            <dbl>          <dbl>            <dbl>
##  1 All Students          0.515           0.0297          0.446            0.960
##  2 Economically …        0.333           0.0667          0.567            0.9  
##  3 Non-Economica…        0.592           0.0141          0.394            0.986
##  4 Students w/ D…       NA              NA              NA               NA    
##  5 Non-Disabled          0.565           0.0217          0.413            0.978
##  6 ELL                  NA              NA              NA               NA    
##  7 Non-ELL               0.525           0.0202          0.444            0.970
##  8 African Amer.…       NA              NA              NA               NA    
##  9 Asian                NA              NA              NA               NA    
## 10 Hispanic/Lati…       NA              NA              NA               NA    
## # … with 11 more rows, and 5 more variables: `% Warning/ Failing` <dbl>, CPI <dbl>, `Median
## #   SGP` <dbl>, `N Included` <dbl>, `N Included in SGP` <dbl>

9.7.2 Sheet 2

The second sheet is variation on two clear rows of text column headers, left aligned. Here, there are three rows of colum headers. The first row is left-aligned, and the second and third rows are directly above the data cells. But the second row is blank above columns D and E. That doesn’t actually matter; in the output, header_2 will be NA for data from those columns.

Download the file, modified from an original source provided to the author.

xlsx_cells(smungs::school, "Sheet2") %>%
  select(row, col, address, data_type, character, numeric, is_blank) %>%
  mutate(character = str_trim(character)) %>%
  behead("up-left", "header_1") %>%
  behead("up", "header_2") %>%
  behead("up", "header_3") %>%
  behead("left", "classroom") %>%
  dplyr::filter(!is_blank, !is.na(header_3)) %>%
  arrange(col, row)

## # A tibble: 32 x 11
##      row   col address data_type character numeric is_blank header_1 header_2 header_3
##    <int> <int> <chr>   <chr>     <chr>       <dbl> <lgl>    <chr>    <chr>    <chr>   
##  1     5     4 D5      character 10         NA     FALSE    MCAS Su… <NA>     MCAS Gr…
##  2     6     4 D6      character 10         NA     FALSE    MCAS Su… <NA>     MCAS Gr…
##  3     7     4 D7      character 10         NA     FALSE    MCAS Su… <NA>     MCAS Gr…
##  4     8     4 D8      character 10         NA     FALSE    MCAS Su… <NA>     MCAS Gr…
##  5     5     5 E5      character 4          NA     FALSE    MCAS Su… <NA>     # Tested
##  6     6     5 E6      character 8          NA     FALSE    MCAS Su… <NA>     # Tested
##  7     7     5 E7      character 5          NA     FALSE    MCAS Su… <NA>     # Tested
##  8     8     5 E8      character 10         NA     FALSE    MCAS Su… <NA>     # Tested
##  9     5     6 F5      numeric   <NA>        0.342 FALSE    Possibl… Total P… %       
## 10     6     6 F6      numeric   <NA>        0.319 FALSE    Possibl… Total P… %       
## # … with 22 more rows, and 1 more variable: classroom <chr>

9.7.3 Sheet 3

The third sheet is variation on two clear rows of text column headers, left aligned, with a nasty catch. The creator of the spreadsheet didn’t merge cells to make space for more words. They didn’t even ‘centre across selection’ (which is sometimes safer than merging cells). Instead, they wrote each word on a separate line, meaning it is ambiguous whether a cell part of another header, or a header in its own right.

Download the file, modified from an original source provided to the author.

Compare columns C and D. Column C has a single header, “Avg Years w/ Class Data”, written across four cells. Column D has two levels of headers, “Years in MA” first, then “% 3+” nested within it (and written across two cells). There’s no way for a machine to tell which cells are whole headers, and which are parts of headers.

We can deal with this by first treating every cell as a header in its own right, and then concatenating the headers of rows 2 to 5. Using the "up-left" direction, headers like “Years in MA” in cell D4 will be carried to the right, which is good. Unfortunately so will headers like “# Students” in cell B2, which we’ll just have to put up with.

cells <-
  xlsx_cells(smungs::school, "Sheet3") %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, data_type, character, numeric)

x <-
  cells %>%
  behead("left", "place") %>%
  behead("up-left", "category") %>%
  behead("up-left", "metric-cell-1") %>% # Treat every cell in every row as a header
  behead("up-left", "metric-cell-2") %>%
  behead("up-left", "metric-cell-3") %>%
  behead("up-left", "metric-cell-4") %>%
  behead("up-left", "metric-cell-5")
glimpse(x)

## Rows: 36
## Columns: 12
## $ row             <int> 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8…
## $ col             <int> 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 10, …
## $ data_type       <chr> "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "…
## $ character       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ numeric         <dbl> 1.000000000, 1.000000000, 1.000000000, 0.842777337, 0.896170213, 0.…
## $ place           <chr> "State (All Students)", "Region", "School", "State (All Students)",…
## $ category        <chr> "STUDENTS", "STUDENTS", "STUDENTS", "EDUCATOR EXPERIENCE", "EDUCATO…
## $ `metric-cell-1` <chr> "# Students", "# Students", "# Students", "# Students", "# Students…
## $ `metric-cell-2` <chr> "Avg", "Avg", "Avg", "Avg", "Avg", "Avg", "Avg", "Avg", "Avg", "Avg…
## $ `metric-cell-3` <chr> "Years w/", "Years w/", "Years w/", "Years in MA", "Years in MA", "…
## $ `metric-cell-4` <chr> "Class", "Class", "Class", "%", "%", "%", "%", "%", "%", "%", "%", …
## $ `metric-cell-5` <chr> "Data", "Data", "Data", "3+", "3+", "3+", "1-2", "1-2", "1-2", "0-1…

Above you can see that every cell in every header row has been treated as a header in its own right, e.g. "Avg" is a level-2 header, and "Years w/" is a level-3 header. The next step is to paste them together into a single header.

x <-
  x %>%
  # Replace NA with "" otherwise unite() will spell it as "NA".
  # This is a common irritation.
  # https://stackoverflow.com/questions/13673894/suppress-nas-in-paste
  mutate_at(vars(starts_with("metric-cell-")), replace_na, "") %>%
  unite("metric", starts_with("metric-cell-"), sep = " ") %>%
  mutate(metric = str_trim(metric))
glimpse(x)

## Rows: 36
## Columns: 8
## $ row       <int> 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7…
## $ col       <int> 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10…
## $ data_type <chr> "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeri…
## $ character <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ numeric   <dbl> 1.000000000, 1.000000000, 1.000000000, 0.842777337, 0.896170213, 0.846153…
## $ place     <chr> "State (All Students)", "Region", "School", "State (All Students)", "Regi…
## $ category  <chr> "STUDENTS", "STUDENTS", "STUDENTS", "EDUCATOR EXPERIENCE", "EDUCATOR EXPE…
## $ metric    <chr> "# Students Avg Years w/ Class Data", "# Students Avg Years w/ Class Data…

Now the headers are manageable. They aren’t perfect – the "# Students" header has leaked into "# Students Avg Years w/ Class Data", but that can be cleaned up manually later. At least "# Students Avg Years w/ Class Data" is within the "STUDENTS" category, which is the hard part.

Spreading this data is the final step to make it easy to work with.

x %>%
  select(place, category, metric, numeric) %>%
  spread(place, numeric) %>%
  print(n = Inf)

## # A tibble: 12 x 5
##    category              metric                           Region   School `State (All Student…
##    <chr>                 <chr>                             <dbl>    <dbl>                <dbl>
##  1 EDUCATOR EXPERIENCE   # Students Avg Years in MA %…   0.0439    0                    0.0535
##  2 EDUCATOR EXPERIENCE   # Students Avg Years in MA %…   0.0599    0.154                0.104 
##  3 EDUCATOR EXPERIENCE   # Students Avg Years in MA %…   0.896     0.846                0.843 
##  4 EDUCATOR EXPERIENCE   # Students PTS % Non- PTS       0.248     0.0684               0.247 
##  5 EDUCATOR EXPERIENCE   # Students PTS Years in MA %…   0.752     0.932                0.753 
##  6 EDUCATOR QUALIFICATI… # Students % % In Field         0.944     1                    0.903 
##  7 EDUCATOR QUALIFICATI… # Students % % Non-SEI Endor…  NA        NA                   NA     
##  8 EDUCATOR QUALIFICATI… # Students % % SEI Endorsed    NA        NA                   NA     
##  9 EDUCATOR QUALIFICATI… # Students % Long Term Subs     0.00182   0                    0.0112
## 10 EDUCATOR QUALIFICATI… # Students % Out of Field Fi…   0.0556    0                    0.0965
## 11 STUDENTS              # Students                    625       116               738499     
## 12 STUDENTS              # Students Avg Years w/ Clas…   1         1                    1