9.4 Toronto Transit Commission

Tweet by Sharla Gelfand about tidying Toronto Transit Commission data with R

This table shows the number of trips recorded on the Toronto Transit Commission per year, by type of ticket, person, vehicle, and weeday/weekend/holiday.

Sharla Gelfand’s annotated screenshot explains the structure, and see her excellent blog post for how she wrangled it with standard tidyverse tools. I show here an alternative method with tidyxl and unpivotr.

Download the file. Original source.

9.4.1 The full code listing

cells <-
  xlsx_cells(smungs::toronto_transit) %>%
  dplyr::filter(!is_blank, row >= 6)

fare <-
  cells %>%
  dplyr::filter(col == 2,
                !str_detect(character, "^ "),
                !str_detect(character, "TOTAL")) %>%
  select(row, col, fare = character)

cells %>%
  behead("up", "year", formatters = list(character = str_trim)) %>%
  behead("left-up", "context") %>%
  behead("left", "media", formatters = list(character = str_trim)) %>%
  enhead(fare, "left-up") %>%
  dplyr::filter(!str_detect(media, "TOTAL")) %>%
  separate(year, c("year", "note"), sep = " ", fill = "right") %>%
  select(year, context, fare, media, count = numeric)
## # A tibble: 1,188 x 5
##    year  context fare  media   count
##    <chr> <chr>   <chr> <chr>   <dbl>
##  1 2017  WHO     ADULT TOKENS  76106
##  2 2016  WHO     ADULT TOKENS 102073
##  3 2015  WHO     ADULT TOKENS 110945
##  4 2014  WHO     ADULT TOKENS 111157
##  5 2013  WHO     ADULT TOKENS 112360
##  6 2012  WHO     ADULT TOKENS 117962
##  7 2011  WHO     ADULT TOKENS 124748
##  8 2010  WHO     ADULT TOKENS 120366
##  9 2009  WHO     ADULT TOKENS 114686
## 10 2008  WHO     ADULT TOKENS  94210
## # … with 1,178 more rows

9.4.2 Step by step

Although the annotations point out that there are really three separate tables (WHO, WHERE and WHEN), they can be imported as one.

Column 2 has two levels of headers in it: the fare in bold (“ADULT”, “BUS”, etc.), and the media used to pay for it indented by a few spaces (“TOKENS”, “WEEKLY PASS”, etc.).

Because behead() can’t distinguish between different levels of headers in the same column, we need to put the bold fare headers into a separate variable on their own, and enhead() them back onto the rest of the table later.

Unfortunately the fare headers in the “WHEN” context aren’t bold, so rather than filter for bold headers, instead we filter for headers that aren’t indented by spaces. We also filter out any “TOTAL” headers.

cells <-
  xlsx_cells(smungs::toronto_transit) %>%
  dplyr::filter(!is_blank, row >= 6)

fare <-
  cells %>%
  dplyr::filter(col == 2,
                !str_detect(character, "^ "), # Filter out indented headers
                !str_detect(character, "TOTAL")) %>% # Filteroout totals
  select(row, col, fare = character)
fare
## # A tibble: 7 x 3
##     row   col fare           
##   <int> <int> <chr>          
## 1     7     2 ADULT          
## 2    21     2 SENIOR/STUDENT 
## 3    31     2 CHILDREN       
## 4    43     2 BUS            
## 5    46     2 RAIL           
## 6    53     2 WEEKDAY        
## 7    54     2 WEEKEND/HOLIDAY
ttc <-
  cells %>%
  behead("up", "year") %>%
  behead("left-up", "context") %>%
  behead("left", "media") %>%
  enhead(fare, "left-up") %>%
  dplyr::filter(!str_detect(media, "TOTAL")) %>%
  select(year, context, fare, media, count = numeric)
ttc
## # A tibble: 1,188 x 5
##    year      context fare  media          count
##    <chr>     <chr>   <chr> <chr>          <dbl>
##  1 "2017"    WHO     ADULT "     TOKENS"  76106
##  2 "2016"    WHO     ADULT "     TOKENS" 102073
##  3 " 2015 *" WHO     ADULT "     TOKENS" 110945
##  4 "2014"    WHO     ADULT "     TOKENS" 111157
##  5 "2013"    WHO     ADULT "     TOKENS" 112360
##  6 "2012"    WHO     ADULT "     TOKENS" 117962
##  7 "2011"    WHO     ADULT "     TOKENS" 124748
##  8 "2010"    WHO     ADULT "     TOKENS" 120366
##  9 "2009"    WHO     ADULT "     TOKENS" 114686
## 10 "2008"    WHO     ADULT "     TOKENS"  94210
## # … with 1,178 more rows

There’s a bit more cosmetic cleaning to do. The indentation can be trimmed from the media and the year headers, and the asterisk removed from the year 2015 *.

ttc %>%
  mutate(year = str_trim(year), media = str_trim(media)) %>%
  separate(year, c("year", "note"), sep = " ", fill = "right") %>%
  select(-note)
## # A tibble: 1,188 x 5
##    year  context fare  media   count
##    <chr> <chr>   <chr> <chr>   <dbl>
##  1 2017  WHO     ADULT TOKENS  76106
##  2 2016  WHO     ADULT TOKENS 102073
##  3 2015  WHO     ADULT TOKENS 110945
##  4 2014  WHO     ADULT TOKENS 111157
##  5 2013  WHO     ADULT TOKENS 112360
##  6 2012  WHO     ADULT TOKENS 117962
##  7 2011  WHO     ADULT TOKENS 124748
##  8 2010  WHO     ADULT TOKENS 120366
##  9 2009  WHO     ADULT TOKENS 114686
## 10 2008  WHO     ADULT TOKENS  94210
## # … with 1,178 more rows