Import ragged data with readr::melt_csv()

Melt it down and start again

Duncan Garmonsway
January 8, 2019

Standard tools like readr::read_csv() can cope to some extent with unusual inputs, like files with empty rows or newlines embedded in strings. But some files are so whacky that standard tools don’t work at all, and instead you have to take the file to pieces and reassemble it in a standard design.

The readr package has recently acquired a set of tools for taking a file to bits. They are the melt_*() family.

The melt_*() family separates delimited text files into individual cells. So “melt” isn’t quite the right name – it should be “disassemble” because it’s about separating the pieces, but “melt” is much shorter to type.

The problem with non-rectangular data

Here’s a contrived example that breaks two assumptions made by common tools like readr::read_csv().

  1. There are more cells in some rows than others.
  2. There are mixed data types within each column.

Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA

To force that data into a data frame, readr::read_csv() must make drastic adjustments.

  1. It omits the cell “1.21”.
  2. It invents an NA cell to fill a gap (row 3 column 4).
  3. It promotes the data type of every cell to character so that it can combine them in columns.

readr::read_csv(data, col_names = FALSE)

# A tibble: 3 x 4
  X1              X2    X3         X4   
  <chr>           <chr> <chr>      <chr>
1 Help            <NA>  007        I'm  
2 1960-09-30      FALSE trapped in 7    
3 non-rectangular data  <NA>       <NA> 

Omission of cells is a big problem because it could occur silently. You can force readr() to read the 1.21 cell by specifiying column names for five columns, but then it has to invent NA cells to plug the gaps in the fifth column.


readr::read_csv(data, col_names = letters[1:5])

# A tibble: 3 x 5
  a               b     c          d         e
  <chr>           <chr> <chr>      <chr> <dbl>
1 Help            <NA>  007        I'm   NA   
2 1960-09-30      FALSE trapped in 7      1.21
3 non-rectangular data  <NA>       <NA>  NA   

The invention of NA cells is a problem when you need to know the difference between a cell being missing altogether, and a cell that contains a ‘missing’ value. This is a bit niche, but a short row (fewer cells) could mark a different table within the same file (it happens), whereas NA means that a value might have existed yet wasn’t recorded.

Type promotion is the least of the problems because it’s lossless – the file is a text file in the first place. But it would be nice to traverse the cells and pluck out, say, any dates.

Introducing readr::melt_csv()

Now see what melt_csv() does. Remember the csv file looks like this.


Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA

The melt_csv() function reads the file one cell at a time, importing each cell of the file into a whole row of the final data frame.


readr::melt_csv(data)

# A tibble: 12 x 4
     row   col data_type value          
   <dbl> <dbl> <chr>     <chr>          
 1     1     1 character Help           
 2     1     2 missing   <NA>           
 3     1     3 character 007            
 4     1     4 character I'm            
 5     2     1 date      1960-09-30     
 6     2     2 logical   FALSE          
 7     2     3 character trapped in     
 8     2     4 integer   7              
 9     2     5 double    1.21           
10     3     1 character non-rectangular
11     3     2 character data           
12     3     3 missing   <NA>           

Think of this output like painting-by-numbers. In the same way that a painting-by-numbers stencil describes a painting (the colour, shape and position of each brush-stroke), this ‘melted’ form describes a CSV file (the value, datatype and position of each cell). While a painting isn’t structured data, a painting-by-numbers stencil is structured data about a painting, and while a CSV file isn’t necessarily structured data, a melted CSV is structured data about a CSV file.

Paint-by-numbers image of a house in the countryside, before and after painting. Image from from Paint by Number: The How-To Craze that Swept the Nation by William L. Bird, via https://simanaitissays.com/2014/09/05/by-the-numbers/.
Paint-by-numbers image of a house in the countryside, before and after painting. Image from from Paint by Number: The How-To Craze that Swept the Nation by William L. Bird, via https://simanaitissays.com/2014/09/05/by-the-numbers/.

What the melt_csv() output means:

How can you use this? To begin with, you can do some simple manipulations with ordinary functions. For example you could extract the words.


data <- "Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA"

readr::melt_csv(data) %>%
  dplyr::filter(data_type == "character")

# A tibble: 6 x 4
    row   col data_type value          
  <dbl> <dbl> <chr>     <chr>          
1     1     1 character Help           
2     1     3 character 007            
3     1     4 character I'm            
4     2     3 character trapped in     
5     3     1 character non-rectangular
6     3     2 character data           

Was any of the message lost in transmission? Yes, two segments are missing.


readr::melt_csv(data) %>%
  dplyr::filter(data_type == "missing")

# A tibble: 2 x 4
    row   col data_type value
  <dbl> <dbl> <chr>     <chr>
1     1     2 missing   <NA> 
2     3     3 missing   <NA> 

Using melted data

Use-cases are hard to come by because this is a (relatively) new way to work with messy data. Try out New Zealand’s election statistics or some UK civil servants’ expenses for yourself, or watch me wrangle a #TidyTuesday dataset of US honey production with the help of my unpivotr package.

My tidyxl package melts spreadsheets in a similar way, and Spreadsheet Munging Strategies has many worked examples and case studies, most of which could apply to plain text files.

The Wicked Witch of the West, melting, from the film The Wizard of Oz.
The Wicked Witch of the West, melting, from the film The Wizard of Oz.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/nacnudus/duncangarmonsway, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Garmonsway (2019, Jan. 8). Duncan Garmonsway: Import ragged data with readr::melt_csv(). Retrieved from https://nacnudus.github.io/duncangarmonsway/posts/2018-12-29-meltcsv/

BibTeX citation

@misc{garmonsway2019import,
  author = {Garmonsway, Duncan},
  title = {Duncan Garmonsway: Import ragged data with readr::melt_csv()},
  url = {https://nacnudus.github.io/duncangarmonsway/posts/2018-12-29-meltcsv/},
  year = {2019}
}