Duncan Garmonsway: Import ragged data with readr::melt_csv()

Duncan Garmonsway

Standard tools like readr::read_csv() can cope to some extent with unusual inputs, like files with empty rows or newlines embedded in strings. But some files are so whacky that standard tools don’t work at all, and instead you have to take the file to pieces and reassemble it in a standard design.

The readr package has recently acquired a set of tools for taking a file to bits. They are the melt_*() family.

The melt_*() family separates delimited text files into individual cells. So “melt” isn’t quite the right name – it should be “disassemble” because it’s about separating the pieces, but “melt” is much shorter to type.

The problem with non-rectangular data

Here’s a contrived example that breaks two assumptions made by common tools like readr::read_csv().

There are more cells in some rows than others.
There are mixed data types within each column.


Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA

To force that data into a data frame, readr::read_csv() must make drastic adjustments.

It omits the cell “1.21”.
It invents an NA cell to fill a gap (row 3 column 4).
It promotes the data type of every cell to character so that it can combine them in columns.


readr::read_csv(data, col_names = FALSE)


# A tibble: 3 x 4
  X1              X2    X3         X4   
  <chr>           <chr> <chr>      <chr>
1 Help            <NA>  007        I'm  
2 1960-09-30      FALSE trapped in 7    
3 non-rectangular data  <NA>       <NA>

Omission of cells is a big problem because it could occur silently. You can force readr() to read the 1.21 cell by specifiying column names for five columns, but then it has to invent NA cells to plug the gaps in the fifth column.


readr::read_csv(data, col_names = letters[1:5])


# A tibble: 3 x 5
  a               b     c          d         e
  <chr>           <chr> <chr>      <chr> <dbl>
1 Help            <NA>  007        I'm   NA   
2 1960-09-30      FALSE trapped in 7      1.21
3 non-rectangular data  <NA>       <NA>  NA

The invention of NA cells is a problem when you need to know the difference between a cell being missing altogether, and a cell that contains a ‘missing’ value. This is a bit niche, but a short row (fewer cells) could mark a different table within the same file (it happens), whereas NA means that a value might have existed yet wasn’t recorded.

Type promotion is the least of the problems because it’s lossless – the file is a text file in the first place. But it would be nice to traverse the cells and pluck out, say, any dates.

Introducing `readr::melt_csv()`

Now see what melt_csv() does. Remember the csv file looks like this.


Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA

The melt_csv() function reads the file one cell at a time, importing each cell of the file into a whole row of the final data frame.


readr::melt_csv(data)


# A tibble: 12 x 4
     row   col data_type value          
   <dbl> <dbl> <chr>     <chr>          
 1     1     1 character Help           
 2     1     2 missing   <NA>           
 3     1     3 character 007            
 4     1     4 character I'm            
 5     2     1 date      1960-09-30     
 6     2     2 logical   FALSE          
 7     2     3 character trapped in     
 8     2     4 integer   7              
 9     2     5 double    1.21           
10     3     1 character non-rectangular
11     3     2 character data           
12     3     3 missing   <NA>

Think of this output like painting-by-numbers. In the same way that a painting-by-numbers stencil describes a painting (the colour, shape and position of each brush-stroke), this ‘melted’ form describes a CSV file (the value, datatype and position of each cell). While a painting isn’t structured data, a painting-by-numbers stencil is structured data about a painting, and while a CSV file isn’t necessarily structured data, a melted CSV is structured data about a CSV file.

Paint-by-numbers image of a house in the countryside, before and after painting. Image from from Paint by Number: The How-To Craze that Swept the Nation by William L. Bird, via https://simanaitissays.com/2014/09/05/by-the-numbers/. — Paint-by-numbers image of a house in the countryside, before and after painting. Image from from *Paint by Number: The How-To Craze that Swept the Nation* by William L. Bird, via https://simanaitissays.com/2014/09/05/by-the-numbers/.

What the melt_csv() output means:

The output of melt_csv() is a data frame – structured data about un-structured data!
Each row of the data frame represents one cell of the input data.
The empty cell of row 1 is imported, but the missing cells at the ends of rows 1 and 3 aren’t.
No data type conversion is attempted – every value is imported as a string, and the data_type column merely gives readr’s best guess of what the data types ought to be.

How can you use this? To begin with, you can do some simple manipulations with ordinary functions. For example you could extract the words.


data <- "Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA"

readr::melt_csv(data) %>%
  dplyr::filter(data_type == "character")


# A tibble: 6 x 4
    row   col data_type value          
  <dbl> <dbl> <chr>     <chr>          
1     1     1 character Help           
2     1     3 character 007            
3     1     4 character I'm            
4     2     3 character trapped in     
5     3     1 character non-rectangular
6     3     2 character data

Was any of the message lost in transmission? Yes, two segments are missing.


readr::melt_csv(data) %>%
  dplyr::filter(data_type == "missing")


# A tibble: 2 x 4
    row   col data_type value
  <dbl> <dbl> <chr>     <chr>
1     1     2 missing   <NA> 
2     3     3 missing   <NA>

Using melted data

Use-cases are hard to come by because this is a (relatively) new way to work with messy data. Try out New Zealand’s election statistics or some UK civil servants’ expenses for yourself, or watch me wrangle a #TidyTuesday dataset of US honey production with the help of my unpivotr package.

My tidyxl package melts spreadsheets in a similar way, and Spreadsheet Munging Strategies has many worked examples and case studies, most of which could apply to plain text files.

The Wicked Witch of the West, melting, from the film The Wizard of Oz.

Comment on this article Share:

Import ragged data with readr::melt_csv()

The problem with non-rectangular data

Introducing `readr::melt_csv()`

Using melted data

Corrections

Reuse

Citation

Import ragged data with readr::melt_csv()

The problem with non-rectangular data

Introducing readr::melt_csv()

Using melted data

Corrections

Reuse

Citation

Introducing `readr::melt_csv()`