Melt it down and start again
Standard tools like readr::read_csv()
can cope to some extent with unusual inputs, like files with empty rows or newlines embedded in strings. But some files are so whacky that standard tools don’t work at all, and instead you have to take the file to pieces and reassemble it in a standard design.
The readr package has recently acquired a set of tools for taking a file to bits. They are the melt_*()
family.
melt_csv()
, melt_csv_chunked()
melt_csv2()
, melt_csv2_chunked()
melt_tsv()
, melt_tsv_chunked()
melt_delim()
, melt_delim_chunked()
melt_table()
, melt_table2()
melt_fwf()
The melt_*()
family separates delimited text files into individual cells. So “melt” isn’t quite the right name – it should be “disassemble” because it’s about separating the pieces, but “melt” is much shorter to type.
Here’s a contrived example that breaks two assumptions made by common tools like readr::read_csv()
.
Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA
To force that data into a data frame, readr::read_csv()
must make drastic adjustments.
NA
cell to fill a gap (row 3 column 4).
readr::read_csv(data, col_names = FALSE)
# A tibble: 3 x 4
X1 X2 X3 X4
<chr> <chr> <chr> <chr>
1 Help <NA> 007 I'm
2 1960-09-30 FALSE trapped in 7
3 non-rectangular data <NA> <NA>
Omission of cells is a big problem because it could occur silently. You can force readr()
to read the 1.21
cell by specifiying column names for five columns, but then it has to invent NA
cells to plug the gaps in the fifth column.
readr::read_csv(data, col_names = letters[1:5])
# A tibble: 3 x 5
a b c d e
<chr> <chr> <chr> <chr> <dbl>
1 Help <NA> 007 I'm NA
2 1960-09-30 FALSE trapped in 7 1.21
3 non-rectangular data <NA> <NA> NA
The invention of NA
cells is a problem when you need to know the difference between a cell being missing altogether, and a cell that contains a ‘missing’ value. This is a bit niche, but a short row (fewer cells) could mark a different table within the same file (it happens), whereas NA
means that a value might have existed yet wasn’t recorded.
Type promotion is the least of the problems because it’s lossless – the file is a text file in the first place. But it would be nice to traverse the cells and pluck out, say, any dates.
readr::melt_csv()
Now see what melt_csv()
does. Remember the csv file looks like this.
Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA
The melt_csv()
function reads the file one cell at a time, importing each cell of the file into a whole row of the final data frame.
readr::melt_csv(data)
# A tibble: 12 x 4
row col data_type value
<dbl> <dbl> <chr> <chr>
1 1 1 character Help
2 1 2 missing <NA>
3 1 3 character 007
4 1 4 character I'm
5 2 1 date 1960-09-30
6 2 2 logical FALSE
7 2 3 character trapped in
8 2 4 integer 7
9 2 5 double 1.21
10 3 1 character non-rectangular
11 3 2 character data
12 3 3 missing <NA>
Think of this output like painting-by-numbers. In the same way that a painting-by-numbers stencil describes a painting (the colour, shape and position of each brush-stroke), this ‘melted’ form describes a CSV file (the value, datatype and position of each cell). While a painting isn’t structured data, a painting-by-numbers stencil is structured data about a painting, and while a CSV file isn’t necessarily structured data, a melted CSV is structured data about a CSV file.
What the melt_csv()
output means:
melt_csv()
is a data frame – structured data about un-structured data!data_type
column merely gives readr’s best guess of what the data types ought to be.How can you use this? To begin with, you can do some simple manipulations with ordinary functions. For example you could extract the words.
data <- "Help,,007,I'm
1960-09-30,FALSE,trapped in,7,1.21
non-rectangular,data,NA"
readr::melt_csv(data) %>%
dplyr::filter(data_type == "character")
# A tibble: 6 x 4
row col data_type value
<dbl> <dbl> <chr> <chr>
1 1 1 character Help
2 1 3 character 007
3 1 4 character I'm
4 2 3 character trapped in
5 3 1 character non-rectangular
6 3 2 character data
Was any of the message lost in transmission? Yes, two segments are missing.
readr::melt_csv(data) %>%
dplyr::filter(data_type == "missing")
# A tibble: 2 x 4
row col data_type value
<dbl> <dbl> <chr> <chr>
1 1 2 missing <NA>
2 3 3 missing <NA>
Use-cases are hard to come by because this is a (relatively) new way to work with messy data. Try out New Zealand’s election statistics or some UK civil servants’ expenses for yourself, or watch me wrangle a #TidyTuesday dataset of US honey production with the help of my unpivotr package.
My tidyxl package melts spreadsheets in a similar way, and Spreadsheet Munging Strategies has many worked examples and case studies, most of which could apply to plain text files.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/nacnudus/duncangarmonsway, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Garmonsway (2019, Jan. 8). Duncan Garmonsway: Import ragged data with readr::melt_csv(). Retrieved from https://nacnudus.github.io/duncangarmonsway/posts/2018-12-29-meltcsv/
BibTeX citation
@misc{garmonsway2019import, author = {Garmonsway, Duncan}, title = {Duncan Garmonsway: Import ragged data with readr::melt_csv()}, url = {https://nacnudus.github.io/duncangarmonsway/posts/2018-12-29-meltcsv/}, year = {2019} }