Parsing the daily digest emails
This post does three things:
Many people who promote R quote the number of R blogs as given on the R-Bloggers website by Tal Galili, which syndicates literally hundreds of R-related blogs (573 at the time of writing). But the number tends only to increase. How many actual posts are there in a given week/month, from how many different blogs?
I have a longer history of daily digest emails than I thought. The data, and some of the text, has been updated to go back to October 2013.
I subscribed to the R-Bloggers daily digest emails in early 2014, giving me a good time-series of posts.
The initial dump is easy from Gmail (define a filter > use it to apply a new label > request a dump of the labelled emails). Since the dump is in a single plain-text file, and because the amazing R-community has bothered to generalise so many solutions to fiddly problems by making packages, all the remaining steps are also easy.
convert_mbox_eml
in the tm.plugin.mail
package.read_html
in the xml2
package (which has its own magic to trim off the non-HTML email headers).It turns out that there are about 75 blogs active in a given month, posting about 160 posts (Revolutions is the only one that regularly posts more than once per week). Nothing much has changed in the last year. For some arbitrary definition of “dead blog”, a survival analysis could be done.
This was an easy project, but a few quirks soaked up a lot of time:
empty_list <- vector(mode = "list", length = n)
. I usually don’t think of lists as a kind of vector, and usually think of them as a class rather than a mode, but perhaps that’s just me.lubridate
-friendly order, so I had to rediscover strptime
.stringr
).xml_find_all
in the xml2
package understands3D\"itemcontentlist\"
as part of an XPath string, I intially fell foul of html_nodes
in the rvest
package, which doesn’t seem to understand it as part of a CSS string.purrr
or rlist
packages, the answer is stack
in (hooray!) base
.But the biggest time-sucker by far was the bizarre way that the plain text of the emails had been trimmed to 76 characters, by sticking an equals sign and a Windows-style line-ending (carriage return and line feed, i.e. \r\n
) after the 75th character. This is snag-tastic, because it’s hard to find a tool that will both search-and-replace across line-endings, and also search-and-replace multiple characters. sed
is one of those command-line tools that lurks for years before pouncing, and this was its moment, when I finally had to learn a bit more than merely s/pattern/replacement/g
. This StackOverflow answer explains how the following command works: sed ':a;N;$!ba;s/=\r\n//g' dirty.mbox > clean.mbox
.
Thankyou for reading. Here are some more graphs, and some code fragments.
# NOTE: These are frangments of code. They do not stand alone.
# Collect the file names
emails <- list.files(mail_dir, full.names = TRUE)
# Remove the "confirm email address" one
# and the one that has no links to the original blogs
emails <- emails[c(-2, -342)]
# Remove any that are replies
emails <- emails[vapply(emails,
function(filename) {
!any(grepl("^In-Reply-To: <",
readLines(filename, n = 10)))},
TRUE)]
# Collect the datetimes in the first line of each file
# Also collect the journals from the subject lines
n <- length(emails)
datetimes <- vector("character", length = n)
blogcounts <- vector("character", length = n)
blogs <- vector("list", length = n)
i <- 0
for (filename in emails) {
i <<- i + 1
datetimes[i] <- readLines(filename, n = 1)
# Extract the links to the original blogs
blogs[[i]] <-
read_html(filename) %>%
xml_find_all("//*[(@id = '3D\"itemcontentlist\"')]//div//div//strong[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a") %>%
xml_text
}
# Extract the datetime string
datetimes <-
datetimes %>%
str_sub(start = 34) %>%
strptime(format = "%b %d %H:%M:%S %z %Y")
# Link the datetime with individual blogs
names(blogs) <- datetimes
blogs <- stack(blogs)
# Recover the dates and clean the blog names
blogs <-
blogs %>%
rename(blog = values, datetime = ind) %>%
mutate(datetime = ymd_hms(datetime),
# blog = str_replace(blog, fixed("=\\n"), ""),
blog = str_replace(blog, fixed("=C2=BB"), "»"),
blog = str_replace(blog, fixed("=E2=80=93"), "–"),
blog = str_replace(blog, fixed("=E2=80=A6"), "…"),
blog = str_replace(blog, fixed("=C3=A6"), "æ"),
blog = str_replace(blog, fixed("=EA=B0=84=EB=93=9C=EB=A3=A8=EB=93=9C =ED=81=AC=EB=A6=AC=EC=8A=A4=ED=86=A0=ED=8C=8C"), "(간드루드 크리스토파)"),
blog = str_replace(blog, fixed("=D0=AF=D1=82=D0=BE=D0=BC=D0=B8=D0=B7=D0=BE"), "Ятомизо"),
blog = str_trim(blog))
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/nacnudus/duncangarmonsway, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Garmonsway (2016, April 18). Duncan Garmonsway: How many R-Bloggers are there?. Retrieved from https://nacnudus.github.io/duncangarmonsway/posts/2016-04-18-rbloggers/
BibTeX citation
@misc{garmonsway2016how, author = {Garmonsway, Duncan}, title = {Duncan Garmonsway: How many R-Bloggers are there?}, url = {https://nacnudus.github.io/duncangarmonsway/posts/2016-04-18-rbloggers/}, year = {2016} }