Duncan Garmonsway: How many R-Bloggers are there?

Duncan Garmonsway

This post does three things:

Finds out how many R-related blogs there are really (not a well-defined question).
Shows that I can use semi-structured non-csv data (job interview weakness).
Explains where all the time goes.

Many people who promote R quote the number of R blogs as given on the R-Bloggers website by Tal Galili, which syndicates literally hundreds of R-related blogs (573 at the time of writing). But the number tends only to increase. How many actual posts are there in a given week/month, from how many different blogs?

Update 30 April 2016

I have a longer history of daily digest emails than I thought. The data, and some of the text, has been updated to go back to October 2013.

The gist of it

I subscribed to the R-Bloggers daily digest emails in early 2014, giving me a good time-series of posts.

The initial dump is easy from Gmail (define a filter > use it to apply a new label > request a dump of the labelled emails). Since the dump is in a single plain-text file, and because the amazing R-community has bothered to generalise so many solutions to fiddly problems by making packages, all the remaining steps are also easy.

Separate the emails into individual files, using convert_mbox_eml in the tm.plugin.mail package.
Parse the date-time in the first line of each file, using base R (hooray for base!)
Parse the HTML email content using read_html in the xml2 package (which has its own magic to trim off the non-HTML email headers).
Extract the names of the blogs in each email using an XPath string created by the SelectorGadget browser extension/bookmarklet.
Mung and analyse the data.

The answer

It turns out that there are about 75 blogs active in a given month, posting about 160 posts (Revolutions is the only one that regularly posts more than once per week). Nothing much has changed in the last year. For some arbitrary definition of “dead blog”, a survival analysis could be done.

What took me so long

This was an easy project, but a few quirks soaked up a lot of time:

I wanted to initialise an empty list, to store information collected by looping through the emails. This is one of my favourite R idiosyncracies. Consider how the following function could be any less intuitive: empty_list <- vector(mode = "list", length = n). I usually don’t think of lists as a kind of vector, and usually think of them as a class rather than a mode, but perhaps that’s just me.
Gmail filters and labels conversations, rather than individual emails, so the occasional forward of an R-Bloggers digest scuppered the code.
One of the digests had a glitch – no links to the originial blogs. The date is given in a non-lubridate-friendly order, so I had to rediscover strptime.
Some blog names include unusual characters that appear in the plain text in a funny way (e.g. “=E2=80=A6”). These had to be found-and-replaced (using stringr).
While, xml_find_all in the xml2 package understands
3D\"itemcontentlist\" as part of an XPath string, I intially fell foul of html_nodes in the rvest package, which doesn’t seem to understand it as part of a CSS string.
Given a named list, how can you crate a data frame that links the names to the each element of the vectors? Finding this kind of thing out is entirely a game of Google Search Term Bingo, but in this case I used part of a clever StackOverflow solution of a different problem. To save you digging around in the purrr or rlist packages, the answer is stack in (hooray!) base.

But the biggest time-sucker by far was the bizarre way that the plain text of the emails had been trimmed to 76 characters, by sticking an equals sign and a Windows-style line-ending (carriage return and line feed, i.e. \r\n) after the 75th character. This is snag-tastic, because it’s hard to find a tool that will both search-and-replace across line-endings, and also search-and-replace multiple characters. sed is one of those command-line tools that lurks for years before pouncing, and this was its moment, when I finally had to learn a bit more than merely s/pattern/replacement/g. This StackOverflow answer explains how the following command works: sed ':a;N;$!ba;s/=\r\n//g' dirty.mbox > clean.mbox.

Reward

Thankyou for reading. Here are some more graphs, and some code fragments.


# NOTE: These are frangments of code.  They do not stand alone.

# Collect the file names
emails <- list.files(mail_dir, full.names = TRUE)

# Remove the "confirm email address" one
# and the one that has no links to the original blogs
emails <- emails[c(-2, -342)]

# Remove any that are replies
emails <- emails[vapply(emails,
                        function(filename) {
                          !any(grepl("^In-Reply-To: <",
                                     readLines(filename, n = 10)))},
                        TRUE)]

# Collect the datetimes in the first line of each file
# Also collect the journals from the subject lines
n <- length(emails)
datetimes <- vector("character", length = n)
blogcounts <- vector("character", length = n)
blogs  <- vector("list", length = n)
i <- 0
for (filename in emails) {
  i <<- i + 1
  datetimes[i] <- readLines(filename, n = 1)
  # Extract the links to the original blogs
  blogs[[i]] <-
    read_html(filename) %>%
    xml_find_all("//*[(@id = '3D\"itemcontentlist\"')]//div//div//strong[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a") %>%
    xml_text
}

# Extract the datetime string
datetimes <-
  datetimes %>%
  str_sub(start = 34) %>%
  strptime(format = "%b %d %H:%M:%S %z %Y")

# Link the datetime with individual blogs
names(blogs) <- datetimes
blogs <- stack(blogs)

# Recover the dates and clean the blog names
blogs <-
  blogs %>%
  rename(blog = values, datetime = ind) %>%
  mutate(datetime = ymd_hms(datetime),
         # blog = str_replace(blog, fixed("=\\n"), ""),
         blog = str_replace(blog, fixed("=C2=BB"), "»"),
         blog = str_replace(blog, fixed("=E2=80=93"), "–"),
         blog = str_replace(blog, fixed("=E2=80=A6"), "…"),
         blog = str_replace(blog, fixed("=C3=A6"), "æ"),
         blog = str_replace(blog, fixed("=EA=B0=84=EB=93=9C=EB=A3=A8=EB=93=9C =ED=81=AC=EB=A6=AC=EC=8A=A4=ED=86=A0=ED=8C=8C"), "(간드루드 크리스토파)"),
         blog = str_replace(blog, fixed("=D0=AF=D1=82=D0=BE=D0=BC=D0=B8=D0=B7=D0=BE"), "Ятомизо"),
         blog = str_trim(blog))

Comment on this article Share:

How many R-Bloggers are there?

Update 30 April 2016

The gist of it

The answer

What took me so long

Reward

Corrections

Reuse

Citation