How many R-Bloggers are there?

Parsing the daily digest emails

Duncan Garmonsway
April 18, 2016

This post does three things:

Many people who promote R quote the number of R blogs as given on the R-Bloggers website by Tal Galili, which syndicates literally hundreds of R-related blogs (573 at the time of writing). But the number tends only to increase. How many actual posts are there in a given week/month, from how many different blogs?

Update 30 April 2016

I have a longer history of daily digest emails than I thought. The data, and some of the text, has been updated to go back to October 2013.

The gist of it

I subscribed to the R-Bloggers daily digest emails in early 2014, giving me a good time-series of posts.

The initial dump is easy from Gmail (define a filter > use it to apply a new label > request a dump of the labelled emails). Since the dump is in a single plain-text file, and because the amazing R-community has bothered to generalise so many solutions to fiddly problems by making packages, all the remaining steps are also easy.

  1. Separate the emails into individual files, using convert_mbox_eml in the tm.plugin.mail package.
  2. Parse the date-time in the first line of each file, using base R (hooray for base!)
  3. Parse the HTML email content using read_html in the xml2 package (which has its own magic to trim off the non-HTML email headers).
  4. Extract the names of the blogs in each email using an XPath string created by the SelectorGadget browser extension/bookmarklet.
  5. Mung and analyse the data.

The answer

It turns out that there are about 75 blogs active in a given month, posting about 160 posts (Revolutions is the only one that regularly posts more than once per week). Nothing much has changed in the last year. For some arbitrary definition of “dead blog”, a survival analysis could be done.

What took me so long

This was an easy project, but a few quirks soaked up a lot of time:

But the biggest time-sucker by far was the bizarre way that the plain text of the emails had been trimmed to 76 characters, by sticking an equals sign and a Windows-style line-ending (carriage return and line feed, i.e. \r\n) after the 75th character. This is snag-tastic, because it’s hard to find a tool that will both search-and-replace across line-endings, and also search-and-replace multiple characters. sed is one of those command-line tools that lurks for years before pouncing, and this was its moment, when I finally had to learn a bit more than merely s/pattern/replacement/g. This StackOverflow answer explains how the following command works: sed ':a;N;$!ba;s/=\r\n//g' dirty.mbox > clean.mbox.

Reward

Thankyou for reading. Here are some more graphs, and some code fragments.


# NOTE: These are frangments of code.  They do not stand alone.

# Collect the file names
emails <- list.files(mail_dir, full.names = TRUE)

# Remove the "confirm email address" one
# and the one that has no links to the original blogs
emails <- emails[c(-2, -342)]

# Remove any that are replies
emails <- emails[vapply(emails,
                        function(filename) {
                          !any(grepl("^In-Reply-To: <",
                                     readLines(filename, n = 10)))},
                        TRUE)]

# Collect the datetimes in the first line of each file
# Also collect the journals from the subject lines
n <- length(emails)
datetimes <- vector("character", length = n)
blogcounts <- vector("character", length = n)
blogs  <- vector("list", length = n)
i <- 0
for (filename in emails) {
  i <<- i + 1
  datetimes[i] <- readLines(filename, n = 1)
  # Extract the links to the original blogs
  blogs[[i]] <-
    read_html(filename) %>%
    xml_find_all("//*[(@id = '3D\"itemcontentlist\"')]//div//div//strong[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a") %>%
    xml_text
}

# Extract the datetime string
datetimes <-
  datetimes %>%
  str_sub(start = 34) %>%
  strptime(format = "%b %d %H:%M:%S %z %Y")

# Link the datetime with individual blogs
names(blogs) <- datetimes
blogs <- stack(blogs)

# Recover the dates and clean the blog names
blogs <-
  blogs %>%
  rename(blog = values, datetime = ind) %>%
  mutate(datetime = ymd_hms(datetime),
         # blog = str_replace(blog, fixed("=\\n"), ""),
         blog = str_replace(blog, fixed("=C2=BB"), "»"),
         blog = str_replace(blog, fixed("=E2=80=93"), "–"),
         blog = str_replace(blog, fixed("=E2=80=A6"), "…"),
         blog = str_replace(blog, fixed("=C3=A6"), "æ"),
         blog = str_replace(blog, fixed("=EA=B0=84=EB=93=9C=EB=A3=A8=EB=93=9C =ED=81=AC=EB=A6=AC=EC=8A=A4=ED=86=A0=ED=8C=8C"), "(간드루드 크리스토파)"),
         blog = str_replace(blog, fixed("=D0=AF=D1=82=D0=BE=D0=BC=D0=B8=D0=B7=D0=BE"), "Ятомизо"),
         blog = str_trim(blog))

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/nacnudus/duncangarmonsway, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Garmonsway (2016, April 18). Duncan Garmonsway: How many R-Bloggers are there?. Retrieved from https://nacnudus.github.io/duncangarmonsway/posts/2016-04-18-rbloggers/

BibTeX citation

@misc{garmonsway2016how,
  author = {Garmonsway, Duncan},
  title = {Duncan Garmonsway: How many R-Bloggers are there?},
  url = {https://nacnudus.github.io/duncangarmonsway/posts/2016-04-18-rbloggers/},
  year = {2016}
}