Reproducing Harry Potter sentiment arcs by some real wizards
In my previous post, I applied Julia Silge’s story-arc sentiment analysis method to the Harry Potter books. Twitter’s response was a tweet about some very similar and fascinating research, on arXiv. Here’s their beautiful graph.
This post reproduces their method, using R.
Reagan et al.’s method differs in two ways. They use the sentiment dictionary LabMT, a product of their own research group, Hedonometer, which is available in R in the qdapDictionaries package. Then they smooth the scores by averaging inside a moving window. The moving-window method produces curves that make a lot more sense.
You can see their curves for all the Harry Potter books on their interactive website, which I highly recommend.
But I wanted to reproduce the method in R, for fun. The main snag was handling the moving window, which is one of the few under-developed parts of R munging at the moment. There’s RcppRoll, which implements a few obvious functions but no-longer allows custom functions, and zoo, which does allow custom functions, but seems to want the input to be a timeseries – fair enough, it is a timeseries library.
So I did something hacky with base R and purrr. Starting with two vectors, one for the first and the other for the last row number of each window. Putting those into a list, I then transposed them, which you really have to see to understand what it’s doing – basically it paired up each entry of each vector into a vectors of length 2, each one an element of one overall list. Then I could use lapply to iterate over each pair, generating a sequence of rownumbers from the first to the last, and finally bind_rows combined them into a large dataframe of all rows in all windows. I recommend running the code below to see how this works. Obviously the method doesn’t scale too well, since it loads all the windows into memory at once.
# Set up the moving-window-average parameters
N <- nrow(book) # Number of words in the book
k <- 10000 # Number of words in each sample
l <- 200 # Number of points in the time series
overlap <- floor((N - k)/(l - 1)) # Number of words by which the window slides
overlap <- floor((N - k - 1)/l) # Function in the paper -- I think it's wrong, but it makes little difference.
# first and last words of each chunk
uppers <- c(seq(k, by = overlap, length.out = l), N)
lowers <- c(seq(1, by = overlap, length.out = l), N - k)
arc <-
list(lowers, uppers) %>%
transpose %>%
at_depth(1, unlist) %>%
lapply(function(x) {slice(book, x[1]:x[2]) %>% mutate(length = n())}) %>%
bind_rows(.id = "window") %>%
etc.
One thing on my graphs that Hedonometer doesn’t provide is the chapter titles, so you can be reminded what is going on at each turn in the plot.
Enjoy the plots below, otherwise that’s it for this post. The code is, as always, on GitHub, but you need to supply your own copies of the books.
# A tibble: 0 x 1
# Groups: title [0]
# … with 1 variable: title <chr>
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/nacnudus/duncangarmonsway, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Garmonsway (2016, July 20). Duncan Garmonsway: More Harry Potter story arcs (they get darker). Retrieved from https://nacnudus.github.io/duncangarmonsway/posts/2016-07-20-harry-potter-arcs/
BibTeX citation
@misc{garmonsway2016more, author = {Garmonsway, Duncan}, title = {Duncan Garmonsway: More Harry Potter story arcs (they get darker)}, url = {https://nacnudus.github.io/duncangarmonsway/posts/2016-07-20-harry-potter-arcs/}, year = {2016} }