Harry Potter cliches and story arcs by n-grams and sentiment analysis
This post applies Julia Silge’s amazing story-arc sentiment analyses to the Harry Potter books.
It also busts the myth that “turned on his heel” is the series’ most common phrase.
Here is a related shiny app to explore the ideas futher.
Casual text-munging is no longer a pain, thanks to a couple of new packages, tidytext and tokenizers, and a not-so-new one, stringi.
When I last analysed Harry Potter a few years ago, the tm package, though powerful, was frustrating, partly due to its unusual data format, which was tricky to traverse. But these new packages operate on ordinary data frames, using nesting to great effect.
The outcome is that n-grams can be created incredibly quickly, easily avoiding sentence boundaries. The code is as simple as this:
readRDS(here("data", "books-raw.Rds")) %>%
# One row per paragraph. Two columns: title and text
# First, break into sentences so that ngrams don't cross sentence boundaries
group_by(title) %>%
summarise(sentence = list(unlist(map(text, tokenize_sentences)))) %>%
unnest %>%
# Then create 4-grams
group_by(title) %>%
summarise(fourgram = list(unlist(tokenize_ngrams(sentence, n = 4)))) %>%
unnest
Did you hear that the most common phrase in Harry Potter is “turned on his heel”? I can finally bust that myth. It does appear quite often – 12 times at most, if you include ‘turning’ and ‘her’, but the most-common four-word phrase, by miles, is “Harry, Ron and Hermione”. Big surprise.
A few of these are predictable nouns (Defence Against the Dark Arts, the Ministry of Magic, the Room of Requirement). He Who Must Not Be Named makes it into the top 40. There are a bunch of phrases that describe where things are (at the end of, etc.). But the most intriguing phrase is “said Hermione in a” – why is Hermione singled out by that construction?
There’s a shiny app to explore lots more n-grams, from 2-grams to 10-grams.
Slate did a similar analysis, though they looked at the most-common sentences, comparing Harry Potter with The Hunger Games and the Twilight series.
They seem to have edited their list somewhat, since “He waited” appears only three times, and “Something he didn’t have last time” only twice, while I find that “Harry nodded” tops my list (of complete sentences) with 14 occurences, one more than Slate’s top sentence, “Nothing happened.”
Here are my top 30, many of which are not complete sentences.
Part of the difficulty is that written English speech isn’t unambiguously punctuated. This has bugged me since primary school. See what happens here.
tokenize_sentences(c("'Are you going?' Harry asked.",
"Ron asked, 'Are you going?' Harry shrugged.",
"'You should go,' Harry said",
"'Go now.' Harry went."))
## [[1]]
## [1] "'Are you going?'" "Harry asked."
##
## [[2]]
## [1] "Ron asked, 'Are you going?'" "Harry shrugged."
##
## [[3]]
## [1] "'You should go,' Harry said"
##
## [[4]]
## [1] "'Go now.'" "Harry went."
Were I king, I’d decree the following unambiguous style.
tokenize_sentences(c("Ron asked, 'Are you going?'. Harry shrugged.",
"'You should go.', Harry said"))
## [[1]]
## [1] "Ron asked, 'Are you going?'." "Harry shrugged."
##
## [[2]]
## [1] "'You should go.', Harry said"
If importance is proportional to mentions of first names, then Hermione and Ron are not as equal as you might expect.
Ever since I read Julia Silge’s amazing story-arc sentiment analyses, I wanted to apply the method to the Harry Potter books.
There’s a shiny app to explore this interactively, but here is a still for the blog.
If there’s anything to interpret here, then it’s that the first three books play the game “fortunately, unfortunately”, while the later books are a little different, especially Order of the Phoenix, which is the grumpy one.
Perhaps the Fourier transform is too sensitive to a magic number that I call the ‘wiggliness’ parameter. To see how sensitive, I calculated the arcs for ‘wiggliness’ values from 3 to 10, and described the range of the arcs with a ribbon – a little like the standard-error-ribbon on a geom_smooth
. I think the ribbons show a more reliable story arc, and reveal that narrow wobbles, of the order of a chapter or so, are probably misleading.
And to see where the ‘wiggliness = 3’ arc lies in the range, I superimpose it as a line.
The code is, as always, on GitHub, but you need to supply your own copies of the books.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/nacnudus/duncangarmonsway, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Garmonsway (2016, July 13). Duncan Garmonsway: Harry Potter and the N-Grams of Sentiment. Retrieved from https://nacnudus.github.io/duncangarmonsway/posts/2016-07-13-harry-potter-sentiment/
BibTeX citation
@misc{garmonsway2016harry, author = {Garmonsway, Duncan}, title = {Duncan Garmonsway: Harry Potter and the N-Grams of Sentiment}, url = {https://nacnudus.github.io/duncangarmonsway/posts/2016-07-13-harry-potter-sentiment/}, year = {2016} }