With added bacran

Nerd-sniped by kevinbacran, a graph of CRAN co-authorship.

Duncan Garmonsway
March 03, 2019

Matt Dray came up with the hilarious pun ‘kevinbacran’ and then had to follow through and make the package. It calculates the degrees of separation between CRAN package authors and Hadley Wickham. Return here when you’ve played with the shiny app.

If you’re still reading this then you have been nerd-sniped, as I was. To help you end your procrastination quickly, here are answers to a few questions:

  1. Who has the highest Hadley number?
  2. What is the longest ‘shortest path’ between any two CRAN authors?
  3. What is the largest network disconnected from Hadley?
  4. Is Hadley the most central author?

0. Preparation

The kevinbacran package provides a graph of package co-authorship, so I piggy back on that, just as it is my pleasure to do with Matt’s professional work. Nodes are authors; edges are packages.


cran_pkg_graph

# A tbl_graph: 16911 nodes and 99542 edges
#
# An undirected multigraph with 2102 components
#
# Node Data: 16,911 x 2 (active)
  name               number
  <chr>               <int>
1 Csillery Katalin        1
2 Lemaire Louisiane       2
3 Francois Olivier        3
4 Abdulmonem Alsaleh      4
5 Robert Weeks            5
6 Ian Morison             6
# … with 1.69e+04 more rows
#
# Edge Data: 99,542 x 3
   from    to package
  <int> <int> <chr>  
1     1     2 abc    
2     1     3 abc    
3     1 12921 abc    
# … with 9.954e+04 more rows

Incredibly, the igraph package is willing and able to calculate the length of the shortest route between all nodes, all at once, in a few seconds. It returns a square matrix, a bit like old people used to have on the back of a road atlas, with the driving times between cities, except here it’s the number of people in the chain linking one author with another. Infinite values represent impossible connections, but I subsituted NA for these so that they don’t upset subsequent max() and min().


cran_pkg_igraph <- as.igraph(cran_pkg_graph)
dists <- distances(cran_pkg_igraph)
dists[is.infinite(dists)] <- NA # is there a more efficient way?

1. Who has the highest Hadley number?

First I need to know which column (or row) represents Hadley. Then I can search that row (or column) for the highest value.


# The row and column number of Hadley Wickham in the matrix of distances.
hadley <-
  cran_pkg_graph %>%
  as_tibble() %>%
  dplyr::filter(name == "Hadley Wickham") %>%
  pull(number)
hadley

[1] 354

# The row of distances between Hadley and everyone else.
hadley_row <- dists[hadley, ]

# The maximum distance between Hadley and anyone else.
(max_hadley_row <- max(hadley_row, na.rm = TRUE))

[1] 11

# The people at the greatest distance from Hadley.
which(hadley_row == max_hadley_row)

    Chelsey Bryant Nicholas R Wheeler        Franz Rubel 
              6050               6051               6052 

Congratulations Chelsey, Nicholas and Franz, you are Hadley’s most distant relations.

2. What is the longest ‘shortest path’ between any two CRAN authors?

I now wipe out the entire upper triangle (and the diagonal), because I don’t want to notice each path twice. This question is undirected. I only need the maximum value in the bottom triangle.


dists[upper.tri(dists, diag = TRUE)] <- NA # Can it be done faster?

# Maximum distance beteween any two authors
(maximum_distance <- max(dists, na.rm = TRUE))

[1] 20

max_distances <-
  which(dists == maximum_distance, arr.ind = TRUE) %>%
  as_tibble() %>%
  mutate(from = rownames(dists)[row],
         to = colnames(dists)[col],
         distance = map2_int(row, col, ~ as.integer(dists[.x, .y]))) %>%
  select(from, to, distance)
max_distances

# A tibble: 9 x 3
  from               to            distance
  <chr>              <chr>            <int>
1 Chelsey Bryant     Qihuang Zhang       20
2 Nicholas R Wheeler Qihuang Zhang       20
3 Franz Rubel        Qihuang Zhang       20
4 Chelsey Bryant     Natalia Vilor       20
5 Nicholas R Wheeler Natalia Vilor       20
6 Franz Rubel        Natalia Vilor       20
7 Chelsey Bryant     Di Shu              20
8 Nicholas R Wheeler Di Shu              20
9 Franz Rubel        Di Shu              20

Oh hi Chelsey, Nicholas and Franz! Not only are you Hadley’s most distant relations, you are among the most distantly related package authors in the whole graph. What did you write?


nodes <- as_tibble(activate(cran_pkg_graph, nodes))
edges <- as_tibble(activate(cran_pkg_graph, edges))

nodes %>%
  dplyr::filter(name %in% c("Chelsey Bryant",
                            "Nicholas R Wheeler",
                            "Franz Rubel")) %>%
  inner_join(edges, by = c("number" = "from")) %>%
  distinct(name, package)

# A tibble: 3 x 2
  name               package
  <chr>              <chr>  
1 Chelsey Bryant     kgc    
2 Nicholas R Wheeler kgc    
3 Franz Rubel        kgc    

All three collaborated on the same package: kgc. The three authors on the other side of the relation are not collaborators – they each authored a different package.


nodes %>%
  dplyr::filter(name %in% c("Qihuang Zhang",
                            "Natalia Vilor",
                            "Di Shu")) %>%
  inner_join(edges, by = c("number" = "from")) %>%
  distinct(name, package)

# A tibble: 3 x 2
  name          package  
  <chr>         <chr>    
1 Qihuang Zhang augSIMEX 
2 Natalia Vilor globalGSA
3 Di Shu        ipwErrorY

A visualisation would be easier to interpret. A few incantations are required:

  1. Combine the six author names
  2. Calculate the graph between each pair with kevinbacran::kb_pair().
  3. Merge and distinctify the graphs with igraph::simplify().
  4. Plot with kevinbacran::kb_plot().

(As you can tell, I do very little of the work around here.)


set.seed(2019-03-03)

max_distances %>%
  mutate(kb_pair = map2(from, to, kb_pair, tidy_graph = cran_pkg_graph)) %>%
  pull(kb_pair) %>%
  reduce(graph_join, by = c("name", "number", ".tidygraph_node_index")) %>%
  as.igraph() %>%
  igraph::simplify(remove.multiple = TRUE,
                   remove.loops = FALSE,
                   edge.attr.comb = "random") %>%
  as_tbl_graph() %>%
  kb_plot()

It’s lovely to see that Ernst Wit and Ernst C Wit are such near relations in work as in life.

Are many authors so distantly related? What is the distribution of maximum distance between authors? That is, for each author, the maximum distance between them and anyone else, what is the distribution of that value across all authors?


rowmaxes <- apply(dists, 1, max, na.rm = TRUE)
hist(rowmaxes)

I … did not expect a bimodal distribution. That’s how naive I am about graphs.

3. What is the largest network disconnected from Hadley?

The graph can be split into cliques – my name for graphs that aren’t connected with one another. Most cliques only have one author in them – authors who haven’t collaborated with anyone. But the second largest clique only has 29 people in it, whereas the largest clique has 9251 in it.


cliques <- components(cran_pkg_igraph)
table(cliques$csize)

   2    3    4    5    6    7    8    9   10   11   12   13   14   15 
 919  481  282  136   73   63   36   35   15   16    3    5    7    5 
  16   17   18   19   20   21   22   23   24   29 9251 
   3    6    3    1    3    3    2    2    1    1    1 

Like nearly everyone else, Hadley Wickham is in the largest clique. Who is in the next largest clique?


hadley_clique <- cliques$membership["Hadley Wickham"]
max_non_hadley_size <- max(cliques$csize[-hadley_clique])

# The largest graph not connected with Hadley
largest_non_hadley_cliques <- which(cliques$csize == max_non_hadley_size)

# Members of those cliques
non_hadley_members <-
  cliques$membership[cliques$membership %in% largest_non_hadley_cliques]

# Graph of the cliques
cliques_graph <-
  cran_pkg_graph %>%
  activate(nodes) %>%
  filter(name %in% names(non_hadley_members))

# Packages authored in the largest non-Hadley cliques
pkgs <- tools::CRAN_package_db()

cliques_graph %>%
  activate(edges) %>%
  as_tibble() %>%
  distinct(package) %>%
  inner_join(pkgs[, c("Package", "Description")],
             by = c("package" = "Package")) %>%
  mutate(Description = str_replace_all(Description, "[\n ]+", " ")) %>%
  print(n = Inf)

# A tibble: 26 x 2
   package      Description                                           
   <chr>        <chr>                                                 
 1 ald          "It provides the density, distribution function, quan…
 2 ALDqr        "EM algorithm for estimation of parameters and other …
 3 ARCensReg    It fits an univariate left or right censored linear r…
 4 ARpLMEC      It fits left, right or interval censored mixed-effect…
 5 BayesCR      Propose a parametric fit for censored linear regressi…
 6 bssn         It provides the density, distribution function, quant…
 7 CensMixReg   Fit censored linear regression models where the rando…
 8 CensRegMod   Fits univariate censored linear regression model unde…
 9 CensSpatial  Fits linear regression models for censored spatial da…
10 endtoend     Computes the expectation of the number of transmissio…
11 FMsmsnReg    Fit linear regression models where the random errors …
12 hopbyhop     Computes the expectation of the number of transmissio…
13 lqr          It fits a robust linear quantile regression model usi…
14 mixsmsn      Functions to fit finite mixture of scale mixture of s…
15 MomTrunc     It computes the raw moments for the folded and trunca…
16 nlsmsn       Fit univariate non-linear scale mixture of skew-norma…
17 Opportunist… Computes the routing distribution, the expectation of…
18 PartCensReg  It estimates the parameters of a partially linear reg…
19 qrLMM        Quantile regression (QR) for Linear Mixed-Effects Mod…
20 qrNLMM       Quantile regression (QR) for Nonlinear Mixed-Effects …
21 SMNCensReg   Fit univariate right, left or interval censored regre…
22 ssmn         Performs the EM algorithm for regression models using…
23 ssmsn        It provides the density and random number generator f…
24 StempCens    It estimates the parameters of a censored or missing …
25 tlmec        Fit a linear mixed effects model for censored data wi…
26 TTmoment     Computing the first two moments of the truncated mult…

set.seed(2019-03-03)
kb_plot(cliques_graph)

That is a clique of regression packages.

4. Is Hadley the most central author?

Uh, what even is centrality? Picking ?centrality_information at random:

‘centrality_information’: centrality based on inverse sum of resistance distance between nodes (‘netrankr’)

Crikey that takes a while, let’s try centrality_betweenness.


centrality <-
  cran_pkg_graph %>%
  activate(nodes) %>%
  mutate(importance = centrality_betweenness(directed = FALSE))

centrality %>%
  activate(nodes) %>%
  as_tibble() %>%
  arrange(desc(importance))

# A tibble: 16,911 x 3
   name              number importance
   <chr>              <int>      <dbl>
 1 Hadley Wickham       354   6068597.
 2 R Core              1465   5785565.
 3 Martin Maechler      273   3599064.
 4 Brian Ripley        1704   2933798.
 5 Dirk Eddelbuettel   1074   2757885.
 6 Kurt Hornik          559   2296512.
 7 Ben Bolker           204   2146111.
 8 Achim Zeileis        756   1859785.
 9 RStudio             1652   1824386.
10 Yihui Xie            378   1738956.
# … with 16,901 more rows

Yes.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/nacnudus/duncangarmonsway, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Garmonsway (2019, March 3). Duncan Garmonsway: With added bacran. Retrieved from https://nacnudus.github.io/duncangarmonsway/posts/2019-02-27-with-added-bacran/

BibTeX citation

@misc{garmonsway2019with,
  author = {Garmonsway, Duncan},
  title = {Duncan Garmonsway: With added bacran},
  url = {https://nacnudus.github.io/duncangarmonsway/posts/2019-02-27-with-added-bacran/},
  year = {2019}
}