Nerd-sniped by kevinbacran, a graph of CRAN co-authorship.
Matt Dray came up with the hilarious pun ‘kevinbacran’ and then had to follow through and make the package. It calculates the degrees of separation between CRAN package authors and Hadley Wickham. Return here when you’ve played with the shiny app.
If you’re still reading this then you have been nerd-sniped, as I was. To help you end your procrastination quickly, here are answers to a few questions:
kevinbacran package provides a graph of package co-authorship, so I piggy back on that, just as it is my pleasure to do with Matt’s professional work. Nodes are authors; edges are packages.
# A tbl_graph: 16911 nodes and 99542 edges # # An undirected multigraph with 2102 components # # Node Data: 16,911 x 2 (active) name number <chr> <int> 1 Csillery Katalin 1 2 Lemaire Louisiane 2 3 Francois Olivier 3 4 Abdulmonem Alsaleh 4 5 Robert Weeks 5 6 Ian Morison 6 # … with 1.69e+04 more rows # # Edge Data: 99,542 x 3 from to package <int> <int> <chr> 1 1 2 abc 2 1 3 abc 3 1 12921 abc # … with 9.954e+04 more rows
Incredibly, the igraph package is willing and able to calculate the length of the shortest route between all nodes, all at once, in a few seconds. It returns a square matrix, a bit like old people used to have on the back of a road atlas, with the driving times between cities, except here it’s the number of people in the chain linking one author with another. Infinite values represent impossible connections, but I subsituted
NA for these so that they don’t upset subsequent
cran_pkg_igraph <- as.igraph(cran_pkg_graph) dists <- distances(cran_pkg_igraph) dists[is.infinite(dists)] <- NA # is there a more efficient way?
First I need to know which column (or row) represents Hadley. Then I can search that row (or column) for the highest value.
# The row and column number of Hadley Wickham in the matrix of distances. hadley <- cran_pkg_graph %>% as_tibble() %>% dplyr::filter(name == "Hadley Wickham") %>% pull(number) hadley
# The row of distances between Hadley and everyone else. hadley_row <- dists[hadley, ] # The maximum distance between Hadley and anyone else. (max_hadley_row <- max(hadley_row, na.rm = TRUE))
# The people at the greatest distance from Hadley. which(hadley_row == max_hadley_row)
Chelsey Bryant Nicholas R Wheeler Franz Rubel 6050 6051 6052
Congratulations Chelsey, Nicholas and Franz, you are Hadley’s most distant relations.
I now wipe out the entire upper triangle (and the diagonal), because I don’t want to notice each path twice. This question is undirected. I only need the maximum value in the bottom triangle.
dists[upper.tri(dists, diag = TRUE)] <- NA # Can it be done faster? # Maximum distance beteween any two authors (maximum_distance <- max(dists, na.rm = TRUE))
max_distances <- which(dists == maximum_distance, arr.ind = TRUE) %>% as_tibble() %>% mutate(from = rownames(dists)[row], to = colnames(dists)[col], distance = map2_int(row, col, ~ as.integer(dists[.x, .y]))) %>% select(from, to, distance) max_distances
# A tibble: 9 x 3 from to distance <chr> <chr> <int> 1 Chelsey Bryant Qihuang Zhang 20 2 Nicholas R Wheeler Qihuang Zhang 20 3 Franz Rubel Qihuang Zhang 20 4 Chelsey Bryant Natalia Vilor 20 5 Nicholas R Wheeler Natalia Vilor 20 6 Franz Rubel Natalia Vilor 20 7 Chelsey Bryant Di Shu 20 8 Nicholas R Wheeler Di Shu 20 9 Franz Rubel Di Shu 20
Oh hi Chelsey, Nicholas and Franz! Not only are you Hadley’s most distant relations, you are among the most distantly related package authors in the whole graph. What did you write?
nodes <- as_tibble(activate(cran_pkg_graph, nodes)) edges <- as_tibble(activate(cran_pkg_graph, edges)) nodes %>% dplyr::filter(name %in% c("Chelsey Bryant", "Nicholas R Wheeler", "Franz Rubel")) %>% inner_join(edges, by = c("number" = "from")) %>% distinct(name, package)
# A tibble: 3 x 2 name package <chr> <chr> 1 Chelsey Bryant kgc 2 Nicholas R Wheeler kgc 3 Franz Rubel kgc
All three collaborated on the same package: kgc. The three authors on the other side of the relation are not collaborators – they each authored a different package.
nodes %>% dplyr::filter(name %in% c("Qihuang Zhang", "Natalia Vilor", "Di Shu")) %>% inner_join(edges, by = c("number" = "from")) %>% distinct(name, package)
# A tibble: 3 x 2 name package <chr> <chr> 1 Qihuang Zhang augSIMEX 2 Natalia Vilor globalGSA 3 Di Shu ipwErrorY
A visualisation would be easier to interpret. A few incantations are required:
(As you can tell, I do very little of the work around here.)
set.seed(2019-03-03) max_distances %>% mutate(kb_pair = map2(from, to, kb_pair, tidy_graph = cran_pkg_graph)) %>% pull(kb_pair) %>% reduce(graph_join, by = c("name", "number", ".tidygraph_node_index")) %>% as.igraph() %>% igraph::simplify(remove.multiple = TRUE, remove.loops = FALSE, edge.attr.comb = "random") %>% as_tbl_graph() %>% kb_plot()
It’s lovely to see that Ernst Wit and Ernst C Wit are such near relations in work as in life.
Are many authors so distantly related? What is the distribution of maximum distance between authors? That is, for each author, the maximum distance between them and anyone else, what is the distribution of that value across all authors?
rowmaxes <- apply(dists, 1, max, na.rm = TRUE) hist(rowmaxes)
I … did not expect a bimodal distribution. That’s how naive I am about graphs.
The graph can be split into cliques – my name for graphs that aren’t connected with one another. Most cliques only have one author in them – authors who haven’t collaborated with anyone. But the second largest clique only has 29 people in it, whereas the largest clique has 9251 in it.
cliques <- components(cran_pkg_igraph) table(cliques$csize)
2 3 4 5 6 7 8 9 10 11 12 13 14 15 919 481 282 136 73 63 36 35 15 16 3 5 7 5 16 17 18 19 20 21 22 23 24 29 9251 3 6 3 1 3 3 2 2 1 1 1
Like nearly everyone else, Hadley Wickham is in the largest clique. Who is in the next largest clique?
hadley_clique <- cliques$membership["Hadley Wickham"] max_non_hadley_size <- max(cliques$csize[-hadley_clique]) # The largest graph not connected with Hadley largest_non_hadley_cliques <- which(cliques$csize == max_non_hadley_size) # Members of those cliques non_hadley_members <- cliques$membership[cliques$membership %in% largest_non_hadley_cliques] # Graph of the cliques cliques_graph <- cran_pkg_graph %>% activate(nodes) %>% filter(name %in% names(non_hadley_members)) # Packages authored in the largest non-Hadley cliques pkgs <- tools::CRAN_package_db() cliques_graph %>% activate(edges) %>% as_tibble() %>% distinct(package) %>% inner_join(pkgs[, c("Package", "Description")], by = c("package" = "Package")) %>% mutate(Description = str_replace_all(Description, "[\n ]+", " ")) %>% print(n = Inf)
# A tibble: 26 x 2 package Description <chr> <chr> 1 ald "It provides the density, distribution function, quan… 2 ALDqr "EM algorithm for estimation of parameters and other … 3 ARCensReg It fits an univariate left or right censored linear r… 4 ARpLMEC It fits left, right or interval censored mixed-effect… 5 BayesCR Propose a parametric fit for censored linear regressi… 6 bssn It provides the density, distribution function, quant… 7 CensMixReg Fit censored linear regression models where the rando… 8 CensRegMod Fits univariate censored linear regression model unde… 9 CensSpatial Fits linear regression models for censored spatial da… 10 endtoend Computes the expectation of the number of transmissio… 11 FMsmsnReg Fit linear regression models where the random errors … 12 hopbyhop Computes the expectation of the number of transmissio… 13 lqr It fits a robust linear quantile regression model usi… 14 mixsmsn Functions to fit finite mixture of scale mixture of s… 15 MomTrunc It computes the raw moments for the folded and trunca… 16 nlsmsn Fit univariate non-linear scale mixture of skew-norma… 17 Opportunist… Computes the routing distribution, the expectation of… 18 PartCensReg It estimates the parameters of a partially linear reg… 19 qrLMM Quantile regression (QR) for Linear Mixed-Effects Mod… 20 qrNLMM Quantile regression (QR) for Nonlinear Mixed-Effects … 21 SMNCensReg Fit univariate right, left or interval censored regre… 22 ssmn Performs the EM algorithm for regression models using… 23 ssmsn It provides the density and random number generator f… 24 StempCens It estimates the parameters of a censored or missing … 25 tlmec Fit a linear mixed effects model for censored data wi… 26 TTmoment Computing the first two moments of the truncated mult…
That is a clique of regression packages.
Uh, what even is centrality?
?centrality_information at random:
‘centrality_information’: centrality based on inverse sum of resistance distance between nodes (‘netrankr’)
Crikey that takes a while, let’s try
centrality <- cran_pkg_graph %>% activate(nodes) %>% mutate(importance = centrality_betweenness(directed = FALSE)) centrality %>% activate(nodes) %>% as_tibble() %>% arrange(desc(importance))
# A tibble: 16,911 x 3 name number importance <chr> <int> <dbl> 1 Hadley Wickham 354 6068597. 2 R Core 1465 5785565. 3 Martin Maechler 273 3599064. 4 Brian Ripley 1704 2933798. 5 Dirk Eddelbuettel 1074 2757885. 6 Kurt Hornik 559 2296512. 7 Ben Bolker 204 2146111. 8 Achim Zeileis 756 1859785. 9 RStudio 1652 1824386. 10 Yihui Xie 378 1738956. # … with 16,901 more rows