And here’s what we actually get:
Remember that we were hoping for mostly 2, 3, and 4. Instead, there is still a long tail of packages with tree depths above 20. 20 is… much larger than I was expecting, and I was expecting to be disappointed.
But let’s re-use the “oddball” theory: perhaps all those packages with extremely deep dependency trees are rarely used, and not worth worrying about. Let’s check.
Here’s a scatterplot, where each package is placed based on its tree depth plotted and how many other packages reference it. On the right of the plot are the most-referenced (~popular) packages; on the top are the deepest dependency trees. Again, my hopes first:
Followed by reality:
There are heaps of 10+ tree-depths up among even the most popular packages, and a few even reach 20+. Extremely deep trees are not just a problem of “oddball” packages.
The average number of direct dependencies (among packages with any dependencies at all) is 5. That, by itself, doesn’t seem to bad. It feels a little alarming when combined with high tree depths, though. Does that mean some of these packages have 510 total dependencies? (Spoiler: no.)
Here’s a graph of how many packages have 1 dependency, 2 dependencies… up to 30 — a nice, neat exponential decay.
This curve is clean enough that I would not be surprised to see something pretty similar in any package registry — maybe not the exact same parameters, but a similar shape. Not shown here is an incredibly long tail; there are 4 packages tied for the most direct dependencies with exactly 1000, and there are runners-up spread pretty smoothly up to that maximum.
Knowing the average depth and branching factor, you have to imagine that counting the total dependencies of each package, including dependencies-of-dependencies, is not going to yield good news. But many of the branches of a large dependency tree are shared — multiple packages in the tree all depend on the same library. And the tree depth I have measured is the maximum depth for each package — not the average. So the picture is not necessarily as dire as an initial back-of-the-envelope calculation might make it seem.
So, now that we’re convinced that we won’t be seeing packages with 9 million total dependencies, let’s see what the actual numbers are.
This doesn’t look too bad! Mostly under 200, which fits my imagined limits. It’s a little fatter in the 50-100 range than I’d like, and this chart doesn’t show the looooong tail leading out to the ultimate winners with over 2500 total dependencies, but this chart doesn’t make things look as bad as they feel.
For verification, let’s split things out by frequency of use, again; in this chart, packages that are more often depended upon (more popular) are toward the right, and packages that will install more total dependencies are toward the top.
The bad news here is the same as it has been. Many packages in that long tail are, in fact, reasonably frequently used. Even among the most-used packages are a few spots representing packages with over 1000 total dependencies.
Certainly, npm doesn’t match my hoped-for “healthy” qualities. You can take that to mean my desires are unrealistic, or that something is genuinely wrong.
As homework for the reader: an easy way to disprove my analysis would be to show that other package repositories have the same issues: cycles, high depth, high indirect dependencies, a large proportion of unmaintained packages, packages from the statistical tails among the most-used, etc.
Do what I’ve done for npm to PyPI.org, or rubygems.org, and see if the outcome is similar, or very different.
[EDIT: I have run a brief-but-similar analysis on Crates.io and the results are fascinating, but not so unambiguous as to include directly here.]
[ANOTHER EDIT: Ivan Krylov has kindly contributed an analysis of CRAN, which results look similar in some ways to Crates.]
requesthas been officially deprecated for months — but almost 50,000 packages still depend on it.)
left-pad so dramatically exposed.
But the collection of these observations, combined with human nature, results in npmjs.org having 5 times more packages than PyPI, a huge number of which are undocumented, maintained, unused, or nonsensical. Even among the popular and frequently maintained packages, you’ll find packages with vast numbers of dependencies, including dependencies with security issues, deprecations, circular dependencies, and so on.
And because that is considered normal, it will continue to happen; and as it continues to happen more libraries will be based on those that already exist, increasing the bloat even further.
If you are an engineer tasked with selecting safe, useful, and long-lasting dependencies
This is the part of the essay where ideally I’d introduce my new automated package-repository-better-maker, that will solve all our problems. Instead, I can only offer some interesting things for you to imagine.
Imagine a set of rules that define a “healthy” package. They might look something like
But most of all, imagine this recursive rule:
Now, starting with the most popular packages with zero dependencies, start collecting a list of healthy packages. Then you can start looking at the packages that depend upon those, walking the dependency graph backwards and flagging what is not only safe, but recursively so. Eventually, (with time, and enough social pressure to be “healthy”), you could build something akin to a standard library. When a package like
request decides to deprecate, that will mean something to any approved packages that depend on it — they must find a replacement, or lose approved status themselves.
There’s no need to change npm for any of this to work! People can still publish silly packages and personal packages and packages they have no intention of maintaining — but it will be easy for developers in professional contexts to recognize the difference.
Any set of rules will risk gaming; for instance, if you say a healthy package responds to bug reports in a timely manner, you’re giving permission to frustrated mis-users to yell and scream and threaten to report a project when they’re ignored.
How do we solve those problems?
I can imagine the partial (but never complete!) automation of grading project health.
I can imagine a halfway-compromise, where rather than a binary good-package/not-good-package, packages get a pagerank-like score that depends both on their own qualities, and the scores of all their dependencies. Then a team could say “we only use packages with a health-score above 65” and still feel like they’ve taken some responsibility. Perhaps a score would even encourage more participation — people do like to make numbers go up.
All of which should come as a stunning disappointment to you, reader. Thousands of words and charts and figures, and I don’t even have a complete solution! What gives?
I cannot offer a batteries-included answer, but I want you to walk away with these conclusions:
If you’re interested in discussing the problem, potential solutions, or berating me for getting it all wrong, feel free to reach out: email@example.com.
There is not a documented endpoint on replicate.npmjs.org for downloading the entire registry in bulk; for some time I assumed I would need to make 1.3 million separate requests to get the data I needed. However, the replicate api is actually a raw couchdb instance. So (don’t follow this link!) https://replicate.npmjs.com/registry/_all_docs?include_docs=true is an incantation for a 50GB json file that contains more than enough information for this task.
I pushed the relevant data from that json file into a postgres database. Note that if you attempt to use node-postrges, or any other asynchronous interface, all the work of using oboe will be for naught — you’ll create tens of thousands of pending asynchronous calls, and still run out of memory. I used pg-native’s synchronous API.
My data schema was simple: a table for packages, and a join table (from packages to packages) for dependencies (I also created and filled tables for
peerDependencies, etc, but didn’t end up using them for this analysis).
Inserting 1.3 million packages and 4.1 million dependencies will take several hours. Make sure you handle errors gracefully, so you’re not forced to restart the process. A few uniqueness-guaranteeing indices will help if you are forced to.
Most of the calculations were iterative, built like inductive proofs, and pretty gnarly.
To find circular dependencies, first find and mark all packages that depend on themselves. Then, excluding those, find 2-step cycles. Then, excluding those, find 3 step cycles. Etc.
Similarly, to calculate the depth of dependency trees, find all packages with no dependencies, and mark them as depth 0. Then find all packages that depend on packages marked 0, and mark them 1. Then find all the packages that depend on those marked 1, and mark them 2. Continue until no new packages are marked. (Make sure you’re not getting caught by the cycles you detected earlier!)
The biggest challenge was counting total indirect dependencies. A single package may depend on another single package in many ways — as a direct dependency, and as an indirect dependency multiple places elsewhere in its dependency tree. If you ignore that fact, especially for packages with deep trees, you can end up with results like “this package has 900,000 indirect dependencies”.
So there is no inductive-proof-style all-SQL solution to be had, here. You must add another join table from packages onto packages, this time tracking indirect dependencies. It needs a compound uniqueness key. And you’ll need to iterate through every package, starting with
depth = 1 and working up, filling out the indirect dependencies from both the direct dependencies table, and all the indirect dependencies from the previous depth.
inserting 40 million entries into that join table will, as before, take many hours of computation, so try to get it right the first time.
The final graphs were made with D3.