Skip to content
← Back to blog

A family like no other: studying the graph behind the Mathematics Genealogy Project

·7 min read

This post is about two topics: first, as the title suggests, it is about studying the graph behind the Mathematics Genealogy Project and its many interesting — and, at times, surprising — properties (the "what"). But maybe more importantly, second, it is about "how" this post was written. We'll cover them in reverse order.

The "how"

Every few years, I have a technological "oh wow, this is so cool" moment. From somewhat recent memories:

  • In the early 2010s, getting good Wi-Fi on an Emirates flight over the North Pole,
  • In 2015, test-driving the Tesla Model S with its brand-new Autopilot feature in Atlanta (as a poor student with no intent or means of financing the car!),
  • In 2022, taking my first fully driverless Waymo ride in San Francisco.

Last week, I had another one of those moments. Over the weekend, as I was puttering around GitHub repositories, I found j2kun/math-genealogy-scraper — a scraper for the Mathematics Genealogy Project (MGP). The MGP is a database for the academic genealogy of mathematicians. As a rite of passage, upon graduation, newly minted mathematicians would enter their graduation details, thereby adding themselves to the genealogy. (I did so myself, of course, when I graduated in 2017!)

So I checked the repo's 2020 snapshot file, data.json, and there, in lines 2,946,918–2,946,932, I was:

   {
     "id": 224170,
     "name": "Fabian Rigterink",
     "thesis": "Pooling Problems: Advances in Theory and Applications",
     "school": "University of Newcastle",
     "country": "Australia",
     "year": 2017,
     "subject": "90—Operations research, mathematical programming",
     "advisors": [
       51372,
       75952,
       88664
     ],
     "students": []
   },

I forked the repository, scraped the missing 2020+ data, wrote a simple build_graph.py script that builds a clean NetworkX directed acyclic graph (DAG) from the scraped data, and — I ran out of time. I had to go to work the next day!

Not wanting to wait until the next weekend, I had an idea: I'd set up a Claude Code session, set up Remote Control, leave my personal MacBook running at home, and continue from my phone via voice while driving to and from work in San Francisco — finally making good use of the ~1.5 hrs commute.

This moment — continuing my local Claude Code session by giving Claude voice instructions on what to do with the MGP graph — all while going up and down the US-101, and then having Claude read its findings back to me: magic!

The "what"

Now that we have the "how" out of the way, here comes the "what" — my Claude's findings! Note: the following is just a highlight reel of findings. The full list of 30+ questions is in eda.ipynb.

My own ancestry

If you'd like to see just how far the apple can fall from the tree, below is my own ancestry. I can trace it to Sharaf al-Dīn al-Ṭūsī (about 1135–1213) — an Islamic mathematician who was born in Tus ("al-Ṭūsī") in today's Iran and later studied and taught in Damascus, Aleppo (both Syria), and Mosul (Iraq). With 244,970 descendants, he is one of the extremes of the database, with only a few other mathematicians having more descendants than him.

My academic ancestry, from Sharaf al-Dīn al-Ṭūsī (1135–1213) to me (2017) — pan and zoom to explore, or open in a new window.

The ancestry includes some very famous names:

10 interesting — and, at times, surprising — findings from the complete MGP graph

The deep past

  1. There is no Mathematical Eve. Even the broadest single ancestor only reaches about 74% of all modern PhDs — a tight cluster of 14th-century Byzantine and Renaissance scholars (Bessarion, Plethon, Manuel Chrysoloras, Thomas à Kempis) all tied at 73.7%. The latest single mathematician with a known year whose descendants make up more than half of moderns is Siméon Denis Poisson, in 1800 (51.1%) — after Poisson, math branched too widely to ever re-converge.

Coverage of modern PhDs by single ancestor — Poisson is the latest 50%+ rootCoverage of modern PhDs by single ancestor — Poisson is the latest 50%+ root

  1. The longest unbroken advisor-student chain is 54 generations long. Roughly 800 years of pedagogy, running from medieval Persia through Byzantine scholars, Renaissance Italy, and Reformation Germany, via Gauß and Bessel, and out into a 2010 Wayne State PhD.

The 54-generation chain from al-Ṭūsī to a 2010 Wayne State PhDThe 54-generation chain from al-Ṭūsī to a 2010 Wayne State PhD

20th-century ruptures

  1. WWII and its effects on mathematics. Take German-trained mathematicians who were active in 1933 and split them by whether they emigrated. Those who emigrated kept producing students at 94% of their pre-1933 rate; those who stayed dropped to 13%. 80+ years later the geographic signature is still visible — emigrant-branch descendants did their PhDs 51% in the US, stayer-branch descendants 43% in Germany.

Where the descendants of each 1933-era branch ended upWhere the descendants of each 1933-era branch ended up

  1. Princeton overtook Göttingen in the 1940s. Göttingen had been producing about 45 PhDs per decade through 1900–1930; it collapsed to about 12 during 1933–1945. Princeton went exponential in the same years and never looked back. Mathematics' center of gravity crossed the Atlantic in a single decade.

Princeton overtakes Göttingen, 1940sPrinceton overtakes Göttingen, 1940s

  1. The other 20th-century ruptures are visible, too. The WWI dip in Germany and France, the Nazi-era trough (1933–1945), and the abrupt collapse of Russian PhD production after 1991 — followed by a slow partial recovery — all jump out of the per-country time series, with the war bands shaded in magenta below.

Per-country PhD output 1900–2014, with WWI/WWII/Nazi-purge bandsPer-country PhD output 1900–2014, with WWI/WWII/Nazi-purge bands

What gets inherited

  1. Subject is staggeringly heritable. A student inherits their advisor's Mathematics Subject Classification (MSC) code 70% of the time — 11.7× the 6% random baseline. Computer science passes on at 93%, game theory / economics at 93%, number theory at 87%, statistics at 83%.

P(student subject | advisor subject) — diagonal = inheritanceP(student subject | advisor subject) — diagonal = inheritance

  1. The Bourbaki effect is traceable four generations later. The seven Bourbaki founders' combined academic descendants (≈2,240 mathematicians) are 13× over-represented in algebraic geometry, 9× in category theory, 5.5× in algebraic topology — and 0.2× in computer science. An intellectual program is heritable, just like a subject.

The shape of the network

  1. The math genealogy is not scale-free. The "classic scale-free network" framing is widely repeated for academic genealogies — but it fails the Clauset–Shalizi–Newman likelihood-ratio test for advisor fertility by R = −11.8, p ≈ 10⁻³². Lognormal beats pure power-law by a wide margin. This is the Broido–Clauset story: many networks billed as scale-free fit lognormal much better.

Empirical advisor-fertility CCDF vs. fitted power-law and lognormalEmpirical advisor-fertility CCDF vs. fitted power-law and lognormal

  1. The genealogy isn't a random branching process, either. A Galton–Watson simulation calibrated to MGP's offspring distribution matches the empirical 60% extinction-at-depth-0 rate — but MGP has ~28× more lineages reaching depth ≥ 5 than the random process predicts (15.9% empirical vs. 0.6% simulated). Once a lineage starts, it survives far longer than chance, presumably because school, subject, and cohort effects all reinforce it.

Lineage-depth distribution: MGP empirical vs. Galton–Watson simulatedLineage-depth distribution: MGP empirical vs. Galton–Watson simulated

What MGP doesn't measure

  1. Tiny direct lineages can compound enormously. Newton has 3 direct students but 31,505 descendants. Laplace has 2 but 158,184. Leibniz has 2 but 183,934. Meanwhile, Cantor has 1 descendant total, and Galois, Abel, and Ramanujan have 0 each. MGP measures formal PhD supervision, not influence — a brilliant student in the right century compounds into 100,000 descendants; a brilliant mathematician without a faculty post compounds into zero.

All code is available in the GitHub repo fabianrigterink/math-genealogy-scraper. For an extended analysis, see eda.ipynb.