The research roots of finding “optimal routing” networks trace back to the late 1970s. Mathematicians defined special kinds of networks called “expanders“. These are graphs with strong connectivity properties guaranteeing no subset of vertices can be isolated from the rest. In 1976, Leslie Valiant gave one of the earliest discussions of such graphs. Following work on Alon-Boppana on trying to understand the best “possible” expanders, mathematicians (notably, Lubotzky, Phillips, and Sarnak) gave constructions of such optimal expanders. These were intricate designs, used advanced number theory, and only work for specific network sizes and degrees.
RNG (Resilient Network Graphs) history
In 2023, Giacomo Bernardi (AWS principal scientist) started to investigate whether datacenter routers could be arranged in a flat network following Penrose tiling, a geometrical construction where shapes tessellate without ever quite repeating. Ratul Mahajan, an Amazon Scholar and Professor at the University of Washington, was intrigued. The two spent months exploring the idea, building simulations, and pushing the concept as far as it would go.
Relative advantages of RNG over Fat Trees
- Resilience: in an RNG fabric, no single router is more critical than any other. The loss of 1% of routers results in a roughly 1% capacity loss — degradation is proportional and predictable rather than catastrophic. In a fat tree network, losing the wrong spine switch can take down a disproportionate share of capacity.
- Efficiency: because all paths through the network are statistically equivalent, capacity is fungible. There is no “stranded bandwidth” locked behind a particular layer — any available capacity can serve any traffic demand.
- Incremental scalability: unlike fat-trees, which come in fixed sizes dictated by switch radix and layer count, an RNG fabric can be scaled up continuously. You add routers and connections without redesigning the topology or hitting a capacity cliff — the graph simply grows.
Relative limitations of RNG (and mitigations)
- Operational complexity: paths through a random graph are less predictable than in a tree, making troubleshooting harder with conventional tools. We mitigate this with purpose-built diagnostic software that gives operators visibility into traffic distribution and fault localization despite the lack of hierarchical structure.
- Performance guarantees are stochastic rather than deterministic. The worst case performance (for metrics such as number of hops and oversubscription) is known, but for RNG our models are stochastic (i.e., the worst case performance is known with high probability). This is a weaker limitation than it might appear. Fat-tree guarantees are also effectively stochastic once you account for real-world failures, which are frequent at scale. RNG simply makes the stochastic nature explicit and designs for it from the start.
References
- RNG research paper: https://arxiv.org/abs/2604.15261
- About Amazon story: https://www.aboutamazon.com/stories/aws-random-graph-theory-data-center-network-design?&utm_term=36
- Amazon Science story: https://www.amazon.science/blog/how-flat-is-replacing-fat-in-aws-data-center-networks
- Video explainer on Youtube: https://www.youtube.com/watch?v=yDoRYRRPOA0
- Relevant images: https://amazongca.getbynder.com/share/B6E5A14E-AFFB-43AD-83599AFEABCBAB6A/?viewType=grid
- “VL2: A Scalable and Flexible Data Center Network” by Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave A. Maltz, Parveen Patel, and Sudipta Sengupta. SIGCOMM 2009