On paper internet exchanges (IX) are very simple in their implementation, simply put together a bunch of routers on a shared layer 2 ethernet switch (and coordinate IP addressing for the LAN) and those routers will be able to set up BGP sessions with each other and exchange traffic directly over the switch(s).
Internet exchange points also are a good place to add extra services (aka, servers on the LAN for the benefit of the member routers) to this LAN, common services include AS112, Network Time Protocol, Route Collectors, and perhaps most importantly, the Route Servers.
Route servers (RS) are special forms of BGP route reflectors that are designed to simplify interconnection on internet exchanges, they work by having members peer with the route reflector (which is just a BGP speaker), and the RS distributes the routes provided to it to the other members. This means that a single BGP session can provide routes for all of the other exchange members who have also set up BGP sessions with the route server. This means that a new network does not have to try and contact every member on the exchange to set up sessions in their router config.
Route servers themselves do not actually exchange any of the traffic because all of the members are on the same layer 2 subnet, when routes are provided with a next hop IP address, the same next hop is just passed along (since all peers are on the same LAN). This means that a route server can coordinate often terabits of traffic with the slowest port speed that IX can provide.
Another modern day advantage for using internet exchange route servers is that they are often a lot more secure than most IX member configurations, this is because there has been a lot of pressure over time to improve the route security configuration of route servers to the point where you will often find that the route servers are implementing far better bgp route security practices (IRR list generation, RPKI ROV, Peerlock, next hop checking, AS_PATH validation, etc) than is actually being applied on direct sessions, where it is distressingly common that direct (commonly known as bilateral/bi-lat) sessions are just accepting everything offered to them because there is an implicit trust because somebody actually went and configured the session, often by hand.
Route servers on internet exchanges were a target for improvement due to the fact that they were often implicated in large routing incidents, and that from a IX members perspective it is very difficult to filter these BGP sessions correctly as they are an amalgamation of all of the members in one session, to give a more direct example if you want to go and generate IRR filters for DE-CIX Frankfurt route servers, the resulting configuration will be bigger than the current size of the internet routing table itself (AS-DECIX generates to 2.2 million IPv4 prefixes (double the size of the current routing table), and 885,000 IPv6 prefixes which is quadruple the size of the current routing table)
Not everything is perfect in the land of IX route servers. Because peering with them is almost never mandatory, not every network is using them, and even if they are they may not have the same import/export policies that you would get on a direct/bilateral session.
Sometimes there is good reason for having a different policy for route servers vs direct peers, especially for the export (to the RS) direction in cases where the network is very sensitive to bad routing (like for example anycast services).
Using bgp.tools (the company that I run), if we look at an internet exchange we can see that members have green ticks of a green “RS” icon. This is displaying if bgp.tools can see routes from that member router via the RS on the exchange:
What if we were to build a network that only peered with RS, no transit and no bilateral peering either?
For this we can use the wide internet exchange presence that bgp.tools has to combine all of the route servers tables together. At the time of writing the full internet routing table comes out to be roughly 1,040,000 IPv4 prefixes and 240,000 IPv6 routes. Assuming you only had route servers (on the 100+ exchanges bgp.tools is on) to work with you would get 567,000 (56.6% of a full table) unique IPv4 prefixes and 145,000 (61% of a full table) unique IPv6 prefixes.
What is more interesting to me is that how little uniqueness there is between exchanges, if we sort all of the exchanges RS’s and then look at the “diminishing returns” of adding each one after, we can see even though there are large (100k+) routing tables from the RS’s, there are not many new prefixes to be found once you add the top 5 exchanges:
There are some obvious caveats to this calculation, mostly that it is not a hugely realistic scenario as you are unlikely to build a network with this collection of exchanges like this (especially over this geographic scope) to send traffic out of. It’s also worth pointing out that this list of exchanges is missing a couple of quite large ones, notably the Equinix IX’s.
Another notable caveat is that while you may be able to cover 60% (depending on the IP version) of your prefix count, that doesn’t necessarily translate into the amount of traffic you’re going to get. From what I’ve observed first hand with residential network traffic flows at least 50% of traffic (as in gigabits per second) is concentrated into just five networks: Meta, Akamai, Google, Netflix, Amazon (I don’t know there is a decent name for these but i’ve been calling them the The “Magna” networks)
The final caveat is that we are only currently looking at one direction that traffic flows, while we have been looking at what we can send out of the network, we do not know what other people will be accepting from route servers and will send to us
The problem here is that BGP does not have a way of communicating that a peer has accepted or denied a route back to the router that sent it, meaning that we do not have a direct feedback mechanism for whether traffic could flow in the other direction.
This makes the bgp.tools web interface example above a little bit more interesting, because while someone may not be exporting routes to the route server, they may be importing them!
To test inbound route acceptance we will have to use a method which involves the data plane, in short we are going to have to ping the internet and see what replies come back to us (as opposed to the ones that do not have our RS route, where the replies will be lost into the void)
To do this we are going to locate and ICMP ping an IP address in each prefix by “spoofing” a ping request out of a transit port (where we can assume close to 100% internet reachability) and listen on all of the internet exchange route collectors to see what, who and where the replies come from:
This method does have some limitations, the most notable one being is that it requires that we can find an IP address in every single prefix that we can ping and get a response. This is not always the case because of either aggressive firewalls or Carrier Grade Network Address Translation (CG-NAT) where prefix is are often “dark” (nothing replies to outside requests)
After running the scan using all exchanges, we get the following numbers:
| IP Version | Total Prefixes | Testable Prefixes | Works with RS Only |
|---|---|---|---|
| IPv4 | 1,040,000 | 988,942 | 148,605 |
| IPv6 | 240,000 | 68,890 | 34,451 |
For both IPv4 and IPv6 the inbound reachability is quite low, at around 14% of the internet routing table. This is quite surprising given the outbound reachability is reasonably high, somewhere between 50 to 60%. The IPv6 numbers here are low confidence due to the fact that only 29% of all prefixes have a known address that can be pinged (compared to the 95% for IPv4). It is entirely possible that a large amount of IPv6 space on the internet is configured and announced but not routed past the edge router based on how hard it is to find any address that responds to ICMPv6.
Finally we can break down our response counts by exchange:
| Internet Exchange | Prefixes | Percent of |
|---|---|---|
| BBIX Singapore | 18729 | 10.05% |
| IX.br (PTT.br) São Paulo | 18062 | 9.69% |
| BBIX Tokyo | 16931 | 9.09% |
| DE-CIX Frankfurt | 15085 | 8.10% |
| LINX LON1 | 11547 | 6.20% |
| MSK-IX Moscow | 11118 | 5.97% |
| AMS-IX | 8004 | 4.29% |
| NL-ix | 7922 | 4.25% |
| DE-CIX Dallas | 4949 | 2.65% |
| DE-CIX New York | 4383 | 2.35% |
| BIX.BG | 4071 | 2.18% |
| DE-CIX Madrid | 3108 | 1.66% |
| THINX Warsaw | 3033 | 1.62% |
Here we can see a surprising amount of prefixes coming back to BBIX exchanges (this is surprising because these exchanges are small as far as the membership count goes), this seems to be because of a select number of quite large residential and/or mobile networks accepting prefixes on the BBIX exchanges but not other ones, like China Mobile and Bharti Airtel. Otherwise, the top exchanges by responses are the largest exchanges by membership count as you would expect.
If you are running an outbound traffic heavy network then route servers are a great way to learn peering routes from a large set of exchanges with minimal effort, however the use the route servers to attract traffic into your network does not seem to be a very effective strategy given few networks import routes from them (compared to those who export prefixes to them). This difference in acceptance of prefixes is interesting, and not what I would have expected.
For inbound heavy networks (typically networks that service residential customers) most of the traffic is likely to be better served over a Private Network Interconnect (PNI) rather than a IX session, as the economics of a 100G IX port compared to cost of data centre cabling for a direct connection to the 5 (or so) major content networks is a pretty poor deal in most cases.
With the wider trend of major content networks either scaling back or removing their involvement in internet exchanges (and replacing interconnections with PNIs instead where there is enough traffic, and “dumping” the rest to transit instead of using IXs), IXs could find themselves increasingly (to some degree this has happened already in a few markets) only effectively servicing the middle bracket of networks by size, as IX ports are often priced in a way that means it is too expensive for small networks to fill in a way that makes the price worth it, and for large networks (who can get better “bulk” rates of internet transit) IX ports are the most expensive way to move traffic in a lot of developed internet markets like the Western Europe and the USA.
If you want to stay up to date with the blog you can use the RSS feed or you can follow me on Fediverse @benjojo@benjojo.co.uk!