James Routley

At Niantic, we are pioneering the concept of a Large Geospatial Model that will use large-scale machine learning to understand a scene and connect it to millions of other scenes globally.

When you look at a familiar type of structure – whether it’s a church, a statue, or a town square – it’s fairly easy to imagine what it might look like from other angles, even if you haven’t seen it from all sides. As humans, we have “spatial understanding” that means we can fill in these details based on countless similar scenes we’ve encountered before. But for machines, this task is extraordinarily difficult. Even the most advanced AI models today struggle to visualize and infer missing parts of a scene, or to imagine a place from a new angle. This is about to change: Spatial intelligence is the next frontier of AI models.

As part of Niantic’s Visual Positioning System (VPS), we have trained more than 50 million neural networks, with more than 150 trillion parameters, enabling operation in over a million locations. In our vision for a Large Geospatial Model (LGM), each of these local networks would contribute to a global large model, implementing a shared understanding of geographic locations, and comprehending places yet to be fully scanned.

The LGM will enable computers not only to perceive and understand physical spaces, but also to interact with them in new ways, forming a critical component of AR glasses and fields beyond, including robotics, content creation and autonomous systems. As we move from phones to wearable technology linked to the real world, spatial intelligence will become the world’s future operating system.

Geospatial models are a step beyond even 3D vision models in that they capture 3D entities that are rooted in specific geographic locations and have a metric quality to them. Unlike typical 3D generative models, which produce unscaled assets, a Large Geospatial Model is bound to metric space, ensuring precise estimates in scale-metric units. These entities therefore represent next-generation maps, rather than arbitrary 3D assets. While a 3D vision model may be able to create and understand a 3D scene, a geospatial model understands how that scene relates to millions of other scenes, geographically, around the world. A geospatial model implements a form of geospatial intelligence, where the model learns from its previous observations and is able to transfer knowledge to new locations, even if those are observed only partially.

While AR glasses with 3D graphics are still several years away from the mass market, there are opportunities for geospatial models to be integrated with audio-only or 2D display glasses. These models could guide users through the world, answer questions, provide personalized recommendations, help with navigation, and enhance real-world interactions. Large language models could be integrated so understanding and space come together, giving people the opportunity to be more informed and engaged with their surroundings and neighborhoods. Geospatial intelligence, as emerging from a large geospatial model, could also enable generation, completion or manipulation of 3D representations of the world to help build the next generation of AR experiences. Beyond gaming, Large Geospatial Models will have widespread applications, ranging from spatial planning and design, logistics, audience engagement, and remote collaboration.

Today we have 10 million scanned locations around the world, and over 1 million of those are activated and available for use with our VPS service. We receive about 1 million fresh scans each week, each containing hundreds of discrete images.

As part of the VPS, we build classical 3D vision maps using structure from motion techniques - but also a new type of neural map for each place. These neural models, based on our research papers ACE (2023) and ACE Zero (2024) do not represent locations using classical 3D data structures anymore, but encode them implicitly in the learnable parameters of a neural network. These networks can swiftly compress thousands of mapping images into a lean, neural representation. Given a new query image, they offer precise positioning for that location with centimeter-level accuracy.

Niantic has trained more than 50 million neural nets to date, where multiple networks can contribute to a single location. All these networks combined comprise over 150 trillion parameters optimized using machine learning.

MicKey can handle even opposing shots that would take a human some effort to figure out. MicKey was trained on a tiny fraction of our data – data that we released to the academic community to encourage this type of research. MicKey is limited to two-view inputs and was trained on comparatively little data, but it still represents a proof of concept regarding the potential of an LGM. Evidently, to accomplish geospatial intelligence as outlined in this text, an immense influx of geospatial data is needed – a kind of data not many organizations have access to. Therefore, Niantic is in a unique position to lead the way in making a Large Geospatial Model a reality, supported by more than a million user-contributed scans of real-world places we receive per week.

Niantic announces “Large Geospatial Model” trained on Pokémon Go player data