Why Geospatial Intelligence Resists General-Purpose AI

The most effective geospatial AI systems are not the largest or most general — they encode domain knowledge into their structure

EO-001

General-purpose machine learning treats location as just another feature in a table. But spatial data has properties that systematically violate the assumptions underlying most AI architectures — non-stationarity, autocorrelation, heterogeneous observation networks, and sensor-specific physics. The most effective geospatial AI systems are not the largest or most general. They are the ones that encode domain knowledge into their structure. This has implications for how intelligence layers over Earth observation data should be designed.

Why It Matters

The geospatial industry is being flooded with promises of AI that can do everything — classify any image, detect any change, predict any outcome. The reality is more nuanced. Models that work in one geography fail in another. Architectures trained on optical imagery cannot interpret SAR without fundamental redesign. The environmental variables that predict soil moisture in Indiana are not the same ones that predict PM2.5 in California. Understanding why general-purpose AI fails at spatial problems is the first step toward building systems that actually work.

There is a sentence in Michael Goodchild's foreword to the 2024 Handbook of Geospatial Artificial Intelligence that deserves more attention than it has received. Reflecting on the trajectory of geographic science from Newtonian mechanics to neural networks, he writes that the geographic world was "too complex for a set of mechanistic explanations." This was not a surrender. It was a recognition that the search for universal spatial laws — the dream of treating geography like physics — had produced useful approximations but could never produce a complete account.

Goodchild's foreword explicitly names Stan Openshaw's "geographical analysis machines" from the 1990s as the precursor to modern GeoAI — automated systems that let data drive model selection. Openshaw and Openshaw's 1997 book Artificial Intelligence in Geography (Wiley) was arguably two decades ahead of its time. The GeoAI Handbook is available from CRC Press.

The same recognition is now arriving in AI.

The Universality Problem

Modern machine learning is built on an assumption of transferability. Train a large enough model on enough data, and it will generalise. This works remarkably well for language, where a sentence in English follows roughly the same grammar whether it was written in London or Lagos. It works well for certain classes of image recognition, where a cat is a cat regardless of the camera that photographed it.

It does not work for the Earth.

The Earth's surface is non-stationary. What is true at one location is not true at another, not because of noise or insufficient data, but because the underlying processes are genuinely different. The relationship between precipitation and vegetation depends on soil type, elevation, latitude, land use history, and a dozen other factors that vary continuously across space. A model that learns this relationship in Iowa will make confident and wrong predictions in Senegal.

This is not hypothetical. The Cropland Data Layer (CDL), trained primarily on US agricultural landscapes, has been shown to misclassify crop types when applied to morphologically similar but spectrally distinct African farming systems. A 2022 study in Remote Sensing of Environment found that transfer learning from Sentinel-2 models trained in France to Burkina Faso degraded overall accuracy by 15–25% depending on crop type, despite identical sensor inputs.

This is not a data problem. It is a structural one. The technical term in geography is spatial heterogeneity — the recognition that statistical relationships vary across space. Geographically Weighted Regression was developed in the 1990s precisely because ordinary regression, which assumes a single global relationship, systematically failed when applied to spatial data. The same insight now applies to neural networks, but the machine learning community has been slower to absorb it.

GWR was introduced by Fotheringham, Brunsdon, and Charlton in their 2002 book Geographically Weighted Regression (Wiley). The technique fits separate regression models at each location, weighted by spatial proximity. It revealed that what appeared to be a single global relationship — e.g., between house prices and school quality — was actually dozens of local relationships, each with different coefficients and sometimes different signs.

Chapter 8 of the GeoAI Handbook addresses this directly. Xie et al. examine how ignoring spatial heterogeneity not only degrades prediction accuracy but introduces systematic unfairness — models that perform well in data-rich regions and fail in data-poor ones. This is not a theoretical concern. It is the lived reality of anyone who has tried to apply a model trained on Sentinel-2 imagery of European farmland to smallholder agriculture in sub-Saharan Africa.

The US has approximately 1 air quality monitor per 250 km². Sub-Saharan Africa averages fewer than 1 per 50,000 km². Models trained on dense US monitoring networks produce high reported accuracy, but that accuracy is not transferable to regions where the training data is sparse or absent. See the WHO Global Air Quality Database for current station coverage: who.int/data/gho/data/themes/air-pollution.

What the Sensors Actually Produce

Before the AI even begins, there is a prior question that most machine learning pipelines ignore: what, exactly, is the data?

A Sentinel-2 image is not a photograph. It is a set of measurements of electromagnetic reflectance in thirteen spectral bands, captured by a pushbroom scanner moving at 7.5 kilometres per second, corrected for atmospheric interference using radiative transfer models, geometrically corrected to a map projection, and delivered as calibrated top-of-atmosphere or bottom-of-atmosphere reflectance values. Each of those processing steps encodes assumptions about the physics of light, the composition of the atmosphere, and the geometry of the Earth's surface.

The Sentinel-2 MSI pushbroom scanner images a 290 km swath with 12-bit radiometric resolution. The 13 spectral bands are not captured simultaneously — the focal plane assembly staggers the detector arrays, so bands are acquired at slightly different times as the satellite moves. This means a single "image" is actually 13 offset acquisitions stitched together. ESA's Level-1C processing corrects for this offset, but the correction depends on the accuracy of the satellite's attitude determination. See ESA's Sentinel-2 Technical Guide.

A Sentinel-1 SAR image is something else entirely. It is a measurement of microwave backscatter — the return signal from radar pulses transmitted by the satellite. The physics is different. The information content is different. The noise characteristics are different. A SAR image of a forest tells you about structure and moisture content. An optical image of the same forest tells you about chlorophyll concentration and canopy reflectance. They are not two views of the same thing. They are two measurements of fundamentally different physical properties.

The Copernicus Emergency Management Service (CEMS) routinely uses SAR for flood extent mapping precisely because optical imagery is useless under the cloud cover that accompanies flood events. During the 2021 European floods, Sentinel-1 SAR provided flood extent maps within 12 hours while Sentinel-2 optical imagery remained cloud-obscured for over a week. See CEMS activation EMSR517.

This matters because general-purpose AI architectures — convolutional neural networks, vision transformers, foundation models — treat their inputs as arrays of numbers. They do not know that band 4 of a Sentinel-2 image measures red reflectance while band 8 measures near-infrared. They do not know that the relationship between those two bands is governed by the physics of chlorophyll absorption. They do not know that a SAR backscatter value of −12 dB over water means something completely different from −12 dB over a parking lot.

The Normalized Difference Vegetation Index (NDVI) — calculated as (NIR − Red) / (NIR + Red) — exploits the fact that chlorophyll absorbs red light strongly but reflects near-infrared. This spectral signature is not a statistical correlation; it is a direct consequence of molecular physics. A general-purpose model can learn the correlation, but without encoding the mechanism, it will treat NDVI as an arbitrary feature rather than a physically grounded measurement.

Domain-specific systems encode this knowledge. They build the physics into the architecture — not as a constraint that limits what the model can learn, but as a prior that helps it learn the right things faster and with less data.

Heinz von Foerster, writing about cybernetics and perception in the 1970s, made a point that applies here with uncomfortable precision. He argued that perception is not passive reception — the sensor does not simply record what is there. Perception is an active process of construction, shaped by the structure of the observer. What a system can see is determined by how it is built to look.

Von Foerster's argument in "On Constructing a Reality" (Chapter 8) goes further than most readers expect. He demonstrates that the nervous system is "closed" — it computes only with its own internal states, not with the external environment directly. What we call perception is a mapping from internal neural computations to a constructed model of the world. The sensor analogy is precise: a satellite sensor computes reflectance values from photon counts, not from "the landscape."

A general-purpose neural network looking at satellite imagery is like von Foerster's hypothetical observer with no framework for interpreting what it sees. It can find statistical patterns. It cannot understand what those patterns mean in physical terms. And when the patterns shift — because the geography changed, or the season changed, or the sensor changed — it has no basis for adaptation.

Federated Learning and the Locality of Knowledge

There is a deeper version of the heterogeneity problem that emerges when you try to train models across distributed sensor networks.

Consider a network of 144 weather stations spread across Indiana, each recording temperature, humidity, wind speed, soil moisture, and twenty other variables at fifteen-minute intervals. A standard approach would be to pool all the data, train a single model, and deploy it back to each station. This is what centralised machine learning does. It assumes that more data produces better models.

The Wabash Heartland Innovation Network (WHIN) is a real network funded by a $40M Lilly Endowment grant, operating across 10 counties in northwest Indiana. The stations record data at 15-minute intervals across 24 environmental variables including soil temperature at four depths, solar radiation, leaf wetness, and wind gusts. More at whin.org.

Recent work on federated learning for soil moisture prediction tells a different story. When lightweight convolutional neural networks (approximately 800 parameters) are trained locally at each station and only their model weights are shared — never the raw data — the federated system achieves a mean absolute error within one centibar of the centralised model. The local model with ten times fewer parameters nearly matches the global model that saw all the data.

For context: GPT-3 has 175 billion parameters. A typical ResNet-50 has 25 million. The lightweight CNN in the federated soil moisture study achieves competitive accuracy with 800 — roughly 22 million times fewer parameters than GPT-3. The model has 3 convolutional layers with 16–64 channels and uses a 10-day input window. The study tested both lightweight (~0.8k) and heavy (~9.4k) architectures; the lightweight version matched the heavy one within statistical significance.

This is not a failure of centralisation. It is evidence that environmental data has a fundamentally local character. The relationships between predictors and outcomes are not identical across a sensor network — they are shaped by local soil composition, microclimate, topography, and land use. A model that tries to learn one universal function across all stations is fighting the data's own structure.

The federated approach respects that structure. Each node learns its local relationships. The aggregation step combines those local models into a global understanding without erasing what makes each locality distinct. This is a principled response to the non-IID (non-independently and identically distributed) nature of spatial data — a property that the machine learning community often treats as a nuisance to be corrected, but which is in fact the most important signal in the dataset.

Non-IID data is the norm, not the exception, in environmental monitoring. The Zakzouk & Said study uses Dirichlet sampling with α values of 0.1, 0.5, and 1.0 to model different levels of data heterogeneity. At α = 0.1 (highly non-IID), individual stations see dramatically different data distributions — yet federated models still converge. The key insight: FedAvg aggregation works because it averages model weights, not data, preserving local learned relationships.

The security implications are worth noting as well. When raw data never leaves the sensor node, the attack surface for data poisoning is fundamentally different. Work on Byzantine-resilient federated learning has demonstrated that even with 50% of nodes compromised — submitting deliberately poisoned model updates — systems that use cosine similarity filtering, multiplicity checking, and committee-based verification can maintain model accuracy above 85%, compared to near-complete failure for naive averaging approaches. The resilience comes not from a central authority but from the distributed structure itself.

The Lee & Kim (2022) system uses a hierarchical architecture: DAG-based local blockchains for shard-level model aggregation, plus a public main blockchain for global aggregation. At 50% malicious nodes, their system achieves 85.24% accuracy vs. ~10% for standard FedAvg. The key innovation is "multiplicity" — a mechanism that prevents the aggregation step from repeatedly selecting models that are similar to poisoned ones, even when those poisoned models pass cosine similarity checks. The system uses an Ethereum blockchain with Solidity smart contracts.

Graph Architectures and Spatial Structure

A second line of evidence comes from air quality forecasting, where the spatial relationships between monitoring stations are not just background context but the primary mechanism of the phenomenon being modelled.

PM2.5 — particulate matter smaller than 2.5 microns — does not stay where it is generated. It moves. It is carried by wind, blocked by mountains, concentrated in valleys, and dispersed by turbulence. Predicting PM2.5 concentration at a given monitoring station requires knowing not just the local conditions at that station but the conditions at upwind stations, the fires burning within hundreds of kilometres, and the terrain between them.

Wildfire PM2.5 has been found to be more toxic than PM2.5 from other sources. A 2021 study by Aguilera et al. in Nature Communications found that wildfire-specific PM2.5 was associated with up to 10× greater respiratory health impacts than equivalent concentrations from ambient sources. Within California, additional PM2.5 emissions from extreme wildfires in the past 8 years have reversed nearly two decades of declining ambient pollution (Burke et al., 2023, Nature).

Graph neural networks are designed for exactly this kind of problem. They represent monitoring stations as nodes in a graph and learn how PM2.5 propagates between them — accounting for wind direction, distance, and elevation differences. A recent GNN-based forecasting system for California achieves a one-hour-ahead MAE of 5.23 μg/m³ across 112 stations, outperforming random forests, LSTMs, and multilayer perceptrons. The advantage is modest in aggregate metrics but dramatic for the cases that matter most: elevated PM2.5 events, where the GNN's spatial inductive bias allows it to capture propagation dynamics that purely temporal models miss.

The US Air Quality Index categorises PM2.5 above 150.5 μg/m³ as "very unhealthy" and above 250.5 μg/m³ as "hazardous." The GNN's MAE of 5.23 μg/m³ is significant because the 99th percentile of the observed test data is only 50.70 μg/m³ — meaning most of the time PM2.5 is low, and aggregate error metrics are dominated by easy predictions. The GNN's real advantage shows in the tail: elevated events where competing models systematically underpredict. The model uses 240 hours (10 days) of historical data to initialise each 48-hour forecast window.

What makes this architecture interesting is not just its accuracy. It is the way domain knowledge is embedded in the graph construction itself. The GNN only models PM2.5 transport between stations that are within 300 kilometres of each other and differ by less than 1,200 metres in elevation — because the physics of particulate transport does not support transmission across those boundaries. Fire radiative power from satellite observations is aggregated at each station using inverse distance and wind-direction weighting, encoding the physical reality that a fire directly upwind matters more than a fire of equal intensity at the same distance but crosswind.

The 1,200-metre elevation threshold encodes a specific physical assumption: that terrain barriers of this scale impede particulate transport. This is domain knowledge, not a learned parameter. In California, this corresponds to the major mountain range boundaries — the Sierra Nevada, the Coast Ranges — that empirically divide air basins. CARB (California Air Resources Board) defines 35 separate air basins along similar topographic boundaries: ww2.arb.ca.gov/resources/documents/maps-state-and-federal-area-designations.

These are not hyperparameters to be tuned. They are physical constraints that define the problem. A general-purpose model would have to discover them from data alone — and might never discover them if the training set does not contain enough variation in wind direction, elevation, and fire location to reveal the relationships.

The GNN framework also enables something that general models cannot do easily: counterfactual simulation. By injecting synthetic fire radiative power values into the graph — representing prescribed burns that have not yet happened — the system can forecast the air quality impact of hypothetical controlled burns at different times of year. This determined that March is the optimal month for prescribed fires in California, and that the short-term PM2.5 increase from controlled burns is dramatically outweighed by the avoided pollution from the wildfires those burns prevent.

The experiment transposed a window of 10 actual prescribed fires (>100 acres each, from January 2021) across all months of 2021. August — peak wildfire season — produced mean PM2.5 29.61% higher than other months' average, with maximum concentrations 44.27% higher. The Caldor Fire experiment (Experiment 2) was even more striking: simulated spring prescribed burns increased mean PM2.5 by only 0.31 μg/m³, while removing the Caldor Fire's influence reduced maximum PM2.5 by 54.32% and cut unhealthy air quality days from 3.54 to 0.80 per station.

This kind of simulation requires a model that understands spatial relationships, not just temporal patterns. It requires architecture that knows what wind direction means, what distance means, what a fire is. General-purpose models can approximate these relationships given enough data. Domain-specific models can reason about them.

The key technical mechanism is the WIDW (Wind and Inverse-Distance Weighted) FRP aggregation from Equation 1 in Liao et al.: each fire's contribution to a station's PM2.5 is weighted by its radiative power, the wind speed at the fire, the cosine of the angle between wind direction and the fire-to-station direction, and the inverse square of distance. This is essentially an atmospheric dispersion model embedded as a feature engineering step. A general-purpose model would need to learn all four of these physical relationships from scratch.

The Observer Constructs the Observation

There is a philosophical thread running beneath all of this that connects to how we think about intelligence in Earth observation systems.

Von Foerster's central insight in second-order cybernetics was that the observer is not separate from the observation. The act of measurement is an act of construction. The sensor does not passively record reality — it creates a specific representation of reality, shaped by its spectral range, its spatial resolution, its revisit frequency, its noise characteristics, and the processing chain that transforms raw signals into data products.

Von Foerster's "trivial machine" vs. "non-trivial machine" distinction (Chapter 7) maps directly to the difference between lookup-table classifiers and adaptive models. A trivial machine always gives the same output for the same input. A non-trivial machine's output depends on its internal state — its history. The Earth is non-trivial: the same reflectance measurement over the same field means something different in March than in August. Models that ignore temporal context are treating a non-trivial system as trivial.

This means that the choice of sensor predetermines what phenomena are visible. A multispectral imager can see chlorophyll but not soil structure. A SAR instrument can see soil moisture but not plant health. A thermal sensor can see surface temperature but not subsurface conditions. No single sensor captures "the truth." Each captures a projection of reality through a specific observational lens.

The same principle applies to AI models. A convolutional neural network sees spatial patterns at the scale of its receptive field. A recurrent network sees temporal sequences up to the length of its memory. A graph neural network sees relationships between connected nodes. The architecture determines what the model can perceive, just as the sensor determines what the instrument can measure.

This is why domain-specific AI matters. Not because general-purpose models are bad, but because the choice of architecture is itself an observational decision. When you build a model to analyse satellite imagery, you are choosing what aspects of the data the model can attend to and what it will be structurally blind to. Making that choice with domain knowledge — encoding the physics of the sensor, the geography of the phenomenon, the structure of the observation network — is not constraining the model. It is giving it eyes that can actually see.

Von Foerster distinguished between "trivial machines" — which produce the same output for the same input, every time — and "non-trivial machines," whose output depends on their internal state, which in turn depends on their history. The Earth is the ultimate non-trivial machine. Its response to a stimulus depends on everything that has happened before, at that location and at every connected location. General-purpose AI tries to trivialise this complexity by treating each observation as independent. Domain-specific AI respects the non-triviality.

What This Means for Intelligence Layers

The convergence of these lines of evidence — federated learning that respects data locality, graph architectures that encode spatial structure, domain-specific feature engineering that reflects sensor physics — points toward a particular design philosophy for geospatial intelligence systems.

That philosophy has three pillars.

First, models must be spatially aware. Not as an afterthought — not by adding coordinates as input features to a generic architecture — but structurally. The relationships between locations, the heterogeneity of processes across space, the physics of how signals propagate through the environment, should be encoded in the model's bones.

Second, training must respect data sovereignty. Environmental data is collected by diverse institutions, across jurisdictions, under different regulatory frameworks. A system that requires all data to be centralised before intelligence can be extracted is not just technically suboptimal — it is practically impossible in many of the contexts where geospatial intelligence matters most. Federated and distributed approaches are not compromises. They are the architecture that matches the data's natural structure.

Third, intelligence must be personalised to context. A model trained on temperate forests should not be blindly applied to boreal forests. A change detection algorithm calibrated for urban expansion should not be repurposed for glacial retreat without adaptation. The most useful intelligence layer is one that can be tuned to a specific geography, a specific sensor, a specific application — while still drawing on the broader patterns learned across all contexts.

These are not novel observations. The geospatial research community has been articulating them for years, and the GeoAI Handbook captures the state of the art as of 2023. What remains underbuilt is the infrastructure that would make them operational at scale — systems that federate model training across sensor networks, that encode domain knowledge into production architectures, that personalise predictions to local context without starting from scratch.

That infrastructure is the next frontier. Not bigger models. Better-situated ones.

The GeoAI Handbook's Chapter 22 ("Forward Thinking on GeoAI" by Shawn Newsam) and the research questions posed in Chapter 1 both converge on this point. Key open question from Gao et al.: "What kinds of datasets and procedures are required to train a large geospatial foundation model, and how is it different from general foundation models?" The answer implied by the evidence is: fundamentally different, because spatial heterogeneity, sensor physics, and observational grammar are not nuances a general model can absorb — they are structural properties that must be encoded.

References

Gao, S., Hu, Y., & Li, W. (Eds.). (2024). Handbook of Geospatial Artificial Intelligence. CRC Press.
Liao, Z., et al. (2025). Simulating the Air Quality Impact of Prescribed Fires Using Graph Neural Network-Based PM2.5 Forecasts. Environmental Data Science, 4. https://doi.org/10.1017/eds.2025.4
Zakzouk, E. & Said, A. (2025). Federated Learning for Soil Moisture Prediction.
Lee, S. & Kim, J. (2022). DAG-Based Blockchain Sharding for Secure Federated Learning with Non-IID Data. Sensors, 22(21), 8263. https://doi.org/10.3390/s22218263
von Foerster, H. (2003). Understanding Understanding: Essays on Cybernetics and Cognition. Springer. https://link.springer.com/book/10.1007/b97005
Reichstein, M., et al. (2019). Deep Learning and Process Understanding for Data-Driven Earth System Science. Nature, 566, 195–204.