Data Provenance | M33 Learn

There is a moment in the life of every dataset where it stops being a measurement and starts being a story someone tells about a measurement.

A satellite passes over a stretch of coastline. Its sensor captures electromagnetic reflectance. There they are: Photons bouncing off water, sand, vegetation, concrete. That light is then converted into numbers. Those numbers are a direct physical record. At the instant of capture, the data is as close to ground truth as we get from 700kms up in LEO.

Then it begins its journey.

The raw data is downlinked to a ground station. It gets ingested into a processing pipeline. Radiometric corrections are applied to account for atmospheric interference. The image is geometrically corrected so that the pixels line up with actual coordinates on Earth. It might be resampled, reprojected, clipped to a region of interest, fused with data from another sensor, run through a classification algorithm, and then delivered as a product to someone who will make a decision based on what it shows.

Each of those steps is a transformation. Each transformation changes what the data is. And at the end of that pipeline, the person staring at the final product — the analyst, the insurer, the emergency responder — has no way of knowing what happened between the satellite and their screen unless someone kept a record.

That record is provenance.

The Problem With Trust

Provenance is an old concept. Archivists and art historians have been tracing the chain of custody of documents and paintings for centuries. The word itself comes from the French provenir — to come from. When a museum acquires a painting, it wants to know every hand that held it, every wall it hung on, every auction it passed through. Not because those facts change the painting, but because they change how much you can trust the painting is what it claims to be.

Data works the same way. The difference is scale.

A single Sentinel-2 satellite generates around 1.6 terabytes of data per day. Landsat, MODIS, SAR constellations, commercial providers like Planet and Maxar — the combined output of the global Earth observation infrastructure is staggering. Every one of those datasets passes through processing chains, gets derived into products, gets shared across institutions, and ends up informing real decisions about real places.

ESA's Copernicus Open Access Hub has distributed over 30 petabytes of Sentinel data since 2014. The archive grows by roughly 12 TB/day across the full Sentinel constellation. See Copernicus Data Space Ecosystem for current access and archive statistics.

Now ask yourself: at the end of that chain, how do you know what you're looking at?

If someone hands you a GeoTIFF and says it represents the normalised difference vegetation index for a particular region on a particular date, you are taking their word for it. You can check if the file looks reasonable. You can compare it against your expectations. But you cannot verify from the file itself that the atmospheric correction was applied correctly, that the cloud mask didn't clip valid data, that the resampling method didn't introduce artifacts, or that the coordinate reference system is what the metadata claims.

You are trusting the pipeline. And in most of the geospatial world today, that trust is implicit.

What Provenance Actually Tracks

A proper provenance record for a geospatial dataset answers a series of basic questions. They seem obvious, but the fact that most processing systems do not answer them reliably is the whole problem.

Where did the input data come from? Not just "Sentinel-2" — which specific granule, from which orbit, captured at what time, downloaded from which archive, with what processing level? If multiple sources were fused, what were all of them?

What transformations were applied? Every reprojection, resampling, correction, classification, masking, clipping, and fusion operation. In what order. With what parameters. Using what software version.

When did processing occur? Timestamps matter because the same algorithm run on different dates might use different calibration coefficients, different ancillary data, or different model weights.

Who or what performed the processing? A human analyst? An automated pipeline? Which version of which code? On what infrastructure?

What was the output? A hash of the final product, so that any subsequent alteration — even a single flipped bit — can be detected.

Taken together, these answers form a chain. That chain is only as strong as its weakest link, which is why partial provenance — recording some steps but not others — provides a false sense of security that may be worse than no provenance at all.

Why This Is Hard

If the problem were simply record-keeping, it would have been solved decades ago. The reason provenance remains a persistent gap in geospatial data infrastructure has less to do with technology and more to do with how the ecosystem evolved.

Remote sensing grew up in institutions. Space agencies built their own processing pipelines, developed their own formats, maintained their own archives. Each pipeline was internally consistent, but interoperability between pipelines was never a design priority. When ESA processes a Sentinel-2 granule and NASA processes a Landsat scene, the provenance metadata they generate is structured differently, stored differently, and in some cases captures different information entirely.

The commercial sector added another layer. Private satellite operators process data through proprietary pipelines where the transformation steps are trade secrets. The customer receives a clean product with limited metadata about what happened inside the black box. This is not malicious — it is standard practice in an industry where processing algorithms are competitive advantages. But it means that downstream consumers are, by design, unable to verify the chain.

Planet Labs' SkySat and SuperDove products, for example, ship with minimal processing lineage. Maxar's ARD pipeline documents output specifications but not intermediate correction steps. See Planet Products and Maxar ARD for current documentation.

And then there is the integration problem. Most real-world applications do not use a single dataset. Flood mapping might combine SAR imagery, optical imagery, terrain models, and hydrological data. Each of those inputs has its own provenance chain. The moment you fuse them, you need a provenance system that can represent not just linear chains but branching, merging graphs of transformation. Very few systems do this well.

OpenLineage, an open-source project now under the Linux Foundation, is one of the few operational systems that natively represents provenance as a DAG across heterogeneous processing jobs. It was designed for data engineering pipelines, not geospatial specifically, but the graph model maps directly. See openlineage.io.

The result is an industry where the most consequential data products — the ones that inform disaster response, agricultural subsidies, carbon credit verification, military planning — often have the weakest provenance records.

The Difference Between a Claim and a Proof

There is a useful distinction to draw here. Most provenance systems that exist today produce claims. They generate metadata that asserts what happened to a dataset. Those claims might be accurate. They might even be detailed and well-structured. But they are still claims — statements made by the system about itself, which could in principle be altered, fabricated, or simply wrong.

A claim says: "This dataset was atmospherically corrected using LaSRC v3.2 on January 15th, 2026."

A proof says: "Here is a cryptographic signature, generated by hardware that cannot be tampered with, confirming that this specific transformation code was applied to this specific input data at this specific time, and the hardware environment was verified to be unmodified."

That distinction — between provenance-as-metadata and provenance-as-proof — is where the field is heading. Cryptographic techniques, hardware security modules, and reproducible processing environments are beginning to make it possible to generate provenance records that are not just detailed but verifiable. Not "we wrote down what happened" but "here is mathematical evidence of what happened, and you can check it yourself."

This matters most in contexts where the stakes are high and the trust is low. Defence and intelligence applications, where data might be deliberately manipulated. Insurance and finance, where the incentive to misrepresent conditions is real. Carbon credit markets, where the entire value proposition depends on whether the satellite-derived measurement is accurate. Climate monitoring, where policy decisions rest on long-term data integrity.

In these contexts, a claim is not enough. You need a proof.

Standards and the State of Play

The geospatial community has not been asleep on this problem. Several standards and frameworks have emerged to address aspects of data provenance.

The W3C PROV data model provides a general-purpose framework for representing provenance information — entities, activities, and agents — that has been adapted for scientific data workflows. The Open Geospatial Consortium (OGC) has worked on provenance specifications through its various working groups. STAC (SpatioTemporal Asset Catalog) has become a de facto standard for cataloguing geospatial assets, and its extension mechanism allows provenance metadata to be attached, though the depth and consistency of that metadata varies wildly across implementations.

ISO 19115, the international standard for geographic metadata, includes provisions for lineage information — a record of the processing steps applied to a dataset. In practice, lineage fields are often either empty or filled with boilerplate text that describes the general type of processing without the specificity needed to reproduce or verify it.

A 2021 survey of Landsat Collection 2 products found that while all products included ISO 19115-compliant metadata, the lineage fields typically contained templated descriptions like "radiometric and geometric corrections applied" rather than the specific algorithm versions, parameter values, or processing timestamps that would be needed to reproduce the result. The standard permits depth; the practice rarely delivers it.

The gap is not in the standards themselves. It is in adoption, enforcement, and the tooling that would make comprehensive provenance tracking a default rather than an afterthought.

What Good Provenance Looks Like

A well-provenance dataset is one where an independent party — with no prior relationship to the data producer — can verify the complete chain from raw observation to final product.

This means the provenance record is machine-readable, not buried in a PDF report that accompanies the dataset. It means every transformation step is recorded with enough specificity to be reproduced. It means the record itself is immutable — once written, it cannot be quietly edited. And ideally, it means the record is cryptographically bound to the data it describes, so that you cannot separate the provenance from the product.

In practical terms, this looks like a processing receipt that travels with the data. Open it, and you can trace every input, every operation, every intermediate product, all the way back to the original sensor reading. If someone altered the data after processing, the receipt's signature would break and you would know.

This is not science fiction. The cryptographic primitives exist. The challenge is building processing systems that generate these records natively, rather than bolting provenance on as an afterthought.

Provenance and Reproducibility

There is a natural connection between provenance and scientific reproducibility. If your provenance record is detailed enough, someone else should be able to take the same inputs, apply the same transformations, and arrive at the same outputs. If they cannot, something in the chain was not recorded — or something in the chain was not what it claimed to be.

This is where provenance meets open science. The push for reproducible research in remote sensing — driven by the same concerns about methodological transparency that have swept through other scientific disciplines — is fundamentally a push for better provenance. Every time a paper describes a processing methodology in prose rather than publishing the actual pipeline, an opportunity for verification is lost.

The ideal is a world where every geospatial data product carries enough provenance information that anyone, anywhere, can independently verify or reproduce the result. We are not there yet. But the trajectory is clear, and the tools are arriving faster than the culture is shifting to use them.