Anthropogenic methane emissions are the main contributor to the rise of atmospheric methane [zhang2022anthropogenic], and mitigating methane emissions is widely recognized as crucial for slowing global warming and achieving the goals of the Paris Agreement [agreement2015paris]. Multiple satellites are in orbit or launching soon which will measure methane emissions from the surface using top-down approaches, but in order to attribute these emissions to specific sources on the ground, a comprehensive database of methane emitting infrastructure is necessary [jacob2022quantifying]. Although several public databases of this infrastructure exist, the data available globally is incomplete, erroneous, and unaggregated.
AI approaches on Earth observation data have the potential to fill in this gap. Several recent works have developed deep learning models to automatically interpret remotely sensed imagery and deploy them at scale to map infrastructure [yu2018deepsolar, lee2021scalable, kruitwagen2021global, sirko2021continental]
. Methods for mapping methane source infrastructure have been emerging as well, including well pads in the Denver basin[dileepautomated], oil refineries and concentrated animal feed operations in the U.S. [sheng2020ognet, handan2019deep], and wastewater treatment plants in Germany [li2022leveraging]. Each of these works depended on the curation of large, labeled datasets to develop the machine learning models, but there is a lack of publicly available, labeled Earth observation data, specifically on methane emitting infrastructure, which prohibits researchers and practitioners from building automated mapping approaches.
In this work, we construct a multi-sensor Earth observation dataset for methane source infrastructure identification called METER-ML. In support of a new initiative to build a global database of methane emitting infrastructure called the MEthane Tracking Emissions Reference (METER) [meter], we develop METER-ML to allow the machine learning community to experiment with multi-view/multi-modal modeling approaches to automatically identify this infrastructure in remotely sensed imagery. METER-ML includes georeferenced imagery from three remotely sensed image products, specifically 19 spectral bands in total from NAIP, Sentinel-1, and Sentinel-2, capturing 51,739 sources of methane from six different classes as well as 34,886 negative examples (Figure 1
). The dataset includes expert-reviewed validation and test sets for robustly evaluating the performance of derived models. Using the dataset, we experiment with a variety of convolutional neural network models which leverage different spatial resolutions, spatial footprints, image products, and spectral bands. The dataset is freely available111https://stanfordmlgroup.github.io/projects/meter-ml in order to encourage further work on developing and validating methane source mapping approaches.
2.1 Methane source locations
We collect locations of methane emitting infrastructure in the U.S. from a variety of public datasets. We focus on the U.S. in this study due to the high availability of publicly accessible infrastructure data and remotely sensed imagery. The infrastructure categories we include are concentrated animal feeding operations (CAFOs), coal mines (Mines), landfills (Landfills), natural gas processing plants (Proc Plants), oil refineries and petroleum terminals (including crude oil and liquified natural gas terminals), and wastewater treatment plants (WWTPs). We group oil refineries and petroleum terminals together due to their high similarity in appearance, and refer to that category as “Refineries & Terminals” (R&Ts). These infrastructure categories were chosen based on their potential for emitting methane along with their consistent, visible differentiating features which make them feasible to identify in high resolution remotely sensed imagery.
The locations are obtained from 18 different publicly available datasets, all of which have licenses that allow for redistribution (see Table 6 in the Appendix). As various datasets may contain the same locations of infrastructure, we deduplicate by considering locations within 500m of each other identical. In total we include 51,739 unique locations of methane source infrastructure in the dataset, which we refer to as positive examples.
|Category||Train (%)||Valid (%)||Test (%)||Total|
|CAFOs||24957 (29.3%)||47 (9.1%)||92 (9.0%)||25096|
|Landfills||4088 (4.8%)||46 (8.9%)||111 (10.9%)||4245|
|Coal Mines||1777 (2.1%)||40 (7.8%)||72 (7.1%)||1889|
|Proc Plants||1901 (2.2%)||38 (7.4%)||107 (10.5%)||2046|
|R&Ts||4016 (4.7%)||59 (11.5%)||108 (10.6%)||4183|
|WWTPs||14522 (17.1%)||46 (8.9%)||130 (12.8%)||14698|
|Negatives||34211 (40.2%)||249 (48.3%)||426 (41.8%)||34886|
2.2 Negative locations
We additionally include a variety of images in the dataset which capture none of the six methane emitting facilities. To do this, we define around 50 classes (see Appendix) of different facilities and landscapes and select characteristic examples of each class. Then we collect locations containing similar facilities and landscapes using the Descartes Labs GeoVisual Search [keisler2019visual], providing up to 1000 similar locations per example. A sample of the similar locations were manually vetted in each case to ensure no locations obtained actually corresponded to the six methane source categories. In total we include 34,886 locations of facilities and landscapes which are not any of the six infrastructure categories, and refer to these as negative examples. The counts and proportions of the positive and negative classes in the dataset are shown in Table 1.
2.3 Remotely sensed imagery
We pair all of the locations in the dataset with three publicly available remotely sensed image sources. Specifically we include aerial imagery from the USDA National Agriculture Imagery Program (NAIP) as well as satellite imagery captured by Sentinel-1 (S1) and Sentinel-2 (S2). NAIP imagery covers the contiguous U.S. and S1 and S2 imagery both have global coverage. For NAIP we use 1m resolution imagery, for Sentinel-2 we use the L1C product at 10m resolution, and for Sentinel-1 we use the Sigma Nought Backscatter product at 10m resolution. We use all spectral bands from each product. Specifically, we use the three visible (RGB) and single near-infrared (NIR) bands from NAIP and S2, the single coastal aerosol (CA) band, four red-edge (RE1-4) bands, single water vapor (WV) band, single cirrus (C) band, and the two shortwave infrared (SWIR1-2) bands from S2, and the V-transmit (VH and VV) bands from S1. We include S1 and S2 in the dataset in order to enable experimenting with coarser resolution satellite imagery which is globally available, unlike NAIP. The details of each imagery product and band are shown in Table 2.
In order to construct images containing each location in the dataset, we consider a 720m x 720m footprint centered around the location. This footprint was chosen to balance the size of the images with the contextual information, but we investigate this choice in the experiments. Due to the geographic coordinate noise in the publicly available datasets, we chose to center the imagery at the locations which increases the likelihood the facilities are captured in the imagery, but still has natural variation in the locations of the facilities in the imagery. We construct a mosaic of the most recently captured pixels in a time range for each image product, where we consider NAIP images captured between 2017 and 2021 and Sentinel-1 and Sentinel-2 images between May and September 2021, where Sentinel-2 images are selected based on lowest cloud cover. We use the Descartes Labs platform to download all of the imagery [dl].
The total dataset contains 86,625 images capturing ten spectral bands across the three imagery products. Information about the remotely sensed image products and bands included in the dataset are provided in Table 2 and characteristic examples for each methane source category are shown in Figure 2 in the Appendix.
|NAIP||RGB & NIR||720x720||1m|
|Sentinel-2||RGB & NIR||72x72||10m|
|Sentinel-2||RE1-4 & SWIR1-2||36x36||20m|
|Sentinel-2||CA & WV & C||12x12||60m|
|Sentinel-1||VH & VV||72x72||10m|
2.4 Validation and test sets
Two Stanford University postdoctoral researchers with expertise in methane emissions and related infrastructure individually reviewed 1,534 examples to compose the held-out validation and test sets. To determine which examples to include in these held-out sets, we randomly sampled 150 images from each of the six positive classes as well as a random sample of 34 images which have multiple labels, constituting 934 positive examples according to the original public dataset labels. We additionally sampled 12 images from each of the 50 negative categories resulting in 600 negative examples. The experts both manually reviewed these examples and identified the presence or absence of the six methane source categories by using a combination of NAIP imagery as well as Google Maps imagery, which often had finer spatial resolution as well as place names. The facility had to be captured by the NAIP image for the corresponding label to be assigned. If the expert identified no clearly visible methane source categories in the image, the example was labeled “negative”, and if the expert was uncertain about any label, the example was labeled “uncertain”. The two labels per example were then resolved as follows:
If the experts agreed and neither was uncertain, the agreed upon label was taken as the final label.
If the experts disagreed, and one was uncertain but the other was not, the expert’s certain label was taken as the final label.
If the experts disagreed, but one agreed with the original label, the original label was taken as the final label.
In all other scenarios, the example was reviewed jointly by the experts and a final label was assigned.
Only 76 examples out of the 1,534 went to another round of review. The resulting datasets have 859 positive examples and 675 negative examples. We split the 1,534 examples into 515 for the validation set and 1,019 for the test set. The label counts on the validation and test sets are shown in Table 1.
We run a variety of multi-label classification experiments on the curated dataset. In all of our experiments, we use a DenseNet-121 convolutional neural network architecture[huang2017densely]. Preliminary experiments on the dataset explored various ResNet and DenseNet architectures and found that DenseNets outperformed all ResNet variants [he2016deep]
. We use a linear layer which outputs six values indicating the likelihood that each of the six methane source categories are present in the input image, which outperformed individual models across all classes in our preliminary experiments. Although the model does not explicitly produce a value indicating the likelihood that the image is negative, a low value assigned to all classes indicates a negative prediction. The loss function is the mean of six unweighted binary cross entropy losses, where the label is 1 if the class if present in the image and 0 otherwise. All six labels in the negative examples are 0. The network weights are initialized with weights from a network pre-trained on ImageNet[deng2009imagenet]. Before inputting the images into the networks, we upscale the Sentinel-1 and Sentinel-2 images to match the size of NAIP images using bilinear resampling and normalize the values by the display range of the bands (see Table 7
in the Appendix). When using inputs with less than or more than 3 channels, we replace the first convolutional neural network layer with one which accepts the corresponding number of channels. Each model is trained for 5 epochs with a batch size of 4. For each model we use the checkpoint saved after an epoch which led to the lowest validation loss. We use an Adam optimizer with standard parameters[kingma2014adam] and a learning rate of 0.02. All models are trained using a GeForce GTX 1070 GPU.
The baseline setting for all experiments uses images capturing a footprint of 720m x 720m with 1m spatial resolution (720 x 720 image dimensions). After the models are trained, each of the six values output by the model are fed through an element-wise sigmoid function to produce a probability for each of the six categories. To evaluate the performance of the models, we compute the per-class area under the precision recall curve (AUPRC) and summarize the performance over all classes by taking the macro-average of the per-class AUPRCs.
|Image Product||Bands||CAFOs||Landfills||Mines||Proc Plants||R&Ts||WWTPs||Overall|
|S2 & S1||All||0.923||0.152||0.379||0.391||0.612||0.231||0.448|
|NAIP & S2 & S1||All||0.889||0.214||0.473||0.457||0.796||0.272||0.517|
3.1 Impact of using different imaging products and bands
We investigate the impact of using different combinations of image products and bands in the dataset (Table 3). Specifically, we experiment with NAIP, S2, and S1 alone, only visible bands and all spectral bands for S2 and NAIP, all spectral bands from S1 and S2 together (representing the model closest to public global transferability due to the global coverage of S1 and S2), and all spectral bands from the three products together.
The best model according to macro-average AUPRC is the one which uses NAIP with all bands (the three visible bands and NIR band), achieving an overall AUPRC of 0.548 and the highest performance on CAFOs, Landfills, Proc Plants, R&Ts, and WWTPs compared to all other tested product and band combinations. Notably, it achieves very high performance on CAFOs (AUPRC=0.945) and high performance on R&Ts (AUPRC=0.857). The second best model is the joint NAIP+S2+S1 model, achieving an overall AUPRC of 0.517 and the highest performance on Mines (AUPRC=0.473) compared to all other tested product and band combinations.
S1 alone underperforms all other combinations of products and bands, followed by S2 and S1 jointly, which performed similarly overall to S2 with only the visible bands and all spectral bands. Importantly, the S2 and S1 joint model still achieves high performance on CAFOs (AUPRC=0.923), although the performance is lower than performance on CAFOs using NAIP imagery (AUPRC=0.947). There is a significant drop in performance on all classes when moving from NAIP to S2, highlighting the benefit of using high spatial resolution imagery.
The inclusion of the non-visible information substantially improves overall AUPRC for NAIP (AUPRC=0.480 0.548) but not for Sentinel 2 (AUPRC=0.450 0.448). For NAIP, the improvement is observed for all classes, with substantial gains on CAFOs, Mines, Proc Plants, and WWTPs. For Sentinel 2, the inclusion of non-visible bands substantially improves performance on Mines but substantially degrades performance on Landfills. For both products, minimal change on R&Ts performance is observed when including the non-visible bands.
|Image Footprint||Resolution||CAFOs||Landfills||Mines||Proc Plants||R&Ts||WWTPs||Overall|
3.2 Impact of image footprint and spatial resolution
As image footprint (i.e. the amount of area on the ground captured by the image) and spatial resolution likely impact model performance due to the variation in the sizes of the methane-emitting facilities and equipment, we conduct experiments to test these effects (Table 4
). To investigate the impact of footprint, we center crop the 720 x 720 1m images to obtain 480 x 480 and 240 x 240 1m images corresponding to 480m x 480m and 240m x 240m footprints respectively. Note that this reduces the area on the ground with spatial resolution held constant. To investigate the impact of spatial resolution, we use cubic spline interpolation[parsania2016comparative] to downsample the 720 x 720 images to 480 x 480 (1.5m resolution, corresponding to Airbus SPOT imagery) and 240 x 240 (3m resolution, corresponding to PlanetScope imagery). Note that this reduces the spatial resolution without modifying the image footprint. In all experiments, we up-sample the images back to 720 x 720 to avoid any differences in performance due to varying image size. We use NAIP with RGB + NIR bands for these experiments as this setting produced the best overall performance compared to the other combinations of products and bands.
We find that the largest tested image footprint achieves the highest overall performance (0.548) and substantially outperforms both smaller spatial footprints across all classes except for WWTPs. This may be explained by the fact that a significant number of smaller wastewater treatment plants are surrounded by industrial buildings and other infrastructure, so cropping out this infrastructure improves the model’s ability to identify the salient features of the wastewater treatment facilities.
We further find that the highest spatial resolution achieves the best overall performance (AUPRC=0.548), outperforming the coarser resolution models on CAFOs, Landfills, and R&Ts. The 1.5m resolution model closely follows with an overall AUPRC of 0.541 and outperforms the 1m resolution model on Mines. The 3m resolution model also closely follows the 1.5m resolution model achieving an overall performance of 0.531, and substantially outperforms both the higher resolution models on Proc Plants. This result suggests that models developed at 1.5m and even 3m resolution have the potential to perform almost as well as 1m resolution models, which has implications on global applicability as Airbus SPOT and PlanetScope are globally (privately) available at 1.5m and 3m resolution respectively.
3.3 Per-class expert model test set results
For each methane source category, we select the experimental configuration (product/band/footprint/resolution) that achieved the highest validation AUPRC for that class to serve as the “class expert”. We refer to the combination of the different class experts as the per-class expert model.
We evaluate the per-class expert model on the hold-out test set using a variety of metrics including AUPRC and area under the receiver operating characteristic curve (AUROCC) as well as precision, recall, and F1 at the threshold which achieves the highest F1 on the validation set. The results are shown in Table5. The per-class expert model obtains a macro-average AUPRC of 0.558. The model does especially well on CAFOs (AUPRC=0.915) and R&Ts (AUPRC=0.821), possibly because these sources have very distinctive features (e.g., long barns in CAFOs and storage tank farms in R&Ts). It performs more poorly on the other sources, especially landfills which do not have many clear distinctive features visible at 1m resolution. Notably it achieves the lowest performance on the categories with the least number of examples in the dataset, excluding R&Ts which may be simpler to identify due to their homogeneity and discernible features.
The experiments suggest that the choice of imaging product, spectral band, image footprint, and spatial resolution can lead to substantial differences in model performance, with the effect often depending on the methane source category. In particular, this suggests that there is significant room to explore approaches which leverage the multi-sensor and multi-spectral aspects of METER-ML. For example, the NAIP & S2 & S1 model underperformed the model which used NAIP alone, and using all 13 spectral bands in the S2 model did not lead to substantial performance differences compared to the S2 model which only used the three visible bands. We also do not leverage the geographic information explicitly in the models, but this has been shown to improve performance on other Earth observation tasks [mac2019presence, irvin2020forestnet]. Furthermore, there is potential to augment the dataset with other sources of imagery and information available at the provided geographic locations. We hope to help create new versions of METER-ML which may include other sources of input data and methane emitting infrastructure categories.
The best model from our experiments achieves high performance on identifying CAFOs and R&Ts, suggesting the potential to map these facilities with NAIP imagery in the U.S. which aligns with findings from prior studies [sheng2020ognet, handan2019deep]. The performance for identifying CAFOs remains high when using S1 and S2, which are globally and publicly available. This suggests the potential to use these lower spatial resolution imagery sources to map CAFOs in other countries besides the U.S., but future work should investigate whether these findings generalize to other regions. There is still a large gap to achieving high performance for each of the other methane source categories and further improve performance on the high performing categories, so METER-ML is a challenging benchmark to test new infrastructure identification approaches.
There are many other publicly available remote sensing datasets for classification, with some of the most common being UC Merced [yang2012geographic], SAT-4 and SAT-6 [basu2015deepsat], AID [xia2017aid], NWPU-RESISC45 [cheng2017remote], EuroSAT [helber2019eurosat], and BigEarthNet [sumbul2019bigearthnet]. Few of these datasets have georeferenced multi-sensor images, which limits their utility for new modeling approaches and downstream use. The OGNet dataset [sheng2020ognet] is the most similar publicly available dataset to METER-ML and is essentially a subset of it, containing NAIP imagery of refineries in the contiguous U.S.
We identify four limitations of this work. First, we limit the geographic scope of METER-ML to the U.S. due to the availability of disseminatable infrastructure data and publicly available, high resolution imagery. Future work should include data in other regions worldwide. Second, we do not include longitudinal imagery in the dataset to reduce the size and complexity of the dataset as most infrastructure is static over time. However, longitudinal information has the potential to provide additional signal to help differentiate certain facilities, e.g. waste pile evolution at landfills. Third, we use a DenseNet121 model that is pre-trained on ImageNet, but the shape and number of channels of remote sensing imagery can be significantly different from ImageNet. It would be worthwhile to train a network from scratch on METER-ML, and compare its performance against a network that is pre-trained on ImageNet and fine-tuned on METER-ML. Fourth, our approach to combine the multi-sensor data may not be optimal as the products and spectral bands have different spatial resolutions and sensor types (e.g. active vs. passive sensors). One alternate approach may be to dedicate different network branches for the inputs and combine the representations from each branch.
In this work, we curate a large georeferenced multi-sensor dataset called METER-ML to test automated methane source identification approaches. We conduct a variety of experiments investigating the impact of remotely sensed image product, spectral bands, image footprint, and spatial resolution on model performance measured against a consensus of expert labels. We find that a model which leverages NAIP with all four bands achieves the highest overall performance across the tested image product and spectral band combinations, followed closely by a joint NAIP, Sentinel-2, and Sentinel-1 model. We also find that the highest spatial resolution and footprint leads to the best overall performance, although performance can depend on the methane source category. Finally we show that the best model achieves high performance in identifying concentrated animal feeding operations and oil refineries and petroleum terminals, suggesting the potential to map them at scale, but substantially lower performance on the other four categories with notably lower performance identifying processing plants and landfills. We make METER-ML freely available in order to encourage and support future work on developing Earth observation models for mitigating climate change.
Acknowledgements.This work was supported by the High Tide Foundation to construct the METER database. We acknowledge Rose Rustowicz and Kyle Story for their support of this work, as well as the Descartes Labs Platform API and tools for downloading and processing the remotely sensed imagery. We also thank Ritesh Gautam and Mark Omara for their help working with the oil and gas infrastructure data, Evan Sherwin for his advice on the dataset and methane source categories, and Victor Maus for providing the coal mines data.
Appendix A Appendix
a.1 Methane-Emitting Infrastructure Datasets
|Dataset Source||Scope||Methane Source Categories|
|CA Energy Commission [caloes]||California||R&Ts|
|Data For Cause Challenge [dataforcause]||US||CAFOs|
|EIA [eia]||US||Proc Plants, R&Ts, WWTPs|
|GHGRP [ghgrp]||US||Landfills, Mines, R&Ts, WWTPs|
|GOGI [gogi]||Global||Proc Plants, R&Ts|
|HIFLD [hifld]||US||R&Ts, WWTPs|
|Marchese et al. [marchese2015methane]||US||Proc Plants|
|Maus et al. [maus2022update]||Global||Mines|
|Minnesota Metropolitan Council [metro_wwtp]||Minnesota||WWTPs|
|Minnesota Pollution Control Agency [mpca]||Minnesota||CAFOs|
|ORNL DAAC [hopkins2019sources]||California||CAFOs|
|Sierra Club [sierra]||Michigan||CAFOs|
|Stanford RegLab [reglab]||North Carolina||CAFOs|
a.1.1 Coal Mines Data
The mines data from [maus2022update] were subsetted to coal mines in order to capture the mines responsible for the vast majority of methane emissions. To do this, the polygons and coal mine coordinates obtained from S&P Global Commodity Insights were matched to determine which polygons were spatially related to coal mine coordinates. Then a visual check and hand cleaning was performed on the polygons assigned a coal mining label to ensure correctness.
a.2 Negative Classes
We identify a variety of infrastructure which are not any of the six infrastructure categories to use as negatives in the dataset. Specifically we include football fields, marinas, solar panels, large bodies of water, parking lots, windmills, baseball fields, airport runways, clouds, neighborhoods, golf courses, roundabouts, mountainous terrain, trees, boats, islands, rocks, rivers, roads, bridges, ripples in water, snow, canyon formations, sparse forests, suburban neighborhoods, beaches, clear water, swimming pools, sand, corn farms, soy farms, trees on mountainside, farm houses, grass, airplanes, turning roads, intersections, multifamily residential facilities, rapids, docks, highway loops, mowed grass, container yards, soccer fields, greenhouses, crops, personal watercrafts, pivot irrigation systems, and concrete plants. Characteristic examples of each type were selected and a variety of similar examples per type were obtained using the Descartes Labs GeoVisual Similarity tool [keisler2019visual].
a.3 Remotely Sensed Image Statistics and Examples
|Product/Bands||Image Size||Resolution||Data Range||Display Range|
|NAIP RGB & NIR||720x720||1m||[0,255]||[0,255]|
|Category||NAIP RGB||NAIP NIR||S1 VV&VH|
|Category||S2 RGB||S2 NIR||S2 RE1&SWIR1-2||S2 RE2-4||S2 CA&WV&C|