Iterative spatial leave-oneout cross-validation and gap-filling based data augmentation for supervised learning applications in marine remote sensing

Stock, Andy; Subramaniam, Ajit

In marine remote sensing, supervised learning can link variables measured in-situ near the ocean surface to variables that can be measured from space. However, the in-situ data used for training and validating such empirical satellite algorithms are often spatially auto-correlated and clustered, giving rise to various statistical challenges such as overfitting to spatial structures. Furthermore, co-located in-situ and satellite measurements are rare in the oceans because of the cost of data collection from research vessels and frequent cloud cover. We propose two methods to mitigate these challenges. The first method builds on spatial leave-one-out cross-validation (SLOOCV), an approach designed to provide sound error estimates when data are spatially auto-correlated by enforcing a minimum separation distance between training and test observations. However, estimating this distance may be impossible with sparse and spatially clustered data. We hence propose to iterate and integrate error estimates over a range of separation distances (iSLOOCV). To address the often-small size of labeled data sets based on marine in-situ data, we tested if increasing the number of observations for algorithm training by means of cloud-filling algorithms for marine satellite data improved predictions. The potential of these two methods is demonstrated by developing empirical algorithms for mapping the proportions of seven diagnostic pigments (DPs) that serve as proxies for phytoplankton community composition in the northern Gulf of Mexico. We estimated the prediction accuracy of 13 algorithms with iSLOOCV, using various sets of satellite data products as input, and found adequate algorithms for 4 of the 7 DPs. Random forests combining ocean color and environmental variables as input had the lowest prediction errors overall. Correlations between predictions and observations estimated by iSLOOCV ranged from 0.69 to 0.85 and mean absolute errors from 0.02 to 0.13. Daily maps and longer-term composites of these DPs were broadly consistent with previously published results. Overall, errors increased when extrapolating over larger distances, highlighting how iSLOOCV can illuminate changes in algorithm performance based on sub-regional data coverage. Generating larger training sets by prior gap-filling substantially improved all error measures for 3 of the 7 DPs, with mixed results for the others. Therefore, data augmentation by gap-filling of satellite data should not be used as a default approach but can be a useful tool when supervised learning applications are suspected to be limited by the size of the training set.

Geographic Areas


  • thumnail for Stock&Subramaniam2022tgrs20.pdf Stock&Subramaniam2022tgrs20.pdf application/pdf 1.09 MB Download File

Also Published In

GIScience & Remote Sensing

More About This Work

Academic Units
Lamont-Doherty Earth Observatory
Biology and Paleo Environment
Published Here
November 11, 2022