Total Soil Organic Carbon Content 

Key points:

Wadoux, Alexandre; Roman Dobarco, Mercedes; Malone, Brendan; Minasny, Budiman; McBratney, Alex; Searle, Ross

Full paper can be found here

Data collation:

Data on total organic carbon (TOC) concentration (%) was extracted with the SoilDataFederator managed by TERN. The Soil Data Federator is a web API that compiles soil data from different institutions and government agencies throughout Australia. The laboratory methods for total organic carbon included in the study are 6A1, 6A1_UC, 6B2, 6B2b, 6B3, 6B3a. We selected TOC data from the period 1970-2020 to get a compromise between representativity of current TOC concentration and spatial coverage. The data was cleaned and processed to harmonize units, exclude duplicates and potentially wrong data entries (e.g. missing upper or lower horizon depths, extreme TOC values, unknown sampling date). Additional TOC measurements from the Biome of Australian Soil Environments (BASE) contextual data (Bisset et al., 2016) were also included in the analyses. TOC concentration for BASE samples was determined by the Walkley-Black method (method 6A1). Upper limits for TOC concentration by biome and land cover classes were set according to published literature, consistent datasets (Australian national Soil Carbon Research Program (SCaRP) and BASE, and data exploration to exclude unrealistic TOC values (e.g. maximum TOC = 30% in temperate forests, maximum TOC = 14% in temperate rainfed pasture). Since TOC concentration in Australian ecosystems has been underestimated by previous SOC maps, we did not set conservative TOC upper limits, knowing that machine learning model would likely underestimate high SOC values.

The equal-area quadratic spline function were fitted to the whole collection of pre-processed TOC data, and then values extracted for the 0-5 cm, 5-15 cm, 15-30 cm, 30-60 cm, 60-100 cm, and 100-200 cm depth intervals, following GlobalSoilMap specifications (Arrouays et al., 2014}. Boxplots with TOC values by biome and land cover after data cleaning and depth standardization are shown in Figure 1.

Figure 1: Values of total organic carbon concentration by biome and land cover after data cleaning and depth standardization.

Mapping of TOC

Covariates: We collected a set of 57 spatially exhaustive environmental covariates covering Australia and representing proxies for factors influencing SOC formation and spatial distribution: soil properties, climate, organisms/vegetation, relief and parent material/age. The covariates were reprojected to WGS84 (EPSG:4326) projection and cropped to the same spatial extent. All covariates were resampled using billinear interpolation or aggregated to conform with a spatial resolution with grid cell of 30 m x 30 m or 90 m x 90 m.

Mapping: The spatial distribution of soil TOC concentration is driven by the combined influence of climate, vegetation, relief and parent materials. We thus modelled TOC concentration as a function of environmental covariates representing biotic and abiotic control of TOC. The measurement of SOC and their corresponding value of environmental covariate at same measurement locations were used to fit the mapping model. For the mapping we used a machine learning model called quantile regression forest.

Figure 2: Maps of TOC (0-5 cm depth interval) along with the lower (left) and upper (right) prediction interval.

Mapping is made with Quantile regression forest, which is similar to the popular random forest algorithm for mapping. Instead of obtaining a single statistic, that is the mean prediction from the decision trees in the random forest, we report all the target values of the leaf node of the decision trees. With QRF, the prediction is thus not a single value but a cumulative distribution of the TOC prediction at each location, which can be used to compute empirical quantile estimates.

Figure 3: Maps of predicted TOC values for the six depth intervals and spatial resolution of 90 m x 90 m 


Validation of predictions: Each depth-specific model of TOC was validated based on the results of a K-fold cross validation. The whole dataset was randomly split into K=10 approximately same size folds. Each fold was kept apart for the validation and the remaining K-1 folds were used as calibration dataset. Models were compared using the mean error (ME), the root mean square error (RMSE), the squared Pearson's r correlation coefficient (r2), and the modelling efficiency coefficient (MEC).

These validation statistics are summarized into a so-called solar diagram (Wadoux et al., 2022), which provides an integrated visualization of various statistical indices in a single figure, see Figure 2.

Figure 4: Summary diagram (solar diagram) of the validation statistics. The statistics are obtained by 10-fold cross-validation. The x-axis represents the mean error (ME), the y-axis is the standard deviation of the error (SDE) which are both standardized by the standard deviation of the observations (denoted by *). Any point in the diagram has a distance to the origin equal to the RMSE. Colour scale indicate the Pearson’s r correlation coefficient and the modelling efficiency (MEC). Note that 30 m and 90 m refer to the map made at 30 m × 30 m and 90 m × 90 m resolutions, respectively.

Validation of the quantified uncertainty: We report the depth-specific lower (q0.05) and upper (q0.95) limits of the 90% prediction interval with two maps. Validation of the uncertainty estimates are obtained through a so-called accuracy plot. In an accuracy plot, the proportion of cross-validation observations contained in in each q prediction interval is calculated. Ideally, the proportion of observations covered by a q interval is approximately equal to the value of q. If the proportion of observations covered in q is greater than q, it suggests that the uncertainty is over-estimated, whereas a substantially smaller proportion of observations compared to the nominal value of q suggest an under-estimation of the uncertainty. The process is repeated for all q, and the values of q are plotted against the actual proportion of values covered by q in a scattergram. Ideally, all values should be close to the 1:1 line, which would mean that uncertainty is adequately estimated. Note that we evaluate uncertainty for all q against cross-validation observations, but report the maps of the 90% prediction intervals only.


Arrouays, D., McKenzie, N., Hempel, J., de Forges, A. R., & McBratney, A. B. (Eds.). (2014). GlobalSoilMap: Basis of the Global Spatial Soil Information System. CRC press.

Bissett, A. et al. "Introducing BASE: the Biomes of Australian Soil Environments soil microbial diversity database." Gigascience 5.1 (2016): s13742-016.

Wadoux, AMJ-C., DJJ Walvoort, and DJ. Brus. An integrated approach for the evaluation of quantitative soil maps through Taylor and solar diagrams. Geoderma 405 (2022): 115332.