Soil pH (CaCl2)
Highlights
Despite a considerable increase in available data brought into the spatial model framework through the inclusion of field measured data, Version 2 represents a modest improvement compared to Version 1.
Based on test data, quantifications of uncertainty appear better defined for Version 2 over Version 1.
There is an obvious difficulty in digital mapping of soil pH given this it can be modified through land management practices, for which is difficult to distinguish using available covariate data. Similarly for subsoils, while there is some increase in accuracy as shown in the models, there appears a sparse amount of covariate data (presumably geological or lithological) that can be exploited to derive more precise estimates of soil pH.
Background
The first effort to derive national digital soil mapping of soil pH (measured with 1:5 soil and CaCl2 mixture; method 4B1 in Rayment and Lyons 2011) is published and available on the CSIRO Data Access Portal among other places. The present work sort to update this mapping as part of ongoing efforts to expand and improve Australia’s national mapping and characterisation of its soil resources. Collectively these national soil mapping efforts constitute the Soil and Landscape Grid of Australia. The original work has been deemed as Version 1 (completed 2015), while the new work logically is Version 2 (completed 2023). This work has been made possible through support and funding from Australia’s National Collaborative Research Infrastructure Strategy (NCRIS) via the Terrestrial and Ecosystem Research Network.
As with the first effort, digital soil mapping is the underpinning framework for the ultimate creation of soil maps in this instance.
As with the other more recent national digital soil mapping efforts, the SoilDataFederator (Searle 2020) has been instrumental in the dynamic collation of disparate soil observational datasets from across the country. These data have been sourced mainly from each State and Territory Government departments tasked with soil survey and collection. Plus there are other data contributions from Universities and to a lessor extent individual research groups. The SoilDataFederator also taps into the larger CSIRO developed Natsoil database (CSIRO 2020) which holds the data related to research projects and field stations that CSIRO has managed.
The improvement in digital soil mapping has come about via several mechanism.
A huge expansion of the available library of data corresponding to each of the main soil state factors has been made possible (Searle et al. 2022). This is through acquisition of new data sets and improvement of others compared with those used for version 1.
The incorporation of soil pH data measured using field method (Raupach's Indicator test method) into the modelling system. An empirical transfer function was developed based on measurements with both lab and field observations (52629) to extend to measures where only field data was available. Combining lab and field measures required a special model fitting to account for differing magnitudes of error in the pH data. Lab data was assumed to be error free, however pH and estimated uncertainty could be estimated by the empirical transfer function, then incorporated into the spatial modelling system.
Adoption of machine learning to derive empirical relationships between target variable (soil pH) and various data related to the state factors that help determine and control soil variability across landscapes, here the Australian continent and very nearshore islands. While the adoption of ML is not an entirely new advancement, the coupling of it with additional data, and integration of it within a psedo-3D predictive framework permit an improved ability to spatially and vertically characterise soils than Version 1 did.
Together with a more powerful and streamlined predictive modelling approach, the quantification of uncertainties draws on the use of the UNEEC (Uncertainty Estimation based on Empirical Errors and Clustering; Shrestha and Solomatine 2006) approach instead of bootstrapping approach so that prediction interval bounds are more custom to the variations in state factor information. Bootstrapping tends to create uniform prediction interval ranges, whereas UNEEC can distinguish areas of relatively lower and higher uncertainties based on differences in soil and landscape characteristics. Therefore, for Version 2, the uncertainties are more custom and tightly defined to the environment they are quantified in.
An approach to understand and characterise issues of model extrapolation has been developed. This seeks to highlight areas where there is high confidence that models are going be unreliable, because these areas are outside the range of the underpinning data used in modelling. This issue is addressed via combination of data geometric and distance-based techniques.
Available data and work steps
Using the Australian Soil Data Federator, combined there are ~145000 sites where there is an observed pH value.
Figure. Maps showing the distribution of site observation of pH measured in field (left map) and laboratory (right map).
Noting there are numerous cases where both lab and field data have been collected, approximately 87% of the site data contain field measured pH only. The proportions of data split out across each of the depths intervals (shown in table below) shows the usual occurrence of much higher number of observations at the surface compared sub-surface layers. A benefit of having field data are that there are many more observations overall compared to lab measurements. Moreover, there are greater proportions of these values at depth too, providing potential greater insight not only in spatial characterisation, but in vertical characterisation too with the incorporation of these underutilised field measured data.
Table. Proportion of sites per measurement method for each depth interval. Many more sites are observed at depth which have field measurements compared with laboratory measures.
Integrating field measurement with lab measurements.
We sort to understand the empirical relationships between field and lab measurements through assessment of the ~56K cases where measurement of both field and 4B1 pH measurements were recorded. The xy-density plot below shows the paired relationship found for Australian soils of both pH measurement types. Field pH measurements more closey resemble that of pH measured in 1:5 soil and water mix (4A1). While there is a general linear relationship between the two variables, the systematic offset between the two measures of pH is clearly apparent. Notably also is there is some dispersion of lab measurements within each pH grouping (0.5-1 pH unit increments) of the field pH measurements, but in general there is about a 0.5 pH unit difference in values (between field and lab) where the data density is greatest.
Figure. Density plot of data cases (~56K) that have corresponding lab and field measures of soil pH.
Fitting a linear relationship between the two variable and using this outcome to integrate both measurement types into a spatial soil modelling framework seeks to overlook the observed variability of the data. It is acknowledged there may be several transcriptional errors in the data entry process and other similar types of human errors, but we need to accept the data for what it is and presents. The variability in lab measured pH in each field pH grouping provides a rich understanding of measurement error, which in turn can be incorporated in various ways into digital soil mapping.
To do this, for each field pH grouping we collected the empirical distribution of lab measured pH value with the 2.5% and 97.5 quantiles and saved these to be used for the spatial soil modelling exercise.
The sequence of steps below were carried out to develop the Version 2 products
Extraction of both lab and field data from SoilDataFederator, followed but data screening processes to clean up spurious cases.
Development of transfer function using data cases with corresponding field and lab information.
Integration of lab and field data whereby estimates of pH 4B1 from field data are propagated from empirical distributions in order for uncertainty of data is sufficiently handled in later spatial modelling steps.
Prepared point and covariate data, including filtering, cleansing, and harmonisation
Point data intersection with covariates.
Creation of model and test data sets. Test cases were extracted from datasets for each depth interval. These were extracted randomly of size 10000. Taking into account of missing data at increasing depth intervals the number of test cases for each depth was: 10000 (0-5cm), 9934 (5-15cm), 9299 (15-30cm), 8521 (30-60cm), 6809 (60-100cm), 4015 (100-200cm).
Ranger model hyperparameter value optimisation
Ranger model fitting with best hyperparameters.
Variogram model fitting of ranger model residuals.
Spatialisation of ranger models and residual kriging models
Uncertainty analysis with UNEEC method including rudimentary optimisation of class number size.
Spatialisation of model uncertainties.
Model extrapolation work with count of observation and boundary method (point data).
Ranger model fitting of extrapolation outcomes.
Spatialisation of model extrapolation outcomes.
Model evaluations with both test data and against SLGA Version 1 products.
Delivery of digital soil mapping outputs and computer code to repository.
Evaluation of Version 2 and comparison with Version 1
Map comparisons
Figure. Digital soil mapping predictions of soil pH (CaCl2) for the 0-5cm depth from both Version 1 and 2 SLGA. Version 2 also has estimates of model extrapolation risk informing reliability of both mean estimates and associated uncertainty.
Metrics for model evaluation include R2, Lin’s Concordance correlation coefficient, mean error, and root mean square error. For SLGA Version 2 evaluations are done for the whole test set and for each depth interval. For Version 1, just depth interval specific evaluations are done. A prediction interval test was also done for Version 1, where the observed cases were projected into the associated prediction envelops of the mapped data.
While it is guaranteed that all test cases used in this analysis were excluded form all model fitting work for the development of version 2 products, it can not be guaranteed for Version 1 products. Irrespective of this, these model evaluations point to substantial improvements of Version 2 over Version 1.
Figure. Model evaluations based on test data set for both Version 1 and 2 SLGA total soil nitrogen maps.
Figure. Accuracy and PICP plots for selected depth intervals for both SLGA Version 1 and 2 products. PICP plots only generated for Version 2.