Soil pH (1:5 Water)

soil pH (4A1) for 0-5cm depth interval mapped across Australia at ~90m grid cell resolution
soil pH (4A1) for 0-5cm depth interval mapped across Australia at ~30m grid cell resolution.

Key Points

  • Australian collation of soil pH (1:5 soil water) data interrogated and mapped (compliant with SLGA specifications) across Australia for the first time.

  • Spatial modelling of soil pH was done via Random Forest machine learning coupled with an integrative approach to combine both laboratory and field measurements.

  • Mapping outputs generated for both 90m and 30m grid cell resolution.

  • Differences in model goodness of fit for 90m and 30m modelling are near indistinguishable.

  • Spatial pattern of soil pH variability similar for both mapping resolutions, but obviously mapping at 30m provides more granular characterisation.

Data collation

The creation for the first time of an Australian national digital mapping of soil pH (1:5 soil water; method 4A1 in Rayment and Lyons 2011) to the specifications of the Soil and Landscape Grid of Australia (which is guided by those defined for the Global Soil Map) has taken on the approach of capturing the value field-based observations to improve spatial coverage and density of sites to include in spatial modelling.

Like for the national mapping of soil texture, while field-based observations are not considered to be as accurate as a laboratory measured value, they are an observation, where particularly in areas of considerable under-observation of laboratory measured data, they can offer an invaluable insight into the soil variability.

When talking about field measured pH this is referring to Raupach's Indicator test method where a small soil sample is collected, and an indicator solution is added to form a paste. The paste is then coated with barium sulphate powder. The colour of the powder is then compared with a colour chart.

Assuming an appropriate modelling framework is developed to acknowledge varying levels of uncertainty to do with the observational data, combining both lab and field data into a digital soil mapping effort is a powerful way to leverage vast bulks of legacy soil data that would otherwise have been overlooked if one were to just use lab measured data only.

Using the Australian Soil Data Federator, combined there are just over 150,000 sites where there is an observed pH value. Noting there are numerous cases where both lab and field data have been collected, approximately 80% of the site data contain field measured pH only.

We acknowledge there may be other collations of soils data across Australia that could potentially have been used in this work, but we report on only what is findable and accessible via the Australian Soil Data Federator.

Distribution of sites across Australia with lab measured pH (4A1).

Distribution of sites across Australia with field measured pH.

The other disparity worth noting is the characterisation of soil pH with depth. The table below shows there to be considerably more measured pH at depths below 15cm for field measured data relative to lab measurements. This is a further highlight of the high potential value of using field data within a comprehensive digital soil mapping effort which seeks to characterise soils across large spatial extents in both the lateral and vertical dimensions. Effectively the field measure data fills some of the gaps that alone, lab measured data can not fulfil comprehensively.

% of site measurements (field or lab) with data available for specified depth interval

Integrating field measurement with lab measurements

We sort to understand the empirical relationships between field and lab measurements through assessment of the 82, 000 cases where measurement of both field and 4A1 pH measurements were recorded.

The xy-density plot below shows the paired relationship found for Australian soils of both pH measurement types. There is a general linear relationship between the two variables, but clearly there is some dispersion of lab measurements within each pH grouping (0.5pH unit increments) of the field pH measurements.

xy-density plot of the ~85 000 cases of paired field and lab measured data for soil pH across Australia.

Fitting a linear relationship between the two variable and using this outcome to integrate both measurement types into a spatial soil modelling framework seeks to overlook the observed variability of the data. It is acknowledged there may be several transcriptional errors in the data entry process and other similar types of human errors, but we need to accept the data for what it is and presents. The variability in lab measured pH in each field pH grouping provides a rich understanding of measurement error, which in turn can be incorporated in various ways into digital soil mapping.

To do this, for each field pH grouping we collected the empirical distribution of lab measured pH value with the 2.5% and 97.5 quantiles and saved these to be used for the spatial soil modelling exercise.

Spatial modelling of pH using both field and lab values

The usual step of covariate data intersection proceeded as per a usual digital soil mapping workflow. Information about the source and nature of the covariates can be found here and here.

We used a Random Forest model to fit the relationship between measurements and covariates. The Random Forest model uses the bootstrap resampling approach to iteratively develop the relationships between target variable and predictor variables.

Our modelling also included a repeated (n =50) bootstrap resampling approach but was different in that on each iteration the selected data which were also field data had to be converted to a ‘lab’ measurement. This ‘lab’ measurement was derived by drawing a value at random from the empirical distribution corresponding to the field measurement. In this way, we can incorporate into the modelling, the observed variability that is associated with field measurements, which also provides a seamless way to incorporate both data types.

A key assumption of this is the universality of the empirical functions where with some deeper investigations we might find an assortment of functions applicable in some areas but not applicable in others. Investigating this is probably warranted but perhaps for it to be useful, one would need enough data to fit these more localised empirical functions. Similarly, spatial relationships of field and lab data are not observed with this method, but logic would say that if lab measurements are reasonably well spatially paired with field data, then we might expect similar values. Whether there is enough data to be able to establish these spatial relationships is something a new investigation would need to establish.

The process of spatial modelling was relatively standard after the data integration step was done. Models were developed for each specified depth interval: 0-5cm, 5-15cm, 15-30cm, 30-60cm, 60-100cm, 100-200cm. Our investigations also revealed there was some benefit to modelling the Random Forest model residuals using variograms. Together models were evaluated using a data set of size 10000 sites, meaning that the number of cases to evaluate models differed with each depth interval as more cases are found at the surface and near surface and drop off with increasing soil depth. We used the prediction interval coverage probability to assess the veracity of the uncertainty quantifications.

Soil pH mapping was output to the ~90m grid resolution in accordance with SLGA specifications. Additionally, we also investigated modelling and mapping soil pH to 30m grid resolution, given the availability of the national 30m covariate stack.

Besides the spatial resolution differences between the 30m and 90m covariate stacks, there are also some that are available at one resolution and not the other. Therefore, models developed for the 90m resolution can not be applied to the 30m covariates. As such, a 30m specific collection of models were developed in the same way as for the 90m models, including the variogram modelling of the Random Forest model residuals. Even for the 30m modelling, there is good reason to investigate the variogram modelling of the residuals, however, for the mapping and visualisation shown below, we show only maps of 30m resolution pH from the Random Forest modelling only. For the 90m resolution maps the Random Forest with kriged residuals mapping is displayed.


Model evaluations

Evaluated on the completely withheld 10000 site data, the table below provides the summary model goodness of fit measures in terms of: coefficient of determination, concordance, root mean square error and bias. This is show for both modelling at 90m and 30m resolution and with and without spatial modelling of the Random Forest model residuals. The figures below and to the right provide show xy-plots of the 90m models with and without the additional variogram modelling of Random Forest model residuals. PICP plots (RF + residual model) show a acceptable quantification on the basis of correspondence between established confidence level and associated prediction interval coverage. Overall, fitted models do not show substantial bias, but there is some indication that the variogram modelling corrects this a bit further. It is difficult to really distinguish the model evaluations for the 90m and 30m modelling. The ultimate benefit of using the finer resolution modelling would lay in the fact that the mapping would be more granular and therefore more suitable for use cases with relatively smaller mapping extents such as farm and even field scales.

Model goodness of fit evaluations for both model resolutions at each depth interval and with and without variogram modelling

xy- and PICP plots illustrating model evaluating both the models and their uncertainties against a test dataset of measured soil pH data.


At the national extent there is very little apparent differences in the 90m and 30m mapping for the 0-5cm layer shown below. Zooming right down to the spatial extent of the CSIRO Boorowa farm we can compare the mapping against digital mapping specific to the farm which was produced at 5m resolution following a detailed reconnaissance soil survey Shown below there are shown clear correspondence between the maps, but noting that for the 30m mapping, it is produced without the kriged residuals of the Random Forest model which might explain perhaps systematic differences when compared against the on-farm mapping. In any case, the variability of surface soil pH is relatively small regardless on the mapping product.

National extent soil pH mapping (4A1 method) for 0-5cm depth interval for 90m (TOP) and 30m (BOTTOM) modelling.


Soil pH (0-5cm) mapping focused on the CSIRO Boorowa farm (~220ha) in southern NSW. TOP: 30m modelling output. MIDDLE: 90m modelling output. BOTTOM: 5m modelling output as described in Malone et al. 2022.