Processed-based modelling to evaluate simulated groundwater levels and frequencies in a Chalk catchment in Southwest England.
Brenner at al.
This is a revised version of a manuscript detailing the simulation of groundwater levels in a Karst environment in the Southwest of England. I described in my review of the original manuscript how this topic is highly relevant to the journal and timely given the increasing need to simulate and forecast groundwater levels from limited datasets. The manuscript is suitably concise and the description is clear. I fully agree with the authors’ statement in their Abstract that “specialised modelling approaches are required that balance model complexity and data availability”. The authors assess whether they have achieved this balance by both exploring the identifiability of the parameters within their model (using a Shuffled Complex Evolution Method; SCEM) and by comparing model performance metrics for calibration and validation datasets (i.e. a split-sample test). They conclude that their modelling exercise had been a success because their analyses suggest that all of the parameters are identifiable and the differences between the calibration and validation metrics take values which they consider to be small.
Whilst I fully endorse the authors’ general approach to assessing the performance of their model I have severe concerns about the exact way in which it has been implemented. I do not believe that the posterior distributions of the parameters yielded from the SCEM accurately reflect the uncertainty of these parameters. Furthermore, I do not believe that the comparison between performance metrics upon calibration and validation are particularly meaningful. For these reasons I do not recommend that the manuscript is accepted for publication.
I first detail my concerns about the analyses of parameter identifiability. Looking at Figure 5, it is apparent that according to the SCEM that when the model is calibrated using only discharge data that the Kc parameter (for example) is almost perfectly identifiable. This posterior distribution indicates that this parameter definitely has a value less than 1. However, when all of the calibration data is used the parameter definitely has a value greater than 9. This is a clear contradiction and at least one of these two posterior distributions must be incorrect. Similar contradictions are evident for all of the other parameters except for those related to the groundwater level in a specific borehole.
Furthermore, the theoretical justification for the authors’ choice to use the Kling-Gupta efficiency (KGE) as the objective function within the SCEM is rather weak. The formal theory of Markov Chain Monte Carlo methods such as the SCEM require that the objective function is a likelihood (i.e. the probability that the data is realised from the proposed model). The authors indicate that the KGE can be treated as an ‘informal’ likelihood function and refer to a paper by Smith, Bevan and Tawn. This paper does discuss informal likelihood functions and describes sufficient conditions for informal likelihood functions to satisfy the most fundamental axioms of a probability. As far as I can see, the paper does not explicitly mention the KGE. The starting point for satisfying the axioms of probability is that the informal likelihood function can be written as an Lp-norm. It is not immediately clear to me that the KGE can be written as an Lp-norm. Therefore, I am unclear of the relevance of the Smith et al. paper to the authors’ study and I am not convinced that the KGE satisfies the fundamental axioms of a probability. I would have thought these axioms were a necessary requirement for a function to be treated as a likelihood.
The authors do not provide any calibration diagnostics which might indicate that the SCEM has converged to a stable posterior distribution.
The authors do not conduct any validation tests which might indicate that the posterior distributions reflect the uncertainty of the model parameters.
I similarly have a number of concerns about the authors split sample tests. First, the authors conclude that decreases in model performance upon validation of 11 and 21% are sufficiently small to indicate robust model performance. They refer to other studies where a similar decrease in performance was observed. However, the decision that 21% is ‘sufficiently small’ is completely subjective. The expected decrease will be a complex function of the number of observations and the degree of seasonality, variability and autocorrelation realised by the data. Therefore, the comparison with other studies is irrelevant. This being said, if I were to compare these values with the results of modelling exercises I have previously undertaken then I would consider 21% to be a relatively large decrease.
In their ‘Responses to comments’ document the authors describe how the KGE objective function “was chosen by trial and error comparing the simulation performances during calibration and validation obtained different objective functions (RMSE and other)”. This ‘trial and error’ approach very much concerns me. The validation data should not be involved with the model calibration in any way – this includes the decisions about how the model is calibrated and what objective function is used. In my opinion, the use of the validation data in this manner invalidates the authors’ split-sample tests. If the authors have infinite patience it is almost inevitable that they will eventually find a model calibration set up which yields results which they find pleasing. However, this setup is likely to be particular honed to the particular characteristics of the data they have used and is likely to perform less well as other data become available. |