In this study the authors use GRACE JPL mascon data to evaluate simulated total water storage (TWS) for 10 land surface (LSM) and global hydrological models (GHMs) over the Magdalena-Cauca basin (Colombia) and its sub-basins. They find different abilities of the different models to represent trends, seasonality and monthly time series, with model accuracy reducing from trends/seasonality to time series, from higher to lower resolution models and from larger to smaller basins. One of the models is declared the overall winner of the comparison.
I have the following comments:
Although this is an interesting and worthwhile exercise in itself, I am a bit hesitant about the novelty of this study. What exactly are the general conclusions we can draw from applying specific models to a specific basin? Global comparisons have been made before, as also testified by the references in the paper (Scanlon et al., 2016;2018; Schellekens et al., 2017). What does a regional study add to that? Does a study like this fit a general purpose hydrological journal like HESS, or does better fit a more applied journal that publishes well executed case studies? I leave it up to the editor, but if it is accepted, the authors should make clear what is novel about this work.
Using GRACE that for sub-basins below 40000 km2 in size is very tricky, even if mascons are used. The inherent resolution of GRACE is too coarse for this. This means that the results for the smaller basins are questionable at best, and the differences between GRACE partly from the models and partly from the GRACE estimates. The question then is which part of the deviation comes from the models and which part from GRACE. The authors should either leave out the smaller basins or be very upfront about this limitation in the Introduction/Methods section already and not wait until the Discussion.
Line 33-35: The argument that knowing TWS leads to better forecasts is often used. Please provide us with examples from the literature where it is shown that significantly better streamflow forecasts are obtained when GRACE TWS is ingested into the model?
Line 35-42: I have to say that this argumentation is a bit silly. Before GRACE, nobody cared about the validating TWS of hydrological models at all! The reason is that it could not be observed. Before GRACE, only partial state variables, such as groundwater, river and lake levels, soil moisture and SWE were independently evaluated using in-situ and remotely sensed data. Only after GRACE, TWS anomalies could be validated and were therefore computed from models.
Figure 11 shows that WaterGap and Lisflood both show relatively poor performance in reproducing TWS anomalies. What is striking is that these models both have been subject to some sort of calibration to streamflow data (see the paper by Beck et al., 2017 where they perform very well in streamflow reproduction). Could it be that calibrating GHMs to streamflow only (without constraining internal states and fluxes by other information) has led to correcting errors in streamflow by accruing errors elsewhere in the model?
References
Beck, H. E., van Dijk, A. I. J. M., de Roo, A., Dutra, E., Fink, G., Orth, R., and Schellekens, J.: Global evaluation of runoff from 10 state-of-the-art hydrological models, Hydrol. Earth Syst. Sci., 21, 2881–2903, 2017.
Scanlon, B., Zhang, Z., Save, H., Sun, A., Schmied, H., van 630 Beek, L., Wiese, D., Wada, Y., Long, D., Reedy, R. C., et al.: Global models underestimate large decadal declining and rising water storage trends relative to GRACE satellite data, Proceedings of the National Academy of Sciences, 115, E1080–E1089, 2018.
Scanlon, B., Zhang, Z., Rateb, A., Sun, A.,Wiese, D., Save, H., Beaudoing, H., Lo, M., Müller-Schmied, H., Döll, P., et al.: Tracking seasonal fluctuations in land water storage using global models and GRACE satellites, Geophysical Research Letters, 46, 5254–5264, 2019.
Schellekens, J., Dutra, E., la Torre, A. M.-d., Balsamo, G., van Dijk, A., Weiland, F. S., Minvielle, M., Calvet, J.-C., Decharme, B., Eisner, S., et al.: A global water resources ensemble of hydrological models: the eartH2Observe Tier-1 dataset, Earth System Science Data, 9, 389–413, 2017.
We would like to thank the anonymous reviewer #1 for the detailed analysis of the manuscript and constructive comments. In the attached document we reflect on the comments made by the reviewer and how we propose to change/improve the manuscript in response to the issues raised. These comments are included for convenience (in blue font) as well as a detailed response to the comment and changes proposed to the manuscript.
Review of Benchmarking global hydrological and land surface models against GRACE in a medium-size tropical basin, Bolaños et al.
This manuscript highlights the use of an external source of data (JPL GRACE TWS monthly anomalies) to benchmark 10 different global hydrological and land surface models using results of the earth2observe project, in a well-instrumented tropical basin in Colombia, the Magdalena-Cauca (MC) macrobasin as the area of study. Findings identify characteristics and limitations of the models and are a key input for contributing to identifying new developments and improvements of these types of models.
The article is well written, organized, and discusses nicely the main findings. The objectives the paper sets out to are of interest, and there is scientific merit for publishing it. Below are some specific comments to the authors:
In the abstract (line 11) and the methodology, analysis and long-term tendencies in terrestrial water storage (TWS) are based on JPL GRACE data from 2002-2014. What are the limitations of these estimations taking into consideration that the period is short (only 13 years), that the MC has a large inter-annual climate variability associated with the ENSO and other phenomena, and that the base period used to calculate the anomalies is also short (2004-2009)?
Although it is not completely clear in the manuscript, because it is not explicitly mentioned in the Data and Methods section, it seems (see line 103, line 221) that TWS is calculated from the models´ results and the JPL GRACE data at the macrobasin and subbasins scale using the average of the values for all the cells in the corresponding domain and time step. If this is true, this approach could have some limitations that the authors should address within the discussion and conclusions. And if not, an explanation of the methodology used and its limitations should be included in the manuscript.
In lines 170 and 418 it is important to consider that from WRR1 to WRR2 some models also have some type of calibration, not necessarily in the MC basin.
The legend used for the different models and modelling phases (WRR1 and WRR2) is consistent throughout the document. However, the first time the legend is introduced is in Table 2. Perhaps an explanation of the legend in this Table would facilitate the analysis right from the beginning of the paper.
Equation 3 proposes a way to decompose the time series of TWS into seasonality, long term, and residuals. For the first two components, a detailed analysis is conducted. However, for the residuals, it is not the case. The analysis of the residuals would be a nice way to complement the findings of the study.
In Figure 2 they appear 7 different GHM including SWBM_Exp 1 (in addition to SWBM). This experiment with this model is not described either in Table 2 or in the text. For consistency in the document, where 10 models are analyzed, this experiment should be dropped from the analysis.
In previous studies that have used discharge to investigate the performance of the models in the earth2observe project in the MC basin it has been shown that LISFLOOD obtains the lower results as it is also confirmed in this study (line 239). Reasons for the low performance of this model in the MC are not discussed in the document and would be helpful to include.
In several parts of the article a threshold of 60,000 km2 has been proposed as the basin size limit for the use of GRACE data to validate the models. In this sense would be the Cauca (C) basin an exception? How do the different climatological regimes in the C and Upper Magdalena (UM) basins influence the results? It is evident that for the small basins including UM, Upper Magdalena Paez (UMP), and Saldaña (S) results are poor and this is the reason for choosing the size limit proposed. However, right from the start results in the UM are poor, so for other subbasins in this area, it would be expected that results are also poor. What would happen if instead of considering subbasins in the UM you choose subbasins in the C (additional to the Upper Cauca (UC), where the size is small and surely below the limit), where results are much better?
In the study, only five subbasins have drainage areas close to or below 60,000 km2. Considering the climatological and physical complexity of the MC macrobasin, in my opinion, there is not enough information to establish the threshold proposed as a basin size limit for evaluating model performance against GRACE data.
Following the previous comments, for the UM and C basins, with approximately the same size, there is quite a contrast in the results. For the first subbasin, results are way lower than for the second one. Similar results in the UM that the ones presented in the study have been obtained with several different models, not only global but also regional and local. In this sense, any model structure seems to perform poorly in the UM. Problems in the precipitation forcing used for this basin could be part of the reason? Recent studies (unpublished) have shown that in some basins of the UM, including the S, the monthly precipitation and discharge average patterns do not match. Rainfall is mainly bimodal, as captured by the models´ forcings in this study, but streamflow is mainly unimodal. This could be associated with anthropogenic interventions, clearly discussed in the manuscript, but also with climatological forcing limitations that need to be addressed in the paper.
In line 274 it should be Figure 4c instead of Figure 5c
Figures 5 and 6 (line 282, line 309) in my opinion could be included in the supplementary material, as they are not key for supporting the main findings described in the article. Instead, the analysis of the residuals perhaps could better support the analyses and discussion.
In line 283 it should be In these figures… instead of In this Figure ….
For Figure 4 there is enough space in the graph to include the accompanying legend to facilitate the interpretation of the results.
Sentence in line 321 is not clear.
In line 335 perhaps the analysis of the residuals quite nicely complements the results.
Results for the WATERGAP3 model in the Limpopo River Basin have shown the good performance of this model (line 356). Results in the MC and some of its subbasins have also shown good results for this model when discharge observations are used. How to interpret that when using GRACE data as a complementary source of validation, results for this model deteriorate so much?
In Figure 11 it also appears the SWBM_Exp1 model, which either should be described in the manuscript considering 11 instead of 10 models or dropped from the analysis.
In line 448 besides the reasons for the poor performance of the models in the UM, perhaps influence from the Orinoco and Amazon macrobasins, may also play a role in the results. Some consideration about this is also recommended to be included in the discussion.
Instrumentation in the UM, especially in the higher altitudes could in my opinion help to separate the influence of the anthropogenic interventions from the limitations in precipitation forcing and how they impact the streamflow patterns observed for this part of the MC catchment.
We would like to thank the anonymous reviewer #2 for the detailed analysis of the manuscript and constructive comments. In the attached document we reflect on the comments made by the reviewer and how we propose to change/improve the manuscript in response to the issues raised. These comments are included for convenience (in blue font) as well as a detailed response to the comment and changes proposed to the manuscript.
We would like to thank the anonymous reviewer #1 for the detailed analysis of the manuscript and constructive comments. In the attached document we reflect on the comments made by the reviewer and how we propose to change/improve the manuscript in response to the issues raised. These comments are included for convenience (in blue font) as well as a detailed response to the comment and changes proposed to the manuscript.