The revised version of the manuscript by Zhang et al. definitively improved from a methodological point of view. The predictor selection is now based on a procedure, which automatically identifies pixels of large scale climate variables, which significantly correlate with summer precipitation anomalies. In a next step, the pixel anomalies in specific regions are averaged and different variables are combined by means of a pca analysis. PCA timeseries (explaining 95% of the overall variance), are eventually utilized as predictors for a linear regression based prediction. The procedure is now fully cross-validated which results in poor to moderate prediction results.
While the application of the method is solid (beside of some smaller issues, see specific remarks), the results of the statistical forecast model are worse compared with some dynamical models (and sometimes even worse compared with climatology). Beside of two clusters all correlations between observations and cross-validated forecast are below 0.2, which is not statistically significant for a period of 30 years. Thus in my opinion the study is not yet ready for publication in HESS. However, I see quite a big potential of the study from a hydro-climatological point of view. The major question for me is, why there is such a big difference in the skill of the forecast models, although the clusters are very close to each other. This could either be a statistical artefact or it could be explained by large/regional scale climate mechanisms. I believe that the results are robust and physically interpretable (all clusters with moderate forecast skill are windward of the East-African monsoon), but this should be somehow investigated. In the revised manuscript the authors give some more information on the general climate characteristics and also mention, that different cluster regions are correlated with different climate modes and sst patterns. I still feel that this insufficient for the interpretation of the results. I suggest to give a short literature overview of climatic mechanisms in the introduction. Which anomalies (pressure patterns, moisture fluxes etc) trigger precipitation surplus and deficits? How are these patterns related to ENSO and other potential predictors? There is quite a lot literature on the monsoon in general.
What kind of predictors are important for which cluster? Maybe this could eventually explain the different model skills and support the robustness of your models. Further it would definitively justify the clustering procedure. My impression from the gridded analysis is that it doesn’t make a difference?
Minor remarks:
- Fig. 1: Please show boarders and rivers to make orientation easier. The seasonal precipitation amounts might be rather shown as a barplot.
- P3L24 (and others): for me the term “scenario” stands for future climate change assessments. Do you mean similar large scale conditions? Likewise in Sec. 3 the term is misleading.
- There’s a bit more information on the clustering procedure now. However the selection of k is still not clear. WSS is smallest for the largest possible k, right? I see that the authors refer to their previous publication, but very general information should be provided.
- I was a bit lost in the section on the statistical modelling approach. Would it be possible to structure it into predictor selection, model calibration and evaluation?
- Predictor regions: Why are those regions chosen (Maybe a broader literature review could justify the selection). Please also name the regions in the map.
- Predictor selection approach: In step 3, the “best” grid cells are spatially averaged. This is not clear to me. If there are positive and negative correlations in the region, this might average out the predictive skill. For example in the southern Indian Ocean, both positive and negative correlations are detected in Fig. 5. Likewise the North Atlantic SLP domain, which contains not only the Azores (high) but also Iceland (low), might be problematic. Wouldn’t it be straight forward, to use all gridcells directly for the PCA analysis?
- PCA: No information is given on standardization of predictor variables. If different variables are not standardized, the final PCA might be much more affected by SLP (values around 1000hpa) then SST (around 25°C).
- Predictor selection: How many PCA-predictors are used in the linear models at the moment? Are all of them correlated with the predictant? (Theoretically, the stepwise procedure, which the authors tested for the original predictors (author response), could also be used for the PCA based predictors. The predictors used in the linear regression (most likely very few) might then be easier interpretable. Further it might reduce overfitting, which is judged by the cross-validation.)
- P10L18: Please only use the term significant if you conducted a test (otherwise remarkable, great etc)
- Dynamical Models: I am not sure if it’s really necessary to do that comparison and I still feel that it is poorly integrated. The heart of the study is the statistical approach – and I think one can easily argue, that statistical predictions are good for operational use (without comparing with complex statistical models?)
- Cluster Results (Fig. 7) : How is the standardization conducted? The y-axis should be limited to -2/2 to better present the results.
- Precipitation trends (Fig. 7): The two clusters with skillfull models (5 and 7) seem to have positive precipitation trends. This might lead to an overestimation of skill, since every variable with a similar trend could be used as a predictor. Thus I suggest to test the model for detrended time-series.
- P23: The argument that the statistical models are of higher resolution than dynamical models is misleading. One could downscale the results the same way, as it is done with the cluster results. Further (as I mentioned during the first review round), the high resolution (indirect) forecast has the same temporal variability as the cluster forecast (due to the univariate linear relationship) and thus does not contain much additional information. The analysis of different statistical relationships at the cluster-level seem to be more relevant (in my point of view). |