This manuscript describes the creation and application of a new Python-based software, called “SITool”, for evaluating Arctic and Antarctic sea ice in global climate models. The authors utilize SITool to analyze models from the CMIP6 OMIP in terms of their sea ice concentration, thickness, snow depth and ice drift. The authors find that model biases exceed observational uncertainties and note improved model performance using the JRA-55 atmospheric forcing versus the CORE-II atmospheric forcing. No single model performs best in all metrics as there is no link found between performance in one variable and performance in another.
This manuscript is thorough and well-organized. The figures and tables support the discussion well and the analysis clearly demonstrates the utility of SITool. My main comment is that, while discussion of model ranking is distributed through the manuscript, there is no section devoted to synthesizing the cross-metric analysis or a figure documenting the rankings (mentioned in the Conclusions on Page 20, Line 468). I recommend the addition of a short paragraph to the Results section summarizing the findings and implications of the cross-metric analysis. While model ranking may not be the primary goal of the tool, it is mentioned enough in the manuscript to warrant further discussion and context prior to the Conclusions. The authors may consider moving the text on Page 20, Lines 470-473 to the added paragraph and/or including a table of the best and worst performing models for each metric to the main text or Appendix.
Overall I recommend minor revisions to address the cross-metric analysis and the specific comments below.
Specific Comments
Page 1, Line 11: I recommend replacing the phrase “bi-polar” with “Arctic and Antarctic” throughout the manuscript for clarity.
Page 2, Line 49: I recommend expanding briefly on what is meant by “rather limited” to describe which sea ice diagnostics are provided in ESMValTool and which are unique to SITool.
Page 2, Line 52: Can you please clarify what is meant by SITool providing “qualitative” information? The tool seems to be used primarily for calculating model biases and related metrics, which I would consider primarily quantitative.
Page 3, Line 92: Can you please clarify here if the interpolation is a component of the SITool workflow or is a preprocessing step that needs to be completed before using SITool?
Page 4, Line 117: It would be helpful to have a brief sentence explaining why February and September were chosen (for example, why February instead of March).
Page 6, Line 184: Please list here the respective resolutions of CMCC-CM2-HR4 and CMCC-CM2-SR5 or provide a reference to Table 1.
Page 9, Line 259 (and page 14, line 359): Throughout the manuscript, I recommend using “finer” or “higher” spatial resolution versus “increased”.
Page 10, Figure 2 (as well as Figures 5, 7, 10): It would be helpful to remind the reader in each of these figure captions that lower values indicate better skill.
Page 14, Line 355: “…the ice edge location simulations in the Arctic are much better than that in the Antarctic.” This is an interesting and logical point that you’ve quantified. Perhaps this has also been shown elsewhere? If so, reference(s) would be helpful.
Page 16, Line 420: Can you please clarify what is meant by “different observational references” in this sentence? Different from what?
Page 17, Figure 8 (and page 18, Figure 9): I recommend a new color map for these figures as the chosen color map may present challenges for readers with red-green color blindness.
Page 19, Line 446: On page 6, line 144 the authors write that two observational references are used for each variable, but here the phrase “at least two” is used. Can you please clarify if you mean that SITool is equipped to handle more than two sets of observational references?
Page 21, Line 488: “While it is running, SITool (v1.0) produces ancillary maps and time series that can be consulted by the expert to understand the origin of one particular metric value.” I believe this means that SITool automatically creates the kinds of maps provided in Appendix A, and if that’s true, please reference Appendix A here. It would also be useful to note in Section 2 that SITool automatically outputs the differences (which may be just as useful to some users) in addition to the scaled metrics.
Technical corrections
Page 2, Line 60: I recommend rephrasing the grammar of the final sentence to something such as:
“The SITool is written in the open-source language Python and distributed under the Nucleus for European Modelling of the Ocean (NEMO) standard tools. SITool is provided with the reference code and documentation to make sure the final results are traceable and reproducible.”
The manuscript "SITool (v1.0) – a new evaluation tool for large-scale sea ice simulations: application to CMIP6 OMIP" describes a new Python diagnostic tool to evaluate sea ice models in the Arctic and Antarctic over the historical period. Although it is designed primarily for atmospheric reanalysis-forced simulations, as presented in the manuscript, it could be useful in other model frameworks as well. This tool is complementary to other climate model evaluation tools, such as ESMValTool. Comparison with multiple observational datasets allow for evaluation of sea ice concentration, extent, edge location, ice thickness, snow depth, and ice drift. The evaluation of Ocean Model Intercomparison Project runs are used here as example, but also provide results and sea ice model performance.
This manuscript is well within the scope of the journal, as it introduces a novel new tool for evaluating climate model performance on a critical component: sea ice. Consistent, repeatable methods of evaluation like this are greatly needed by the community. It also provides novel results on the impact of the atmospheric forcings on the modeled sea ice (especially that model biases are significantly reduced by using JRA-55). The title and abstract well capture the key points. Methods are generally clearly described, and the code is well-documented and easily accessible. The paper is generally well structured and clearly written, but some figures could be improved for easier interpretation. It would be helpful to be clearer about how the presented results connect with the code and outputs in the published package. I believe with minor suggested edits and demonstration of code implementation (either by an additional reviewer or by the author, within the repository), this manuscript warrants publishing. Note: this reviewer did not complete a test of the scripts, and it may be useful for this code to be checked and tested by someone who is experienced in working with these output types.
Comments:
L23-30: [suggestion] Separate out into 4 sentences for improved readability
L70: “either” feels imprecise here. Is it accurate to say “both atmospheric forcings, when possible, …”?
L76: It would be helpful to more clearly state how the diagnostics of concentration and thickness in previous studies (2020) are the same or different from the diagnostics proposed here.
L88-89: Including some sort of table summarizing the diagnostics available, in terms of variable and type (spatial map, mean, IIEE) would be helpful
L90: Would be preferable for this to be a complete list of add’l sea ice variables, and then “e.g.” is not needed
L92-95: Please discuss somewhere (results or conclusions) the possible implications of interpolation.
L105/Fig. 1: Can this be re-oriented such that the order progresses downwards? (So, sea ice input data above SItool?
L105/Fig. 1: In the “Observations” portion, it would be helpful to make it clearer that extent and edge are also coming from the concentration products. In other words, it would be helpful to be explicit what observations each of the defined “metrics” are compared to
L106: [style suggestion] I think more subsections to separate out the methods would be helpful. This will be helpful in providing a quick reference for users of SITool.
L117: This sentence is hard to understand. Perhaps it could be clarified by separating into 2 sentences.
L135: Why only February and September? Is it a user option to select for other (or all months)? If so, please be clear what the difference is between options for the tool and what is being shown here as demonstration.
L187: Please describe the options for the user to select years for comparison
L204-205: Would the authors recommend freeboard be included in future CMIP model outputs for observational comparison? If so, include this in the conclusions.
L215/Table 2: separate dataset name and reference into separate columns
L220: Perhaps I have misunderstood something about the calculations, but how can you determine the typical error to get the metric shown in Fig. 7b for Envisat without the second observational product (SnowModel-LG)
L272: Is the primary difference between OMIP1 and OMIP2 protocols the atmospheric model JRA55 vs. CORE-II? If so, I suggest it may be more clear to use “J” and “C” rather than 1 and 2.
L272/Fig. 2: Are there any significant differences in patterns between the two observational products (NSIDC and OSI)? If not, why not just use the normal error relative to the mean between the two products? I do not believe that the difference in comparison between the two products is a key point here, so I’m not sure it is useful.
L272/Fig 2: [suggestion] It may be helpful to show this after Figures 3 and 4 (showing specific metrics), so that these values have some context and introduction already
L272/Fig. 2: Does “Ano” in figure refer to anomaly? (i.e. interannual variability?) This somehow needs to be made more clear, such as in the figure caption.
L272/Fig. 2: [suggestion] add some vertical space between NH and SH. “NH” and “SH” may be sufficient (rather than “North” and “South”), and would save some room
L290/Fig. 3: The multi-model mean line is hard to distinguish. Consider using a thicker or brighter/more distinctly colored line.
L310/Fig. 4: If possible, it would be helpful to explicitly label “std” and “trend” on these plots to demonstrate where values in Fig. 2 are coming from.
L316: Interesting. What are the implications of this? Should typical error not be used in this case, or used with caution? Are there similarities in how some products are derived that result in this metric having less utility?
L341-2/Fig. 5: Separate by product in panel (c) to be consistent with other Figure. (Unless products are combined with averaging, as suggested in comment above)
L336/Fig. 5: Perhaps include in subplot title a summary of what is being evaluated, such as “Extent: models vs. NSIDC”
L366: I’m not sure I understand how you can have IIEE for the observational product OSI-450. Is this rather the observational products? In that case, should it be called the “typical error” here?
8/Fig. 9: Are these annual means/using all months? Please specificy period evaluated in figure captions.
L489-499: Are the plots the ones that are used in the manuscript and/or the appendix? Please clarify
It would be helpful to provide example of completed code run and diagnostic plots in the repository on Github
Review of “SITool (v1.0) - a new evaluation tool for large-scale sea ice simulations: application to CMIP6 OMIP” by Xia Lin, François Massonnet, Thierry Fichefet, Martin Vancoppenolle (gmd-2021-99).
[General comments]
This paper introduces an evaluation tool for sea ice simulation and presents its application to CMIP6-OMIP simulations available through ESGF. I think that such a tool will become a valuable asset for the climate/sea ice modeling community and such activities should be strongly encouraged. Calculation methods of metrics are well described and the evaluation using this tool is well presented. The comparison between OMIP-1 and OMIP-2 simulations, which use different surface atmospheric forcing dataset, is timely and should be highly appreciated. However, I think that some discussion would be needed for the proposed method for the evaluation of interannual variability and trend as commented below.
[Specific comments]
Metrics are proposed for the monthly mean state, interannual variability, and trend, with each metric basically using common calculation method: difference between simulation and observational reference is scaled by observational uncertainty based on the difference between two observational datasets. For me, applying this method to the monthly mean state was understandable, but it was somewhat difficult to interpret the specific values of metrics for interannual variability (standard deviation of monthly anomalies) and trend. If I was to evaluate interannual variability of a simulation, I would like to know the size of the standard deviation of monthly anomalies relative to that of observational reference. Specifically, I think that the metrics would be easier to interpret if the standard deviation was scaled by the that of an observational reference and the range of values obtained by applying different observational references were presented. The same argument would be applied to trends and in this case the signs of trends could be also evaluated. I would like to ask the authors to explain the background behind the choice of the current method.
I would like to add that it would be useful and clear if the calculation methods are presented using mathematical formulas.
[Technical corrections]
L135, 150, 164: Why equal weight is used for these metrics?
L184: “the influence model resolution” should read “the influence of model resolution”.
L286: “exits” should read “exists”.
L288: “without reduction”… I could not understand the meaning of this phrase in the sentence.
Figure 3: It was difficult for me to distinguish the lines. I would suggest the figures to be separated for OMIP-1, OMIP2, and their means, that is, into the total of six figures.