This manuscript provides a seven-step methodology for the calibration and quality assurance of low-cost air quality sensors. Thanks to the generalised nature of this method, it can be applied to a wide range of sensors and potentially be used as a standard calibration procedure. The data processing script was made publicly available which maximises the applicability of this method and the impact of this research.
The authors have pointed out current challenges in the use of low-cost sensors including the lack (or incomparability) of calibration procedures in many low-cost sensor application studies. They stress the need of a reliable and reproducible data calibration and post-processing method. This manuscript is an important step towards this aim and, therefore, a valuable contribution to the literature in this field as it has the potential to improve the data quality in future applications of low-cost sensors. The manuscript is well structured and clearly written.
My main suggestions to further improve the scientific quality of the manuscript are:
Discuss the limitations of this method in more detail (Point 1)
Add physical explanations of the found observations (Point 5)
Specific comments
1. Please discuss the limitations of your calibration method in more detail (Point 2.1)
Application range of calibrated sensors (indoor vs outdoor vs mobile) You stressed the importance of calibrating the sensors under conditions that are similar to those under which they will be (or have been) operated during the experimental application. This needs to be considered when defining the application range of the sensors. Thanks to their silent operating conditions and small size, low-cost sensors are suited for indoor as well as mobile applications (e.g. wearable sensors for personal exposure assessment). However, if the calibration is conducted outdoors, the sensors might not be suited for such applications as the environmental conditions may differ significantly in these environments. Furthermore, mobile deployments would require further data cleaning and validation steps as rapidly changing environments may have an impact on the sensor performance (e.g. Alphasense Ltd., 2013).
Sensor systems As you have pointed out, low-cost sensors are often temperature and RH dependent as well as cross-sensitive to other pollutants. Therefore, it should be recommended to apply the presented calibration method to sensor systems (with additional sensors for T, RH and cross-sensitive gases) rather than individual sensors.
Data cleaning(Point 2.2.2) In this step, point outliers are removed based on the assumption of a slowly changing airfield where peak exposures over a few seconds do not occur. However, such short-term (< 10 sec) emissions may occur in certain settings (e.g. traffic emissions of nearby passing vehicles, cigarette emissions of passengers etc.). One advantage of the high spatial and temporal resolution of low-cost sensors is that such peak exposures may be captured. The proposed method, however, excludes such events. Please include this argument when defining the application range of the sensors (Point 2.1).
2. Line 115: You state that, while demonstrated here with MOS, the proposed calibration method can equally be applied to electrochemical sensors. To strengthen this argument, please add a brief physical explanation, a reference, or experimental proof.
3. Line 221, line 240: Please explain how you have determined the splitting ratio between training and validation period. How much differ you results when using other ratios?
4. Table 6: Please explain why you are using the medians and not the means of your statistical parameters. (whereas in Line 221 you were speaking about the average RMSE)
5. While the manuscript nicely discusses the implications of a finding, it sometimes does not offer physical explanations for them:
Line 245: “If the graphs showed instability across the various folds, Step 4 was repeated and a new model was selected for validation” What causes this instability and how can you ensure that the model stays stable under field conditions?
Section 3.4.4 (model selection): Different relationships between the input variables were found for different models, e.g. an inverse temperature dependence for NO2 was found for the best fitting MLR but no temperature dependence was found in the case of the best fitting RF. How can you explain this and what type of physical relationship (e.g. temperature dependence) would you expect?
The model performance was found to be higher when using the ambient environmental conditions (T and RH) as parameters (e.g. Tables 6 and 7). However, you pointed out in the discussion (Line 619) that the internal conditions are more representative for the operating conditions of the sensor. What are possible explanations for this observation?
6. Line 292: Please specify “decent” and “good” agreement (e.g. with mean R2 & RMSE)
7. Line 327: You deployed (at least) two low-cost sensors. Have you quantified the agreement between the two sensors? If so, add a small sentence here as it may be a strong argument why it is sufficient to only look at the data of one representative sensor. Perhaps summarise the performance of the second sensor briefly in the main text. How can you explain the non-linear response of sensor s72 (Figure S8)?
8. Figure 8 (optional): Adding histograms showing the overlap between colocation and experiment would make the Figure easier to comprehend and help to understand the flagging procedure.
9. Line 596: Replace “for those who enjoy” with “to achieve”
Technical comments
10. Please use subscripts for NO2 and O3 and superscripts for R2 throughout the document.
11. Lines 93 and 96: What means SVM? Do you mean SVR (support vector regression)?
12. Line 149: Delete “for use in statistical calibration” (the general quality of the final data is likely to be higher)
13. Line 154 (Style, optional): Replace “What follows in this section is a” with “This section provides a”
14. Line 196: How do you define the range of the colocation data? As the range between the minimum and maximum observations? (Or percentiles?)
15. Line 219: Please provide references for AIC and VI
16. Line 263 (optional): Perhaps add a sentence or reference explaining the term “smearing” as the audience might not be familiar with this practice.
17. Line 295: “more information 295 in section 3.2” – this is section 3.2
18. Table 1: Is it correct that the sensor models for the reducing and the oxidising gases are identical? (SGX Sensortech MICS- 4514)
19. Figure 2 (optional): Adding a timeline with (rough) dates would help to comprehend the paragraph above quicker.
20. Figures 4 c, d; 6; 7 a, b; 10 etc: Make sure that all axes have units (even if only arbitrary units).
21. Figures 14 and 15 (optional): Although you have already mentioned them in Tables 8 and 9, add the R2 and RMSE values to the graphs to provide a comprehensive overview.
22. Line 503: “the reference instruments did not impact the predictive accuracy of the models and can therefore [in this case] be ignored as a potential interference” – can this be generalised for all sensors? If not, add “in this case”
23. Line 508: “The uncertainty between RF models and MLR models was fairly similar” - replace “between” with “of”
General comments. The manuscript describes an open source, systematic methodology to calibrate low-cost sensors (LCSs). The Authors propose a 7-step statistical method based on: 1) preliminary analysis of raw data; 2) data cleaning; 3) flag data; 4) selection of the model by using both multiple linear regression and random forest and several statistical parameters; 5) model validation; 6) export of the experimental data as concentrations; 7) error predictions. Finally, the Authors tested the proposed model with an example during a field campaign in urban environment.
The manuscript shows a very interesting and systematic methodology to calibrate LCSs, suggesting to employ a univocal and standardized method to let comparable the LCSs measurements, considering the more and more frequently use of this technology. Despite this, the manuscript requires revisions before to be accepted for final publication. Following suggestions and specific comments.
Specific comments.
For this calibration procedure, reference instruments are needed. Trends due to specific events (i.e., burning etc…) could be not properly described by the sensors, if not calibrated in the same conditions?
Moreover, did the Authors try to do a calibration procedure by using chemical standard to produce a calibration curve at different concentrations and conditions in laboratory experiment? If yes, could the Authors discuss difference between the two approaches?
In case of LCSs time drift, did the proposed methodologies take into account of it (or allow to)? Is the proposed frequency (2 weeks every 2-3 months) enough to take into account of seasonal variations and, eventually, time drift of the sensors?
Lines 430 and following. LCSs could have a T and RH dependency. Is it appropriate to apply the suggested corrections by the manufacturer for T and RH to the raw data before to apply the models in the proposed methodology? Could the Authors discuss this aspect and if the models relationships with RH and T are in line with what, eventually, suggested by the manufacturer?
Other species could interfere with measurements: if the concentration of those compounds changes (season, night-day, etc…), this could affect the sensors response. How the Author suggest to deal with this eventuality?
The Authors showed their experiment results only for NO2 and O3 MOS sensors. Is this methodology applicable to other compounds (i.e., VOCs) or technologies (i.e., PID, electrochemical) with the same characteristics proposed in the manuscript? This information should be included in the manuscript.
The Authors refer to Ammonia and Reducing gas sensor in the manuscript (see Table 1), but results regarding these sensors are not present. Is this due to lacking of reference instrument?
Could the Author describe in more details how the Step 4 is performed? How the experiment and co-location data have been used in this step? The Authors describe that in Step 5 and 6 the co-locations data were used, but information about data used in Step 4 seem missed.
Did the Authors intercompare between them similar sensors, i.e. two Zephyrs, before and after the calibration to check the response of same sensors in same conditions?
About the data cleaning, how the Authors correct data for possible bias effect? Line 190: the duration of the moving window chosen to remove the outliers avoid to exclude from the dataset some specific and real events with short duration?
Lines 218-220. To identify which model better describe the measurements in term of over or under estimation, could the Authors consider to include also a statistical parameter such as the Fractional Bias?
Lines 365-369. Co-location 3 was at the end of the summer campaign (i.e., October). Anyway data for experiment 2 are not available. It seems from Figure 4 there is a seasonal impact. Did the Author use this co-location for their calibration for Experiment 1? Did the different season affect the calibration procedure? Are the 2 weeks every 2-3 months enough to take into account of it?
Lines 390-396. Is the GSM the only way to transfer data to a database? The warm up time was provided by the manufacturer?
Lines 421-423. Since the 3rd co-location is in October, could this be indication that closer and more frequent co-location are needed? See also the following Section 3.5 (line 539) and Figures 14-15.
Figures 14-15. Could the Authors add the 1:1 lines and indicate the R2 in the plots? How the Authors can explain the constant thresholds in the plots of panels 15e and 15f? Looking at Figure 11, the models using internal T and RH seem to give lower O3 and NO2 compare to the ones that use the ambient T and RH. In figure 11 this is less evident: could the Authors explain it and the reasons/meaning of the slopes (typically lower than the unit) and intercepts?
Supplementary. Why for the winter campaign, the Authors use co-location 1 and 2 instead of 4 and 5, which are closer to Experiment 3? Comparing Table 9 and Table S4, the models identified for O3 are different (and similarly for NO2): how the Authors could explain this?
Technical comments.
Line 88. See “host”.
Line 200. Do the Authors refer to Section 3.5?
Line 291. Decent and good agreement should have to be quantitative and not qualitative information.
Lines 306-307 and 317-318. Information about the date of the campaigns are confused and should be coherent. The information could be furnished only once clearly and I would suggest to add the dates in Figure 2, as well.
Line 328-329. This sentence should be clarified. Zephyr s71 and s72 were located as in Figure 3 or with reference in an office on the 6th floor? In the former case, this information is redundant and could be included in previous paragraphs, when describing the setting (line 311 and following). In the second: how air masses have been sampled?
Line 359. When the Authors refer to “combined” co-location, this means an average of co-location 1 and 2?
Lines 419-423. This section is not well described. Could the Authors explain in more details the criteria to be used to flag the data?
Lines 430. Could the Authors specify in this or previous paragraph the units of the input data?
Lines 436-438. The Author report that relationship between Oxa and O3 was determined be inverse; but, since the predictive accuracy for no transformation is similar, they selected the latter. Anyway, in Table 3 there is not inverse relationship and a log dependency between O3 and Oxa was selected (also in Table 5 there is not inverse transformation). Could the Author explain this discrepancy or illustrate better this paragraph?
Lines 490-498. A comparison with the reference O3 and NO2 data should be included here (and in Figure 11).