In this paper, the authors developed a deep-learning based method for fusing observational data into simulated air pollution concentration fields. The method requires considerable amount of efforts in preparation, but model execution is fast and the results are favorable, making it suitable for operational air quality forecasting platforms. Overall, this paper is well-structured, fluently written, and the topic fits the scope of this journal. I only have a few comments:
Is there any reason why only RH and WS were included as meteorological variables, but other important variables such as precipitation and boundary layer height were not included? Is it because of limitations in computational resources?
In model training, actual observation data is not used, rather a random sampling of 1500-2500 data points were used. In evaluation, actual observational data were used. It would benefit the readers if the authors could provide more justifications on: 1) Why 1500-2500 data points were chosen, and why the number of data points varies among years; 2) Why random samples of data points were used in training, instead of actual observational data; 3) Will the biases and errors in CTM simulation impact training results?
In lines 173-174, the authors mentioned that the kernels are generally isotropic but some anisotropic characteristics are evident. What are the expected impacts of such anisotropicity? Does including additional training variables such as wind direction help addressing this anisotropic issue?
In the manuscript, a new data fusion paradigm is developed to estimate PM2.5 reanalysis fields from station observations by a deep learning framework to learn multi-variable spatial correlations from Chemical Transport Model (CTM) simulations. The model includes an explainable PointConv operation to pre-process isolated observations and a regression grid-to-grid network to reflect correlations among multiple variables. Compared with previous data fusion methods of PM2.5 reanalysis, the proposed fusion framework can fuse multi-variable observations from different monitoring networks (even when they are not spatially aligned at collocations) and the model training does not rely on observations. The deep learning data fusion model framework is novel and can reasonably generate spatio-temporally complete fused fields of PM2.5 using observations at sparse locations. I would recommend publication in Geoscientific Model Development after consideration of the following comments.
Specific comments
1. For the proposed fusion framework, why are only the predictions of PM2.5 concentrations, relative humidity (RH) and wind speed (WS) together with the surface height of Digital Elevation Model (DEM) and land use and land cover (LULC) used to train the deep learning network?
2. Line 247: “This model was fitted with model simulation data by learning daily spatial patterns from long-term CTM simulations.” When applying the fusion model, how long period of CTM simulated data is required at least for the network training to obtain the simulated spatial correlations?
3. Although, as it is said in lines 258-260, CTM simulation theoretically do not need to be very accurate in the model inputs, an accurate or reasonable spatial correlations (or spatial patterns) simulated by the CTM models is necessary for the model deep-learning. There are very limited information on the CTM simulation data used in the study. Have the simulated PM2.5 spatial patterns been evaluated? How about the performance? Please give some necessary introduction or relevant reference.
Technical comments
1. Lines 83-84: “Each of these data items at each were assigned…”, the word of “site” or “station” is missed after “at each”.
2. Line 185: “(Figure S2 in the SI)”, Figure S2 is not found in the SI. Please check it.