Using Mathematical Modeling and Neuron Network to Predict the Dynamics of COVID-19 in China

. The COVID-19 pandemic spreading from Wuhan in 2019 had a severe continuous impact globally. Even now when several vaccinations were approved by WHO and accepted widely, there are still millions of new confirmed cases daily. To provide insights for governments to make prompt and effective response with the smallest social and economic cost, numerous studies have proposed to predict COVID-19 development trend, in which mathematical modeling, such as SEIR, and neuron networks, such as LSTM, were utilized and modified widely. Among the reviewed papers, population migrations and quarantine policies were popularly considered as influential factors, while nature factors were rarely mentioned. The construction focuses of SEIR, and LSTM were parameter selection and dynamics, and solving overfitting from the data shortage and over-complicated structure respectively. The expansion of applicable environments and increase of prediction accuracy still seems to be necessary. Though this review is limited to the studies based on Chinese datasets, research from other countries may benefit from the analysis strategy.


Introduction
At the end of 2019, the first case of COVID-19, the official name from World Health Organization (WHO) [1] was discovered in Wuhan, Hubei province, China. Due to its high infectivity and transmission, the virus spread rapidly globally, and virus has caused 1,965,515 death cases cumulatively at the end of 2020 [2]. Even now when several vaccinations were approved by WHO and accepted widely, there are still millions of new confirmed cases daily. To prevent further transmission, governments chose complete or partial block down, quarantine and other social restrictions as health policies [3,4]. Although the polities suppressed the pandemic effectively, the various mental health issues [5], economic challenges [6], and food allocation and affordability problems [7] caused by the polities show that these restrictions cannot be implemented in a long time on a large scale. Thus, to carry out effective restrictions in a short time on a small scale and minimize the potential issues mentioned, COVID-19 dynamic prediction has become an important research topic globally.
After WHO issued the global public health emergency, researchers have built models to forecast COVID-19 case number, the development trend, or the outbreak peak time and size to understand the virus propagation and suggest implications to authorities to prevent and control the transmission. Mathematical modeling has long been used for epidemiology and public health policy to predict and plan in both long and short term for restricting the spreading of a disease [8] [9]. In the COVID-19pandemic, Susceptible-Exposed-Infected-Removed (SEIR) and related models were employed frequently to study the development trend and thus will be discussed in detail. On the other hand, neuron network has become a popular tool for prediction in the recent decades, which uses computer algorithms to analyze giant complex data structures, learn patterns in the data and create prediction models. The long-term short-term memory system (LSTM) model has been popular for epidemic dynamic prediction and thus will be studied closely in the paper. Since China was the country where the pandemic first erupted and took strict policies to solve the epidemic, the paper will focus on the studies analyzing Chinese pandemic.

Data sourcing
The statistics used by the papers are listed in Table 1. Most of the data were related to pandemic dynamics, such as the number of confirmed cases, and were used for parameter estimation for mathematical models, training for neural network models and testing for both categories. The statistics were mainly from China as the focus of this review paper, but statistics from other countries were also included for testing. Some authors used the data in country unit, while others cited the data in province, or even smaller city unit. Besides the pandemic dynamics related data, other statistics such as population migration related data were also incorporated in the dataset, which were considered as important factors effecting the pandemic development trend and integrated into the model.

Preprocessing
Preparing for the later matrix computation, Zhan et al. [10] processed the migration population size in and out of each city into the migration matrix in which the entry at position (i, j) indicated the number Wuhan Jan 21 -Feb 3, 2020 COVID-19 epidemic data: confirmed, cured, suspected, critical of citizens from city i to j. Wu et al. [11] excluded the data of Hong Kong when estimating the transmissibility of Wuhan using the data from 2019 considering social unrest in Hong Kong typical in 2020. Yan et al. [20] processed the online data to get the statistics about diagnoses as input of the model and standardized the parameters by MinMaxScalar. For the input of NLP module in their model, Zheng et al. [19] sorted the COVID-19 related news by date, city, filtered out case reports and foreign news, and extracted the titles and main content of news. The data were processed by a pretrained model of the BERT language model before input into NLP module. The titles and main content were input into the module separately to avoid overfitting and improve training efficiency.

Mathematical modeling
Mathematical modeling has long been employed to depict the dynamics of infectious diseases and understand the epidemic growth patterns. The ordinary mathematical models were refined to capture the characteristics of the specific pathogen and the social context.

Susceptible-exposed-infected-removed (SEIR) model
As an epidemiological model, SEIR has been widely used to study the epidemic spreading. In the model, individuals of a population will be categorized into one of the four stages of epidemic spreading, and proportion of individuals changed from one stage to another as time passed will be represented by some parameters. Zhan et al. [10], considering the unusually high volume and frequency of population flow between cities in the onset of the pandemic in China, modified the classic SEIR model by integrating the daily intercity migration data to forecast the dynamics of COVID-19 in China. In the modified SEIR model, both the difference between category E and R, and the net flow of infections into the city were counted as parts of the daily increase of infected cases. The parameters were estimated by constrained nonlinear programming, in which the data from Jan 24 to Feb 13, 2020, were used for fitting. In the result, the time of the predicted outbreak would be between February to March of 2020, and the size would reach about 0.8 in Wuhan, less than 0.1 in Hubei province and less than 0.01 for the rest of China in percentage of populations.
Another study by Wu et al. [11] also considered the population flow as a main factor to estimate epidemic spreading extent but focused on the export of infected individuals from Wuhan domestically and internationally. The transmissibility of Wuhan was estimated based on the transportation related data in the similar time from 2019, and the outbreak sized was estimated based on the confirmed case number exported from Wuhan reported outside of China. The results showed that in the baseline scenario where R0 was set as 2.68, Wuhan exported 461, 113, 98, 111 and 80 infected cases to Chongqing, Beijing, Shanghai, Guangzhou, and Shenzhen provinces, respectively.
Prem et al. [12], concerned of the effectivity of physical distance control policies, applied age structured SEIR model to forecast the effect of population mixing on the virus progression. Prem et al. [12] assumed that the social mixing pattern was different for individuals in different groups and position and thus with different probabilities to be exposed to coronavirus. For the effect of location on social mixing patterns, three scenarios were considered, namely usual social mixing, Lunar New Year holiday and relax intervention patterns. The simulation showed that policies limiting social mixing were effective to reduce the size and delay the peak of the pandemic, though the effect varied in different age groups. The simulation also suggested that the staggered return to work starting at April would maximize the effect of these measures.
Like Prem et al. [12]'s concern, Yang et al. [13] wandered whether the severe policies taken in China, such as the quarantine of whole cities, causing unignorable social and economic disruption, limited the epidemic effectively. Like Zhan et al. [10]'s study, Yang et al. [13] integrated population migration into the classic SEIR model by stating two additional parameters, the move-in and move-out. Considering the unique characteristics of coronavirus, E was associated with asymptomatic but infectious individuals while I was associated with symptomatic and infectious individuals. The incubation time was set to be 7 days, the midpoint of reported incubation period. The epidemic data from Hubei province were used to model the skewed SEIR model to determine other constants in the model. In the result, a good fit was shown between the predictions and the reported data. Also, the severe policies were important to limit the epidemic size.

Other models
Hu et al. [15] employed Susceptible-Exposed-Infectious-Removed-Quarantined (SEIRQ) model with seven compartments for the whole population. Four of the compartments were the same as those in SEIR model, and the rest three were quarantined susceptible, quarantined exposed and quarantined infectious. Guangzhou province was chosen as the analysis example to study the effects of population control strategies on the dynamic of COVID-19 considering its large gross domestic product compared to other provinces and expected giant inflow of workers in the future. The values of determinant coefficients, AE, DISO, RE showed the high accuracy of the model. Time-dependent susceptible-infected-recovered (SIR) model was proposed by Chen et al. [16] with transition rate and recover rate as functions of time to forecast the infected and recovered number during some period. The two time-sensitive parameters were estimated by ridge regression. For evaluation, they obtained prediction curves highly aligned with the real curves and the prediction errors within 3% for the confirmed cases.
Wang et al. [17] tried to integrate the time-sensitive quarantine policies to the SIR model by two main methods. Like Chen et al.'s study, the focus of the first method was the time-sensitive transmission rate which was achieved by adding a transmission rate modifier varying with different quarantine protocols or time. The second approach was like Hu et al. [15]'s consideration, which extended the three compartments of SIR model into four by adding a quarantine compartment. The parameters were estimated by Markov Chain Monte Carlo algorithm. Shown in the results, the integrated quarantine factor improved both estimation and prediction. NN [14] are simplified models imitating the human intelligence. Their structure are layers consisting of the basic units, neurons. The input data pass through an input layer, hidden layers, and output layer. Then, the networks generate predictions for all observations and adjust weights to improve the predictions. The process is repeated until some stop criteria are achieved.

Long-term short-term memory system (LSTM)
As an improved recurrent neural network method, though RNN frequently uses for prediction, the vanishing gradient or exploding gradient problems and the storage limited to short-term memory are still the main shortcomings of RNN. The presence of gate functions in the structure of LSTM enables the model to solve the problems of long-term dependencies. The gate functions represent the states of the four gates and build interactions between gates.
Yang et al. [13] constructed a LSTM trained by SARS data, incorporated the parameters of coronavirus, and optimized by the Adam optimizer with 500 iterations. The structure was kept simple to prevent overfitting from the small training dataset. The results of the confirmed case number from the model fit remarkably with the real statistics. The model predicted an outbreak in February with 4000 daily infections, which aligned with the prediction from SEIR model.
Liu et al. [18] compared the predictions with modified Susceptible-Exposed-Infected-Recovered-Dead (SEIRD) dynamic model, Geographically Weighted Regression (GWR) model and an LSTM. The LSTM with an input layer, an LSTM layer, a fully connected layer, and an output was trained by the data from four provinces in China and optimized by Adam optimizer with MSE as the loss function. To incorporate the effect of migrants from Wuhan to the provinces in the training dataset, the cumulative migrants from Wuhan and the incidence were also considered in the model. The results from LSTM fit with the real situation and got a MAPE smaller than the other two models.
Zheng et al. [19], realized the inability of susceptible-infected (SI) model to capture the change of policies and emergency conditions, used LSTM and natural language processing (NLP) module to reduce the deviation of prediction. LSTM updated the parameters corresponding to different pandemic policies, and NLP module counted people's awareness of epidemic prevention affected by news. The proposed model made an improved prediction of confirmed case numbers with the smallest mean absolute and mean absolute percentage error than the SI model, improved SI model and improved SI model with LSTM network.
Yan et al. [20], targeted to outdo the prediction of mathematical equations and population prediction models, constructed a modified LSTM to predict the positive cases. Concerned of the biased fitting for cities with a large case number, the traditional LSTM was improved to adjust its parameters according to different epidemic stages judged by the standard deviation of n days before. The prediction results from the proposed LSTM were compared with those of ordinary LSTM, logistic and hill equation algorithms by goodness of fit and deviation rate. The proposed LSTM had a better goodness of fit than the ordinary LSTM, and a smaller deviation rate within 2% then logistic and hill equation algorithms, showing the improvement made by Yan et al. [20].

Other models
Through Convolutional Neural Network (CNN) method, Huang et al. [21] forecasted the positive number in part area of China. CNN is a feedforward neural network with a relatively small weight number leading to easy training and effective characteristic extraction. The proposed deep CNN model included CNN and a dropout layer to prevent overfitting due to the small sample size. Huang et al. [21] compared the results of the deep CNN model with other neural networks, on the criteria of MAE and RMSE and found that CNN had the smallest errors and the best performance among the networks.
Fong et al. [23], aimed at solving the data deficiency at the start of a pandemic, proposed Group of Optimized and Multisource Selection (GROOSE) method in which five models were constructed to complete for the best prediction. Polynomial Neural Network with Corrective Feedback (PNN+cf) was used in group one. PNN was an evolutionary neural network that increased its powers of polynomial coefficients until the best fitting with the data was reached. By comparing the RMSE of each group, group one with PNN+cf had the smallest error and the best performance. For PNN+cf model, the input as a combination of suspected, confirmed, and critical cases gave the most correct prediction.

Evaluation
Hu et al. [15] also used a range of parameters, such as determinant coefficients, AE, DISO, RE, to show the high accuracy of SEIRQ model, and its forecasting performance was supported by the absolute values of RE. Chen et al. [16], Wang et al. [17] drew their predictions and observed data on the same graph to do the comparison and found out the predicted epidemic peak times and sizes were close to the real situations. Accuracy was also tested among models. Different performance of SEIRD model, LSTM and GWR by their MAPE with observed statistics are compared in Liu et al. [18]. Zheng et al. [19] showed the outstanding prediction accuracy of the hybrid model of improved SI model, LSTM and NLP module by the smallest MAE and PMAE of the model among other hybrid models. Huang et al. [21] concluded the high feasibility of the proposed deep CNN model by comparing its MAE and RMAE with other neural network models. Fong et al. [23] selected PNN+cf model from the other model groups for prediction due to its smallest RMSE. Yan et al. [20] employed goodness of fit and deviation rate to show the improvement of the modified LSTM.

Discussion
Though research had been done to make mathematical models fit with the situation of the coronavirus pandemic, the model could only depict some basic aspects of the reality. Because of this limitation, researchers made assumptions to simplify real situations before using the model. The modified SEIR model from Zhan et al. [10] was built on four assumptions, and the parameters were assumed to be unchanged in the short period for computational simplicity. Chen et al. [16] and Wang et al. [17] made lists of assumptions and limitations of their SIR models. The static parameters were another aspect of the mathematical models concerned by mathematicians, which could not self-adjust corresponding to the changing pandemic situations. To overcome this disadvantage, Wang et al. [17] added a transmission rate modifier or a quarantine compartment to the original SIR model. On the other hand, Zheng et al. [19] solved the problem by integrating LSTM and NLP module to the SEIRD dynamic model.
For LSTM, its prediction performance outdid some other models. Liu et al. [18] showed that MAPE of LSTM was smaller than those of SEIRD and GWR models. Yan et al. [20] compared the deviation rates of the proposed LSTM, logistic and hill equation algorithms and found that the proposed LSTM had the highest accuracy. However, LSTM may be still not the most suitable model for the prediction. Huang et al. [21] chose a CNN model due to its lower MAE and RMSE than LSTM. Hao et al. [22] selected an Elman neural network model since the model predicted the deaths and cumulative cured cases better than LSTM. Also, neural networks were unstable to predict the aperiodic epidemic data, described in Hao et al. [22]'s study, and likely to overfit due to data deficiency, mentioned in Huang et al. [21]'s study, and excessive structure complication, stated in Fong et al. [23]'s paper.
In this situation, the strategies to hold the assumptions in real application or release the limits seem to be necessary. Other algorithms and models should be explored or integrated with existing models to improve prediction accuracy. The number of days the model can predict ahead need to be extended to give more time for epidemic outbreak prepare [24].

Conclusion
In this paper, the studies that using mathematical modeling and neural network methods to predict COVID-19 dynamic were analyzed, combining with different factors including pandemic development trend, the proposed new sights, and prediction accuracy. Population migrations, quarantine policies and people's awareness were popularly considered as influential factors, while nature factors, such as temperature, were rarely mentioned. In the two method categories, SEIR and LSTM are widely used models, and each showed high prediction accuracy. The consideration for parameter selection and the adjustment of static parameters are two main focuses for SEIR model construction. LSTM was modified to solve overfitting from the data shortage and over-complicated structure. Model hybrid are a popular idea to overcome these problems. These studies may provide better understanding for the pandemic development and reference for governments to propose pandemic policies.