# Novel cost-effective method for forecasting COVID-19 and hospital occupancy using deep learning

### Dataset

In the field of professional science, our study focuses on the use of a regression model to predict COVID-19 trends using data from the Hospital Insular de Gran Canaria (Spain). This dataset spans from the beginning of 2020 to March 29, 2022, and consists of only two inputs in the simplest case: date and daily new COVID-19 cases. Despite the simplicity of this dataset, our analysis has demonstrated the exceptional ability of the model to accurately predict future COVID-19 trends, identifying temporal patterns, seasonality, and the impact of interventions. This work underscores the value of accessible data and demonstrates how even minimal data inputs can yield profound insights, revolutionizing the landscape of professional research and analysis in the field of science. As mentioned above, the database is owned by the Government of the Canary Islands (Spain) and the data is public^{11}. It can be consulted or downloaded from https://opendata.sitcan.es/dataset/capacidad-asistencial-covid-19.

### Performance indices

A set of statistical parameters has been used to evaluate the accuracy of the model. The selection of these parameters is based on their widespread use in the literature, which allows us to compare our results with the current state of the art. The most prominent are *RMSE, MAE, MAPE,* and *R*^{2}, which measure the precision of the measurements, as well as their dispersion and correlation. Their mathematical expressions are shown in the following equations, where \(y_i\) are the observed values, \(\widehaty _i\) the predicted values and \(\overliney \) the mean of these values respectively^{12,13,14,15}.

Mean square error (*MSE*):

$$MSE=\frac1n \sum_i=1^n\left(y_i-\widehaty_i\right)^2$$

(1)

Root mean square error (*RMSE*):

$$RMSE=\sqrt\frac1n \sum_i=1^n\left(y_i-\widehaty_i\right)^2$$

(2)

Mean average error (*MAE*):

$$MAE=\frac1n \sum_i=1^n$$

(3)

Mean square error (*MAPE*):

$$MAPE=\frac100n \sum_i=1^n\fracy_i-\widehaty_i\righty_i$$

(4)

Coefficient of determination (*R*^{2}):

$$R^2=1-\frac\sum_i=1^n\left(\widehaty_i-\overliney \right)^2\sum_i=1^n\left(y_i-\overliney \right)^2$$

(5)

### Data preprocessing

To perform the data preprocessing and labeling, the “new daily cases” variable was separated into one vector and the date variable into another vector. Then, a labeling window with different values was used to assign a label to the “new daily cases” values. This label assigns the values of “Ytrain”. These “Ytrain” values depend on the size of the window. Thus, for a window n = 2, the “new daily cases” of date n = 1 and n = 2 would be grouped in the first row of the “Xtrain” vector and their “Ytrain” value would be that of the later date, i.e., n + 1. Next, considering a step = 1, the dates n = 2 and n = 3 would be grouped in the second row, and the value of “Ytrain” would be that of n = 4.

The study was carried out with values from n = 1 to n = 20 to check which window was better suited to the data and which could better handle the high slope presented by the COVID-19 waves. Figure 4 shows a scheme of the starting vectors “date” and “new daily cases” and the labeling process for a window n = 2 and n = 5. The dataset is available from the link provided in the previous section.

### Network architecture

To correctly predict the COVID-19 data, an architecture has been designed that is capable of analyzing the time series and capturing the existing gradient differences in the slopes generated by the different waves through the use of deep learning. The different layers used in the whole architecture are described in detail below.

#### LSTM-BiLSTM

Long-Short Term Memory is a type of recurrent neural network (RNN) that is particularly useful for modeling sequential data. This type of algorithm has been applied to a wide range of tasks, including speech recognition, natural language processing, and time series forecasting^{16}. By using memory cells, LSTMs can retain useful data from current or previous stages and use it in the future. Therefore, they use algorithmic gates that are also capable of retaining such information for future use and goal attainment.

They can also be combined to improve the overall network architecture. There are variants with different functions, such as Bidirectional LSTMs (BiLSTMs), Gate Recurrent Units (GRUs), or the new algorithms focused on the attention layer, called “transformers”, described by Vaswani et al. in 2017 in their work entitled “Attention is all you need”^{17}. In the case of BiLSTMs, the only difference is the relationship between the states, since they are bidirectional and can take into account the data of the previous state as well as the following one.

The LSTM consists of a memory cell and three parts, which can be expressed mathematically as follows^{18}:

**Input Gate:** the layer responsible for updating the state of the network through the sigmoidal function.

$$i_t = \sigma \left( W_i \cdot\left[ h_t – 1 ,x_t \right] + b_i \right)$$

(6)

\(W_i\) is the representation of the weight of the input, \(b_i\) is the corresponding bias, \(x_t\) is the current time step, and \(h_t-1\) is the output of the previous time step. σ will have a value ∈ [0, 1], representing full discard or full save of the data, respectively^{16,18}.

**Forget Gate:** the layer responsible for deciding whether to save or discard the information. It is the first step of LSTM.

$$f_t = \sigma \left( W_f \cdot\left[ h_t – 1 ,x_t \right] + b_f \right)$$

(7)

\(W_f\) is the weight representation of the input, \(b_f\) is the corresponding bias, \(x_t\) is the current time step and \(h_t-1\) is the output of the previous time step.

**Output Gate:** This is where the information output is determined. This output is based on the filtered version of the cell state. The output value is determined by the sigmoid layer and then multiplied by the cell state^{18}.

$$o_t = \sigma \left( W_o \cdot\left[ h_t – 1 ,x_t \right] + b_o \right)$$

(8)

$$h_t = o_t \cdot \tan \,h(C_t )$$

(9)

\(W_o\) is the weight representation of the input, \(b_o\) is the corresponding bias, and \(x_t\) is the current time step. \(h_t-1\) is the output of the LSTM layer at the current time step. Finally, the previous cell state \(C_t-1\) must be updated. This is computed by forgetting one input gate, as shown in Fig. 5.

$$C_t = f_t \cdot C_t – 1 + i_t \cdot g_t$$

(10)

where \(g_t\) is the tanh layer.

In addition, the BiLSTM model is composed of two LSTM networks and is capable of reading input evaluations in both directions, forward and backward. The forward LSTM processes information from left to right, while the backward LSTM processes information from right to left^{19}.

#### Dense layer

A dense or fully connected layer, also known as a fully connected feedforward neural network, is a type of artificial neural network in which each neuron in one layer is connected to each neuron in the next layer. The basic formula for a fully connected neural network with a hidden layer and an output layer (\(y_fc\)) can be represented as follows^{20}:

$$y_fc = f\left( \mathop \sum \limits_i = 1^n \left( W_i *x_i \right) + b \right)$$

(11)

where \(x_i\) is the input vector to the network and \(W_i\) are the weight matrices for the connections between layers. \(b\) is the bias and \(f\), is the activation function applied to the output of each layer (sigmoid, ReLU, tanh).

It is important to note that this formula is for a single hidden layer neural network, but in practice, fully connected neural networks usually have multiple hidden layers, in which case the formula would be more complex and would include additional weight matrices and bias vectors for each additional layer.

#### Dropout

Dropout is a regularization technique used in deep learning to avoid overfitting. It works by randomly “dropping” (i.e., setting to zero) a certain number of neurons during each training iteration. The mechanism of a dropout layer is quite simple: it is applied to the output of the previous layer and consists of multiplying the input vector by a mask. This mask is a binary mask that is randomly generated for each training iteration, it has the same shape as the input and each element is either 0 or 1. The probability that each element of the mask is 1 is called the dropout rate. The dropout rate is a hyperparameter that is usually set between 0.2 and 0.5, depending on the specific application and the complexity of the model. Typically, a low dropout rate is used for the input layer and a higher dropout rate is used for the hidden layers. During the testing phase, it is common to use a dropout rate of 0, which means that all neurons are active. This is because dropout is only applied during the training phase and is not used during the testing phase^{21}.

### Hyperparameters

The network was trained and tested using Python’s TensorFlow. The adaptive moment estimation or Adam method, which is a widely used optimization algorithm for neural network training, is used. Adam combines techniques from the RMSprop and Momentum optimizers to efficiently and effectively adjust the weights of a neural network during training. See Eqs. (12)–(14) below^{22,23}:

$$m_t = \beta_1 m_t – 1 + \left( 1 – \beta_1 \right)g_t$$

(12)

$$v_t = \beta_2 v_t – 1 + \left( 1 – \beta_2 \right)g_t^2$$

(13)

$$\theta_t = \theta_t – 1 – \frac\alpha \sqrt v_t + \epsilon m_t$$

(14)

where \(m_t\) is the first updated moment (mean), \(v_t\) is the second updated moment (variance), \(\beta _1\) and \(\beta _2\) are the moment decay parameters, \(g_t\) is the gradient at the current step, α is the learning rate, “epsilon” (ϵ) is a small numerical constant to avoid division by zero, and \(\theta _t\) is the current value of the parameter being updated. This is the parameter that the algorithm optimizes.

The values of these hyperparameters were 1·10^{–6} for “Epsilon” in the training options and 1·10^{–4} for the learning rate. The batch size was set to 5 and 15, with epochs equal to 1000 and a “shuffle” in each epoch. The value of \(\beta _1\) was set to 0.99, and \(\beta _2\) to 0.999. Training and testing are performed using the holdout method for regressions with a training percentage test of 40–60.

### Complete network architecture

The architecture developed in this work consists of the sequential input of the previously defined temporal windows. This sequence passes through 3 levels. In the first one, there is an LSTM layer with 128 units in its hidden layer and with sequence return enabled. Then, in levels 2 and 3, there are 2 BiLSTM layers with sequence return enabled and 128 units each. Finally, at the output of this last level, a dense fully connected layer with 128 connections is implemented. Then, to reduce the randomness of the weights, a Dropout layer with a value of 0.4 and a Flatten layer are added to flatten the output sequence into a vector^{24}. Finally, a dense layer with one neuron and linear activation is added to obtain the output. In Fig. 6, a scheme of the entire architecture is shown.

link