Temporal single spike coding for effective transfer learning in spiking neural networks

In this section, the basic model of the non-leaky Integrate-and-Fire (IF) neuron is first presented as a spiking model with temporal behavior. Then, the use of one-spike-per-neuron coding for data representation in the input, hidden, and readout layers is described. Finally, the mathematical formulation of the proposed fully single-spike supervised learning rule is presented for training the classifier block in a TL system.

Table of Contents

IF neuron model Interacting with single spike temporal behaviour

SNNs rely on spiking neurons to transmit information via discrete spikes. The IF neuron model is a simple model that simulates membrane potential changes and can be directly trained in deep SNNs. Figure 1, inspired by⁶, depicts the step-by-step flowchart of the SNNs used in this study, highlighting the IF neuron’s dynamic interaction with temporal single spike coding of data, as well as the processes of signal processing and classification outcomes. The process begins with input data (such as pixel intensities from images or extracted features), which are encoded into temporal spike signals at each time step using the “Time-to-First-Spike” (TTFS) scheme. In this encoding, each pixel’s intensity is transformed into a delay time, forming a spatiotemporal 3D “spike image”. The encoded data propagate spatially and temporally and are processed sequentially over multiple time steps, from t = 0 to t = C, where C denotes the maximum number of time steps (e.g., 255 for 8-bit grayscale). At each time step, spikes are fed into the hidden layers composed of IF neurons, each neuron’s state depending on presynaptic inputs and its state from the previous time step, and it is updated recurrently. Each IF neuron accumulates incoming signals and fires a spike only when its membrane potential exceeds a fixed threshold. This dynamic process, which is detailed in the green box in the figure, consists of two key operations at each time step: potential update and threshold checking. If the weighted sum of inputs exceeds the threshold, the neuron spikes (outputs “1”); otherwise, it does not spike (“0”). Overall, Fig. 1 serves as a compact visual guide that connects the concepts of temporal encoding, membrane potential dynamics, and spike-based decision making within the proposed SNN classifier used for transfer learning.

Time to first spike coding

Unlike traditional ANNs, which generate continuous outputs, SNNs use binary spikes that are interpreted through coding and decoding methods within the SNN structure. In this study, we employ the “time-to-first-spike” (TTFS) coding strategy, where a stimulus is represented by the time interval between the stimulus onset and the first spike. The intensity of the external input is encoded by the timing of the first spike of the corresponding neuron relative to the stimulus onset. In this work, an inverse approach is used, where stronger inputs (i.e., higher pixel intensities) cause earlier firing, while weaker inputs lead to delayed firing. Therefore, neurons associated with larger pixel values fire earlier than those linked to smaller pixel values. Additionally, in the readout layer, the first neuron to fire determines the winner class. This efficient approach allows rapid information processing and effective data encoding, as decisions can be made upon the arrival of the first spike, typically within milliseconds³⁴. To implement this scheme, we map pixel intensity values to specific time steps. If the pixel intensity of the input image, denoted as $C_{\textrm{i}}$, falls within the range $[0, C_{\text {max}}]$, it is linearly transformed to the single-spike firing time of a neuron, $\textit{t}_s$ within $[0, T_{\text {max}}]$ using normalization at the corresponding time step. The occurrence time of the single spike is determined by Equations (1) and (2). Here, $C_{\textrm{max}}$ represents the number of time steps (or time channels) over which the pixel intensity is distributed and is given by $C_{\textrm{max}} = 2^n – 1$, where n is the number of bits used to encode pixel intensity. In this work, n=8.

$$\begin{aligned} & t_s’ = \frac{C_i}{C_{\text {max}}} T_{\text {max}} \end{aligned}$$

(1)

$$\begin{aligned} & \quad t_s = T_{\text {max}} – \frac{C_i}{C_{\text {max}}} \cdot T_{\text {max}} \end{aligned}$$

(2)

The resulting encoding is visualized in Fig. 2, where a third dimension, time, is inserted into the 2D input image, transforming it into a 3D spatiotemporal “spike image” with dimensions height (H), width (W), and time (C). In Fig. 2.a, the encoding process for a single pixel is depicted. For example, a pixel with an intensity value of 85 generates a single spike at time step 85 along the temporal axis, while all other time steps remain silent (zero). Fig. 2.b presents the final 3D spike image as a binary tensor, where each spatial location (i.e., each pixel) contains exactly one spike (value = 1) at its corresponding firing time, and zeros elsewhere. This tensor has dimensions $H \times W \times C$, where C is the number of time steps (or channels). Figure 2.c illustrates this concept by highlighting two example pixels with intensity 85 in the input image. These are projected into the spike image, where their spikes are positioned at time step 85 along the time axis, preserving their spatial locations. This figure visually connects the original 2D image and the corresponding spikes in the 3D volume. The 3D spike model consists of single spikes, one for each pixel, following Eq. (3). In this equation, i and j represent the location of the (i, j)-th pixel in the input image.

$$\begin{aligned} x_s(i,j,t) = \left\{ \begin{array}{ll} 1 & t = t_s \\ 0 & \text { otherwise} \end{array} \right\} \end{aligned}$$

(3)

By applying this procedure to each image in the static dataset, the original images are effectively transformed into their neuromorphic counterparts, which we refer to as “spike images.” These representations mimic biological neural processing by depicting information as a series of spikes rather than continuous values. This transformation converts the original static dataset into a neuromorphic dataset, which is then used for network training and testing. It allows the network to process information in a biologically inspired, event-driven manner using temporal coding.

Transfer learning strategy in SNNs

In this study, to demonstrate the effectiveness and robustness of our proposed learning rule for training the classifier block within a transfer learning system, we implemented two main scenarios, as depicted in Fig. 3. In both cases, the overall pipeline follows a similar structure: first, a dataset is processed through a CNN-based feature extractor to learn meaningful high-level representations. Then, these features are initially classified using a traditional Multi-Layer Perceptron (MLP) to achieve baseline accuracy. Subsequently, the convolutional layers are frozen, meaning their weights are no longer updated, and reused as pre-trained layers in a new classification task, where the classifier is replaced with an SNN trained using our proposed supervised learning rule based on single-spike temporal coding. Before being passed to the SNN classifier, the extracted features are converted into temporally coded spike representations using the Time-to-First-Spike encoding scheme. This hybrid pipeline leverages the feature abstraction power of CNNs while benefiting from the energy efficiency and biological plausibility of SNNs. The pre-trained feature extractor also facilitates faster convergence during SNN training by guiding the optimization towards better local minima. In Fig. 3.a, the feature extraction and transfer learning stages are performed using the same dataset. MNIST, Fashion-MNIST, and Caltech101-Face/Bike are each passed through a 2-layer or 4-layer CNN depending on their complexity. The extracted features are then used twice: first for training an MLP, and later (with frozen convolutional weights) for training our single-spike SNN classifier. Using the same dataset simplifies the experimental setup, reduces overfitting risks, and eliminates domain mismatch, making it ideal for verifying the internal behaviour and learning capacity of the proposed model under controlled conditions. Additionally, Fig. 3.b illustrates a more challenging and realistic scenario, where a deep CNN (ResNet50), pre-trained on the large-scale ImageNet dataset, is employed as the feature extractor. Here, we decouple the source (ImageNet) and target datasets (ETHh80), introducing a domain shift between feature learning and classification. By freezing the ResNet50 layers and feeding the spike-encoded features into our trainable SNN classifier, we evaluate the generalization capability of the proposed temporal learning rule when applied to diverse and previously unseen classification tasks. This setting validates the adaptability of our method to real-world scenarios, where pre-trained models on large datasets are repurposed for specialized applications with limited data. To ensure broad applicability, we evaluated our model on four datasets with varying complexity: a 10-class problem using MNIST, a 2-class problem using Caltech101-Face/Bike, a 10-class problem using Fashion-MNIST, and an 8-class problem using ETH80. The results across these datasets confirm the versatility and transferability of our learning strategy across different domains, dataset sizes, and task complexities.

Feature extractor

In this work, various CNN-based architectures are employed to extract meaningful features from input data, tailored to dataset complexity. These networks consist of convolutional layers with digital filters and pooling layers for dimension reduction. Fig. 4 illustrates the structure of the two-layer CNN for extracting low-level features from the MNIST dataset. The input is a grayscale image of size 28$\times$28 pixels. The feature extraction begins with a convolutional layer (C1) consisting of 32 filters with a 3$\times$3 kernel size and ReLU activation, resulting in 32 feature maps of size 26$\times$26 due to valid padding. Next, a second convolutional layer (C2), with 16 filters, a 4$\times$4 kernel, and ReLU activation, is applied to the output of C1, generating 16 feature maps of size 23$\times$23. These layers capture increasingly complex patterns in the image. Then, a max-pooling (subsampling) layer with a 3$\times$3 window is applied to downsample the feature maps to 7$\times$7 dimensions, while retaining essential spatial information. This layer is denoted as S1. Here, a dropout layer (0.25) prevents overfitting. The output of the subsampling layer is then flattened and passed through a fully connected layer with 128 hidden units and ReLU activation, learning a compact, high-level representation of the input. Another dropout layer with a rate of 0.5 is applied for regularization before the final dense layer. Finally, a second fully connected layer with 10 output neurons and a softmax activation function performs classification across the 10 digit classes. This CNN is specifically designed for MNIST and is kept deliberately simple to reduce computation and demonstrate the effectiveness of our spiking model on top of extracted features. By clearly separating each step, convolution, pooling, flattening, and fully connected layers, the figure visually supports the described process and matches the implementation step-by-step. For the Fashion-MNIST and Caltech101-Face/Bike datasets, we use a similar structure with modifications to accommodate increased complexity and richer feature sets. In Fashion-MNIST, the second convolutional layer has 64 filters (instead of 16, the pooling size is (2$\times$2) instead of (3$\times$3), and the dense layer contains 784 neurons. In Caltech101-Face/Bike, the model includes an extra convolutional layer (8 filters, 4$\times$4 kernel, ReLU), another max pooling layer (3$\times$3), and a final convolutional layer (4 filters, 3$\times$3 kernel, ReLU) before a max pooling layer (2$\times$2). Dropout rates remain unchanged, and the final dense layer aligns with the number of dataset classes. For feature extraction on the ETH-80 dataset. We use ResNet-50, a powerful deep CNN with residual blocks and skip connections that mitigate vanishing gradients and enhance feature learning. By leveraging its pre-trained weights from the large-scale ImageNet dataset, we extract discriminative features to enhance classification with our proposed spiking model. Using ResNet-50 alongside other architectures further demonstrates our model’s adaptability across different feature extractors. Table 1 summarizes the feature extraction architectures across datasets, where CNN layer counts are based on 2D convolutional layers.

Classifier

In this section, we present an illustrative formulation of forward propagation and error backpropagation based on our temporal coding and the proposed target mechanism³⁵. The first layer of the classifier block, the coding layer, encodes incoming feature values into sparse spike times. Each extracted feature is assigned to an input neuron which emits a spike with a delay inversely proportional to the corresponding pixel value within $[0, \textrm{t}_{\textrm{max}}]$. Thus, the input layer contains as many neurons as extracted features. These spikes propagate to the hidden layer, where each neuron integrates the weighted spikes and updates its membrane potential at each time step. When the membrane potential reaches the threshold, the neuron fires a single spike to the output neurons. After processing the input image, output neurons may fire at different times. The earliest firing neuron determines the winner class. We define a temporal error function as the difference between actual and target firing times, which is then backpropagated through the fully connected layers of the classifier to update the weights. Table 1 provides an overview of classifier structures across datasets. In fully connected classifiers, the number of layers is determined by counting the hidden layers plus the output layer.

Table 1 Feature extractor and classifier architectures across various datasets.

Forward path Fig. 5.a shows a one-layer fully connected spiking neural network and demonstrates the propagation of single spikes in the forward path. Assuming l represents the $l^{\text {th}}$ layer in our spiking MLP classifier, we denote $x^{l-1}$ $\in \mathbb {R}^{H \times W}$ as the input data to the $l^{\text {th}}$ layer. For the input layer (I) of the classifier, this corresponds to a flattened 2D feature map extracted from the preceding CNN-based feature extractor. This feature map is temporally encoded using the TTFS method, as explained in Section “Single spike temporal coding”. Each green arrow in the figure represents the emission of a spike at a specific time step. We denote $i$ and $j$ as the iterators over the spatial dimensions in all equations; $x^{l-1}{(i,j)}$ represents the value of each input feature, calculated based on Eq. (4). After encoding the 2D input feature data into a 3D temporal representation using the TTFS method, the resulting spike tensor is $x_s^{l-1} \in \mathbb {R}^{H \times W \times C}$, as defined in Eq. (5). Finally Eq. (6) shows the vectorized format of these spike features entering the classifier. In Fig. 5.a, the input spikes propagate to the hidden layer $H$ through the weight matrix $[w^H]$, and subsequently to the output layer $O$ via weights $[w^O]$. Each neuron in the hidden and output layers accumulates weighted spikes and emits a single spike when its membrane potential reaches a fixed threshold. The resulting firing times of the output neurons denoted $t_{\text {out}_n}$, determine the final class prediction. The neuron that spikes first is selected as the predicted class, implementing a temporal winner-take-all mechanism that forms a core part of our learning rule. This figure visually summarizes the forward spike propagation and encoding mechanism within our proposed temporal SNN classifier.

$$\begin{aligned} & x^{l-1}(i,j): \left\{ x^{l-1} \in \mathbb {R};\; i,j \in \mathbb {N};\; 0 \le x^{l-1}(i,j) \le C;\; \left\{ \begin{array}{ll} {0 \le i \le H-1} \\ {0 \le j \le W-1} \end{array} \right\} \right\} \end{aligned}$$

(4)

$$\begin{aligned} & \quad x_s^{l-1}(i,j,t):\left\{ \begin{array}{ll} i,j,t \in \mathbb {N};\; t_s= x_s^{l-1}(i,j);\; x_s^{l-1} = \left\{ \begin{array}{ll} 1 & t=t_s \\ 0 & \text {otherwise} \end{array} \right\} ;\; \left\{ \begin{array}{ll} {0 \le i \le H-1} \\ {0 \le j \le W-1} \\ {0 \le t \le C-1} \end{array} \right\} \end{array} \right\} \end{aligned}$$

(5)

$$\begin{aligned} & \quad x_s^I(i,t):\left\{ \begin{array}{ll} i,t \in \mathbb {N};\; [t_s^l]_{FE}(i) = x_s^I(i);\; x_s^I(i,t-t_s) = \left\{ \begin{array}{ll} 1 & t=t_s \\ 0 & \text {otherwise} \end{array} \right\} ;\; \left\{ \begin{array}{ll} {0 \le i \le H-1} \\ {0 \le t \le C-1} \end{array} \right\} \end{array} \right\} \end{aligned}$$

(6)

We define the partial membrane potential ($V _{-}p$), which represents the contribution of all presynaptic neurons connected to a postsynaptic neuron in generating the membrane potential at each time step t. Equation (7) shows the partial membrane potential ($V _{-}p^H_j(t)$) of the j’th hidden-layer neuron(H) at time step t. The cumulative sum of these partial membrane potentials over all time steps yields the total membrane potential (net value) of each neuron in the hidden or output layers, as expressed in Eq. (8). The IF neuron model (Eq. (9) is then applied, and the resulting value is compared to a threshold at each time step to determine the firing time of each hidden neuron; $x_s^I$ denotes the spike features entering the spiking hidden layers. I is the number of input neurons in the classifier block, equal to the number of extracted features. The vector of spike times for hidden neurons, denoted by $t_s^H$ is transformed into spike patterns as described in Eq. (6). Using the same method, the firing times of the network’s total outputs are computed.

$$\begin{aligned} & V _{-}p^H_j(t)=\sum _{i=0}^{I} \textrm{w}_i^H x_s^I(i,t) + b^H \end{aligned}$$

(7)

$$\begin{aligned} & \quad V _{-}t^H_j(t) = \left[ \sum _{t=0}^{t_s-1} V _{-}p^H_j(t) \right] + b^H \end{aligned}$$

(8)

$$\begin{aligned} & \quad t_s^H = \text {IF}( V _{-}t^H_j(t)) = {\left\{ \begin{array}{ll} t_s & V _{-}t^H_j(t) = \text {Treshold} \\ 0 & \text {otherwise} \end{array}\right. } \end{aligned}$$

(9)

Our goal with backpropagation is to update each weight in the network to bring the actual output closer to the target output, minimizing error at both the neuron and network levels. To perform backpropagation, we need to update both the weights and the deltas. Figure 5.b shows the backward learning procedure in our proposed spiking neural network. The right side of the figure shows the key concept of our proposed “Absolute Target Firing Time”: we assign a fixed target spike time only for the correct output neuron, denoted by $t_{\text {target}_n}$ while all other output neurons remain silent, implicitly represented by a late firing time $t_{\text {max}}$. We denote the n’th output firing time as $t_{\text {out}_n}$ and the corresponding target value as $t_{\text {target}_n}$, where N represents the total number of outputs and $n \in \mathbb {N}$. The temporal error function, $e_n$, quantifies this difference and is formally expressed in Eq. (10). For convergence stability, this value is normalized by dividing it by $t_{\text {max}}$. As visually shown in the figure, for the correct neuron (i.e., the winner), $\Delta t_n = 0$, while the other neurons incur positive errors due to firing later (or not firing at all). This temporal error is then used to adjust the synaptic weights $[w^O]$ and $[w^H]$ during backpropagation. As shown in the figure, red arrows indicate the direction of gradient flow. We first compute the local error (delta) at the output neurons and propagate it backward through the network layers. Neurons that contribute to earlier spikes play a more significant role in the error signal, which aligns with our IF-ReLU functional approximation described in the training formulation. This process is fully vectorized and is iteratively applied across all layers. The figure effectively conveys how the absolute target strategy simplifies learning: only the correct output neuron is required to match a fixed spike time, while others are suppressed. This design reduces spike activity and accelerates convergence, making training more efficient compared to approaches that rely on relative spike times, such as in²⁶. During training, we apply stochastic gradient descent (SGD) and backpropagation to minimize the total network error, defined as the squared loss in Eq. (11).

$$\begin{aligned} & e_n = t_{\text {target}_n} – t_{\text {out}_n} = \Delta t_n \end{aligned}$$

(10)

$$\begin{aligned} & \quad E_{\text {total}} = \frac{1}{2} \Vert e_n \Vert ^2 = \frac{1}{2} \sum _{n=1}^{N} (t_{\text {target}_n} – t_{\text {out}_n})^2 \end{aligned}$$

(11)

Considering $w^{\prime }$ as the weight connected to n’th neuron of the output layer(forward connections shown in Fig. 5.a), we aim to determine how much a change in $w^{\prime }$ affects the total error, denoted as $\frac{\partial E_{\text {total}}}{\partial w’}$. Applying the chain rule yields Eq. (12).

$$\begin{aligned} \frac{\partial E_{\text {total}}}{\partial w’} =\frac{\partial E_{\text {total}}}{\partial t_{\text {out}_n}} \times \frac{\partial t_{\text {out}_n}}{\partial \text {net}_{O_n}} \times \frac{\partial \text {net}_{O_n}}{\partial w’} \end{aligned}$$

(12)

The first term in Eq. (12) is gibven by $t_{\text {out}_n} – t_{\text {target}_n}$ using Eq. (11). To compute the third term, we determine the change in the net value of the nt’h output (Eq. (13)) with respect to $w’$, where $\frac{\partial \text {net}_n}{\partial w’}$ is equal to $\text {out}_{h_{w’}}$ accordingly.

$$\begin{aligned} \text {net}_n = \sum _{j=1}^{J} w_j \cdot \text {out}_{h_j} + b_O = w_1 \cdot \text {out}_{h_1} + w_2 \cdot \text {out}_{h_2} + \cdots + w^{\prime } \cdot \text {out}_{h_{w^{\prime }}} + \cdots + w_J \cdot \text {out}_{h_J} + b_O \cdot 1 \end{aligned}$$

(13)

The second term of Eq. (12), representing the derivative of the postsynaptic spike time with respect to its membrane potential, poses a challenge. It shows how much the n’th output ($t_{\text {Out}_n}$) changes with respect to its total net input. However, we lack a direct function relating the net input to the postsynaptic spike latency. To address this issue, we employ the theorem from²⁶, which states that an IF neuron can approximate the ReLU function. In ANNs with ReLU activation function, the computation of the output of a neuron with index j in layer l is given by $y_j^l = \max (0, \text {net}_j^l = \sum _i w_{ij}^l x_i^{l-1})$, where $x_i^{l-1}$ is the i’th input, and $w_{ij}^l$ represents the weight connecting the i’th presynaptic spike to the j’th postsynaptic spike. Thus, the ReLU neuron with a larger net input value ($\text {net}_j^l$) has a larger output value $y_j^l$. In time-to-first-spike coding, earlier spikes correspond to larger net input values, and IF neurons receiving stronger synaptic weights fire earlier. Thus, earlier spikes carry more information. This coding scheme is preserved across hidden and output layers, allowing us to establish an equivalence between the ReLU neuron output $y_j^l$ and the firing time of the corresponding IF neuron, represented as $t_j^l: y_j^l \sim t_{\text {max}} – t_j^l$. This approximation assumes an inverse relationship between ReLU neuron output and the IF neuron firing time. As the firing time decreases (indicating an earlier spike), the output of the ReLU neuron increases, enabling the efficient encoding and decoding of temporal information. Assuming IF neurons approximate ReLU neurons, we derive $\frac{\partial t_{\text {Out}_n}}{\partial \text {net}_n} = -1$. This leads us to Equation (14), where, by defining $\frac{\partial E_{\text {total}}}{\partial t_{\text {out}_n}} \times \frac{\partial t_{\text {out}_n}}{\partial \text {net}_{O_n}}$ as $\delta _n$, it is equal to $t_{\text {target}_n} – t_{\text {out}_n}$. Finally, $\frac{\partial E_{\text {total}}}{\partial w^{\prime }}$ is computed as in Eq. (15).

$$\begin{aligned} & \frac{\partial E_{\text {total}}}{\partial w’} =\frac{\partial E_{\text {total}}}{\partial t_{\text {out}_n}} \times \frac{\partial t_{\text {out}_n}}{\partial \text {net}_{O_n}} \times \frac{\partial \text {net}_{O_n}}{\partial w’} = -(t_{\text {target}_n} – t_{\text {out}_n}) \cdot \left( -1\right) \cdot \text {out}_{h_{w^{\prime }}} \end{aligned}$$

(14)

$$\begin{aligned} & \quad \frac{\partial E_{\text {total}}}{\partial w^{\prime }} = \delta _n \cdot \text {out}_{h_{w^{\prime }}} \end{aligned}$$

(15)

In our model, $\delta _n$ represents the error back propagated from the output layer, denoted as $\Delta t_n$ and $\text {out}_{h_{w’}}$ is the input spike train to the n’th output neuron and based on Eq. (5), is equal to 1 at the firing time. Thus, we obtain $\frac{\partial E_{\text {total}}}{\partial w^{\prime }} = \Delta t_n$. To decrease the error, we subtract this value from the current weight, as shown in Eq. (16). Optionally, we can multiply it by a learning rate(lr) which is set as 0.2 in our implementations.

$$\begin{aligned} w^{\prime }_{\text {new}} = w^{\prime }_{\text {old}} – \text {lr} \frac{\partial E_{\text {total}}}{\partial w^{\prime }} \end{aligned}$$

(16)

Using the chain rule, we propagate $\Delta t_n$ back to the previous layer to update the weights of the output layer. We define $\text {hasfired}_j$ (formulated in Eq. (17)) as a signal indicating that the weights associated with presynaptic spikes occurring before the postsynaptic spike are updated. This leads us to Eq. (18). In this equation, the remaining weights in the output layer are updated following the same procedure as the $w’$ connection.

$$\begin{aligned} & \text {hasfired}_j = \left\{ \begin{array}{ll} 1 & t_j < t_n \\ 0 & \text {otherwise} \end{array} \right\} \end{aligned}$$

(17)

$$\begin{aligned} & \quad w^{‘}_{\text {new}} = w^{‘}_{\text {old}} – \text {lr} \cdot \Delta t_n \cdot \text {hasfired}_j \end{aligned}$$

(18)

To backpropagate the error to the hidden layer, we follow a process similar to that of the output layer, with slight modifications to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore the error) of multiple output neurons. We consider $w^{”}$(shown in Fig. 5.b) as one of the weights in the hidden layer. As with the output layer, we determine how a change in $w^{”}$ affects the total error, denoted as $\frac{\partial E_{\text {total}}}{\partial w^{”}}$. This value is computed using Eq. (19) and expanded in Eq. (20) accordingly.

$$\begin{aligned} & \frac{\partial E_{\text {total}}}{\partial w^{\prime \prime }} = \frac{\partial E_{\text {total}}}{\partial \text {out}_{h_{w^{\prime \prime }}}} \times \frac{\partial \text {out}_{h_{w^{\prime \prime }}}}{\partial \text {net}_{h_{w^{\prime \prime }}}} \times \frac{\partial \text {net}_{h_{w^{\prime \prime }}}}{\partial w^{\prime \prime }} \end{aligned}$$

(19)

$$\begin{aligned} & \quad \frac{\partial E_{\text {total}}}{\partial \text {out}_{h_{w^{\prime \prime }}}} = \frac{\partial }{\partial \text {out}_{h_{w^{\prime \prime }}}} \left( \sum _{n=0}^{N-1} E_{O_n} \right) = \frac{\partial E_{O_1}}{\partial \text {out}_{h_{w^{\prime \prime }}}} + \frac{\partial E_{O_2}}{\partial \text {out}_{h_{w^{\prime \prime }}}} + \cdots + \frac{\partial E_{O_n}}{\partial \text {out}_{h_{w^{\prime \prime }}}} \end{aligned}$$

(20)

Since $\text {out}_{\textrm{h}_{w”}}$ affects all the outputs, represented as $\text {Out}_1, \text {Out}_2, \ldots , \text {Out}_n$, $\frac{\partial E_{\text {total}}}{\partial \text {out}_{\textrm{h}_{w”}}}$ must account for its impact on all output neurons. $\frac{\partial E_{O_n}}{\partial \text {out}_{\textrm{h}_{w”}}}$ is calculated using Eq. (21). By considering that $\text {net}_n$ is computed similarly to Eq. (13), we obtain $\frac{\partial \text {net}_n}{\partial \text {out}_{h_{w”}}} = w^{”}$. The same process can be applied to calculate the other terms in Equation (20), leading to Eq. (22).

$$\begin{aligned} & \frac{\partial E_{O_n}}{\partial \text {out}_{\textrm{h}_{w”}}} = \frac{\partial E_{O_n}}{\partial \text {out}_n} \times \frac{\partial \text {out}_n}{\partial \text {net}_n} \times \frac{\partial \text {net}_n}{\partial \text {out}_{\textrm{h}_{w”}}} \end{aligned}$$

(21)

$$\begin{aligned} & \quad \frac{\partial E_{\text {total}}}{\partial w^{”}} = \sum _O \left( \frac{\partial E_{\text {total}}}{\partial \text {out}_O} \cdot \frac{\partial \text {out}_O}{\partial \text {net}_O} \cdot \frac{\partial \text {net}_O}{\partial \text {out}_{h_{w”}}} \right) \cdot \frac{\partial \text {out}_{h_{w”}}}{\partial \text {net}_{h_{w”}}} \cdot \frac{\partial \text {net}_{h_{w”}}}{\partial w^{”}} \end{aligned}$$

(22)

Similar to the output layer, we consider the IF neurons in the hidden layer approximating ReLu. Hence ${\frac{\partial \text {out}_{h_{w”}}}{\partial \text {net}_{h_{w”}}}}$ which is the secound term in Eq. (19), is approximated as − 1. We consider $\text {net}_{h_j}$ as the membrane potential of the j’th neuron in the hidden layer. This value is calculated in a manner similar to Eq. (13). Therefore, $\frac{\partial \text {net}_{h_{w”}}}{\partial w^{”}}$ is given as $x_{w”}$ which represents the input of the network associated with the $w^{”}$ connection. By substituting the calculated values, we arrive at Eq. (23). In this equation, $\delta _{h_{w”}}$ represents the error backpropagated and $x_{w^{”}}$ is the spike value of connection related to $w^{”}$ and is calculated according to Eq. (6).

$$\begin{aligned} \frac{\partial E_{\text {total}}}{\partial w^{”}} = \left( \sum _O \delta _O \cdot w_{h_{w”}}\right) \cdot (-1) \cdot x_{w”} = \delta _{h_{w”}} \cdot x_{w”} \end{aligned}$$

(23)

As we did for the output layer, we use $\text {hasfired}_i$ to indicate that we update the i’th connection when its spike (the presynaptic spike) occurs before the output spike (the postsynaptic spike). By denoting $\frac{\partial E_{\text {total}}}{\partial w^{”}}$ as $\text {delta}_j$ for the j’th hidden neuron, then it is calculated based on the Eq. (24). Finally $w^{”}$ is updated according to Eq. (25). For deeper networks with more hidden layers, this process extends to those layers to update their respective weights.

$$\begin{aligned} & \text {delta}_j = \sum _{n=0}^{N-1} w_n \cdot \Delta t_n \cdot \text {hasfired}_i \end{aligned}$$

(24)

$$\begin{aligned} & \quad w^{”}_{\text {new}} = w^{”}_{\text {old}} – \text {lr} \cdot \text {delta}_j \cdot \text {hasfired}_i \end{aligned}$$

(25)

Absolute target firing times

To perform error backpropagation within our network’s temporal behaviour, we proposed a simple and fixed target as shown in Fig. 5.b. We assume that the firing time of the correct output neuron, corresponding to the input image label, matches the occurrence time of the target spike. This ideal target, which we term the “Absolute Target,” ensures that only the correct output neuron fires simultaneously with the target spike, resulting in zero error for the expected (correct) output while keeping all other neurons silent. In this target selection method, we use only one single spike as the target, enabling a fully single-spike learning rule in which the inputs, hidden layer data, output data, and even the target itself consist of just one spike. The network is trained to minimize the error between its predicted output and the fixed target signal. To model the behaviour of the nonfiring neurons, we suppose a fake spike occurs at the last time step, $t_{\textrm{max}}$, beyond the grayscale level range, representing the absence of firing. In²⁶, the authors employed a relative approach that considers the actual firing times. For each input image, they compute a new set of target spikes corresponding to the network outputs. This means that during each iteration, they observe the output spikes and then determine the target spike times for all of the output neurons of the network while in traditional supervised learning, the target signal remains fixed for each training iteration, their approach requires extensive computation, leading to increased training time and power consumption. We compute the minimum output firing time as $\tau = \min \{t_j^o \mid 1< j < C\}$ and then we set the target firing time for the j’th output neuron according to Eq. (26). The lambda term acts as a constraint to penalize a target neuron which never fires.

$$\begin{aligned} \tau _j^o = {\left\{ \begin{array}{ll} t_{\text {max}} & i \ne j \\ \tau – \gamma & i = j \quad \& \quad \tau = t_{\text {max}} \\ \tau & i = j \quad \& \quad \tau \ne t_{\text {max}} \\ \end{array}\right. } \end{aligned}$$

(26)

Handling error cases in counting the correct outputs

In the first spike coding method of the readout layer, the winning class is determined by the first neuron to fire. We expect that the winning class will both fire and be the first to do so. However, in some cases, two error scenarios may arise. These occur when more than one neuron has the minimum firing time, requiring the network to select one class among them:

1.

Cases where more than one neuron has the minimum firing time.
2.

Cases where none of the output neurons fire (In this cases we model all output firing times as $t_{\text {max}})$.

In²⁶, only the first case is handled. where in the train and test phases, the authors introduce a condition (an if statement in the Python code of the model publicly available at In this case, when no neuron fires, the firing times are determined based on membrane potentials at the arbitrary time step of $t_{\textrm{max}}-3$. While this adjustment is acceptable during training, where model modifications are more flexible, it can bias test accuracy. The art is to train the network so that its weights are determined to minimize error cases in the test loop, rather than manually adjusting outputs through conditional rules. This ensures reliable performance without introducing biases or assumptions. Error cases are more frequent in early training epochs but decrease as learning progresses, though they do not disappear entirely. Error cases can reduce the accuracy of the classification, To mitigate this issue, we employ a dropout mechanism. We apply dropout to specific layers based on input datasets, reduces both types of errors. As demonstrated in the next section. By randomly deactivating neurons, dropout encourages the network to learn more generalizable features³⁶. In our single-spike temporal coding mechanism, dropout treats a subset of neurons as non-firing, assigning them firing times $t_{\text {max}}$. This approach helps prevent overfitting and reduces the likelihood of errors.

Handling “dead” neurons

A key challenge in training single-spike SNNs arises when hidden neurons rarely fire during learning. We refer to these neurons as “dead” neurons. In our work, we consider a neuron as dead if it fires in less than 0.001 of the total training samples for the Caltech101-Face/Bike, MNIST, and ETH80 datasets, and less than 0.01 for Fashion-MNIST. Dead neurons can hinder effective training and degrade performance. Following²⁶, we address this issue by resetting the initial weights of dead neurons and enforcing their activation during training.

link