LLS: Local Learning Rule for Deep Neural Networks Inspired by Neural Activity Synchronization

Marco P. E. Apolinario, Arani Roy, Kaushik Roy
Elmore Family School of Electrical and Computer Engineering
Purdue University, West Lafayette, IN, USA
{mapolina, roy173, kaushik}@purdue.edu
Abstract

Training deep neural networks (DNNs) using traditional backpropagation (BP) presents challenges in terms of computational complexity and energy consumption, particularly for on-device learning where computational resources are limited. Various alternatives to BP, including random feedback alignment, forward-forward, and local classifiers, have been explored to address these challenges. These methods have their advantages, but they can encounter difficulties when dealing with intricate visual tasks or demand considerable computational resources. In this paper, we propose a novel Local Learning rule inspired by neural activity Synchronization phenomena (LLS) observed in the brain. LLS utilizes fixed periodic basis vectors to synchronize neuron activity within each layer, enabling efficient training without the need for additional trainable parameters. We demonstrate the effectiveness of LLS and its variations, LLS-M and LLS-MxM, on multiple image classification datasets, achieving accuracy comparable to BP with reduced computational complexity and minimal additional parameters. Specifically, LLS achieves comparable performance with up to 300×300\times300 × fewer multiply-accumulate (MAC) operations and half the memory requirements of BP. Furthermore, the performance of LLS on the Visual Wake Word (VWW) dataset highlights its suitability for on-device learning tasks, making it a promising candidate for edge hardware implementations. Our code is available at GitHub repository.

1 Introduction

Currently, stochastic gradient-based optimization schemes serve as the default method for training deep neural network (DNN) models. These schemes leverage the backpropagation (BP) algorithm, enabling the computation of gradients of the loss function with respect to the trainable parameters (weights) in the hidden layers. However, BP is associated with high time and memory complexities, leading to significant energy consumption. For instance, in a model with L𝐿Litalic_L layers and n𝑛nitalic_n neurons per layer, BP exhibits time and memory complexities of O(Ln2)𝑂𝐿superscript𝑛2O(Ln^{2})italic_O ( italic_L italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and O(Ln)𝑂𝐿𝑛O(Ln)italic_O ( italic_L italic_n ), respectively. While suitable for offline training in environments with ample computational resources (such as the cloud), these computational demands render BP inefficient for on-device learning on low-power edge devices, where computation resources are severely constrained [31, 1, 25]. Studies such as [1] and [25] highlight the large energy consumption associated with extensive external memory accesses and gradient computations in BP. Consequently, there is a need for hardware-friendly algorithms to facilitate efficient on-device learning on low-power edge devices.

With this consideration in mind, numerous works have explored alternatives to backpropagation (BP), trying to eliminate the need of computationally expensive gradient calculations associated with BP. Methods like feedback alignment (FA) and its variant, direct feedback alignment (DFA), utilize random matrices to propagate error signals or directly project errors to each layer, offering some reduction in dependency across layers but still requiring similar memory demands [19, 28, 5]. An alternative to this approach is proposed by [9], which uses random matrices to project targets instead of errors, thereby enabling each layer to be updated independently. Although promising, these methods do not scale well for deep neural networks (DNNs). In contrast, [24] proposes a local learning rule that matches BP performance in large models at the cost of significantly increasing the number of trainable parameters and computational complexity. Recent research works have attempted to replace BP’s backward pass with an additional forward pass, aiming to enhance biological plausibility, though they suffer from slow convergence and have not yet proven effective for deep networks [7, 11]. Additionally, [14] proposes a biologically inspired method using a soft winner-take-all mechanism to facilitate unsupervised learning in simpler DNN models. In contrast, [23, 2] and [29] proposed to use auxiliary networks as local classifiers. These methods [23, 2, 29] avoid using end-to-end BP by breaking the problem into smaller pieces and generating error signals with the aid of such local classifiers per layer or group of layers. Since these methods necessitate additional layers to generate the learning signal, we categorize them as hybrids between local learning and BP.

The aforementioned learning methods often struggle to scale to complex vision tasks without high computational costs [19, 28, 9, 7, 14, 24]. Hybrid approaches using local classifiers [23, 2] offer a better balance for on-device learning but at the cost of increasing trainable parameters, thus increasing memory and energy demands. To address this, we propose a Local Learning rule inspired by brain-like neural activity Synchronization (LLS). This rule bypasses intensive gradient calculations of BP and scales to complex vision tasks and deep networks.

Neuronal activity synchronization in the brain reflects the correlation of brain signals. Studies in [15, 10, 22, 12, 3], have demonstrated that neuronal ensembles in the brain synchronize their activity during cognitive learning processes or in response to visual stimuli. Inspired from this biological process, LLS utilizes fixed periodic basis vectors to synchronize neuron activity within same layers of the model. Our experiments show that simple periodic functions like cosine and square enable effective learning in complex image classification tasks. These functions are computationally lightweight, allowing on-the-fly generation on low-power devices without additional trainable parameters. Furthermore, we explore variations of LLS, such as LLS-M and LLS-MxM, to enhance performance on more complex tasks. LLS-M learns to modulate the amplitude of the fixed basis, while LLS-MxM learns to construct an improved basis through a linear combination of the fixed basis. Both variants require minimal trainable parameters, on the order of O(C)𝑂𝐶O(C)italic_O ( italic_C ) and O(C2)𝑂superscript𝐶2O(C^{2})italic_O ( italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where C𝐶Citalic_C represents the number of classes. Evaluation on public image classification datasets, including CIFAR10, CIFAR100, IMAGENETTE, TinyIMAGENET, and Visual Wake Words (VWW), demonstrates that our method achieves high accuracy comparable to BP, with significant reductions in MAC operations, memory usage, and minimal additional parameters. Notably, our method’s performance on the VWW dataset underscores its suitability for on-device learning hardware implementations.

The main contributions of the paper are as follows:

  • A novel local learning rule that utilizes fixed periodic basis vectors to synchronize neural activity per layer, achieving high accuracy with reduced MAC operations, memory usage, and minimal additional trainable parameters.

  • Evaluation of the effectiveness of our method on various image classification datasets, demonstrating accuracy comparable to BP.

  • Demonstration of the suitability of our method for on-device learning tasks by evaluating its performance on the Visual Wake Word (VWW) dataset, achieving high performance with low computational complexity.

2 Background

2.1 Backpropagation (BP)

As noted earlier, the backpropagation (BP) algorithm is central to deep learning. We explore its mechanics here and introduce key notations used in this work. A neural network model can be represented as a parameterized function F(𝐱;θ)𝐹𝐱𝜃F(\mathbf{x};\mathbf{\theta})italic_F ( bold_x ; italic_θ ), where 𝐱𝐱\mathbf{x}bold_x is the input data and θ𝜃\mathbf{\theta}italic_θ are the parameters. For an L𝐿Litalic_L-layer model, the parameters are θ=[𝐰(1),,𝐰(L)]𝜃superscript𝐰1superscript𝐰𝐿\mathbf{\theta}=[\mathbf{w}^{(1)},\cdots,\mathbf{w}^{(L)}]italic_θ = [ bold_w start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , bold_w start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ], with 𝐰(l)superscript𝐰𝑙\mathbf{w}^{(l)}bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT representing the weights of the l𝑙litalic_l-th layer. Each layer produces an output, 𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, obtained by applying a linear transformation over the input 𝐡(l1)superscript𝐡𝑙1\mathbf{h}^{(l-1)}bold_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT based on the parameters 𝐰(l)superscript𝐰𝑙\mathbf{w}^{(l)}bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, resulting in an intermediate representation 𝐳(l)superscript𝐳𝑙\mathbf{z}^{(l)}bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, followed by a non-linear element-wise activation function 𝐡(l)=f(𝐳(l))superscript𝐡𝑙𝑓superscript𝐳𝑙\mathbf{h}^{(l)}=f(\mathbf{z}^{(l)})bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = italic_f ( bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ). Given a loss function \mathcal{L}caligraphic_L and a labeled dataset [𝐗,𝐘]𝐗superscript𝐘[\mathbf{X},\mathbf{Y}^{*}][ bold_X , bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ], where 𝐗𝐗\mathbf{X}bold_X are the inputs and 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the labels. The objective is to find the parameters θ𝜃\mathbf{\theta}italic_θ that minimize the loss, i.e., θ:=argminθ(𝐘,F(𝐗;θ))assign𝜃subscript𝜃superscript𝐘𝐹𝐗𝜃\mathbf{\theta}:=\arg\min_{\mathbf{\theta}}\mathcal{L}(\mathbf{Y}^{*},F(% \mathbf{X};\mathbf{\theta}))italic_θ := roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_F ( bold_X ; italic_θ ) ). For this purpose, the conventional approach is to use mini-batch stochastic gradient descent (SGD), which randomly samples a mini-batch of data [𝐱,𝐲]𝐱superscript𝐲[\mathbf{x},\mathbf{y}^{*}][ bold_x , bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] from the dataset to estimate the gradient of the loss function. Such a learning algorithm, with a learning rate (η𝜂\etaitalic_η), has the following update rule for the parameters:

𝐰(l):=𝐰(l)η𝐰(l)assignsuperscript𝐰𝑙superscript𝐰𝑙𝜂subscriptsuperscript𝐰𝑙\mathbf{w}^{(l)}:=\mathbf{w}^{(l)}-\eta\nabla_{\mathbf{w}^{(l)}}\mathcal{L}bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT := bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - italic_η ∇ start_POSTSUBSCRIPT bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L (1)

The gradient 𝐰(l)subscriptsuperscript𝐰𝑙\nabla_{\mathbf{w}^{(l)}}\mathcal{L}∇ start_POSTSUBSCRIPT bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L is computed based on the BP algorithm. BP operates in two phases: the forward pass and the backward pass. During the forward pass, an input 𝐱𝐱\mathbf{x}bold_x is propagated layer by layer through the model to obtain a model prediction 𝐡(L)=F(𝐱;θ)superscript𝐡𝐿𝐹𝐱𝜃\mathbf{h}^{(L)}=F(\mathbf{x};\mathbf{\theta})bold_h start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT = italic_F ( bold_x ; italic_θ ), and the loss (𝐲,𝐡(L))superscript𝐲superscript𝐡𝐿\mathcal{L}(\mathbf{y}^{*},\mathbf{h}^{(L)})caligraphic_L ( bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) is computed. In this process, all intermediate representations 𝐳(l)superscript𝐳𝑙\mathbf{z}^{(l)}bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are saved. Then, in the backward pass, the chain rule is used to compute the gradients as follows:

𝐰(l)=𝐡(l)𝐡(l)𝐳(l)𝐳(l)𝐰(l)=𝐡(L)i=l+1L𝐡(i)𝐡(i1)𝐡(l)𝐳(l)𝐳(l)𝐰(l)subscriptsuperscript𝐰𝑙superscript𝐡𝑙superscript𝐡𝑙superscript𝐳𝑙superscript𝐳𝑙superscript𝐰𝑙superscript𝐡𝐿subscriptsuperscriptproduct𝐿𝑖𝑙1superscript𝐡𝑖superscript𝐡𝑖1superscript𝐡𝑙superscript𝐳𝑙superscript𝐳𝑙superscript𝐰𝑙\begin{split}\nabla_{\mathbf{w}^{(l)}}\mathcal{L}&=\frac{\partial\mathcal{L}}{% \partial\mathbf{h}^{(l)}}\frac{\partial\mathbf{h}^{(l)}}{\partial\mathbf{z}^{(% l)}}\frac{\partial\mathbf{z}^{(l)}}{\partial\mathbf{w}^{(l)}}\\ &=\frac{\partial\mathcal{L}}{\partial\mathbf{h}^{(L)}}\prod^{L}_{i=l+1}\frac{% \partial\mathbf{h}^{(i)}}{\partial\mathbf{h}^{(i-1)}}\frac{\partial\mathbf{h}^% {(l)}}{\partial\mathbf{z}^{(l)}}\frac{\partial\mathbf{z}^{(l)}}{\partial% \mathbf{w}^{(l)}}\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L end_CELL start_CELL = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = italic_l + 1 end_POSTSUBSCRIPT divide start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW (2)

Here, 𝐡(l)superscript𝐡𝑙\frac{\partial\mathcal{L}}{\partial\mathbf{h}^{(l)}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG is the learning signal obtained by propagating errors from the last layer (L𝐿Litalic_L) to layer l𝑙litalic_l. Additionally, 𝐡(l)𝐳(l)superscript𝐡𝑙superscript𝐳𝑙\frac{\partial\mathbf{h}^{(l)}}{\partial\mathbf{z}^{(l)}}divide start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG corresponds to the derivative of the activation function f(𝐳(l))superscript𝑓superscript𝐳𝑙f^{\prime}(\mathbf{z}^{(l)})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ), and 𝐳(l)𝐰(l)superscript𝐳𝑙superscript𝐰𝑙\frac{\partial\mathbf{z}^{(l)}}{\partial\mathbf{w}^{(l)}}divide start_ARG ∂ bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG is equivalent to the input of the l𝑙litalic_l-th layer, i.e., 𝐡(l1)superscript𝐡𝑙1\mathbf{h}^{(l-1)}bold_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT. From (2), it can be observed that while the latter two factors on the right-hand side of (2) depend only on the inputs and outputs of layer l𝑙litalic_l, the learning signal depends on all successive layers. Therefore, the weight updates must be sequential (i.e., update-locking problem). Moreover, the computational and memory complexity of BP are O(Ln2)𝑂𝐿superscript𝑛2O(Ln^{2})italic_O ( italic_L italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and O(Ln)𝑂𝐿𝑛O(Ln)italic_O ( italic_L italic_n ), respectively, with n𝑛nitalic_n representing the average number of neurons per layer.

2.2 Local learning for DNN

The non-locality and update-locking features of BP, among others, have been argued as reasons that make BP unlikely as the learning rule used by the brain [20]. Different local learning mechanisms that may not rely on the propagation of errors using symmetric weights have been explored in many works [28, 9, 7, 11, 14]. Here, we refer to local learning as learning rules that compute weight updates (Δ𝐰(l)Δsuperscript𝐰𝑙\Delta\mathbf{w}^{(l)}roman_Δ bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) based only on inputs (𝐡(l1)superscript𝐡𝑙1\mathbf{h}^{(l-1)}bold_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT), outputs (𝐳(l)superscript𝐳𝑙\mathbf{z}^{(l)}bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT), and some other global factors. An example is the DFA method [28], which uses random feedback weights (𝐁(l)superscript𝐁𝑙\mathbf{B}^{(l)}bold_B start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) to produce the learning signal. In this method, 𝐡(l)superscript𝐡𝑙\frac{\partial\mathcal{L}}{\partial\mathbf{h}^{(l)}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG in (2) is replaced by 𝐡(L)𝐁(l)superscript𝐡𝐿superscript𝐁𝑙\frac{\partial\mathcal{L}}{\partial\mathbf{h}^{(L)}}\mathbf{B}^{(l)}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_ARG bold_B start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. A similar method is proposed by [9], denoted as DRTP, which uses fixed random learning signals produced by propagating the labels instead of error. In other words, the learning signals are 𝐲𝐁(l)superscript𝐲superscript𝐁𝑙\mathbf{y}^{*}\mathbf{B}^{(l)}bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_B start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Other approaches, such as those by [7, 11], use two forward passes to produce the learning signal, or produce a learning signal based on a soft competition mechanism as proposed by [14].

2.3 Neural activity synchronization in the brain

Neural activity synchronization refers to the correlated neuronal signals across different regions of the brain. Groups of neurons that co-activate in response to sensory stimuli or during spontaneous activity are often referred to as ensembles. These ensembles play a crucial role in various cognitive functions, including the processing of visual stimuli in the cortex [22], memory formation [12], and behavior regulation [3]. In addition to these roles, modulations in oscillatory neuronal activity are commonly observed when humans engage in cognitive tasks. For instance, as highlighted by [10], the complex, high-dimensional dynamics of neuronal activity can collapse into low-dimensional oscillatory modes, which in turn facilitates memory enhancement and learning. This synchronization not only simplifies the representation of neuronal dynamics but also captures both linear and non-linear aspects of neuronal interactions. Drawing inspiration from these biological processes, we propose a local learning rule (LLS) that employs fixed periodic vectors for each class to synchronize neural activity within the same layer of a neural network. This approach is intended to enhance the efficiency of learning in artificial systems. By using periodic vectors, the LLS encourages groups of neurons, distributed periodically within the same layer, to exhibit high activity in response to specific visual stimuli (such as images of a particular class). This design is inspired in the concept of neuronal ensembles within artificial neural networks.

3 LLS: Local Learning Rule inspired by Neural Activity Synchronization

Refer to caption
Figure 1: Overview of LLS. Weight updates for the l𝑙litalic_l-th hidden layer within an L𝐿Litalic_L-layer neural network are derived via per-layer minimization of cross-entropy loss ((l)CEsuperscript𝑙CE\mathcal{L}^{(l)}{\mathrm{CE}}caligraphic_L start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT roman_CE) on the projection of output activations (𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) over a fixed basis of periodic vectors 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, i.e., 𝐡(l)𝐛superscript𝐡𝑙superscript𝐛top\mathbf{h}^{(l)}\mathbf{b}^{\top}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. This produces a local error signal as the difference between the softmax of the projection (𝐩(l)superscript𝐩𝑙\mathbf{p}^{(l)}bold_p start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) and the one-hot encoded labels 𝐲superscript𝐲\mathbf{y}^{*}bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Subsequently, this error signal is multiplied with the fixed basis to generate the learning signal. Weight updates are then determined by multiplying the locally generated learning signal with the layer’s inputs and outputs. Consequently, LLS enables independent layer updates based on local information, resulting in low time and memory complexities of O(LCn)𝑂𝐿𝐶𝑛O(LCn)italic_O ( italic_L italic_C italic_n ) and O(nmax)𝑂subscript𝑛𝑚𝑎𝑥O(n_{max})italic_O ( italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ), respectively. It is noteworthy that the fixed basis 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT comprises C𝐶Citalic_C vectors, where C𝐶Citalic_C represents the number of classes for the classification task. Furthermore, the fixed basis vectors are constructed using periodic functions g(fc,t)=g(fc,t+1/fc)𝑔subscript𝑓𝑐𝑡𝑔subscript𝑓𝑐𝑡1subscript𝑓𝑐g(f_{c},t)=g(f_{c},t+1/f_{c})italic_g ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_t ) = italic_g ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_t + 1 / italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ), where fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the spatial frequency associated with class c𝑐citalic_c.

LLS aims to synchronize neural activity within the same layer while minimizing computational complexity and additional trainable parameters. We emphasize three core aspects of LLS: (1) locality, (2) update-unlocking, and (3) minimal parameter requirements.

First, LLS operates locally within each layer, updating synaptic connections (𝐰(l)superscript𝐰𝑙\mathbf{w}^{(l)}bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) based on local inputs (𝐡(l1)superscript𝐡𝑙1\mathbf{h}^{(l-1)}bold_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT), outputs (𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT), and generated learning signals. The locally generated learning signals are obtained by projecting 𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT onto a set of fixed periodic basis vectors 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, which align with specific classes to optimize layer performance. Local operation reduces computational overhead of computing the weight gradients.

Second, LLS’s update-unlocking feature is a by-product of locality and enables independent weight updates per layer, eliminating the need to save the output activations of all the layers in the model during training. This results in a memory complexity of O(nmax)𝑂subscript𝑛𝑚𝑎𝑥O(n_{max})italic_O ( italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ), where nmaxsubscript𝑛𝑚𝑎𝑥n_{max}italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT represents the maximum number of neurons in a layer. Unlike methods employing auxiliary local classifiers, LLS requires no additional trainable parameters, utilizing fixed periodic vectors for alignment. However, for tasks with numerous classes, relying solely on fixed vectors may present challenges, as discussed in Section 4. To address these limitations, we also propose LLS-M and LLS-MxM as variations of LLS. LLS-M enables learning of optimal modulation for fixed basis vectors, while LLS-MxM learns to form a superior basis via a linear combination of fixed vectors. Both variations entail minimal additional trainable parameters on the order of O(C)𝑂𝐶O(C)italic_O ( italic_C ) and O(C2)𝑂superscript𝐶2O(C^{2})italic_O ( italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), respectively, where C𝐶Citalic_C denotes the number of classes in a task.

3.1 Technical details

The hidden layers are trained based on the alignment of their output activations (𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) with predefined set of fixed basis vectors (𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT), as shown in Fig. 1. Alignment is measured as the inner product of a layer’s output activations and the basis. To encourage synchronicity in neural responses among neurons, the fixed basis vectors are constructed using periodic functions g(f,t)=g(f,t+1/f)𝑔𝑓𝑡𝑔𝑓𝑡1𝑓g(f,t)=g(f,t+1/f)italic_g ( italic_f , italic_t ) = italic_g ( italic_f , italic_t + 1 / italic_f ), where f𝑓fitalic_f represents spatial frequency.

For a classification problem with C𝐶Citalic_C classes, each class c𝑐citalic_c has its own vector 𝐛c(l)=g(fc,𝐭(l))subscriptsuperscript𝐛𝑙𝑐𝑔subscript𝑓𝑐superscript𝐭𝑙\mathbf{b}^{(l)}_{c}=g(f_{c},\mathbf{t}^{(l)})bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_g ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_t start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) where 𝐭(l)=[1,2,3,,T(l)]superscript𝐭𝑙123superscript𝑇𝑙\mathbf{t}^{(l)}=[1,2,3,\cdots,T^{(l)}]bold_t start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = [ 1 , 2 , 3 , ⋯ , italic_T start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ], T(l)=superscript𝑇𝑙absentT^{(l)}=italic_T start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = the length of l𝑙litalic_l-th layer’s output (𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) and fc=subscript𝑓𝑐absentf_{c}=italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = a fixed frequency for class c𝑐citalic_c. Note that these basis vectors have the same frequencies for all layers but with different lengths. The weight updates can be derived as a per-layer minimization of cross-entropy loss ((l)superscript𝑙\mathcal{L}^{(l)}caligraphic_L start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) on the projection of the activations over the fixed basis (𝐡(l)𝐛superscript𝐡𝑙superscript𝐛top\mathbf{h}^{(l)}\mathbf{b}^{\top}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT), as illustrated in Fig. 1. Specifically, the per-layer cross-entropy loss is described as follows:

(l)(𝐡(l),𝐲)=1Nn=1N𝐲log(𝐩n(l))=1Nn=1Nlogexp(𝐡n(l)𝐛cn)c=1Cexp(𝐡n(l)𝐛c)superscript𝑙superscript𝐡𝑙superscript𝐲1𝑁subscriptsuperscript𝑁𝑛1superscript𝐲logsubscriptsuperscript𝐩𝑙𝑛1𝑁subscriptsuperscript𝑁𝑛1logexpsubscriptsuperscript𝐡𝑙𝑛subscriptsuperscript𝐛topsuperscriptsubscript𝑐𝑛superscriptsubscript𝑐1𝐶expsubscriptsuperscript𝐡𝑙𝑛subscriptsuperscript𝐛top𝑐\begin{split}\mathcal{L}^{(l)}(\mathbf{h}^{(l)},\mathbf{y^{*}})&=-\frac{1}{N}% \sum^{N}_{n=1}\mathbf{y^{*}}\textrm{log}(\mathbf{p}^{(l)}_{n})\\ &=-\frac{1}{N}\sum^{N}_{n=1}\textrm{log}\frac{\textrm{exp}(\mathbf{h}^{(l)}_{n% }\mathbf{b}^{\top}_{c_{n}^{*}})}{\sum_{c=1}^{C}\textrm{exp}(\mathbf{h}^{(l)}_{% n}\mathbf{b}^{\top}_{c})}\end{split}start_ROW start_CELL caligraphic_L start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL start_CELL = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT log ( bold_p start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT log divide start_ARG exp ( bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT exp ( bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW (3)

Here, N𝑁Nitalic_N is the number of samples in the mini-batch, cnsubscriptsuperscript𝑐𝑛c^{*}_{n}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the class index for the n𝑛nitalic_n-th sample in the mini-batch, and 𝐩n(l)subscriptsuperscript𝐩𝑙𝑛\mathbf{p}^{(l)}_{n}bold_p start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is probability vector obtained of applying the softmax function over the projection vector 𝐡n(l)𝐛subscriptsuperscript𝐡𝑙𝑛superscript𝐛top\mathbf{h}^{(l)}_{n}\mathbf{b}^{\top}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Solving the per-layer minimization problem, minw(l)(l)(𝐡(l),𝐲)subscriptsuperscript𝑤𝑙superscript𝑙superscript𝐡𝑙superscript𝐲\min_{w^{(l)}}\mathcal{L}^{(l)}(\mathbf{h}^{(l)},\mathbf{y^{*}})roman_min start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), results in the following expression for weight updates on the l𝑙litalic_l-th layer:

Δ𝐰(l)=1N((𝐩(l)𝐲)𝐛(l)f(𝐳(l)))𝐡(l1)=1N(𝐞(l)𝐛(l)f(𝐳(l)))𝐡(l1)Δsuperscript𝐰𝑙1𝑁superscriptdirect-productsuperscript𝐩𝑙superscript𝐲superscript𝐛𝑙superscript𝑓superscript𝐳𝑙topsuperscript𝐡𝑙11𝑁superscriptdirect-productsuperscript𝐞𝑙superscript𝐛𝑙superscript𝑓superscript𝐳𝑙topsuperscript𝐡𝑙1\begin{split}\Delta\mathbf{w}^{(l)}&=\frac{1}{N}\left((\mathbf{p}^{(l)}-% \mathbf{y^{*}})\mathbf{b}^{(l)}\odot f^{\prime}(\mathbf{z}^{(l)})\right)^{\top% }\mathbf{h}^{(l-1)}\\ &=\frac{1}{N}\left(\mathbf{e}^{(l)}\mathbf{b}^{(l)}\odot f^{\prime}(\mathbf{z}% ^{(l)})\right)^{\top}\mathbf{h}^{(l-1)}\end{split}start_ROW start_CELL roman_Δ bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( ( bold_p start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⊙ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( bold_e start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⊙ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW (4)

From Equation (4), it is evident that the weight updates for each layer l𝑙litalic_l depend solely on the local variables of that layer, including its inputs, outputs, and the set of fixed basis vectors. Consequently, all layers can be updated independently of the rest of the model. These independent updates are the reason why the memory complexity of LLS depends only on the largest layer (the layer with the highest number of neurons), in contrast with end-to-end training methods that require memory proportional to the number of neurons in the entire model. Moreover, since LLS’s learning signals are generated locally, the time complexity to generate them for all the layers is proportional to the number of neurons per layer and the number of classes, that is O(LCn)𝑂𝐿𝐶𝑛O(LCn)italic_O ( italic_L italic_C italic_n ).

The selection of frequencies (fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) for each class is done to maintain sufficient distance among frequencies of different classes to avoid interference. The range of available frequencies is defined by the length of 𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Hence,frequencies can be assigned to be equally distributed in that range or randomly as long as they do not overlap. In practice, we reduce the dimensions of 𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT of convolutional layers by using average pooling before projecting it onto the basis 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. This helps both in faster convergence of the method and in reducing the number of MAC operations.

3.2 Variations of LLS

So far, we have discussed LLS based on utilizing a basis of periodic vectors 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, generated from a fixed periodic function g()𝑔g(\cdot)italic_g ( ⋅ ). However, such a base may not always be optimal for a given task. For instance, the amplitude of the vectors could be too large making it difficult for the algorithm to converge. Additionally, in problems with a large number of classes, the restriction to fixed periodic vectors may impede the model’s ability to learn semantics in the data, such as grouping similar classes.

To address these concerns, we propose two variations of LLS: LLS-M for learning the appropriate modulation of the fixed basis (𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT), and LLS-MxM for learning to construct a new basis as a linear combination of the original fixed basis.

LLS-M:

In this variation, the new basis is simply a modulation of the original fixed basis, defined as 𝐝(l)=𝐌(l)𝐛(l)superscript𝐝𝑙direct-productsuperscript𝐌𝑙superscript𝐛𝑙\mathbf{d}^{(l)}=\mathbf{M}^{(l)}\odot\mathbf{b}^{(l)}bold_d start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⊙ bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, where 𝐌(l)superscript𝐌𝑙\mathbf{M}^{(l)}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is a vector of trainable parameters with dimensions equal to the number of classes, i.e., 𝐌(l)Csuperscript𝐌𝑙superscript𝐶\mathbf{M}^{(l)}\in\mathbb{R}^{C}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. Weight updates for LLS-M follow (4), with 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT replaced by 𝐝(l)superscript𝐝𝑙\mathbf{d}^{(l)}bold_d start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. The updates for 𝐌(l)superscript𝐌𝑙\mathbf{M}^{(l)}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are computed as follows:

Δ𝐌(l)=1Nn=1N𝐞n(l)(𝐡n(l)𝐛(l))Δsuperscript𝐌𝑙1𝑁superscriptsubscript𝑛1𝑁direct-productsubscriptsuperscript𝐞𝑙𝑛subscriptsuperscript𝐡𝑙𝑛superscript𝐛limit-from𝑙top\Delta\mathbf{M}^{(l)}=\frac{1}{N}\sum_{n=1}^{N}\mathbf{e}^{(l)}_{n}\odot(% \mathbf{h}^{(l)}_{n}\mathbf{b}^{(l)\top})roman_Δ bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_e start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ ( bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_b start_POSTSUPERSCRIPT ( italic_l ) ⊤ end_POSTSUPERSCRIPT ) (5)
LLS-MxM:

Here, the new basis vectors (𝐝(l)superscript𝐝𝑙\mathbf{d}^{(l)}bold_d start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) are obtained as a linear combination of the original fixed periodic vectors: 𝐝(l)=𝐌(l)𝐛(l)superscript𝐝𝑙superscript𝐌𝑙superscript𝐛𝑙\mathbf{d}^{(l)}=\mathbf{M}^{(l)}\mathbf{b}^{(l)}bold_d start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT where 𝐌(l)C×Csuperscript𝐌𝑙superscript𝐶𝐶\mathbf{M}^{(l)}\in\mathbb{R}^{C\times C}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT. Weight updates are obtained following (4), with the basis replaced by 𝐝(l)superscript𝐝𝑙\mathbf{d}^{(l)}bold_d start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Similar to LLS-M, updates for the matrix 𝐌(l)superscript𝐌𝑙\mathbf{M}^{(l)}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are computed as follows:

Δ𝐌(l)=1N𝐞n(l)(𝐡n(l)𝐛(l))Δsuperscript𝐌𝑙1𝑁subscriptsuperscript𝐞limit-from𝑙top𝑛subscriptsuperscript𝐡𝑙𝑛superscript𝐛limit-from𝑙top\Delta\mathbf{M}^{(l)}=\frac{1}{N}\mathbf{e}^{(l)\top}_{n}(\mathbf{h}^{(l)}_{n% }\mathbf{b}^{(l)\top})roman_Δ bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG bold_e start_POSTSUPERSCRIPT ( italic_l ) ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_b start_POSTSUPERSCRIPT ( italic_l ) ⊤ end_POSTSUPERSCRIPT ) (6)

4 Experimental evaluation

In this section, we assess the efficacy of LLS and its variations across several image classification datasets, which include MNIST [18], FashionMNIST [30], CIFAR10 [16], CIFAR100 [16], IMAGENETTE [8], TinyIMAGENET [17], and Visual Wake Words (VWW) [4].

We primarily evaluate the proposed learning rules using three models: a 5-layer CNN (SmallConv), a VGG8 [23], and MobileNets-V1 (MBNet) [13]. Detailed descriptions of each model are provided in Appendix A.1. Additionally, information regarding hyperparameters, data pre-processing, and optimizer settings is provided in Appendix A.2.

4.1 Effect of different basis in learning

Refer to caption
Figure 2: Neural activity synchronization induced by learning rule LLSsquare on the VGG8 model’s 4th layer output (𝐡(4)superscript𝐡4\mathbf{h}^{(4)}bold_h start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT) for classes 0 and 1 from the IMAGENETTE dataset. The layer’s response exhibits spatial periodicity coinciding with the periodic function selected as a basis (𝐛(4)superscript𝐛4\mathbf{b}^{(4)}bold_b start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT).
Table 1: LLS’s performance comparison with different function g()𝑔g(\cdot)italic_g ( ⋅ ) to generate the basis 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Test accuracy mean and std are reported over five trials.
Function g(.)g(.)italic_g ( . )
Model MNIST FashionMNIST CIFAR10 IMAGENETTE
Cosine SmallConv 99.50±0.02plus-or-minus99.500.0299.50\pm 0.0299.50 ± 0.02 89.57±0.22plus-or-minus89.570.2289.57\pm 0.2289.57 ± 0.22 75.82±0.39plus-or-minus75.820.3975.82\pm 0.3975.82 ± 0.39 78.03±0.35plus-or-minus78.030.3578.03\pm 0.3578.03 ± 0.35
Square 99.50±0.02plus-or-minus99.500.0299.50\pm 0.0299.50 ± 0.02 90.54±0.23plus-or-minus90.540.2390.54\pm 0.2390.54 ± 0.23 77.79±0.31plus-or-minus77.790.3177.79\pm 0.3177.79 ± 0.31 79.02±0.76plus-or-minus79.020.7679.02\pm 0.7679.02 ± 0.76
Random 99.38±0.03plus-or-minus99.380.0399.38\pm 0.0399.38 ± 0.03 87.30±0.18plus-or-minus87.300.1887.30\pm 0.1887.30 ± 0.18 74.19±0.57plus-or-minus74.190.5774.19\pm 0.5774.19 ± 0.57 71.70±1.45plus-or-minus71.701.4571.70\pm 1.4571.70 ± 1.45
Cosine VGG8 99.52±0.02plus-or-minus99.520.0299.52\pm 0.0299.52 ± 0.02 93.04±0.17plus-or-minus93.040.1793.04\pm 0.1793.04 ± 0.17 86.92±0.27plus-or-minus86.920.2786.92\pm 0.2786.92 ± 0.27 84.85±0.11plus-or-minus84.850.1184.85\pm 0.1184.85 ± 0.11
Square 99.54±0.01plus-or-minus99.540.0199.54\pm 0.0199.54 ± 0.01 93.54±0.06plus-or-minus93.540.0693.54\pm 0.0693.54 ± 0.06 88.64±0.12plus-or-minus88.640.1288.64\pm 0.1288.64 ± 0.12 85.62±0.24plus-or-minus85.620.2485.62\pm 0.2485.62 ± 0.24
Random 99.70±0.02plus-or-minus99.700.0299.70\pm 0.0299.70 ± 0.02 93.77±0.08plus-or-minus93.770.0893.77\pm 0.0893.77 ± 0.08 90.45±0.09plus-or-minus90.450.0990.45\pm 0.0990.45 ± 0.09 87.09±0.28plus-or-minus87.090.2887.09\pm 0.2887.09 ± 0.28

First, we compare the effect of different functions g()𝑔g(\cdot)italic_g ( ⋅ ) for generating the basis 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. We consider two simple periodic functions: cosine (g=cos(fct)𝑔cossubscript𝑓𝑐𝑡g=\mathrm{cos}(f_{c}t)italic_g = roman_cos ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_t )) and square (g=sign(cos(fct))𝑔signcossubscript𝑓𝑐𝑡g=\mathrm{sign}(\mathrm{cos}(f_{c}t))italic_g = roman_sign ( roman_cos ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_t ) )). Both functions offer the advantage of being easily generated on-the-fly or require storage with minimal memory overhead due to their periodicity. Additionally, we investigate the scenario where g()𝑔g(\cdot)italic_g ( ⋅ ) is a pseudo-random number generator, resulting in a random fixed vector 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT.

The results are evaluated on two models, SmallConv and VGG8, across four image classification datasets of increasing complexity. Each model undergoes five training iterations with different random seeds, and the results are reported in Table 1.

We observe that employing any of the three fixed vector bases with LLS yields high accuracy across all four vision tasks. Notably, for the SmallConv model, using LLS with a square basis function present the best accuracy results, followed by cosine basis. In contrast, for the VGG8 model, the random basis exhibits better performance than the periodic basis, with square still performing better than cosine. This discrepancy may be attributed to the increased complexity of per-layer feature representations in deeper models, where a random vector offers more degrees of freedom for such representations. However, it is important to note that a random vector is less hardware-friendly, as it requires specialized pseudo-random number generators, leading to energy and memory overhead, as discussed in [5]. Therefore, in the subsequent sections, we primarily focus on LLS using a square g()𝑔g(\cdot)italic_g ( ⋅ ) function (LLSsquare).

Moreover, employing a periodic function, such as a square function, induces layer neurons to synchronize with the frequency of the basis function. This synchronization is demonstrated in Fig. 2, where the activations for different classes align with the spatial frequencies of the basis function. Here, the spectral decomposition is obtained by applying Fourier transform in the spatial dimension to both basis vectors (𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) and layer output activations (𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT). As shown in Table 1, synchronization has a beneficial effect on accuracy for small models, such as SmallConv. A reason for this is that such models need to discriminate between classes by transforming inputs through only a few layers. Thus, aligning the layers’ outputs to periodic vectors might be easier than aligning random vectors.

4.2 Comparison with local learning algorithms

In this section, we compare LLSsquare with other local learning methods that exhibit similar time and memory complexities. These methods include DFA [28], DRTP [9], and PEPITA [7]. For this comparison, we use the MNIST, CIFAR10 and CIFAR100 datasets, with results shown in Table 2.

We observe that training the SmallConv model with DFA, DRTP or PEPITA resulted in low performance or did not converge at all. For DFA, performance improved by increasing the number of channels threefold (SmallConvL). Consequently, we used SmallConvL for reporting results with BP and LLS. However, for DRTP and PEPITA, increasing number of channels did not yield satisfactory results, and hence, we opted for reporting accuracy of each task as reported in the original papers.

As shown in Table 2, LLS demonstrates the best performance among the three local learning methods under consideration. In terms of accuracy, LLS achieves results close to BP, while maintaining significantly lower time and memory complexities compared to BP. In fact, among all the methods in Table 2, only DRTP exhibit a time and memory complexities comparable to LLS. Furthermore, it is worth noting that while DFA, DRTP, and PEPITA do not scale well for deeper models and in many cases require wide DNNs to converge [27], LLS performs well on deeper models, as demonstrated in Section 4.1.

Table 2: Comparison with local learning algorithms (Test accuracy mean and std are reported)
Method
Time
Memory
Model MNIST CIFAR10 CIFAR100
BP (baseline) O(Ln2)𝑂𝐿superscript𝑛2O(Ln^{2})italic_O ( italic_L italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) O(Ln)𝑂𝐿𝑛O(Ln)italic_O ( italic_L italic_n ) SmallConvL 99.62±0.020plus-or-minus99.620.02099.62\pm 0.02099.62 ± 0.020 87.57±0.13plus-or-minus87.570.1387.57\pm 0.1387.57 ± 0.13 62.25±0.29plus-or-minus62.250.2962.25\pm 0.2962.25 ± 0.29
DFA O(LCn)𝑂𝐿𝐶𝑛O(LCn)italic_O ( italic_L italic_C italic_n ) O(Ln)𝑂𝐿𝑛O(Ln)italic_O ( italic_L italic_n ) SmallConvL 97.90±0.17plus-or-minus97.900.1797.90\pm 0.1797.90 ± 0.17 71.53±0.38plus-or-minus71.530.3871.53\pm 0.3871.53 ± 0.38 44.93±0.52plus-or-minus44.930.5244.93\pm 0.5244.93 ± 0.52
[28] 98.98±0.02plus-or-minus98.980.0298.98\pm 0.0298.98 ± 0.02 73.10±0.50plus-or-minus73.100.5073.10\pm 0.5073.10 ± 0.50 41.00±0.30plus-or-minus41.000.3041.00\pm 0.3041.00 ± 0.30
DRTP O(LCn)𝑂𝐿𝐶𝑛O(LCn)italic_O ( italic_L italic_C italic_n ) O(nmax)𝑂subscript𝑛𝑚𝑎𝑥O(n_{max})italic_O ( italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) [9] 98.52±0.15plus-or-minus98.520.1598.52\pm 0.1598.52 ± 0.15 68.96±0.45plus-or-minus68.960.4568.96\pm 0.4568.96 ± 0.45 --
PEPITA O(Ln2)𝑂𝐿superscript𝑛2O(Ln^{2})italic_O ( italic_L italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) O(nmax)𝑂subscript𝑛𝑚𝑎𝑥O(n_{max})italic_O ( italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) [7] 98.29±0.13plus-or-minus98.290.1398.29\pm 0.1398.29 ± 0.13 56.33±1.35plus-or-minus56.331.3556.33\pm 1.3556.33 ± 1.35 27.56±0.60plus-or-minus27.560.6027.56\pm 0.6027.56 ± 0.60
LLSsquare (Ours) O(LCn)𝑂𝐿𝐶𝑛O(LCn)italic_O ( italic_L italic_C italic_n ) O(nmax)𝑂subscript𝑛𝑚𝑎𝑥O(n_{max})italic_O ( italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) SmallConvL 99.57±0.03plus-or-minus99.570.0399.57\pm 0.0399.57 ± 0.03 84.10±0.27plus-or-minus84.100.2784.10\pm 0.2784.10 ± 0.27 55.32±0.38plus-or-minus55.320.3855.32\pm 0.3855.32 ± 0.38
Table 3: Performance comparison on image classification datasets. Accuracy mean and std are reported over five trials, the additional params refers to additional trainable parameters, and #MAC is estimated for the number of ops required to generate the learning signal (𝐡(l)superscript𝐡𝑙\frac{\partial\mathcal{L}}{\partial\mathbf{h}^{(l)}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG).
Method Model
Accuracy
(mean±plus-or-minus\pm±std)
# MAC1
(×106absentsuperscript106\times 10^{6}× 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT)
Memory1
(MB)
Additional
params
CIFAR10
BP VGG8 94.12±0.12plus-or-minus94.120.1294.12\pm 0.1294.12 ± 0.12 719.33719.33719.33719.33 1082108210821082 -
Local Losses VGG8 91.93±0.07plus-or-minus91.930.0791.93\pm 0.0791.93 ± 0.07 2.562.562.562.56 576576576576 1.02×1051.02superscript1051.02\times 10^{5}1.02 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT
LLSsquare (Ours) VGG8 88.64±0.12plus-or-minus88.640.1288.64\pm 0.1288.64 ± 0.12 2.462.462.462.46 574574574574 00
LLS-Msquare (Ours) VGG8 90.43±0.24plus-or-minus90.430.2490.43\pm 0.2490.43 ± 0.24 2.462.462.462.46 574574574574 70707070
LLS-MxMsquare (Ours) VGG8 90.89±0.09plus-or-minus90.890.0990.89\pm 0.0990.89 ± 0.09 2.462.462.462.46 574574574574 700700700700
IMAGENETTE
BP VGG8 90.92±0.27plus-or-minus90.920.2790.92\pm 0.2790.92 ± 0.27 11477.8111477.8111477.8111477.81 15858158581585815858 -
Local Losses VGG8 88.06±0.12plus-or-minus88.060.1288.06\pm 0.1288.06 ± 0.12 36.4836.4836.4836.48 7319731973197319 1.02×1051.02superscript1051.02\times 10^{5}1.02 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT
LLSsquare (Ours) VGG8 85.62±0.24plus-or-minus85.620.2485.62\pm 0.2485.62 ± 0.24 36.3836.3836.3836.38 7318731873187318 00
LLS-Msquare (Ours) VGG8 86.60±0.37plus-or-minus86.600.3786.60\pm 0.3786.60 ± 0.37 36.3836.3836.3836.38 7318731873187318 70707070
LLS-MxMsquare (Ours) VGG8 87.29±0.29plus-or-minus87.290.2987.29\pm 0.2987.29 ± 0.29 36.3836.3836.3836.38 7319731973197319 700700700700
CIFAR100
BP VGG8 73.69±0.39plus-or-minus73.690.3973.69\pm 0.3973.69 ± 0.39 719.40719.40719.40719.40 1083108310831083 -
Local Losses VGG8 69.26±0.36plus-or-minus69.260.3669.26\pm 0.3669.26 ± 0.36 5.335.335.335.33 598598598598 1.02×1061.02superscript1061.02\times 10^{6}1.02 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
LLSsquare (Ours) VGG8 58.84±0.33plus-or-minus58.840.3358.84\pm 0.3358.84 ± 0.33 4.304.304.304.30 577577577577 00
LLS-Msquare (Ours) VGG8 62.55±0.24plus-or-minus62.550.2462.55\pm 0.2462.55 ± 0.24 4.314.314.314.31 577577577577 700700700700
LLS-MxMsquare (Ours) VGG8 68.81±0.19plus-or-minus68.810.1968.81\pm 0.1968.81 ± 0.19 4.514.514.514.51 578578578578 0.70×1050.70superscript1050.70\times 10^{5}0.70 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT
TinyIMAGENET
BP VGG8 61.10±0.25plus-or-minus61.100.2561.10\pm 0.2561.10 ± 0.25 2871.202871.202871.202871.20 4048404840484048 -
Local Losses VGG8 54.00±0.11plus-or-minus54.000.1154.00\pm 0.1154.00 ± 0.11 15.1815.1815.1815.18 1971197119711971 2.04×1062.04superscript1062.04\times 10^{6}2.04 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
LLSsquare (Ours) VGG8 35.99±0.38plus-or-minus35.990.3835.99\pm 0.3835.99 ± 0.38 13.1313.1313.1313.13 1928192819281928 00
LLS-Msquare (Ours) VGG8 41.89±0.20plus-or-minus41.890.2041.89\pm 0.2041.89 ± 0.20 13.1413.1413.1413.14 1928192819281928 1400140014001400
LLS-MxMsquare (Ours) VGG8 51.41±0.48plus-or-minus51.410.4851.41\pm 0.4851.41 ± 0.48 13.9713.9713.9713.97 1932193219321932 0.28×1060.28superscript1060.28\times 10^{6}0.28 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
Visual Wake Words (VWW)
BP MBNet 88.49±0.28plus-or-minus88.490.2888.49\pm 0.2888.49 ± 0.28 181.83181.83181.83181.83 3036303630363036 -
Local Losses MBNet 82.49±0.17plus-or-minus82.490.1782.49\pm 0.1782.49 ± 0.17 178.28178.28178.28178.28 730730730730 0.28×1050.28superscript1050.28\times 10^{5}0.28 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT
LLSsquare (Ours) MBNet 81.91±0.16plus-or-minus81.910.1681.91\pm 0.1681.91 ± 0.16 178.23178.23178.23178.23 729729729729 00
LLS-Msquare (Ours) MBNet 82.71±0.42plus-or-minus82.710.4282.71\pm 0.4282.71 ± 0.42 178.23178.23178.23178.23 729729729729 28282828
LLS-MxMsquare (Ours) MBNet 83.66±0.21plus-or-minus83.660.2183.66\pm 0.2183.66 ± 0.21 178.25178.25178.25178.25 729729729729 560560560560
1: # MAC is estimated for a batch size of 1 and GPU memory is measured for a batch size of 128.

4.3 Performance comparison on deeper models

In this section, we conduct a performance comparison of LLS and its variations on five image classification datasets: CIFAR10, CIFAR100, IMAGENETTE, TinyIMAGENET, and VWW. These datasets cover a wide range of classification tasks, including low to high-resolution images and tasks with few to multiple classes. Notably, we emphasize the experiments conducted on the VWW dataset, as it holds significance for edge vision applications and serves as a relevant use case for on-device learning [4]. The comparison considers four metrics: accuracy, the number of MAC operations required to compute the learning signal, the peak memory usage, and the number of additional trainable parameters needed by each method. We compare our method against BP and the local losses method [23]. Note, local losses method employs a linear classifier per layer.

CIFAR10 and IMAGENETTE

First, we examine tasks with a few number of classes and different image resolutions, such as CIFAR10 and IMAGENETTE. As depicted in Table 3, LLS achieves high accuracy, closely following BP and Local Losses. Note, that LLS achieves such high accuracy with approximately 300×300\times300 × fewer MAC operations and half the memory usage compared to BP, and without requiring additional trainable parameters. To further narrow the accuracy gap, we explore variations of LLS, such as LLS-M and LLS-MxM. Both variations improve the accuracy to be closer to BP with almost no increase in MACs and memory usage. Note, however, the accuracy improvement comes at the cost of employing some additional trainable parameters. It is important to note that LLS-MxM still requires approximately 100×100\times100 × fewer trainable parameters than Local Losses.

Refer to caption
Figure 3: Projection of the linear combination matrix 𝐌(l)superscript𝐌𝑙\mathbf{M}^{(l)}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT of the fixed basis 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT using t-SNE. 𝐌(l)superscript𝐌𝑙\mathbf{M}^{(l)}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is obtained after training a VGG8 model with LLS-MxM on CIFAR100. The results provide evidence that our learning rule can learn better basis (as a linear combination of a fixed basis) and can encode semantics within it. Points are colored using the twenty super-class labels provided in CIFAR100.
Refer to caption
Figure 4: Visual explanations, obtained with the Grad-CAM method, for predictions of the MBNet model trained with LLS-MxM on the VWW dataset. It can be observed that our method allows the model to learn high level image features to discern about the presence of a person or not in an image.
CIFAR100 and TinyIMAGENET

For tasks with hundreds of classes such as CIFAR100 and TinyIMAGENET, LLS exhibits significant accuracy drop compared to BP. This is attributed to the orthogonal nature of the periodic vectors, which compels the model to represent each class orthogonally, even when semantically some classes have similar representations. Essentially, the basic form of LLS may not effectively capture semantics. Additionally, increasing the number of classes also increases the number of frequencies used to generate the fixed basis, leading to overlapping frequencies. We applied LLS-M learning for the above problems. LLS-M improves the accuracy, but only marginally, as the problems associated with orthogonality of the bases could not be completely solved by simply modulating the bases. In contrast, LLS-MxM learns to create a better basis as a linear combination of the original basis, offering a larger improvement and bringing the accuracy closer to BP, as show in Table 3. To further verify that LLS-MxM can actually learn semantics, we analyze the learned linear combination matrix (𝐌(l)superscript𝐌𝑙\mathbf{M}^{(l)}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT) used to create the new basis. For instance, for a VGG8 model trained on CIFAR100, we project the 𝐌(l)superscript𝐌𝑙\mathbf{M}^{(l)}bold_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT matrix into a 2D space using t-SNE [21] using the twenty super-classes provided in the dataset as ground truth. The results of this projection are illustrated in Fig. 3, wherein vectors representing similar classes are grouped together. The accuracy improvements shown in Table 3 and the clustering of similar classes illustrated in Fig. 3 demonstrate the ability of LLS-MxM to encode semantic knowledge in the formation of the new basis. Furthermore, it is worth noting that LLS-MxM requires approximately 200×200\times200 × fewer MACs and half memory compared to BP, and approximately 10×10\times10 × fewer trainable parameters than Local Losses.

Visual Wake Words (VWW)

Since our learning rule targets on-device learning scenarios, we tested the method on the VWW dataset using a MobileNetsV1 model. Note, the task and the model are suitable for on-device learning. The results are shown in Table 3. For this task, LLS-M and LLS-MxM outperforms the Local Losses method in all metrics (accuracy, MACs, memory, and trainable parameters). Compared to BP, LLS, LLS-M and LLS-MxM show competitive accuracy with fewer MACs and 4×4\times4 × lower memory usage. Moreover, to understand the model’s learning ability, we used the Grad-CAM method [26] to obtain visual explanations of the parts of the image most relevant for a particular prediction. As shown in Fig. 4, the MBNet model trained with LLS-MxM successfully learns high-level image features indicative of the presence of people in a given frame. This provides evidence that our method allows the model to learn complex representations.

5 Conclusions

In this work, we introduced a novel local learning rule, LLS, inspired by the synchronization of neural activity observed in biological systems, which is associated with memory formation and cognitive learning. LLS utilizes fixed periodic basis vectors to synchronize the activity of neurons within the same layer. Moreover, the deliberate choice of simple periodic functions, such as cosine and square functions, enables the generation of such basis easily and on-the-fly on low-power devices without imposing significant hardware overhead. Experimental validation demonstrates that LLS and its variations (LLS-M and LLS-MxM) achieve high accuracy comparable to BP across various image classification datasets, including CIFAR10, CIFAR100, IMAGENETTE, TinyIMAGENET, and VWW. Remarkably, this high accuracy is attained with significantly fewer MAC operations, reduced memory usage, and a minimal number of additional trainable parameters. Furthermore, employing the Grad-CAM method for visual explanations reveals that LLS and its variants can capture high-level information relevant to predictions. In summary, the demonstrated high accuracy and efficiency of LLS make it well-suited for on-device learning applications, particularly in scenarios where computational resources are severely constrained.

Acknowledgments

This work was supported in part by the Center for Co-design of Cognitive Systems (CoCoSys), one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program, and in part by the Department of Energy (DoE).

References

  • [1] Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Sapan Agarwal, Matthew Marinella, Martin Foltin, John Paul Strachan, Dejan Milojicic, Wen Mei Hwu, and Kaushik Roy. PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-Efficient ReRAM. IEEE Transactions on Computers, 69(8):1128–1142, 8 2020.
  • [2] Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy Layerwise Learning Can Scale to ImageNet. In International Conference on Machine Learning, 2018.
  • [3] Luis Carrillo-Reid, Shuting Han, Weijian Yang, Alejandro Akrouh, and Rafael Yuste. Controlling Visually Guided Behavior by Holographic Recalling of Cortical Ensembles. Cell, 178(2):447–457, 7 2019.
  • [4] Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, and Rocky Rhodes. Visual Wake Words Dataset. arXiv: 1906.05721, 6 2019.
  • [5] Brian Crafton, Abhinav Parihar, Evan Gebhardt, and Arijit Raychowdhury. Direct feedback alignment with sparse connections for local learning. Frontiers in Neuroscience, 13(MAY), 2019.
  • [6] Aaron Defazio, Xingyu Yang, Konstantin Mishchenko, Ashok Cutkosky, Harsh Mehta, and Ahmed Khaled. Schedule-Free Learning - A New Way to Train. https://github.com/facebookresearch/schedule_free, 2024.
  • [7] Giorgia Dellaferrera, Gabriel Kreiman, and Gabriel Kreiman. Error-driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, pages 4937–4955. PMLR, 7 2022.
  • [8] fast.ai. fastai/imagenette: A smaller subset of 10 easily classified classes from Imagenet, and a little more French, 2021.
  • [9] Charlotte Frenkel, Martin Lefebvre, and David Bol. Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience, 15:629892, 2 2021.
  • [10] Ramon Guevara Erra, Jose L Perez Velazquez, and Michael Rosenblum. Neural synchronization from the perspective of non-linear dynamics. Frontiers in computational neuroscience, 11:98, 2017.
  • [11] Geoffrey Hinton. The Forward-Forward Algorithm: Some Preliminary Investigations. Technical report, 2022.
  • [12] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8):2554, 1982.
  • [13] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861, 4 2017.
  • [14] Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian Deep Learning Without Feedback. In 2023 International Conference on Learning Representations, 2023.
  • [15] Michael J Jutras and Elizabeth A Buffalo. Synchronous neural activity and memory formation. Current opinion in neurobiology, 20(2):150–155, 2010.
  • [16] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.
  • [17] Ya Le and Xuan S Yang. Tiny ImageNet Visual Recognition Challenge. 2015.
  • [18] Yann LeCun, Corinna Cortes, and C J Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • [19] Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7, 11 2016.
  • [20] Timothy P. Lillicrap, Adam Santoro, Luke Marris, Colin J. Akerman, and Geoffrey Hinton. Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 6 2020.
  • [21] Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
  • [22] Jae Eun Kang Miller, Inbal Ayzenshtat, Luis Carrillo-Reid, and Rafael Yuste. Visual stimuli recruit intrinsically generated cortical ensembles. Proceedings of the National Academy of Sciences of the United States of America, 111(38):E4053–E4061, 9 2014.
  • [23] Arild Nøkland and Lars H Eidnes. Training Neural Networks with Local Error Signals. In Proceedings of the 36 th International Conference on Machine Learning, 2019.
  • [24] Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Backpropagation-Free Deep Learning with Recursive Local Representation Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8):9327–9335, 6 2023.
  • [25] Xiaochen Peng, Shanshi Huang, Hongwu Jiang, Anni Lu, and Shimeng Yu. DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 40(11):2306–2319, 11 2021.
  • [26] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Proceedings of the IEEE International Conference on Computer Vision, 2017-October:618–626, 12 2017.
  • [27] Ganlin Song, Ruitu Xu, and John Lafferty. Convergence and Alignment of Gradient Descent with Random Backpropagation Weights. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021), 2021.
  • [28] Arild Nøkland Trondheim. Direct Feedback Alignment Provides Learning in Deep Neural Networks. Advances in Neural Information Processing Systems, 29, 2016.
  • [29] Yulin Wang, Zanlin Ni, Shiji Song, Le Yang, and Gao Huang. Revisiting Locally Supervised Learning: an Alternative to End-to-end Training. In International Conference on Learning Representations, 2021.
  • [30] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [31] Qingtian Zhang, Huaqiang Wu, Peng Yao, Wenqiang Zhang, Bin Gao, Ning Deng, and He Qian. Sign backpropagation: An on-chip learning algorithm for analog RRAM neuromorphic computing systems. Neural Networks, 108:217–223, 12 2018.
  • [32] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, pages 13001–13008, 8 2017.

Appendix A Experimental Setup

In this section, we describe the architecture of all models used in this work, the datasets and preprocessing operations, the training details including hyperparameters for each experiment, and the compute resources employed.

A.1 Model architecture

In this work, we use four models: SmallConv, SmallConvL, VGG8 [23], and MobileNetV1 [13]. These models are built using the following three basic blocks: ConvBlock, ConvDWBlock, and LinearBlock.

  • ConvBlock is composed of three layers in the following order: a convolutional layer (Conv), a batch normalization layer (BN), and a Leaky ReLU (LeakyReLU).

  • ConvDWBlock is composed of five layers in the following order: a depthwise convolutional layer (ConvDW), a BN layer, a Conv layer with kernel size of 1 (Conv1x1), another BN layer, and a LeakyReLU layer.

  • LinearBlock is composed of three layers: a fully-connected layer (Linear), a BN layer, and a LeakyReLU.

The architecture of each of the models is described in Table 4. Note that LLS was applied at the outputs of each ConvBlock, ConvDWBlock, and LinearBlock, after the output dimensions were reduced to a size of 2048 (or lower depending on the output dimensions) using an Adaptive Average Pooling (AdaptiveAvgPool) layer.

Table 4: Model architectures. For the ConvBlock and ConvDWBlock A,B,C means A means the kernel size, B the number of output channels and C the stride. For Linear Block, A means the number of output neurons.
ID SmallConv SmallConvL VGG8 MobileNetV1
1
ConvBlock
3, 32, 1
ConvBlock
3, 96, 1
ConvBlock
3, 128, 1
ConvBlock
3, 32, 2
2
MaxPool
2, 2
MaxPool
2, 2
ConvBlock
3, 256, 1
ConvDWBlock
3, 64, 1
3
ConvBlock
3, 64, 1
ConvBlock
3, 192, 1
MaxPool
2, 2
ConvDWBlock
3, 128, 2
4
MaxPool
2, 2
MaxPool
2, 2
ConvBlock
3, 256, 1
ConvDWBlock
3, 128, 1
5
ConvBlock
3, 128, 1
ConvBlock
3, 512, 1
ConvBlock
3, 256, 1
ConvDWBlock
3, 256, 2
6
AdaptiveAvgPool
(2, 2)
AdaptiveAvgPool
(2, 2)
Max Pool
2, 2
ConvDWBlock
3, 256, 1
7
LinearBlock
512
LinearBlock
1024
ConvBlock
3, 512, 1
ConvDWBlock
3, 512, 2
8 - -
ConvBlock
3, 512, 1
ConvDWBlock
3, 512, 1
9 - -
AdaptiveAvgPool
(2, 2)
ConvDWBlock
3, 512, 1
10 - -
LinearBlock
1024
ConvDWBlock
3, 512, 1
11 - - -
ConvDWBlock
3, 512, 1
12 - - -
ConvDWBlock
3, 512, 1
13 - - -
ConvDWBlock
3, 1024, 2
14 - - -
ConvDWBlock
3, 1024, 1
15 - - -
AdaptiveAvgPool
(2, 2)

A.2 Datasets

In this section, we provide a brief description of the datasets used in this work: MNIST [18], FashionMNIST [30], CIFAR10 [16], CIFAR100 [16], IMAGENETTE [8], TinyIMAGENET [17], and Visual Wake Words (VWW) [4].

MNIST:

This dataset consists of 70000 grayscale images of handwritten digits (0-9), each of size 28x28 pixels. It is divided into 60000 training images and 10,000 test images.

FashionMNIST:

This dataset consists of 70000 grayscale images of fashion items, such a clothing and accessories, each of size 28x28 pixels. Similar to MNIST, it is divided into 60,000 training images and 10000 test images.

CIFAR10:

This dataset consists of 60000 color images in 10 different classes, with each class containing 6000 images. The images are 32x32 pixels in size and the dataset is split into 50000 training images and 10000 test images.

CIFAR100:

It is similar to CIFAR-10 but contains 100 classes with 600 images per class. The images are each of size 32x32 pixels. The dataset is divided into 50000 training images and 10,000 test images. Each class has 500 training images and 100 test images. Additionally, CIFAR-100 includes labels for twenty super-classes, each grouping together five similar classes, providing a hierarchical structure for more detailed analysis.

IMAGENETTE

This dataset is a subset of the larger ImageNet dataset, containing 10 easily classified classes such as tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, and parachute. It consists of 13000 images each with a resolution of 160x160 pixels.

TinyIMAGENET

This dataset is a scaled-down version of the ImageNet dataset, containing 200 classes with 500 training images, 50 validation images, and 50 test images per class. The images are resized to 64x64 pixels.

Visual Wake Words (VWW):

This dataset is designed for tiny, low-power computer vision models. It contains images labeled with the presence or absence of a person. The images are resized to 128x128 pixels. The dataset is divided into 115000 training images and 8000 test images.

These datasets provide a diverse range of image classification challenges, facilitating the evaluation of models across various levels of complexity and application scenarios.

A.3 Training Details

All models reported in this work were trained with a batch size of 128 using the Schedule-Free AdamW optimizer [6] with a learning rate of 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, betas of 0.9 and 0.999, weight decay of 0. For experiments with the MNIST dataset, the data augmentation applied included a random crop transformation with padding 4, followed by a normalization transformation. For FashionMNIST, a similar data augmentation was used, with the addition of a random horizontal flip. Below, we report the specific settings used for particular models.

A.3.1 Experiments with SmallConv and SmallConvL

For experiments with the SmallConv and SmallConvL models, we used light data augmentation for CIFAR10, CIFAR100, and IMAGENETTE. For CIFAR10 and CIFAR100, only a random horizontal flip was applied. For IMAGENETTE, the images were resized to 132x132 pixels and then randomly cropped to 128x128 pixels, followed by a random horizontal flip. The models were trained for 100 epochs for the experiments reported in Table 1 and Table 2.

A.3.2 Experiments with VGG8

We used more extensive data augmentation for experiments with CIFAR10, CIFAR100, IMAGENETTE, and TinyIMAGENET. The data augmentation consisted of a random crop, followed by a random horizontal flip, then a normalization layer, and a random erasing [32] with a probability of 0.2. When VGG8 was trained on MNIST and FashionMNIST, the model was trained for 100 epochs. For the other datasets, the model was trained for 300 epochs and dropout layers with a probability of 0.2 were used after each ConvBlock.

A.3.3 Experiments with MobileNetV1

For the experiments with the Visual Wake Words (VWW) dataset, the training images were resized and randomly cropped to a size of 128x128 pixels, followed by normalization. The model was trained for 500 epochs for the experiments reported in Table 3.

A.4 Experimental Compute Resources

All experiments were conducted on a shared internal Linux server equipped with an AMD EPYC 7502 32-Core Processor, 504 GB of RAM, and four NVIDIA A40 GPUs, each with 48 GB of GDDR6 memory. Additionally, code was implemented using Python 3.9 and PyTorch 2.2.1 with CUDA 11.8.