In-Memory Binary Neural Networks

Supervisors:
Prof. Maurizio ZAMBONI
Prof. Mariagrazia GRAZIANO
Prof. Marco VACCA

Candidate:
Andrea COLUCCIO

April 10, 2019
Acknowledgments

I would like to thank all the people who made possible this course of studies.

A special and most important recognition goes to my parents, who gave me the opportunity to face these academic years with peace of mind, giving me all the support I needed.

I also thank my girlfriend Martina for having been always close, even in the most difficult moments and complicated choices. Thank you for supporting me everytime.

A special thank to my aunt Maria for her support.

I would like to express my gratitude to my high school Professor Anna Civarelli, who has always motivated me to do my best from the beginning of my scholastic carrier until now. She played an important role in my education, allowing me to develop a strong interest in Electronics.

I am grateful to the Politecnico di Torino who provided me with the means to fulfill myself in my field of interest; in particular, I thank my mentors Prof. Maurizio Zamboni, Prof. Mariagrazia Graziano and Prof. Marco Vacca who have encouraged me to give the best of me to achieve this important thesis.

Lastly, I address a thought of thanks to all those who have supported me over the years.

Sincerely,
Andrea Coluccio
Turin, April 10 2019.
Glossary

1T1R One transistor, one resistor: a memory cells’ implementation used in RRAM to isolate the current of the selected cell from the others. 11, 51, 54, 58, 93, 94

ACCA Accumulation array. 107

AlexNet AlexNet is a convolutional neural network, which competed in the ImageNet Large Scale Visual Recognition Challenge in 2012. The network achieved a top-5 error of 15.3% [1].. 9, 11, 12, 23, 24, 28, 32, 48–50, 101, 109–116, 122, 124, 125, 250

BCNN Binary convolutional neural network. 48, 49, 102, 103

BNN Binary neural network. 11, 13, 61, 70, 71

CIFAR-10 The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images. The CIFAR-10 dataset contains 60000 32x32x3 images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. [2]. 12, 25, 32, 34, 38, 39, 58, 60, 62, 103, 112

CIM Computation in memory. 45, 49, 126


DPU Digital processing unit: a separated unit (external from memory) used to perform computations which are not executable in-memory. 49

DW Domain wall (magnetic). 50, 51, 53

FMEM Filter memory. 107

**ImageNet** The ImageNet project is a large visual database designed for use in visual object recognition software research. It contains about 14 million images [3]. 6, 10–12, 23, 27, 28, 32, 39, 49, 110–112

**IMEM** Image memory. 107

**IPNE** Input parallel neural engine, inputs are in parallel, while outputs are delivered in serial. Output of this configuration is compatible with OPNE’s input. 12, 72–74, 115, 118–120, 123, 126–130

**ISU** Input feature map summation unit. 107

**LeNet** LeNet is a type of convolutional neural network. 86, 90, 91, 116

**MLC** Multi level cell, more than one bit can be held into a single cell. 11, 12, 41, 42, 55–57, 94, 115, 118–120, 122, 126, 130

**MLCS** Memory logic conjugated system. 75–78

**MLP** Multilayer perceptron is a class of artificial neural network. Each node is a neuron that uses a nonlinear activation function, except for the inputs. MLP uses backpropagation for training. [4]. 8, 9, 12, 14, 23, 58, 60, 61, 66, 67, 86, 113, 115, 121–123, 134, 137, 146, 235, 236, 246

**MLSA** Multi level sense amplifier. 60–62, 123

**MNIST** The MNIST database (Modified National Institute of Standards and Technology database) is a dataset of handwritten digits with 60000 images in B/W. [5]. 13, 19, 34, 45, 58, 60–62, 66–68, 75, 81, 86, 91, 119, 131, 133, 138, 139, 201, 235

**MRAM** Magnetoresistive random-access memory (MRAM) is a non-volatile random-access memory technology. Data in MRAM is not stored as electric charge or current flows, but by magnetic storage elements. 6, 10, 41, 42, 46, 47

**MSC** Modified sensing circuit, designed for logic and full-add operations. 10, 42, 43, 45

**MTJ** Magnetic Tunnel Junction is a component composed by two ferromagnets separated by an insulator. Electrons can tunnel from one ferromagnet into the other. [6]. 6, 10, 41–43, 45, 47, 48, 50–53, 93, 117, 119, 120, 126, 129
Glossary

NDP  Near Data Processing. 95

NPU  Neuron processing unit. 106, 107

NVM  Non-volatile memory. 46, 92, 93

OFMAP  Output feature map. 7, 8, 10, 37, 46, 62, 75, 91, 105, 132, 133, 136


OPNE  Output parallel neural engine, inputs are in serial, while outputs are delivered in parallel. Output of this configuration is compatible with IPNE’s input. 12, 72–74, 115, 118–120, 123, 126–130

PIM  Processing in memory module: it is formed by the combination of an OPNE and an IPNE. 74, 115, 123, 128

PU  Processing unit. 105

ReLU  Rectified linear unit, a type of neuron’s activation function which consists into $ReLU(x) = max(0, x)$. In terms of training time, it is the best choice.. 24, 25, 34, 102–104, 106

RRAM  Resistive switching random access memory. 6, 7, 10–12, 54–58, 61, 67, 69, 92–94, 115, 117, 123, 124, 126–130

SC  Stochastic computing. 87, 88, 91

SCT  Synapse configuration table. 70–72

SGD  Stochastic gradient descent method. 37, 66, 67, 133

SOT  Spin-orbit torque: a type of magnetic RAM. 6, 10, 12, 46, 47, 50, 115, 118, 120, 122, 126, 128–130

stride  stride, in the context of CNNs, is the distance between the receptive field centers of neighboring neurons in a kernel map. 8–10, 24–26, 35, 99

STT  Spin-transfer torque is an effect in which the orientation of a magnetic layer in a MTJ can be modified using a spin-polarized current [7]. 6, 12, 41, 42, 50, 92, 115, 118, 119, 122, 126, 130
SVHN  SVHN (Street View House Numbers) is a dataset. It consists in a training set of 604K and a test set of 26K 32x 32 color images representing digits ranging from 0 to 9. 34

top-1  top-1 error is measured by checking if the top class (the one having the highest probability) is the same as the target label.. 10, 11, 25, 26, 39, 109, 110, 112

top-5  top-5 error is measured by checking if the target label is one of your top 5 predictions (the 5 ones with the highest probabilities). 10, 12, 25, 26, 39, 111
Summary

In this thesis, an In-Memory architecture of a binary neural network is presented. The concept of "In-Memory" is related to the possibility to place near-memory very simple computational units, such as logic gates or full-adders, to implement a distributed circuit instead of Von Neumann’s classical one. This choice brings to relevant benefits such as lower energy consumption/delay, since the computation is performed very close to memory, the wasted energy and the corresponding latency caused by the data fetching are heavily reduced, allowing an higher parallelization.

As computational models, Convolutional Neural Networks (Figure 1) have been chosen. They are a class of neural networks that are able to recognize/classify raw data, such as images, sounds, natural language etc. The key parameters of a neural network are the number of layers and their dimensions, that influence the accuracy achievable and the usable dataset’s complexity. A "binary" approximation called XNOR Net is considered, in which weights (W) and inputs (I) are binarized between \{-1, +1\} by taking the sign, reducing the multiply-accumulate operations used in convolution into XNORs-popcounting sequences. The term "pop-counting" refers to the following operation: number of 1s - number of 0s. The value computed is then multiplied by two scaling factors (K and \(\alpha\)), obtaining the approximated convolution. This choice reduces memory required and computational cost, but degrades the achievable accuracy (from \(\sim 97\%\) to \(\sim 84\%\) for the model in Figure 1).
Architectures

Figure 2: Classical implementation. In **Binary input RF**, the binary signs are precharged and then fetched one row per clock cycle to compute the XNORs. The incoming bit selected goes to pop-counting unit.

Figure 3: In-Memory implementation. Inputs are precharged into the memory cells and the XNOR gates perform the xnor operation between the binary weights ($W_0, W_1, \ldots$) and inputs. Xnor results are then fetched from pop-counting parts.

Two architectures based on 45nm CMOS technology (In-Memory and classical implementations respectively), have been developed. The classical implementation has been used as reference architecture to compare the performance achieved in the In-Memory case. The computational model is well-suited for an In-Memory implementation, since XNOR gates and pop-counting circuits are very simple units that can be integrated into a memory array. In the classical implementation in **Figure 2**, a traditional memory has been used, in which data are simply stored and the computation is done out-of-memory (OOM). In the In-Memory alternative (**Figure 3**), the traditional structure has been replaced with a CAM-like array and the computation is performed inside the mesh by computing the xnors between binarized weights-inputs. One of the main advantage in the In-Memory alternative is the parallelization of the XNOR/pop-counting computations, which reduces the time required by the algorithm and the energy consumed.
Validation flow

Since the neural networks are often realized in software (for example Python with TensorFlow and Keras), a MATLAB model that computes both in floating point and fixed point representation has been carried out to convalidate the correctness of the VHDL implementation: when the floating point results are validated, the fixed point model is then verified, obtaining the validation flow depicted in Figure 4.

![Validation Diagram](image)

Figure 4: Validation flow of the neural network model.

Performance

The results show that the classical implementation needs $\sim 2.5 \times$ more computational time than the In-Memory architecture with an higher energy consumed ($\sim 1.7 \times$), for the model in Figure 1. The architectures implemented have the possibility to realize any kind of neural network, with more complex models or datasets. For the model depicted in Figure 1, the framerate achieved in the In-Memory case is 16337 fps with 0.79 $\mu$J consumed, while for the OOM case is 6652 fps with 1.33$\mu$J and a clock frequency of 4.22ns for both cases. By evaluating the architectures’ performance for different neural network models, the In-Memory alternative is able to consume $\sim 3.7 \times$ less energy and to save up to $\sim 5.7 \times$ computational delay than the classical counterpart. Roughly comparisons have been performed with the state-of-the-art based on innovative technologies (such as RRAMs, MTJs Memristors, etc), showing very good computational delay with relatively low energy consumption results for the In-Memory architecture: the performance estimations for this case are pessimistic, since the memory array has been synthesized by Synopsys Design Compiler as a register file and each cell as a flip-flop, that is more complex than a custom memory cell. However, the resulting normalized energy and delay for the In-Memory case are $\sim 650pJ/\text{neuron}$ and $\sim 17ns/\text{neuron}$ respectively, that are comparable to an analog MTJ-based single-level-cell solution with an energy
value of $\sim 450pJ/neuron$ and a normalized delay of $\sim 16ns/neuron$. Choosing beyond-CMOS technologies, enables the realization of very efficient solutions.

**Thesis structure**

This work is composed by the following chapters:

1. **State-of-the-art**, in which actual neural network implementations and technologies are reported (both In-Memory and OOM solutions);

2. **Comparisons**: the implementations discussed in the state-of-the-art are compared in terms of performance;

3. **Software implementation**: an explanation of the starting neural network model is provided, in which Python code is analyzed and discussed;

4. **Hardware implementations**: a detailed explanation on how the neural network has been realized in VHDL is given, for both the computational model (OOM and In-Memory respectively). In this part, the neural network model depicted in Figure 1 is used, because it is easier to understand. Next, it is demonstrated how the circuit can be used to implement different neural network models, with any kind of structure and dimension;

5. **Verification**: the results are compared and the correspondence between Python-Matlab-VHDL is tested, as already described in Figure 4. Here, three different neural network models are tested to demonstrate the capability of the circuits to implement any kind of neural network model and dataset: the original one (Figure 1), an MLP network and a fashion-MNIST based CNN;

6. **Synthesis - Place&Route**: performance results are provided for the models analyzed in the verification part. Moreover, by performing several synthesis, a parametric sweep is performed on the key parameters of the neural network (such as IFMAP sizes, OFMAP sizes, contemporary input channels and so on), to evaluate the trend of Power, Area, Timing and energy for both In-Memory and OOM architectures respectively. Roughly comparisons with the state-of-the-art are performed in this part;

7. **Conclusions and future work**: conclusions and improvements are proposed.
# Table of contents

Acknowledgments

1 State of the art
1.1 Introduction ............................................. 1
1.1.1 Artificial neural network .......................... 1
1.1.2 Convolutional neural networks .................... 7
1.1.3 Binary neural network ............................. 11
1.1.4 Backpropagation algorithm ......................... 13
1.2 Software based neural networks ....................... 23
1.2.1 ImageNet Classification with Deep Convolutional Neural Networks ......................... 23
1.2.2 XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks .......... 28
1.2.3 BinaryConnect: Training Deep Neural Networks with binary weights during propagations .... 32
1.2.4 A Ternary Weight Binary Input Convolutional Neural Network: Realization on the Embedded Processor ................................. 34
1.3 MTJ-Based BNN ............................................ 41
1.3.1 A Multilevel Cell STT-MRAM-Based Computing In-Memory Accelerator for Binary Convolutional Neural Network .................. 41
1.3.2 Energy Efficient In-Memory Binary Deep Neural Network Accelerator with Dual-Mode SOT-MRAM .......................... 46
1.3.3 A Logic-in-Memory Design with 3-Terminal Magnetic Tunnel Junction Function Evaluators for Convolutional Neural Networks ........................................ 50
1.4 RRAM Based .............................................. 54
1.4.1 The application of Non-volatile Look-up-table Operations based on Multilevel-cell of Resistance Switching Random Access Memory ........................................ 55
1.4.2 XNOR-RRAM: A Scalable and Parallel Resistive Synaptic Architecture for Binary Neural Networks ................................. 58
1.4.3 MAGIC-Memristor-Aided Logic ........................................... 63
1.4.4 Mixed-precision architecture based on computational memory for training deep neural networks ................................. 65
1.4.5 A hardware neural network for handwritten digits recognition using binary RRAM as synaptic weight element .................. 67
1.4.6 Challenges of emerging memory and memristor based circuits: Nonvolatile logics, IoT security, deep learning and neuromorphic computing ............................................................ 69
1.5 SRAM based .............................................................................. 72
1.5.1 In-Memory Area-Efficient Signal Streaming Processor Design for Binary Neural Networks ......................................................... 72
1.5.2 Deep learning consideration with novel approach - look-up-table based processing conjugated memory ......................... 77
1.5.3 A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm ....................................................... 80
1.6 DRAM Based ........................................................................... 84
1.6.1 XNOR-POP: A processing-in-memory architecture for binary Convolutional Neural Networks in Wide-IO2 DRAMs ........ 84
1.7 OOM implementations ............................................................... 89
1.7.1 Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor Computing .......................................................... 89
1.7.2 Towards Near Data Processing of Convolutional Neural Networks ..................................................................................... 94
1.7.3 Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks ..................... 97
1.7.4 An Energy-Efficient Architecture for Binary Weight Convolutional Neural Networks ......................................................... 101

2 Comparisons ............................................................................ 108
2.1 Algorithm accuracies .............................................................. 108
2.1.1 Performance comparisons .................................................... 112
2.1.2 Conclusions .......................................................................... 129

3 Software implementation .......................................................... 130
3.1 Network model ........................................................................ 130
3.2 Network’s computational model ............................................... 133
3.2.1 Python code ........................................................................ 136

4 Hardware implementations ....................................................... 147
4.1 OOM implementation ............................................................. 148
4.1.1 Max pooling layer .............................................................. 148
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.1.2</td>
<td>Convolutional and fully-connected layers</td>
<td>156</td>
</tr>
<tr>
<td>4.1.3</td>
<td>Flatten layer</td>
<td>190</td>
</tr>
<tr>
<td>4.1.4</td>
<td>Neural network entity</td>
<td>191</td>
</tr>
<tr>
<td>4.1.5</td>
<td>VHDL implementation</td>
<td>199</td>
</tr>
<tr>
<td>4.2</td>
<td>In-memory implementation</td>
<td>206</td>
</tr>
<tr>
<td>4.2.1</td>
<td>Convolutional/fully connected layer</td>
<td>206</td>
</tr>
<tr>
<td>4.3</td>
<td>Memories’ sizes</td>
<td>215</td>
</tr>
<tr>
<td>4.3.1</td>
<td>Parameters precharging</td>
<td>216</td>
</tr>
<tr>
<td>4.3.2</td>
<td>Memory required</td>
<td>217</td>
</tr>
<tr>
<td>4.4</td>
<td>Timing comparison</td>
<td>219</td>
</tr>
<tr>
<td>4.4.1</td>
<td>OOM implementation</td>
<td>219</td>
</tr>
<tr>
<td>4.4.2</td>
<td>In-memory implementation</td>
<td>221</td>
</tr>
<tr>
<td>4.4.3</td>
<td>General cases</td>
<td>223</td>
</tr>
<tr>
<td>4.5</td>
<td>Choosing the number of bits (n_bit)</td>
<td>234</td>
</tr>
<tr>
<td>5</td>
<td>Verification</td>
<td>236</td>
</tr>
<tr>
<td>5.1</td>
<td>VHDL’s output</td>
<td>240</td>
</tr>
<tr>
<td>5.2</td>
<td>MATLAB’s output</td>
<td>241</td>
</tr>
<tr>
<td>5.3</td>
<td>Other neural network models</td>
<td>243</td>
</tr>
<tr>
<td>5.3.1</td>
<td>MLP Implementation</td>
<td>244</td>
</tr>
<tr>
<td>5.3.2</td>
<td>Fashion-MNIST neural network model</td>
<td>249</td>
</tr>
<tr>
<td>6</td>
<td>Synthesis - Place &amp; Route</td>
<td>256</td>
</tr>
<tr>
<td>6.1</td>
<td>Original architecture</td>
<td>256</td>
</tr>
<tr>
<td>6.1.1</td>
<td>Synthesis</td>
<td>257</td>
</tr>
<tr>
<td>6.1.2</td>
<td>Place &amp; Route</td>
<td>259</td>
</tr>
<tr>
<td>6.2</td>
<td>MLP architecture</td>
<td>261</td>
</tr>
<tr>
<td>6.2.1</td>
<td>Synthesis &amp; Place-Route chips</td>
<td>262</td>
</tr>
<tr>
<td>6.3</td>
<td>Fashion-MNIST CNN</td>
<td>266</td>
</tr>
<tr>
<td>6.4</td>
<td>General cases</td>
<td>271</td>
</tr>
<tr>
<td>6.5</td>
<td>State-of-the-art comparisons</td>
<td>287</td>
</tr>
<tr>
<td>6.5.1</td>
<td>Number of neurons</td>
<td>288</td>
</tr>
<tr>
<td>6.5.2</td>
<td>Results</td>
<td>288</td>
</tr>
<tr>
<td>7</td>
<td>Conclusions and future work</td>
<td>292</td>
</tr>
<tr>
<td>7.1</td>
<td>Future work</td>
<td>292</td>
</tr>
<tr>
<td></td>
<td><strong>Bibliography</strong></td>
<td>294</td>
</tr>
</tbody>
</table>
# List of figures

1. Convolutional neural network used as starting model. MNIST database is used, which is composed by handwritten digits in range 0 ÷ 9.
2. Classical implementation. In **Binary input RF**, the binary signs are precharged and then fetched one row per clock cycle to compute the XNORs. The incoming bit selected goes to pop-counting unit.
3. In-Memory implementation. Inputs are precharged into the memory cells and the XNOR gates perform the xnor operation between the binary weights \((W_0,W_1,...)\) and inputs. Xnor results are then fetched from pop-counting parts.
4. Validation flow of the neural network model.
5. **Neuron’s structure**
6. **Sigmoid activation function**
7. **Hyperbolic tangent activation function**
8. **ReLU activation function**
9. **CNN example from [32]**
10. **Example of a kernel in a CNN with 3x3 size.**
11. **Convolution example with FC network.** The weights used in the fully connected part are the same of the kernel.
12. **Example of max-pooling 2x2 and stride of 1 [35]**
13. **Binarization process based on the sign function of the input weights/activations.**
14. **Binary XNOR-Popcount based computation.**
15. **BNN implementation.** The green boxes are fully connected layers.
16. **Neural network example.** It is analyzed an **MLP**, in order to simplify the explanations. The same approach can be used in **CNNs**.
17. **First output neuron**
<table>
<thead>
<tr>
<th>Section</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.15</td>
<td>Approximation of the derivative of the sign function from [36]</td>
</tr>
<tr>
<td>1.16</td>
<td>Example of $w_1(W^{(2)}(1,1))$ computation from [10]. Biases are not reported because they are not used in the computation.</td>
</tr>
<tr>
<td>1.17</td>
<td><strong>AlexNet</strong> architecture from [11]. The architecture is divided into two parts handled by the two GPUs respectively, with some layers in which they communicate with a DMA (direct memory access) approach. The two parts are identical and so the dimensions reported are the same.</td>
</tr>
<tr>
<td>1.18</td>
<td>Example of a non-overlapped pooling and pooling procedural steps</td>
</tr>
<tr>
<td>1.19</td>
<td>An example of a 3x3 window with an overlapping pooling with stride $s = 2$ and $z = 3$</td>
</tr>
<tr>
<td>1.20</td>
<td>Example of Dropout technique from [37]</td>
</tr>
<tr>
<td>1.21</td>
<td>Weights and inputs represented as tensors.</td>
</tr>
<tr>
<td>1.22</td>
<td>Structure of the XNOR-Net from [12].</td>
</tr>
<tr>
<td>1.23</td>
<td>Example of the 2D convolutional operation for the ternary weight and binary input</td>
</tr>
<tr>
<td>1.24</td>
<td><strong>VGG16</strong> architecture from [38]. It is composed by 16 layers and it is able to reach up to 70% on top-1 and 90% in top-5 recognition accuracies respectively on ImageNet.</td>
</tr>
<tr>
<td>1.25</td>
<td>Magnetic tunnel junction (schematic) from [6].</td>
</tr>
<tr>
<td>1.26</td>
<td>Cell structure and example of a 2x2 array from [15]. <strong>MSC</strong> stands for &quot;modified sensing circuit&quot; and it is able to do some computations based on the current of the source/bit lines. The mode controller is able to choose which operation to perform, while the row decoder handles the word lines. In order to write into the MTJ, a current has to flow through it, and the direction is expressed here. If &quot;1&quot; has to be stored, the current has to magnetize the layers in a parallel way, resulting less MTJ resistance (LRS), so a positive voltage is applied between BL and SL; otherwise, with &quot;0&quot;, the magnetizations must have antiparallel direction.</td>
</tr>
<tr>
<td>1.27</td>
<td>CNN architecture used in [15].</td>
</tr>
<tr>
<td>1.28</td>
<td><strong>BCNN</strong> Accelerator from [15]. The logical computations are performed inside the memory array, while other intensive operations, such as batch normalization or scaling factors computations are performed outside the memory in a separate unit.</td>
</tr>
<tr>
<td>1.29</td>
<td><strong>SOT-MRAM</strong> device structure and an example of 2x2 crossbar array from [16].</td>
</tr>
<tr>
<td>1.30</td>
<td>Inputs and weights are in ImageBanks and then it will be computed the binary convolution by performing an In-Memory AND logic operation followed by a Bitcounting. Source: [16].</td>
</tr>
</tbody>
</table>
1.31 Crosspoint array architecture from [17]; two different types of MTJs are used in [17]: the synaptic MTJs are the classical ones, with two possible values of resistances ($R_P$ and $R_{AP}$); while the thresholding MTJs are the ones discussed so far. The last MTJ (indicated by an arrow) acts as function evaluator and it implements the activation function of the neuron. This crossbar can be seen as an array of variable resistances.

1.32 1T1R configuration from [39].

1.33 Crossbar array cell’s organization from [18]. Each memory cell is a RRAM.

1.34 Architecture of the multiplier based on MLC RRAM from [18].

1.35 Bit cell structure from [19].

1.36 Memristor behavior from [20] depending on the current flow direction.

1.37 NOR Gate with memristors from [20].

1.38 Memristor-based crossbar array: configuration for NOR logic gate from [20].

1.39 Principle scheme of the mixed precision architecture. Source: [21].

1.40 Network structure from [22].

1.41 Write voltages of different technologies. Source: [23].

1.42 Simplified 3-2 neural network implemented with RRAMs 1T1R configuration. Source: [23].

1.43 An example of a 3-2 BNN from [24] and the transformation into a fully connected configuration. The Synapse configuration table is reported indicating the meaning of the connections. The fully connected network has been implemented considering also bias and mask signals. At the end, three popcounting results will be added together and it is taken the sign of the result, that defines the output.

1.44 DL calculation structure at 700MHz from [25].

1.45 Structure of the neurosynaptic core from [26].

1.46 Architecture proposed by [27]. Source: [27].

1.47 Building blocks of a XNOR-NET from [27].

1.48 Bank structure. Source: [27].

1.49 (a) Multiplier; (b) Binary - Stochastic converter; (c) Stochastic - Binary converter; (d) Multiplexer adder with random input r; (e) Improved version of the adder, without the random input. Source: [28].

1.50 Example of computations of the new stochastic adder. Source: [28].

1.51 Architecture structure. Source: [29].

1.52 Complete and partial result computation. Source: [29].
1.53 Chain NN architecture with \( k = 3 \), where \( k \) is the kernel size. 9 processing elements are needed because for each PE, a different weight is used. Inside a PE there are a MAC and a register and eventually the corresponding outputs can be pipelined, in order to improve performance (red dashed lines). Example of computation. Source: [30] 99

1.54 Streaming order in dual channel architecture. Source: [30] 100

1.55 Architecture. Source: [31] 103

1.56 4:2 compressors used in [31] 104

1.57 Approximate multiplier. Source: [31] 105


2.2 top-5 errors for the same dataset ImageNet. AlexNet: [11], XNOR-Net: [12], BWN: [12], BinaryConnect: [13], BW-BI: [13] 110

2.3 Accuracy comparison for CIFAR-10 dataset. XNOR-Net: [12], BWN: [12], BinaryConnect: [13], Ternary: [14] 111

2.4 AlexNet architecture from [11] 113

2.5 Energy comparison: the higher is better. MLC-STT: [15], SOT: [16], OPNE-IPNE: [40], Neurosynaptic core: [26], Stochastic: [28], CPU-CLU: [29] 117

2.6 Macro-pipeline structure [40]. Once OPNE terminates, IPNE starts producing a serial output: this is elaborated by the following OPNE. 119

2.7 Delay comparison: the higher is better. MLC-STT [15], SOT [16], OPNE-IPNE [40], Neurosynaptic core [26], XNOR-RRAM [19], HMC [29], Chain-NN [30], Energy-efficient [31] 125

2.8 Area comparison: the higher is better. SOT [16], OPNE-IPNE [40], Neurosynaptic core [26], XNOR-RRAM [19] (MLP), Stochastic [28], HMC [29], Energy-efficient [31] 128

3.1 Neural network model 131

3.2 Xnor net computation example 134

3.3 Fully connected layer - toy example 136

3.4 Accuracies’ trend over 5 epochs and batch size of 10 142

3.5 Accuracies’ trend over 5 epochs and batch size of 100 143

3.6 Accuracies’ trend over 20 epochs and batch size of 100 144

3.7 Example of \( K \) and \( \alpha_i \) computation in the fully connected layer 145

3.8 MLP network used to test the approximation drawback. This structure is able to achieve an accuracy of around 97% after 20 training epochs 146

4.1 Neural network model used as starting point 147

4.2 Max pooling: indexing example with \( w_{in} = 4 \), \( w_{filter} = 2 \) and stride = 1 149

4.3 Input selection circuit 151
4.4 Max pooling layer FSM

4.5 Timing diagram of the max pooling layer. Starting from idle, the FSM moves to precharge decoder (PD), in which the external decoder is precharged with its initial values. During do pooling, the inputs are provided to the max pooling layer and the computation starts with comparator computing, in which Count comparator is increased until it reaches $w_{\text{filter}(\text{pool})}^2 - 1$ value, that in the neural network model depicted in Figure 4.1 is equal to 3 (4-1). When the terminal count CMP is asserted, the FSM migrates to clear comparator (CC), in which the stored value inside the comparator is reset to the minimum. The result is stored inside the RF Pool, which is placed outside the chip (see Figure 4.27) and it is addressed by count out pool. The entire procedure is repeated until count out pool has not reached the terminal count pool, which is asserted when count out pool is equal to $w_{\text{out}(\text{pool})}^2$, that in the neural network model in Figure 4.1 is 196. At this point, done and wait for start are reached, where FSM waits for a new start signal.

4.6 Example of a $w_{\text{in}} = 4, w_{\text{filter}} = 2, \text{stride} = 1$ input selection and saving circuits. The inputs are selected from the input selector and their sign is stored into the register file ($s(0), s(1), s(4), s(5)$). Then, once the saving procedure is completed, the inputs are fetched from the Binary Input RF and XNORed with weights’ signs. The XNOR results are selected from a multiplexer (Incoming bit).

4.7 Pop-counting circuit: 4 bits example

4.8 Multiple input channels architecture. The XNOR and pop-counting units are replicated for a number of input channels times, obtaining a parallel computation. Each channel contribution is added in the output computer unit.

4.9 Alpha computational unit: example with $w_{\text{filter}} = 2$. The input multiplexer has been instatiated into an external unit, in order to reduce the total number of inputs of the chip.

4.10 Alpha computation unit in case of multiple input channels. An adder tree adds all the multiplexed weights from the $c_{\text{in}}$ inputs. The last division is performed also by the number of input channels. Re-timing technique has been used for the loop register, in order to reduce the critical path caused by an adder tree and a divider.

4.11 Alpha computation unit in case of multiple output/input channels

4.12 Fixed point representation: example with $n_{\text{bit}} = 18$ and $n_{\text{bit\_fractional}} = 10$
4.13 K scheduling. Example with $w_{in} = 4$ and $w_{filter} = 2$. Everytime a new data is precharged, K computation starts and lasts for $w_{filter}^2$ clock cycles.

4.14 Example of K unit with $w_{filter} = 2$ with multiple input channels. The input multiplexer has been integrated into an external unit in order to reduce the number of contemporary inputs into the architecture. Since conv_z is fixed, the multiplexer selects only one input per time: the register indicated by the red arrow has been moved from its original location by applying re-timing: this technique avoids to have multiple adders connected to the final multiplier, reducing the critical path delay. The last term ($\frac{1}{w_{filter} \times w_{in}}$) is taken directly from the alpha unit.

4.15 Example of fully connected layer integration. The data precharging pattern is inverted to compute the outputs values of the neurons o0 and o1. number_of_fc_parameters indicates the number of input neurons that in this example is equal to 3. In the real case depicted in Figure 4.1, number_of_fc_parameters = 1014.

4.16 Fully connected layer scheduling. Inputs and weights are divided into subgroups of L elements and precharged inside the Binary Input RF. At each cycle, once the pop-counting has finished, a new set of inputs/weights is precharged in the Binary Input RF and the pop-counting part starts again. The register file RF TMP Pop holds the temporary values of pop-counting and it is addressed by the counter: the total number of registers used in RF TMP Pop is equal to the number of output neurons that, as in Figure 4.1, it is equal to 10.

4.17 Convolution computation unit. Example of a 4 input channels output computer unit, with batch normalization and ReLU. $\alpha$ is delayed by a register in order to reduce the critical path. A, B are the batch normalization terms. The path indicated by the red arrow has been retimed to reduce the critical path delay.

4.18 Multiplication scheme: example with n_bit=18 and n_bit_fractional = 10.

4.19 Entire convolutional layer datapath: example with $C_{out} = 2$, $C_{in} = 4$. The area highlighted by the red dashed line is implemented in an external unit.

4.20 FSM of the convolutional/fully connected layers. The term ”TC” indicates terminal count.
4.21 Timing diagram for the K computation considering only one input channel. When Enable K is asserted during initial stage, K computation starts to address one out of $w^2_{filter}$ inputs with Count K, and the corresponding sum is obtained in OutRegSum. This phase lasts for $w^2_{filter}$ clock cycles, that in this example it is equal to 4. After that, during Input precharge (IP), a new input set is provided and the just computed value of K is stored inside K array.

4.22 Example of timing diagram for Evaluation state, considering only one pop-counting unit. When Counter SRAM has terminated, terminal count SRAM is '1', allowing the FSM to move from Initial stage toward Evaluation. During this state, pop-counting is enabled and Count pop starts. OutPop(0) changes its value according to the xnor values: this procedure terminates when all the filter elements have been considered, so after $w^2_{filter}$ clock cycles. In the meanwhile, alpha can start its computation.

4.23 Timing diagram for the convolution computation. As it is possible to see, the FSM moves from Evaluation (IE) to output computation when terminal count pop counting is '1'. During output computation, the values of K (selected by the counter SRAM) and alpha are fed to the output computer, which performs the product between the OutPop result and these two values, obtaining Output computation (reference Figure 4.17). The FSM waits until terminal count OC, which is asserted when the output computer has scanned all the parallel input channels (Figure 4.17), so after $c_{in}$ clock cycles. Since in the reference architecture depicted in Figure 4.1 there is only one parallel input channel, the FSM passes immediately to batch normalization state, which computes Batch Normalization/ReLU within a clock cycle. Moving to increase batch state, the Counter SRAM is enabled and the counting is increased, in order to consider another Binary input set from Binary Input RF and a new value of K, which is addressed by the counter itself. At the same time the convolution result is saved inside a temporary register file (Temporary CNV RF in (Figure 4.27)). The procedure restarts with evaluation.
4.24 Timing diagram for multiple output channels handling. From increase batch (IB), the FSM moves toward wait for last result, since the Counter SRAM has reached the end of counting. The last valid data is saved inside the Temporary CNV RF (Figure 4.27) and, consequently, the entire content of the register file is stored in the output register files (Figure 4.27) during store results. At this point the channel is changed by increasing channel selected, which selects another weights set. Alpha is computed again and the entire process described in the previous parts is repeated.

4.25 Timing diagram of the fully connected part. After weight precharge (WP), the FSM starts to save the binary values inside the Binary Input RF during Input precharge, as already discussed. After that, evaluation can start, in particular the first line addressed by Counter SRAM is pop-counted. The pop-counting procedure has a time duration equal to $L \times t_{ck}$, that in the neural network model depicted in Figure 4.1 is equal to $6 \times t_{ck}$. Once Count Pop has reached 5, the FSM moves to save tmp results fc (ST), in which the temporary result of the pop-counting procedure is saved inside the RF TMP POP (depicted in Figure 4.19) and the last register of the pop-counting unit is cleared (Figure 4.7). A new evaluation procedure starts, but now the second row of the Binary Input RF is considered, since Counter SRAM is increased. The entire procedure for the first part of the fc scheduling (discussed in section 4.1.2) ends when the value of Counter SRAM is equal to the number of output neurons, that in the neural network model depicted in Figure 4.1, it is 10. After that, the state Increase fc increases the value of Count fc, which allows to select another inputs/weights set, as reported in section 4.1.2. These computational steps are repeated for $n_{iter} = \frac{\text{number of fc parameters}}{L}$ number of times.

4.26 Example of flattening procedure. Each matrix represents a convolutional output channel.

4.27 Example of a neural network top entity with $c_{out} = 2$, $c_{in} = 4$. The hardware in the dashed border-line are included in the Neural network top entity. This scheme is valid for both OOM and In-Memory architecture.

4.28 Neural network’s FSM

4.29 Parameter generation entity. In base on the value of iteration_cycle, that changes everytime a done signal from the convolutional layer is asserted, the parameters are chosen accordingly.
4.30 Example of XNOR in memory with \( w_{in} = 4, \ w_{filter} = 2 \) and \( W = 4 \). For each memory cell there is a XNOR gate that computes the xnor between the binary weights (first row) and the corresponding binary inputs. At the end of each row (excluding the first one reserved to the binary weights), there is a multiplexer which selects the incoming bit as discussed in the OOM implementation. For each incoming bit there is a pop-counting unit and each pop-output is selected by a final multiplexer.

4.31 Example of an in-memory convolutional layer architecture with \( c_{in} = 4 \) and \( c_{out} = 2 \).

4.32 FSM of the convolutional/fully connected layer of the In-Memory implementation.

4.33 Timing diagram of convolution computation in the In-Memory architecture. Starting from Weights precharge (WP), the binary weights are precharged inside the first row of the XNOR UNIT. During Initial stage, K computation starts requiring \( w_{filter}^2 \) clock cycles. Binary inputs are precharged inside the memory array during Input precharge (IP), in which also the Counter SRAM is increased. During evaluation, \( \alpha \) starts and the pop-counting results will be computed in parallel, requiring \( w_{filter}^2 \) clock cycles: this is the most important difference respect to OOM architecture, in which the evaluation process has to be repeated for each output (Figure 4.23). After pop-counting has finished, output computation (OC), batch normalization (BN) and ReLU computations are performed and repeated for each output. In Change CNV Res (CNV), the count mux out is increased and the final multiplexer in Figure 4.30, addresses another output. The procedure finishes when count mux out is 168 and, at this point, the second weight set is selected, \( \alpha \) is computed again and the FSM restarts with evaluation (EV).
4.34 The algorithm starts with **Weights precharge** (WP) state, in which the binary fc inputs are precharged in the first row of the **XNOR Memory**, because of the inverted precharging order between weights-inputs (section 4.1.2). During **input precharge**, also the fully connected weights are stored inside the memory. **Evaluation fc** starts and ends within 6 clock cycles, since $L = 6$: in this phase, all the parallel pop-counting units are computing, obtaining at the same time the partial results of the $w_{out(fc)}$ neurons, which it is equal to 10, considering the neural network model depicted in Figure 4.1. After **evaluation fc**, the FSM increases **count fc** during **increase fc** (IFC), for the fc scheduling already explained in section 4.1.2. At this point the algorithm start again from **weights precharge**. Considering the timing diagram of the fully connected layer for the OOM case (Figure 4.25), it is possible to see the big difference between them: OOM needs to perform serially the pop-counting calculations by storing the partial results inside the **RF TMP POP**, while the In-Memory alternative can do the computation in parallel, without the need of storing the partial results, since they are maintained by the last register of the pop-counting units (Figure 4.7).

4.35 Data precharging scheduling. One data of $n_{bit}$ per clock cycle is stored in the register files.

4.36 Memory required in function of $n_{bit}$ for the neural network model depicted in Figure 4.1.

4.37 Computational delay of the OOM architecture with $t_{ck} = 5.5ns$.

4.38 Computational delay of the In-Memory architecture with $t_{ck} = 5.5ns$.

4.39 Speedup vs $C_{out}$: higher number of $C_{out}$ increases the time ratio, but the complexity of the architecture is badly influenced (higher number of parameters required).

4.40 Speedup vs stride conv: the consequence of increasing the stride are worse accuracy and speedup, but the complexity of the network decreases.

4.41 Speedup vs $w_{filter(pool)}$: speedup ratio decreases, but the accuracy is worse since the higher is the $w_{filter(pool)}$, the lower is the input quality image.

4.42 Speedup vs stride pool: speedup ratio decreases, but the accuracy is worse since an higher stride implies bad quality input image.

4.43 Speedup vs $w_{out(fc)}$: the higher is better, but in the case reported in Figure 4.1, no more than 10 outputs are used. If the neural network is structured with more than one fully connected layer, this brings some advantages.
4.44 Speedup vs $w_{\text{filter(\text{conv})}}$: increasing $w_{\text{filter(\text{conv})}}$ also the speedup increases, but the accuracy is degraded. ........................................ 226

4.45 $c_{\text{in}} - w_{\text{filter}}$ plot for a convolutional computation. By increasing the $c_{\text{in}}$, the delay ratio decreases, because by looking at Equation 4.45 and Equation 4.46, the ratio tends towards 1 for high values of $c_{\text{in}}$. Delay ratio increases with higher values of $w_{\text{filter}}$. .......................... 229

4.46 $c_{\text{in}} - c_{\text{out}}$ plot for a convolutional computation. For $c_{\text{in}}$, the same considerations made in Figure 4.45 are valid. Regarding $c_{\text{out}}$, by increasing it the delay ratio slowly rises as a logarithm-like function until it reaches a saturation, since by performing the limit of the Delay ratio function for $c_{\text{out}} \to \infty$, the result is a constant. .. ....... 230

4.47 $w_{\text{filter}} - c_{\text{out}}$ plot for a convolutional computation. The big advantage of the In-Memory architecture in terms of delay respect to OOM one, is obtained with high values of $w_{\text{filter}}$ and $c_{\text{out}}$. Considering for example the first layer of AlexNet, the total number of OFMAPs are 96 with $w_{\text{filter}} = 11$ and the delay ratio will be $\sim 27 \times$. .......... 231

4.48 $w_{\text{in}} - w_{\text{filter}}$ plot for a convolutional computation. By increasing $w_{\text{in}}$, the delay ratio remains approximately the same, while $w_{\text{filter}}$ dependency is the same described in Figure 4.47 ....................... 232

4.49 $w_{\text{out(\text{fc})}} - n_{\text{iter}}$ plot, considering a fully connected layer. By increasing both the quantities brings relevant benefits in terms of Delay ratio. In particular it is demonstrated that with high values of $n_{\text{iter}}$, the In-Memory architecture takes advantages of a more scheduled FC computation (Figure 4.16): this is a very important result, since high $n_{\text{iter}}$ implies a smaller array, since $W \geq \frac{\text{number of FC parameters}}{n_{\text{iter}}} = L$, allowing to further reduce power consumption/area/energy consumption of the In-Memory architecture. .............................. 233

4.50 Accuracy vs number of bits. The total number of images tested are 10000. The reference accuracy is set to 0.8338 from section 3.2.1 . . . 235

5.1 Verification flow ................................. 236

5.2 MLP model. The network has 15 layers and it is able to achieve $\sim 90\%$ of accuracy on MNIST dataset. ......................... 244

5.3 Fashion MNIST dataset ............................. 250

5.4 CNN model used for fashion-MNIST dataset. All convolutional layers have a kernel size of 5x5x6 with stride 1. Max pooling layers have a kernel size of 2x2 with stride 1. After each fully connected layer there is a batch normalization computation, in order to reduce the inaccuracies caused by the approximated computation introduced in section 4.1.2. This model is able to achieve up to 70% of accuracy. . . 251
6.1 Part of timing reports for both architectures. The main differences are highlighted by the red dashed circles. The same logic gate has been implemented into two different ways in the architectures.

6.2 Physical chip of OOM architecture

6.3 Physical chip of In-Memory architecture

6.4 OOM chip implementing the neural network model depicted in Figure 5.2

6.5 In-Memory chip implementing the neural network model depicted in Figure 5.2

6.6 Computational delay of the In-Memory architecture, implementing the neural network model depicted in Figure 5.2

6.7 Computational delay of the OOM architecture, implementing the neural network model depicted in Figure 5.2

6.8 Computational delay of the OOM architecture, implementing the neural network model depicted in Figure 5.4

6.9 Computational delay of the In-Memory architecture, implementing the neural network model depicted in Figure 5.4

6.10 Area, CP delay, Power vs $c_{in}$ - $w_{filter}$ for the OOM architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$). Power vs $c_{in}$ - $w_{filter}$: power increases almost linearly with $c_{in}$, because more parallel architectures are working at the same time. With higher $w_{filter}$, the power rises almost exponentially, because it is required a larger memory array and more XNOR gates are used. Area vs $c_{in}$ - $w_{filter}$ behaves in the same way. CP delay vs $c_{in}$-$w_{filter}$: remains almost constant, since it is caused by a multiplier-adder sequence. For an higher amount of $c_{in}$, more adders are used in the adder trees in K-α computations (Figure 4.11 and Figure 4.13), but the critical path remains the same.

6.11 Area, Critical path delay, Power vs $c_{in}$ - $w_{filter}$ for the In-Memory architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$). Same considerations made in Figure 6.10 are valid here. The maximum power achieved in this case is $\sim 260mW$ respect to $\sim 230mW$ of the previous case. Considering the higher number of logic gates required in the In-Memory architecture, it is a very good result that allows also to reduce also the computational time normally required by the OOM architecture.
6.12 Area ratio, Critical path delay ratio, Power ratio vs $c_{in} - w_{filter}$ obtained as OOM/In-Memory ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$). Increasing $c_{in}$ brings to power/area ratios reductions, since In-Memory architecture requires more building blocks than OOM case. $w_{filter}$’s rise brings power benefits in the In-Memory architecture, since the registers start to have a predominant contribution respect to the sequential/combinational powers: since the architectures have approximately the same number of registers, the power ratio tends towards 1 for $w_{filter} \rightarrow \infty$. From a power consumption point of view, it is convenient to implement an In-Memory architecture with high $c_{in}$ and $w_{filter}$.

6.13 Energy ratio vs $c_{in} - w_{filter}$ ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$). Taking the delay ratio respect to $c_{in} - w_{filter}$ depicted in Figure 4.45, it has been multiplied by the obtained power ratio. The result shows that the In-Memory architecture becomes more efficient in terms of energy for higher values of $w_{filter}$. Consequently, the effect of $c_{in}$’s rise is reduced. This is a very good result, since for very deep networks such as AlexNet, the In-Memory architecture reaches better energy results.

6.14 Area, Critical path delay, Power vs $n_{bit} - w_{filter}$ for the OOM architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$, $c_{in} = 1$). Increasing $n_{bit}$, also power and area rises, since an higher number of bits implies more complicated operators (adders, multipliers etc). In the critical path delay case, it is possible to see a peak located at 19 bits: from the timing report, the critical path is located in the divider of the $\alpha$ unit. As already seen in Figure 6.10, with high values of $w_{filter}$, both area and power rise exponentially.

6.15 Area, Critical path delay, Power vs $n_{bit} - w_{filter}$ for the In-Memory architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$, $c_{in} = 1$). Same considerations of Figure 6.14 are valid here.

6.16 Area ratio, Critical path delay ratio, Power ratio vs $n_{bit} - w_{filter}$ obtained as OOM/In-Memory ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$, $c_{in} = 1$). For an high value of $n_{bit}$, area-power ratios increases. This implies that the In-Memory architecture takes performance advantages, if a more precise representation is used.

6.17 Area, Critical path delay, Power vs $\sqrt{H} - c_{in}$ for OOM architecture ($c_{out} = 1$, $W = w_{filter}^2 = 4$). The higher is the $\sqrt{H}$ size, the higher are power consumption and area, since registers have very big sizes (exponential trend). Regarding $c_{in}$, as already said, power/area increase almost linearly. Critical path delay remains almost the same for each value of $\sqrt{H} - c_{in}$.
6.18 Area, Critical path delay, Power vs \( \sqrt{H} - c_{in} \) for In-Memory architecture \((c_{out} = 1, W = w^2_{filter} = 4)\). Same considerations made for Figure 6.17 are valid in this case. The power/area values reached are higher than the previous case, because of the higher number of registers/logic gates. 

6.19 Area ratio, Critical path delay ratio, Power ratio vs \( \sqrt{H} - c_{in} \), obtained as OOM/In-Memory \((c_{out} = 1, W = w^2_{filter} = 4)\). By increasing both \( c_{in} \) and \( \sqrt{H} \), power/area ratios decrease, because of the higher amount of logic gates inside the In-Memory architecture. 

6.20 Energy ratio vs \( \sqrt{H} - c_{in} \), obtained as OOM/In-Memory \((c_{out} = 1, W = w^2_{filter} = 4)\). This is the worst case, because by increasing both \( c_{in} \) and \( \sqrt{H} \) the energy ratio decreases, because of the higher amount of logic gates inside the In-Memory architecture. With higher values of both \( w_{filter} \) and \( c_{out} \), the energy ratio will decrease for the motivations explained before. 

6.21 Energy ratio vs \( \sqrt{H} - c_{in} \) for the fully connected algorithm, obtained as OOM/In-Memory \((c_{out} = 1, W = w^2_{filter} = 4, number\_of\_fc\_parameters = 1000, n_{iter} = 250)\). In this case, the energy ratio increases a lot, since the fully connected algorithm is far more efficient in the in-memory case respect to OOM one. Depending on the algorithm type, the performance can be better or worse: an higher number of fully connected layers with an high value of \( n_{iter} \), implies a more efficient In-Memory architecture than OOM counterpart. 

6.22 Mean Delay, Power, Area, Timing and Energy ratios, obtained as OOM/In-Memory. If the ratio value is higher than 1, it means that the In-Memory architecture obtained a better result. As expected, In-Memory alternative is more efficient in terms of Energy/Delay than OOM counterpart. 

6.23 Energy comparison: the higher is better. MLC-STT: [15], SOT: [16], OPNE-IPNE: [40], Neurosynaptic core: [26], Stochastic: [28], CPU-CLU: [29]. 

6.24 Delay comparison: the higher is better. MLC-STT [15], SOT [16], OPNE-IPNE [40], Neurosynaptic core [26], XNOR-RRAM [19], HMC [29], Chain-NN [30], Energy-efficient [31]. 

6.25 Area comparison: the higher is better. SOT [16], OPNE-IPNE [40], Neurosynaptic core [26], XNOR-RRAM [19] (MLP), Stochastic [28], HMC [29], Energy-efficient [31]. 

7.1 Modified pop-counting circuit for the In-Memory architecture.
Chapter 1

State of the art

1.1 Introduction

1.1.1 Artificial neural network [8]

Neuron

An artificial neural network is used to process very complex informations and in particular to give a classification. Its structure is based on the biological brain way-of-computation.[8] It is composed by ”neurons”, which are the basic blocks:

\[
net = \sum_{i}^{N} X_i \times W_i + Bias
\]

Figure 1.1: Neuron’s structure
Neurons are organised in an interconnected network that is able to take decisions and to learn when these decisions are wrong [41]. Considering its equivalent structure from an ”electronic” point of view (Figure 1.1), the following terms are used:

- **Bias**: additive term;
- **$X_1, X_2$**: inputs of the neuron;
- **$W_1, W_2$**: weights. For each synapse there is a different weight. They can assume any value so they can be:
  1. Floating point weights: the values are represented in floating point, so the network can work at fully precision;
  2. Binarized weights: the weights can only assume ±1 values;
  3. Ternary weights: the weights can assume $\{1, 0, -1\}$. When a weight assumes the value 0 means that a particular neuron is not connected to another one.
- **$f$**: activation function of the neuron. There are different kind of activation functions, in particular the most used ones are:
  1. **Sigmoid function**: represented by the following equation
     \[
     f(x) = \frac{1}{1 + e^{-x}} \tag{1.1}
     \]
2. **Hyperbolic tangent**: given by

\[ f(x) = \tanh(x) \quad (1.2) \]

Figure 1.2: Sigmoid activation function

Figure 1.3: Hyperbolic tangent activation function
3. **ReLU function**: the term means "rectified linear unit" and it is given by

\[
ReLU(x) = \max(0, x)
\]  

(1.3)

This kind of activation is often used because it represents a good trade-off between accuracy and simplicity since, as it is possible to see in the plot in Figure 1.4, it is quite similar to the sigmoid or hyperbolic tangent functions.

![Rectified linear unit function](image)

Figure 1.4: ReLU activation function

4. **Sign function**: this is used in binary/ternary neural networks and, considering the type of network, two different kinds of sign functions can be used:

\[
sign^{(0)}(x) = \begin{cases} 
1, & \text{if } x \geq 0 \\
-1, & \text{if } x < 0 
\end{cases}
\]  

(1.4)

\[
sign^{(t)}(x) = \begin{cases} 
1, & \text{if } x \geq \rho \\
0, & \text{if } -\rho \leq x < \rho \\
-1, & \text{if } x < -\rho 
\end{cases}
\]  

(1.5)
Neural network

In order to realize a neural network, it is possible to use multiple neurons into a neat structure composed by many layers. An example is reported in the following figure:

As it is possible to see in Figure 1.5, the network has 3 layers:

1. **Input layer**: it simply reports the inputs to the following layer, by applying the neurons’ activation fuction;

2. **Hidden layer**: the most important layer in the network, because it is used to do the computations explained before. Each neuron propagate in output the following quantity:

   \[
   \text{output}(i) = f \left( \sum_{j=0}^{\text{#inputs}} (\text{input}(j) \cdot w(j)) + \text{bias} \right) \quad (1.6)
   \]

3. **Output layer**: executes the same computations of the hidden layer, but the outputs coming from these neurons represent the classification, and so the result coming from the computations.
It is possible to have two different situations: the first one, when all the \((N-1)\)-th neurons are connected to the following \(N\)-th neurons the network is called **fully connected**; otherwise, if this condition is not satisfied, the network is not fully connected. It is important to notice that a non fully-connected network is easily implementable by a ternary network, because in the ternary approach, weights can assume \(\{1,0,-1\}\).

**Size of the NN** [42] The size of a neural network has to be chosen considering the application in which it will be used. For example, the usage of multiple hidden layers implies the capability to perform very difficult computations. Also the number of neurons in the hidden layers has to be chosen properly, in fact choosing many of them can imply **overfitting**, in which the number of the elements is very high and the dataset is not sufficient to update them all, reaching very low accuracies. On the contrary, if the number of neurons are minimal, the consequence is that the network is not capable anymore to perform complex computations and to process complex datasets (**underfitting**). A simple rule from [42] to choose the number of neurons is the following:

\[
\begin{cases}
#\text{input neurons} < #\text{hidden neurons} < #\text{output neurons} \\
#\text{hidden neurons} = \frac{2}{3} \cdot #\text{input neurons} + #\text{output neurons} \\
#\text{hidden neurons} < 2 \cdot #\text{input neurons}
\end{cases}
\] (1.7)

**Forward pass** The first step in a NN is to ”forward pass” the inputs towards the outputs. Inputs are propagated inside the neural network, which elaborates the partial results as explained before until the output is not reached. Once the outputs are computed, two different classifications are carried out from the NN, which represent the actual computation result with its original configuration (initial weights). The outputs are compared with the expected result and, if they are not the same, the network needs to be **trained** with a backward pass.

**Backward pass** Starting from output layer, the weights of the network needs to be updated, in order to reach the target output. To do this, it can be used the backpropagation algorithm method that will be explained later in **subsection 1.1.4.**
1.1.2 Convolutional neural networks

A convolutional neural network is a particular type of neural network which is able to process a very large data (such as an image) and to give a proper classification in output [34]. A CNN is composed by input, output and multiple hidden layers such as convolution, pooling, normalization and fully connected layers as depicted in Figure 1.6:

Figure 1.6: CNN example from [33]. There are several layers such as convolution, pooling, fully connected (already described) and normalization that can be trained in order to classify the input. In this case, an image is used but the CNNs can be used for different applications, such as natural language and speech recognition [34].

- Convolutional layers: a convolutional layer takes in input an image (considering the first layer, the IFMAP is represented by 3 matrices of pixels representing the RGB values) and gives in output a convolved group of matrices with a particular set of weights (called kernels). An example of kernel is depicted in the following figure:

Figure 1.7: Example of a kernel in a CNN with 3x3 size.

The input image is also called input feature map (IFMAP) and the corresponding processed output is called output feature map (OFMAP). The
general equation that defines the OFMAP can be formulated considering [31] and [16]:

\[ y_\ell^{(l)}(j, i) = b_0^{(l)} + \sum_{c=0}^{\text{#channels}-1} \sum_{m=0}^{\text{#rows(kernel)}-1} \sum_{t=0}^{\text{#cols(kernel)}-1} k_{a,c}^{(l)}(m,t)x_c^{(l)}(j + m + j(\text{stride} - 1), i + t + i(\text{stride} - 1)) \]

Where:

- \((l)\) is the layer. In this example, the first layer is considered;
- \(b_0\) is the bias term;
- \(c\) is the input channel. As said previously, there can be more than one input channel in a convolutional neural network (RGB case);
- \(k_{a,c}^{(l)}\) is the kernel weight of the channel and layer considered;
- \(x_c^{(l)}\) is the corresponding input;
- \(\text{stride}\) is the corresponding step size used in the convolution.

This equation considers the case of a batch size equals to 1, where the batch size is the number of images in input [43]. Considering a simpler case, with a kernel size of 3x3, 5x5 IFMAP, only one channel and evaluated in the first layer, the equation that defines the OFMAP becomes:

\[ y_0(j, i) = b_0 + k(0,0)x(j, i) + k(0,1)x(j, i + 1) + k(0,2)x(j, i + 2) + \]
\[ + k(1,0)x(j + 1, i) + k(1,1)x(j + 1, i + 1) + k(1,2)x(j + 1, i + 2) + \]
\[ + k(2,0)x(j + 2, i) + k(2,1)x(j + 2, i + 1) + k(2,2)x(j + 2, i + 2) \]

By looking at this equation, it is possible to observe that it is quite similar to the neuron’s equation, in fact:

\[ \text{Conv} = b + \sum_{t=0}^{\#\text{cols(kernel)}-1} \sum_{m=0}^{\#\text{rows(kernel)}-1} k(m,t) \cdot x(j + m, i + t) \quad (1.8) \]
\[ \text{Neuron} = \sum_{t=0}^{\#\text{inputs}} k(t)x(t) + b \quad (1.9) \]
1.1 – Introduction

It is possible to realize a convolutional layer by employing a fully connected neural network, in particular considering that a single convolved output can be realized as:

![Diagram of convolution example with FC network. The weights used in the fully connected part are the same of the kernel.](image)

Figure 1.8: Convolution example with FC network. The weights used in the fully connected part are the same of the kernel.
Each element in the **OFMAP** is defined as the sum of products between the weights and the **IFMAP**.

- **Pooling layers**: Pooling is an important feature of **CNNs**, because it reduces the dimensions of the feature map, but maintaining the most important informations [35], allowing to reduce the size of the network and the parameters used, preventing overfitting. Considering the max pooling, it can be defined a window size (2x2 for example) and slide it into the **OFMAP** elaborated by the convolutional layer and take the largest element inside that window [35]. An intuitive example of max pooling is reported in the following figure:

![Max Pooling](image.png)

**Figure 1.9**: Example of max-pooling 2x2 and **stride** of 1 [35]

- **Batch normalization** [44]: it is a technique which is not reported in Figure 1.6 and it allows to reduce the problems coming from the training (such as slow convergence), in particular in very deep networks such as **CNNs**. The technique is based on the normalization of the inputs of each layer, in such a way that they will have mean output activation of 0 and standard deviation of 1 [44][45]. Main benefits are faster training, easier weights’ initialization and the possibility to design deeper networks without losing precision[44].

The last part that composes a **CNN** is the fully connected layer, which has been explained previously.
1.1.3 Binary neural network [9]

As already mentioned, BNN has binary weights/activations. However, as analyzed in subsection 1.1.4, the fully precision weights are needed to compute the gradients. The main steps that characterize a BNN are the following ones:

1. **Weights/inputs binarization [9]:** the incoming activations and the weights are binarized as illustrated below:

   
   ![Figure 1.10: Binarization process based on the sign function of the input weights/activations.](image)

   Figure 1.10: Binarization process based on the sign function of the input weights/activations.

2. **Pass inputs in the neural network [9]:** the binary matrices pass in the neural network, producing a result. Taking for example the following computation:
Figure 1.11: Binary XNOR-Popcount based computation.

The result can be obtained by considering a series of XNOR operations and a final pop-count (number of ones - number of zeros). Considering the XNOR truth table:

<table>
<thead>
<tr>
<th>A</th>
<th>B</th>
<th>OUT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0(-1)</td>
<td>0(-1)</td>
<td>1(+1)</td>
</tr>
<tr>
<td>0(-1)</td>
<td>1(+1)</td>
<td>0(-1)</td>
</tr>
<tr>
<td>1(+1)</td>
<td>0(-1)</td>
<td>0(-1)</td>
</tr>
<tr>
<td>1(+1)</td>
<td>1(+1)</td>
<td>1(+1)</td>
</tr>
</tbody>
</table>

As it is possible to see, "0" is considered as -1 and "1" as +1, so the bitwise multiplication corresponds to XNOR’s output. The final result of the computation in Figure 1.11 is given by:

\[ y(0) = \text{popcount(xnor(001,010)))} = \text{popcount(1,0,0) = -1} \] (1.10)
Similarly, this procedure is applied to all the remaining rows.

3. Realization: BNN is then implemented as shown in the following figure:

![BNN implementation diagram]

Figure 1.12: BNN implementation. The green boxes are fully connected layers.

The first layer of the BNN reported in Figure 1.12 is not binarized, because the correlation between unbinarized-binarized weights is weaker than the other layers [9].

It is possible to demonstrate that the accuracy of a BNN trained with MNIST dataset, compared to a fully precision neural network, slowly converges to the FP's one and this is a very important fact: BNN allows to reduce the resources and the computation complexity and are well suited for in-memory implementation [9].

1.1.4 Backpropagation algorithm [10]

In this part, the backpropagation algorithm is explained from [10]. If Ternary/Binary neural networks are considered, since the activation function is the \( \text{sign}(t) \), the derivative has to be approximated in some way: the approximation from [36] can be used.

**Example of a 2x2x2 network [10]**

The back-propagation algorithm is used to train a neural network, by computing the gradient that is needed in the calculation of the weights. [46] Backpropagation requires the derivative of the **loss function** (also known as error function) w.r.t. the network’s output to be known. Consider the following neural network:
1 – State of the art

Figure 1.13: Neural network example. It is analyzed an MLP, in order to simplify the explanations. The same approach can be used in CNNs.

It has three layers (input layer, hidden layer and output layer) that are indicated as (1) for input, (2) for hidden and (3) for output.

**Forward pass**  The first step of the backpropagation is to forward pass the inputs through the neural network and to see what is the result. This will be compared to the expected one (target), by considering the total error [10]:

\[
E_{total} = \sum_{o=1}^{\#outputs} \frac{1}{2}(target(o) - out_o^{(3)})^2 =
\]

\[
= \frac{1}{2}(target(1) - out_1^{(3)})^2 + \frac{1}{2}(target(2) - out_2^{(3)})^2 + ... =
\]

\[
= E_{o1} + E_{o2} + ...
\]

**Backwards pass**  The errors obtained in output are backward-passed toward the input. The first layer encountered is the output layer:

- **Output layer**: considering for example the first output neuron depicted in Figure 1.14:
The neuron itself is divided into the input part (called net) and the output part (called out) [10]. In order to realize an algorithm which can be implemented in high-level language, the weights of each layer are stored into matrices in the following way:

\[
W^{(2)} = \begin{bmatrix} w_1 & w_3 \\ w_2 & w_4 \end{bmatrix} \quad W^{(3)} = \begin{bmatrix} w_5 & w_7 \\ w_6 & w_8 \end{bmatrix}
\] (1.14)

The input net can be defined as the weighted sum of all neuron’s inputs (which correspond to the outputs of the previous layer), with their corresponding weights:

\[
net_o^{(3)} = \sum_{i=1}^{\#col W^{(3)}} x^{(2)}(i) \cdot W^{(3)}(o,i)
\] (1.15)

And out as:

\[
out_o^{(3)} = f_{act}(net_o^{(3)})
\] (1.16)

The activation function into a binary/ternary neural network is the sign. In those particular case, out_o is defined as:

\[
out_o^{(3)} = sign(net_o^{(3)})
\] (1.17)

To determine the new value of \( w_5 \), the backpropagation algorithm computes
the quantity $\frac{\partial E_{\text{total}}}{\partial w_5}$ and applies the chain rule from [10] expressed in Figure 1.14:

$$\frac{\partial E_{\text{total}}}{\partial w_5} = \frac{\partial E_{\text{total}}}{\partial \text{out}_1^{(3)}} \cdot \frac{\partial \text{out}_1^{(3)}}{\partial \text{net}_1^{(3)}} \cdot \frac{\partial \text{net}_1^{(3)}}{\partial w_5}$$

(1.18)

By expanding all the elements:

$$\frac{\partial E_{\text{total}}}{\partial \text{out}_1^{(3)}} = \frac{\partial}{\partial \text{out}_1^{(3)}} \left( \sum_{o=1}^{\#\text{outputs}} \frac{1}{2} (\text{target}(o) - \text{out}_1^{(3)})^2 \right)$$

$$= -(\text{target}(1) - \text{out}_1^{(3)})$$

$$= \text{out}_1^{(3)} - \text{target}(1)$$

(1.19)

$$\frac{\partial \text{net}_1^{(3)}}{\partial w_5} = \frac{\partial}{\partial w_5} \left( \sum_{i=1}^{\#\text{col}\ W^{(3)}} x^{(2)}(i) \cdot W^{(3)}(i,1) \right)$$

$$= \frac{\partial}{\partial w_5} \left( x^{(2)}(1) \cdot w_5 + x^{(2)}(2) \cdot w_6 + \text{bias} \right)$$

$$= x^{(2)}(1)$$

(1.20)

The last term $\frac{\partial \text{out}_1^{(3)}}{\partial \text{net}_1^{(3)}}$ considers the derivative of the activation function $f_{\text{act}}(x)$. In the Binary/Ternary case, it can be approximated as indicated in Figure 1.15 from [36]. In formulas:

$$\frac{\partial \text{Sign}(x)}{\partial x} = \begin{cases} 
1/2a, & \text{if } r - a \leq |x| \leq r + a \\
0, & \text{others}
\end{cases}$$

(1.21)

$$\frac{\partial \text{Sign}(x)}{\partial x} = \begin{cases} 
\frac{-1}{a^2}(|x| - (r + a)), & \text{if } r \leq |x| \leq r + a \\
\frac{1}{a^2}(|x| - (r - a)), & \text{if } r - a \leq |x| < r \\
0, & \text{others}
\end{cases}$$

(1.22)

To simplify the equations, $\frac{\partial f_{\text{act}}(x)}{\partial x}$ is expressed as:

$$f'_{\text{act}}(x) = \frac{\partial f_{\text{act}}(x)}{\partial x}$$

(1.23)
Finally, the original equation of \( \frac{\partial E_{\text{total}}}{\partial w_5} \) can be rewritten as:

\[
\frac{\partial E_{\text{total}}}{\partial w_5} = (out_1^{(3)} - target(1)) \cdot x^{(2)}(1) \cdot f_{act}' \left( \sum_{i=1}^{\#col W^{(3)}} x^{(2)}(i) \cdot W^{(3)}(1,i) \right) \tag{1.24}
\]

In order to simplify the expression, the following equality is imposed:

\[
\delta_o = (out_o^{(3)} - target(o)) \cdot f_{act}'(net_o^{(3)}) \tag{1.25}
\]

The final expression for \( w_5 \) is given by:

\[
\frac{\partial E_{\text{tot}}}{\partial w_5} = \delta_1 \cdot x^{(2)}(1) \tag{1.26}
\]
The update rule for the weight $w_5$ is the following:

$$w_5^+ = w_5 - \eta \cdot \frac{\partial E_{\text{total}}}{\partial w_5}$$

(1.27)

Where $\eta$ is the learning rate, which is an important parameter that indicates how much the weights are adjusted with respect to the loss function. By using a very small value of learning rate, it means that the algorithm moves very slowly and takes very long time to converge: typically the accuracy achievable is higher in the case of small learning rates. Trying now to provide a general expression for a weight $w$ connected to a specific neuron (in particular the first output one), the following equation can be considered:

$$\frac{\partial E_{\text{tot}}}{\partial w} = \delta_1 \cdot x^{(2)}(k)$$

(1.28)

To simplify the equations, $\frac{\partial E_{\text{tot}}}{\partial w} = \psi$. The computation for all the weights becomes:

for $k=1:\#\text{rows}(W^{(3)})$
  $\psi = \delta_1 \cdot x^{(2)}(k)$
  $W^{(3)+}(k,1) = W^{(3)}(k,1) - \eta \cdot \psi$
end

Extending this concept to all the output neurons in the last layer, the final steps for the output layer becomes:

for $o=1:\#\text{cols}(W^{(3)})$
  for $k=1:\#\text{rows}(W^{(3)})$
    $\psi = \delta_o \cdot x^{(2)}(k)$
    $W^{(3)+}(k,o) = W^{(3)}(k,o) - \eta \cdot \psi$
  end
end
• **Hidden layer**: for the hidden layer, the concept is more complicated, since it has to use all the computations done in the previous layer. So, considering for example the computation of $w_1$:

![Diagram of a neural network](image)

Figure 1.16: Example of $w_1$ ($W^{(2)}(1,1)$) computation from [10]. Biases are not reported because they are not used in the computation.

The term needed to be figured out is [10]:

$$\frac{\partial E_{\text{tot}}}{\partial w_1} = \frac{\partial E_{\text{tot}}}{\partial out_1^{(2)}} \frac{\partial out_1^{(2)}}{\partial net_1^{(2)}} \frac{\partial net_1^{(2)}}{\partial w_1}$$ (1.29)

The contribution of $E_{\text{tot}}$ is given by the sum of the errors in output, so:

$$E_{\text{tot}} = \sum_{o=1}^{\#\text{cols}(W^{(3)})} E_0(o)$$ (1.30)

$$\frac{\partial E_{\text{tot}}}{\partial out_1^{(2)}} = \frac{\partial}{\partial out_1^{(2)}} \left( \sum_{o=1}^{\#\text{cols}(W^{(3)})} E_0(o) \right)$$ (1.31)

By exploiting the $j$-th element of the sum, the derivative can be rewritten as:

$$\frac{\partial E_0(j)}{\partial out_1^{(2)}} = \frac{\partial E_0(j)}{\partial net_j^{(3)}} \cdot \frac{\partial net_j^{(3)}}{\partial out_1^{(2)}}$$ (1.32)

$$\frac{\partial net_j^{(3)}}{\partial out_1^{(2)}} = W^{(3)}(1,j)$$ (1.33)
Now $\frac{\partial E_0(j)}{\partial \text{net}_j^{(3)}}$ can be computed as:

$$\frac{\partial E_0(j)}{\partial \text{net}_j^{(3)}} = \frac{\partial E_0(j)}{\partial \text{out}_j^{(3)}} \cdot \frac{\partial \text{out}_j^{(3)}}{\partial \text{net}_j^{(3)}}$$  \hfill (1.34)

$$\frac{\partial E_0(j)}{\partial \text{out}_j^{(3)}} = \text{out}_j^{(3)} - \text{target}(j)$$  \hfill (1.35)

$$\frac{\partial \text{out}_j^{(3)}}{\partial \text{net}_j^{(3)}} = f'_\text{act}(\text{net}_j^{(3)})$$  \hfill (1.36)

Putting all together:

$$\frac{\partial E_{\text{tot}}}{\partial \text{out}_{1}^{(2)}} = \left( \sum_{o=1}^{\#\text{cols}(W^{(3)})} (\text{out}_o^{(3)} - \text{target}(o)) \cdot f'_\text{act}(\text{net}_o^{(3)}) \cdot W^{(3)}(1,o) \right)$$  \hfill (1.37)

By remembering:

$$\delta_o = (\text{out}_o^{(3)} - \text{target}(o)) \cdot f'_\text{act}(\text{net}_o^{(3)})$$  \hfill (1.38)

The term $\frac{\partial E_{\text{tot}}}{\partial \text{out}_{1}^{(2)}}$ becomes:

$$\frac{\partial E_{\text{tot}}}{\partial \text{out}_{1}^{(2)}} = \sum_{o=1}^{\#\text{cols}(W^{(3)})} \delta_o \cdot W^{(3)}(1,o)$$  \hfill (1.39)

The final expression is:

$$\frac{\partial E_{\text{tot}}}{\partial w_1} = \sum_{o=1}^{\#\text{cols}(W^{(3)})} \delta_o \cdot W^{(3)}(1,o) \cdot f'_\text{act} \left( \sum_{i=1}^{\#\text{rows}(W^{(2)})} x^{(1)}(i) \cdot W^{(2)}(i,1) \right) \cdot x^{(1)}(1)$$  \hfill (1.40)
The final expression can be further reduced for a single neuron’s weights considering:

\[
\delta^{(L)}(1,o) = W^{(L)}(1,o) \cdot f'_{\text{act}} \left( \sum_{i=1}^{\#\text{rows}(W^{(L-1)})} x^{(L-2)}(i) \cdot W^{(L-1)}(i,1) \right) \quad (1.41)
\]

For a generic weight \( w \) of the same neuron (the first one of the hidden layer):

\[
\frac{\partial E_{\text{tot}}}{\partial w} = \psi = \sum_{o=1}^{\#\text{cols}(W^{(3)})} \delta_o \cdot \delta^{(3)}(1,o) \cdot x^{(1)}(k) \quad (1.42)
\]

From an algorithm point of view, each weight of the hidden layer’s neurons can be obtained as follows:

```matlab
for p=1:\#cols(W^{(2)})
    for k=1:\#rows(W^{(2)})
        \psi = \sum_{o=1}^{\#cols(W^{(3)})} \delta_o \cdot \delta^{(3)}(p,o) \cdot x^{(1)}(k)
        W^{(2)+}(k,p) = W^{(2)}(k,p) - \eta \cdot \psi
    end
end
```

**General case: N-layers network**

The procedure is the same as described above, so in order to define the equation in the general case, a generic \( U \)-th neuron is considered. The \( \psi = \frac{\partial E_{\text{tot}}}{\partial w} \) for a single weight \( w \) is defined as following in the different cases:

- **N-th layer**: \( \psi = \delta_U \cdot x^{(N-1)}(k) \)
- **(N-1)th layer**: \( \psi = \sum_{o=1}^{\#\text{cols}(W^{(N)})} \delta_o \cdot \delta^{(N)}(U,o) \cdot x^{(N-2)}(k) \)
• (N-2)th layer: $\psi = \sum_{o=1}^{\#cols(W^{(N)})} \sum_{p=1}^{\#cols(W^{(N-1)})} \delta_o \cdot \delta^{(N)}(p,o) \cdot x^{(N-3)}(k)$

As it is possible to see, everytime the equation goes down by 1 layer, the number of sums increases and the equation considers the term $\delta$ of the previous layers. So it is not possible to define a-priori an equation that determines all the weights, because it depends on the number of layers and on the weights’ values updated in the previous cycle. It is important to consider that this algorithm works well if the weights’ values are in floating point representation: before binarizing, the neural network have to be trained.
1.2 Software based neural networks

1.2.1 ImageNet Classification with Deep Convolutional Neural Networks [11]

Introduction

This approach describes AlexNet, which is a convolutional neural network that is able to process a very large dataset such as ImageNet[11]. With a standard feedforward approach, the recognition task made on thousands of images in input requires a very big neural network with a very large number of parameters (weights). Considering in fact:

$$\text{#parameters}_{FF} = \sum_{i=1}^{#\text{layers}-1} \text{#neurons}(i) \cdot \text{#neurons}(i+1) \quad (1.43)$$

Where i indicates the current layer analyzed. CNNs instead are well suited for very large inputs, because the number of parameters are less than the MLP solution, but the precision is a little bit degraded. AlexNet-CNN is implemented with 2 Nvidia GTX 580 3GB GPUs with an algorithm optimized to train faster the network itself, to reduce overfitting and to achieve very good results on these datasets [11]. The images from ImageNet in input are prescaled to 224x224.

Architecture

The architecture consists into 8 layers (5 convolutional and 3 fully connected)[11] as reported in Figure 1.17:
In **Figure 1.17**, the output of the last fully-connected layers is connected to a final layer with **softmax** activation (a derivable function which simply takes the maximum of its inputs), which gives 1000 classification labels. **ReLU** is applied where specified and also after the fully connected part: this activation function is preferred to the others like $$\tanh(x)$$ or $$\text{sigmoid}(x)$$, because the training time is improved. The layers are organized as follows:

- The first convolutional layer has 224x224x3 input image, processed by 96 kernels of size 11x11x3 with a **stride** of 4 pixels. The output dimensions can be determined considering:

\[
t_{out} = \frac{t_{in} - t_{filter}}{\text{stride}} + 1 = \frac{224 - 11}{4} + 1 \simeq 55 \tag{1.44}
\]

\[
h_{out} = \frac{h_{in} - h_{filter}}{\text{stride}} + 1 = \frac{224 - 11}{4} + 1 \simeq 55 \tag{1.45}
\]

- The second layer takes as input the output of the first layer, which has been pooled and normalized, and convolves it with 256 filters of size 5x5x48;
• The third, fourth and fifth layers have 384,384 and 256 kernels of size 3x3x256, 3x3x192 and 3x3x192 respectively;

• The fully-connected layers have 4096 neurons.

Using two GPUs in this configuration reduces top-1 and top-5 error rates by 1.7% and 1.2% w.r.t other solutions with only one GPU[11]. A local response normalization in [11] is used in order to reduce the top-1 and top-5 error rates by 1.4% and 1.2%, respectively, and it is used after applying ReLU in some layers. This has been tested also on CIFAR-10, producing an error of 11% w.r.t 13% without it.

**Overlapping pooling** Pooling procedure has already been explained in the introduction, but here it is used the overlapping pooling. Considering for example a pooling region of ZxZ: if the stride is larger than or equal to Z, the pooling windows does not overlap. Here it is presented a simple example:

Figure 1.18: Example of a non-overlapped pooling and pooling procedural steps
If $s < z$, we obtain overlapping pooling. In particular [11] uses a solution in which $\text{stride} = 2$ and $z = 3$, reducing top-1 and top-5 error rates by 0.4% and 0.3%, respectively w.r.t non-overlapping scheme.

![Overlapping pooling diagram](image)

Figure 1.19: An example of a 3x3 window with an overlapping pooling with \text{stride} $s = 2$ and $z = 3$

**Reducing overfitting**

Since in this network there are 60 millions parameters [11] and it has to classify among 1000 different classes, overfitting problem could introduce a significant overhead in terms of performance.

**Data augmentation** One of the possible ways to reduce the overfitting is to expand the dataset using some transformations [11]:

1. Generation of image translations and horizontal reflections. 224x224 patches and their translations/reflections from the 256x256 dataset images are used and training is performed, which size is improved by a factor of 2048;

2. Alternate the intensities of the RGB channels in images used in the training set, by means of an important property of natural images that consists on objects’ identity-invariance to changes in the intensity and color of the illumination.
**Dropout** Dropout is a useful technique which consists to selectively turn off some neurons in the hidden layer with p probability, in order to speed up the training [11]. In fact, the inputs are sampled by a different network at each iteration step, allowing the backpropagation algorithm to converge faster. However, even if the architecture is different at each step, the weights are shared in the network.

![Figure 1.20: Example of Dropout technique from [37]](image)

**Results**

Several results are reported in [11]. They are summarized in the following table:

<table>
<thead>
<tr>
<th>Competition</th>
<th>top-1 [%]</th>
<th>top-5 [%]</th>
<th>Dataset/year</th>
<th>Network structure</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>ILSVRC-2010</td>
<td>37.5</td>
<td>17</td>
<td>ImageNet</td>
<td>5 layers CNN</td>
<td>-</td>
</tr>
<tr>
<td>ILSVRC-2012</td>
<td>40.7</td>
<td>18.2</td>
<td>ImageNet</td>
<td>5 layers CNN</td>
<td>-</td>
</tr>
<tr>
<td>-</td>
<td>67.4</td>
<td>40.9</td>
<td>ImageNet 2009 (10184 categories, 8.9 million images)</td>
<td>5 layers CNN</td>
<td>Half images for training, half for classification</td>
</tr>
</tbody>
</table>
1.2.2 XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks [12]

Introduction

The CNNs are very good in activities like speech recognition, image classification and so on. The main CNNs’ drawback is the high amount of computational power and memory required to perform all the computations, which are mainly based on millions of parameters as explained in the AlexNet. The goals are to enable mobile devices and low-power embedded systems to handle a neural network (such as a convolutional one), to recognize with an high accuracy and to save power due to the limited capacity of the batteries. One of the possible ways to reach these objectives is to binarize the neural network, in such a way that the accuracy will be comparable to the original implementation. Two alternatives can be analyzed:

1. Binary-Weight-Networks: only the weights are approximated to a binary value (±1). The MAC operations are simply reduced to additions/subtractions. This kind of CNN can be easily integrated into an embedded system;

2. XNOR-Networks: both the weights and the inputs are approximated to the binary values. As already mentioned in the introduction, if weights and inputs are binarized, the MAC operation become simply a XNOR + population counting.

Differences between different types of networks are reported in the following table:

<table>
<thead>
<tr>
<th>Network type</th>
<th>Operations used</th>
<th>Memory saving</th>
<th>Computation saving</th>
<th>Accuracy on ImageNet % (AlexNet)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Standard (FP)</td>
<td>+,-,x</td>
<td>1x</td>
<td>1x</td>
<td>56.7</td>
</tr>
<tr>
<td>Binary weight</td>
<td>+,-</td>
<td>32x</td>
<td>2x</td>
<td>56.8</td>
</tr>
<tr>
<td>XNOR-Net</td>
<td>XNOR,bitcount</td>
<td>32x</td>
<td>58x</td>
<td>44.2</td>
</tr>
</tbody>
</table>
1.2 – Software based neural networks

Binary convolutional neural network [12]

- **I**: represents the input tensor for each layer. \( I \in \mathbb{R}^{c \times w_{in} \times h_{in}} \), where \((c, w_{in}, h_{in})\) represents channel, width and height respectively;

- **W**: represents the weight tensor of each layer. \( W \in \mathbb{R}^{c \times w \times h} \), where \( w \leq w_{in}, h \leq h_{in} \).

\[ W \approx \alpha B \]  \hspace{1cm} (1.46)

where \( B \) is the binary filter and \( \alpha \) is a scaling factor. So a convolution operation can be transformed from [12]:

\[ I \ast W \approx (I \odot B)\alpha \]  \hspace{1cm} (1.47)

Figure 1.21: Weights and inputs represented as tensors.
Estimating binary weights [12] The $\alpha$ value can be estimated considering the loss function, which is defined as:

$$\text{Loss function} = \| W - \alpha B \|^2$$

$$\alpha, B = \text{argmin} (\text{Loss function})$$

It is possible to demonstrate that this equation brings to an optimization based on the following assumptions [12]:

$$\begin{cases} 
B_i = +1, \text{ if } W_i \geq 0 \\
B_i = -1, \text{ if } W_i < 0 
\end{cases}$$

So $B = \text{sign}(W)$. While for the scaling factor $\alpha$, the derivative of the loss function w.r.t. $\alpha$ is considered and set it to 0 from [12]. The result obtained is:

$$\alpha = \frac{W^T B}{n} = \frac{W^T \text{sign}(W)}{n} = \frac{\sum |W_i|}{n} = \frac{1}{n} \| W \|_1$$

Training The algorithm proposed by [12] to train the binary networks is the following:

1. Binarization of the weight filters at each layer by computing $B, A$;
2. Forward propagation with binary weights and their corresponding scaling factors;
3. Backward propagation, where the gradients are computed w.r.t. the estimated weight filters $\tilde{W} = \alpha \times \text{Sign}(W)$;
4. Parameters and the learning rate gets updated.

XNOR-Networks

A convolution operation, which consists in dot products and shifts, can be performed by binarizing both inputs and weights. By doing this, convolution becomes a simple XNOR-Bitcounting sequence, which can be implemented with low cost. To approximate the dot product $\langle X, W \rangle$ in a binary form in which $X \approx \beta H^T$ and $W \approx \alpha B$, 


the following equation can be considered from [12]:

$$\alpha^*, B^*, \beta^*, H^* = \text{argmin} \|X \cdot W - \beta \alpha H \cdot B\|$$  \hspace{1cm} (1.52)

It is possible to demonstrate that the best solution is achieved when:

$$H = \text{sign}(X)$$ \hspace{1cm} (1.53)

$$B = \text{sign}(W)$$ \hspace{1cm} (1.54)

$$\beta = \left(\frac{1}{n}\|X\|_1\right)$$ \hspace{1cm} (1.55)

$$\alpha = \left(\frac{1}{n}\|W\|_1\right)$$ \hspace{1cm} (1.56)

For the binarizing input procedure, a more efficient procedure than computing $\beta$ for all the combinations can be used, and it is based on $K$ and $\alpha$ values. Once binarizing is completed, the convolution can be approximated as [12]:

$$I * W \approx (\text{sign}(I) \odot \text{sign}(W)) \cdot K\alpha$$  \hspace{1cm} (1.57)

where $\odot$ represents XNOR-Bitcount operations and $K$ and $\alpha$ are defined as:

$$\begin{align*}
K &= \frac{\sum_{\text{channels}} |\text{inputs}|}{\#\text{channels}} \cdot 2D \text{ Matrix} \left(\frac{1}{w_{\text{filter}}^2}\right) \\
\alpha &= \frac{\sum\|\text{weights}\|}{\#\text{weights}}
\end{align*}$$ \hspace{1cm} (1.58)

By looking at Figure 1.22, max-pooling is placed after the convolution because the pooling itself reduces the accuracy in a binary solution (almost often returns +1).
Normalization of the inputs before binarization is done to improve the accuracy. The binary activation layer (BinActiv) computes $K$ and $\text{sign}(I)$, and in BinConv, given $K$ and $\text{sign}(I)$, binary convolution is performed.

### Results

Here are reported some graphs which represents the various results obtained by measuring the efficiency, speedup, memory required, accuracy.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Double precision [MB]</th>
<th>Binary precision [MB]</th>
</tr>
</thead>
<tbody>
<tr>
<td>VGG-19</td>
<td>1000</td>
<td>16</td>
</tr>
<tr>
<td>ResNet-18</td>
<td>100</td>
<td>1.5</td>
</tr>
<tr>
<td>AlexNet</td>
<td>475</td>
<td>7.4</td>
</tr>
</tbody>
</table>

Table 1.4: Required memory for different architectures from [12]

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Dataset</th>
<th>Implementation</th>
<th>Error rate [%]</th>
<th>TOP-1 [%]</th>
<th>TOP-5 [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>-</td>
<td>CIFAR-10</td>
<td>BWN</td>
<td>9.88</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>AlexNet</td>
<td>CIFAR-10</td>
<td>XNOR-NET</td>
<td>10.17</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>AlexNet</td>
<td>ImageNet</td>
<td>BWN</td>
<td>-</td>
<td>56.8</td>
<td>79.4</td>
</tr>
<tr>
<td>AlexNet</td>
<td>ImageNet</td>
<td>BinaryConnect</td>
<td>-</td>
<td>35.4</td>
<td>61.0</td>
</tr>
<tr>
<td>AlexNet</td>
<td>ImageNet</td>
<td>XNOR-NET</td>
<td>-</td>
<td>44.2</td>
<td>69.2</td>
</tr>
<tr>
<td>AlexNet</td>
<td>ImageNet</td>
<td>Binary Weight</td>
<td>-</td>
<td>27.9</td>
<td>50.42</td>
</tr>
<tr>
<td>AlexNet</td>
<td>ImageNet</td>
<td>Binary Input</td>
<td>-</td>
<td>56.6</td>
<td>80.2</td>
</tr>
</tbody>
</table>

Table 1.5: Classification accuracy from [12]

1.2.3 **BinaryConnect: Training Deep Neural Networks with binary weights during propagations** [13]

### Introduction

BinaryConnect is an approach that enables low-power computations in a neural network adapted to be "binary". The following parts explain how this network works.
BinaryConnect [13]

The key is to impose the values of the weights to ±1, as already seen previously: as a result, all MAC operations are reduced to only additions-subtraction, bringing less power consumption.

Deterministic/stochastic binarization One of the possibility to binarize the weights is to choose a very simple approach, based on taking the sign of the real-value weight from [13]:

\[ w_b = \begin{cases} 
+1, & \text{if } w \geq 0 \\
-1, & \text{otherwise} 
\end{cases} \]  

(1.59)

Another possibility is to use a statistical approach [13]:

\[ w_b = \begin{cases} 
+1, & \text{with probability } p = \sigma(w) \\
-1, & \text{with probability } p = 1 - \sigma(w) 
\end{cases} \]  

(1.60)

Where \( \sigma \) is the "hard sigmoid" function from [13]:

\[ \sigma(x) = \max\left(0, \min\left(1, \frac{x + 1}{2}\right)\right) \]  

(1.61)

About training, the network works exactly as the previous analyzed cases, but here no \( \alpha, K \) values are used as in XNOR-Net. Weights during backpropagation have to be at fully precision, because otherwise the algorithm does not work anymore. Batch normalization is used here to accelerate training and ADAM [13] as learning rule is employed, which is a different algorithm than stochastic gradient descent, in fact ADAM produces a smaller error rate (10.47%) w.r.t 11.45%. ADAM learning rule has been reported in section 1.2.4.

Results In this part, the accuracy of BinaryConnect has been reported from [13], which uses three different approaches and some datasets:

1. Use the resulting binary weights \( w_b \);

2. Use the real-valued weights \( w \) (binarized weights helps only to reduce training
time);

3. Stochastic case: different networks can be obtained in this case and their accuracy can be computed by averaging the output of all of them.

Table 1.6: Resulting error rates and network structures used in [13]

<table>
<thead>
<tr>
<th>Method used</th>
<th>MNIST [%]</th>
<th>CIFAR-10 [%]</th>
<th>SVHN [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>No regularizer</td>
<td>1.30 ± 0.04</td>
<td>10.64</td>
<td>2.44</td>
</tr>
<tr>
<td>BinaryConnect(Det.)</td>
<td>1.29 ± 0.08</td>
<td>9.90</td>
<td>2.30</td>
</tr>
<tr>
<td>BinaryConnect(stoch)</td>
<td>1.18 ± 0.04</td>
<td>8.27</td>
<td>2.15</td>
</tr>
<tr>
<td>50% dropout</td>
<td>1.01 ± 0.04</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>

Details

- Network structure:
  - 3 hidden layers
  - 1024 neurons
  - Exponential Decay rate

The meaning of the terms indicated in the Table 1.6 are the following from [13]:

- **C3**: ReLU convolutional layer with 3x3 size;
- **MP2**: Max pooling (2x2 size);
- **FC**: fully connected;
- **SVM**: Support vector machine, a supervised learning model.

1.2.4 A Ternary Weight Binary Input Convolutional Neural Network: Realization on the Embedded Processor [14]

**Introduction**

Ternary network has three values of weights ({0, 1, −1}) respectively: the value of "0" means that the computation can be skipped, improving the network speed. As already mentioned, this solution enables very low power architecture to work into
an embedded system. If CNN’s network structure is considered, the equation that is able to determine the output for the convolutional layer is reported again, but also considering the stride in the case of only one input channel from [16]:

\[
y_0(j,i) = \sum_{m=0}^{\text{#rows(kernel)}-1} \sum_{t=0}^{\text{#cols(kernel)}-1} k(m,t)x(j + m + j(\text{stride} - 1),i + t + i \cdot (\text{stride} - 1)}
\]

(1.62)

Training methods for the CNN [14]

Back-Propagation method As discussed in subsection 1.1.4, the backpropagation method is the most used one and it allows to compute the weights by considering the Loss function. By applying the chain rule already explained, the new values of the weights can be computed by using the stochastic gradient descent (SGD). [14] uses modified algorithms to support weights’ update in ternary neural networks and compares them:

- **Adam**: learning rule based on the following equation taken from [14], in which the weights are updated as:

  \[
  w_{t+1} = w_t - \alpha \frac{E[\frac{\partial}{\partial w} \text{Loss}]}{\sqrt{E[\frac{\partial}{\partial w} \text{Loss}]^2} + \epsilon}
  \]

  (1.63)

  Where E is the mean and \(\alpha\) the learning rate.

- **AdaDelta** The update rule by the AdaDelta is given by [14]:

  \[
  h_t = \beta h_{t-1} + (1 - \beta) E \left[ \frac{\partial}{\partial w} \text{Loss} \right]^2
  \]

  (1.64)

  \[
  v_t = w_t - \frac{\sqrt{s_t + \epsilon}}{\sqrt{h_t + \epsilon}} E \left[ \frac{\partial}{\partial w} \text{Loss} \right]
  \]

  (1.65)

  \[
  s_{t+1} = \beta s_t + (1 - \beta) v_t
  \]

  (1.66)

  \[
  w_{t+1} = w_t - v_t
  \]

  (1.67)

Batch normalization Batch normalization speedups the training and has to be considered in the neural network’s structure. In particular, taking into account the
binary output of a neuron from [14]:

\[ Y = y^{(b)} = f^{(b)}_{\text{act}} \left( \sum_{i=1}^{N} w^{(b)}_{i} x^{(b)}_{i} \right) \]  

(1.68)

The activation function is the sign(x). Applying the batch normalization means to add an additional term into the activation function of the neuron itself as reported in [14]:

\[ Y' = y'^{(b)} = f^{(b)}_{\text{act}} \left( \gamma \frac{Y - \mu_B}{\sqrt{\sigma^2_B + \epsilon}} + \beta \right) \]  

(1.69)

Where \( \gamma, \mu_B, \sigma^2_B, \epsilon \) and \( \beta \) are parameters for BN and are mean, variance of the batch considered, correction term and offset terms respectively. By performing mathematical transformations used in [14], batch normalization’s terms can be transformed into biases as follows:

\[ Y' = f_{\text{act}} \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \left( Y - \left( \mu_B - \frac{\sqrt{\sigma^2_B + \epsilon}}{\gamma} \beta \right) \right) \right) \]  

(1.70)

As performed in [14], since the output takes the sign, \( \frac{\gamma}{\sqrt{\sigma^2_B + \epsilon}} \) can be ignored if:

\[ f'_{\text{sign}}(Y) = \begin{cases} 1 & \text{if } Y < -\mu_B + \frac{\sqrt{\sigma^2_B + \epsilon}}{\gamma} \beta \\ -1 & \text{(otherwise)} \end{cases} \]  

(1.71)
The final equation that defines a neuron’s output (considering also that \( x_0 = 1 \)) is the following from [14]:

\[
Y_{\text{final}} = Y - \mu_B + \frac{\sqrt{\sigma^2_B + \epsilon}}{\gamma} \beta
\]  
(1.72)

\[
= \sum_{i=0}^{n} w_i x_i - \mu_B + \frac{\sqrt{\sigma^2_B + \epsilon}}{\gamma} \beta
\]  
(1.73)

\[
= \sum_{i=1}^{n} w_i x_i + \left( w_0 - \mu_B + \frac{\sqrt{\sigma^2_B + \epsilon}}{\gamma} \beta \right)
\]  
(1.74)

\[
= \sum_{i=1}^{n} w_i x_i + W'
\]  
(1.75)

**Training the binary/ternary weights** Since the SGD is not usable with binary weights, hidden weights \( w_{\text{hid}} \) are required, which correspond to the real floating point values of the corresponding binarized weights. During the training phase, only the hidden weights are updated, while binary weights are used at inference. This also happens in ternary networks, in which the only difference is the definition of the sign function as already explained. In order to train a ternary network, the AdaDelta algorithm is more suitable, because it generates an higher concentration of "0" weights, reducing the computational overhead and the power required in the CNN. However [14] proposes a comparison between AdaDelta and ADAM optimizers, in order to see the differences in terms of accuracy.

**Realization Ternary Weight Binary Input CNN on Embedded Processor**

**Binary 2D convolutional operation** Considering an architecture with multiple channels, the output of the convolutional layer is given by [16]:

\[
y_o^{(l)}(j,i) = b_0^{(l)} + \sum_{c=0}^{\#\text{channels} - 1} \sum_{m=0}^{\#\text{rows(kernel)} - 1} \sum_{t=0}^{\#\text{cols(kernel)} - 1} k_{o,c}^{(l)}(m,t)x_c^{(l)}(j + m + j(\text{stride} - 1),i + t + i(\text{stride} - 1))
\]  
(1.76)

In Figure 1.23 it is shown the computation of \( y_o^{(l)}(j,i) \) at coordinates \((j,i)\) for the \( o \)-th OFMAP.
Figure 1.23: Example of the 2D convolutional operation for the ternary weight and binary input

Nx3x3 MAC operations are needed to compute a 2D convolution, so if the feature map becomes larger, more computation time will be required [14].

Results

The results of the network are determined by imposing an initial distribution of the weights (-1,0,1) of 2.5%:95%:2.5% respectively, in order to speedup the training, since most of the connections are set to 0 [14]. $\rho$ is imposed to 0.2 in Equation 1.5 and the dataset used is CIFAR-10 (50,000 images as training set and 10,000 as test one). Two kinds of optimization algorithms are used as described before: Adam and AdaDelta. The architecture used to realize these networks is the VGG16, reported in the following figure:
Figure 1.24: VGG16 architecture from [38]. It is composed by 16 layers and it is able to reach up to 70% on top-1 and 90% in top-5 recognition accuracies respectively on ImageNet.

**Comparison Ternary Weight CNN with Binary One**  In order to do a comparison between binary/ternary networks, both of them have been trained by using Adam optimizer in [14]. The parameters evaluated are the error rate and the non-zero weight density, which gives an important indication of how many connections are present in the network. Ternary net has been trained also with AdaDelta, in order to see what are the main differences w.r.t the Adam. Here are reported the results:

Table 1.7: Comparisons between networks from [14] on CIFAR-10 and VGG16

<table>
<thead>
<tr>
<th>Algorithm used</th>
<th>Network</th>
<th>Error rate [%]</th>
<th>Non-zero weight density [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Adam</td>
<td>Binary</td>
<td>19.6</td>
<td>100</td>
</tr>
<tr>
<td>Adam</td>
<td>Ternary</td>
<td>17.1</td>
<td>73.9</td>
</tr>
<tr>
<td>AdaDelta</td>
<td>Ternary</td>
<td>19</td>
<td>5.3</td>
</tr>
</tbody>
</table>
In Table 1.7 it is possible to see the differences between the optimizers, and in particular the results obtained in the ternary network in terms of accuracy are worse than binary ones in the case of Adam optimizer. But, if AdaDelta is considered, it is possible to see a similar error rate with only 5.3% of non-zero weights: that means that a ternary network trained with AdaDelta is able to achieve a similar result of a binary one but only with 5.3% of active connections [14]. This is a very important result, that indicates the possibility to implement very small, low power and embedded networks without losing accuracy.

Table 1.8: Comparison of required time on ARM Cortex-A53 1.2GHz and 1 GB DDR2 SDRAM from [14]

<table>
<thead>
<tr>
<th>Convolutional neural network</th>
<th>Time required [s]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Binary weight</td>
<td>6.103</td>
</tr>
<tr>
<td>Ternary weight</td>
<td>0.750</td>
</tr>
</tbody>
</table>
1.3 MTJ-Based BNN

An MTJ (magnetic tunnel junction) is a device composed by two ferromagnets separated by a thin insulator [6], in which electrons can flow through by means of a tunnel injection.

The magnetizations of the two ferromagnets determines the intensity of the current flowing: if they are parallel, the current will be higher ($R_p$: low resistance state), while if they are anti-parallel the current will be lower ($R_{AP}$: high resistance state).

1.3.1 A Multilevel Cell STT-MRAM-Based Computing In-Memory Accelerator for Binary Convolutional Neural Network [15]

The classical Von-Neumann implementation in which the memory (SRAM for example) and the computations are performed separately, has many problems in terms of delay and power consumption. The key idea to use another type of memories based on a NV (non-volatile memories) approach, consisting on MRAM such as MLC-STT-MRAM, could be a substitute for SRAMs: these are able to perform memory and computing operations and could solve Von Neumann bottlenecks. MLC means that multiple bits can be stored into a single cell and some logical computations can be made inside the memory. [15]
MLC-STT-MRAM

The goal of this implementation is to integrate two bits into a single memory cell. An example from [15] has been reported, that corresponds to a 2x2 array with the corresponding cells’ structure:

![Cell structure and example of a 2x2 array from [15]. MSC stands for "modified sensing circuit" and it is able to do some computations based on the current of the source/bit lines. The mode controller is able to choose which operation to perform, while the row decoder handles the word lines. In order to write into the MTJ, a current has to flow through it, and the direction is expressed here. If "1" has to be stored, the current has to magnetize the layers in a parallel way, resulting less MTJ resistance (LRS), so a positive voltage is applied between BL and SL; otherwise, with "0", the magnetizations must have antiparallel direction.](image-url)

Figure 1.26: Cell structure and example of a 2x2 array from [15]. MSC stands for "modified sensing circuit" and it is able to do some computations based on the current of the source/bit lines. The mode controller is able to choose which operation to perform, while the row decoder handles the word lines. In order to write into the MTJ, a current has to flow through it, and the direction is expressed here. If "1" has to be stored, the current has to magnetize the layers in a parallel way, resulting less MTJ resistance (LRS), so a positive voltage is applied between BL and SL; otherwise, with "0", the magnetizations must have antiparallel direction.
1.3 – MTJ-Based BNN

The cells have 4 different configurations \((R_{P-P}, R_{AP-P}, R_{P-AP}, R_{AP-AP})\) representing all the combinations given by two bits. The \(I_{SL}\) has four possible values, since the two MTJs are different from each other \((R_{AP-P} > R_{P-AP})\). The modified sensing circuit is simply composed by a set of comparators that compare the incoming current \(I_{sl}\) with three different currents \(I_{ref1}, I_{ref2}, I_{ref3}\) with the following relation from [15]:

\[
I_{sl,11} > I_{ref1} > I_{sl,10} > I_{ref2} > I_{sl,01} > I_{ref1} > I_{sl,00}
\] (1.77)

With them, it is possible to realize some logic functions such as OR, NOR, XOR, NAND and AND.

**Working mechanisms**

1. Write Mode: This process is realized in two steps. The first one, with a large current, the state of the largest MTJ is changed and the second one if needed, a small current is used to modify the state of the smallest one;

2. Read Mode: to read, the source lines are connected to the comparators of the MSC circuit and the current of the SL is simply compared with the reference currents explained before;

3. Logic Mode: the sense amplifiers can realize some logical operations as already mentioned. Taking for example \(I_{ref1}\), this is the largest current and if the \(I_{sl}\) is larger than \(I_{ref1}\), it means that the cell is in the "11" configuration (parallel-parallel), that is translated into the logical operation AND;

4. Full-Adder mode: it is possible to implement a full adder by considering that:

\[
S_n = A_n \oplus B_n \oplus C_n
\] (1.78)

\[
C_{out} = (A_n \& B_n)\vert((A_n \oplus B_n)\& C_n)
\] (1.79)

Since the MSC is composed by comparators that realizes basical logic functions, its structure can be extended in order to implement a full-adder. This is done by adding three additional gates.
BCNN accelerator

Considering a binary neural network, the computation of the convolution is performed by considering the PopCounting of the XNOR as follows:

\[ I \ast W = \text{PopCount}(I(B) \& W(B)) \]  

(1.80)

This can be fully implemented by the architecture described and depicted in the following figure:

Figure 1.28: BCNN Accelerator from [15]. The logical computations are performed inside the memory array, while other intensive operations, such as batch normalization or scaling factors computations are performed outside the memory in a separate unit.

As it is possible to see, the part of Batch normalization, Binary operation, Scaling...
factors, Multipliers and Pooling are designed into a separate processing unit, that is not included into the CIM array [15]. The detailed calculation process is the following:

1. Batch normalization is performed on the inputs to reduce information loss;

2. Inputs and weights are binarized (sign);

3. Binarized inputs/weights are stored into the CIM to perform in-memory computations. Weights into a CNN are shared and so they are stored into the largest MTJ, while inputs in the smaller one; The colored lines indicated into the Figure 1.28 [15] have the following meanings:

   - Green line: represents AND data flow and its result coming from MSC is written directly into the CIM array;
   - Orange line: represents the popcounting data flow using MSC full-add operation;

4. Tensors I and W are sent to scaling factors to calculate $α$ and $K$ from subsection 1.2.2;

5. Convolutional results and scaling factors are delivered to Multiplier to complete the convolutional layer. At the end of the chain, the results goes into a Pooling layer, in which they will be reduced.

**Experimental results**

This architecture is tested on MNIST dataset and it is realized with parameters of CMOS 45nm. The resulting energy consumption of this design is only 0.38 µJ, while the cycle time (entire convolution performed) is 27.24 ns. Since the structure realized is a XNOR-NET, here there are reported some useful results:
Table 1.9: Results of the XNOR-NET implementation on the architecture from [15]. The energy reported refers to a convolutional layer with number of kernels that indicates number of OFMAPs coming from the convolution.

<table>
<thead>
<tr>
<th>Layer</th>
<th>Number of kernels</th>
<th># convolution operations</th>
<th>Energy Consumption</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>Layer (\mu J)</td>
</tr>
<tr>
<td>C1</td>
<td>6</td>
<td>4704</td>
<td>0.278</td>
</tr>
<tr>
<td>C3</td>
<td>16</td>
<td>1600</td>
<td>0.094</td>
</tr>
<tr>
<td>C5</td>
<td>120</td>
<td>120</td>
<td>0.007</td>
</tr>
<tr>
<td>F6</td>
<td>84</td>
<td>84</td>
<td>0.0003</td>
</tr>
<tr>
<td>Total</td>
<td>-</td>
<td>-</td>
<td>0.38</td>
</tr>
</tbody>
</table>

1.3.2 Energy Efficient In-Memory Binary Deep Neural Network Accelerator with Dual-Mode SOT - MRAM [16]

Introduction

An architecture based on a NVM system is employed in [16], in particular a solution SOT - MRAM that enables the computation in memory with zero standby leakage and very high integration density.

In-memory processing platform

**SOT-MRAM** This kind of MRAM is based on the scheme represented in Figure 1.29
As shown in Figure 1.29, depending on the direction of the current, the MTJ changes its magnetization in the free layer: as a consequence two states are possible: antiparallel and parallel. These two states, as already said, correspond to two resistances (HRS and LRS respectively). In Figure 1.29 it is shown the structure of the array cell, which has respectively RBL,WWL,WBL,SL that enable in-memory operations, like write/read and computation. Also here the data are fetched from the cells in form of current, and so the current on the RBL is fed to a current sense amplifier that gives the corresponding logical output.

**Memory write** [16] In order to write a data inside the cell, a write current has to be injected in the heavy metal substrate. The current has to flow from 2 to 3 terminals of the MTJ (or viceversa), in order to obtain the different magnetizations’
states (Figure 1.29). So the operational steps are:

1. WWL is activated by row decoder;
2. SL is grounded;
3. Voltage driver on the WBL is set to positive (/negative) in order to obtain HRS (LRS).

**Memory read [16]** To perform a read operation, a read current has to flow from 1 to 3 terminals of the MTJ. The operational steps are the following:

1. Sense voltage generated in the SA used to read the values from the cells;
2. SA compares $V_{\text{sense}}$ (which is determined by the current on the RBL multiplied the resistance of the MTJ) with $V_{\text{ref}}$, by selecting one of the possible enables;
3. SA’s output is high when path resistance is higher than ref ($R_{\text{ref}}$) resistance.

**Computing mode [16]** The computing mode is performed by selecting two or multiple rows. If multiple rows are selected, the equivalent resistance on the RBL is given by the parallel of the individual resistances in the selected cells. This equivalent resistance is then compared with another specific reference, which has been selected by the enable signals (for instance $EN_{\text{OR}}, EN_{\text{AND}}$). In this case, the reference is chosen properly to obtain a SA output, which corresponds to the selected logic function. In particular:

- AND logic function: $R_{\text{ref}}$ midpoint of $R_P//R_{AP}$ and $R_{AP}//R_{AP}$, and so only if both cell resistances are HRS (corresponding to ”11”), the output will be high;
- OR logic function: $R_{\text{ref}}$ midpoint of $R_P//R_P$ and $R_P//R_{AP}$.

**BCNN accelerator [16]**

To demonstrate how this architecture well-suites the computations inside a BCNN, [16] uses AlexNet architecture, which has 5 convolutional layers and 3 fully connected layers. In particular here it is adopted the variant AlexNet BCNN which
is composed by 8 convolutional layers (no fully-connected layers used), with the first and the last that are not binarized. Each convolutional layer corresponds to Batch Normalization, Scaling factor computation, Multiplier, Pooling (handled by an external DPU) and Sign function, Binary-AND, Bitcount (handled by CIM).

![Diagram of MTJ-Based BNN](image)

Figure 1.30: Inputs and weights are in ImageBanks and then it will be computed the binary convolution by performing an In-Memory AND logic operation followed by a Bitcounting. Source: [16]

Results [16]

**Energy, area, delay and Memory usage estimation** The computation energy, area, execution time and memory usage of different implementations of AlexNet BCNN/CNN on ImageNet dataset are tabulated:

Table 1.10: Memory usage of AlexNet DP (Double precision), SP and BCNN from [16]

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>AlexNet DP</td>
<td>476.4</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>AlexNet SP</td>
<td>238.2</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>AlexNet BCNN (this)</td>
<td>39.7</td>
<td>310.42</td>
<td>5.28</td>
<td>10.7</td>
</tr>
</tbody>
</table>
1 – State of the art

About memory usage, there is a significant improvement w.r.t AlexNet DP and AlexNet SP (x12 and x6 respectively), because the binary architecture occupies less space in memory.

1.3.3 A Logic-in-Memory Design with 3-Terminal Magnetic Tunnel Junction Function Evaluators for Convolutional Neural Networks [17]

Introduction

[17] uses magnetic DW devices, which are able to perform logical and memory operations: the concept is based on a variable resistor that is read and written electrically in such a way that complex functions can be obtained, if the device is designed with particular conditions. The variable resistor is an MTJ and when there is a moving DW in the free layer, the MTJ resistance value $R_{MTJ}$ can assume one of many values between $R_P$ and $R_{AP}$ depending on the position of the DW. The reference technology is the SOT, which has better motion properties than STT (DW requires less current to move). The positions of the DW are discrete and limited and they determine how many resistive values can be obtained from the MTJ function evaluator. If a sufficiently high resolution of resistive values is available, particular functions (activation functions like sigmoid or hyperbolic tangent) can be obtained by using MTJ as function evaluator, avoiding the usage of very complex digital circuits.

The MTJ function evaluator [17]

In [17] there is a mathematical explanation of how an MTJ can be designed to obtain a particular function in output, based on the DW motion. The resistance of an MTJ is defined as follows from [17]:

$$R_{MTJ} = R_P \left( \frac{x_0(I_{IN})}{L} \right) + R_{AP} \left( 1 - \frac{x_0(I_{IN})}{L} \right)$$ \hspace{1cm} (1.81)

With $x_0(I_{IN})$ the final position of the DW w.r.t. the input current $I_{IN}$ and L is the length of the MTJ. In [17], it is considered the domain wall velocity and derived the
equation of the MTJ’s width, assuming also that the current is applied in a finite pulse $t_0$. The resulting $w(x)$ equation is reported from [17]:

$$w(x_0) = \eta t_0 \left( \frac{dI}{dx_0} \right)$$

(1.82)

$I(x_0)$ is the inverse function of $x_0(I_{IN})$ and $d$ is the thickness of the MTJ. In [17] it is used a shifted sigmoid function, and so the goal is to obtain the DW position’s equation that is proportional to the sigmoid. The only way to do this, is to find the width equation, that defines the shape of the MTJ. Considering the shifted sigmoid function from [17]:

$$x_0(I) = x_A tanh \left( \frac{I - I_1}{I_2} \right) + x_B$$

(1.83)

Now by using the equation of the sigmoid (Equation 1.83), and applying it to the width equation (Equation 1.82) by doing the inverse derivative, an equation for $w(x)$ can be obtained. The MTJ designed will have a resistive behavior proportional to the shifted sigmoid.

**Logic in memory system design** [17]

**Crosspoint array** The memory array is organized as a crosspoint composed by 1T1R cells. The output coming from one column of the crosspoint array is the following:

$$I_j = \sum_i V_i G_{(i,j)}$$

(1.84)

The corresponding output voltage coming from the MTJ function evaluator is:

$$V_{OUT,j} = f(I_j)$$

(1.85)

The crosspoint array structure and architecture is defined in Figure 1.31:
Figure 1.31: Crosspoint array architecture from [17]; two different types of MTJs are used in [17]: the synaptic MTJs are the classical ones, with two possible values of resistances ($R_P$ and $R_{AP}$); while the thresholding MTJs are the ones discussed so far. The last MTJ (indicated by an arrow) acts as function evaluator and it implements the activation function of the neuron. This crossbar can be seen as an array of variable resistances.

The network can be larger, and this configuration allows the connection between multiple arrays simply by taking the output of the function evaluator MTJ of the previous array, without the need of using ADC/DACs, speeding up the system. Connections between CNN subarrays are programmed with multiplexers.

**Perceptron mode** [17] The steps to use the architecture in perceptron mode are:

1. The function evaluator MTJs are reset by imposing $RST_j = 1$ and a Domain
1.3 – MTJ-Based BNN

wall is injected with RSTP;

2. Both $RST_j = 0$ and $WL_i = 0$. On $RL_i$ there is an input voltage that allows the MTJ to set its resistance;

3. $BL_j = 0$: the output is passed to the next layer or, if in the final cycle, sense the thresholding MTJ.

**Memory mode [17]** For reading operation:

1. One row is selected with $RL_i = 0$, others to 'Z';

2. $WL_i = 0$ set;

3. Sense the resistance MTJ on $BL_j$.

While for writing:

1. Write one row $WL_i = 1$, $RL_i = Z$, $SL_j = 0$;

2. Inject a DW current with $BL_j$.

**Results**

The architecture has been implemented in [17] with CMOS 45nm and magnetic tunnel junction process. This implementation is able to save energy up to 50x w.r.t a CPU-Based CNN. In this part they will be presented the results coming from the architecture:

<table>
<thead>
<tr>
<th>Operations/Parameters</th>
<th>Feed forward operations</th>
<th>Memory write</th>
<th>Memory read</th>
</tr>
</thead>
<tbody>
<tr>
<td>Power</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Static [uW]</td>
<td>68.6</td>
<td>68.6</td>
<td>68.6</td>
</tr>
<tr>
<td>Dynamic [nW]</td>
<td>15.4</td>
<td>10.7</td>
<td>129000</td>
</tr>
<tr>
<td>Latency per layer[ns]</td>
<td>4</td>
<td>4</td>
<td>2</td>
</tr>
</tbody>
</table>

Table 1.11: Results for 2 convolutional layers from [17]
1.4 **RRAM Based**

RRAM is a non-volatile random-access memory that changes its resistance based on the voltage applied across it [47]. A dielectric could conduct, if the voltage applied is sufficiently high to form a conduction path through it (dielectric breakdown) which in this case is temporary and reversible because of the materials used. Once the conduction path is formed, it may be reset (broken, resulting in high resistance) or set (re-formed, resulting in lower resistance) by another voltage.

These types of cells can be implemented into a 1T1R configuration as following:

![1T1R configuration](image)

Figure 1.32: 1T1R configuration from [39]

Generally a V/2 scheme is adopted to avoid write disturbance, that is, $V_{set}$ or $V_{rst}$ is applied across the selected cells and $V_{set}/2$ or $V_{rst}/2$ is applied across all the unselected cells. For read operations, a voltage smaller than the threshold, $V_{read}$ is applied on the selected cell: current coming from it, is compared with a reference to determine the output. It is possible to have an undesirable current flow due to the non-isolation of cells, which is known as the sneak current: 1T1R is able to reduce this drawback.
1.4.1 The application of Non-volatile Look-up-table Operations based on Multilevel-cell of Resistance Switching Random Access Memory [18]

Introduction

In the approach presented by [18], the MLC is used, enabling multiple bits per cell. The resistance of the RRAM can be easily switched by varying pulse duration/amplitude of write voltage. In [18] it is discussed a ROM LUT-based implementation, where outputs are pre-stored and the input bits are used as the address, by means of a decoder to access to them. A novel approach to implement a multiplier is presented by [18], which is based on a LUT in-memory model.

Circuits design and implementation [18]

As it is possible to see by looking in Figure 1.33, the cells are organized in a crossbar configuration.

![Crossbar array cell’s organization from [18]. Each memory cell is a RRAM.](image)

Figure 1.33: Crossbar array cell’s organization from [18]. Each memory cell is a RRAM.
The circuits in Figure 1.34 is composed by:

- Row decoder: selects the specific row for read/write;
- LUT RAM on the right, stores the multiplication results (precharged);
- The MLC RRAM consists on a set of resistances which assume the following values:
  - "11" corresponds to 1kΩ;
  - "10" corresponds to 10kΩ;
  - "01" corresponds to 100kΩ;
  - "00" corresponds to 10MΩ.
• The programming block programs the two crossbars (writes values of the multiplier in the arrays for example);

• The write control circuit handles the multiplier in two steps:
  1. Initializes the two crossbars to "00" (high impedance) and verify the written values through write-verify circuit;
  2. Write data into the crossbars to accomplish a specific digital function, and verify the input data via write-verify circuit.

2x2 multiplier output bits will be stored into the LUT at the right hand of the Figure 1.34 and the 4-16 line decoder will address the LUT. The input interface (at the bottom of Figure 1.34) contains inverters (that are able to do the logic inversion if needed), buffers and level shifters, because the voltage level for a MLC RRAM is different from the CMOS’s one. The row decoder is used to address the LUT: the columns of the crossbar array are used as inputs, while the rows as outputs that produce a read voltage. The interface circuit controls the read voltage on the selected address of the LUT, and the corresponding results pass through sense amplifiers and 4-2 level converters.

Simulation

[18] implements three types of multiplier (4x4,8x8 and 16x16 respectively) based on a CMOS 65nm process. Also a 1bit/Cell multiplier has been implemented to compare the results between the two different approaches. In the following table are reported the results:

Table 1.12: Results of the LUT-based multiplier from [18], with different configurations.

<table>
<thead>
<tr>
<th>Multiplier</th>
<th>RRAM Type</th>
<th>Delay[ns]</th>
<th>Area[µm²]</th>
</tr>
</thead>
<tbody>
<tr>
<td>4x4</td>
<td>1bit/cell</td>
<td>1.01</td>
<td>166.87</td>
</tr>
<tr>
<td></td>
<td>2bit/cell</td>
<td>1.21</td>
<td>137.86</td>
</tr>
<tr>
<td>8x8</td>
<td>1bit/cell</td>
<td>1.03</td>
<td>738.23</td>
</tr>
<tr>
<td></td>
<td>2bit/cell</td>
<td>1.21</td>
<td>650.73</td>
</tr>
<tr>
<td>16x16</td>
<td>1bit/cell</td>
<td>1.06</td>
<td>2811.25</td>
</tr>
<tr>
<td></td>
<td>2bit/cell</td>
<td>1.24</td>
<td>2460.75</td>
</tr>
</tbody>
</table>
1.4.2 XNOR-RRAM: A Scalable and Parallel Resistive Synaptic Architecture for Binary Neural Networks [19]

Introduction

[19] proposes RRAM based architecture, that is able to implement the XNOR-Bitcounting operations, enabling the realization of very deep binary neural networks. Both MLP and CNNs are implemented in [19], so the datasets used are MNIST and CIFAR-10 respectively. Also a novel architecture based on a parallel reading is employed in [19], which results more efficient than the sequential one. The structures of the network used in [19] are the following:

- **MLP**: 784-512-512-512-10 (MNIST dataset). The accuracy w.r.t floating point implementation passes from 99.0% to 98.77%;

- **CNN**: 6 convolutional layers and 3 fully connected layers (CIFAR-10 dataset). The accuracy is 88.47 % and in floating point implementation is 89.98%.

This architecture is implemented with 65nm node.

RRAM Based Synaptic Array [19]

In the following figure, it is shown the cell structure based on 1T1R implementation:
1.4 – RRAM Based

In Figure 1.35:

- -1: top/bottom cells are in HRS/LRS respectively;
- +1: reverse pattern.

For the WLs instead, the following representations are used:

- -1: input pattern (0,1);
- +1: input pattern (1,0).

The output current coming from the cell depends on the input pattern and the cell configuration. For example:

- Input vector is -1 (0,1);
- Cells selected with weight -1;
- The activated row is LRS causing a large current, seen as "1" (XNOR);

If multiple WLs are selected in parallel, the LRS-cells will dominate the bitline current. So $I_{BL}$ is proportional to the number of LRS-cells in the column, realizing the pop-counting. For example:
- Column’s length = 64;
- $I_{ref} = 32LRS$ activated cells for the sense amplifier;
- If $I_{BL} < I_{ref}$ the output is -1 that represents the neuron’s activation function.

Two kind of approaches can be made to realize the previous functions: in sequential approach, only one WL is activated per time and during the read, $V_{BL} = GND$, current sense amplifier injects a current on the bitline that will be compared with $I_{ref}$. This procedure is done for all the rows in the crossbar array, so MAC units and registers are needed. The final sum is sent to a comparator which generates the digital output. In the parallel architecture, a WL switch matrix enables multiple WLs usage simultaneously based on the input vector. The most important component in this architecture is the current sense amplifier, because it can be affected by unwanted offset that degrades the sensing pass rate[19]. This is worse when the bitline current is higher. The design of the current sense amplifier has to consider that the offset could change completely the output of a neuron and, consequently, producing a wrong result. Considering an array size of 512x512, the accuracy is only 15.04%: one of the possibilities to reduce the offset problem is to divide the initial array into subarrays in order to reduce the current $I_{BL}$ and to perform a non-linear quantization on the partial sums. All of these considerations are discussed in detail in [19].

Array Partitioning After array partitioning, each array generates a partial sum that has to be added with the other ones. A partial sum has to be very precise, because it affects the whole accuracy in the final sum: ADC-like MLSA carries out partial sums in fixed point. Another important parameter is the number of bits of MLSA, which heavily influences the accuracy: bit-level of 2-bit imply >98% accuracy for MLP on MNIST, when sub-array dimensions are 32x32 or 64x64.

Benchmark results on MNIST and CIFAR-10 [19]

[19] reports also area, latency and energy efficiency per each subarray size:
Table 1.13: Parameters of the different architectures from [19]

<table>
<thead>
<tr>
<th>Subarray size</th>
<th>MLSA bit-width</th>
<th>Area [$mm^2$]</th>
<th>Latency [ns]</th>
<th>TOPS/W</th>
</tr>
</thead>
<tbody>
<tr>
<td>64x64</td>
<td>3</td>
<td>0.0832</td>
<td>12.7</td>
<td>81.79</td>
</tr>
<tr>
<td>128x128</td>
<td>3</td>
<td>0.047</td>
<td>13.69</td>
<td>141.18</td>
</tr>
</tbody>
</table>

**MNIST** In this part the results obtained in [19] from the benchmarks will be analyzed. In particular, considering the variations of MLSA offset and RRAM cell resistance (Gaussian distribution with a mean of 200kΩ and a standard deviation of 3kΩ from [19]), it is possible to demonstrate that the sensing pass rate is small when the bitcounting value is close to a sensing reference. When the bit-counting is far enough from a sensing reference, the pass rate can achieve 100%. There are reported the results in terms of accuracy on MNIST dataset from [19]:

Table 1.14: **MNIST**-based implementations results from [19]

<table>
<thead>
<tr>
<th>Implementation</th>
<th>Sub-array size</th>
<th>MLSA Bit level</th>
<th>Network structure</th>
<th>Dataset</th>
<th>Accuracy[%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>XNOR-RRAM</td>
<td>64x64</td>
<td>2</td>
<td>MLP</td>
<td>MNIST</td>
<td>95.81</td>
</tr>
<tr>
<td>XNOR-RRAM</td>
<td>64x64</td>
<td>3</td>
<td>MLP</td>
<td>MNIST</td>
<td>98.56</td>
</tr>
<tr>
<td>XNOR-RRAM</td>
<td>128x128</td>
<td>3</td>
<td>MLP</td>
<td>MNIST</td>
<td>98.43</td>
</tr>
<tr>
<td>BNN Algorithm</td>
<td>-</td>
<td>-</td>
<td>MLP</td>
<td>MNIST</td>
<td>98.77</td>
</tr>
<tr>
<td>NN Algorithm (FP)</td>
<td>-</td>
<td>-</td>
<td>MLP</td>
<td>MNIST</td>
<td>99</td>
</tr>
</tbody>
</table>

The total number of subarrays used in both cases (64x64 and 128x128) are 136 and 36 respectively from [19]. With these data, it is possible to determine the total area as:

Table 1.15: Total latency of **MLP** based on **MNIST** from [19]

<table>
<thead>
<tr>
<th># employed</th>
<th>Subarray size</th>
<th>MLSA bit-width</th>
<th>Area [$mm^2$]</th>
</tr>
</thead>
<tbody>
<tr>
<td>136</td>
<td>64x64</td>
<td>3</td>
<td>11.315</td>
</tr>
<tr>
<td>36</td>
<td>128x128</td>
<td>3</td>
<td>1.686</td>
</tr>
</tbody>
</table>
CIFAR-10  The results on CIFAR-10 dataset are reported in the following table:

Table 1.16: Results on CIFAR-10-Based implementations (CNN) from [19]

<table>
<thead>
<tr>
<th>Implementation</th>
<th>Sub-array size</th>
<th>MLSA</th>
<th>Bit level</th>
<th>Network structure</th>
<th>Dataset</th>
<th>Accuracy [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>XNOR-RRAM</td>
<td>64x64</td>
<td></td>
<td>3</td>
<td>CNN</td>
<td>CIFAR-10</td>
<td>86.12</td>
</tr>
<tr>
<td>XNOR-RRAM</td>
<td>128x128</td>
<td></td>
<td>3</td>
<td>CNN</td>
<td>CIFAR-10</td>
<td>86.08</td>
</tr>
<tr>
<td>CNN Algorithm</td>
<td>-</td>
<td></td>
<td>-</td>
<td>CNN</td>
<td>CIFAR-10</td>
<td>88.47</td>
</tr>
<tr>
<td>CNN FP</td>
<td>-</td>
<td></td>
<td>-</td>
<td>CNN</td>
<td>CIFAR-10</td>
<td>89.98</td>
</tr>
</tbody>
</table>

The convolutional neural network has the following structure in details:

Table 1.17: CNN structure from [19]

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th># IFMAP</th>
<th># OFMAP</th>
<th>kernel size</th>
<th># subarrays 64x64</th>
<th># subarrays 128x128</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Convol</td>
<td>3</td>
<td>128</td>
<td>3x3</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>2</td>
<td>Convol</td>
<td>128</td>
<td>128</td>
<td>3x3</td>
<td>36</td>
<td>9</td>
</tr>
<tr>
<td>3</td>
<td>Convol</td>
<td>256</td>
<td>256</td>
<td>3x3</td>
<td>72</td>
<td>18</td>
</tr>
<tr>
<td>4</td>
<td>Convol</td>
<td>256</td>
<td>256</td>
<td>3x3</td>
<td>144</td>
<td>36</td>
</tr>
<tr>
<td>5</td>
<td>Convol</td>
<td>256</td>
<td>512</td>
<td>3x3</td>
<td>288</td>
<td>72</td>
</tr>
<tr>
<td>6</td>
<td>Convol</td>
<td>512</td>
<td>512</td>
<td>3x3</td>
<td>576</td>
<td>144</td>
</tr>
<tr>
<td>7</td>
<td>F.Conn</td>
<td>8192</td>
<td>1024</td>
<td>-</td>
<td>2048</td>
<td>512</td>
</tr>
<tr>
<td>8</td>
<td>F.Conn</td>
<td>1024</td>
<td>1024</td>
<td>-</td>
<td>256</td>
<td>64</td>
</tr>
<tr>
<td>9</td>
<td>F.Conn</td>
<td>1024</td>
<td>10</td>
<td>-</td>
<td>16</td>
<td>8</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>2436</td>
<td>863</td>
</tr>
</tbody>
</table>

The parameters related to this network are reported in the following table:

Table 1.18: Parameters on the CNN based on CIFAR-10 from [19]

<table>
<thead>
<tr>
<th># employed</th>
<th>Subarray size</th>
<th>MLSA bit-width</th>
<th>Area [mm²]</th>
</tr>
</thead>
<tbody>
<tr>
<td>2436</td>
<td>64x64</td>
<td>3</td>
<td>202.67</td>
</tr>
<tr>
<td>863</td>
<td>128x128</td>
<td>3</td>
<td>40.5</td>
</tr>
</tbody>
</table>

The best solution is the 128x128 array size with 3 bit-level MLSA in both cases (MNIST and CIFAR-10) and in particular for the last one, the energy efficiency is 141.18 TOPS/W from [19].
1.4.3 MAGIC-Memristor-Aided Logic [20]

Introduction

A memristor is a device which changes its resistance depending on the current flowing through it. It is possible to realize some logic functions considering that memristors have two different resistance states which can be used as ”1” and ”0”: if the current is higher, then it is considered logic ”1”, otherwise logic ”0”.

![Memristor behavior](image)

Figure 1.36: Memristor behavior from [20] depending on the current flow direction.

A NOR Gate can be implemented as shown in the following figure:

![NOR Gate](image)

Figure 1.37: NOR Gate with memristors from [20]
Considering Figure 1.37, the two parallel memristors are considered as the inputs, while the last one (on the right) is the output. The operations to perform a NOR can be summarized as follows:

1. Initialization of the output memristor low resistance (logic 1);

2. A voltage $V_0$ is applied. If the input memristors are in high impedance, the current flowing through the output resistor is not sufficient to change the state, so it remains low (logic 1). If the input combination is different (10 or 01), the current will be higher than the threshold of the output memristor, and so it changes its state;

For the input combination ”00”, the voltage across the output memristor should be lower than $V_{T,OFF}$. In the other cases, it should be higher than this value. When an input memristor is ”0”, the voltage applied $V_0$ can change the input to logic ”1” in the meanwhile. So, $V_0$ has to be less than the threshold voltage $V_{T,ON}$: $V_0$ has to be designed properly, as indicated in [20].

**In-Memory structure**

These memristors can be placed inside a crossbar array in order to be integrated in an in-memory solution.

![Figure 1.38: Memristor-based crossbar array: configuration for NOR logic gate from [20]](image-url)
In Figure 1.38, the memristors are organized in a crossbar array. The NOR function is implemented by imposing on the inputs the voltage $V_0$ and on the output GND. The result of the NOR function is a current that flows through the row lines into a proper analog circuit, which translates the input current into a logical state. From [20], it is reported the delay of the NOR operation, which depends on $V_0$: if $V_0 = 1V$, it is equal to 1.3ns, considering the slowest computation. This approach enables also the realization of other logic gates.

### 1.4.4 Mixed-precision architecture based on computational memory for training deep neural networks [21]

**Introduction**

[21] proposes a mixed approach based on a crossbar array of memristors and an high precision digital unit, which is able to perform both in memory and high precision computations. In particular, the second ones are useful in the training phase, in which an high grade of precision is required and so the weights are not binarized. The modifications of the weights in the crossbar array are obtained by changing the resistances by means of programming impulses.

**The architecture**

The architecture of the system is presented in the following figure from [21]:

![Diagram of mixed-precision architecture](image)

Figure 1.39: Principle scheme of the mixed precision architecture. Source: [21]
In Figure 1.39, the crossbar array on the right performs multiplications and stores the weights and on the left the high precision computational unit trains the neural network. The working principle is the following:

1. The inputs (neurons’ activations) are fed to the crossbar array from the high precision unit and converted in analog voltages by means of DACs;

2. The crossbar array performs its evaluations and each column carries out a current, which is proportional to the multiplication between the weight stored in the crossbar (considered as a conductance) and the input voltage;

3. The currents are then converted into digital by means of ADCs. The digital vector is the result of the computation.

The same crossbar can be used to perform the backpropagation, in which the errors are converted in voltages. All the considerations done in subsection 1.1.4 are still valid, so the learning rule is applied also in [21]. Since the conductances are subjected to variations, the update is performed only when the accumulated weights updates reaches a multiple of the smallest and reliably achieved change of the conductance itself [21].

**Neural network structure**  The MLP network in [21], is realized considering MNIST dataset. Its structure is 784-250-10 and the inputs are images of 28x28 size. The hidden layer and the output layer have sigmoid as activation function. The NN is trained for 10 epochs and the corresponding floating point implementation (64 bit) achieves an accuracy of **98%** with SGD.

**Inaccuracies**  The inaccuracies arising from this architecture are a lot. Starting from the conductances, they do not have precise values because they depends on the physical properties of the material used, in particular they are subjected by granularity, stochasticity and asymmetric conductance response. The weights update process is heavily influenced by the conductances and also the computations regarding the output classification/weights update depend on these variations, in fact, as mentioned before, the crossbar array is also used to compute multiplications regarding the backpropagation. The noise (in particular the read noise) is another
1.4 – RRAM Based

important factor that has to be considered, because it degrades the accuracy [21]. Also the DAC/ADC inaccuracies influence the system behavior, and in particular, choosing a small resolution brings to a very low values of accuracy (about 50% from [21]). In [21] it is chosen 8-bit resolution for both DAC/ADCs, in order to avoid degradation. The architecture has been tested and trained, taking into account all these variations. After 10 training epochs, the architecture reaches 97.78% of accuracy.

1.4.5 A hardware neural network for handwritten digits recognition using binary RRAM as synaptic weight element [22]

Introduction

[22] proposes a binary neural network based on RRAM devices, which implements a 784-10 MLP network for handwritten digit recognition (MNIST dataset). The network is realized as a resistive crossbar array, in which the columns are the outputs and the rows the inputs. The architecture achieves 81% accuracy with a custom training procedure, instead of the classical SGD based one.

Network structure

The network structure is the classical crossbar array, in which the input is a MNIST image which has been vectorized (from 28x28 to 784x1). The greyscale input values have been adapted to the voltage range (0;0.1V) and the outputs are currents which are compared with each other. The classification in output is given by the maximum current incomings from the columns. The following figure reports the structure from [22]
Training strategy [22]

The memristors can modify their resistances according to the applied voltage: in particular, if the voltage is positive and larger than a certain threshold $V_{T,\text{on}}$, the resistance of the memristor becomes LRS and so the bit stored is a zero (Set). Otherwise, if the applied voltage becomes negative and less then $V_{T,\text{off}}$, the resistance will be reset to HRS. To train this network, the recognition result is considered: if it is not correct, the corresponding column that gives the result and the other one which is correct will be "stochastically reset" and "stochastically set" respectively, in order to decrease the current in "recognition result" node and increase the current in expected node. The "stochastic reset" process is performed by a sweeping increase voltage in the output node, in order to generate a negative voltage across the interested memristors. If any is reset, the applied voltage is removed. The threshold voltages of the memristors were chosen randomly from [22].

Results

Table 1.19: Results from [22]. When more than 1 arrays are used, the recognition result is improved. They are used in parallel and the output is evaluated in the same way explained before.

<table>
<thead>
<tr>
<th>Total images</th>
<th>Training images</th>
<th>Dataset</th>
<th>Array size</th>
<th># arrays</th>
<th>Accuracy [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>60000</td>
<td>10000</td>
<td>MNIST</td>
<td>784x10</td>
<td>20</td>
<td>81</td>
</tr>
<tr>
<td>60000</td>
<td>10000</td>
<td>MNIST</td>
<td>784x10</td>
<td>1</td>
<td>62</td>
</tr>
</tbody>
</table>
Also the network robustness has been evaluated in [22] w.r.t Vset/Reset changes, and they demonstrated that it works well also when 50% of the RRAM is disabled.

1.4.6 Challenges of emerging memory and memristor based circuits: Nonvolatile logics, IoT security, deep learning and neuromorphic computing [23]

Introduction

[23] explores the NVM technology and its real applications and proposes a comparison between different realizations (such as RRAM, ReRAM, Memristors, PCM, STT etc). [23] has been reported in this section because it provides very interesting considerations on memory technologies.

Write voltages of emerging NVM

Figure 1.41 shows all recent emerging NVM technologies which are better in terms of performance than Flash:

![Figure 1.41: Write voltages of different technologies. Source: [23]](image-url)
The analyzed parameters are the write voltage and the write time and, as it is possible to see, only three kinds of NVMs are included in the flash area. So resistive-based memories are very good in terms of energy efficiency, because the required write voltage and write time are less than the flash memories.

**NvLogics: non-volatile computational units**

In a classical system, when the power is turned off, the logic circuits have to move their data to NVM, in order to keep them saved for the next operations. The systems discussed in [23] instead, use local NVM inside the computational part and since the new technologies based on RRAM, MTJs etc enables fast writing data at low power consumption, the operation of switching off-on a circuit is not so expensive. Another important parameter that has to be considered is the resistance ratio of the resistive memories: if it is small, there is not an evident difference between on-off state and so circuits able to sense small resistance difference has to be employed, considering also the presence of the sneak current between cells and other leakage currents. These last parasitic effects are reduced by using solutions such as 1T1R cells or similar. In the following figure, it is reported a simplified implementation of a 3-2 network with RRAMs in 1T1R configuration from [23]: the products are performed by the array itself and the weights are stored into the RRAMs, so by applying the binary inputs V0,V1,V2, the corresponding word line is enabled and the current flowing through the bitline is sensed by a sense amplifier which generates the binary output.
But there are some problems that have to be solved [23]:

1. High performances and MLC cells with low power consumption are not reached yet;

2. Parasitic currents (like sneak current) are still present also with 1T1R configuration;

3. Since the computation in the array is based on an analog approach, a good interface between the array itself and CPU is needed.
1.5 SRAM based

In this section it will be discussed a solutions based on SRAM implementations.

1.5.1 In-Memory Area-Efficient Signal Streaming Processor Design for Binary Neural Networks [24]

Introduction and architecture

[24] proposes an in-memory architecture which is based on BNN, so operations of XNOR and bitcounting are performed. In the implementation, it is used the concept of synapse configuration table SCT which is explained in the following parts.

The NN depicted in Figure 1.43 has 3 input activations (named $A_{11}, A_{12}, A_{13}$) and 2 output activations ($A_{21}, A_{22}$). As it is possible to see, there are also some numbers reported next to the output neurons (in this case +2 and -1): these represents the bias values that have to be added to the neuron’s function to obtain the corresponding output activation. In Figure 1.43, the network is not fully connected: a general representation of these kinds of networks is needed. By considering the ternary networks, the NN in Figure 1.43 can be implemented by performing some transformations illustrated in the same image. The steps to compute a neuron’s output are the following:

1. XNOR bitwise operation to compute the products between input-weight;

2. The sum in a BNN is computed by a pop-counting operation:

$$PopCount = number\_of\_1s - number\_of\_0s$$ \hspace{1cm} (1.86)

3. Biases are added with the pop-count;

4. Activation function: the output of a neuron is the sign of the previous calculations.
1.5 – SRAM based

Figure 1.43: An example of a 3-2 BNN from [24] and the transformation into a fully connected configuration. The Synapse configuration table is reported indicating the meaning of the connections. The fully connected network has been implemented considering also bias and mask signals. At the end, three popcounting results will be added together and it is taken the sign of the result, that defines the output.

An implementation of these steps has been represented in Figure 1.43. For each neuron, there is a set of (weight,bias,mask) bits that determines the meaning of the connection and the corresponding value of the weight to be multiplied with the input activation. In the example in Figure 1.43, the SCT is the following [24]:

<table>
<thead>
<tr>
<th>Connection</th>
<th>Line</th>
<th>Bias</th>
<th>Mask</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>Normal</td>
<td>—</td>
<td>0</td>
<td>0</td>
<td>$W \times A$</td>
</tr>
<tr>
<td>Masked</td>
<td>—</td>
<td>0</td>
<td>1</td>
<td>No conn</td>
</tr>
<tr>
<td>Biased</td>
<td>—</td>
<td>1</td>
<td>0</td>
<td>$W \times A - 1$</td>
</tr>
<tr>
<td></td>
<td>—</td>
<td>1</td>
<td>1</td>
<td>$W \times A + 1$</td>
</tr>
</tbody>
</table>
As it is possible to see in Table 1.20, the input activations are disposed on the rows, while the outputs on the columns. If this SCT is implemented into a SRAM, the rows correspond to the address, while the columns to the bitlines: if one row is accessed per time, it means that only one input per time can be processed. In fact this SCT configuration is called OPNE [24] (output parallel neural engine), in which the inputs are given serially, while the outputs are generated simultaneously when the scanning over all the inputs has finished. Moreover, if the network is extended into a 3-2-3 structure, the hidden layer takes the inputs (coming from the previous layer) in parallel and so a new SCT configuration has to be considered. In this case the synapse configuration table is called IPNE (input parallel neural engine) [24]:

Table 1.21: IPNE SCT from [24]

<table>
<thead>
<tr>
<th>A21</th>
<th>A22</th>
</tr>
</thead>
<tbody>
<tr>
<td>Weight</td>
<td>Bias</td>
</tr>
<tr>
<td>A31</td>
<td>Weight</td>
</tr>
<tr>
<td>A32</td>
<td>Weight</td>
</tr>
<tr>
<td>A33</td>
<td>Weight</td>
</tr>
</tbody>
</table>

The inputs now address multiple columns and only one output is provided per time. After an IPNE layer (which gives outputs in serial and takes inputs in parallel), an OPNE (which takes input in serial and provide outputs in parallel) can be connected without any interface circuitry [24].

General case OPNE and IPNE configurations can be also used in general with NNs different from the 3-2 example discussed before, in fact they can be extended to a general H-output/input case. In particular an OPNE takes 1bit serially and
produce H outputs in parallel, while an IPNE takes H inputs in parallel and gives 1 output serially.

**Additional details [40]**

In this part, there are presented some additional details from [40], that implements the same architecture, but with more detailed explanations.

**Batch normalization** The batch normalization is an essential element in the binary/ternary neural networks in order to obtain an high accuracy with weights extremely approximated [40]. From section 1.2.4 in the introduction, the sign activation function to the formula of the batch normalization can be applied as follows:

\[
\hat{Y} = \text{sign} \left( \gamma \left( \frac{Y - \mu}{\sigma} \right) + \beta \right) \text{ from [40]} \tag{1.87}
\]

Where:

- \(Y\) is the weighted sum between \(W\) and activations (output of a neuron without activation function applied);
- \(\mu, \sigma^2\) are mean and variance of \(Y\) (over all input images);
- \(\gamma, \beta\) are scaling and offset factors

As done in section 1.2.4 in the ternary network explanation, the original Equation 1.87 can be transformed in:

\[
\hat{Y} = \text{sign} \left( Y + \left( -\mu + \frac{\sigma}{\gamma} \beta \right) \right) = \text{sign} (Y + \text{bias}) \text{ from [40]} \tag{1.88}
\]

The bias value is added after the pop-counting, requiring an additional space of memory for the bias term.

**Computation cycle** Considering a network with \(L \times H\) size (where \(L\) is the number of rows in the SRAM and \(H\) is the number of outputs which an OPNE produces per time), from [40]:

- OPNE produces a result after \(L+1\) cycles;
• IPNE can start computing and, in the meanwhile, OPNE can fetch another data;

• IPNE produces a result after only 1 cycle (when all the inputs are available from OPNE) and this output is used immediately from the OPNE of the next layer.

Results

In [24] is an OPNE-IPNE considered as a PIM. Some parameters are reported:

<table>
<thead>
<tr>
<th>#PIMS</th>
<th>6</th>
</tr>
</thead>
<tbody>
<tr>
<td>H</td>
<td>144</td>
</tr>
<tr>
<td>L</td>
<td>484</td>
</tr>
<tr>
<td>Frequency [MHz]</td>
<td>400</td>
</tr>
<tr>
<td>Peak performance [GSOPS]</td>
<td>691</td>
</tr>
<tr>
<td>#neurons</td>
<td>3768</td>
</tr>
<tr>
<td>#synapses</td>
<td>836000</td>
</tr>
<tr>
<td>Power consumption [W]</td>
<td>0.6</td>
</tr>
<tr>
<td>Area [mm²]</td>
<td>3.9</td>
</tr>
<tr>
<td>Energy efficiency [TSOPS/W]</td>
<td>1.2</td>
</tr>
<tr>
<td>Area Efficiency [TSOPS/mm²]</td>
<td>0.177</td>
</tr>
</tbody>
</table>

In the table, the term SOPS indicates “synapse operation per second” which is simply a multiplication and an addition. H is the number of inputs in parallel into an IPNE, while L is the number of words into an SRAM array. Since there are 6 PIMs, the network structure is the following: 484-144-484-144-484-144-484-144-484(10). The critical path of this architecture is in the IPNE adder tree, since all the computations are performed in parallel. An additional implementation of a CNN has been analyzed by [40], in which the structure used is the following:
1.5 – SRAM based

Table 1.23: Accuracy results of a CNN implementation from [40]

<table>
<thead>
<tr>
<th># Layers</th>
<th>Type</th>
<th>Kernel sizes</th>
<th>Stride</th>
<th># output channels</th>
<th>OFMAP size</th>
<th># of OFMAP</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>Input</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>22x22</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>Conv</td>
<td>5x5</td>
<td>4</td>
<td>4</td>
<td>6x6</td>
<td>4</td>
</tr>
<tr>
<td>2</td>
<td>Conv</td>
<td>5x5</td>
<td>4</td>
<td>12</td>
<td>6x6</td>
<td>12</td>
</tr>
<tr>
<td>3</td>
<td>F.Conn</td>
<td>432-144</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>4</td>
<td>F.Conn</td>
<td>144-10</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>

Accuracy [%] 80

Dataset MNIST

1.5.2 Deep learning consideration with novel approach - look-up-table based processing conjugated memory [25]

Introduction

The MLCS solution is able to implement some operations in memory, because it combines SRAM with look-up tables. A particular memory cell can be viewed as LUT logic or a simple memory element. If LUT functionality is considered, the result of the operation is simply obtained by using the inputs as addresses and so no computation are performed with this approach.

Typical structure

The typical structure used to implement an in-memory neural network is illustrated in Figure 1.44 from [25]:

77
The image is fed to the array on the left, while the weights are precharged from the upper side, so the multiplication is then performed by the array. This operation is repeated for all the cells with an operating frequency of 700 MHz. An MLCS is a simple unit based on a SRAM (256 words x 8 bits from [25]) which is used as a LUT and so it is addressed in order to give in output the result of a specific logic function, in particular it is able to perform a multiplication between two 4 bits numbers ($2^8 = 256\text{Words}$).

**Structure of MLCS for DL**

Taking as example a 16x16 image in input (256 pixels), the inputs should be connected to the second layer and so there are required 256 multiply-accumulate operations. If the clock is 700MHz, since there are 256 units, the speed is reduced to 2.7MHz. By using 256 parallel architectures, the maximum frequency of 700MHz per layer can be achieved also with low power because LUT based multiplier is only addressed without any calculation.
Roughly Performance Estimation

These results have been taken from another paper ([48]) that is focused on the same approach. It is not guaranteed the correctness of these results, in fact they are reported only as reference.

**Power Consumption [48]** A comparison between pure logic vs FPGA vs LUT approach is reported in the following Table 1.24:

Table 1.24: Relative power results from [48]

<table>
<thead>
<tr>
<th>Power</th>
<th>Pure logic</th>
<th>MLCS</th>
<th>FPGA</th>
</tr>
</thead>
<tbody>
<tr>
<td>Relative ratio</td>
<td>1</td>
<td>0.05</td>
<td>0.1</td>
</tr>
</tbody>
</table>

MLCS’s power is less than one twentieth of conventional pure logic’s one, because the just one address access of the LUT memory is enough for the calculation.

**Processing speed [48]** The maximum SRAM frequency is fixed to 1GHz, due to SRAM wiring penalties. The pure logic approach with pipeline achieves 4GHz. Here is reported a comparison among different architectures:

Table 1.25: Speed comparison from [48]

<table>
<thead>
<tr>
<th>Band frequency</th>
<th>Pure logic (8/64bits)</th>
<th>8bits</th>
<th>64bits</th>
</tr>
</thead>
<tbody>
<tr>
<td>Speed</td>
<td>4GHz</td>
<td>1GHz</td>
<td>1GHz</td>
</tr>
<tr>
<td></td>
<td></td>
<td>500MHz</td>
<td>250MHz</td>
</tr>
</tbody>
</table>

**Area Comparison [48]** This comparison is done on a 8 bit multiplier. LUT based SRAM circuit needs 4096 cells of SRAM ($0.5um^2$/memory cell [48]). In total $2.05um^2$ (TSMC 65nm). SRAM memory cell often is 1/3 smaller than logic gate but $4 \times$ memories are needed for making LUT plus overheads coming from registers, I/Os etc [48]. As a consequence, the area of the SRAM-LUT is 7 times larger than pure logic.
1.5.3 A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm [26]

Introduction

[26] proposes an implementation of a neurosynaptic core based on a SRAM cross-bar array. The architecture is event-based that corresponds to the brain’s way of computation. The real neuron’s model is implemented in [26], in which synapses (connections between neurons), axons and neurons’ core are integrated in-memory, in particular the chip has **256 digital neurons**, 1024 rows (axons) and so the array dimensions is 1024x256.

<table>
<thead>
<tr>
<th>Network structure</th>
<th>Area ( [mm^2] )</th>
<th>Technology</th>
<th>Energy per spike ( [pJ] )</th>
</tr>
</thead>
<tbody>
<tr>
<td>1024x256</td>
<td>4.2</td>
<td>45nm SOI</td>
<td>45</td>
</tr>
</tbody>
</table>

**Architecture specification** [26]

In Figure 1.45, the neurosynaptic core is composed by \( K \) axons (rows), \( KxM \) synapses and \( M \) neurons. The blue circles indicate the intersection between axons and columns, which represents the weight. At the end of a column there is a neuron indicated by the red box. In each time instant \( t \), there is an activity bit \( A_j(t) \) which indicates if a particular neuron has been fired or not in the previous time \( t-1 \). Connected to each each axon, there is a \( G_j \) value, which indicates what is the type of connection (0,1,2) (inhibitory,excitatory [26]). The synapse value of a neuron \( i \) is indicated as \( S_i^{G_j} \) from [26], so a neuron’s input is defined as in [26]:

\[
A_j(t) \times W_{ji} \times S_i^{G_j}
\]  

(1.89)
The membrane potential of the neuron is considered from [26]:

- \( V(t) \): membrane potential;
- \( L \): leak;
- \( \theta \): threshold;

\[
V_i(t + 1) = V_i(t) + L_i + \sum_{j=1}^{K} [A_j(t) \times W_{ji} \times S_i^{G_j}] \text{ from [26] (1.90)}
\]

When \( V(t) \) is higher than \( \theta \), the neuron produces a spike and its membrane potential is reset to 0.

**Implementation [26]**

In the following figure it is reported the architecture of the neurosynaptic core:

![Figure 1.45: Structure of the neurosynaptic core from [26]](image)

The communication between each block of the architecture is event-driven based, and so without any clock. In order to correctly synchronize all the operations,
handshake signals have been implemented. All the neurons are implemented as stand-alone elements: no multiplexed structure has been used in [26] to realize all the computations in parallel. The steps that the architecture executes in the processing flow are the following:

1. The addresses are fed to the crossbar one at a time. The corresponding row is activated and the connections (weights) and the type of connection ($G_j$) are read;
2. All the connections of type 1, are sent to the neuron that performs the membrane update in Equation 1.90;
3. Once all the neurons are updated, the address read procedure has finished;
4. Everytime 1ms has passed (after the completion of the steps described so far), a Sync signal is sent to the neurons, which controls if the membrane potential is higher than $\theta$ or not. If so, the membrane potential is reset to 0 and a spike is produced (logic ”1” coming from the corresponding neuron).

**Results**

The results and some useful parameters from [26] are reported:

Table 1.28: Network parameters for 1024x256 crossbar array dimensions from [26]

<table>
<thead>
<tr>
<th>Parameters</th>
<th>1024x256</th>
</tr>
</thead>
<tbody>
<tr>
<td>Network structure</td>
<td>1024x256</td>
</tr>
<tr>
<td># Transistors</td>
<td>3.8 million</td>
</tr>
<tr>
<td># Neurons</td>
<td>256</td>
</tr>
<tr>
<td>Neuron’s area [$\mu$m$^2$]</td>
<td>3325</td>
</tr>
<tr>
<td>Bitcell area [$\mu$m$^2$]</td>
<td>1.3</td>
</tr>
<tr>
<td>Delay [ms/img]</td>
<td>1</td>
</tr>
<tr>
<td>Vdd [V]</td>
<td>0.85</td>
</tr>
<tr>
<td>Energy per spike [pJ/spike]</td>
<td>45</td>
</tr>
<tr>
<td>Worst case energy [pJ]</td>
<td>11520</td>
</tr>
</tbody>
</table>
The accuracy results from [26], considering a network structure of 484x256, are the following:

Table 1.29: Accuracy results from [26]

<table>
<thead>
<tr>
<th>Accuracy test</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dataset</td>
<td>MNIST</td>
</tr>
<tr>
<td># neurosynaptic cores</td>
<td>2 (excitatory and inhibitory)</td>
</tr>
<tr>
<td>Network structure</td>
<td>484x256</td>
</tr>
<tr>
<td># Training images</td>
<td>50000</td>
</tr>
<tr>
<td># Test images</td>
<td>10000</td>
</tr>
<tr>
<td>Accuracy [%]</td>
<td>89</td>
</tr>
<tr>
<td>Accuracy [%]</td>
<td>94</td>
</tr>
<tr>
<td>Neuronsynaptic core</td>
<td>Neurosynaptic core</td>
</tr>
<tr>
<td>Real value weights</td>
<td>Real value weights</td>
</tr>
</tbody>
</table>

The network is realized with 2 neurosynaptic cores of 484x256 which are configured with excitatory and inhibitory \( G_j \) bits.
1.6 DRAM Based

1.6.1 XNOR-POP: A processing-in-memory architecture for binary Convolutional Neural Networks in Wide-IO2 DRAMs [27]

Introduction

[27] proposes a novel architecture based on DRAM, which is able to implement a XNOR-NET: XNOR operations are performed inside the memory and are transferred to the logic layer by TSVs, in which the population-counting computing is performed. TSVs enable power saving, reduction of wires’ length and consequently the delay.

![Architecture proposed by [27]. Source: [27]](image_url)

The architecture is depicted in Figure 1.46: each DRAM layer (8Gb) has 8 channels with 64 bits and each channel has 4 banks[27].

XNOR-NET CNN The XNOR-NET CNN has the following building blocks:
The convolution in output is obtained as:

\[ Y_{\text{conv}} = (I \otimes W) \cdot K \alpha \]  

(1.91)

As already mentioned, the batch normalization applied to a XNOR-NET can be reduced simply into the following equation:

\[
y_{(\text{batch})} = \begin{cases} 
1, & \text{if } x \geq \mu - \frac{\beta}{\gamma \sqrt{\sigma^2 + \varepsilon}} \\
0, & \text{otherwise}
\end{cases} 
\]  

(1.92)

So a simple comparator can be used.

**Binary Convolution: XNOR-Popcount**

**XNOR-Dram** The structure of a bank is reported in the following figure from [27]:

---

Figure 1.47: Building blocks of a XNOR-NET from [27]
The functional steps are the following from [27]:

1. At the beginning, all lines are precharged to 1/2 Vdd;

2. WL is activated: the local sense amplifier senses the difference between Local bit line, Local bit line;

3. Cell content is restored by Local sense amplifier. The local bit lines are attached to global bit lines through switches.

A XNOR operation is performed considering an additional block inserted after the global sense amplifier. The operational steps to compute $A \oplus B$ are the following from [27]:

1. $A$ and $\overline{A}$ are fetched from the subarray 0 and memorized in Global sense amplifier;

2. Global sense amplifier/Sub0 connection is detached;

3. Local sense amplifier charges subarray 1;

4. $B$ is read from subarray 1 and sent to the XNOR engine;

5. The connection between XNOR engine/global sense amplifier is attached and a result is produced and memorized;

6. XNOR/Subarray 1 are disconnected from global bit lines;
7. Local sense amplifier precharges subarray 1 again;

8. Global sense amplifier precharges the global bitlines.

The banks are organized in such a way that the input is stored in subarray0, while in subarray1 the corresponding weight. From [27] the total latency of this operation is 128ns that can be reduced to 78ns, when loop unrolling technique is used [27]. The results elaborated in the DRAM are sent to the logic die by means of TSVs to perform the popcount adopted in [49]: two of them are required to count 1s and 0s respectively. For the pooling technique, a 16 bit comparator is used and in the pooling phase also the matrix $K$ and $\alpha$ are computed.

Results

At the beginning, the architecture has to fetch the weights and to dispose them in the banks in order to perform all the operation explained so far. If the network is very deep, weights could occupy a lot of memory. In the following table, there are reported the results from [27]. There are also presented comparisons between the floating point network accuracy and its corresponding XNOR-Net implementation:
Table 1.30: Accuracy and performance results of the architecture with different neural network models. Source:[27]

<table>
<thead>
<tr>
<th>Network used</th>
<th>Dataset</th>
<th>Structure</th>
<th>Accuracy (FP) [%]</th>
<th>Accuracy XNOR [%]</th>
<th>Frame per second</th>
</tr>
</thead>
<tbody>
<tr>
<td>LeNet-5</td>
<td>MNIST</td>
<td>Layer 1 to 3 Conv</td>
<td>99.1</td>
<td>97.2</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Layer 4 to 5 Pooling</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Layer 6 Fully connected</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MLP</td>
<td>MNIST</td>
<td>1 to 5 Fully connected</td>
<td>98.5</td>
<td>96.9</td>
<td>-</td>
</tr>
<tr>
<td>CNP</td>
<td>MNIST</td>
<td>1 to 3 Conv</td>
<td>97</td>
<td>96.1</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td></td>
<td>4 to 5 Pooling</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>6 Fully connected</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SCNN</td>
<td>MNIST</td>
<td>1 to 2 Conv</td>
<td>99</td>
<td>97.8</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td></td>
<td>3 to 4 Fully connected</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MCDNN</td>
<td>MNIST</td>
<td>1 to 3 Conv</td>
<td>96.8</td>
<td>95.7</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td></td>
<td>4 to 6 Pooling</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>7 to 9 Fully connected</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AlexNet</td>
<td>ImageNet</td>
<td>1 to 5 Conv</td>
<td>80.2</td>
<td>69.2</td>
<td>3390</td>
</tr>
<tr>
<td></td>
<td></td>
<td>6 to 8 Pooling</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>9 to 10 Fully connected</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ResNet-18</td>
<td>ImageNet</td>
<td>1 to 18 Conv</td>
<td>89.2</td>
<td>73.2</td>
<td>1391</td>
</tr>
<tr>
<td></td>
<td></td>
<td>19 to 20 Pooling</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>21 Fully connected</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
1.7 OOM implementations

Particular mixed implementations or computation methods are presented in this section, that can be employed to realize a neural network. The NNs implemented in the following part are not realized in-memory but as hardware accelerators, in fact the term OOM means out of memory.

1.7.1 Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor Computing [28]

Introduction

[28] proposes a solution in which raw data (such as data coming from sensors) have to be processed. One of the possible ways to operate on such data is the NN employment combined with near-data computing. [28] introduces a new way of computation based on a stochastic-binary approach (SC), where a bit sequence represents a probability. Its implementations is cheaper than the classical binary approach, but it requires longer computation time and consequently higher energy [28]. The precision in this case can be reduced in order to save energy/time. The stochastic approach is used only in the first layer of the neural network.

Architecture and considerations

The SC is based on the probabilities, and so a bitstream in SC has the following meaning:

\[ X = 01101010 \rightarrow \text{Probability} = \frac{\#1s}{\text{length}} = \frac{4}{8} = 1/2 \]

The probability in this case is, 0.5 because there are four 1s out of 8 possibilities. The arithmetic functions are easily implemented: multiplication is simply realized
with an AND logic gate.

\[ p_1 = 0.5 \rightarrow X_1 = 0110 \]
\[ p_2 = 0.25 \rightarrow X_2 = 0100 \]
\[ p_1 \cdot p_2 = 0.5 \cdot 0.25 = 0.125 \]
\[ Y = X_1 \text{AND} X_2 = 0100 \rightarrow p_Y = 0.125 \]

In the following figure are reported the stochastic circuits used in [28]:

![Diagram](image)

Figure 1.49: (a) Multiplier; (b) Binary - Stochastic converter; (c) Stochastic - Binary converter; (d) Multiplexer adder with random input r; (e) Improved version of the adder, without the random input. Source: [28]

This kind of computation presents some errors, depending on the positions of the incoming bits. One way to improve the precision is to enlarge the bit sequence, in fact the precision of the SC is given by:

\[ \text{Precision} = \log_2(\text{length}) \]  \hspace{1cm} (1.93)

The probability is only in the range [0,1], but this problem can be easily solved by considering the value of X as \(2p_X - 1\) [28]. The term length in the formula is the bitstream size. An adder is implemented from a stochastic point of view as a multiplexer, in which, as a selector it is used a random value with probability \(P(r)\)
This implementation has been improved in [28], in such a way to eliminate the additional random input: considering the circuit in Figure 1.49 (e), at each clock cycle, if X and Y are the same, Y is propagated to the output [28]; otherwise, the state of the TFF is changed. In order to understand its functionality, consider the following example from [28]:

1. Initial TFF state = 0;
2. X = 0100 1010 (3/8);

By performing all the computations (showed in Figure 1.50), the output bitstream results to be equal to 00100010 (1/4). In fact:

\[
Z_0 = 0.5 \cdot (3/8 + 1/4) = 5/16 \sim 1/4
\]  \hspace{1cm} (1.94)

In case of initial TFF state equal to 1, the result will be 01001010. Considering the other circuits depicted in Figure 1.49, the binary to stochastic converter is designed as a comparator with its input connected to a random number generator and to the input binary: if this last one is higher than the number randomly generated, the output will be 1, otherwise 0 (Figure 1.49 (b))[28]. Similarly, the conversion from stochastic to binary can be performed by a binary counter which counts the total number of 1s into the bitsequence (Figure 1.49 (c))[28].
Stochastic binary neural network design

In order to implement the stochastic approach, [28] considers the LeNet-5 neural network which is composed by:

Table 1.31: LeNet-5 structure Source: [50]

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th># Channels input</th>
<th>IFMAP size</th>
<th>Kernel size</th>
<th>OFMAP size</th>
<th># Channels output</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>Input</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>28x28</td>
<td>1</td>
<td>-</td>
</tr>
<tr>
<td>1</td>
<td>Conv</td>
<td>1</td>
<td>28x28</td>
<td>5x5</td>
<td>28x28</td>
<td>32</td>
<td>-</td>
</tr>
<tr>
<td>2</td>
<td>Max Pool</td>
<td>32</td>
<td>28x28</td>
<td>2x2</td>
<td>14x14</td>
<td>32</td>
<td>-</td>
</tr>
<tr>
<td>3</td>
<td>Conv</td>
<td>32</td>
<td>14x14</td>
<td>5x5</td>
<td>14x14</td>
<td>32</td>
<td>-</td>
</tr>
<tr>
<td>4</td>
<td>Max Pool</td>
<td>32</td>
<td>14x14</td>
<td>2x2</td>
<td>7x7</td>
<td>32</td>
<td>-</td>
</tr>
<tr>
<td>5</td>
<td>FC</td>
<td>-</td>
<td>128</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>50 % dropout</td>
</tr>
<tr>
<td>6</td>
<td>FC</td>
<td>-</td>
<td>10</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>25% dropout &amp; softmax</td>
</tr>
</tbody>
</table>

Figure 1.50: Example of computations of the new stochastic adder. Source: [28]

\[ z_0 = \begin{bmatrix} 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \end{bmatrix} \] (14)
The NN is made by bipolar operations, but the bipolar approach of SC is not used because the accuracy is degraded. [28] adopts unipolar operations by splitting the weights into positive/negative bit-streams $w_{\text{pos}}$ and $w_{\text{neg}}$ and so two different dot products are computed (negative/positive), converted in binary domain and the sign function is performed by a simple comparator.

Results

The results from [28] are now reported. The architecture has been tested on MNIST dataset as reported in the Table 1.32, with different bitstream lengths, in order to see what are the changes in the evaluated parameters. The power is normalized to the throughput, because depending on the application, the throughput can be chosen arbitrarily.

Table 1.32: Performance and accuracy results. Comparison with the classical binary approach and the discussed one. Source: [28]
1.7.2 Towards Near Data Processing of Convolutional Neural Networks [29]

Introduction

[29] proposes an approach in which the memory wall problem is reduced by introducing the near-data processing (NDP). However, incorporating memory with logic is very expensive in terms of performances, but the solution of a 3D stacked structure connected via TSV is studied and in particular it is applied to CNN architecture. HMC (Hybrid memory cube) divided into vaults has been chosen by [29].

HMC Structure The technology of the HMC is made by DRAM layers (4 to 8), in which the image is split. They are stacked on top of each other and connected by TSV as already mentioned. At the bottom layer there is a computational unit which performs all the computations that a CNN requires [29]. Each DRAM layer is divided in 16 parts and a stack of these parts coming from different layers is called vault [29], which is divided into two parts called banks. The architecture of the system is reported in the Figure 1.51 from [29]. HMC has 4 layers of 4Gb each (total 2GB). In each vault controller there is a CLU (CNN logic unit) which computes the convolution operation for a specific vault. In particular [29] adopts the floating point double precision representation of the numbers (so it is not a binary network). The CLU contains a floating point multiplier, adder, some registers (that stores the bias value and partial results) and an SRAM (which contains the filter weights, that are the same for all the CLUs): all the needed elements to compute a convolution.
Computational steps When the host processor assert a start signal, the computation begins and it is performed in the following way:

1. The kernel’s elements are loaded into the CLU SRAM and an image element
is loaded inside the CLU from DRAM banks;

2. The floating point multiplier in the CLU performs the multiplication between the weights stored in the SRAM and the incoming image element. Eventually a bias element is added by means of the floating point adder;

3. Results are sent back to the memory die. Some of them could be partial results, because the memory is split and so some elements of the image could be located into a different vault as shown in Figure 1.52.

![Figure 1.52: Complete and partial result computation. Source: [29]](image)

The partial result is stored locally in order to be considered at the end of computation of the following vaults.

4. Partial results are then added together by means of inter-vault connections and then are written in the memory dies.

**Results**

Here there are reported the results of this implementation from [29]. Also the network structure employed by [29] is specified.
Table 1.33: Results and network structure (Source: [29]) of the floating point architecture

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>IFMAP size</th>
<th>kernel size</th>
<th>CPU-Based</th>
<th>CLU-Based</th>
<th>Area [mm²]</th>
<th>Power [W]</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Conv</td>
<td>128x128x3</td>
<td>5x5x3</td>
<td>1.149</td>
<td>0.0158</td>
<td>11.0619</td>
<td>0.1848</td>
</tr>
<tr>
<td>2</td>
<td>Conv</td>
<td>124x124x3</td>
<td>5x5x3</td>
<td>1.0808</td>
<td>0.0153</td>
<td>10.4054</td>
<td>0.1779</td>
</tr>
<tr>
<td>3</td>
<td>Conv</td>
<td>120x120x3</td>
<td>4x4x3</td>
<td>0.6643</td>
<td>0.0086</td>
<td>6.396</td>
<td>0.1148</td>
</tr>
<tr>
<td>4</td>
<td>Conv</td>
<td>117x117x4</td>
<td>4x4x4</td>
<td>0.832</td>
<td>0.0165</td>
<td>8.0103</td>
<td>0.2057</td>
</tr>
<tr>
<td>5</td>
<td>Conv</td>
<td>114x114x5</td>
<td>3x3x5</td>
<td>0.808</td>
<td>0.0094</td>
<td>7.7753</td>
<td>0.118</td>
</tr>
<tr>
<td>6</td>
<td>Conv</td>
<td>112x112x3</td>
<td>5x5x3</td>
<td>0.873</td>
<td>0.0105</td>
<td>8.4049</td>
<td>0.1415</td>
</tr>
<tr>
<td>7</td>
<td>Conv</td>
<td>108x108x3</td>
<td>5x5x3</td>
<td>0.809</td>
<td>0.0102</td>
<td>7.7888</td>
<td>0.1354</td>
</tr>
</tbody>
</table>

1.7.3 Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks [30]

Introduction

[30] proposes an energy efficient and reconfigurable architecture which is based on a chain of processing elements interconnected. There can be different architectures that implement a CNN:

- **Memory-centric**[30]: there are not data reuses in the processor, and so the data are fetched from the memory. PEs in the CPU are simply stacked and are not interconnected to each other.

  (Pro) Reconfigurability;
  (Con) Low efficiency.

- **2D Spatial**[30]: Data are reused in the processor, since a connection between one PE and the following one exists. This solution reduces the data fetching from the memory because the PE maintains locally data frequently used, and passes them to the following PE if needed.

  (Pro) Reduced data movements;
  (Con) High power-area cost.
• **1D-Chain[30]**: PEs are arranged as a chain and piloted by an FSM. The sequential circuit is able to precharge the kernel parameters and, after that, the IFMAPs are streamed along the chain architecture in order to compute the CNN results.

(Pro) Better energy efficiency;

(Pro) Data reusability;

(Pro) High reconfigurability and so high performance.

**Chain-NN: 1D Chain Architecture**

**1D Chain architecture** An example of chain NN is reported in the Figure 1.53, considering a kernel size of 3. Each chain is mapped to a convolution kernel window: the inputs are sent serially to the chain and each PE performs a MAC operation with kernel weight. This architecture works well, but in the case in which some pixels are not included in a convolutional window, there are required additional clock cycles to fetch the new pixels, resulting in a throughput decreasing (in particular with \( K = 3 \) and \( \text{stride} = 2 \), the maximum number of matching pixels are 6 in two different convolutional windows, so at least 3 pixels have to be fetched). For this motivation, dual channel architecture has been designed in [30].
1.7 – OOM implementations

Figure 1.53: Chain NN architecture with \( k = 3 \), where \( k \) is the kernel size. 9 processing elements are needed because for each PE, a different weight is used. Inside a PE there are a MAC and a register and eventually the corresponding outputs can be pipelined, in order to improve performance (red dashed lines). Example of computation. Source: [30]

\[
k_1 \cdot (1) + k_2 \cdot (2) + k_3 \cdot (3) + k_4 \cdot (4) + \\
k_5 \cdot (5) + k_6 \cdot (6) + k_7 \cdot (7) + k_8 \cdot (8) + k_9 \cdot (9)
\]

Dual channel [30] proposes a solution to this problem by increasing the total number of fetched data in a single PE. This implementation is called dual channel architecture, in which the column wise scanning is maintained, but this time at least \( 2K-1 \) row elements are fed to the PE. The PE fetches the even columns (evenIF) and, after \( K+1 \) clock cycles, the odd columns (oddIF) with the following order:
In this way the Dual-channel architecture can continuously perform new convolutional operations without waiting times. Inside each PE, there is an internal storage (kMemory) that keeps the kernels (which are the same, since it is a CNN).

### Results

From [30] are written the results. The implemented network is AlexNet (only with the convolutional layers and without the fully-connected part) and its structure is reported in the following table.

**Table 1.34: Results and network structure. Source: [30]. The implementation is in fixed-point precision. For an OPS (operation per second) is a multiplication and an accumulation**

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>IFMAP size</th>
<th>kernel size</th>
<th>Time required [ms]</th>
<th>Memory required [MB]</th>
<th>Total power [mW]</th>
<th>Throughput [GOPS]</th>
<th>Power efficiency [GSOP-S/W]</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Conv</td>
<td>227x227x3</td>
<td>3x3</td>
<td>29.2</td>
<td>44.9</td>
<td>567.5</td>
<td>806.4</td>
<td>1421</td>
</tr>
<tr>
<td>2</td>
<td>Conv</td>
<td>55x55x96</td>
<td>3x3</td>
<td>43.83</td>
<td>175.3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>Conv</td>
<td>27x27x256</td>
<td>3x3</td>
<td>58.43</td>
<td>312.1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>Conv</td>
<td>13x13x384</td>
<td>3x3</td>
<td>102.53</td>
<td>234.3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>Conv</td>
<td>13x13x256</td>
<td>3x3</td>
<td>159.35</td>
<td>156.2</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
1.7.4 An Energy-Efficient Architecture for Binary Weight Convolutional Neural Networks [31]

Introduction

[31] proposes a BCNN architecture, which can be used in an embedded system, since it is a low power implementation. Very deep CNN architecture can be realized with this architecture and it is compatible with BinaryConnect or BWN. [31] analyzes only the convolutional layers.

Background

**Binary weight CNN** [31] analyze BinaryConnect and BWN. The second one differs from the first one only by the scaling factor $\alpha$, given by [31]:

$$\alpha_o^{(\ell)} = \frac{\|W_{o,fp}\|_{\ell_1}}{n}$$  \hfill (1.95)

So the correspondent output of the BWN is given by [31]:

$$y_{o,bwn}^{(\ell)}(j,i) = \alpha_o^{(\ell)} \times y_{o,bc}^{(\ell)}(j,i)$$  \hfill (1.96)

Where $^{(bc)}$ means BinaryConnect and $^{(fp)}$ floating point. As already mentioned, the $\alpha$ coefficient requires the fully precision weights in its computation. A basic stage of a BCNN is reported in the following table:

Table 1.35: Basic stages of a binary convolutional neural network. Source: [31]

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>BCNN</td>
<td>Binary convolution</td>
</tr>
<tr>
<td>2</td>
<td>Scaling</td>
<td>$y = \alpha x$</td>
</tr>
<tr>
<td>3</td>
<td>Batch norm.</td>
<td>Apply batch normalization formula</td>
</tr>
<tr>
<td>4</td>
<td>ReLU</td>
<td>Max(0,x)</td>
</tr>
<tr>
<td>5</td>
<td>Max Pooling</td>
<td>Downsampling</td>
</tr>
</tbody>
</table>
Algorithmic optimizations for BCNNs

The following optimizations are implemented in [31]:

1. [31] proposes the 1's complement to further reduce the complexity of the system, since the 2's complement requires an additional sum. This approximation introduce an error of 15% on CIFAR-10 with VGG-16 architecture. The error can be mathematically defined from [31] as:

\[ x^* = x - n \]  

Since ±1 are roughly equal, some considerations in advance can be made and this error can be compensated by knowing the number of -1s (n)

2. Since the max pooling layer selects only the maximum out of all the possible outputs, the others computed are useless. An earlier pooling can be made from [31]. This technique is based simply on the changing the order of the layers, in fact pooling layer is performed after convolution as shown in the following table:

Table 1.36: Basic stages with earlier pooling Source: [31]. Compared to Table 1.35, the pooling layer is placed as second in the order.

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>BCNN</td>
<td>Binary convolution</td>
</tr>
<tr>
<td>2</td>
<td>Max Pooling</td>
<td>Downsampling</td>
</tr>
<tr>
<td>3</td>
<td>Scaling</td>
<td>( y = \alpha x )</td>
</tr>
<tr>
<td>4</td>
<td>Batch normalization</td>
<td>Apply batch normalization formula</td>
</tr>
<tr>
<td>5</td>
<td>ReLU</td>
<td>Max(0,x)</td>
</tr>
</tbody>
</table>

3. Batch normalization transformation: since the batch normalization is defined as

\[ y_{\text{batch}} = \frac{y_{\text{conv}} \times \alpha - \mu}{\sigma} \]  

(1.98)
[31] considers the terms $m = \frac{\mu}{\alpha}$ and $n = \alpha/\sigma$ and transforms the equation in:

$$y_{batch} = (y_{conv} - m) \times n$$  \hspace{1cm} (1.99)

So no divisions are performed.

4. Quantization of the activations: the ReLU layer has to be quantized somehow. [31] proposes a method based on a equal-discance nonuniform quantization which produces a maximum accuracy loss of 1.7%.

**Architecture**

**Top-level architecture** In Figure 1.55 it is reported the architecture proposed by [31]:

![Architecture Diagram](image_url)

Figure 1.55: Architecture. Source: [31]
The architecture in Figure 1.55 [31] is composed by an image memory that stores two rows of the IFMAP (size $C_{in} \times 2 \times w_{in}$); a filter memory (FMEM) that contains the filter elements; some PUs that compute convolution operations and each of them process an OFMAP; an input feature map summation unit (ISU) that adds all the OFMAPs and produce the neuron’s output; an accumulation array (ACCA) which accumulates the exceeding IFMAPs, if their number is higher than the PUs available; a Neuron PU (NPU) that computes scaling, batch normalization, ReLU, max-pooling and produces 256 output neurons per clock cycle[31]; a central control unit that schedules the architecture [31].

**Processing unit** The PU is able to compute the convolution since it is composed by multiple filters (MFIR) and their outputs are added together in multiple fast adder units (FAUs), which are made by optimized compressor tree structure based on 4:2 and 3:2 compressor circuits.

**Adder tree** All the multiplications have been removed from by the binary convolutional layers and so the critical path will be in the accumulation part. In one convolution, an output neuron is obtained by adding together $w_{kernel} \times h_{kernel} \times \text{window}_{\text{size}} = 36$ data.

![Adder tree](image_url)

Figure 1.56: 4:2 compressors used in [31]
The compressor tree is made by 3:2 and 4:2 compressors, and in particular, the last ones do not have a carry chain, so the delay is not heavily influenced. The signals Coutk and Cink are not used, producing an approximated result.

**Approximate Binary Multiplier** Since the design uses 1’s complement representation, an approximated version can be used, in which the adder which adds 1 to obtain the 2’s complement is not implemented.

![Figure 1.57: Approximate multiplier. Source: [31]](image)

This implementation brings a 60% area reduction.

**Approximate Adder [31]** The adder tree inaccuracies have been alleviated by dividing the adder into two parts (N:k) part and (k-1:0) respectively. The input carry bit of the N:k part is taken from the k-th bit of input data in the (k-1:0) part

1. For the higher (N-k) -bits subadder, its input carry bit Cin is approximately speculated using the k th bit of one of the input data, reducing datapath delay and hardware complexity;

2. When the k is set to half of the word size, the hardware efficiency gain can reach the maximum, but the error rate increases with k. The error obtained with this approach is $\pm 2^k$, where k is the split position.

**NPU, batch normalization unit [31]** As already said, the NPU is able to compute the batch normalization, activation (ReLU) and max pooling, in particular 8 convolved neurons passes through it [31]. In order to save power, a ”disable” signal is used to turn off the NPU when it is not needed. Since the ReLU function is 0 for all the elements below 0 (max(0,x)), a latch (piloted by the sign of the incoming bits)
is introduced before the quantization unit in order to block the data propagation in the case of negative result coming from the adder.

Implementation results and comparison

Results from [31] are reported in Table 1.38, including the structure of the VGG-16 CNN model used. The technology used in the 130nm, while it is possible to demonstrate that from 90nm the frequency reaches a maximum value of 650MHz. In Table 1.37 are reported the bit-lengths and memory sizes of the different building blocks in the architecture:

Table 1.37: Bit-lengths and memory used. Source: [31]

<table>
<thead>
<tr>
<th>Bit-Lengths</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>IMEM</td>
<td>ISU Adders (k=3)</td>
<td>ACCA Adders (k=4)</td>
<td>NPU</td>
</tr>
<tr>
<td>6</td>
<td>13</td>
<td>15</td>
<td>16</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Memory sizes [KB]</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>IMEM</td>
<td>FMEM</td>
<td>ACCA</td>
<td>NPU</td>
</tr>
<tr>
<td>21.6</td>
<td>295</td>
<td>53.76</td>
<td>2</td>
</tr>
</tbody>
</table>
Table 1.38: Results Source: [31] with corresponding CNN structure. An OPS is a MAC operation per second.

<table>
<thead>
<tr>
<th>Technology</th>
<th>Dataset</th>
</tr>
</thead>
<tbody>
<tr>
<td>130nm SMIC</td>
<td>ImageNet</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>CNN STRUCTURE</th>
<th>Total Power [mW]</th>
<th>Frequency [MHz]</th>
<th>Peak performance [TOPS]</th>
<th>Area [mm²]</th>
<th>Core voltage [V]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Layer</td>
<td>Type</td>
<td>IFMAP size</td>
<td>Kernel size</td>
<td># operations [MOPs]</td>
<td>Time required [ms]</td>
</tr>
<tr>
<td>1</td>
<td>Conv</td>
<td>224x224x3</td>
<td>3x3x64</td>
<td>183.04</td>
<td>0.34</td>
</tr>
<tr>
<td>2</td>
<td>Conv + Pooling</td>
<td>224x224x64</td>
<td>3x3x128</td>
<td>3709.01</td>
<td>4.17</td>
</tr>
<tr>
<td>3</td>
<td>Conv</td>
<td>112x112x128</td>
<td>3x3x128</td>
<td>1854.5</td>
<td>1.94</td>
</tr>
<tr>
<td>4</td>
<td>Conv + Pooling</td>
<td>112x112x128</td>
<td>3x3x128</td>
<td>3704.19</td>
<td>3.86</td>
</tr>
<tr>
<td>5</td>
<td>Conv</td>
<td>56x56x128</td>
<td>3x3x256</td>
<td>1852.1</td>
<td>1.94</td>
</tr>
<tr>
<td>6</td>
<td>Conv</td>
<td>56x56x256</td>
<td>3x3x256</td>
<td>3701.78</td>
<td>3.89</td>
</tr>
<tr>
<td>7</td>
<td>Conv + Pooling</td>
<td>56x56x256</td>
<td>3x3x256</td>
<td>3701.78</td>
<td>3.89</td>
</tr>
<tr>
<td>8</td>
<td>Conv</td>
<td>28x28x256</td>
<td>3x3x512</td>
<td>1850.89</td>
<td>2.22</td>
</tr>
<tr>
<td>9</td>
<td>Conv</td>
<td>28x28x512</td>
<td>3x3x512</td>
<td>3700.58</td>
<td>4.45</td>
</tr>
<tr>
<td>10</td>
<td>Conv + Pooling</td>
<td>28x28x512</td>
<td>3x3x512</td>
<td>3700.58</td>
<td>4.45</td>
</tr>
<tr>
<td>11</td>
<td>Conv</td>
<td>14x14x512</td>
<td>3x3x512</td>
<td>925.15</td>
<td>1.3</td>
</tr>
<tr>
<td>12</td>
<td>Conv</td>
<td>14x14x512</td>
<td>3x3x512</td>
<td>925.15</td>
<td>1.3</td>
</tr>
<tr>
<td>13</td>
<td>Conv + Pooling</td>
<td>14x14x512</td>
<td>3x3x512</td>
<td>925.15</td>
<td>1.3</td>
</tr>
<tr>
<td>Total</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>30733.9</td>
<td>35.05</td>
</tr>
</tbody>
</table>
Chapter 2

Comparisons

In this chapter, it is proposed a comparison in terms of performance among the different architectures analyzed so far in the state-of-the-art.

2.1 Algorithm accuracies

All the accuracies arising from the different algorithm implementations are compared. In order to compare correctly the results, the same reference architecture (such as AlexNet) and dataset are chosen.

TOP-1 and TOP-5 errors

The top-1 error comparison is reported in the following plot:
The architecture analyzed is the AlexNet for all the cases and the same dataset (ImageNet) has been chosen. In particular, starting from the left side, the first refers to the classical floating point implementation, in which the accuracy reaches 56.6% [11]. Passing to binarized alternatives, it is possible to observe that the accuracy obtained in the BWN [12] (binary weight network) case is very high: here the weights are binarized with an extra scaling factor, while the inputs are kept in floating point precision. Reducing the precision of the weights and keeping a scaling factor, allow to obtain similar accuracy to the ideal floating point case. If the weights are binarized, the convolution operations (MACs) are transformed into simple additions/subtractions, because they assume only ±1 values. Similarly for the cases of XNOR-Net [12] (both weights-input binarized with extra scaling factors $K$ and $\alpha$), BinaryConnect [13] (only weights are binarized without any extra scaling factors) and BW-BI [13] (both inputs and weights are binarized without scaling factor), it is possible to observe that the first one reaches better results than the
Comparisons

others, because some additional terms during the convolution operation are used and they guarantee to reach a good result:

\[ \text{Conv}_{\text{XNOR}} \simeq (\text{sign}(I) \odot \text{sign}(W)) \cdot \alpha K \]  

(2.1)

BinaryConnect and BW-BI (BNN) can be used in small datasets, while XNOR-Net represents a good trade-off between complexity and accuracy.

In the following figure is reported the top-5 error values for the different networks analyzed:

Figure 2.2: top-5 errors for the same dataset ImageNet. AlexNet: [11], XNOR-Net: [12], BWN: [12], BinaryConnect: [13]. BW-BI: [13]
Also in this case depicted in Figure 2.2, the architecture analyzed is the AlexNet for the first five cases: the XNOR-Net reaches a very good performance in term of accuracy. In the last bar, a deeper fully precision convolutional neural network is reported (ResNet-18), in which there are up to 18 convolutional layers: the precision is higher than the AlexNet because increasing the number of layers improves the accuracy. The last graph proposed is the top-1 error in the case of CIFAR-10 dataset, which is far less complicated than ImageNet and so the accuracy is expected to be higher than the previous plot for all the cases:

![Accuracy comparison for CIFAR-10 dataset](image)

Figure 2.3: Accuracy comparison for CIFAR-10 dataset. XNOR-Net: [12], BWN: [12], BinaryConnect: [13], Ternary: [14]

The ternary network has been analyzed: setting to 0 some of the weights of the network produces an acceptable accuracy result for simple datasets with a lower power.
2.1.1 Performance comparisons

⚠️ ATTENTION ⚠️

The comparisons presented here are rough estimations! The data reported are reformulated considering linear dependencies. All the specific cases and distinctions are not considered (for example area of the array w.r.t area of the entire chip), since the data provided by the documents may not consider this difference. Some assumptions have been made (as described in the following part), but it is not guaranteed their correctness and accuracy. Unfortunately not all documents analyzed have been considered for the comparison, since some of them don’t provide the parameters estimated, because focused on a different topic (for example accuracy). In order to correctly reproduce a more accurate comparison, the individual cases should be reproduced with the same benchmark model (such as AlexNet) and the same neural network type (CNN or MLP).

Number of neurons

The number of neurons has been computed to compare the architectures in terms of performance, in fact it is possible to normalize some network parameters (such as power, energy, area etc.) with different structures. The values obtained with this normalization are expressed as parameter per number of neurons. To do this estimation, consider the AlexNet network as an example:
Taking the first layer, the input has 96 feature maps of 55x55 pixels. Each pixel can be considered as a neuron, so the total number of neurons in the input layer is:

\[
\text{#neurons}_{1st} = 55 \times 55 \times 96 = 290400
\]  

(2.2)

Adding all the number of neurons of individual stages, it is possible to obtain the total value, which results equal to:

\[
\text{#neurons}_{\text{AlexNet}} = 659272
\]  

(2.3)

This parameter depends on which parameter has to be evaluated: for instance, if area/neuron is considered, the value to use is the effective number of neurons elaborated by the architecture. If instead the energy is considered, this parameter becomes the total number of neurons of a particular benchmark model (AlexNet for instance). In the following parts, each case is analyzed and the number of neurons (real or effective) is chosen.
Number of neurons

The other network’s number of neurons has been computed as follows:

1. **MLC-STT** [15]: by looking at the scheme proposed in section 1.3.1, the convolutional neural network has a number neurons which results equal to:

   \[
   \#\text{neurons}_{\text{MLC}} = 6 \times 28 \times 28 + 16 \times 10 \times 10 + 84 = 6388 \quad (2.4)
   \]

   This is a binary neural network (XNOR-Net).

2. **SOT** [16]: the SOT architecture uses AlexNet BCNN as reference network as described section 1.3.2. The total number of neurons is 659272 as already said. This is a binary neural network (XNOR-Net).

3. **OPNE-IPNE** [26]: the network realized in [26] is a MLP with 6 PIMs of 484-144 neurons as described in Table 1.5.1. The total number of neurons is given by:

   \[
   \#\text{neurons}_{\text{OPNE-IPNE}} = 6 \times (144 + 484) = 3768 \quad (2.5)
   \]

   This is a ternary neural network (XNOR-Net).

4. Neurosynaptic core [26]: MLP with 1024x256 structure. The total number of neurons in this case is 256;

5. **XNOR-RRAM** [19]: only the MLP structure has been considered for the neuron’s count. In particular, as reported in section 1.4.2:

   \[
   \#\text{neurons}_{\text{XNOR-RRAM}} = 3 \times 512 + 10 = 1546 \quad (2.6)
   \]

   This is a binary neural network (XNOR-Net).

6. Mixed-precision [21]: the network structure is 784-250-10 and so the total number of neurons is 260. Weights are binary while inputs are converted into a voltage range;

7. Synaptic weight [22]: two MLPs of 784x10 single layer and 20 parallel layers
of 784x10 are implemented. The total number of neurons is 10 and 200 respectively. The choice of using parallel arranged layers allows to improve the accuracy. Weights are binary, while inputs are converted into a voltage range;

8. Stochastic [28]: LeNet-5 network topology is considered. The stochastic approach is used only in the first layer of this CNN, so the total number of neurons is obtained as:

\[
\#\text{neurons}_{\text{stochastic}} = 32 \times 28 \times 28 = 25088
\] (2.7)

9. HMC [29]: convolutional neural network which is computed by fetching one layer per time into the 3D stacked memory. Since the performance provided in [29] refers to a single convolutional layer per time, a rough estimation can be made considering:

\[
\#\text{neurons}_{\text{HMC}} = \text{mean}(124 \times 124 \times 3, 120 \times 120 \times 3, 117 \times 117 \times 4, 114 \times 114 \times 5, 112 \times 112 \times 3, 108 \times 108 \times 3) = 46948
\] (2.8)

This implementation is in floating-point representation.

10. Chain-NN [30]: the first five convolutional layers of the AlexNet fixed-point are used in this implementation, so:

\[
\#\text{neurons}_{\text{Chain-NN}} = 55 \times 55 \times 96 + 27 \times 27 \times 256 + 13 \times 13 \times 384 + 13 \times 13 \times 256 = 585184
\] (2.9)

11. Energy-efficient [31]: the benchmark model used is VGG-16 (section 1.7.4) with Binary Weight network (BWN) approximation, so \( \alpha \) is computed. The
total number of neurons is given by:

\[
\#\text{neurons}_{\text{ne}} = 224 \times 224 \times 64 + 112 \times 112 \times 128 + 112 \times 112 \times 128 + \\
+ 56 \times 56 \times 128 + 56 \times 56 \times 256 + 56 \times 56 \times 256 + \\
+ 28 \times 28 \times 256 + 28 \times 28 \times 512 + 28 \times 28 \times 512 + \\
+ 14 \times 14 \times 512 + 14 \times 14 \times 512 + 14 \times 14 \times 512 + \\
+ 7 \times 7 \times 512 = 9759232
\]  

(2.10)

It is presented a comparison in terms of number of neurons of the analyzed implementations from the state of the art.

Table 2.1: Number of neurons (real) of the analyzed architectures

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Number of neurons (real)</th>
<th>Network type</th>
<th>Technology</th>
</tr>
</thead>
<tbody>
<tr>
<td>MLC-STT[15]</td>
<td>6388</td>
<td>XNOR-NET (binary)</td>
<td>MTJ</td>
</tr>
<tr>
<td>SOT [16]</td>
<td>659272</td>
<td>XNOR-NET (binary)</td>
<td>MTJ</td>
</tr>
<tr>
<td>OPNE-IPNE [24]</td>
<td>3768</td>
<td>XNOR-NET (ternary)</td>
<td>SRAM</td>
</tr>
<tr>
<td>Neurosynaptic core [26]</td>
<td>256</td>
<td>Binary weight</td>
<td>SRAM</td>
</tr>
<tr>
<td>XNOR-RRAM (MLP) [19]</td>
<td>1546</td>
<td>XNOR-NET (binary)</td>
<td>RRAM</td>
</tr>
<tr>
<td>Synaptic weight [22]</td>
<td>10</td>
<td>Binary weight</td>
<td>RRAM</td>
</tr>
<tr>
<td></td>
<td>20 layers</td>
<td>Binary weight</td>
<td>RRAM</td>
</tr>
<tr>
<td>Stochastic [28]</td>
<td>25088</td>
<td>Fixed point</td>
<td>OOM</td>
</tr>
<tr>
<td>HMC [29]</td>
<td>281688</td>
<td>Floating point</td>
<td>OOM</td>
</tr>
<tr>
<td>Chain-NN [30]</td>
<td>585184</td>
<td>Fixed point</td>
<td>OOM</td>
</tr>
<tr>
<td>Energy efficient [31]</td>
<td>9759232</td>
<td>BWN</td>
<td>OOM</td>
</tr>
</tbody>
</table>
Normalized energy

In Figure 2.5 it is reported a comparison between the normalized energy of different architectures:

\[ \text{NormEnergy} = \frac{\text{Energy[J]}}{\#\text{neurons}} \]  (2.11)

Figure 2.5: Energy comparison: the higher is better. MLC-STT: [15], SOT: [16], OPNE-IPNE: [40], Neurosynaptic core: [26], Stochastic: [28], CPU-CLU: [29]
Comparisons

\[ f \left( \frac{\text{Energy}}{1J} \right) = \frac{\log \left( \frac{\text{NormEnergy}}{1J} \right)}{\min \left[ \log \left( \frac{\text{NormEnergy}}{1J} \right) \right]} \]  

(2.12)

The minimum value of the logarithm at the denominator is a negative value and so the bar chart is projected from negative Y values to positive, resulting into the Figure 2.5. The absolute energy values of the different architectures are the following:

Table 2.2: Energy values picked from the documents. The HMC values are computed by taking the sum of the reported energies and the network considered is the entire CNN proposed by [29]. The OPNE-IPNE energy value has been computed considering [40]: in the result section, the total energy of 0.73J has been computed considering 1 million transaction of the MNIST dataset. This value has been obtained by dividing the original value of 0.73J by 1 million.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Number of neurons</th>
<th>Energy [J]</th>
<th>Energy/neuron [J]</th>
</tr>
</thead>
<tbody>
<tr>
<td>MLC-STT</td>
<td>6388</td>
<td>380.0n[15]</td>
<td>59.48p</td>
</tr>
<tr>
<td>SOT</td>
<td>659272</td>
<td>310.4µ[16]</td>
<td>470.9p</td>
</tr>
<tr>
<td>OPNE-IPNE</td>
<td>3768</td>
<td>730.0n [40]</td>
<td>193.7p</td>
</tr>
<tr>
<td>Neurosynaptic core</td>
<td>256</td>
<td>-</td>
<td>45.0p[26]</td>
</tr>
<tr>
<td>Stochastic</td>
<td>25088</td>
<td>542.4n [28]</td>
<td>21.6p</td>
</tr>
<tr>
<td>HMC</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CPU</td>
<td>281688</td>
<td>59.8 [29]</td>
<td>0.212m</td>
</tr>
<tr>
<td>CLU</td>
<td>281688</td>
<td>1.1 [29]</td>
<td>3.83µ</td>
</tr>
</tbody>
</table>

The results of the in-memory implementations are compatible with the expectations:

- **MLC-STT**: since this architecture is based on MTJs which are analog components, the energy consumption is not heavily influenced because the operations are performed with currents manipulations and resistance variations in the mesh array. A single neuron consists into a couple of resistances and a modified sensing circuit that performs the product computation required in the XNOR-net. Additional external units are employed to compute the batch normalization, pooling, K and \( \alpha \) coefficients.
• **SOT**: similarly to the previous case, the neuron is composed by a single MTJ, so two active cells are needed to compute a logic function at the same time, requiring more energy. The computation is performed by a simple current comparator, which performs all the logical functions required.

• **OPNE-IPNE (SRAM)** [24]: the computation cycle in [40] consists in alternating OPNE and IPNE computation. Once OPNE part finishes, IPNE starts to produce a serial output. The serial output is then fetched from the following OPNE, which starts its computation concurrently:

![Figure 2.6: Macro-pipeline structure [40]. Once OPNE terminates, IPNE starts producing a serial output: this is elaborated by the following OPNE.](image)

All the components are working in parallel until the data-stream is not finished, and considering that an OPNE and an IPNE are composed by XNOR gates and adders to compute the pop-counting operation, this implies higher energy consumption. This network is composed by 484 OPNEs, each of them with a XNOR, an adder and a register for accumulation. The 144 IPNEs instead are composed by 144 XNORs and an adder tree: by iterating this consideration to the overall dimension of the neural network, the energy will reach 730 nJ for a single dataset transaction. Considering the energy/neuron, this will be worse than the MLC solution (probably because of its compact structure and multiple bits into a single cell) but better than SOT.

• **Neurosynaptic core**: this is based on an analog computation, which updates the membrane potential of the neuron, producing a spike under a particular condition. SRAM is used as an analog component, reducing the energy required.
Considering now the OOM architectures:

- Stochastic [28]: produces the best result in term of energy required per neuron, because of its simple computation. The bit-stream length is 8 bit and adder/multiplier are replaced by a simple AND gate/multiplexer which are able to reduce the energy of the system. The architecture is in fixed point precision and only one layer of the chosen CNN is performed with a stochastic approach.

- HMC [29]: since the structure of this network is based on hybrid-memory cube, the data travels along distances that are reduced by means of TSV. The big drawback in terms of energy of the in-memory structures are the internal interconnections, which are based on very long bit-lines/word-lines. By looking at HMC structure, it is possible to see that the data are fetched from DRAM layers and computed in the floating point units and since the computation is divided in vaults, all the logic units perform the convolution simultaneously. The total number of vaults is 16, so considering the worst case all the computational units are active at the same time. As expected, these architectures have the worst resulting energies respect to the other cases: the CLU has better performances that the CPU one.

Normalized latency: rough estimation

Also an indication on latency is given in some papers. In order to evaluate properly the differences between the architectures, the following procedure has been used:

1. A neural network (CNN or MLP) has a certain number of layers, which is determined by simply counting them. In the case of CNNs, pooling and normalization layers are not considered;

2. The architectures analyzed in the documents are not the same! In certain cases, the dimensions of the array influence the delay (for example the XNOR-RRAM implementation in section 1.4.2). In these cases, a normalization has to be considered, but the dependency between delay-array size could not be linear. Each architecture is able to process a convolutional window with a certain number of neurons processed per time. This number influences the
array dimensions (in case of in-memory architecture). This problem is also present in MLPs, because they can be considered as convolutional networks too. The number of effective neurons processed per time could normalize the resulting delays by rescaling the networks to the same dimensions.

A rough estimation has been done considering all the cases:

- **MLC-STT** [15]: by looking at MLC architecture in section 1.3.1, it is possible to observe that there are 7 layers. Excluding pooling layers and the last output layer (simply gives a classification):

  \[
  \#\text{layers}_{\text{MLC-STT}} = 4 \tag{2.13}
  \]

  The number of neurons depends on which layer are considered during the actual computation, so a mean value can be computed:

  \[
  \#\text{neurons}_{\text{MLC-STT}(\text{eff})} = \text{mean}(6 \times 28 \times 28, 16 \times 10 \times 10, 120, 84) = 1627 \tag{2.14}
  \]

  The delay value given by [15] is a cycle time, in which a single convolution is performed. It is equal to 27.24ns and it can be considered as a time/(layer*neurons), since only one convolution provides the output of a single neuron, so layer-neurons normalizations are not required.

  \[
  \text{Delay}_{\text{normalized}(\text{MLC})} = 27.24\text{ns} \tag{2.15}
  \]

- **SOT** [16]: the SOT architecture uses AlexNet, so the total number of layers is 8. Also here a mean value is considered, that indicates how many neurons are processed per time:

  \[
  \#\text{neurons}_{\text{SOT}(\text{eff})} = \text{mean}(55 \times 55 \times 48 \times 2, 27 \times 27 \times 128 \times 2, 13 \times 13 \times 192 \times 2, 13 \times 13 \times 192 \times 2, 2048 \times 2, 2048 \times 2, 1000) = 82409 \tag{2.16}
  \]
The delay given by [16] is 10.7 ms, which is the total delay with batch normalization, scaling factors and convolution computation. This value can be divided by the number of layers and the neurons:

\[
\text{Delay}_{\text{normalized(SOT)}} = \frac{10.7 \text{ms}}{8 \cdot 82409} \approx 16.3 \text{ns} \quad (2.17)
\]

- **OPNE-IPNE** [40]: since the network is an MLP with 6 PIMs of 484-144 neurons, the total number of layers is 13. The number of neurons processed per time is always 484 (by observing the macro-pipelined structure in Figure 2.6), so:

\[
\#\text{neurons}_{\text{OPNE-IPNE(ef)}} = 484 \quad (2.18)
\]

The clock frequency of 400MHz is given and for this estimation, 1 neuron computes its output after 484 clock cycles, in fact OPNE computation constrains the required time as indicated in Figure 2.6:

\[
\text{Delay}_{\text{normalized(OPNE-IPNE)}} = f_{\text{ck}} \times 484 = 2.5 \text{ns} \times 484 \approx 1.21 \mu s \quad (2.19)
\]

- **Neurosynaptic core** [26]: with an MLP structure, the total number of neurons is always 256 with only 1 layer. The delay of 1ms indicated by [26] is a cycle time, and so every millisecond the output is evaluated:

\[
\text{Delay}_{\text{normalized(Neurosynaptic)}} = \frac{1\text{ms}}{256} \approx 3.91 \mu s \quad (2.20)
\]

- **XNOR-RRAM** [19]: here the case is a little bit different, because the CNN is implemented with subarrays having the same dimensions (128x128 as already said in Table 1.4.2) [19] arranged in parallel. Their outputs are fetched by the remaining logic, that computes the convolution. Also the MLSA bit level influences the delay value because of its complexity. For the sake of simplicity, the total number of layers in the MLP implementation is considered, which results equal to 4, while the total number of neurons is given by:

\[
\#\text{neurons}_{\text{XNOR-RRAM(ef)}} = \text{mean}(512, 512, 512, 10) \approx 386 \quad (2.21)
\]
2.1 – Algorithm accuracies

The value given by [19] is the delay of a single subarray of 128x128, without considering the cost of the decoding and other computations normally executed in a XNOR-Net, and it is equal to 16.69ns, which is already the delay required for the computation of a single neuron, so:

\[ \text{Delay}_{\text{normalized(XNOR-RRAM)}} = 16.69\text{ns} \quad (2.22) \]

- HMC [29]: since the architecture is fetching one layer per time into the 3D stacked memory, an average delay is estimated, given by [29] and divide it by 1. [29] provides two delay results: one referred to a CPU floating point implementation while the other to HMC(CLU) (already specified in section 1.7.2). The mean number of neurons processed per time is given by:

\[
\#\text{neurons}_{\text{HMC(eff)}} = \text{mean}(124 \times 124 \times 3, 120 \times 120 \times 3, 117 \times 117 \times 4, 114 \times 114 \times 5, 112 \times 112 \times 3, 108 \times 108 \times 3) = 46948
\]

So the corresponding delays are computed from the data provided in seconds:

\[
\text{Delay}_{\text{normalized(HMC-CPU)}} = \frac{1.149 + 1.0808 + 0.6643 + 0.832 + 0.808 + 0.873 + 0.809}{7 \cdot 46948} \approx 18.91\mu s \quad (2.24)
\]

\[
\text{Delay}_{\text{normalized(HMC-CLU)}} = \frac{0.0138 + 0.0133 + 0.0086 + 0.0165 + 0.0091 + 0.0105 + 0.0102}{7 \cdot 46948} \approx 249.52\text{ns} \quad (2.25)
\]

- Chain-NN [30]: only 5 convolutional layers of AlexNet are used in [30]. The delay given by [30] of 353.17ms is a total delay, so it has to be divided by the total number of layers and number of neurons. As specified in [30], the total number of neurons processed per time is equal to \(C_{in} \times (2K - 1) \times w_{in}\) where
K is the kernel size of the AlexNet. Also here an average is considered:

\[ \#\text{neurons}_{\text{chain-NN(eff)}} = \text{mean}(224 \times (2 \times 11 - 1), 55 \times 96 \times (2 \times 5 - 1), 27 \times 256 \times (2 \times 3 - 1), 13 \times 384 \times (2 \times 3 - 1), 13 \times 384 \times (2 \times 3 - 1)) \approx 27340 \]

The resulting delay is:

\[ \text{Delay}_{\text{normalized(Chain-NN)}} = \frac{353.17\,ms}{5 \cdot 27340} \approx 2.58\,\mu s \]

- Energy-efficient [31]: VGG-16 structure has 13 layers and the value given by [31] of 35.1ms is the total delay of the architecture that has to be divided by the total number of layers and by the number of neurons. Considering the overall benchmark model, the effective number of neurons can be computed as:

\[ \#\text{neurons}_{\text{ee(eff)}} = \text{mean}(224 \times 4 \times 64, 112 \times 4 \times 128, 112 \times 4 \times 128, 56 \times 4 \times 256, 56 \times 4 \times 256, 28 \times 4 \times 512, 28 \times 4 \times 512, 7 \times 4 \times 512) \approx 43008 \]

So the delay is:

\[ \text{Delay}_{\text{normalized(ee)}} = \frac{35.1\,ms}{13 \cdot 43008} \approx 63\,ns \]

In order to compare the values obtained, a bar plot is provided, in which on the Y axis there are the normalized latency values rescaled from 0 to 1 as already done in the previous case:
2.1 – Algorithm accuracies

Figure 2.7: Delay comparison: the higher is better. MLC-STT [15], SOT [16], OPNE-IPNE [40], Neurosynaptic core [26], XNOR-RRAM [19], HMC [29], Chain-NN [30], Energy-efficient [31]

1. MLC-STT [15]: The delay obtained is very small, because of the internal structure: in fact multiple bits are stored in the same cell and this reduces the costs. The network used is smaller than the other ones, and this can affect the delay, since the bit-lines/word-lines lengths increases with network’s complexity. All the computations are done in parallel, so a convolutional window is computed by multiple CIM arrays and this speeds up the execution time. Since it is an analog solution based on current levels comparison to compute the logic operations, this solution will be faster than a digital one: in general this concept is valid for all the analog solutions discovered;

2. SOT [16]: This case is very similar to the previous one, since it is an analog solution. The cells are composed by a single MTJ [16] and multiple cells are
activated simultaneously to perform a logic function. In general, the performance is expected comparable to the previous case based on MLC:

3. **OPNE-IPNE** [40]: the cost of the neuron’s intrinsic structure (composed by three ram cells, a XNOR, adder and register as already said) is heavy in terms of delay. The architecture is constrained by the clock frequency, the corresponding critical path and the **OPNE** computation, which influences the execution time.

4. Neurosynaptic core [26]: the delay is determined by the cycle time of 1ms, so an entire result is produced with a frequency of 1kHz. The architecture is based on the neuron’s membrane potential update, which is an analog approach based on comparators that slows the system;

5. XNOR-RRAM [19]: better results are obtained in this case thanks to parallel computation with multiple sub-arrays. A logical operation is performed by fetching data from two RRAM cells and the resulting current is compared with a reference. Also in this case, the considerations made for an analog solution are valid;

6. HMC [29]: this case is very interesting, because also the performance that will be obtained with a CPU are reported. Here the computations are in floating point and, as expected, the CPU case is the worst in terms of execution time. The usage of a 3D memory stack (CLU) allows to reduce the execution time due to the reduced wire lengths: the main characteristic of 3D memories with logic, allowing to obtain better performance than **OPNE-IPNE** case, because by reducing the wire length, the clock frequency can be higher;

7. Chain-NN [30]: good result is obtained also by the Chain-NN, which is a pipelined structure in fixed point representation. The pipeline allows to reduce the critical path delay and the fixed-point numbers reduce the computational cost;

8. Energy-efficient [31]: this architecture is a BWN OOM which is able to reduce the data fetching from the external memory by using efficient techniques. Computations are performed in an approximate form (CA1 instead of CA2) with
error compensation: this reduces the multiplier by ±1 into a simple structure as indicated in section 1.7.4. The usage of simpler logical structures allows to reduce the clock frequency required and to speed up the operations: the delay is comparable to an analog solution.

Normalized area: rough estimation

Also in this case, a similar approach has been applied. The area taken from the documents has been divided by an effective number of neurons that a specific architecture process per time, in order to obtain a normalized area per number of neurons which results comparable. The areas are listed below:

Table 2.3: Area values of different architectures.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Number of neurons (eff)</th>
<th>Area $[mm^2]$</th>
<th>Area/# neurons $[mm^2]$</th>
</tr>
</thead>
<tbody>
<tr>
<td>SOT [16]</td>
<td>82409</td>
<td>5.28</td>
<td>$64.1 \times 10^{-6}$</td>
</tr>
<tr>
<td>OPNE-IPNE [40]</td>
<td>3768</td>
<td>3.90</td>
<td>$1.0 \times 10^{-3}$</td>
</tr>
<tr>
<td>Neurosynaptic core [26]</td>
<td>256</td>
<td>4.20</td>
<td>$16.4 \times 10^{-3}$</td>
</tr>
<tr>
<td>XNOR-RRAM [19] (MLP)</td>
<td>386</td>
<td>1.686</td>
<td>$4.36 \times 10^{-3}$</td>
</tr>
<tr>
<td>Stochastic [28]</td>
<td>25088</td>
<td>1.32</td>
<td>$52.7 \times 10^{-6}$</td>
</tr>
<tr>
<td>HMC [29]</td>
<td>46948</td>
<td>729.00</td>
<td>$15.5 \times 10^{-3}$</td>
</tr>
<tr>
<td>Energy-efficient [31]</td>
<td>43008</td>
<td>44.96</td>
<td>$1.0 \times 10^{-3}$</td>
</tr>
</tbody>
</table>

The number of neurons of the OPNE-IPNE [40] are the real number of neurons, since the area refers to the entire implementations with 6 PIMs. Same for Neurosynaptic core. The stochastic case instead considers only the first layer of the CNN, so also in this case, the effective number of neurons are the real one already calculated. The other cases’ values are picked from the previous part. All the areas indicated in the table are obtained from the papers, exception for the XNOR-RRAM, that has been computed as:

\[
\text{Area}_{\text{XNOR-RRAM}} = \#\text{sub-arrays} \cdot \text{Area}_{\text{subarray}(128 \times 128)} = 36 \times 46824.1 \mu m^2 \simeq 1.69 mm^2
\]

(2.30)
The best results are obtained in the stochastic and SOT [16] case, since the stochastic [28] is very efficient in terms of computational blocks used. A convolution operation (which is composed by multiply-and-accumulate sequences) is simply realized with ANDs (multiply) and multiplexers (accumulate). Probably this solution is also the slowest one, since the number is transformed into a bit sequence (for example the number 15 represented on 4 bits is transformed into a sequence of 15 bits). The SOT case allows to reach a good area performance, since the structure of the array is composed by a single MTJ. HMC [29] has bad area performance, due to its 3D structure and floating point representation, in fact the whole chip has an area of $729 \text{ mm}^2$, in fact floating point units are required and this increases the area occupation. OPNE-IPNE [40] value is very good, considering the neuron’s structure and has similar performance to the XNOR-RRAM [19] case. The area provided by
2.1 – Algorithm accuracies

[40] is the chip area, and so it considers also the control circuit. The neurosynaptic core [26] occupation is degraded by the analog circuits that perform the membrane potential update: this is a slow, big but energy efficient solution. Last but not least, the Energy-efficient solution [31] allows to reach good performances in terms of area (since the implemented neural network is a Binary weight network, so \( \mathbf{K} \) computation is not performed).

2.1.2 Conclusions

1. **MLC-STT**: very good in terms of energy/latency. Since it is an analog solution, it is subjected to noise errors and unwanted parasitic effects. Small networks (or array partitioning) can be realized with this solution;

2. **SOT**: good in terms of energy/latency/area. Same considerations of MLC are valid;

3. **OPNE-IPNE**: good energy/latency/area. Since it is a synchronous architecture, latency is influenced by critical path but the calculation accuracy is higher than the previous cases since it is all implemented in digital;

4. Neurosynaptic core: worse latency/area results but good energy achieved. The motivations have been already explained previously;

5. **XNOR-RRAM**: very good in terms of latency and area, so a fast solution can be realized with RRAM technology. Partitioning is required to reduces the parasitic effects;

6. **HMC (CLU)**: worse area/latency/energy performance w.r.t the others, but reaches the highest precision due to floating point computations. It is an interesting application of a 3D memory, that allows to reach very good performance w.r.t. the CPU based neural network implementation;

7. **Energy-efficient**: Good in terms of area/latency due to its optimizations. Computations are performed in fixed point and multipliers are replaced by approximated adders, in order to reduce power/latency and energy.
Chapter 3

Software implementation

The neural network type that has been chosen is the XNOR-Net with MNIST dataset, since it has good trade-offs in terms of accuracy, power and latency. The steps used to implement a XNOR-Net are the following:

1. Neural network implementation and training with Python using Tensorflow and Keras;

2. Parameters extracted from the Python implementation are fed to a MATLAB model for verification purposes.

The chosen dataset is the smallest one, because of the shorter simulation/synthesis time required by the VHDL model.

3.1 Network model

The neural network model that has been used in this analysis is the following one:
As it is possible to see, the network structure is composed by **input-image**, **max-pooling**, **convolution**, **batch normalization**, **ReLU**, **flatten** and **fully connected** layers. All the layers are zero-padding, it means that the dimensions of the feature maps, stride and filters have been chosen accordingly, avoiding input-resizing. In the following part is presented a description of the layers, but before some notations are introduced:

<table>
<thead>
<tr>
<th>Symbol</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$w_{in}$</td>
<td>Input image dimension</td>
</tr>
<tr>
<td>$w_{filter}$</td>
<td>Kernel window dimension</td>
</tr>
<tr>
<td>$w_{out}$</td>
<td>OFMAP size, output dimension</td>
</tr>
<tr>
<td>$c_{out}$</td>
<td>Number of output channels</td>
</tr>
<tr>
<td>$c_{in}$</td>
<td>Number of input channels</td>
</tr>
<tr>
<td>$h$</td>
<td>Height of the memory</td>
</tr>
<tr>
<td>$w$</td>
<td>Width of the memory</td>
</tr>
</tbody>
</table>

Table 3.1: Notations used

- The **input layer** is a matrix of dimension 28x28x1 pixels which represents a digit from **MNIST** handwritten digit dataset;
- **Max-pooling** layer is the first computational layer encountered in the neural network. The parameter used in this layer are the following:

\[
\begin{align*}
  w_{in} &= 28 \\
  w_{filter} &= 2 \\
  \text{stride} &= 2 \\
  w_{out} &= \frac{w_{in} - w_{filter}}{\text{stride}} + 1 = \frac{28 - 2}{2} + 1 = 14
\end{align*}
\]

The pooling layer has been placed before the convolutional layer: this technique called earlier pooling already used in [31], decreases the power required and computational complexity in the following layers, since it reduces the dimensions of the input from 28x28 to 14x14 and so the convolutional layer has to process a smaller input image, without losing too much precision;

- **Convolutional layer** takes the pooled image and convolves it with 6 different kernels and so 6 OFMAPs are obtained, one for each kernel. The parameters used in this layer are the following:

\[
\begin{align*}
  w_{in} &= 14 \\
  w_{filter} &= 2 \\
  \text{stride} &= 1 \\
  w_{out} &= \frac{w_{in} - w_{filter}}{\text{stride}} + 1 = \frac{14 - 2}{1} + 1 = 13
\end{align*}
\]

Here no bias values have been applied in order to reduce the complexity;

- After the convolutional layer, **Batch normalization** ("batchnorm") is realized. Each IFMAP is normalized w.r.t mean and variance that are computed over a batch. Batchnorm layer is useful during the training phase, because it speeds up the convergence of the training algorithm (SGD for example) and improves stability. Another two learnable factors called $\gamma$ and $\beta$ are considered in the batchnorm, producing the following output:

\[
\hat{x} = \frac{x - \mu}{\sigma} \cdot \gamma + \beta
\]  
(3.1)
3.2 – Network’s computational model

- **ReLU** is the activation function used in the network to allow better training performances. Compared to other activation functions, this is the simplest one, since it simply takes the maximum between 0 and its input. In hardware, this is simply realized by a multiplexer selecting between input and 0 based on the sign of the input itself.

\[ ReLU = \max(0, x) \]  \hspace{1cm} (3.2)

- **Flatten** layer transforms the inputs (IFMAPs) into a vector, so if 6 IFMAPs of 13x13 pixels are considered, the output vector dimension is given by:

\[ w_{out} = w_{in}^2 \cdot c_{out} = 13 \cdot 13 \cdot 6 = 1014 \]  \hspace{1cm} (3.3)

This vector is fed to the fully connected part (MLP);

- **Fully connected** takes the output of the flatten layer and by means of a MLP with size 1014-10 gives the classification in output. The highest result coming from the last 10 neurons corresponds to the output classification.

3.2 Network’s computational model

As already said, the network’s model is a XNOR-Net, it means that the convolution is approximated as

\[ I \ast W \approx (\text{sign}(I) \otimes \text{sign}(W)) \cdot K \alpha \]  \hspace{1cm} (3.4)

Where \( \otimes \) represents XNOR-Bitcount operations and K and \( \alpha \) are defined as:

\[
K = |\text{Input}| \ast \begin{bmatrix}
\frac{1}{w_{filter}^2} & \frac{1}{w_{filter}^2} & \cdots \\
\frac{1}{w_{filter}^2} & \frac{1}{w_{filter}^2} & \cdots \\
\vdots & \vdots & \ddots
\end{bmatrix}
\]  \hspace{1cm} (3.5)
The size of the $\frac{1}{w_{filter}}$-matrix is the same of the kernel’s one.

$$\alpha = \frac{\sum_{i=1}^{N} |W_i|}{N}$$  \hspace{1cm} (3.6)

As it is possible to see, $K$ is a simple matrix containing the same elements, while $\alpha$ is a scalar. Here it is reported an example:

Two computations are reported: the real case, which simply executes the sum of products of the kernel with the windowed part of the input and the XNOR net. The steps to compute the output in the second case are:

1. Computation of $\alpha$ as the mean of the absolute sum of the kernel elements;

2. Computation of $K$ as the windowed part of input convolved with the matrix
defined before. In this case the computation is defined as:

$$K(1,1) = | -0.4| \cdot \frac{1}{2^2} + |0.2| \cdot \frac{1}{2^2} + |0.3| \cdot \frac{1}{2^2} + |0.4| \cdot \frac{1}{2^2} = 0.325 \quad (3.7)$$

The other values are obtained with the same approach. $K$ is a mean of the input sub-matrix considered by the convolutional window. If more than one input channels are evaluated, $K$ is computed as the element-wise absolute sum over all the IFMAPs divided by the number of channels and convolved with the $\frac{1}{w_{filter}}$-matrix defined before:

$$K = \sum_{c=1}^{c_{in}} |\text{Inputs}(:,:,c)| \cdot \begin{bmatrix} \frac{1}{w_{filter}}^2 & \frac{1}{w_{filter}}^2 & \cdots \\ \frac{1}{w_{filter}} & \frac{1}{w_{filter}} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} \quad (3.8)$$

3. Binarization of inputs/weights by taking the sign. When the input/weight is 0, the sign function returns -1, so:

$$\text{Binarize}(x) = \begin{cases} -1, \text{ when } x \leq 0 \\ +1, \text{ when } x > 0 \end{cases} \quad (3.9)$$

4. Binary convolution between the binary-input and binary kernel. Considering, for example, the first result, this step is performed as:

$$\text{BinConv}(1,1) = -1 \cdot (-1) + 1 \cdot 1 + 1 \cdot 1 + (-1) \cdot 1 = 2 \quad (3.10)$$

5. Xnor-convolution: the output is then computed by the element-wise multiplication between the binary OFMAP and $K$. This new matrix is then multiplied by the scalar alpha. Considering the first element of the OFMAP:

$$\text{OFMAP}(1,1) = \text{BinConv}(1,1) \cdot \alpha \cdot K(1,1) = 2 \cdot 0.4 \cdot 0.325 \approx 0.26 \quad (3.11)$$
These assumptions are valid in the case of a convolutional layer. If the fully connected layer is considered, the $K$ computation is different, since the sub-matrix considered has the size of the entire input.

![Fully connected layer - toy example](image)

**Figure 3.3**: Fully connected layer - toy example

In order to compute $K$, the thing to consider is that the actual dimension of the kernel is equal to the number of input neurons. $K$ becomes a scalar which is simply given by the mean of absolute input values:

$$K_{fc} = \frac{\sum_{i=1}^{N} |I_i|}{N}$$  \hspace{1cm} (3.12)

### 3.2.1 Python code

A software implementation of the neural network proposed has been realized in Python and the source code is based on [51], since python with Tensorflow and Keras allow a very easy and straight-forward realization and training of every kind of neural network (from CNNs to MLPs). In order to reduce the complexities of the synthesis-simulations of the VHDL implementation, the easiest CNN structure has been chosen, that contains all the most used components in a neural network: **max pooling**, **convolution**, **batch normalization**, **ReLU**, **flatten** and **fully connected**. It is reported an extract of the python code used from [51]:

1. `# nn parameters`: specification of the parameters used in the neural network.
   The `batch_size` is the total number of images that passes at the same time during forward/backward propagation in the training process; `epochs` parameter

136
specifies the number of times the entire training batch feed-forwards the neural network. The training size in this case is equal to 60000; \( nb\_channels \), \( img\_rows \) and \( img\_cols \) specify the input dimensions, which are equal to 28x28x1; \( nb\_classes \) indicates the total number of classifications that can be obtained in output, so if \texttt{MNIST} is used, 10 classes can be recognized (from 0 to 9).

```python
from binary_ops import binary_tanh as binary_tanh_op
from xnor_layers import XnorDense, XnorConv2D
H = 1.
# nn parameters
batch_size = 10
epochs = 5
nb_channel = 1
img_rows = 28
img_cols = 28
nb_classes = 10
use_bias = False
```

2. # Learning rate schedule specify the behavior of the \( \eta \) learning rate during the training phase;

```python
# learning rate schedule
lr_start = 1e-3
lr_end = 1e-4
lr_decay = (lr_end / lr_start)**(1. / epochs)
```

3. # BatchNorm: the parameters specified for the batch normalization layer are \textit{epsilon} and \textit{momentum}. The \textit{epsilon} is an additive term that allows to improve the stability during the training process and it is used as follows:

\[
\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta \tag{3.13}
\]

In the VHDL implementation, this term has been neglected, since it is very small (\( \sim 10^{-5} \)). The parameter \textit{momentum} indicates how the BatchNorm
computes the mean and variance in each iteration step. In particular a momentum of 0.9 is translated as:

\[
\mu(t) = (1 - \text{momentum}) \cdot \mu(t - 1) + \text{momentum} \cdot \text{batch\_mean}(t)
\]

\[
= 0.1 \cdot \mu(t - 1) + 0.9 \cdot \text{batch\_mean}(t)
\]

momentum parameter allows to consider the previous mean value, improving stability.

```python
# BatchNorm
epsilon = 1e-6
momentum = 0.9
```

4. # MNIST loading: loads the MNIST dataset, which is composed by images of 28x28 pixels in range 0 to 255. The total numbers of training images and testing images are set and then scaled between 0 and 1.

```python
# MNIST loading
(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(60000, 1, 28, 28)
X_test = X_test.reshape(10000, 1, 28, 28)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train = X_train / 255
X_test = X_test / 255

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes) * 2 - 1 # -1 or 1 for hinge loss
Y_test = np_utils.to_categorical(y_test, nb_classes) * 2 - 1
```

5. # Neural network realization: each layer is added sequentially. A XNOR-Net is not a standard network, as a consequence ad-hoc layers are designed, such as the convolutional and fully connected ones. In the code, they are called XnorConv2D and XnorDense, which executes the operations already described
of computing $\alpha$ and $\mathbf{K}$. The Optimizer option is set to Adam already described in section 1.2.4.

```python
# Neural network realization
model = Sequential()
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), name='pool1', input_shape=(nb_channel, img_rows, img_cols)))
model.add(XnorConv2D(8, kernel_size=(2, 2), strides=(1, 1), H=H, padding='valid', use_bias=use_bias, name='conv1'))
model.add(BatchNormalization(epsilon=0, momentum=momentum, axis=1, name='bn1', mode=0, trainable=True))
model.add(Activation('ReLU', name='act1'))
model.add(Flatten())
# dense1
model.add(XnorDense(nb_classes, use_bias=use_bias, name='dense3'))
opt = Adam(lr=lr_start)
model.compile(loss='squared_hinge', optimizer=opt, metrics=['acc'])
model.summary()
```

6. Learning rate scheduler and model building: this part trains the network with the options described. At the end of the training, the accuracy is evaluated and trained network parameters can be saved.

```python
lr_scheduler = LearningRateScheduler(lambda e: lr_start * lr_decay ** e)
history = model.fit(X_train, Y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(X_test, Y_test), callbacks=[lr_scheduler])
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
```
3 – Software implementation

Training and binarization

During training, the propagating gradient is computed using derivatives, as already explained in subsection 1.1.4. The activation functions of the weights/inputs are sign functions: the derivative results equal to 0 almost everywhere. In binary neural networks, an alternative binarization process is used and the source code is reported here from [51]:

```python
def round_through(x):
    rounded = K.round(x)
    return x + K.stop_gradient(rounded - x)
def _hard_sigmoid(x):
    x = (0.5 * x) + 0.5
    return K.clip(x, 0, 1)
def binary_sigmoid(x):
    return round_through(_hard_sigmoid(x))
def binary_tanh(x):
    return 2 * round_through(_hard_sigmoid(x)) - 1
def binarize(W, H=1):
    Wb = H * binary_tanh(W / H)
    return Wb
```

The operations performed in this code are the following:

1. `round_through(x)`: takes as input x and rounds the value to the nearest integer. The return value has the function `K.stop_gradient` that indicates that the value is computed when the gradient propagation is stopped.

2. `hard_sigmoid(x)`: clipping obtained by the function \( x = 0.5 \cdot x + 0.5 \) and the values 0,1. The corresponding output is:

\[
\text{hard}\_\text{sigmoid}(x) = \begin{cases} 
0, & \text{when } x \leq -1 \\
0.5 \cdot x + 0.5, & \text{when } -1 < x < +1 \\
1, & \text{when } x \geq 1
\end{cases}
\]  

3. `binary_tanh(x)`: is obtained by applying the `round_through(x)` function to the `hard_sigmoid(x)` and rescaling the result between ±1.
When an input/weight has to be binarized, the function \( \text{binarize}(W, H=1) \) is called and the final result is:

\[
\text{binarize}(x) = \text{binary_tanh}(x) \\
\text{binary_tanh}(x) = 2 \times \text{round_through}(\text{hard_sigmoid}(x)) - 1 \\
\text{round_through}(\text{hard_sigmoid}(x)) = \text{round}(\text{hard_sigmoid}(x))
\]

By flattening the equations:

\[
\text{binarize}(x) = \begin{cases} 
2 \times \text{round}(0) - 1, & \text{when } x \leq -1 \\
2 \times \text{round}(0.5 \cdot x + 0.5) - 1, & \text{when } -1 < x < +1 \\
2 \times \text{round}(1) - 1, & \text{when } x \geq 1 
\end{cases}
\] (3.15)

Rounding in the (-1,1) interval can be split into two parts, since if the value of \( x \) is less-equal than 0.5 the result is rounded to 0, otherwise to 1. This procedure provides a piece-wise approximation of the sign function and of the estimators used in the training process. As it is possible to discover, the value of 0 is approximated to -1 instead of +1. In VHDL this binarization method is not used, because it is useful only during the training process: in fact it is sufficient to take the sign of the incoming input (exception for “0”, which sign is considered).

**Output of the program**

At the beginning of the program, Python reports the neural network’s structure summary:

<table>
<thead>
<tr>
<th>Layer (type)</th>
<th>Output Shape</th>
<th>Param #</th>
</tr>
</thead>
<tbody>
<tr>
<td>pool1 (MaxPooling2D)</td>
<td>(None, 1, 14, 14)</td>
<td>0</td>
</tr>
<tr>
<td>conv1 (XnorConv2D)</td>
<td>(None, 6, 13, 13)</td>
<td>24</td>
</tr>
<tr>
<td>bn1 (BatchNormalization)</td>
<td>(None, 6, 13, 13)</td>
<td>24</td>
</tr>
<tr>
<td>act1 (Activation)</td>
<td>(None, 6, 13, 13)</td>
<td>0</td>
</tr>
</tbody>
</table>
During training, the accuracy is continuously evaluated in order to see which is the trend of the network to achieve good recognition rate (train accuracy). At the end of an epoch, the network is tested and the corresponding result is called test accuracy. The accuracy’s behavior of this simple neural network is the following:

![Accuracy Trend](image)

*Figure 3.4: Accuracies’ trend over 5 epochs and batch size of 10*
In general, increasing the number of epochs allows to achieve better accuracy results, until the network reaches a saturation: this is determined by the complexity of the network itself and by the total number of learnable parameters available. Another important parameter is the batch size: the smaller it is, the higher is the accuracy achievable within an epoch as reported in Figure 3.5. In the following plot, it is shown how does the number of epochs influences the accuracy:

**Figure 3.5:** Accuracies’ trend over 5 epochs and batch size of 100
As it is possible to see, the train accuracy behaves like a logarithmic function, so a saturation is always reached.

**Approximations**

Once the network is trained, some approximations have to be considered, in order to simplify the VHDL implementation. One of the most computational intensive part in the neural network is the **fully connected layer**, in fact the computing resources of $\alpha$ and $K$ have to consider a very huge number of elements. $K$ computation in the fully connected layer becomes the mean of the absolute values of the inputs, so considering the neural network model in Figure 3.1, it is a calculation over 1014 values, while $\alpha$ is the mean of the absolute value of each set of weights considered separately, as shown in the following figure:
3.2 – Network’s computational model

Figure 3.7: Example of $K$ and $\alpha_i$ computation in the fully connected layer

To avoid this computational bottleneck, it is possible to neglect the computation of $K$ and $\alpha$ for the fully connected layer and to approximate the output of a neuron as:

$$I \ast W \approx (\text{sign}(I) \odot \text{sign}(W)) \cdot K \alpha \approx (\text{sign}(I) \odot \text{sign}(W))$$ (3.16)

In order to evaluate the impact of this approximation, the neural network has been trained with the original configuration and then in the computational model these two parameters have been removed:

```python
class XnorDense(BinaryDense):
    def call(self, inputs, mask=None):
        inputs_a, inputs_b = xnorize(inputs, 1., axis=1,
                                    keepdims=True) # (nb_sample, 1)
        kernel_a, kernel_b = xnorize(self.kernel, self.H, axis=0,
                                    keepdims=True) # (1, units)
        print(K.get_value(kernel_a))
        output = K.dot(inputs_b, kernel_b) *
                 kernel_a * inputs_a #<--- Original
        # output = K.dot(inputs_b, kernel_b) <--- Approximated
```

To properly evaluate the effect of this approximation, an MLP network has been built with a several number of fully connected layers:
Python’s output reports the total number of trainable parameters and the network’s accuracy:

Total params: 10,039,336
Test accuracy: 0.9181

After training with 1 epoch, the network achieves an accuracy of 0.9181. Trying to neglect $K$ and $\alpha$ in the fully connected computation, the final accuracy is:

Test accuracy: 0.9155

This result is very good because by canceling two very computational-intensive parts in the neural network does not heavily influence the total accuracy: this is because the activations coming from the fully connected layer already contains the classification result without the usage of scaling factors, but batch normalization layers are required to maintain good accuracy. It is also possible to evaluate the impact of this approximation on the neural network used as reference (Figure 3.1):

Original accuracy = 0.8332
Approximated = 0.8338
Chapter 4

Hardware implementations

In this chapter, two different VHDL implementations of the neural network depicted in Figure 4.1 are reported, in particular an OOM and In-Memory structure.

These implementations are discussed and compared in terms of performance. The steps used in hardware design flow are the following:

1. VHDL fixed-point implementation is realized and simulated with Modelsim;
2. Synthesis and network analysis in terms of area, timing and power are performed using Synopsys Design Compiler. Power estimation has been performed considering the worst case: all the switching activities are equal to 1;

3. Place and Route phase with Cadence Innovus;

4. Post Place & Route power estimation using .vcd file.

The possibility to implement any kind of neural network model is discussed after all the hardware design explanations. The model depicted in Figure 4.1 is used as a starting point, since it is a very simple and straight-forward example.

## 4.1 OOM implementation

As it is possible to see in Figure 4.1, the neural network is composed by several layers, that have different tasks. Each of them are now discussed.

### 4.1.1 Max pooling layer

Max pooling compute the maximum value of an input’s subset and provide it to the output. The important parameters that max pooling layer uses are $w_{in}$, $w_{filter}$ and stride that determines the size of the input, the overlapped window and the corresponding step size.

**Input selection**

Considering that the input’s form is a matrix, it is possible to associate to each cell an index, representing the address of the considered input value:
In the example proposed in Figure 4.2, the first output cell (index 0) is obtained by computing the maximum value of the highlighted input cells with indexes 0,1,4,5. The second with the maximum value of 1,2,5,6 and so on. In VHDL it is possible to transform the input image matrix into a vector and to store it into a register file as shown in Figure 4.2. In general a \( w^2 \) number of inputs are fetched and the addressing is performed considering the following pseudocode:

```vhdl
current_address[N] = initial_address;
counter = 0;
if (clk'event and clk= '1')
    counter ++;
if(counter < w_out**2)
    for i=1:N
        current_address[i] = current_address[i] + stride;
end
```

Figure 4.2: Max pooling: indexing example with \( w_{in} = 4 \), \( w_{filter} = 2 \) and stride = 1
In the pseudocode, the current address is computed considering the stride, $w_{in}$ and $w_{filter}$ as follows:

1. At the beginning, $current\_address[N]$ is set to $initial\_address$, which are the addresses of the first pooling window (in the example $0,1,4,5$);

2. For each positive clock event, a counter is increased;

3. If the value of counter is less than $w_{out}^2$ (output pooling dimensions), the value added to each $current\_address[i]$ is the value of stride. In the example, if $counter < 3$ then $current\_address[i] = current\_address[i] + 1$;

4. If counter has reached the value of $w_{out}^2$, it means that the pooling window has reached the end of the input columns and it has to be shifted also by rows: in the example, if $current\_address = \{2,3,6,7\}$ the following addresses should be $current\_address = \{4,5,8,9\}$. To do this, the $current\_address$ has to be added by 2 instead of 1 and the algorithm becomes

$$current\_address[i] = current\_address[i] + w_{in} \times stride - (w_{in} - w_{filter})$$

This formula has been found experimentally, in fact by multiplying $w_{in}$ and stride and adding to $current\_address$, the address value obtained moves the pooling window by a number of rows equal to stride. Finally, the pooling region has to be shifted by columns by subtracting $(w_{in} - w_{filter})$.

5. counter is reset to 0.
The circuit shown in Figure 4.3 works as follows: when enabled, the first multiplexer (left) propagates the precharging values toward the output, if enable precharge is '1' and terminal count '0'. Input preset value is stored into the final register and the adder adds every clock cycle the input value with the stored one. If terminal count is '1', it means that current_address has to be added by the new value instead of stride. This circuit is replicated for each element and so there are $w^2_{filter}$ input selection circuits in parallel that are implemented in an external unit, which is not synthesized.

Max comparator

Once the $w^2_{filter}$ number of inputs are selected, they are fed to the max comparator. It takes one input per clock cycle and compares it with a previously stored result: if it is higher than the stored one, the new value is saved and replaces the older one. The pseudocode is the following:

```
previous_value = -2^(n_bit);    --minimum
counter = 0;
if(clk'event and clk='1')
```
if(input(counter) > previous_value)
    previous_value = input(counter);
end
counter++; 
end

1. At the beginning, the `previous_value` is set to the minimum value achievable, which is equal to \(-2^{n\_bit}\);

2. At each clock cycle, an input is selected and compared with the `previous_value`. If the input is higher, it will be stored.

Since the comparator works sequentially, the time required to compute a single comparison is equal to \(w_{filter}^2 \times t_{ck}\).
Control unit

The FSM that controls the max pooling layer has the following structure:

![ FSM Diagram ]

Figure 4.4: Max pooling layer FSM

As it is possible to see, at the beginning of the algorithm, it is checked the variable \texttt{do_pool}: if it is equal to 0, the pooling part is skipped by asserting the done signal...
otherwise the computations can start.

- Precharge decoder: the input selection precharges the `initial_address` registers (Figure 4.3), in order to select the inputs (enable and enable precharge window are set to '1');

- Do pooling: waiting state, that allows to store the values of `initial_address` into the corresponding registers and to start the pooling process;

- Comparator computing: the comparator starts to compute the maximum value and in the meanwhile a counter starts. The inputs are kept until the comparator has not finished, which is signaled by `terminal count cmp`;

- Clear comparator: once the comparator has finished, the result will be stored and the comparator’s register will be cleared. At the same time, the input selection will be enabled and `initial_address` will be incremented, pointing to the new set of data required for the incoming computation;

- Done: once the pooling process has finished (signaled by a counter with `terminal count pool`), the FSM passes into a done state, where `done` signal is asserted and then into `wait for start`, in which the system is waiting for a new start. The counter that asserts the signal `terminal count pool` counts until \( w_{out}^2 \), in order to process all the outputs of the pooling layer.
Max pooling layer

Figure 4.5: Timing diagram of the max pooling layer. Starting from idle, the FSM moves to precharge decoder (PD), in which the external decoder is precharged with its initial values. During do pooling, the inputs are provided to the max pooling layer and the computation starts with comparator computing, in which Count comparator is increased until it reaches $w_{filter(pool)}^2 - 1$ value, that in the neural network model depicted in Figure 4.1 is equal to 3 (4-1). When the terminal count CMP is asserted, the FSM migrates to clear comparator (CC), in which the stored value inside the comparator is reset to the minimum. The result is stored inside the RF Pool, which is placed outside the chip (see Figure 4.27) and it is addressed by count out pool. The entire procedure is repeated until count out pool has not reached the terminal count pool, which is asserted when count out pool is equal to $w_{out(pool)}^2$, that in the neural network model in Figure 4.1 is 196. At this point, done and wait for start are reached, where FSM waits for a new start signal.
Scheduling

Since the max-pooling layers are the same for both OOM and In-Memory implementations, the scheduling is analyzed only once. By looking at the control unit depicted in Figure 4.4, the duration of the states are the following:

<table>
<thead>
<tr>
<th>State</th>
<th>Required clock cycles</th>
<th>Multiplicity</th>
</tr>
</thead>
<tbody>
<tr>
<td>idle</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>weights_precharge</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>precharge_decoder</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>do_pooling</td>
<td>1</td>
<td>(w_{\text{out(pool)}}^2)</td>
</tr>
<tr>
<td>comparator_computing</td>
<td>(w_{\text{filter}}^2)</td>
<td>(w_{\text{out(pool)}}^2)</td>
</tr>
<tr>
<td>clear_comparator</td>
<td>1</td>
<td>(w_{\text{out(pool)}}^2)</td>
</tr>
<tr>
<td>done</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Max pooling fetches data from the RF INPUT Image and perform the pooling operation. The total number of data fetched depends on stride, \(w_{\text{in(pooling)}}\), \(w_{\text{filter(pooling)}}\) which defines the size of the output data, which is equal to \(w_{\text{out(pooling)}}^2\). For each set of data, there is the max comparation which requires \(w_{\text{filter(pooling)}}^2\) to complete the computation. In the case of the neural network depicted in Figure 4.1, the total number of clock cycles required are:

\[
\text{Pool time} = (1 + 1 + 1 + w_{\text{out(pool)}}^2) \times (1 + w_{\text{filter}}^2 + 1) + 1) \times t_{ck} \\
= (3 + 196 \times (1 + 4 + 1) + 1) \times t_{ck} = 1180 \times t_{ck} \tag{4.1}
\]

4.1.2 Convolutional and fully-connected layers

The main parts composing a convolutional layer are alpha computation, \(K\) computation and XNOR Unit. Usually after a convolution are placed batch normalization and ReLU blocks, and so they are integrated inside the convolutional layer entity. Regarding the fully connected layer, it is quite similar to the convolution, since it requires the same procedures (except for \(\alpha\) and \(K\)). Now they will be discussed the main parts composing the convolutional/fully-connected layer (from now called
convolutional layer) and how the fully connected part can be integrated in the same architecture.

**XNOR Unit**

In order to realize the convolution operation, the signs of weights/inputs are XNORRED and then XNOR results are pop-counted. The inputs that has to be XNORed with weights are the ones selected by the kernel window, as already discussed in the section 4.1.1. By selecting properly the inputs, their sign will be stored into a register file and the corresponding output is fed to the XNOR circuitry.

![Diagram](image)

**Figure 4.6:** Example of a $w_{in} = 4, w_{filter} = 2, stride = 1$ input selection and saving circuits. The inputs are selected from the input selector and their sign is stored into the register file ($s(0), s(1), s(4), s(5)$). Then, once the saving procedure is completed, the inputs are fetched from the **Binary Input RF** and XNORRed with weights’ signs. The XNOR results are selected from a multiplexer (Incoming bit)
The incoming bit (on the right of Figure 4.6) is fed to the pop-counting circuit, which is able to transform 1,0 into +1,-1 respectively. Once the multiplexer has completed the scanning, the final pop-result is obtained. Pop-counting has the following definition:

\[
\text{Pop-Count} = \#1s - \#0s
\]

A very simple circuit is used to perform the pop-counting:

If the incoming bit is equal to ’1’, Cin of the first FA is ’0’ while the FAs’ inputs becomes ”0001” which is added to the stored value into the register. Otherwise, if the incoming bit is ’0’, Cin is ’1’ and the inputs becomes ”1110”. Considering Cin the final number is ”1111”, which is added to the stored result. The times required by the entire processes of saving inputs-popcounting are given by:

\[
t_{\text{save}} = w_{out}^2 \times t_{ck} \]

\[
t_{\text{pop}} = w_{out}^2 \times w_{filter}^2 \times t_{ck}
\]

**Multiple input channels** When a convolution has to be perfomed on multiple input channels, the entire architecture is parallelized in order to process all the
matrices at the same time. The output equation becomes the following:

\[
\text{Conv} = \sum_{i=1}^{c_{in}} [(\text{sign}(I) \otimes \text{sign}(W))_i \cdot K \cdot \alpha] = \left[ \sum_{i=1}^{c_{in}} (\text{sign}(I) \otimes \text{sign}(W))_i \right] \cdot K \cdot \alpha
\]

(4.5)

Figure 4.8: Multiple input channels architecture. The XNOR and pop-counting units are replicated for a number of input channels times, obtaining a parallel computation. Each channel contribution is added in the output computer unit.

For each channel there is a pop-counting unit. Each XNOR unit performs the computations in parallel and the final contribution is added serially in the output
computer, that fetches one by one the OutPops.

**Alpha computation**

In order to compute the convolution, also $\alpha$ is required, which is the absolute mean of the considered kernel. Alpha computation starts during popcounting phase and in particular it requires $w_{filter}^2$ clock cycles. Alpha unit is very simple and it is composed by an adder, absolute block, register (which saves the partial results) and a divider.

![Diagram of Alpha computational unit](image)

Figure 4.9: Alpha computational unit: example with $w_{filter} = 2$. The input multiplexer has been instantiated into an external unit, in order to reduce the total number of inputs of the chip.

As it is possible to see the multiplexer is piloted by the same counter used in Figure 4.6, since it has to choice one out of $w_{filter}^2$ possibilities. This multiplexer is instantiated outside the chip in order to reduce the total number of inputs of the chip, in fact considering the following example with $w_{filter} = 5$ and $n_{bit} = 18$, the total number of bits are equal to $w_{filter}^2 \times n_{bit} = 450$. By putting the multiplexer outside the chip guarantees only $n_{bit}$ bits in input. These considerations are applied also in the following parts. When alpha computation is not required anymore, it is disabled by using **Enable alpha**.

**Multiple input channels** When multiple input channels are considered, before computing $\alpha$, the architecture has to consider all the kernels and to perform the absolute sum of each kernel element as follows:
for i=1:c_in
    abs_weights(:,:,i) = abs_weights(:,:,i) + abs(kernel(:,:,i));
end

alpha = mean(abs_weights(:,:));

In formulas:

\[
\alpha = \frac{\sum_{i=1}^{c_{in}} w_{filter}^2 \times |W_i|}{w_{filter}^2 \times c_{in}} \]

To compute it, the alpha circuit has been modified as follows:

Figure 4.10: Alpha computation unit in case of multiple input channels. An adder tree adds all the multiplexed weights from the \( c_{in} \) inputs. The last division is performed also by the number of input channels. Re-timing technique has been used for the loop register, in order to reduce the critical path caused by an adder tree and a divider.
Multiple output channels  When multiple output channels are considered (as in the Figure 4.1), a multiplexer is placed in alpha unit input for each input channel, which select the weight-set to consider, obtaining the final architecture which is depicted in Figure 4.11:

![Figure 4.11: Alpha computation unit in case of multiple output/input channels](image)

Depending on the output channel considered (addressed by Channel selected), the multiplexer selects a weight-set for each input channel and the computational scheduling of alpha unit is then executed. All the inputs contributions are then added in the adder tree and divided by $w_{filter}^2 \times c_{in}$. Placing the multiplexers outside the chip is fundamental for very large networks, since this approach reduces the maximum number of input bits from $w_{filter}^2 \times c_{out} \times c_{in} \times n_{bit}$ to $c_{in} \times n_{bit}$. Considering for example the first layer of AlexNet and imposing $n_{bit} = 18$, the total number of
input bits can be computed as:

\[ w_{\text{filter}}(\text{AlexNet}) = 11 \]
\[ c_{\text{in}} = 3 \]
\[ c_{\text{out}} = 96 \]
\[ n_{\text{bit}} = 18 \]

Number of bits (Mux inside) = \( w_{\text{filter}}^2 \times c_{\text{out}} \times c_{\text{in}} \times n_{\text{bit}} = 34848 \)

Number of bits (Mux outside) = \( c_{\text{in}} \times n_{\text{bit}} = 54 \)

**Division process**  About the division, some details have to be discussed. By defining \( n_{\text{bit}} \) and \( n_{\text{bit}_{\text{fractional}}} \) as two parameters indicating how many bits are used to represent an input/weight, the resulting fixed point value is split as follows:

![Fixed point representation](image)

Figure 4.12: Fixed point representation: example with \( n_{\text{bit}} = 18 \) and \( n_{\text{bit}_{\text{fractional}}} = 10 \)

To compute \( \frac{1}{w_{\text{filter}}^2} \), the division process has to consider the following steps:

1. Division between \( 2^{n_{\text{bit}}-1} \) and the term to divide. The result is on \( n_{\text{bit}} \).

   Considering the following example with \( w_{\text{filter}}^2 = 4 \):

   \[ \text{Div} = \frac{2^{n_{\text{bit}}-1}}{w_{\text{filter}}^2} = \frac{2^{17}}{4} = 32768 \quad (4.7) \]

   Representing this result on 18 bits:
2. The fixed point division result is obtained by taking from the 16th bit to the 10th bit of the previous calculation and placing it in the 6th position toward 0. In general, the operation performed is the following:

\[
\begin{align*}
to\_divide(n\_bit\_1 \ text{ downto } n\_bit\_fractional) & \leq (\text{others } \Rightarrow '0') ; \\
to\_divide(n\_bit\_fractional\_1 \ text{ downto } 0) & \leq \text{div}(n\_bit\_2 \ text{ downto } n\_bit\_2-(n\_bit\_fractional\_1));
\end{align*}
\]

K computation

The matrix K has to be computed during the Binary input RF precharging, since, once it is precharged, the fixed point input values are not considered anymore to avoid wasting of power caused by data migration. The design is based on the following considerations:

1. The precharging phase has a duration equal to \(w^2_{out}\), since all the \(w^2_{filter}\) inputs are precharged at the same time in Binary input RF;

2. During this period of time, K unit has to compute all the K values and to store them into a register file (called k_array);

3. K values must be ready before the convolution computation.

In order to do this, the data precharging is stopped every time a new input-set is ready, to allow the K computation unit to complete the computation. A valid output from K computation unit is achieved after \(w^2_{filter}\) clock cycles. The corresponding scheduling obtained is depicted in Figure 4.13:
In general, the total number of inputs to be precharged inside the Binary Input RF is equal to \( w_{out}^2 \). As a consequence, the clock cycles required to achieve the end of computation is equal to:

\[
\text{Max clock cycles} = w_{out}^2 \times (w_{filter}^2 + 1) \times t_{ck}\tag{4.8}
\]

After the completion of the \( K \) computation, the data are ready to be processed, so pop-counting part can start.
Multiple input channels  By looking at the Equation 3.8, the absolute sum of the inputs has to be considered. In order to reduce the complexity of the Equation 3.8, the following transformation is applied:

$$K = \sum_{c=1}^{c_{in}} |\text{Inputs}(;,:,c)| \times \begin{bmatrix} 1 \frac{1}{w_{filter}^2} \frac{1}{w_{filter}^2} & \cdots \\ \frac{1}{w_{filter}^2} \frac{1}{w_{filter}^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} =$$

$$= \sum_{c=1}^{c_{in}} \left[ \sum_{i=1}^{w_{filter}^2} \text{SelectedInputs}(c)(i) \right] \times \frac{1}{w_{filter}^2 \times c_{in}}$$

Where SelectedInputs is the input-set selected in a clock cycle so, considering the example in Figure 4.13, they will be 0,1,4,5 → 1,2,5,6 etc. Each channel contribution is added by an adder tree and the corresponding output is divided by $w_{filter}^2 \times c_{in}$. Since each convolutional layer has a variable number of input channels, the output of the adder tree is chosen by a multiplexer piloted by the conv_z variable, which indicates how many input channels are used in that layer: if conv_z is equal to 1, it means that it is considered only 1 input layer so the output is simply given by the first $K$ scheduled unit. When conv_z is equal to 2, two channels are selected, so also the contribution of the second parallel $K$ scheduled unit has to be considered.
From alpha unit

Figure 4.14: Example of $K$ unit with $w_{filter} = 2$ with multiple input channels. The input multiplexer has been integrated into an external unit in order to reduce the number of contemporary inputs into the architecture. Since $\text{conv}_z$ is fixed, the multiplexer selects only one input per time: the register indicated by the red arrow has been moved from its original location by applying re-timing: this technique avoids to have multiple adders connected to the final multiplier, reducing the critical path delay. The last term ($\frac{1}{w_{filter} \times c_{in}}$) is taken directly from the alpha unit.

**Fully connected integration**

The fully connected layer has the same computational flow of the convolutional part, but $\alpha, K$ are not considered for the motivations explained in section 3.2.1. To compute the fully connected layer, the following example can be considered:
Figure 4.15: Example of fully connected layer integration. The data precharging pattern is inverted to compute the outputs values of the neurons o0 and o1. \textit{number of fc parameters} indicates the number of input neurons that in this example is equal to 3. In the real case depicted in Figure 4.1, \textit{number of fc parameters} = 1014

In Figure 4.15, the fully connected layer has been implemented in the original convolutional structure. The data precharging pattern is inverted to compute the output values as follows:

\[
\begin{align*}
o0 &= \text{Pop}(i_0 \oplus w_1, i_1 \oplus w_3, i_2 \oplus w_5) \\
o1 &= \text{Pop}(i_0 \oplus w_2, i_1 \oplus w_4, i_2 \oplus w_6)
\end{align*}
\] (4.10)

Since the fully connected layer in the model proposed in Figure 4.1 has 1014 binarized inputs and 10 outputs, the \textbf{Binary input RF} has to have at least 1014 columns and a number of rows which is equal to the maximum number of \(w_{out}^2(i)\) of all the layers in the neural network that in Figure 4.1 is equal to:

\[
\text{\#Rows} = \max(w_{out}^2(i)) = 13 \times 13 = 169
\] (4.11)

Since the wordlength is very huge, the chip has to have at least 1014 input bits for each \textbf{Binary input RF}. To eliminate this drawback, the fully connected process has been serialized to reduce the input data’s bitlength, as depicted in Figure 4.16:
Figure 4.16: Fully connected layer scheduling. Inputs and weights are divided into subgroups of $L$ elements and precharged inside the **Binary Input RF**. At each cycle, once the pop-counting has finished, a new set of inputs/weights is precharged in the **Binary Input RF** and the pop-counting part starts again. The register file **RF TMP Pop** holds the temporary values of pop-counting and it is addressed by the counter: the total number of registers used in **RF TMP Pop** is equal to the number of output neurons that, as in Figure 4.1, is equal to 10.
The fully connected scheduling works as follows:

- ①: a sub-group of inputs/weights are precharged into the **Binary Input RF** and fed to xnors' inputs respectively. The width of the sub-groups (L) is defined according to the dimensions of the first fully connected layer (number of fc parameters), that in the model depicted Figure 4.1 is 1014: by choosing the minimum size that is also a divisor of number of fc parameters brings advantages in terms of energy and power consumed. The constraint that has to be respected for the definition of W is:

\[
\begin{align*}
W &\geq w_{filter}^2 \\
W &\geq L
\end{align*}
\]  

(4.12)

For the neural network depicted in Figure 4.1, W=6. For each fully connected layer, the value of L is defined dynamically, according to the algorithm. As it is possible to discover in Figure 4.16, the fully connected layer model (⑤) is replicated in the **Binary Input RF**: the first row is dedicated to the first neuron’s output, the second to the second neuron and so on. Each row is fetched and the XNORs between inputs/weights are computed: the XNORs’ results are multiplexed and the Incoming bit is fed to the pop-counting unit;

- ③: the pop-count result is added with a previously stored value, which is addressed by the counter: if count=0, it means that the first line of RF TMP POP is addressed and the pop-counting result refers to the first output neuron. RF TMP Pop is useful in the OOM implementation, since once the first pop-count result has been computed, count increments selecting the second line and the temporary pop-count result is stored;

- ④: after the pop-counting procedure for the first neuron has finished, count increments selecting the second line. A new pop-count for the second neuron starts and finishes when all the inputs have been multiplexed. The temporary value is stored into the RF TMP Pop;

- The entire procedure is executed for each output neuron: it means that when count=9, the first fully connected layer cycle has been completed and the
entire procedure starts again from the first output neuron. A new set of values (indicated by \(2\)) are going to be precharged. Considering again the first output neuron, a new pop-counting procedure starts and the resulting value is added to the value stored in the RF TMP Pop.

This procedure ends when all the inputs/weights are considered for each output neuron, requiring a number of iterations that are equal to:

\[
n_{\text{iter}} = \frac{\text{number of fc parameters}}{L} \quad (4.13)
\]

Considering the neural network model depicted in Figure 4.1, since the number of input neurons are 1014 and \(L = 6\), the Time duration can be computed as:

\[
\text{Time duration} = \frac{\text{number of fc parameters}}{L} \times (L \times w_{\text{out (fc)}}) \times t_{ck} = \frac{1014}{6} \times (6 \times 10) \times t_{ck} = 10140 \times t_{ck} \quad (4.14)
\]

**Output computation, Batch normalization and ReLU**

Once \(\alpha\) and \(K\) are computed (convolution case) and pop-counting routine has finished, the output is simply obtained by the product of these three values. Batch normalization takes the values of the convolution/fully connected layer and computes the batch-normalized output as already explained. ReLU layer is very simple, since it computes the maximum between 0 and its input. In hardware it is realized with a multiplexer that chooses between "0" and input by using the input’s sign as select.

**Multiple input channels** In the convolutional case, in order to consider the contributions of the parallel architectures, an accumulator is used to add all the contributions. The formula for the output computation becomes:

\[
Conv = \sum_{c=1}^{c_{in}} (\text{sign} (I) \odot \text{sign} (W))_c \cdot K \alpha \quad (4.15)
\]
Figure 4.17: Convolution computation unit. Example of a 4 input channels output computer unit, with batch normalization and ReLU. \( \alpha \) is delayed by a register in order to reduce the critical path. A, B are the batch normalization terms. The path indicated by the red arrow has been retimed to reduce the critical path delay.

The computational steps executed by this circuit are indicated by the circled numbers in Figure 4.17:

- \( \circ 1 \): the first multiplexer is able to choose an OutPop(i) out of 4 possible inputs. Each OutPop(i) indicates a pop-counting result of an input channel and each of them is added in an accumulator unit;

- \( \circ 2 \): the output computer computes the convolution result by multiplying the
4.1 – OOM implementation

pop-counting value by $\alpha$ and $K$, as indicated in Equation 4.15;

• ③: the multiplexer selects between output computation and "0", based on compute_batch AND do_batch. The signal compute_batch is asserted by the control unit, when the output computer has finished its computation. The end of output computer’s calculation is signaled by the counter’s terminal count, which is reached when the counting value is equal to the number of input channels of the layer considered, indicated as conv_z (in the convolutional computation, it has a different meaning respect to the fully connected one, reported in section 4.1.2). The signal do_batch is handled by the user, who can decide if the Batch Normalization has to be used in that particular layer or not, so it is defined in the testbench (discussed in subsection 4.1.5);

• ④: a multiplexer selects between the output of the register and FcPoP, based on Fully connected layer, which is a signal that indicates if the layer considered is convolutional or fully connected. Fully connected layer is handled by the user that defines the neural network model, so it is declared in the testbench. When Fully connected layer is '1', FcPop is chosen, which is the output coming from the first input channel’s pop-counting (that in the OOM implementation, corresponds to the output of the RF TMP POP, as depicted in Figure 4.19), because in the fully connected part the computations of $\alpha$ and $K$ are neglected. Only the first channel’s output is required in the fully connected, since one xnor pop-counting unit is sufficient to perform the computation because, by definition, the fully connected layer requires an input vector instead of matrices;

• ⑤: a multiplexer choose between the batch normalization output and the ①’s output: if the batch normalization is disabled, output computation or FcPop is chosen, depending on the layer’s type. The terms A and B are given in inputs to compute the batch normalization; the computational cost of the
BatchNorm layer can be reduced considering:

\[
\begin{align*}
\text{BatchNorm} &= \frac{x - \mu}{\sigma} \times \gamma + \beta \\
\text{BatchNorm} &= \frac{x}{\sigma} \times \gamma + \beta - \frac{\mu}{\sigma} \times \gamma \\
A &= \frac{\gamma}{\sigma} \\
B &= \beta - \frac{\mu}{\sigma} \times \gamma \\
\text{BatchNorm} &= x \times A + B
\end{align*}
\]  

(4.16)  

(4.17)  

(4.18)  

(4.19)

- ⑥: after ReLU has performed the maximum between 0 and the input, the user can choose if the ReLU’s output has to be considered or not with do_relu signal.

**Multiplication process**  To multiply two fixed point numbers it is sufficient to consider the following scheme:

```
<table>
<thead>
<tr>
<th>2*n_bit</th>
</tr>
</thead>
<tbody>
<tr>
<td>n_bit</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>n_bit_fractional</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
```

Figure 4.18: Multiplication scheme: example with n_bit=18 and n_bit_fractional = 10

The multiplication generate an output on 2*n_bit, but since the architecture has a finite precision, the result is truncated on n_bit as shown in Figure 4.18.

**Entire datapath**

In the following figure, it is presented an example of a convolutional layer with \( C_{out} = 2 \) and \( c_{in} = 4 \): every component in the dashed red boxes is not included
in the convolutional layer entity. Since the architecture has multiple input channels, the number of parallel Max pooling layers, Input selectors, XNOR units and pop-counting units are equal to $c_{in}$. Starting from the left side (1), the inputs are Initial_input provided by Input source, that can be the RF Input Image, max pooling output or convolutional layer output itself, as depicted in Figure 4.27. The inputs are fed to Input selector, which fetches the values to be convolved and stores their sign into the Binary input RF (2). There are $c_{in}$ multiplexers reported in 2 that are piloted by count K and they take one of $w^2_{filter}$ inputs: the blue arrow means that for each multiplexer, there are $w^2_{filter}$ inputs of $n_{bit}$ each. These muxes have been already discussed in the K computation unit part (Figure 4.14), and they fed the architecture with one input pixel per time. Binary values are fetched from Binary input RF and convolution-\(\alpha\) computation processes start. When the pop-count terminates (3), the output is computed by Convolution computation unit and batch normalization-ReLU are finally applied (4). At the end, the result is finally stored into the Temporary RF CNV and the new binary values are fetched from Binary input RF and the entire process is repeated. Once an entire OFMAP has been computed, the results are moved in parallel from Temporary RF CNV to Output register files. After that, a new output channel is processed: since the architecture depicted in Figure 4.19 has $C_{out} = 2$, multiple output kernels are used. For this purpose a multiplexer which selects the kernel values depending on the output channel considered (signaled by Channel selected) is employed and the entire procedure is repeated. The functionality of the convolutional layer entity is explained in the control unit part.
Figure 4.19: Entire convolutional layer datapath: example with $C_{out} = 2$, $C_{in} = 4$. The area highlighted by the red dashed line is implemented in an external unit.
Control unit

Figure 4.20: FSM of the convolutional/fully connected layers. The term "TC" indicates terminal count.
Convolution algorithm  The convolutional part of the control unit is now explained. Fully connected layer is a signal that indicates if the layer considered is a fully connected or a convolutional, so in this part Fully connected layer is '0'.

- ①: dummy state that allows to precharge the first row of the Binary input RF and to meet timing requirements. This state is particularly useful in the IN-MEMORY architecture, since also the weights are stored inside the Binary input RF;

- ②: during the Initial stage, K computational unit starts its computation. Terminal count K is reached when the first useful result of K is available after $w_{filter}^2$ clock cycles (in the example proposed in Figure 4.13, after 4 clock cycles, since $w_{filter} = 2$). When it is reached, Terminal count SRAM is tested and, if it is '0', the FSM moves to Input precharge, where a new input set is provided, stored inside the Binary Input RF and the computed value of K is saved inside k_array. The FSM turns back to Initial stage, where a new value of K is computed with a new input-set. This routine ends when Terminal count SRAM is asserted, meaning that all the values from Binary Input RF have been precharged. In Figure 4.21 it is reported the timing diagram of the K computation unit;
Figure 4.21: Timing diagram for the K computation considering only one input channel. When Enable K is asserted during initial stage, K computation starts to address one out of $w_{filter}^2$ inputs with Count K, and the corresponding sum is obtained in OutRegSum. This phase lasts for $w_{filter}^2$ clock cycles, that in this example it is equal to 4. After that, during Input precharge (IP), a new input set is provided and the just computed value of K is stored inside K array.

- 3: during Input precharge, the binary inputs are precharged in Binary input RF and K array starts to store the K results;

- 4: Dummy state that waits for the last precharge in the Binary Input RF;
Figure 4.22: Example of timing diagram for **Evaluation** state, considering only one pop-counting unit. When **Counter SRAM** has terminated, **terminal count SRAM** is ’1’, allowing the FSM to move from **Initial stage** toward **Evaluation**. During this state, pop-counting is enabled and **Count pop** starts. **OutPop(0)** changes its value according to the xnor values: this procedure terminates when all the filter elements have been considered, so after \( w^2_{filter} \) clock cycles. In the meanwhile, alpha can start its computation.

- **5**: during **evaluation**, the columns of the **Binary Input RF** are scanned and the pop counting procedure is performed. This state reaches the end when **Terminal count pop** is asserted: the value of **count pop** is equal to \( w^2_{filter} \).

- **6**: output computer starts to perform its evaluation. The output computation phase has a duration given by the total number of input channels in the layer considered: **Counter** (Figure 4.17) asserts **terminal count OC** when all
the inputs have been scanned, so \( c_{\text{in}} \times t_{\text{ck}} \) cycles are required. Once the output computer has finished its computation (signaled by Terminal count OC), batch normalization and ReLU computations can start. During the state highlighted by (7), the result is stored in the temporary convolutional register (Temporary RF CNV) depicted in Figure 4.27 and the Counter SRAM is increased, allowing to address another row of the Binary Input RF: if the Terminal count SRAM is equal to 1, it means that the Binary Input RF has been completely scanned and the convolution is completed, otherwise the FSM returns on Evaluation, in which pop-counting operation is executed again;

- (9) and (10): once all the outputs have been computed, the FSM waits for a clock cycle and then stores the results by moving in parallel the content of Temporary RF CNV inside Output register files (Figure 4.27);

- (11): since it is possible to have more than one output channels, during the state Change channel out a counter is increased which is able to select the other set of weights by piloting the signal Channel selected (Figure 4.27). If the Terminal count ch out is not reached, meaning that there still remain output channels to consider, the FSM goes to Alpha computing, in which the new value of alpha is going to be computed and so the alpha unit is enabled to process new weights (Figure 4.9). The entire procedure for the new channel is repeated;

- (13): once all the output channels are processed, the convolutional layer has finished and asserts the Done signal. In this state, the FSM is waiting for a new start signal;
Figure 4.23: Timing diagram for the convolution computation. As it is possible to see, the FSM moves from Evaluation (IE) to output computation when terminal count pop counting is ‘1’. During output computation, the values of $K$ (selected by the counter SRAM) and alpha are fed to the output computer, which performs the product between the OutPop result and these two values, obtaining Output computation (reference Figure 4.17). The FSM waits until terminal count OC, which is asserted when the output computer has scanned all the parallel input channels (Figure 4.17), so after $c_{in}$ clock cycles. Since in the reference architecture depicted in Figure 4.1 there is only one parallel input channel, the FSM passes immediately to batch normalization state, which computes Batch Normalization/ReLU within a clock cycle. Moving to increase batch state, the Counter SRAM is enabled and the counting is increased, in order to consider another Binary input set from Binary Input RF and a new value of $K$, which is addressed by the counter itself. At the same time the convolution result is saved inside a temporary register file (Temporary CNV RF in (Figure 4.27)). The procedure restarts with evaluation.
Figure 4.24: Timing diagram for multiple output channels handling. From **increase batch** (IB), the FSM moves toward **wait for last result**, since the **Counter SRAM** has reached the end of counting. The last valid data is saved inside the **Temporary CNV RF** (Figure 4.27) and, consequently, the entire content of the register file is stored in the **output register files** (Figure 4.27) during store results. At this point the channel is changed by increasing **channel selected**, which selects another weights set. Alpha is computed again and the entire process described in the previous parts is repeated.
**Fully connected algorithm**  Now the functionality of the FSM when Fully connected layer is '1' is analyzed.

- 3: when fully connected layer is considered, the Input precharge phase is executed until the terminal count SRAM is not asserted. The Binary values are stored inside the Binary Input RF, without the execution of K computation;

- a: The evaluation of the fully connected part starts and terminates when Terminal count pop is asserted. This is '1' when all the L elements defined in the FC scheduling (section 4.1.2) are selected by the multiplexer, which is addressed by Counter Pop in Figure 4.6, so considering Figure 4.16, when count pop reaches 5 (0 to 5);

- b: once the pop-counting for FC has finished, the temporary result will be saved into the RF TMP POP (Figure 4.19);

- c: when the Terminal count SRAM is asserted, meaning that the precharging of Binary Input RF has finished, a counter that handles the FC input scheduling (as depicted in Figure 4.16) is increased. This counter allows to choose a new set of fc inputs/weights as follows:

  \[
  \text{InputRed} = \text{Inputs}_\text{fc}(L*(\text{count}_\text{fc}+1)-1 \text{ downto } L*(\text{count}_\text{fc}));
  \]

  \[
  \text{WeightsRed} = \text{Weights}_\text{fc}(L*(\text{count}_\text{fc}+1)-1 \text{ downto } L*(\text{count}_\text{fc}));
  \]

  Inputs\_fc/Weights\_fc refer to the entire word of fully connected inputs/weights. By analyzing the neural network model depicted in Figure 4.1, Inputs\_fc/Weights\_fc have a length of number\_of\_fc\_parameter = 1014 bits, while L (that is the width of the fc word fed into the Binary Input RF, as described in Figure 4.16), is equal to 6. terminal count fc is asserted when all the input word is scanned reaching 1014, so translating in formulas:

  \[
  \text{Terminal\_count\_value} = \frac{1014}{6} = 169
  \]

The total number of iterations \(n_{\text{iter}}\) is equal to 169.
• ④: once all the inputs of the fully connected layer are considered, the output pop-counted values are scanned one by one in order to be stored in the Temporary CNV RES. In this state, compute_batch signal is equal to 1, in order to perform the batch normalization if requested;

• ⑤: during this state, data migrates from the Temporary CNV RES to the first Output register files. Moreover, a .txt file is generated containing the FC results.
Figure 4.25: Timing diagram of the fully connected part. After weight precharge (WP), the FSM starts to save the binary values inside the Binary Input RF during Input precharge, as already discussed. After that, evaluation can start, in particular the first line addressed by Counter SRAM is pop-counted. The pop-counting procedure has a time duration equal to $L \times t_{ck}$, that in the neural network model depicted in Figure 4.1 is equal to $6 \times t_{ck}$. Once Count Pop has reached 5, the FSM moves to save tmp results fc (ST), in which the temporary result of the pop-counting procedure is saved inside the RF TMP POP (depicted in Figure 4.19) and the last register of the pop-counting unit is cleared (Figure 4.7). A new evaluation procedure starts, but now the second row of the Binary Input RF is considered, since Counter SRAM is increased. The entire procedure for the first part of the fc scheduling (discussed in section 4.1.2) ends when the value of Counter SRAM is equal to the number of output neurons, that in the neural network model depicted in Figure 4.1, it is 10. After that, the state Increase fc increases the value of Count fc, which allows to select another inputs/weights set, as reported in section 4.1.2. These computational steps are repeated for $n_{iter} = \frac{\text{number of fc parameters}}{L}$ number of times.
Scheduling

This layer has two different schedulings, since they are both performed convolutional and fully connected computations. By looking at the control unit depicted in Figure 4.20, it is possible to compute the clock cycles required by each state, as already done in the max-pooling part. For this purpose, the neural network model depicted in Figure 4.1 is considered.

- Convolution: it starts by storing the binary values inside Binary Input RF and, at the same time, $K$ computation is performed requiring $w_{out}^2 \times (w_{filter}^2 + 1)$ clock cycles. The convolutional process takes, one by one, each row of the Binary Input RF and computes the pop-counting in $w_{filter}^2$ clock cycles. After that, the output computation, batch normalization/ReLU and storing results are performed taking $c_{in}$, 1 and 1 clock cycles respectively: the entire procedure is repeated for each output ($w_{out}^2$). These steps and alpha_computation, store_res, change_channel_out and wait_for_last_result have to be performed for each output channel ($c_{out}$), since everytime a different channel is considered, a new convolution starts.
Table 4.2: Clock cycles required by the convolutional algorithm for the OOM architecture.

<table>
<thead>
<tr>
<th>State</th>
<th>Required clock cycles</th>
<th>Multiplicity</th>
</tr>
</thead>
<tbody>
<tr>
<td>idle</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>weights_precharge</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>initial_stage</td>
<td>( w_{\text{filter}}^2 )</td>
<td>( w_{\text{out}}^2 )</td>
</tr>
<tr>
<td>input_precharge</td>
<td>1</td>
<td>( w_{\text{out}}^2 )</td>
</tr>
<tr>
<td>wait_for_last_precharge</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>evaluation</td>
<td>( w_{\text{filter}}^2 )</td>
<td>( c_{\text{out}} \times w_{\text{out}}^2 )</td>
</tr>
<tr>
<td>output_computation</td>
<td>( c_{\text{in}} )</td>
<td>( c_{\text{out}} \times w_{\text{out}}^2 )</td>
</tr>
<tr>
<td>batch_normalization</td>
<td>1</td>
<td>( c_{\text{out}} \times w_{\text{out}}^2 )</td>
</tr>
<tr>
<td>increase_batch</td>
<td>1</td>
<td>( c_{\text{out}} \times w_{\text{out}}^2 )</td>
</tr>
<tr>
<td>wait_for_last_result</td>
<td>1</td>
<td>( c_{\text{out}} )</td>
</tr>
<tr>
<td>store_results</td>
<td>1</td>
<td>( c_{\text{out}} )</td>
</tr>
<tr>
<td>change_channel_out</td>
<td>1</td>
<td>( c_{\text{out}} )</td>
</tr>
<tr>
<td>alpha_computing</td>
<td>1</td>
<td>( c_{\text{out}} )</td>
</tr>
<tr>
<td>done</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Considering the neural network model depicted in Figure 4.1, the total number of clock cycles of the convolution algorithm is:

\[
\text{Convolution\_cycles} = 1 + 1 + w_{\text{out}}^2 \times (w_{\text{filter}}^2 + 1) + 1 + \\
+ c_{\text{out}} \times w_{\text{out}}^2 \times (w_{\text{filter}}^2 + 1 + c_{\text{in}} + 1) \\
+ c_{\text{out}} \times (1 + 1 + 1 + 1) + 1 = \\
= 4 + 169 \times (4 + 1) + 6 \times 169 \times (4 + 1 + 1 + 1) \\
+ 6 \times 4 = 7971
\]  

- Fully connected: the process starts with the weights and inputs precharging (weights\_precharge, input\_precharge) in the Binary Input RF, that requires at least \( w_{\text{out(fc)}} + 1 \) clock cycles to be performed. After the precharging phase has finished, the evaluation starts (evaluation\_fc) and terminates only when it has scanned all the fully connected contributions, requiring L clock cycles (already explained in Figure 4.16): temporary results will be saved into
the RF TMP POP (save tmp res fc). This procedure based on evaluation and saving results is repeated for all the output neurons, which are \( w_{\text{out}(fc)} \). After all the temporary results are obtained, increase\_fc state increases count\_fc. Once the other inputs/weights have been selected, the entire procedure is repeated for a number of times equal to \( n_{\text{iter}} \), which is defined as:

\[
{n_{\text{iter}}} = \frac{\text{number of fc parameters}}{L} = \frac{1014}{6} = 169
\]  

In fact, as already said in the fc scheduling (section 4.1.2), to fetch 1014 inputs with a \( L = 6 \) are required 169 clock cycles. At the end of the algorithm, the results are scanned in order to be saved outside the neural_network (scan\_fc) and store\_fc\_res signals to the datasave to store the FC results.

Table 4.3: Clock cycles required by the fully connected layer algorithm.

<table>
<thead>
<tr>
<th>State</th>
<th>Required clock cycles</th>
<th>Multiplicity</th>
</tr>
</thead>
<tbody>
<tr>
<td>idle</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>weights_precharge</td>
<td>1</td>
<td>( n_{\text{iter}} )</td>
</tr>
<tr>
<td>input_precharge</td>
<td>( w_{\text{out}(fc)} )</td>
<td>( n_{\text{iter}} )</td>
</tr>
<tr>
<td>evaluation_fc</td>
<td>( L )</td>
<td>( n_{\text{iter}} \times w_{\text{out}(fc)} )</td>
</tr>
<tr>
<td>save_tmp_results_fc</td>
<td>1</td>
<td>( n_{\text{iter}} \times w_{\text{out}(fc)} )</td>
</tr>
<tr>
<td>increase_fc</td>
<td>1</td>
<td>( n_{\text{iter}} )</td>
</tr>
<tr>
<td>scan_fc</td>
<td>( w_{\text{out}(fc)} )</td>
<td>1</td>
</tr>
<tr>
<td>store_fc_res</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>done</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Considering the neural network model depicted in Figure 4.1, the total number of cycles required is given by:

\[
FC\_cycles = 1 + n_{\text{iter}} \times (1 + w_{\text{out}(fc)} + w_{\text{out}(fc)} \times (L + 1) + 1 + 1 + 1 = 1 + 169 \times (1 + 10 + 10 \times (6 + 1) + 1) + 10 + 1 + 1 = 13871
\]  

189
4.1.3 Flatten layer

The flatten layer takes the convolutional results and vectorizes them. Considering Figure 4.19, output register files placed at the end, store the outputs coming from the convolutional process with the approach illustrated in Figure 4.2. The total number of output register files is equal to the maximum $c_{out}$ of the considered neural network, and the procedure used to flatten the outputs is depicted in the following figure:

![Flatten layer diagram](image)

Figure 4.26: Example of flattening procedure. Each matrix represents a convolutional output channel.

And the corresponding algorithm:

```plaintext
for j=0:w_out**2-1
    for i=0:c_out-1
        if (output_convolution_stored(i)(j)==zeros)
            flat(i+j*(c_out)) = '0';
        else
            flat(i+j*(c_out)) =
                not(output_convolution_stored(i)(j)(n_bit-1));
        end
    end
end
```
This layer is simply implemented with two nested generate statements.

### 4.1.4 Neural network entity

The top entity of the project is called `neural_network` and contains the convolutional and max pooling layers. In Figure 4.27 it is reported an example of neural network with $c_{out} = 2$, $c_{in} = 4$. 
Figure 4.27: Example of a neural network top entity with $c_{out} = 2$, $c_{in} = 4$. The hardware in the dashed border-line are included in the Neural network top entity. This scheme is valid for both OOM and In-Memory architecture.
Convolutional data flow

- In an external unit are placed several numbers of register files, which are used to store the values useful to the neural network to work properly. The external inputs such as input image, convolutional weights and so on are fed together to the neural network. Starting from the RF INPUT Image, each output is fed to two multiplexers connected to Input selector POOL and Input selector CONV respectively (1), since the neural network circuit offers the possibility to perform both pooling/convolution on the input image, depending on the initial model. By these considerations, the cond1 is defined as:

\[
\text{cond1} \leq \text{do_pool AND to_integer(unsigned(iteration_cycle))=0;}
\]

- The variable iteration_cycle indicates the layer considered in the neural network model: if it is equal to 0, the layer is the first one and so on. During an iteration cycle, it can be executed either pooling and convolutional/fully connected layers, meaning that iteration_cycle is increased only when both pooling/convolution have completed the algorithm asserting their done signals;

- If it is required a max pooling computation in the first layer, the variable do_pool will be equal to '1' allowing to feed the max-pooling layer with the input image. If pooling is performed on the input image (as done in the neural network model in Figure 4.1), the Input image goes into the Input selector POOL (2), which selects \( w^2_{filter} \) inputs out of \( w^2_{in} \) to fed the max pooling layers. Since the input image has 28x28x1 pixels, only the first pooling layer has to be considered.

- Once pooling has been computed, the values are stored inside the RF Pool and, since only one pooling layer has been used in the neural network model in Figure 4.1, the first register file is precharged. Moving toward point 4, the pooled outputs can be selected by two parallel multiplexers, both connected to the Input preset selector CONV. The first one (piloted by the signal cond4) selects between the Input image and the pooling result: this is useful at the
beginning of the algorithm when the Input image has to be processed directly by the convolutional layer instead of max pooling. The cond4 is defined as:

$$\text{cond4} \leq \text{do\_pool AND to\_integer(unsigned(iteration\_cycle)))=0;}$$

When this condition is verified (cond4=1), RF Pool’s outputs are selected and sent to the multiplexer highlighted by 5. The cond2 is defined as:

$$\text{cond2}\leq \text{to\_integer(unsigned(iteration\_cycle)))/=0;}$$

When the iteration\_cycle is higher than 0, it means that the Input image is not considered anymore since it has been already processed, so the multiplexer can only choose between the pooling results or the convolutional ones.

- Once the pooling results have been chosen, the Input selector CONV selects $w^2_{\text{filter}}$ out of $w^2_{\text{out}}$ input values and propagates them inside the convolutional layer. The yellow blocks denominated ”MUXES” are placed outside the chip and they indicate that several muxes selects only one out of $w^2_{\text{filter}}$ inputs: this strategy allows to reduce the number of input bits of the architecture, as already discussed in K and $\alpha$ computation parts (section 4.1.2).

- The convolutional layer computes the convolution results by taking $c_{\text{in}}$ inputs, depending on the neural network model considered (for example, Figure 4.1 uses only 1 input channel). In the example proposed in Figure 4.27, there are 2 output channels: it means that the convolution has to be executed with two different sets of input weights, that are fetched from RF conv weights. As it is possible to see in Figure 4.27, there are 4 couples of RF conv weights, since are required 2 output kernels for each input channel, so $c_{\text{out}} \times c_{\text{in}} = 2 \times 4 = 8$ register files.

- Once the convolution result has been computed, it is provided one by one to the Temporary CNV RES, which stores the temporary convolutional results and, when and entire output channel has been computed, data are transferred in parallel to one of the output register files (addressed by channel selected, which indicates the output channel considered). The
number of output register files are always equal to the maximum number of output channels in the neural network model, so in this case there are 2 register files, while for the model depicted in Figure 4.1, they must be at least 6.

- The point 6b indicates the register files configuration for the A and B inputs required by the batch normalization: when Fully connected layer=0, each RF A CONV, RF B CONV outputs are selected by channel selected.

- Once the convolutional results are stored inside output register files, convolutional’s done signal is asserted, iteration_cycle is increased, cond2 becomes always true, since iteration_cycle is different from 0, and cond3 is verified, since it is defined as:

  \[ \text{cond3} \leftarrow \text{do_pool AND to_integer(unsigned(iteration_cycle))}/=0; \]

**Fully connected data flow**

- When Fully connected layer is equal to '1', it means that the layer considered is the fully connected. In this case the convolutional parameters are completely ignored by the convolutional layer, while the weights FC and the ones highlighted by (8) are considered. \( \text{cond5} \) selects between the output of the convolutional layer and the Input image and it is defined as:

  \[ \text{cond5} \leftarrow \text{fully_connected_layer AND to_integer(unsigned(iteration_cycle))}/=0; \]

It is also possible to have a max-pooling layer followed by a fully connected, so in this case the multiplexer with \( \text{cond6} \) signal as selector is able to choose also RF Pool’s outputs. \( \text{cond6} \) is defined as:

\[ \text{cond6} \leftarrow \text{fully_connected_layer='1' and do_pool='1'} \]

- The light blue multiplexer selects between the fc weights and the sign of the convolutional inputs, basing on fully connected layer signal, since the architecture has to choose the fully connected binary weights instead of the convolutional binary inputs to precharge them inside the Binary Input RF.
• When the iteration_cycle is equal to 0 and fully connected layer is equal to '1', the first layer considered is a fully connected and, consequently, the input image is considered as FC input. The flatten layers vectorize the matricial input and the corresponding output vector is fed to the FC scheduling, already discussed in section 4.1.2 (Figure 4.16).

Register files dimensions

The dimensions of each register file are now analyzed:

• The RF INPUT Image holds the input image values, which are 28x28x1 pixels of n_bit each;

• RF Conv weights: they hold the weights that are used in the convolutional process. By looking at the architecture in Figure 4.27, 2 kernels of $w^2_{\text{filter}}$ are needed with a bitlength of n_bit. The outputs are $w^2_{\text{filter}}$ weights of $n_{\text{bit}}$ each that are selected by the MUXES, in order to feed only one of them per time to the convolutional layer;

• RF A conv, RF B conv: hold the values of A,B of the convolutional layer. The total number of registers with a bitlength of n_bit in each register file is equal to $c_{\text{out}}$, that in Figure 4.27 is equal to 2. The output is a single value of n_bit that depends on which output channel is considered;

• RF Weights FC: holds the values of the weights of the fully connected layer. The total number of weights required is equal to $\text{number\_of\_fc\_parameters} \times w_{\text{out(fc)}}$ (that in Figure 4.1 is equal to 1014 for each output, so $1014 \times 10 = 10140$);

• Temporary RF CNV: holds the temporary values of the convolution/fully connected layer. It is a register file with $w^2_{\text{out}}$ locations of n_bit each;

• Output register files: each register file has a number of registers equal to $w^2_{\text{out}}$ of n_bit each. The total number of register files used is equal to $c_{\text{out}}$, since each channel has to be stored. In the example proposed in Figure 4.27, $c_{\text{out}} = 2$;
• RF A FC, RF B FC: they store A and B parameters for the batch normalization in the fully connected layer. BatchNorm applied to a fully connected layer consists on normalize all the neurons’ outputs, so the registers require at least \( w_{out(fc)} \) number of registers of \( n\_bit \) each. Considering the example in Figure 4.1, \( w_{out(fc)} \) is equal to 10.

Layer parameters specifies to the convolutional layer what are the dimensions of the layer examined, since each layer has different parameters (\( w_{filter}, w_{out}, w_{in}, ... \)). These define the terminal counts of the counters used in the entity, some constant values (such as \( \frac{1}{w_{filter}^2} \) in the Alpha computer and K unit) and so on.
Control unit

Figure 4.28: Neural network’s FSM
• **Parameters precharge**: all the parameters are precharged in the register files. They are fetched one by one at the same time. This state terminates when the signal done acq is asserted: this is piloted from the external data generator and it is equal to '1' when all the inputs are stored. Considering the neural network in Figure 4.1, this state finishes when all the fully connected weights are read since they are 10140. This part will be explained in subsection 4.3.1;

• **Start pool**: pooling layer starts the computation and waits until the end. If the pooling layer is not performed (do_pool=0), the max-pooling’s control unit asserts immediately done pooling;

• **Start convolution init**: once pooling asserts done_pooling, the convolution/fully connected can start the computation;

• **Wait for done**: FSM waits until the end of the convolution, signaled by done conv;

• When done conv is '1' it means that the convolutional layer has finished the computation (convolution or fully connected). At this point, the terminal count of the iteration cycle is tested and if it is '0', it means that the architecture has not processed all the layers defined in the neural network model and the convolutional layer has to be reused for another computations. At this point the counter that handles iteration cycle is enabled and the counting value increases. Also the input parameters of the convolutional layer changes, according to the layer to be examinated. The FSM moves to the parameters precharge, since the new layer need different values that are stored again in the register files. This is a very important concept, because it allows the reusability of the architecture;

• **Done**: the neural network has finished and the classification result is available.

### 4.1.5 VHDL implementation

From a VHDL point of view, a package has been defined, giving the possibility to implement every kind of neural network. Once the model has been designed, two
different parameter-sets (called fixed and variable parameters) are chosen accordingly:

- Fixed parameters: define the worst case dimensions of the network, such as the Binary Input RF size, the $w_{out}$, $w_{in}$ and $w_{filter}$ values and so on;

- Variable parameters: define the actual layer’s dimensions and type. They are addressed by iteration cycle and allow to dynamically program the behavior of the architecture, based on the layer examined.

Fixed parameters

In order to implement the neural network model depicted in Figure 4.1, the following fixed parameters have been used:

```
-----------------FIXED PARAMETERS---------------------
constant w_out : integer := 14 ;
constant h_max : integer := 169 ;
constant w_in : integer := 28 ;
constant w_filter : integer := 4 ;
constant w_filter_s : integer := 2 ;
constant width_sram : integer := 6 ;
constant number_of_output_channels : integer := 6 ;
constant number_of_input_channels : integer := 1 ;
constant number_of_fc_parameters : integer := 1014 ;
constant number_of_sum_elements : integer := 11 ;
constant number_of_neurons_output : integer := 10 ;
constant input_image_size_x : integer := 28 ;
constant input_image_size_z : integer := 1 ;
constant flatten_x : integer := 169 ;
constant flatten_z : integer := 6 ;
---------- counters and other parameters -----------
constant n_bit_channel_sel : integer := 8 ;
constant n_bit_cnw_pos_type : integer := 9 ;
constant n_bit_counter_k : integer := 8 ;
constant n_bit_counter_sram : integer := 8 ;
constant number_of_bits_counter : integer := 10 ;
constant count_fc_terminal_count : integer := 169 ;
```


---
• \textbf{w\_out}: the maximum output dimension is defined by the pooling, which takes in input a matrix of \( w\_in^2 = 28 \times 28 \) pixels and elaborates them with a \textbf{stride} = 2 and \textbf{w\_filter} = 2, so:

\[
w_{out} = \frac{w_{in} - w_{filter}}{\text{stride}} + 1 = \frac{28 - 2}{2} + 1 = 14
\] (4.24)

Since the convolutional layer after the max-pooling has as input a 14x14 matrix, the corresponding output dimension is 13x13, which is less than the required output size of max-pooling. For this motivation, \textbf{w\_out} is fixed to 14;

• \textbf{h\_max}: defines the number of rows (H) of the \textbf{Binary Input RF}, which has to be equal to the maximum \( w_{out}^2 \) dimension of the convolutional layer. Since in the neural network model depicted in Figure 4.1 there is only one convolutional layer, \textbf{h\_max} is fixed to 13 \times 13 = 169;

• \textbf{w\_in}: defines the input size dimension, which is equal to 28, since the input image has 28x28 pixels;

• \textbf{w\_filter_s} and \textbf{w\_filter}: they represent \( w_{filter} \) and \( w_{filter}^2 \) respectively. They define the maximum number of contemporary inputs given in input to the convolutional/max pooling layers. Considering the neural network model depicted in Figure 4.1, the maximum number of inputs are equal to 2x2=4, since both max pooling and convolutional layers have the same kernel size;

• \textbf{width\_sram}: represents the W dimension of the \textbf{Binary Input RF}. For the motivations explained in section 4.1.2, this is imposed equal to 6. This dimension must be greater-equal than L and the maximum kernel’s dimensions (\textbf{w\_filter}), that in the model depicted in Figure 4.1 is 4;

• \textbf{number_of_output_channels}: defines the maximum number of output channels in the model. Considering Figure 4.1, the maximum number of output channels is 6;

• \textbf{number_of_input_channels}: maximum number of contemporary input channels in the architecture. Considering Figure 4.1, only 1 channel is fed to the
convolutional/pooling layers. After the convolutional layer, it is placed a fully connected layer which takes the vectorized input;

- **number_of_fc_parameter**: total number of inputs required by the fully connected layer. This number is defined by the vectorization process, which takes in input 13x13x6 IFMAPs and vectorizes them into a 1014 elements vector;

- **number_of_sum_elements**: defines the maximum number of bits required to perform the pop-counting operation. This value has been computed considering the worst case, which is the fully connected layer, since a sum among 1014 elements (number_of_fc_parameters) has to be computed. If they are all equal to -1, the number of bits has to be at least:

\[
\text{number\_of\_sum\_elements} = \log_2(|-1014|) + 1 = 11
\]  

(4.25)

- **number_of_neurons_output**: maximum number of output neurons of the fully connected part. In the model proposed in Figure 4.1, it is equal to 10;

- **input_image_size_x** and **input_image_size_z**: define the maximum input dimensions. Since MNIST has been used, they are equal to 28 and 1;

- **flatten_x** and **flatten_z**: define how to transform the output matrix convolution into a vector. The vectorization procedure has been already presented in subsection 4.1.3.

The other parameters are the number of bits required by the counters in the architecture.

**Variable parameters**

The variable parameters play an important role in the architecture, since they allow to dynamically program the behavior of the neural network:

----------VARIABLE PARAMETERS----------
constant n_layers : integer := 2 ;
custom number_of_layers : std_logic_vector ( n_layers-1 downto 0 ) := "01";
custom conv_layer_size_x : int_vect ( n_layers-1 downto 0 ) := ( 1 ,14 );
As it is possible to see, they are all vectors. Each element is selected by the variable iteration_cycle, which takes trace on what layer is going to be computed. Layer types parameters defines the behavior of the network:

1. do_batch_layer: when ‘1’, the variable do_batch that is used in the convolution computation unit (Figure 4.17), obtained by

    do_batch <= do_batch_layer(to_integer(unsigned(iteration_cycle)))

    it is equal to ‘1’. In the first layer, batch normalization is computed;

2. do_pool_layer: defines if the pooling layer has to be computed or not. From this vector, the variable do_pool is obtained, which is useful in the FSM of the pooling layer (Figure 4.4).
do_pool <= do_pool_layer(to_integer(unsigned(iteration_cycle)));

3. do_relu_layer: defines the variable do_relu as:

   do relu <= do_relu_layer(to_integer(unsigned(iteration_cycle)))

   It is used in the convolution computation unit (Figure 4.17). In the first layer, ReLU is computed;

4. fully_connected_layer: when '1', fully connected computation is considered.

The other parameters have the following meanings:

- **n_layers**: defines the dimensions of the parameters’ vectors;

- **number_of_layers**: maximum number of layers to be considered in the neural network. Considering Figure 4.1, there are only three layers (max-pool, convolution, fully connected). Since the variable iteration_cycle is incremented every time a convolution/fully connected layer terminates the computation, number_of_layers is equal to 2 ("01"), in fact max pooling computation is integrated in the convolution computation. This is also the terminal count for the iteration_cycle variable, so only the first two elements of each vector can be selected;

- **convolutional_layer_size_x**: defines the dimensions of the IFMAP of the layer considered. The first layer is a convolutional with the dimensions defined in Figure 4.1: \( w_{in} = 14 \). The second layer is a fully connected and, since it is a different type of computation, these values are equal to 1 and they are not considered;

- **kernel_size_xy** and **kernel_size_xy_pow** are \( w_{filter} \) and \( w_{filter}^2 \) respectively. Considering Figure 4.1, the first convolutional layer has a kernel size of 2x2, so kernel_size_xy = 2 and kernel_size_xy_pow = 4. The value kernel_size_xy_pow defines also the terminal count pop used in the convolutional layer’s control unit (Figure 4.20) and indicates how many columns
of the Binary input RF have to be considered for the pop-computation. In the case of the convolutional layer, this is equal to 4, while for the fully connected layer it is equal to 6 for the motivations explained in the fc scheduling in section 4.1.2: kernel_size_xy_pow for the fully connected layer, indicates the L value, reported in Figure 4.16;

- **convolutional_layer_size_z**: defines the number of contemporary input channels processed by the convolutional layer. The first convolutional layer has only 1 input channel. In the case of a fully connected layer, it has a different meaning: this value is equal to 169 and indicates the total number of times the FC scheduling divides the fc inputs \( n_{iter} \) (discussed in section 4.1.2). Considering \( \text{number_of_fc_parameters} = 1014 \) and \( \text{kernel_size_xy_pow} = 6 \):

\[
n_{iter} = \text{convolutional_layer_size_z} = \frac{1014}{6} = 169
\]  

6 out of 1014 FC inputs/weights have to be elaborated 169 times.

- **stride_sel_c**: stride values used in the convolutional layers. In the fully connected layer this value is not used;

- **output_size_conv and output_size_conv_pow**: refer to \( w_{out} \) and \( w^2_{out} \) respectively. Considering Figure 4.1, the convolutional layer has \( w_{out} = 13 \) and \( w^2_{out} = 169 \). The value of output_size_conv_pow is used as terminal count SRAM in the convolutional layer’s control unit (Figure 4.20) and defines how many rows of the Binary input RF have to be considered in the computation. In the fully connected layer, since there are only 10 output neurons (Figure 4.1), output_size_conv_pow is equal to 10.

The same considerations are valid for the pooling layer.
4.2 In-memory implementation

The original OOM circuit has been reviewed, in order to implement an In-Memory alternative. The XNOR Unit is integrated into a memory array, allowing the computation near-data and reducing the Von Neumann’s bottleneck. Since the Pop-counting circuitry is composed by very simple elements (memory element + full adder), it is possible to implement it in a memory-like structure, as already made for the XNOR Unit. All the other components (such as $K, \alpha$ and convolution computational units) remain the same.

4.2.1 Convolutional/fully connected layer

In Figure 4.30, it is reported the XNOR part integrated in memory: as it is possible to see, for each memory cell (represented by a rectangle and implemented as a flip-flop), there is a XNOR gate that execute $w_i \oplus m_j$. Once the memory is precharged, the computation starts and all XNOR gates provide a result at the same
4.2 – In-memory implementation

time: for each wordline, there is a multiplexer piloted by \text{count\ pop}, that selects which xnor result to consider for the pop-counting part. At the end of pop circuits, there is a multiplexer that selects one of the pop result to be considered for the \text{output computer}. By having only one result per time, enables the reutilization of the convolution computation unit of the OOM implementation, depicted in Figure 4.17. The remaining parts are the same of the Figure 4.19.

Figure 4.30: Example of XNOR in memory with $w_{in} = 4$, $w_{filter} = 2$ and $W = 4$. For each memory cell there is a XNOR gate that computes the xnor between the binary weights (first row) and the corresponding binary inputs. At the end of each row (excluding the first one reserved to the binary weights), there is a multiplexer which selects the Incoming bit as discussed in the OOM implementation. For each incoming bit there is a pop-counting unit and each pop-output is selected by a final multiplexer.

In the following figure it is reported the entire convolutional layer in-memory architecture:
Figure 4.31: Example of an in-memory convolutional layer architecture with $c_m = 4$ and $c_{out} = 2$. 
To implement the fully connected layer, the same approach of OOM architecture has been used, which has been already described in section 4.1.2. The main difference respect to OOM architecture is that the RF TMP pop is not used, because the temporary values are already stored inside of each pop-counting unit. It is sufficient to switch the output multiplexer, depicted in Figure 4.30, to have directly the correct pop-counting value.

Control unit

In the following figure, it is reported the FSM of the convolutional/fully connected layer for the In-Memory implementation. The numbers and letters depicted in Figure 4.32 indicate the differences between the OOM control unit (Figure 4.20) and the In-Memory one. The other states execute the same operations already described in the OOM implementation.

- **①**: after batch normalization, change cnv res (change convolution result) state is executed. This state is useful, because it allows to change the result selected by the last multiplexer depicted in Figure 4.30, since the count mux out is increased;

- **②**: the terminal count change cnv res is tested. It is equal to ’1’ when all the pop-counting outputs are scanned by the last multiplexer in Figure 4.30 that, in the neural network model in Figure 4.1, happens when count mux out is equal to 169. In this case, the output will be stored to Temporary RF CNV, otherwise a new output computation is performed;

- **③**: the most important difference between the In-Memory and the OOM architectures is located in the fully connected part. Since the in-memory architecture has multiple pop-counting units, the temporary result is already stored inside them. Once the evaluation fc phase has terminated, the FSM moves directly to increase fc, allowing to select another set of fc inputs/weights (as already described in section 4.1.2) and to speed-up FC computation.
Figure 4.32: FSM of the convolutional/fully connected layer of the In-Memory implementation.
Figure 4.33: Timing diagram of convolution computation in the In-Memory architecture. Starting from **Weights precharge** (WP), the binary weights are precharged inside the first row of the **XNOR UNIT**. During **Initial stage**, \( K \) computation starts requiring \( w_2^\text{filter} \) clock cycles. Binary inputs are precharged inside the memory array during **Input precharge** (IP), in which also the **Counter SRAM** is increased. During **evaluation**, \( \alpha \) starts and the pop-counting results will be computed in parallel, requiring \( w_2^\text{filter} \) clock cycles: this is the most important difference respect to OOM architecture, in which the evaluation process has to be repeated for each output (Figure 4.23). After pop-counting has finished, **output computation** (OC), **batch normalization** (BN) and **ReLU** computations are performed and repeated for each output. In **Change CNV Res** (CNV), the **count mux out** is increased and the final multiplexer in Figure 4.30, addresses another output. The procedure finishes when **count mux out** is 168 and, at this point, the second weight set is selected, \( \alpha \) is computed again and the FSM restarts with **evaluation** (EV).
Figure 4.34: The algorithm starts with **Weights precharge** (WP) state, in which the binary fc inputs are precharged in the first row of the **XNOR Memory**, because of the inverted precharging order between weights-inputs (section 4.1.2). During **input precharge**, also the fully connected weights are stored inside the memory. **Evaluation fc** starts and ends within 6 clock cycles, since $L = 6$: in this phase, all the parallel pop-counting units are computing, obtaining at the same time the partial results of the $w_{out(fc)}$ neurons, which it is equal to 10, considering the neural network model depicted in Figure 4.1. After **evaluation fc**, the FSM increases **count fc** during **increase fc** (IFC), for the fc scheduling already explained in section 4.1.2. At this point the algorithm start again from **weights precharge**. Considering the timing diagram of the fully connected layer for the OOM case (Figure 4.25), it is possible to see the big difference between them: OOM needs to perform serially the pop-counting calculations by storing the partial results inside the RF TMP POP, while the In-Memory alternative can do the computation in parallel, without the need of storing the partial results, since they are maintained by the last register of the pop-counting units (Figure 4.7).
4.2 – In-memory implementation

Scheduling

Also in the In-Memory case are provided the clock cycles required to compute both the convolutional and fully connected parts:

- Convolution: the process starts with the binary inputs/weights precharging (states weights_precharge, initial_stage, input_precharge and wait_for_last_precharge), that requires at least $3 + w_{out}^2 \times (w_{filter}^2 + 1)$ clock cycles, as already explained in the OOM part. After that, the data are ready to be processed: Evaluation begins and performs all the pop computations in parallel, obtaining the outputs ready within $w_{filter}^2$ clock cycles. These values are chosen by the last multiplexer, piloted by count mux out in Figure 4.30, and output computation is performed, requiring $c_{in}$ clock cycles for each output. After output computation, batch normalization and ReLU are performed and need only 1 clock cycle to be executed. This procedure is repeated for all the $w_{out}^2$ outputs. When all the outputs have been computed, the results can be stored (store_results) and the output channel can be changed (change_channel_out). By changing the kernel, the entire procedure is repeated for $c_{out}$ number of times, requiring also a new alpha computation. Detailed informations on time durations of each state are reported in Table 4.4.
Table 4.4: Clock cycles required by the convolutional algorithm for the In-Memory architecture.

<table>
<thead>
<tr>
<th>State</th>
<th>Required clock cycles</th>
<th>Multiplicity</th>
</tr>
</thead>
<tbody>
<tr>
<td>idle</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>weights_precharge</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>initial_stage</td>
<td>( w_{filter}^2 )</td>
<td>( w_{out}^2 )</td>
</tr>
<tr>
<td>input_precharge</td>
<td>1</td>
<td>( w_{out}^2 )</td>
</tr>
<tr>
<td>wait_for_last_precharge</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>evaluation</td>
<td>( w_{filter}^2 )</td>
<td>( c_{out} )</td>
</tr>
<tr>
<td>batch_normalization</td>
<td>1</td>
<td>( c_{out} \times w_{out}^2 )</td>
</tr>
<tr>
<td>output_computation</td>
<td>( c_{in} )</td>
<td>( c_{out} \times w_{out}^2 )</td>
</tr>
<tr>
<td>change_conv_res</td>
<td>1</td>
<td>( c_{out} \times w_{out}^2 )</td>
</tr>
<tr>
<td>store_results</td>
<td>1</td>
<td>( c_{out} )</td>
</tr>
<tr>
<td>change_channel_out</td>
<td>1</td>
<td>( c_{out} )</td>
</tr>
<tr>
<td>alpha_computing</td>
<td>1</td>
<td>( c_{out} )</td>
</tr>
<tr>
<td>done</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

For the neural network model depicted in Figure 4.1, the total convolution delay clock cycles are equal to:

\[
\text{Convolution\_cycles} = 1 + 1 + (w_{filter}^2 + 1) \times w_{out}^2 + 1 +
+ c_{out} \times (w_{filter}^2 + w_{out}^2 \times (1 + c_{in} + 1)) +
+ c_{out} \times (1 + 1 + 1) + 1 =
= 1 + 1 + (4 + 1) \times 169 + 1 + 6 \times (4 + 169 \times (1 + 1 + 1)) +
+ 6 \times 3 + 1 = 3933
\]

(4.27)

- Fully connected: weights/inputs are precharged requiring \( w_{out(fc)} \) + 1 clock cycles. Evaluation starts (evaluation\_fc) and terminates when all the columns of the custom memory have been scanned (L clock cycles). The results are already stored inside the pop-counting units, so the algorithm moves to increase\_fc,
in which count\_fc is increased. The entire procedure is repeated \( n_{\text{iter}} \) times, in order to complete the entire fully connected layer. After that, the outputs are scanned to be saved inside the external memory, requiring \( w_{\text{out}(fc)} \) clock cycles.

Table 4.5: Clock cycles required by the fully connected layer algorithm for the In-Memory architecture.

<table>
<thead>
<tr>
<th>State</th>
<th>Required clock cycles</th>
<th>Multiplicity</th>
</tr>
</thead>
<tbody>
<tr>
<td>idle</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>weights_precharge</td>
<td>1</td>
<td>( n_{\text{iter}} )</td>
</tr>
<tr>
<td>initial_stage</td>
<td>( w_{\text{out}(fc)} )</td>
<td>( n_{\text{iter}} )</td>
</tr>
<tr>
<td>evaluation_fc</td>
<td>( L )</td>
<td>( n_{\text{iter}} )</td>
</tr>
<tr>
<td>increase_fc</td>
<td>1</td>
<td>( n_{\text{iter}} )</td>
</tr>
<tr>
<td>scan_fc</td>
<td>( w_{\text{out}(fc)} )</td>
<td>1</td>
</tr>
<tr>
<td>store_fc_res</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>done</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Where \( n_{\text{iter}} \):

\[
n_{\text{iter}} = \frac{\text{number\_of\_fc\_parameters}}{L} = \frac{1014}{6} = 169
\]  \hspace{1cm} (4.28)

Considering the neural network model depicted in Figure 4.1, the total number of cycles required is given by:

\[
FC\_cycles = 1 + n_{\text{iter}} \times (1 + w_{\text{out}(fc)} + L + 1) + w_{\text{out}(fc)} + 1 + 1 =
\]

\[
= 1 + 169 \times (1 + 10 + 6 + 1) +
\]

\[
+ 10 + 1 + 1 = 3055
\]  \hspace{1cm} (4.29)

### 4.3 Memories’ sizes

As reference, it is used the neural network model depicted in Figure 4.1. These considerations are valid for both the in-memory and OOM architectures, since they are very similar to each other.
4.3.1 Parameters precharging

During the parameters precharge phase, executed at the beginning of the algorithms, the values are stored inside the register files. Each parameter is fed to them in parallel and only one per clock cycle. Taking for example the convolutional weights precharging:

![ convolutional weights precharge scheduling ]

**Figure 4.35:** Data precharging scheduling. One data of $n_{bit}$ per clock cycle is stored in the register files.

This scheduling is valid for all the parameters of the network, exception for the fully connected weights, in fact they are fed to the memory already binarized, in order to reduce the total number of input bits and the required memory. However, since a wordlength of 1014 is very long, because it requires at least 1014 input bits, the precharge scheduling of the fully connected part has been changed making the following considerations:

1. For the fully connected computation are required at least 1014 weights for the total number of neurons in output, which for the model considered in Figure 4.1 is 10, producing 10x1014 total bits, where the first number indicates the rows and the second one the columns of a matrix. The straight-forward way to save them is to feed them by columns, i.e. 1014 bits for each output neuron;

2. By inverting the precharging order and selecting the rows instead of columns, only 10 bits are given in input for 1014 times.

The total time required by data precharging is given by the maximum precharging time of all the required parameters, which are:

- Input image: $w_{in}^2 = 784$ clock cycles required;
4.3 – Memories’ sizes

- Convolutional weights: \( w_{\text{filter}}^2 \times c_{\text{out}} = 4 \times 6 = 24 \) clock cycles required;

- Fully connected weights: \( \text{number}\_\text{of}\_\text{fc}\_\text{parameters} = 1014 \) clock cycles required;

- A,B convolutional parameters: the batch normalization parameters are used for each output channel, so \( c_{\text{out}} = 6 \) clock cycles required;

- A,B fully connected parameters: not considered in the neural network model depicted in Figure 4.1, but in general, they are needed \( w_{\text{out}(f\text{c})} \) batch normalization parameters in the fully connected computation (one for each output).

The precharge time is equal to:

\[
\text{Precharge time} = \max(1014, 784, 24, 10, 6) \times t_{\text{ck}} = 1014 \times t_{\text{ck}} \quad (4.30)
\]

### 4.3.2 Memory required

The total memory required to store all the parameters required by the architecture, can be computed considering the Table 4.6. The values used for the evaluations refer the neural network model depicted in Figure 4.1 and are the following:

\[
\begin{align*}
  w_{\text{in}} &= 28 \\
  \text{image\_size\_z} &= 1 \\
  w_{\text{out}} &= 14 \\
  w_{\text{filter}} &= 2 \\
  c_{\text{out}} &= 6 \\
  c_{\text{in}} &= 1 \\
  w_{\text{out}(f\text{c})} &= 10 \\
  \text{number}\_\text{of}\_\text{fc}\_\text{parameters} &= 1014 \\
  n_{\text{bit}} &= 18
\end{align*}
\]
Table 4.6: Memory required with $n_{\text{bit}} = 18$. All the parameters used in these computations are defined in the fixed parameters part in section 4.1.5 and the model used is depicted in Figure 4.1

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Size</th>
<th>Memory [kB]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Output convolution</td>
<td>$n_{\text{bit}} \times (c_{\text{out}} + 1) \times w_{\text{out}}^2$</td>
<td>3.0870</td>
</tr>
<tr>
<td>Output pooling</td>
<td>$n_{\text{bit}} \times w_{\text{out}}^2 \times c_{\text{in}}$</td>
<td>0.441</td>
</tr>
<tr>
<td>Convolution weights</td>
<td>$n_{\text{bit}} \times w_{\text{filter}}^2 \times c_{\text{out}} \times c_{\text{in}}$</td>
<td>0.054</td>
</tr>
<tr>
<td>$A_{\text{conv}}, B_{\text{conv}}$</td>
<td>$n_{\text{bit}} \times c_{\text{out}}$</td>
<td>0.027</td>
</tr>
<tr>
<td>$A_{\text{FC}}, B_{\text{FC}}$</td>
<td>$n_{\text{bit}} \times w_{\text{out}(fc)}$</td>
<td>0.045</td>
</tr>
<tr>
<td>Fully connected weights</td>
<td>$w_{\text{out}(fc)} \times \text{number of fc parameters}$</td>
<td>1.268</td>
</tr>
<tr>
<td>Input image</td>
<td>$w_{\text{in}}^2 \times \text{image size} \times n_{\text{bit}}$</td>
<td>1.764</td>
</tr>
<tr>
<td>Total</td>
<td>$w_{\text{out}}^2 \times \text{number of fc parameters}$</td>
<td>6.686</td>
</tr>
</tbody>
</table>

The number of bits $n_{\text{bit}}$ is fixed to 18, as discussed in section 4.5. In general, the trend of required memory in function of $n_{\text{bit}}$ is reported in the following plot:

Figure 4.36: Memory required in function of $n_{\text{bit}}$ for the neural network model depicted in Figure 4.1
4.4 Timing comparison

To perform a timing comparison between the OOM and the In-Memory architectures, the neural network model depicted in Figure 4.1 is used.

4.4.1 OOM implementation

The OOM architecture requires an amount of time that can be defined by considering all the stages of the neural network and the time durations of each state of the FSMs, that have been already computed. The total time required by the algorithm is given by the sum of the following contributions:

1. Data acquisition: the time required is equal to the maximum number of parameters that has to be fetched from the data generator. In the case reported in Figure 4.1 is equal to number_of_fc_parameters = 1014. This procedure is performed everytime a done signal from the convolutional layer is asserted. At the beginning, the image is precharged and, once it has been stored, it will be not precharged anymore. The precharging phase delay contribution is given by:

\[
\text{Delay}_{\text{data (acq)}} = \left[ \max(w_{\text{filter}}^2, \text{number_of_fc_parameters}, w_{\text{in}}^2, A_{\text{conv}}, A_{\text{FC}}) + (n_{\text{layers}} - 1) \times \max(w_{\text{filter}}^2, \text{number_of_fc_parameters}, A_{\text{conv}}, A_{\text{FC}}) \right] \times t_{\text{ck}}
\]

(4.31)

Where \( n_{\text{layers}} \) is the number of convolutional/fully connected layers in the network. To reduce the equation’s length:

\[
\text{Delay}_{\text{data (acq)}} = (\phi + (n_{\text{layers}} - 1) \times \psi) \times t_{\text{ck}}
\]

(4.32)

\[
\phi = \max(w_{\text{filter}}^2, \text{number_of_fc_parameters}, w_{\text{in}}^2, A_{\text{conv}}, A_{\text{FC}})
\]

\[
\psi = \max(w_{\text{filter}}^2, \text{number_of_fc_parameters}, A_{\text{conv}}, A_{\text{FC}})
\]

2. Max pooling: the total Pool time, already reported by Equation 4.1, is given

219
by:

\[
\text{Pool\_time} = (1 + 1 + 1 + w_{out(pool)}^2 \times (1 + w_{filter}^2 + 1 + 1)) \times t_{ck}
\]

\[
= (3 + 196 \times (1 + 4 + 1) + 1) \times t_{ck} = 1180 \times t_{ck}
\]

(4.33)

3. Perform convolution: from Equation 4.21:

\[
\text{Convolution\_cycles} = 1 + 1 + w_{out}^2 \times (w_{filter}^2 + 1) + 1 +
\]

\[
+ c_{out} \times w_{out}^2 \times (w_{filter}^2 + 1 + c_{in} + 1)
\]

\[
+ c_{out} \times (1 + 1 + 1 + 1) + 1 =
\]

\[
= 4 + 169 \times (4 + 1) + 6 \times 169 \times (4 + 1 + 1 + 1)
\]

\[
+ 6 \times 4 = 7971 \times t_{ck}
\]

(4.34)

It is possible to distinguish between the precharging values time and perform convolution as follows:

- The term \(3 + w_{out}^2 \times (w_{filter}^2 + 1)\) is formed by the contributions of idle, initial\_stage, input\_precharge and wait for last precharge states. During these periods, the Binary input RF is precharged so:

\[
\text{Precharging\_Values} = 3 + w_{out}^2 \times (w_{filter}^2 + 1)
\]

(4.35)

- The remaining terms in the Equation 4.34 come from the convolution computation delay:

\[
\text{Perform\_convolution} = c_{out} \times w_{out}^2 \times (w_{filter}^2 + c_{in} + 2) + c_{out} \times 4 + 1
\]

(4.36)

4. Fully connected layer:

\[
\text{FC\_cycles} = 1 + n_{iter} \times (1 + w_{out(fc)}) +
\]

\[
+ w_{out(fc)} \times (L + 1 + 1) + w_{out(fc)} + 1 + 1 =
\]

\[
= 1 + 169 \times (1 + 10 + 10 \times (6 + 1) + 1) +
\]

\[
+ 10 + 1 + 1 = 13871 \times t_{ck}
\]

(4.37)
Table 4.7: Timing of the OOM architecture. The reference neural network model is in Figure 4.1

<table>
<thead>
<tr>
<th>Operation</th>
<th>Time required</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data acquisition</td>
<td>((\phi + (n_{\text{layers}} - 1) \times \psi) \times t_{ck})</td>
<td>(2 \times 1014 \times t_{ck})</td>
</tr>
<tr>
<td>Max Pooling</td>
<td>((3 + w_{out(pool)}^2) \times (2 + w_{filter}^2) + 1 \times t_{ck})</td>
<td>(1180 \times t_{ck})</td>
</tr>
<tr>
<td><strong>Convolutional layer</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Precharge binary values</td>
<td>(3 + w_{out}^2 \times (w_{filter}^2 + 1) \times t_{ck})</td>
<td>(848 \times t_{ck})</td>
</tr>
<tr>
<td>Perform convolution</td>
<td>((c_{out} \times w_{out}^2) \times (w_{filter}^2 + c_{in} + 2) + c_{out} \times 4 + 1) \times t_{ck})</td>
<td>(7123 \times t_{ck})</td>
</tr>
<tr>
<td><strong>Fully connected layer</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Perform computation</td>
<td>([3 + n_{\text{iter}} \times (2 + w_{out(fc)} + w_{out(fc)} \times (L + 1)) + w_{out(fc)}] \times t_{ck})</td>
<td>(13871 \times t_{ck})</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>(25050 \times t_{ck})</td>
</tr>
</tbody>
</table>

### 4.4.2 In-memory implementation

Here the main differences are in the convolutional-fully connected layers:

1. Perform convolution: after the binary values are precharged, since there are \(w_{out}^2\) pop-counting units in parallel, the pop operation has a time duration equal to \(w_{filter}^2 \times t_{ck}\). Once pop-counting has been performed, the final multiplexer (Figure 4.30) needs \(w_{out}^2\) clock cycles to select all the inputs. This procedure is repeated for each output channel, but the inputs coming from the XNOR UNIT in memory are not fetched because the computation is already performed inside the memory: the evaluation phase can start reducing the total number of clock cycles required.

\[
\text{Convolution\_cycles} = [3 + (w_{filter}^2 + 1) \times w_{out}^2 + c_{out} \times (w_{filter}^2 + w_{out}^2 \times (c_{in} + 2)) + c_{out} \times 3 + 1] \times t_{ck} \tag{4.38}
\]

2. Fully connected layer: for the same motivations, the fully connected layer, once the inputs/weights are precharged, can be computed in parallel, without the need to fetch each row of the XNOR UNIT in memory.

\[
\text{FC\_cycles} = [3 + n_{\text{iter}} \times (2 + w_{out(fc)} + L) + w_{out(fc)}] \times t_{ck} \tag{4.39}
\]
Table 4.8: Timing of the In-Memory architecture. The reference neural network model is in Figure 4.1

<table>
<thead>
<tr>
<th>Operation</th>
<th>Time required</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data acquisition</td>
<td>$(\phi + (n_{layers} - 1) \times \psi) \times t_{ck}$</td>
<td>$2 \times 1014 \times t_{ck}$</td>
</tr>
<tr>
<td>Max Pooling</td>
<td>$(3 + w_{out(pool)}^2 \times (2 + w_{filter}^2) + 1) \times t_{ck}$</td>
<td>$1180 \times t_{ck}$</td>
</tr>
<tr>
<td>Convolutional layer</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Precharge binary values</td>
<td>$(3 + (w_{filter}^2 + 1) \times w_{out}^2) \times t_{ck}$</td>
<td>$848 \times t_{ck}$</td>
</tr>
<tr>
<td>Perform convolution</td>
<td>$(c_{out} \times (w_{filter}^2 + w_{out}^2 \times (c_{in} + 2)) + c_{out} \times 3 + 1) \times t_{ck}$</td>
<td>$3085 \times t_{ck}$</td>
</tr>
<tr>
<td>Fully connected layer</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Perform computation</td>
<td>$((1 + n_{iter} \times (w_{out(fc)} + L) + w_{out(fc)} + 2) \times t_{ck}$</td>
<td>$3055 \times t_{ck}$</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>$10196 \times t_{ck}$</td>
</tr>
</tbody>
</table>

The main differences in terms of timing are located in the convolutional/fully connected computation in fact, thanks to the parallel In-Memory computation, time can be saved up to $\sim 2.46 \times$ (considering the same clock period for both the architectures). By looking at the convolutional delay expressions:

\[
\text{Delay OOM}_{\text{convolution}} = (c_{out} \times w_{out}^2 \times (w_{filter}^2 + c_{in} + 2) + c_{out} \times 4 + 1) \times t_{ck} \tag{4.40}
\]

\[
\text{Delay In-Memory}_{\text{convolution}} = (c_{out} \times (w_{filter}^2 + w_{out}^2 \times (c_{in} + 2)) + c_{out} \times 3 + 1) \times t_{ck} \tag{4.41}
\]

Since the data are computed directly in the memory array and $w_{out}^2$ parallel pop-counting units are used, there is no need to fetch them from the memory and to pop-counts them one by one as OOM does. This allows to the In-Memory architecture to reduce the computational time by transforming the part of the delay equation from $c_{out} \times w_{out}^2 \times (w_{filter}^2 + c_{in} + 2)$ to $c_{out} \times (w_{filter}^2 + w_{out}^2 \times (c_{in} + 2) + 1)$. In the fully connected layer instead, the gain is much more evident:

\[
\text{Delay OOM}_{\text{FC}} = [3 + n_{iter} \times (2 + w_{out(fc)} + w_{out(fc)} \times (L + 1)) + w_{out(fc)}] \times t_{ck} \tag{4.42}
\]

\[
\text{Delay In-Memory}_{\text{FC}} = ((1 + n_{iter} \times (w_{out(fc)} + L + 2) + w_{out(fc)} + 2) \times t_{ck} \tag{4.43}
\]

The In-Memory architecture has a big advantages in terms of delay w.r.t. the OOM, because of the usage of multiple pop-counting units and XNOR gates for each couple binary weight/input. As already said, once the pop-counting procedure has finished, there is no need to save them into an external register file (RF TMP POP). In the In-Memory architecture, at the end of the algorithm, they are simply multiplexed by the last multiplexer in Figure 4.30. These motivations brings to a delay ratio of $\sim 4.54 \times$ for the FC layer. In the following figures are reported the total required
time in both OOM and In-Memory case, in order to verify the correctness of the computations:

\begin{equation}
\text{Delay Ratio} = \frac{150.333\,\mu s}{61.209\,\mu s} \approx 2.456
\end{equation}

There is a small difference w.r.t the computed value (2.46) and the simulated one (2.456): the reason is that some states of the neural network’s control unit are not considered in the computation (such as reset state, idles, etc).

### 4.4.3 General cases

General cases are analyzed by sweeping the parameters of the network, in order to evaluate the timing ratio between the two alternatives (OOM/In-memory), considering the same clock period and the reference architecture in Figure 4.1. The accuracy
is evaluated in all the cases, with batch size of 100 and 5 epochs training, in order to see what is the impact of the different choices on the architecture’s precision:

Figure 4.39: Speedup vs $C_{out}$: higher number of $C_{out}$ increases the time ratio, but the complexity of the architecture is badly influenced (higher number of parameters required)

Figure 4.40: Speedup vs stride $conv$: the consequence of increasing the stride are worse accuracy and speedup, but the complexity of the network decreases.
4.4 – Timing comparison

Figure 4.41: Speedup vs $w_{\text{filter(pool)}}$: speedup ratio decreases, but the accuracy is worse since the higher is the $w_{\text{filter(pool)}}$, the lower is the input quality image.

Figure 4.42: Speedup vs stride pool: speedup ratio decreases, but the accuracy is worse since an higher stride implies bad quality input image.
Figure 4.43: Speedup vs $w_{\text{out}(fc)}$: the higher is better, but in the case reported in Figure 4.1, no more than 10 outputs are used. If the neural network is structured with more than one fully connected layer, this brings some advantages.

Figure 4.44: Speedup vs $w_{\text{filter(\text{conv})}}$: increasing $w_{\text{filter(\text{conv})}}$ also the speedup increases, but the accuracy is degraded.
4.4 – Timing comparison

Regarding the total delay of the architectures, it can be demonstrated that the delay of the OOM case is always higher than the In-Memory alternative, considering the following equations for the convolution algorithm from the timing computation:

\[
\text{Delay}_{\text{OOM (convolution)}} = [3 + w_{\text{out}}^2 \times (w_{\text{filter}}^2 + 1) + c_{\text{out}} \times w_{\text{out}}^2 \times (w_{\text{filter}}^2 + c_{\text{in}} + 2) + c_{\text{out}} \times 4 + 1] \times t_{ck}
\]

(4.45)

\[
\text{Delay}_{\text{In-Memory (convolution)}} = [3 + w_{\text{out}}^2 \times (w_{\text{filter}}^2 + 1) + c_{\text{out}} \times (w_{\text{filter}}^2 + w_{\text{out}}^2 \times (c_{\text{in}} + 2)) + c_{\text{out}} \times 3 + 1] \times t_{ck}
\]

(4.46)

By imposing the \( \text{Delay}_{\text{In-Memory (convolution)}} \) equation less than \( \text{Delay}_{\text{OOM (convolution)}} \):

\[
\text{Delay}_{\text{In-Memory (convolution)}} < \text{Delay}_{\text{OOM (convolution)}}
\]

\[
w_{\text{filter}}^2 + w_{\text{out}}^2 \times (c_{\text{in}} + 2) + 3 < w_{\text{out}}^2 \times (w_{\text{filter}}^2 + c_{\text{in}} + 2) + 4
\]

(4.47)

\[
w_{\text{filter}}^2 + 3 < w_{\text{out}}^2 \times w_{\text{filter}}^2 + 4
\]

\[
w_{\text{filter}}^2 + 1 < w_{\text{out}}^2 \times w_{\text{filter}}^2
\]

By neglecting the -1 in the equation, it is demonstrated that \( \text{Delay}_{\text{In-Memory (convolution)}} \) is always less than \( \text{Delay}_{\text{OOM (convolution)}} \). By performing the same steps for the fully connected computational delay:

\[
\text{Delay}_{\text{OOM (FC)}} = [1 + n_{\text{iter}} \times (w_{\text{out (fc)}} + 3 + w_{\text{out (fc)}} \times (L + 1)) + w_{\text{out (fc)}} + 2] \times t_{ck}
\]

(4.48)

\[
\text{Delay}_{\text{In-Memory (FC)}} = ((1 + n_{\text{iter}} \times (w_{\text{out (fc)}} + L + 2) + w_{\text{out (fc)}} + 2) \times t_{ck}
\]

(4.49)
Imposing the inequality, it results always verified:

\[
\begin{align*}
\text{Delay}_{IM-Memory}(FC) &< \text{Delay}_{OOM}(FC) \\
\frac{w_{out(fc)}}{L} + L + 2 &< \frac{w_{out(fc)}}{L} + 3 + w_{out(fc)}(L + 1) \\
L &< 1 + w_{out(fc)}(L + 1) \\
\frac{L - 1}{L + 1} &< w_{out(fc)}
\end{align*}
\]

(4.50)

\(w_{out(fc)}\) is always greater-equal than 1 and, since the ratio \(\frac{L - 1}{L + 1}\) is always less than 1, the equation is verified. The following plots represent both convolutional/fully connected delay ratios between the OOM and In-Memory architectures. Delay ratios have been evaluated by sweeping two parameters per time. Starting from the convolutional computation, the X-Y meshes used are \(c_{in} - w\_filter\) (fixed parameters are \(w_{in} = 28, \quad \text{stride} = 1\) and \(c_{out} = 1\)); \(c_{in} - c_{out}\) (\(w_{in} = 28, \quad w\_filter = 2, \quad \text{stride} = 1\)); \(w\_filter - c_{out}\) (\(w_{in} = 28, \quad \text{stride} = 1, \quad c_{in} = 1\)); \(w_{in} - w\_filter\) (\(c_{in} = 1, \quad c_{out} = 1, \quad \text{stride} = 1\)). Regarding the fully connected computation, the dependency of the delay ratio has been evaluated respect to \(w_{out(fc)}\) and \(n_{iter}\), considering \(\text{number}_of_\text{fc}_\text{parameters} = 1000\). \(n_{iter}\) is chosen with the \texttt{divisors} function in MATLAB, that generates the following vector:

\[
n_{iter} = \left(1 \quad 2 \quad 4 \quad 5 \quad 8 \quad 10 \quad 20 \quad 25 \quad 40 \quad 50 \quad 100 \quad 125 \quad 200 \quad 250 \quad 500 \quad 1000\right)
\]

(4.51)
Figure 4.45: $c_{in} - w_{filter}$ plot for a convolutional computation. By increasing the $c_{in}$, the delay ratio decreases, because by looking at Equation 4.45 and Equation 4.46, the ratio tends towards 1 for high values of $c_{in}$. Delay ratio increases with higher values of $w_{filter}$.
Figure 4.46: $c_{in}$ - $c_{out}$ plot for a convolutional computation. For $c_{in}$, the same considerations made in Figure 4.45 are valid. Regarding $c_{out}$, by increasing it the delay ratio slowly rises as a logarithm-like function until it reaches a saturation, since by performing the limit of the Delay ratio function for $c_{out} \to \infty$, the result is a constant.
4.4 – Timing comparison

Figure 4.47: $w_{filter} - c_{out}$ plot for a convolutional computation. The big advantage of the In-Memory architecture in terms of delay respect to OOM one, is obtained with high values of $w_{filter}$ and $c_{out}$. Considering for example the first layer of AlexNet, the total number of OFMAPs are 96 with $w_{filter} = 11$ and the delay ratio will be $\sim 27\times$. 
Figure 4.48: \( w_{in} \) - \( w_{filter} \) plot for a convolutional computation. By increasing \( w_{in} \), the delay ratio remains approximately the same, while \( w_{filter} \) dependency is the same described in Figure 4.47.
4.4 – Timing comparison

Figure 4.49: $w_{out(fc)}$-$n_{iter}$ plot, considering a fully connected layer. By increasing both the quantities brings relevant benefits in terms of Delay ratio. In particular it is demonstrated that with high values of $n_{iter}$, the In-Memory architecture takes advantages of a more scheduled FC computation (Figure 4.16): this is a very important result, since high $n_{iter}$ implies a smaller array, since $W \geq \frac{\text{number of fc parameters}}{n_{iter}} = L$, allowing to further reduce power consumption/area/energy consumption of the In-Memory architecture.
Considerations

From the previous plots, the following considerations can be made for convolutional computation:

1. By increasing $w_{\text{filter}}$, the delay ratio increases, taking advantages of a more complex network (such as AlexNet, in which the first layer has a kernel size of 11x11). In combination of an higher number of output channels ($c_{\text{out}}$), this behavior is much more evident (Figure 4.47);

2. An higher value of $w_{\text{in}}$ (which is translated in $w_{\text{out}} = \frac{w_{\text{in}} - w_{\text{filter}}}{\text{stride}} + 1$), do not brings relevant advantages in terms of delay ratio, since it has a constant behavior;

3. Higher values of $c_{\text{in}}$ reduce the delay ratio, because of the similar delay expressions.

Regarding the FC layer, an higher value of $n_{\text{iter}}$ allows to increase the delay ratio and also the power consumption. The In-Memory architecture takes relevant advantages when fully connected computation is performed.

4.5 Choosing the number of bits ($n_{\text{bit}}$)

An important parameter is the number of bits for the fixed point implementation. In order to properly choose this value, a fixed-point neural network model has been implemented in MATLAB (discussed in chapter 5). Following Figure 4.1, the MATLAB code takes the pre-trained parameters, input images and labels from Python and computes the accuracy for each combination $[n_{\text{bit}}, n_{\text{bit, fractional}}]$ simply considering:

$$\text{Accuracy} = \frac{\text{Score}}{\#\text{Mnist images}}$$ (4.52)

The total number of bits dedicated to the integer part are fixed and contrained to the fractional ones, and to have a full precision computation, these have to be at least equal to $\text{floor}(\log_2(\text{number of fc parameters}) + 1)$, since the pop-counting
4.5 – Choosing the number of bits (n\_bit)

part counts all the inputs:

\[ n\_bit\_integer = n\_bit - n\_bit\_fractional \]

\[ n\_bit\_integer = \text{floor}(\log_2(\text{number\_of\_fc\_parameters}) + 1) = 11 \]

The analysis is focused on the accuracy which derives from a different number of fractional bits. The result is reported in Figure 4.50:

![Figure 4.50: Accuracy vs number of bits. The total number of images tested are 10000. The reference accuracy is set to 0.8338 from section 3.2.1](image)

If a new neural network model is considered, the \( n\_bit \) analysis can be performed again in order to find the best trade-off between accuracy-complexity.
Chapter 5

Verification

In order to check the results given by the VHDL fixed point model, the verification steps include:

1. The realization of the Python program, from which the results of the single layers have been extracted;

2. Design of MATLAB floating point model and validity verification by comparing the Python results and the MATLAB ones;

3. Derivation of a fixed-point MATLAB model;

4. Comparison between the VHDL results and MATLAB ones.

In order to convert a floating point value to a fixed point, the following formula has been used:

\[ \text{quantization}(x) = \text{fix} \left( \frac{x}{2^{-n_{\text{bit\_fractional}}}} \right) \]  \hspace{1cm} (5.1)
Where $\text{fix}(y)$ rounds toward 0 both positive and negative results. For a multiplication result, the steps to compute its fixed-point equivalent can be derived by the following example with $A = 0.25$, $B = 0.125$ and $n_{\text{bit\_fractional}} = 4$:

- Full precision multiplication gives 0.03125:

\[
\begin{align*}
A &= 00.0100 \times \\
B &= 00.0010 = \\
\text{-------------} \\
00|00.0000|1000
\end{align*}
\]

- The result is truncated after the 4th fractional bit, giving 0 as final result. From a software point-of-view, the corresponding operation is $\text{floor}(x)$, so:

```matlab
tmp = X.*2^(n_bit_fractional);
tmp = floor(tmp);
quantized = tmp.*2^(-n_bit_fractional);
```

Now the entire MATLAB program is reported and explained:

1. Loading of the the Python trained parameters by using `readNPY` function, that allows MATLAB to read Numpy vectors:

```matlab
Image = readNPY('./Image.npy'); % Input images: 10000 testing images
Ws_Conv = readNPY('./Parameters/Ws_1.npy'); % Convolutional layer's weights
Ws_FC = readNPY('./Parameters/Ws_2.npy'); % Fully connected layer's weights
mu = readNPY('./Parameters/mu.npy'); % Mean for batchnorm
sigma = readNPY('./Parameters/sigma.npy'); % Std deviation for batchnorm
scale = readNPY('./Parameters/scale.npy'); % Scale for batchnorm
offset = readNPY('./Parameters/offset.npy'); % Offset for batchnorm
labels = readNPY('./Parameters/labels.npy'); % Labels for the accuracy
```

2. Saving the extracted parameters for the VHDL simulation:
path = './VHDL_MODEL/INPUT_PARAMETERS_VHDL/';
s = strcat(path,'Ws_Conv.txt');
s1 = strcat(path,'*.txt');
delete(s1)
sz = size(Weights_NN);
for i=1:sz(4)
    vect = Ws_Conv(:,:,i)';
    mat = vect(:)';
    dlmwrite(s,mat,'delimiter','\t','precision','%6f','-append');
end
A = scale.*(sqrt(sigma).ˆ(-1));
B = -mu.*(sqrt(sigma).ˆ(-1)).*scale + offset;
dlmwrite(strcat(path,'Ws_FC.txt'),Ws_FC,...
    'delimiter','\t','precision','%6f','-append');
dlmwrite(strcat(path,'Aone.txt'),A,...
    'delimiter','\t','precision','%6f','-append');
dlmwrite(strcat(path,'Bone.txt'),B,...
    'delimiter','\t','precision','%6f','-append');
dlmwrite(strcat(path,'Image.txt'),...
    Image(:,:,1),'delimiter','\t','precision','%6f','-append');

3. Reading of Python results:

conv_out = readNPY('./Parameters/conv.npy');  % Convolutional layer output
fully = readNPY('./Parameters/fully.npy');   % Fully connected layer output
batch_norm_output = readNPY('./Parameters/batch.npy');  % Batch normalization output
ReLU_out = readNPY('./Parameters/ReLU.npy');  % ReLU output

4. Setting the fixed-point number of bits by using two global variables (n_bit and n_bit_fractional respectively) with SetGlobals(18,7). By choosing SetGlobals(0,0), the computation is performed in floating point representation:

SetGlobals(18,7);  % 18 = n_bit; 7 = n_bit_fractional
5. Parameters' quantization. \texttt{Ws\_FC} are not quantized since only the sign is taken:

```matlab
if to\_quantize == 1
    QntImage = quantization(Image);
    QntWS\_Conv = quantization(Ws\_Conv);
    QntA = quantization(A);
    QntB = quantization(B);
end
```

6. Neural network's realization:

```matlab
XNOR\_NET = 1; \% Sets the computational model to the XNOR\_NET one.

\% Max pooling layer \%
pool = Max\_pooling\_layer(QntImage,2,2); \% The first parameter is the \texttt{w\_filter} size, while the second the stride.

\% Convolutional layer \%
[K\_conv,\alpha\_conv,conv\_xnor] = Convolutional\_layer(pool,QntWS\_Conv,1,1,XNOR\_NET,0); \% (Input argument, Weights, Number of input channels, stride, XNOR\_NET computational model, disable k computation).

\% Batch normalization layer \%
[conv\_xnor,\texttt{vectors}] = Batchnorm(conv\_xnor,A,B); \% (Input argument, A,B constants)

\% ReLU \%
conv\_xnor = max(0,conv\_xnor);

\% Flatten layer \%
input\_fc = flatten\_layer(conv\_xnor);

\% Fully connected \%
disable\_k = 1;
[K,\alpha,output\_full] = Fully\_connected\_layer(input\_fc,XNOR\_NET,Ws\_FC,disable\_k);
```

7. To perform the validation of the results, the absolute difference is taken between the Python/Floating-Point MATLAB and Fixed-Point MATLAB/VHDL. If
any of the difference values is higher than a certain threshold, the computational model is considered wrong.

5.1 VHDL’s output

To ease the verification procedure, the results have been printed on a file in matricial form by using a data save. It is reported a toy example of convolutional/fully connected outputs with 5x5 input image, 2x2 kernel sizes, stride=1, 3 output channels and 4 output neurons. The last results are produced by the fully connected layer and the maximum value represents the final classification result.

```
convolution.txt
1.250000e-01 1.250000e-01 1.250000e-01 1.250000e-01
1.250000e-01 1.250000e-01 1.250000e-01 1.250000e-01
1.250000e-01 1.250000e-01 1.250000e-01 1.250000e-01
1.250000e-01 1.250000e-01 1.250000e-01 1.250000e-01
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

3.281250e-01 3.281250e-01 3.281250e-01 3.281250e-01
3.281250e-01 3.281250e-01 3.281250e-01 3.281250e-01
3.281250e-01 3.281250e-01 3.281250e-01 3.281250e-01
3.281250e-01 3.281250e-01 3.281250e-01 3.281250e-01
3.281250e-01 3.281250e-01 3.281250e-01 3.281250e-01

-5.200000e+01
-1.200000e+02
-3.200000e+01
-6.200000e+01
```

MATLAB just read these values with the directive `dlmread` and compares with its approximated fixed-point computation.
5.2 MATLAB’s output

First of all the floating point model is verified and the output of all layers is compared with the Python’s ones. For demonstration purposes, it has been reported only the convolutional layer’s first result, after performing Batch normalization and ReLU:

```matlab
if(validation_python==1)
    diff = abs(conv_xnor-conv_out); % conv_xnor is the output given by the MATLAB model, while conv_out the python's one
    fprintf('Maximum Convolution difference between python-MATLAB is:');
    max(diff(:))
end
```

Maximum Convolution difference between python-MATLAB is:

```
ans =

    single

    3.3930e-07
```

As it is possible to see, a very small difference between the two models exists due to the saving procedure precision, but it can be neglected.
MATLAB FP model’s first ReLU output

(5.2)

Pythons model’s first ReLU output

(5.3)

Considering the VHDL results, the MATLAB model is switched to fixed-point computation and the outputs are compared:

MATLAB Fixed point model’s first ReLU output

(5.4)

VHDL model’s first ReLU output

(5.5)
MATLAB checks the values given by the VHDL and, if the model is correct, should report the following messages:

First convolution (1) is correct!
First convolution (2) is correct!
First convolution (3) is correct!
First convolution (4) is correct!
First convolution (5) is correct!
First convolution (6) is correct!
Output FC results are correct!!

By comparing the results given by the fixed point model and the floating point one, the difference is not so evident, since the number of layers is reduced and the approximation imposed by the fixed point representation does not influence so much the calculations. Trying now to recognize some inputs, the resulting fully connected VHDL outputs are the following. The maximum value out of 10 is the classification result: the first one refers to "0" and the last one to "9".

5.3 Other neural network models

To demonstrate the capability of the VHDL architecture to implement every kind of neural network, in the following part are proposed other neural network models.
5.3.1 MLP Implementation

The network structure is the following:

![MLP model diagram](image)

Figure 5.2: MLP model. The network has 15 layers and it is able to achieve \( \sim 90\% \) of accuracy on MNIST dataset.

To implement an MLP, the variable and fixed parameters discussed in subsection 4.1.5, must be changed according to the new structure. For comparison purposes, these values are reported for both MLP and original neural network models.

```
-----------------FIXED PARAMETERS----------------------
-----------------MLP NETWORK----------------------
constant w_out: integer:=14;
constant h_max: integer:=196;
constant w_in: integer:=28;
constant w_filter: integer:=4;
constant w_filter_s: integer:=2;
constant width_sram:integer:=14;
constant number_of_output_channels: integer:=1;
constant number_of_input_channels: integer:=1;
constant number_of_fc_parameters:integer:=798;
constant number_of_neurons_output: integer:=196;
constant input_image_size_x:integer:=28;
```
constant input_image_size_z:integer:=1;
constant flatten_x:integer:=196;
constant flatten_z:integer:=1;

-------------------FIXED PARAMETERS-------------------
-------------------ORIGINAL NETWORK-------------------
constant w_out: integer:=14;
constant h_max: integer:=169;
constant w_in: integer:=28;
constant w_filter: integer:=4;
constant w_filter_s: integer:=2;
constant width_sram:integer:=6;
constant number_of_output_channels: integer:=6;
constant number_of_input_channels: integer:=1;
constant number_of_fc_parameters:integer:=1014;
constant number_of_neurons_output: integer:=10;
constant input_image_size_x:integer:=28;
constant input_image_size_z:integer:=1;
constant flatten_x:integer:=169;
constant flatten_z:integer:=6;

1. h_max has changed into 196 respect to 169 of the initial neural network model (Figure 4.1), since in this case there are 196 maximum output neurons. For this motivation, 196 rows of memory are required;

2. The width_sram chosen is 14, since it is a divider of both 784 (input layer size) and 196 (hidden layers size). The iterations required for input and hidden layers are given by:

\[
\begin{align*}
n_{\text{iter}}(\text{MLP-Input}) &= \frac{784}{14} = 56 \\
n_{\text{iter}}(\text{MLP-hidden}) &= \frac{196}{14} = 14
\end{align*}
\]  

(5.6)

(5.7)

3. number_of_output_channels: in a MLP network, no output channels are required, since it is an operation performed on vectors instead of matrices;

4. number_of_fc_parameters: in the MLP case, the maximum input size of the fully connected layer is 784. For motivations related to the VHDL implementation, it is incremented to 798 to avoid addressing problems;
5. flatten_x and flatten_z: since there is only one output register file, the flattening procedure has to be performed only on 1 register file of size 196.

Considering now the variable parameters:

---------------VARIABLE PARAMETERS----------------- 
---------------MLP NETWORK---------------------- 

constant n_layers : integer := 4; 
constant number_of_layers : std_logic_vector ( n_layers-1 downto 0 ) := "0011"; 
constant conv_layer_size_x : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant conv_layer_size_x_pow: int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant conv_layer_size_y : int_vect ( n_layers-1 downto 0 ) := conv_layer_size_x; 
constant conv_layer_size_z : int_vect ( n_layers-1 downto 0 ) := (14,14,14,56); 
constant kernel_size_xy_pow : int_vect ( n_layers-1 downto 0 ) := (14,14,14,14); 
constant kernel_size_xy : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant kernel_size_z : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant stride_sel_c : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant output_size_conv : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant output_size_conv_pow : int_vect ( n_layers-1 downto 0 ) := (10,196,196,196); 

---------------POOLING------------------------ 
constant pool_filter_size : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant pool_x_size : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant pool_out_size : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant pool_filter_size_s : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant pool_stride : intVect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant pool_x_size_pow : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 
constant pool_z_size : int_vect ( n_layers-1 downto 0 ) := (1,1,1,1); 

---------------Layer types--------------------- 
constant do_batch_layer : std_logic_vector ( n_layers-1 downto 0 ) := "1111"; 
constant do_pool_layer : std_logic_vector ( n_layers-1 downto 0 ) := "0000"; 
constant do_relu_layer : std_logic_vector ( n_layers-1 downto 0 ) := "0111"; 
constant fully_connected : std_logic_vector ( n_layers-1 downto 0 ) := "1111"; 

---------------VARIABLE PARAMETERS----------------- 
---------------ORIGINAL NETWORK------------------ 
constant n_layers : integer := 2; 
constant number_of_layers : std_logic_vector ( n_layers-1 downto 0 ) := "01"; 
constant conv_layer_size_x : int_vect ( n_layers-1 downto 0 ) := (1,14); 

246
5.3 – Other neural network models

constant conv_layer_size_x_pow : int_vect ( n_layers-1 downto 0 ) := ( 1 , 196 );
constant conv_layer_size_z : int_vect ( n_layers-1 downto 0 ) := ( 169 , 1 );
constant kernel_size_xy_pow : int_vect ( n_layers-1 downto 0 ) := ( 6 , 4 );
constant kernel_size_xy : int_vect ( n_layers-1 downto 0 ) := ( 1 , 2 );
constant kernel_size_z : int_vect ( n_layers-1 downto 0 ) := ( 1 , 6 );
constant stride_sel_c : int_vect ( n_layers-1 downto 0 ) := ( 1 , 1 );
constant output_size_conv : int_vect ( n_layers-1 downto 0 ) := ( 1 , 13 );
constant output_size_conv_pow : int_vect ( n_layers-1 downto 0 ) := ( 10 , 169 );

-------------
-------------POOLING-------------
constant pool_filter_size : int_vect ( n_layers-1 downto 0 ) := ( 1 , 4 );
constant pool_x_size : int_vect ( n_layers-1 downto 0 ) := ( 1 , 28 );
constant pool_out_size : int_vect ( n_layers-1 downto 0 ) := ( 1 , 196 );
constant pool_filter_size_s : int_vect ( n_layers-1 downto 0 ) := ( 1 , 2 );
constant pool_stride : int_vect ( n_layers-1 downto 0 ) := ( 1 , 2 );
constant pool_x_size_pow : int_vect ( n_layers-1 downto 0 ) := ( 1 , 784 );
constant pool_z_size : int_vect ( n_layers-1 downto 0 ) := ( 1 , 1 );

-------------
-------------Layer types-------------
constant do_batch_layer : std_logic_vector ( n_layers-1 downto 0 ) := "01" ;
constant do_pool_layer : std_logic_vector ( n_layers-1 downto 0 ) := "01" ;
constant do_ReLU_layer : std_logic_vector ( n_layers-1 downto 0 ) := "01" ;
constant fully_connected : std_logic_vector ( n_layers-1 downto 0 ) := "10" ;

1. The number of layers n_layers are four, since there are 4 fully connected computations: batch normalization and ReLU are considered inside the fully connected layer. Dropouts layers are not useful for classification routine, since are used during training to prevent overfitting;

2. The useful parameters of the convolutional part are conv_layer_size_z, kernel_size_xy_pow and output_size_conv_pow. conv_layer_size_z, as already said, indicates the number of iterations \( n_{iter} \) required by the considered layer to compute the fully connected output. kernel_size_xy_pow indicates the size of the FC input (L) (discussed in section 4.1.2): in this network, L is equal to 14 for all the cases. output_size_conv_pow indicates the number of output neurons of the considered layer;

3. Since pooling layer is not performed in the neural network model depicted in
Figure 5.2, the values specified in the vectors are not considered;

4. Batch normalization and Fully connected layer are always performed, consequently do\_batch and fully connected layer are always enabled. Same considerations can be made for ReLU and max pooling.

Results verification

With the same approach described in Figure 5.1, the results are verified. Since the fully connected layers have vectors that flows into the architecture, the saving format is a vector, that is compared with the result provided by MATLAB. The vectors have a size of 196, 196, 196, 10 for the four layers respectively, which are separated by |---------------|

In the following part it is reported an output example with the number "7": as it is possible to see, the classification in output, which is given by the maximum number of the last fully connected layer, is 7 in both cases.

\[
\begin{align*}
FC_1 &= \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0.414 \\ 0.484 \\ 0.32 \\ 0 \end{pmatrix} \\
FC_2 &= \begin{pmatrix} 0 \\ 0 \\ 0.719 \\ 0 \\ 0.227 \\ 0.195 \\ ... \end{pmatrix} \\
FC_3 &= \begin{pmatrix} 0 \\ 0.703 \\ 1.04 \\ 0 \\ 0 \\ 0 \end{pmatrix} \\
FC_4 &= \begin{pmatrix} -1.6 \\ -1.31 \\ -0.906 \\ -1.05 \\ -1.44 \\ -1.04 \\ -1.21 \\ 0.875 \\ -1.12 \\ -1.51 \end{pmatrix}
\end{align*}
\]
5.3 – Other neural network models

VHDL Results

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0.000000e+00</td>
<td>0.000000e+00</td>
<td>0.000000e+00</td>
<td>-1.601562e+00</td>
</tr>
<tr>
<td>0.000000e+00</td>
<td>0.000000e+00</td>
<td>7.031250e-01</td>
<td>-1.312500e+00</td>
</tr>
<tr>
<td>0.000000e+00</td>
<td>0.000000e+00</td>
<td>0.000000e+00</td>
<td>-9.062500e-01</td>
</tr>
<tr>
<td>4.140625e-01</td>
<td>7.187500e-01</td>
<td>1.039062e+00</td>
<td>-1.046875e+00</td>
</tr>
<tr>
<td>4.843750e-01</td>
<td>0.000000e+00</td>
<td>0.000000e+00</td>
<td>-1.437500e+00</td>
</tr>
<tr>
<td>3.203125e-01</td>
<td>2.265625e-01</td>
<td>0.000000e+00</td>
<td>-1.039062e+00</td>
</tr>
<tr>
<td>0.000000e+00</td>
<td>1.953125e-01</td>
<td>0.000000e+00</td>
<td>-1.210938e+00</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
<td>-1.125000e+00</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>-1.507812e+00</td>
</tr>
</tbody>
</table>

The output of the MATLAB program is the following:

(1) fully connected layer is correct!
(2) fully connected layer is correct!
(3) fully connected layer is correct!
Classification is correct!

5.3.2 Fashion-MNIST neural network model

To validate the VHDL implementation with another CNN network and Dataset, the following neural network model depicted in Figure 5.4 has been implemented with Fashion-MNIST. This is a dataset containing 60000 greyscale images of 28x28 representing 10 classification classes, that are identified by the following one-hot positions:

1. T-shirt/top;
2. Trouser;
3. Pullover;
4. Dress;
5. Coat;
6. Sandal;
7. Shirt;
8. Sneaker;

9. Bag;

10. Ankle boot.

Figure 5.3: Fashion MNIST dataset
Figure 5.4: **CNN** model used for fashion-MNIST dataset. All convolutional layers have a kernel size of 5x5x6 with stride 1. Max pooling layers have a kernel size of 2x2 with stride 1. After each fully connected layer there is a batch normalization computation, in order to reduce the inaccuracies caused by the approximated computation introduced in section 4.1.2. This model is able to achieve up to 70% of accuracy.
This neural network has been designed to verify the possibility to implement any kind of CNN configuration, since max-pooling layers have been placed after convolutional layers respect to the original neural network model depicted in Figure 4.1. In Figure 5.4 there are multiple input channels used, since the first convolution has 6 OFMAPs that are fed to the following stages. The package that defines the variable/fixed parameters is:

```
--- FIXED PARAMETERS ---
FASHION MNIST CNN
constant w_out: integer:=24;
constant h_max: integer:=576;
constant w_in: integer:=28;
constant w_filter: integer:=25;
constant w_filter_s: integer:=5;
constant width_sram: integer:=32;
constant number_of_output_channels: integer:=6;
constant number_of_input_channels: integer:=6;
constant number_of_fc_parameters: integer:=152;
constant number_of_neurons_output: integer:=120;
constant input_image_size_x: integer:=28;
constant input_image_size_z: integer:=1;
constant flatten_x: integer:=16;
constant flatten_z: integer:=6;
constant n_layers: integer:=5;
--- VARIABLE PARAMETERS ---
constant number_of_layers: std_logic_vector(n_layers-1 downto 0):= "00100";
constant conv_layer_size_x: int_vect(n_layers-1 downto 0):=(1,1,1,12,28);
constant conv_layer_size_x_pow: int_vect(n_layers-1 downto 0):=(1,1,1,144,784);
constant conv_layer_size_y: int_vect(n_layers-1 downto 0):=conv_layer_size_x;
constant conv_layer_size_z: int_vect(n_layers-1 downto 0):=(6,6,3,6,1);
constant kernel_size_xy_pow: int_vect(n_layers-1 downto 0):=(14,20,32,25,25);
constant kernel_size_xy: int_vect(n_layers-1 downto 0):=(1,1,1,5,5);
constant kernel_size_z: intVect(n_layers-1 downto 0):=(1,1,1,6,6);
constant stride_sel_c: intVect(n_layers-1 downto 0):=(1,1,1,1,1);
constant output_size_conv: intVect(n_layers-1 downto 0):=(1,1,1,8,24);
constant output_size_conv_pow: intVect(n_layers-1 downto 0):=(11,84,120,64,576);
--- POOLING ---
constant pool_filter_size: intVect(n_layers-1 downto 0):=(1,1,4,4,1);
```
5.3 – Other neural network models

constant pool_x_size:int_vect(n_layers-1 downto 0):=(1,1,8,24,1);
constant pool_out_size:int_vect(n_layers-1 downto 0):=(1,1,16,144,1);
constant pool_filter_size_s:int_vect(n_layers-1 downto 0):=(1,1,2,2,1);
constant pool_stride:int_vect(n_layers-1 downto 0):=(1,1,2,2,1);
constant pool_x_size_pow:int_vect(n_layers-1 downto 0):=(1,1,64,576,1);
constant pool_z_size:int_vect(n_layers-1 downto 0):=(1,1,6,6,1);

constant do_batch_layer: std_logic_vector(n_layers-1 downto 0):= "11100";
constant do_pool_layer:std_logic_vector(n_layers-1 downto 0):= "00110";
constant do_relu_layer: std_logic_vector(n_layers-1 downto 0):= "10000";
constant fully_connected:std_logic_vector(n_layers-1 downto 0):= "11100";

Starting from fixed parameters:

- **w_out**: the maximum output dimension is given by the first convolutional layer, which convolves 28x28 IFMAP with 6 kernels of 5x5 and stride 1.

\[
w_{out(max)} = \frac{w_{in} - w_{filter}}{stride} + 1 = 24 \tag{5.8}\]

- **h_max**: maximum number of rows of the custom memories, which is equal to \(w_{out(max)}^2\), so 576;

- **w_filter**: maximum kernel size used in the architecture, which is 25;

- **width_sram**: fixed to 32. Considering the W constraints:

\[
\begin{align*}
W & \geq w_{filter}^2 = 25 \\
W & \geq L
\end{align*} \tag{5.9}\]

In this architecture there are 3 fully connected layers, and the values of L are different for each of them. The first one takes 96 inputs, the second 120 and the last one 84: for the first case, L is fixed to 32, since it is the first divisor of 96 after \(w_{filter}^2 = 25\). In the second case, L is fixed to 20 and in the third one \(L = 14\). Consequently, \(W=32\).

- **number_of_output_channels**: the architecture produces 6 output channels in both convolutions;
• **number_of_input_channels**: the second convolution takes 6 channels in input and produces 6 OFMAPs. For this motivation, both architectures should be replicated 6 times, imposing $c_{in} = 6$;

• **number_of_fc_parameters**: the maximum size of the fully connected layers is 120, but in the VHDL implementation it is imposed equal to 152 to avoid indexing errors in the fc scheduling. This value has been obtained as $120 + 32$, where 32 is W;

• **flatten_x** and **flatten_z**: since the output sizes before the first fully connected layer are 4x4x6, the flatten layer has to consider 16 elements of each **Output register files**.

Regarding variable parameters, only the relevant changes are discussed, since for the others the same considerations made before are valid:

• **conv_layer_size_x**: the first layer takes in input the entire image with 28x28 pixels. The second one, takes the convolved and pooled image with size 12x12. The remaining numbers refer to the fully connected layers, in which this parameter is not used;

• **kernel_size_xy_pow**: in the convolution computation, it indicates the kernel size which is 25, while in the FC part the L sizes;

• **kernel_size_z**: 6 kernels are used in the convolutions, since 6 OFMAPs are produced in both the cases;

• **conv_layer_size_z**: the first layer takes in input only one channel, since the image has 28x28x1 pixels. The second one takes 6 input channels, as reported in **Figure 5.4**. The number in position 3 is equal to 3 and indicates $n_{iter}$ in the fully connected computation: since the first FC layer takes 96 inputs, with L fixed to 32, the total number of iterations required to perform the fc scheduling (section 4.1.2) is equal to 3 ($32*3=96$). Same considerations have been made for the other cases.

The MATLAB’s output is the following:

```
254
```
First convolution (1) is correct!
First convolution (2) is correct!
First convolution (3) is correct!
First convolution (4) is correct!
First convolution (5) is correct!
First convolution (6) is correct!
Second convolution (1) is correct!
Second convolution (2) is correct!
Second convolution (3) is correct!
Second convolution (4) is correct!
Second convolution (5) is correct!
Second convolution (6) is correct!
(1) FC is correct!
(2) FC is correct!
(3) FC is correct!

For demonstration purposes, it is reported only the first OFMAP of the second max pooling layer, since the other matrices are very big:

VHDL output

```
7.812500e-03 9.375000e-02 5.234375e-01 1.039062e+00
3.320312e-01 4.648438e-01 6.718750e-01 5.546875e-01
-8.945312e-01 -1.554688e+00 -1.492188e+00 -1.625000e+00
```

MATLAB’s output

```
Pool2(:, :, 1) =

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0.00781</td>
<td>0.0938</td>
<td>0.523</td>
<td>1.04</td>
</tr>
<tr>
<td>0.332</td>
<td>0.465</td>
<td>0.672</td>
<td>0.555</td>
</tr>
<tr>
<td>0.738</td>
<td>0.672</td>
<td>-0.652</td>
<td>-0.98</td>
</tr>
<tr>
<td>-0.895</td>
<td>-1.55</td>
<td>-1.49</td>
<td>-1.62</td>
</tr>
</tbody>
</table>
```
Chapter 6

Synthesis - Place & Route

6.1 Original architecture

In this section, are reported the synthesis and place&route results obtained for the neural network model depicted in Figure 4.1. The dimensions of the architecture are summarized in the following table:

Table 6.1: Dimensions of the top entity in terms of # input bits, XNOR Gates, Pop units etc. The reference neural network is depicted in Figure 4.1.

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>Parameter</th>
<th>Formula</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVOLUTIONAL LAYER</td>
<td>Memory sizes</td>
<td>W</td>
<td>number of fc parameters ( \frac{\text{number of fc parameters}}{\text{niter}} )</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td></td>
<td>H</td>
<td>( h_{\text{max}} )</td>
<td>169</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Z</td>
<td>( c_{\text{in}} )</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>Number of XNOR GATES</td>
<td>In MEMORY</td>
<td>( W \times H \times c_{\text{in}} )</td>
<td>1014</td>
</tr>
<tr>
<td></td>
<td>OOM</td>
<td>( \text{OOM} )</td>
<td>( W \times c_{\text{in}} )</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td>Number of POP Units</td>
<td>In MEMORY</td>
<td>( H \times c_{\text{in}} )</td>
<td>169</td>
</tr>
<tr>
<td></td>
<td>OOM</td>
<td>( \text{OOM} )</td>
<td>( c_{\text{in}} )</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>Input sizes [# bits]</td>
<td>Input filters</td>
<td>( c_{\text{in}} \times n_{\text{bit}} )</td>
<td>18</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Input image</td>
<td>( c_{\text{in}} \times n_{\text{bit}} )</td>
<td>18</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Input weights FC</td>
<td>( W )</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Input image FC</td>
<td>( W )</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td></td>
<td>A,B</td>
<td>( n_{\text{bit}} )</td>
<td>18</td>
</tr>
<tr>
<td></td>
<td>Output sizes [# bits]</td>
<td>Binary inputs/Binary weights</td>
<td>( W \times c_{\text{in}} )</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Output convolution</td>
<td>( n_{\text{bit}} \times c_{\text{in}} )</td>
<td>18</td>
</tr>
<tr>
<td>POOL</td>
<td>Input sizes [# bits]</td>
<td>Input pool</td>
<td>( c_{\text{in}} \times n_{\text{bit}} )</td>
<td>18</td>
</tr>
<tr>
<td></td>
<td>Output sizes [# bits]</td>
<td>Output pool</td>
<td>( c_{\text{in}} \times n_{\text{bit}} )</td>
<td>18</td>
</tr>
</tbody>
</table>
6.1 – Original architecture

6.1.1 Synthesis

The two circuits are synthesized with Synopsys Design Compiler with CMOS 45nm, and, in particular, are analyzed the differences in terms of area, timing and power. Power estimation is performed in the worst case: switching activity is equal to ‘1’ in each node of the network. The options used for the synthesis are the following:

1. Chosen clock period of 5.5ns:
   ```
   create_clock -name MY_CLK -period 5.5 clk;
   ```

2. Clock uncertainty (jitter) is applied:
   ```
   set_clock_uncertainty 0.07 [get_clocks MY_CLK];
   ```

3. Delay of inputs/outputs:
   ```
   set_input_delay 0.5 -max -clock MY_CLK [remove_from_collection [all_inputs] clk]
   set_output_delay 0.5 -max -clock MY_CLK [all_outputs]
   ```

4. For sake of simplicity, the input capacitance of BUF_X4 (which is equal to 3.40 fF) is chosen as load for the outputs:
   ```
   set OLOAD [load_of NangateOpenCellLibrary/BUF_X4/A]
   set_load $OLOAD [all_outputs]
   ```

Results

Table 6.2: Results in terms of area, critical path delay, power, total energy and time required by the two architectures for the neural network model depicted in Figure 4.1

<table>
<thead>
<tr>
<th>Area [$mm^2$]</th>
<th>Critical path delay [ns]</th>
<th>Power [mW]</th>
<th>Time required [$\mu$s]</th>
<th>Total energy [nJ]</th>
</tr>
</thead>
<tbody>
<tr>
<td>In Memory architecture</td>
<td>0.0923</td>
<td>4.22</td>
<td>12.9</td>
<td>61.209</td>
</tr>
<tr>
<td>OOM architecture</td>
<td>0.0564</td>
<td>4.38</td>
<td>8.85</td>
<td>150.333</td>
</tr>
</tbody>
</table>
Critical path delay  The critical path is formed by a multiplier and an adder of the batch normalization unit, as depicted in Figure 4.17. The values of the critical path delays are different because of the different synthesis choices that Synopsys has made. In the following figure it is reported a part of the timing report of both architectures:

The worst case is analyzed: critical path delays of both architectures are equal. The other results are now discussed.

Area  An analysis of the main contributions that defines the area of the designs is performed, taking into account only the main differences between the two architectures:

1. Number of XNOR gates: since the architectures are different from each other, they have a different number of XNOR gates, that can be defined considering the dimensions of the XNOR UNIT and XNOR UNIT in memory for OOM and In-memory structures respectively:

\[
\#XNOR\ Gates(\text{in memory}) = w \times h = 6 \times 169 = 1014 \\
\#XNOR\ Gates(\text{OOM}) = w \times 1 = 6 \\
XNOR\ Gate\_ratio = 169
\]

2. Pop-counting units: in the OOM architecture there is only one pop-count unit, since the structure is serialized, while for the in-memory architecture there are 169 of them:

\[
Pop\_ratio = 169 \quad (6.1)
\]
These two contributions produce the following area ratio:

\[ AR = \frac{Area_{OOM}}{Area_{In-Memory}} = \frac{0.0564}{0.0923} \approx 0.611 \] (6.2)

**Power** As expected, the resulting power is worst in the case of in-memory implementation, since Synopsys is not able to perform in-memory designs for XNOR-UNIT and pop-counting parts. The estimations performed in the In-Memory case are pessimistic, since the memory has been implemented as a register file and a flip-flop is more complicated than a custom memory cell, composed by a memory element and a XNOR gate. The in-memory architecture consumes \( \approx 1.45 \times \) more power than the OOM counterpart.

**Total energy** Taking the power and total delay values, the total energy of both architectures has been evaluated. The energy ratio is given by:

\[ ER = \frac{1330.4nJ}{789.6nJ} = 1.7 \] (6.3)

The In-Memory architecture consume \( \sim 1.7 \times \) less energy than the OOM counterpart. This is a very good result, because the main goal of an In-Memory architecture is to change the design approach in order to find a solution with lower energy consumed and computational delay, since the computational elements (in this case XNOR gates and full-adders) are placed near-memory element. Consequently, the Von Neumann’s bottleneck is reduced.

### 6.1.2 Place & Route

In this section are reported the Place & Route results for both the architectures:
By looking at Figure 6.2, it is possible to see that the structure is less complex than the In-Memory alternative: this confirms the expectations also on the power consumption, as discussed in the synthesis part. Now there are reported the power reports for a clock period of 5.5ns of the two architectures. They are performed with .vcd files, in order to take into account also the switching activities of the circuits.

**OOM implementation**

* Power Units = 1mW

<table>
<thead>
<tr>
<th>Cell</th>
<th>Internal Power</th>
<th>Switching Power</th>
<th>Total Power</th>
<th>Leakage Power</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total ( 26504 of 26504 )</td>
<td>6.747</td>
<td>0.2858</td>
<td>8.08</td>
<td>1.047</td>
</tr>
<tr>
<td>Total Capacitance</td>
<td>1.444e-10 F</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**In-memory implementation**

* Power Units = 1mW

<table>
<thead>
<tr>
<th>Cell</th>
<th>Internal Power</th>
<th>Switching Power</th>
<th>Total Power</th>
<th>Leakage Power</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
6.2 – MLP architecture

<table>
<thead>
<tr>
<th>Total ( 46104 of 46104 )</th>
<th>10.98</th>
<th>1.861</th>
<th>14.54</th>
<th>1.705</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total Capacitance</td>
<td>2.554e-10 F</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

In the OOM case, the capacitance load used by Synopsys influences the power consumption, producing an higher result than the real one, computed by Innovus. In the second one, the interconnections have a big impact in terms of power consumption, increasing the synthesis estimated one by 1.64 mW. The energy ratio can be evaluated considering the interconnections contributions:

\[
ER_{\text{Place\&Route}} = \frac{8.08 mW \times 150.333 \mu s}{14.54 mW \times 61.209 \mu s} = \approx 1.37
\]

Since the In-Memory architecture has a more complex structure than the OOM counterpart, the energy is much more degraded: as a consequence the energy ratio increases by \(\sim 0.33\). Regarding the critical path delay, after the Place\&Route phase, the worst slack values have been analyzed for both architectures with \(t_{ck} = 5.5\) ns. Their values are reported from neural_network_postRoute_hold.slk:

Worst slack (In-Memory) = 0.005 ns
Worst slack (OOM) = 0.005 ns

The interconnections increase the clock period from 4.22ns to \(\sim 5.5\) ns.

### 6.2 MLP architecture

The results for the MLP architecture depicted in Figure 5.2 are reported. The dimensions of both the architectures (OOM and In-Memory) are the following:
Table 6.3: Dimensions of the top entity. The reference neural network is depicted in Figure 5.2. The (−)s indicate don’t care, since it is a MLP architecture, they are fixed to the minimum size.

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>Parameter</th>
<th>Formula</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVOLUTIONAL LAYER</td>
<td>Memory sizes</td>
<td>W</td>
<td>number_of_fc_parameters</td>
<td>14</td>
</tr>
<tr>
<td></td>
<td></td>
<td>H</td>
<td>h_{max}</td>
<td>196</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Z</td>
<td>c_{in}</td>
<td>1</td>
</tr>
<tr>
<td>Number of XNOR GATES</td>
<td>IN MEMORY</td>
<td>W × H × c_{in}</td>
<td>2744</td>
<td></td>
</tr>
<tr>
<td></td>
<td>OOM</td>
<td>W × c_{in}</td>
<td>14</td>
<td></td>
</tr>
<tr>
<td>Number of POP Units</td>
<td>IN MEMORY</td>
<td>H × c_{in}</td>
<td>196</td>
<td></td>
</tr>
<tr>
<td></td>
<td>OOM</td>
<td>c_{in}</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>Input sizes [# bits]</td>
<td>Input filters</td>
<td>c_{in} × n_{bit}</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Input image</td>
<td>c_{in} × n_{bit}</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Input weights FC</td>
<td>W</td>
<td>14</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Input image FC</td>
<td>W</td>
<td>14</td>
<td></td>
</tr>
<tr>
<td></td>
<td>A,B</td>
<td>n_{bit}</td>
<td>18</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Binary inputs/Binary weights</td>
<td>W × c_{in}</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>Output sizes [# bits]</td>
<td>Output convolution</td>
<td>n_{bit} × c_{in}</td>
<td>18</td>
<td></td>
</tr>
</tbody>
</table>

6.2.1 Synthesis & Place-Route chips

Figure 6.4: OOM chip implementing the neural network model depicted in Figure 5.2

Figure 6.5: In-Memory chip implementing the neural network model depicted in Figure 5.2
By performing the synthesis, the results obtained in terms of area, critical path delay and power are reported in Table 6.4. The time required by the two algorithms can be computed considering that are 4 fully connected layers. The steps are the following:

1. At the beginning of the algorithm, the data are precharged inside the memories. The total number of clock cycles required can be computed also considering that this procedure is performed everytime \texttt{iteration\_cycle} increases:

\[
\text{Delay}_{\text{Data\_acq}} = (\phi + (n_{\text{layers}} - 1) \times \psi) \times t_{\text{ck}} = 4 \times 798 \times t_{\text{ck}} \quad (6.5)
\]

2. The fully connected layers have the following delays for the In-Memory and OOM alternatives:

\[
\text{Delay}_{\text{FC(In-Memory)}} = ((1 + n_{\text{iter}} \times (w_{\text{out}}(fc) + L + 2) + w_{\text{out}}(fc) + 2) \times t_{\text{ck}})
\]

\[
\text{Delay}_{\text{FC(OOM)}} = [3 + n_{\text{iter}} \times (2 + w_{\text{out}}(fc) + w_{\text{out}}(fc) \times (L + 1)) + w_{\text{out}}(fc)] \times t_{\text{ck}} \quad (6.7)
\]

The computations of the layers are the following:

(a) First layer \(n_{\text{iter}} = 56, \ w_{\text{out}}(fc) = 196, \ L = 14:\

\[
\text{Delay}_{\text{FC(In-Memory)}} = ((1 + 56 \times (196 + 14 + 2) + 196 + 2) \times t_{\text{ck}}
\]

\[
= 12071 \times t_{\text{ck}}
\]

\[
\text{Delay}_{\text{FC(OOM)}} = [3 + 56 \times (196 + 2 + 196 \times (14 + 1)) + 196] \times t_{\text{ck}}
\]

\[
= 175927 \times t_{\text{ck}}
\]

(b) Second layer \(n_{\text{iter}} = 14, \ w_{\text{out}}(fc) = 196, \ L = 14:\

\[
\text{Delay}_{\text{FC(In-Memory)}} = ((1 + 14 \times (196 + 14 + 2) + 196 + 2) \times t_{\text{ck}}
\]

\[
= 3167 \times t_{\text{ck}}
\]

\[
\text{Delay}_{\text{FC(OOM)}} = [3 + 14 \times (196 + 2 + 196 \times (14 + 1)) + 196] \times t_{\text{ck}}
\]

\[
= 44131 \times t_{\text{ck}}
\]
(c) Third layer $n_{iter} = 14, w_{out(fc)} = 196, L = 14$:

\[
\text{Delay}_{\text{FC(In-Memory)}} = ((1 + 14 \times (196 + 14 + 2) + 196 + 2) \times t_{ck} \\
= 3167 \times t_{ck}
\]

\[
\text{Delay}_{\text{FC(OOM)}} = [3 + 14 \times (196 + 2 + 196 \times (14 + 1)) + 196] \times t_{ck} \\
= 44131 \times t_{ck}
\]

(d) Fourth layer $n_{iter} = 14, w_{out(fc)} = 10, L = 14$:

\[
\text{Delay}_{\text{FC(In-Memory)}} = ((1 + 14 \times (10 + 14 + 2) + 10 + 2) \times t_{ck} \\
= 377 \times t_{ck}
\]

\[
\text{Delay}_{\text{FC(OOM)}} = [3 + 14 \times (10 + 2 + 10 \times (14 + 1)) + 10] \times t_{ck} \\
= 2281 \times t_{ck}
\]

The final delays of the FC layers are:

\[
\text{Delay}_{\text{FC(In-Memory)}} = 18782 \times t_{ck}
\]

\[
\text{Delay}_{\text{FC(OOM)}} = 266470 \times t_{ck}
\]

Considering all the contributions:

\[
\text{Delay}_{\text{OOM}} = (21974 + \text{overheads}) \times t_{ck}
\]

\[
\text{Delay}_{\text{In-Memory}} = (269606 + \text{overheads}) \times t_{ck}
\]

\[
DR = \frac{269606}{21974} \approx 12.27
\]

From Modelsim, the real times required by the architectures to perform the algorithm with a clock period of 6ns are given by:
6.2 – MLP architecture

Figure 6.6: Computational delay of the In-Memory architecture, implementing the neural network model depicted in Figure 5.2.

Figure 6.7: Computational delay of the OOM architecture, implementing the neural network model depicted in Figure 5.2.

Giving a ratio of $\sim 12.26$. From a delay point of view, the In-Memory architecture is very efficient to perform the fully connected computations w.r.t the OOM, because of the parallelization technique and the possibility to perform the XNORs/pops directly inside the memory array.

Table 6.4: Results in terms of area, critical path delay, power, total energy and time required by the two architectures for the neural network model depicted in Figure 5.2

<table>
<thead>
<tr>
<th></th>
<th>Area [mm$^2$]</th>
<th>Critical path delay [ns]</th>
<th>Power [mW]</th>
<th>Time required [ms]</th>
<th>Total energy [µJ]</th>
</tr>
</thead>
<tbody>
<tr>
<td>In Memory architecture</td>
<td>0.1055</td>
<td>4.220</td>
<td>15.1</td>
<td>0.132</td>
<td>1.99</td>
</tr>
<tr>
<td>OOM architecture</td>
<td>0.0876</td>
<td>4.32</td>
<td>14.32</td>
<td>1.62</td>
<td>23.2</td>
</tr>
</tbody>
</table>

The powers in Table 6.4 are comparable, which is a very good result considering
the dimensions and gate count of the In-Memory architecture respect to OOM. The energy ratio is given by:

\[ ER = \frac{23.2 \mu J}{1.99 \mu J} \approx 11.7 \times \] (6.8)

Also from the energy consumption point of view, the In-Memory is far more efficient than the OOM one.

### 6.3 Fashion-MNIST CNN

The case of fashion-MNIST CNN depicted in Figure 5.4 is now discussed. The parameters chosen for this model assume the following values:

1. The number of bits used in this architecture is 16, with 8 fractional bits and 8 integer bits;
2. The number of input channels is equal to 6, since there are 2 convolutional layers and the second one takes in input 6 channels;
3. \( w_{filter}^2 \) is fixed to 25 for the convolutional part, while max-pooling has only 2x2 kernel size;
4. \( W \) is fixed to 32, for the motivations already explained in subsection 5.3.2;
5. \( H \) is 576 from the first convolutional layer’s output size, which is given by 24x24 OFMAPs.
### Table 6.5: Dimensions of the top entity. The reference neural network is depicted in Figure 5.4.

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>Parameter</th>
<th>Formula</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVOLUTIONAL LAYER</td>
<td>Memory sizes</td>
<td>W</td>
<td>number of f_c parameters _n_iter</td>
<td>32</td>
</tr>
<tr>
<td></td>
<td></td>
<td>H</td>
<td>h_max</td>
<td>576</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Z</td>
<td>c_in</td>
<td>6</td>
</tr>
<tr>
<td>Number of XNOR GATES</td>
<td>IN MEMORY</td>
<td>W × H × c_in</td>
<td></td>
<td>110592</td>
</tr>
<tr>
<td></td>
<td>OOM</td>
<td>W × c_in</td>
<td></td>
<td>192</td>
</tr>
<tr>
<td>Number of POP Units</td>
<td>IN MEMORY</td>
<td>H × c_in</td>
<td></td>
<td>3456</td>
</tr>
<tr>
<td></td>
<td>OOM</td>
<td>c_in</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Input sizes [# bits]</td>
<td>Input filters</td>
<td>c_in × n_lat</td>
<td></td>
<td>96</td>
</tr>
<tr>
<td></td>
<td>Input image</td>
<td>c_in × n_lat</td>
<td></td>
<td>96</td>
</tr>
<tr>
<td></td>
<td>Input weights FC</td>
<td>W</td>
<td></td>
<td>32</td>
</tr>
<tr>
<td></td>
<td>Input image FC</td>
<td>W</td>
<td></td>
<td>32</td>
</tr>
<tr>
<td></td>
<td>A,B</td>
<td>n_lat</td>
<td></td>
<td>16</td>
</tr>
<tr>
<td></td>
<td>Binary inputs/Binary weights</td>
<td>W × c_in</td>
<td></td>
<td>192</td>
</tr>
<tr>
<td></td>
<td>Output convolution</td>
<td>n_lat × c_in</td>
<td></td>
<td>16</td>
</tr>
<tr>
<td>POOL</td>
<td>Input sizes [# bits]</td>
<td>Input pool</td>
<td>c_in × n_lat</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Output sizes [# bits]</td>
<td>Output pool</td>
<td>c_in × n_lat</td>
<td></td>
</tr>
</tbody>
</table>

This network is composed by 2 convolutional layers, 2 pooling layers and 3 fully connected layers. The total delay can be computed as:

1. Precharge

\[
\text{Delay}_{\text{Data\_acq}} = (\phi + (n\_layers - 1) \times \psi) \times t\_ck = (784 + 4 \times 152) \times t\_ck \quad (6.9)
\]

2. First convolutional layer: \( w^2\_\text{filter} = 25, w^2\_\text{out} = 576, c\_\text{out} = 6, c\_\text{in} = 1 \)

\[
\text{Delay}_{\text{In-Memory(\text{convolution})}} = [3 + w^2\_\text{out} \times (w^2\_\text{filter} + 1) + (c\_\text{out} \times (w^2\_\text{filter} + w^2\_\text{out} \times (c\_\text{in} + 2)) + c\_\text{out} \times 3 + 1)] \times t\_ck \\
= 25516 \times t\_ck
\]

\[
\text{Delay}_{\text{OOM(\text{convolution})}} = [3 + w^2\_\text{out} \times (w^2\_\text{filter} + 1) + (c\_\text{out} \times w^2\_\text{out} \times (w^2\_\text{filter} + c\_\text{in} + 2) + c\_\text{out} \times 4 + 1)] \times t\_ck \\
= 111772 \times t\_ck
\]
3. Max pooling: $w_{filter}^2 = 4$, $w_{out(pool)}^2 = 144$

$$\text{Delay_{Pooling}} = (4 + w_{out(pool)}^2) \times (2 + w_{filter}^2) \times t_{ck}$$

$$= (4 + 144 \times (2 + 4)) \times t_{ck} = 868 \times t_{ck} \quad (6.10)$$

4. Second convolutional layer: $w_{filter}^2 = 25$, $w_{out}^2 = 64$, $c_{out} = 6$, $c_{in} = 6$

$$\text{Delay_{In-Memory(convolution)}} = [3 + w_{out}^2 \times (w_{filter}^2 + 1) + (c_{out} \times (w_{filter}^2 + w_{out}^2 \times (c_{in} + 2)) + c_{out} \times 3 + 1)] \times t_{ck}$$

$$= 4908 \times t_{ck}$$

$$\text{Delay_{OOM(convolution)}} = [3 + w_{out}^2 \times (w_{filter}^2 + 1) + (c_{out} \times w_{out}^2 \times (w_{filter}^2 + c_{in} + 2) + c_{out} \times 4 + 1)] \times t_{ck}$$

$$= 14364 \times t_{ck}$$

5. Max pooling: $w_{filter}^2 = 4$, $w_{out(pool)}^2 = 16$

$$\text{Delay_{Pooling}} = (4 + w_{out(pool)}^2) \times (2 + w_{filter}^2) \times t_{ck}$$

$$= (4 + 16 \times (2 + 4)) \times t_{ck} = 100 \times t_{ck} \quad (6.11)$$

6. First fully connected layer: $W_{out(fc)} = 120$, $L = 32$, $n_{iter} = \frac{96}{32} = 3$

$$\text{Delay_{FC(In-Memory)}} = ((1 + n_{iter} \times (w_{out(fc)} + L + 2) + w_{out(fc)} + 2) \times t_{ck}$$

$$= 585 \times t_{ck}$$

$$\text{Delay_{FC(OOM)}} = [3 + n_{iter} \times (2 + w_{out(fc)} + w_{out(fc)} \times (L + 1)) + w_{out(fc)}] \times t_{ck}$$

$$= 12369 \times t_{ck}$$

268
6.3 – Fashion-MNIST CNN

7. Second fully connected layer: $W_{out(fc)} = 84, L = 20, n_{iter} = \frac{120}{20} = 6$

$$\begin{align*}
\text{Delay}_{FC(In-Memory)} &= ((1 + n_{iter} \times (w_{out(fc)} + L + 2) + w_{out(fc)} + 2) \times t_{ck} \\
&= 723 \times t_{ck} \\
\text{Delay}_{FC(OOM)} &= [3 + n_{iter} \times (2 + w_{out(fc)} + w_{out(fc)} \times (L + 1)) + \\
&\quad + w_{out(fc)}] \times t_{ck} \\
&= 11187 \times t_{ck}
\end{align*}$$

8. Third fully connected layer: $W_{out(fc)} = 11, L = 14, n_{iter} = \frac{84}{14} = 6$

$$\begin{align*}
\text{Delay}_{FC(In-Memory)} &= ((1 + n_{iter} \times (w_{out(fc)} + L + 2) + w_{out(fc)} + 2) \times t_{ck} \\
&= 176 \times t_{ck} \\
\text{Delay}_{FC(OOM)} &= [3 + n_{iter} \times (2 + w_{out(fc)} + w_{out(fc)} \times (L + 1)) + \\
&\quad + w_{out(fc)}] \times t_{ck} \\
&= 1082 \times t_{ck}
\end{align*}$$

Considering all the contributions:

$$\begin{align*}
\text{Delay}_{OOM} &= (34268 + \text{overheads}) \times t_{ck} \\
\text{Delay}_{In-Memory} &= (153134 + \text{overheads}) \times t_{ck} \\
DR &= \frac{153134}{34268} \approx 4.47
\end{align*}$$

269
The real delay ratio is $\sim 4.4 \times$. The synthesis results are the following:

Table 6.6: Results in terms of area, critical path delay, power, total energy and time required by the two architectures for the neural network model depicted in Figure 5.4

<table>
<thead>
<tr>
<th>Area $[\text{mm}^2]$</th>
<th>Critical path delay $[\text{us}]$</th>
<th>Power $[\text{mW}]$</th>
<th>Time required $[\text{ms}]$</th>
<th>Total energy $[\mu\text{J}]$</th>
</tr>
</thead>
<tbody>
<tr>
<td>In Memory architecture</td>
<td>1.68</td>
<td>4.11</td>
<td>254.5</td>
<td>0.210</td>
</tr>
<tr>
<td>OOM architecture</td>
<td>1.10</td>
<td>4.14</td>
<td>193.30</td>
<td>0.923</td>
</tr>
</tbody>
</table>

The energy ratio is $\sim 3.34 \times$, which is a very good result considering the dimensions of the network.
6.4 General cases

To evaluate the performance of the network with different parameters, several synthesis have been performed. In each of them are evaluated power, energy, area and timing and the results are compared between the In-Memory and OOM architecture’s ones. The same architecture implemented for the neural network model in Figure 4.1 is used and it is swept only two values per time. The parameters chosen are the following:

1. $n_{bit}$: considering the plot in Figure 4.50, the evaluation can start from 12 bits (11 integer and 1 fractional) to 21 bits (11 integer and 10 fractional);

2. The $w_{filter}$ parameter has been swept from its initial value ($w_{filter} = 2$) to $w_{filter} = 11$, emulating the kernel size in deeper neural networks, such as AlexNet;

3. $c_{in}$ number of input channels are swept from 2 to 7, in order to evaluate the cost in terms of performance of having parallel architectures working at the same time;

4. $H$ size to evaluate the impact of having a bigger OFMAP.

Regarding the energy estimations, it has been considered only a convolution computation with $c_{out} = 1$ (exception for the H case, in which also the fully connected algorithm case is considered), because it represents the worst case in terms of delay ratio respect to fully connected, which is depicted in Figure 4.49. Taking the delay ratio trend, energy ratio is given by Power ratio multiplied by Delay ratio. This procedure produces the worst case energy results, since a neural network (CNN or MLP) is always composed by fully connected layers, in which there is the effective gain in terms of delay. To better understand this consideration, the original network depicted in Figure 4.1 is taken as example.

1. The power values for OOM and In-Memory architectures are 8.85 mW and 12.9 mW respectively, producing an energy ratio of $\sim 1.7$, as already discussed in section 6.1.1;
2. Considering only one convolution computation with $c_{out} = 6$ and the same dimensions of the model (Figure 4.1), the resulting delay ratio is:

\[
DR = \frac{3 + w_{out}^2 \times (w_{filter}^2 + 1) + (c_{out} \times w_{out}^2 \times (w_{filter}^2 + c_{in} + 2) + c_{out} \times 4 + 1)}{3 + w_{out}^2 \times (w_{filter}^2 + 1) + (c_{out} \times (w_{filter}^2 + w_{out}^2 \times (c_{in} + 2)) + c_{out} \times 3 + 1)}
\]

\[
= \frac{3 + 169 \times (4 + 1) + (6 \times 169 \times 7 + 6 \times 4 + 1)}{3 + 169 \times 5 + (6 \times (4 + 169 \times 3) + 6 \times 3 + 1)} = \frac{7971}{3933} = 2.03
\]

(6.12)

Delay ratio is degraded respect to its real value ($\sim 2.46$), with also fully connected part, reducing the energy ratio to $ER = 1.39$ (from $\sim 1.7$). In general, by having less $c_{out}$, delay ratio becomes worse, in fact by performing the same computation with $c_{out} = 1$, $DR = 1.49$. 

272
Figure 6.10: Area, CP delay, Power vs $c_{in}$ - $w_{filter}$ for the OOM architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$). Power vs $c_{in}$ - $w_{filter}$: power increases almost linearly with $c_{in}$, because more parallel architectures are working at the same time. With higher $w_{filter}$, the power rises almost exponentially, because it is required a larger memory array and more XNOR gates are used. Area vs $c_{in}$ - $w_{filter}$ behaves in the same way. CP delay vs $c_{in}$ - $w_{filter}$: remains almost constant, since it is caused by a multiplier-adder sequence. For an higher amount of $c_{in}$, more adders are used in the adder trees in K-α computations (Figure 4.11 and Figure 4.13), but the critical path remains the same.
Figure 6.11: Area, Critical path delay, Power vs $c_{in}$ - $w_{filter}$ for the In-Memory architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$). Same considerations made in Figure 6.10 are valid here. The maximum power achieved in this case is $\sim 260\text{mW}$ respect to $\sim 230\text{mW}$ of the previous case. Considering the higher number of logic gates required in the In-Memory architecture, it is a very good result that allows also to reduce also the computational time normally required by the OOM architecture.
Figure 6.12: Area ratio, Critical path delay ratio, Power ratio vs $c_{in} - w_{filter}$ obtained as OOM/In-Memory ($H = 169, c_{out} = 1, W = w_{filter}^2$). Increasing $c_{in}$ brings to power/area ratios reductions, since In-Memory architecture requires more building blocks than OOM case. $w_{filter}$’s rise brings power benefits in the In-Memory architecture, since the registers start to have a predominant contribution respect to the sequential/combinational powers: since the architectures have approximately the same number of registers, the power ratio tends towards 1 for $w_{filter} \rightarrow \infty$. From a power consumption point of view, it is convenient to implement an In-Memory architecture with high $c_{in}$ and $w_{filter}$. 
The plot reported in Figure 6.12 reports an important consideration: the higher is \( w_{filter} \) and \( c_{in} \) the lower are power ratio and area ratio. It means that for the In-Memory architecture it is convenient to realize an architecture with high \( w_{filter} \) and \( c_{in} \), as already said. To understand the behavior of the PR, it is sufficient to analyze the following power reports:

**OOM architecture’s power report with \( c_{in} = 7, w_{filter} = 11 \)**

<table>
<thead>
<tr>
<th>Power Group</th>
<th>Internal Power</th>
<th>Switching Power</th>
<th>Leakage Power</th>
<th>Total Power ( % )</th>
</tr>
</thead>
<tbody>
<tr>
<td>io_pad</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>memory</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>black_box</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>clock_network</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>register</td>
<td>1.9380e+05</td>
<td>22.3686</td>
<td>1.1818e+07</td>
<td>2.0563e+05 ( 90.96%)</td>
</tr>
<tr>
<td>sequential</td>
<td>5.5253e-02</td>
<td>7.8112e-03</td>
<td>2.7743e+03</td>
<td>2.8373 ( 0.00%)</td>
</tr>
<tr>
<td>combinational</td>
<td>1.7291e+03</td>
<td>7.9070e+03</td>
<td>1.0788e+07</td>
<td>2.0425e+04 ( 9.04%)</td>
</tr>
</tbody>
</table>

Total: 1.9552e+05 uW, 7.9294e+03 uW, 2.2609e+07 nW, 2.2606e+05 uW

**In-Memory architecture’s power report with \( c_{in} = 7, w_{filter} = 11 \)**

<table>
<thead>
<tr>
<th>Power Group</th>
<th>Internal Power</th>
<th>Switching Power</th>
<th>Leakage Power</th>
<th>Total Power ( % )</th>
</tr>
</thead>
<tbody>
<tr>
<td>io_pad</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>memory</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>black_box</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>clock_network</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000 ( 0.00%)</td>
</tr>
<tr>
<td>register</td>
<td>2.0721e+05</td>
<td>64.1904</td>
<td>1.2895e+07</td>
<td>2.2016e+05 ( 83.71%)</td>
</tr>
<tr>
<td>sequential</td>
<td>8.9353e-03</td>
<td>2.7972e-03</td>
<td>38.9396</td>
<td>5.0672e-02 ( 0.00%)</td>
</tr>
<tr>
<td>combinational</td>
<td>8.3320e+03</td>
<td>1.6143e+04</td>
<td>1.8368e+07</td>
<td>4.2838e+04 ( 16.29%)</td>
</tr>
</tbody>
</table>

Total: 2.1554e+05 uW, 1.6207e+04 uW, 3.1263e+07 nW, 2.6300e+05 uW

The highest power contribution is related to the registers, so the combinational power overhead for the In-Memory case decreases with higher \( c_{in}, w_{filter} \).
Figure 6.13: Energy ratio vs $c_{in}$ & $w_{filter}$ ($H = 169, c_{out} = 1, W = w_{filter}^2$). Taking the delay ratio respect to $c_{in} - w_{filter}$ depicted in Figure 4.45, it has been multiplied by the obtained power ratio. The result shows that the In-Memory architecture becomes more efficient in terms of energy for higher values of $w_{filter}$. Consequently, the effect of $c_{in}$’s rise is reduced. This is a very good result, since for very deep networks such as AlexNet, the In-Memory architecture reaches better energy results.
Figure 6.14: Area, Critical path delay, Power vs $n_{bit}$ - $w_{filter}$ for the OOM architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$, $c_{in} = 1$). Increasing $n_{bit}$, also power and area rises, since an higher number of bits implies more complicated operators (adders, multipliers etc). In the critical path delay case, it is possible to see a peak located at 19 bits: from the timing report, the critical path is located in the divider of the $\alpha$ unit. As already seen in Figure 6.10, with high values of $w_{filter}$, both area and power rise exponentially.
Figure 6.15: Area, Critical path delay, Power vs $n_{bit} - W_{filter}$ for the In-Memory architecture ($H = 169$, $c_{out} = 1$, $W = w_{filter}^2$, $c_{in} = 1$). Same considerations of Figure 6.14 are valid here.
In Figure 6.15 and Figure 6.16 there is a peak located in $n_{bit} = 19$. By looking at the timing reports for $n_{bit} = 19$ and $n_{bit} = 20$ of OOM architecture, it is possible to see that the critical path delay in the first case is located in the divider of the $\alpha$ computation unit, while in the second one in the batch normalization unit (adder-multiplier):

**OOM architecture’s timing report with $n_{bit} = 19$, $w_{filter} = 11$**

<table>
<thead>
<tr>
<th>Path</th>
<th>Delay (ns)</th>
<th>Slack (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>cnv_layer/alpha_computation/div_103/U27/ZN (OR2_X1)</td>
<td>0.06</td>
<td>0.15 f</td>
</tr>
<tr>
<td>cnv_layer/alpha_computation/div_103/U26/ZN (NOR3_X1)</td>
<td>0.06</td>
<td>0.21 r</td>
</tr>
<tr>
<td>cnv_layer/alpha_computation/div_103/U25/ZN (NAND2_X1)</td>
<td>0.03</td>
<td>0.24 f</td>
</tr>
<tr>
<td>cnv_layer/alpha_computation/div_103/U24/ZN (NOR3_X1)</td>
<td>0.06</td>
<td>0.30 r</td>
</tr>
<tr>
<td>...</td>
<td></td>
<td></td>
</tr>
<tr>
<td>clock MY_CLK (rise edge)</td>
<td>5.50</td>
<td>5.50</td>
</tr>
<tr>
<td>clock network delay (ideal)</td>
<td>0.00</td>
<td>5.50</td>
</tr>
<tr>
<td>clock uncertainty</td>
<td>-0.07</td>
<td>5.43</td>
</tr>
<tr>
<td>cnv_layer/alpha_computation/r2/dout_reg[0]/CK (DFFR_X1)</td>
<td>0.00</td>
<td>5.43 r</td>
</tr>
<tr>
<td>library setup time</td>
<td>-0.04</td>
<td>5.39</td>
</tr>
<tr>
<td>data required time</td>
<td></td>
<td>5.39</td>
</tr>
</tbody>
</table>

---

**In-Memory architecture’s power report with $n_{bit} = 20$, $w_{filter} = 11$**

<table>
<thead>
<tr>
<th>Path</th>
<th>Delay (ns)</th>
<th>Slack (MET)</th>
</tr>
</thead>
<tbody>
<tr>
<td>cnv_layer/batch_normalization/U10/Z (BUF_X1)</td>
<td>0.06</td>
<td>0.73 f</td>
</tr>
<tr>
<td>cnv_layer/batch_normalization/U6/Z (BUF_X1)</td>
<td>0.04</td>
<td>0.77 f</td>
</tr>
<tr>
<td>cnv_layer/batch_normalization/U3/ZN (INV_X1)</td>
<td>0.13</td>
<td>0.90 r</td>
</tr>
<tr>
<td>cnv_layer/batch_normalization/U16/ZN (AOI22_X1)</td>
<td>0.06</td>
<td>0.96 f</td>
</tr>
<tr>
<td>cnv_layer/batch_normalization/U51/ZN (INV_X2)</td>
<td>0.11</td>
<td>1.07 r</td>
</tr>
<tr>
<td>cnv_layer/batch_normalization/bb/inputs[1] (bnorm_n_bit20_multiplication_{sx_extreme28})</td>
<td>0.00</td>
<td>1.07 r</td>
</tr>
<tr>
<td>...</td>
<td></td>
<td></td>
</tr>
<tr>
<td>clock MY_CLK (rise edge)</td>
<td>5.50</td>
<td>5.50</td>
</tr>
<tr>
<td>clock network delay (ideal)</td>
<td>0.00</td>
<td>5.50</td>
</tr>
<tr>
<td>clock uncertainty</td>
<td>-0.07</td>
<td>5.43</td>
</tr>
<tr>
<td>output external delay</td>
<td>-0.50</td>
<td>4.93</td>
</tr>
<tr>
<td>data required time</td>
<td></td>
<td>4.93</td>
</tr>
</tbody>
</table>

---

<table>
<thead>
<tr>
<th>Path</th>
<th>Delay (ns)</th>
<th>Slack (MET)</th>
</tr>
</thead>
<tbody>
<tr>
<td>data required time</td>
<td></td>
<td>4.93</td>
</tr>
<tr>
<td>data arrival time</td>
<td>-4.57</td>
<td></td>
</tr>
</tbody>
</table>

---

<table>
<thead>
<tr>
<th>Path</th>
<th>Delay (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>slack (MET)</td>
<td>0.36</td>
</tr>
</tbody>
</table>
Figure 6.16: Area ratio, Critical path delay ratio, Power ratio vs $n_{\text{bit}}$ - $w_{\text{filter}}$ obtained as OOM/In-Memory ($H = 169$, $c_{\text{out}} = 1$, $W = w_{\text{filter}}^2$, $c_{\text{in}} = 1$). For an high value of $n_{\text{bit}}$, area-power ratios increases. This implies that the In-Memory architecture takes performance advantages, if a more precise representation is used.
Figure 6.17: Area, Critical path delay, Power vs $\sqrt{H} - c_{in}$ for OOM architecture ($c_{out} = 1$, $W = w_{filter}^2 = 4$). The higher is the $\sqrt{H}$ size, the higher are power consumption and area, since registers have very big sizes (exponential trend). Regarding $c_{in}$, as already said, power/area increase almost linearly. Critical path delay remains almost the same for each value of $\sqrt{H} - c_{in}$. 
Figure 6.18: Area, Critical path delay, Power vs $\sqrt{H}$ - $c_{in}$ for In-Memory architecture ($c_{out} = 1$, $W = w_{filter}^2 = 4$). Same considerations made for Figure 6.17 are valid in this case. The power/area values reached are higher than the previous case, because of the higher number of registers/logic gates.
Figure 6.19: Area ratio, Critical path delay ratio, Power ratio vs $\sqrt{H} - c_{in}$, obtained as OOM/In-Memory ($c_{out} = 1$, $W = w_{filter}^2 = 4$). By increasing both $c_{in}$ and $\sqrt{H}$, power/area ratios decrease, because of the higher amount of logic gates inside the In-Memory architecture.
Figure 6.20: Energy ratio vs $\sqrt{H} - c_{in}$, obtained as OOM/In-Memory ($c_{out} = 1$, $W = w_{filter}^2 = 4$). This is the worst case, because by increasing both $c_{in}$ and $\sqrt{H}$ the energy ratio decreases, because of the higher amount of logic gates inside the In-Memory architecture. With higher values of both $w_{filter}$ and $c_{out}$, the energy ratio will decrease for the motivations explained before.
Figure 6.21: Energy ratio vs $\sqrt{H} - c_{in}$ for the fully connected algorithm, obtained as OOM/In-Memory ($c_{out} = 1$, $W = w_{filter}^2 = 4$, $number\_of\_fc\_parameters = 1000$, $n_{iter} = 250$). In this case, the energy ratio increases a lot, since the fully connected algorithm is far more efficient in the in-memory case respect to OOM one. Depending on the algorithm type, the performance can be better or worse: an higher number of fully connected layers with an high value of $n_{iter}$, implies a more efficient In-Memory architecture than OOM counterpart.
It is possible to delineate the behavior of the performance in both cases, by considering a mean of the obtained values of energy, power, area, delay and timing:

![Performance ratios diagram](image)

**Figure 6.22**: Mean Delay, Power, Area, Timing and Energy ratios, obtained as $\frac{OOM}{In-Memory}$. If the ratio value is higher than 1, it means that the In-Memory architecture obtained a better result. As expected, In-Memory alternative is more efficient in terms of Energy/Delay than OOM counterpart.

**Figure 6.22** gives an important confirmation of the advantages coming from an In-Memory design respect to a classical Von-Neumann’s based one: by placing near-memory very simple elements (such as XNOR gates and full-adders), allows to reduce fetching latency, energy consumption and computational delay.

6.5 **State-of-the-art comparisons**

⚠️ **ATTENTION** ⚠️

The following performance comparisons are based on the assumptions made in chapter 2, in which a linear dependency between the evaluated parameter and the network’s complexity is used. The correctness of the obtained values is not guaranteed.
6.5.1 Number of neurons

The examined architecture is the original one, which implements the model depicted in Figure 4.1. The total number of layers is 3 while the # of neurons can be computed as:

\[ \#Neurons_{O.W.} = 14 \times 14 + 13 \times 13 \times 6 + 10 = 1220 \]  \hspace{1cm} (6.13)

Where the acronym O.W. stands for Our Work.

6.5.2 Results

In the following part are reported the results in terms of energy consumption, latency and area rescaled by the number of neurons, as already did in chapter 2:

![Energy comparison: the higher is better. MLC-STT: [15], SOT: [16], OPNE-IPNE: [40], Neurosynaptic core: [26], Stochastic: [28], CPU-CLU: [29].](image-url)
The energy values obtained for the two architectures are given by:

\[
\text{NormEnergy}_{\text{In-Memory}} = \frac{0.79 \mu J}{1220} \approx 647.5 \text{pJ}
\]

\[
\text{NormEnergy}_{\text{OOM}} = \frac{1.33 \mu J}{1220} \approx 1.09 \text{nJ}
\]

The resulting performance is very good, especially for the In-Memory case. The possibility to binarize the network and to transform MAC operations into Xnor-Pop counting sequence, decreases the energy required. By designing a custom memory cell, the energy consumed can be further reduced, because the performance inefficiency coming from the usage of a flip-flop in the model, will be cancelled.

Figure 6.24: Delay comparison: the higher is better. MLC-STT [15], SOT [16], OPNE-IPNE [40], Neurosynaptic core [26], XNOR-RRAM [19], HMC [29], Chain-NN [30], Energy-efficient [31].
Also the delay values obtained are very good compared to the other implementations. They are obtained as:

\[
\text{Delay}_{\text{normalized(O.W. (In-Memory))}} = \frac{0.061ms}{3 \times 1220} \approx 16.7\text{ns}
\]

\[
\text{Delay}_{\text{normalized(O.W. (OOM))}} = \frac{0.15ms}{3 \times 1220} \approx 40.98\text{ns}
\]

The parallelization of the Xnor-Pop procedure allows to reach very high efficiency in terms of latency, obtaining results that are comparable with RRAM implementations which, by their nature, are very fast.

Figure 6.25: Area comparison: the higher is better. SOT [16], OPNE-IPNE [40], Neurosynaptic core [26], XNOR-RRAM [19] (MLP), Stochastic [28], HMC [29], Energy-efficient [31]
Area\textsubscript{Normalized}(O.W.\,(In-Memory)) = \frac{0.0923\,mm^2}{1220} \approx 75.6 \times 10^{-6}\,mm^2

Area\textsubscript{Normalized}(O.W.(OOM)) = \frac{0.0564\,mm^2}{1220} \approx 46.2 \times 10^{-6}\,mm^2

In this last case, area performance in the O.W. In-Memory case reaches a value which is comparable with SOT In-Memory architecture [16] and Stochastic [28] cases (in fact, this last one has a similar computation complexity, since multiplication is performed by an AND gate and the sum by a multiplexer). The O.W. OOM case reaches the best resulting area, because of its simplicity and serialization.
Chapter 7

Conclusions and future work

As already discussed, the In-Memory architecture allows to reduce the Von Neumann’s bottlenecks. In general, by increasing the sizes of the neural network, by choosing a deeper model, the dimensions of the circuit increase and, consequently, power consumption/area of both solutions. The In-Memory architecture has a big advantage respect to OOM counterpart: in Synthesis & Place&Route chapter, the estimations represent the worst case values, since the XNOR-Unit part & Pop-Counting can be realized inside a memory array, without employing discrete gates and flip flops, which are far more complicated than a custom memory cell. The power reports show that for an higher dimensionality, the most important power contribution is given by the registers: by designing a custom memory cell, it is possible to reduce this drawback.

7.1 Future work

New pop-counting design  Regarding the pop-counting unit, it is possible to optimize the design for the In-Memory part considering the following equations:

\[
\text{Pop – Counting} = \#1s – \#0s
\]

\[
\text{Pop – Counting} = \#1s – (\text{length(Word)} – \#1s)
\]  \hspace{1cm} (7.1)

\[
\text{Pop – Counting} = 2\#1s – \text{length(Word)}
\]
It is sufficient to count the number of ones inside the word, that can be performed by a chain of half-adders, instead of full-adders. A circuit that can perform this operation is the following:

![Modified pop-counting circuit for the In-Memory architecture.](image)

The reduction in terms of logic gates used is equal to 5/2, since FA contains 5 logic gates and HA only two.

**Beyond-CMOS technology** By employing Beyond-CMOS technologies, it is possible to further improve the performance: resistive-based technologies, such as MTJ, RRAM etc can be used to realize the XNOR-Unit and pop-counting parts.
Bibliography


296


