# POLITECNICO DI TORINO

Master's Degree in Computer Engineering



Master's Degree Thesis

# A FPGA-based tensor accelerator for Machine Learning

Supervisors

Prof. Paolo BERNARDI

Prof. Pedro P.M. TRANCOSO

Candidate

Francesco ANGIONE

A.Y. 2019/2020

### Abstract

Part of a Neural Network inference execution mainly consists in multiplications and additions, basic operation of tensor convolutions, and across several execution data, especially weight tensors, are reused. Clearly, those operations are executed on a CPU but, as it is well known, they are independent of each other and therefore they can be executed in parallel by the means of parallel architectures, such as GPU or domain specific hardware platform. In the following pages, the state-of-the-art for accelerating Neural Network inference is explored starting from the newest proposed GPGPU architecture by NVIDIA to the domain specific accelerator from Google, NVIDIA, and Habana.

With the state-of-the-art awareness, a hardware accelerator capable of execution tensor convolution, compute and memory intensive operation of a Neural Network, is designed from scratch. It is also designed for accommodating different data type computation request from Neural Network models, ranging from integer8/16/32/64 to floating-point 32 and brain floating-point 16. Starting from the hardware system development, through the software development of a library capable to use the underlying hardware, it ends with integration into a popular Machine Learning framework, Tensorflow.

The work is carried out on a configurable hardware, FPGA, which allows to explore different design points, in terms of latency and number of processing elements, for different Neural Network models and data type. Moreover, the impact of integrating the accelerator into the Neural Network model is measured and compared with different platforms. Energy consumption is also estimated in the case of deployment on mobile devices.

Keywords: Computer, science, computer science, engineering, hardware, accelerator, machine learning.

# Acknowledgements

It is always hard to write this part of a work. I would say it is the hardest part, more than the technical one.

However, let me try to address it anyway. I am apologizing in advance if i will forget something.

This work is the sum of five years of experiences, from a technical and non-technical point of view, and it has been developed during a terrible event, a pandemic, which has literally stopped the entire world and caused death, issues and debates. However, as the human race has always been, we are resilient to everything, and we tried as much as we could to not let the world stop, especially thanks to technology, Internet and all the related services. We are just human, but we can do whatever we can image, especially in Computer Science.

First, I would like to thank my family for all their support and presence, even when i was going counter current in my life. I would like to thank to both my supervisors Prof. Paolo Bernardi and Prof. Pedro Petersen Moura Trancoso for believing in me without any guarantees on the final work, and their support through this journey.

To the day in which I learned how to read, an important pillar of my life.

To the people who have contributed, in badness and goodness, to make me the person who i am today.

To my past and future failures, where I have built and I will build myself.

To my feelings, which remember us how much we are fragile but at the same time they remind us that we are human being, and we gather our strength from them. Sapere aude.

Francesco Angione, Gothenburg, October 2020

# Contents

| Li       | st of            | Figures                                        | xi                     |
|----------|------------------|------------------------------------------------|------------------------|
| Li       | $\mathbf{st}$ of | Tables                                         | $\mathbf{x}\mathbf{v}$ |
| 1        | Intr             | oduction                                       | 1                      |
| <b>2</b> | Bac              | kground                                        | 3                      |
|          | 2.1              | Overview                                       | 3                      |
|          | 2.2              | Machine Learning                               | 4                      |
|          |                  | 2.2.1 Brain Inspired                           |                        |
|          |                  | 2.2.1.1 Neural Networks                        |                        |
|          |                  | 2.2.1.2 Spiking Neural Networks                | 7                      |
|          | 2.3              | Machine Learning Quantization                  |                        |
|          | 2.4              | Applications                                   |                        |
| 3        | Stat             | te-of-the-Art                                  | 11                     |
|          | 3.1              | Overview                                       | 11                     |
|          | 3.2              | GPU                                            |                        |
|          |                  | 3.2.1 Nvidia Ampere A100 Tensor Core GPU       | 12                     |
|          | 3.3              | Domain Specific Hardware Platform              |                        |
|          |                  | 3.3.1 NVDLA                                    | 16                     |
|          |                  | 3.3.1.1 NVDLA Software                         | 18                     |
|          |                  | 3.3.2 Google TPU                               | 19                     |
|          |                  | 3.3.3 Habana Goya HL-1000                      |                        |
| 4        | Syst             | tem Development                                | 23                     |
|          | 4.1              | Overview                                       | 23                     |
|          | 4.2              | Software                                       |                        |
|          | 4.3              | System Level                                   | 29                     |
|          | 4.4              | DTPU, the hardware accelerator                 | 31                     |
|          |                  | 4.4.1 Real Implementation                      |                        |
|          |                  | 4.4.2 High Level State Machine of Control Unit |                        |
|          |                  | 4.4.3 Datapath                                 |                        |
|          |                  | 4.4.3.1 Filter&Select and Compact&Select       |                        |
|          |                  | 4.4.3.2 Matrix Multiplication Unit             | 38                     |
| 5        | Res              | ults                                           | 41                     |
|          |                  |                                                |                        |

|              | 5.1   | Evaluation metrics             | 11 |
|--------------|-------|--------------------------------|----|
|              | 5.2   | Utilization Factor             | 12 |
|              | 5.3   | Energy and Power Consumption   | 15 |
|              | 5.4   | Throughput                     | 56 |
|              | 5.5   | Latency                        | 57 |
|              | 5.6   | Accuracy                       | 30 |
| 6            | Con   | clusion                        | 3  |
|              | 6.1   | Discussion                     | 3  |
|              | 6.2   | Future Works                   | 34 |
| Bi           | bliog | raphy                          | 65 |
| A            | Acc   | elerator library               | Ι  |
| В            | Top   | level entity of DTPU core XLV  | Π  |
| $\mathbf{C}$ | Resi  | ults for different frequencies | V  |

# List of Figures

| Classification of AI with emphasis on machine learning and its sub- | 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                     | 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                                                                     | 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Example of a Neural Network                                         | 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Approximation of floating-point values to integer values            | 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Streaming Multiprocessor Architecture [21]                          | 12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Matrix Multiplication in Tensor Core [21]                           | 13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     | 13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     | 14                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Matrix Multiply Accumulate [21]                                     | 14                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     | 15                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     | 16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     | 1.0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| NVDLA Software stack[24]                                            | 18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Google TPU architecture[1]                                          | 19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Google 1PU Soitware Stack [25]                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                                                     | 21                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| nabana Goya Software Stack [4]                                      | 21                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Average execution time divided by type of operations                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Zynq 7000 SoC [32]                                                  | 29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Logical view of DTPU accelerator                                    | 31                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| $\Theta$                                                            | 33                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 1                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 1 0                                                                 | 34                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     | 36                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| ŭ                                                                   | 38                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| SMAC and SMUL details                                               | 39                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                     | classification A parallelism between a human-brain neuron and a neuron in a Brain Inspired Network Example of a Neural Network Approximation of floating-point values to integer values  Streaming Multiprocessor Architecture [21] Matrix Multiplication in Tensor Core [21] Mixed Precision Schema of a FMA unit in Tensor Core Unit [21] Sparsity Optmization of a weight tensor [21] Matrix Multiply Accumulate [21] Software stack [21] Comparsion of two possible NVDLA system [22] Internal architecture of NVDLA small system, Secondary DBB not considered [22] NVDLA Software stack[24] Google TPU architecture[1] Google TPU Software Stack [25] High level view of Goya architecture [4] Habana Goya Software Stack [4] |

| 4.15 | DSP Slice Functionality [38] $\dots$                                                                                                 | 40 |
|------|--------------------------------------------------------------------------------------------------------------------------------------|----|
| 5.1  | Post Implementation Utilization Factor of integer 8 bit PEs and clock frequency of 30 Mhz                                            | 42 |
| 5.2  | Post Implementation Utilization Factor of integer 16 bit PEs and clock frequency of 30 Mhz                                           | 42 |
| 5.3  | Post Implementation Utilization Factor of integer 32 bit PEs and clock frequency of 30 Mhz                                           | 43 |
| 5.4  | Post Implementation Utilization Factor of integer 64 bit PEs and clock frequency of 30 Mhz                                           | 43 |
| 5.5  | Post Implementation Utilization Factor of bfp 16 bit PEs and clock frequency of 30 Mhz                                               | 43 |
| 5.6  | Post Implementation Utilization Factor of fp 32 bit PEs and clock frequency of 30 Mhz                                                | 44 |
| 5.7  | Post Implementation Power Consumption of Processing System for integer 8 PEs                                                         | 45 |
| 5.8  | Post Implementation Static Power Consumption Programmable logic for integer 8 PEs                                                    | 45 |
| 5.9  | Post Implementation Dynamic Power Consumption per Programmable logic with integer 8 PEs                                              | 46 |
| 5.10 | Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and integer 8 PEs  | 46 |
| 5.11 | Post Implementation Power Consumption of Processing System for integer 16 PEs                                                        | 47 |
| 5.12 | Post Implementation Static Power Consumption Programmable logic for integer 16 PEs                                                   | 48 |
| 5.13 | Post Implementation Dynamic Power Consumption per Programmable logic with integer 16 PEs                                             | 48 |
| 5.14 | Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and integer 16 PEs | 49 |
| 5.15 | Post Implementation Power Consumption of Processing System for integer 32 PEs                                                        | 49 |
| 5.16 | Post Implementation Static Power Consumption Programmable logic for integer 32 PEs                                                   | 50 |
| 5.17 | Post Implementation Dynamic Power Consumption per Programmable logic with integer 32 PEs                                             | 50 |
| 5.18 | Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and integer 32 PEs | 51 |
| 5.19 | Post Implementation Power Consumption of Processing System for integer 64 PEs                                                        | 51 |
| 5.20 | Post Implementation Static Power Consumption Programmable logic for integer 64 PEs                                                   | 52 |

| 5.21 | Post Implementation Dynamic Power Consumption per Programmable      |      |
|------|---------------------------------------------------------------------|------|
|      | logic with integer 64 PEs                                           | 52   |
| 5.22 | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 30 MHz and integer     |      |
|      | 64 PEs                                                              | 53   |
| 5.23 | Post Implementation Power Consumption for bfp16 PEs                 | 53   |
| 5.24 | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 30 MHz and bfp16 PEs   | 54   |
| 5.25 | Post Implementation Power Consumption for fp32 PEs                  | 54   |
| 5.26 | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 30 MHz and fp32 PEs    | 55   |
| 5.27 | Comparison of Post Implementation Dynamic Power Consumption         |      |
|      | per entities in Programmable Logic with a clock frequency of 30 MHz |      |
|      | and a MXU 3x3                                                       | 55   |
| 5.28 | Roofline model of the accelerator with a MXU size of 8x8            | 56   |
|      | Roofline model of the accelerator with a MXU size of 8x8 and vec-   |      |
| 0.20 | torized PEs                                                         | 56   |
| 5.30 | Total Execution time of Invoke method (left) in the configuration   |      |
| 0.00 | with accelerator and MNIST model                                    | 58   |
| 5.31 | Total Execution time of Invoke method (left) in the configuration   |      |
| 0.01 | with accelerator and Cifar10 model                                  | 58   |
|      |                                                                     |      |
| C.1  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 50 MHz and integer 8   |      |
|      | PEs                                                                 | LV   |
| C.2  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 80 MHz and integer 8   |      |
|      | PEs                                                                 | LV   |
| C.3  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 100 MHz and integer    |      |
|      | 8 PEs                                                               | VI   |
| C.4  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 120 MHz and integer    |      |
|      | 8 PEs                                                               | VI   |
| C.5  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 50 MHz and integer     |      |
|      | 16 PEs                                                              | VII  |
| C.6  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 80 MHz and integer     |      |
|      | 16 PEs                                                              | VII  |
| C.7  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 100 MHz and integer    |      |
|      | 16 PEs                                                              | VIII |
| C.8  | Post Implementation Dynamic Power Consumption per entities in       |      |
|      | Programmable Logic with a clock frequency of 50 MHz and integer     |      |
|      | 32 PEs                                                              | VIII |

| C.9  | Post Implementation Dynamic Power Consumption per entities in    |     |
|------|------------------------------------------------------------------|-----|
|      | Programmable Logic with a clock frequency of 80 MHz and integer  |     |
|      | 32 PEs                                                           | LIX |
| C.10 | Post Implementation Dynamic Power Consumption per entities in    |     |
|      | Programmable Logic with a clock frequency of 100 MHz and integer |     |
|      | 32 PEs                                                           | LIX |
| C.11 | Post Implementation Dynamic Power Consumption per entities in    |     |
|      | Programmable Logic with a clock frequency of 50 MHz and integer  |     |
|      | 64 PEs                                                           | LX  |
| C.12 | Post Implementation Dynamic Power Consumption per entities in    |     |
|      | Programmable Logic with a clock frequency of 60 MHz and integer  |     |
|      | 64 PEs                                                           | LX  |

# List of Tables

| 5.1 | Execution Time for different platform and model, integer 8  | 57 |
|-----|-------------------------------------------------------------|----|
| 5.2 | Execution Time for different platform and model, integer 16 | 57 |
| 5.3 | Execution Time for different platform and model, integer 32 | 57 |
| 5.4 | Accuracy Output <sup>1</sup> with Convolution on integer 8  | 60 |
| 5.5 | Accuracy Output <sup>1</sup> with Convolution on integer 16 | 60 |
| 5.6 | Accuracy Output <sup>1</sup> with Convolution on integer 32 | 60 |

# 1

# Introduction

Machine learning is one of the hot technologies today as it is being used to solve complex problems that would otherwise be very hard or costly to solve with traditional methods. Speech and image recognition as well as many other complex decision-making problems such as self-driving vehicles are successfully solved with machine learning and deep-learning.

In the last years, the number of published papers regarding machine learning has grown exponentially, and the success of machine learning has been driven by the current available hardware which could provide the required demands in terms of storage and compute capacity. But obviously as problems scale so do the demands and thus companies have started to develop, deploy and sell their own hardware platform, such as Tensor Processing Unit [1] from Google, NVDLA[2] from Nvidia and Gaudi [3] and Goya [4], respectively for training and interference, from Habana (acquired by Intel).

The use of commodity hardware is not the most efficient way to execute machine learning, so researchers are looking at flexible hardware solutions [5] [6] that can satisfy the required demands for different machine learning models but at lower cost and energy consumption in order to be deployed also on mobile devices. Moreover, during the inference process, a model does not need high precision computations [7] [8] for achieving high accuracy into its outputs. As it is very well-known, hardware accelerators are capable, if designed correctly, of delivering significant improvements in terms of the latency but also in terms of energy efficiency [9]. Thus, in order to obtain the best solution in every metric a hardware-software co-design is needed, and it also requires to the hardware designer a basic knowledge of machine learning algorithms.

Machine learning includes two processes, the training and the inference. The training process is done off the field, on powerful machines, exploiting different algorithms for optimizing the models in terms of memory footprint, data type and feedback mechanisms for fine-tuning the weight values. On the other hand, the inference process is the execution of the trained model, applying the inputs and expecting the correct outputs. It is done in the field, for example on a mobile device, which is area and energy constrained. The inference process is massively composed of multiplication and addition and on a normal CPU-based system they are executed sequentially, increasing the latency of the model and the energy consumption due

to data movement.

Thus, the goal of this work is to develop from scratch a hardware accelerator, which implements a tensor-based convolution. Exploiting a non Von Neumann architecture and data locality and reuse for weights reduces the CPU workload and boosts the model's performance. The use of different arithmetic data types can drastically reduce the computations without reducing the final accuracy of the neural network[7] [10]. From a hardware perspective, the use of different arithmetic precision [11], such as the use of integer operations instead of floating-point operations, can lead to benefits in terms of area, energy consumption and latency.

In order to have the possibility of exploring different solutions, in terms of size and latency, of the accelerator the work is deployed on FPGA and it is integrated into a common ML-Framework, Tensorflow. Accuracy of operations, reliability, performance and energy efficiency are evaluated and compared to the implementation of the same models executed on a GPU.

2

# Background

Can a machine think?

— Alan Turing, Computing Machinery and Intelligence

### 2.1 Overview

In the past decade many companies have started to advertise the use of AI(Artificial Intelligence), even if they are using a subfield of the AI, in their products and software applications. Nevertheless the recent growth, the AI fiels is not a recently invented concept but it takes one of its roots from a theoretical paper of Alan Turing published by journal Mind in the 1950 [12].

The general definition of AI: intelligence demonstrated by machines, any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals [13]. In general, "artificial intelligence" is used when machines mimic the cognitive functions of the human mind, i.e. learning and problem solving.

According to the definition, AI is too vast to be studied and simulated [13]. Therefore, it has been divided into subfields, characterized by different traits, such as knowledge representation, planning, learning, natural language processing, perception, motion and manipulation, social intelligence and general intelligence.

AI can be seen as a general purpose technology. It does not excel in a specific task, and tasks are not even characterize.



Figure 2.1: Classification of AI with emphasis on machine learning and its subclassification

# 2.2 Machine Learning

A particular interesting subcategory of AI in Computer Science is the machine learning. It is the study of algorithms used to compute a specific task(image recognition, computer vision etc) without explicit programming the machine, relying on patterns and inference, in order to make decisions. This approach is able to solve tricky or unfeasible problems with conventional algorithms.

A peculiarity of machine learning model is that it is composed of two processes, training and inference. The inference process is the process in which a conclusion is given at the end of the evaluation process, i.e. the input stimulus are applied to the model and the output is observed. The training process has to be done before the model is put on the field, before the inference process, otherwise the latter can give wrong results. As the name suggests, in this process the model learns how to behave, adjusting the weight accordingly to the applied inputs and expected outputs.

Besides this type of training and according to [13], other exists, characterized by approach, type of data and tasks:

- Supervised learning, it builds a mathematical model of a set of data that contains both the inputs and the desired outputs.
- Unsupervised learning, it takes a set of data that contains only inputs and find structure in the data.
- Semi-supervised learning, it falls between unsupervised learning and supervised learning.
- Reinforcement learning, it concerns how software agents should take actions in order to maximize some notion.
- Self learning is a type of learning with no external rewards and no external teacher advices.

- Feature learning, also called representation learning algorithms, often attempts to transform data and preserve at the same time. It is used as a preprocessing step before any classification or predictions.
- Sparse dictionary learning is a feature learning method where a training example is represented as a linear combination of basis functions, and is assumed to be a sparse matrix.
- Anomaly detection, also known as outlier detection, identifies rare items, events or observations which are significantly different from the majority of data.
- Association rules is a rule-based method for discovering relationships between variables in large databases.

Machine learning space is also divided into other type of models such as decision tree, support vector machines, regression analysis, bayesian networks and genetic algorithms. As it can be seen in Figure 2.1 brain inspired machine learning is also divided in subcategories.

## 2.2.1 Brain Inspired

Brain inspired networks are based on algorithms which take their basic functionalities from our understanding of how the brain operates, trying to mimic the functionalities.



Figure 2.2: A parallelism between a human-brain neuron and a neuron in a Brain Inspired Network<sup>1</sup>

In the human brain, the basic computational unit is the neuron.

Neuron, which receives input signal from dendrites and produce output signal along the axon which interacts with other neurons via synaptic weights.

The synaptic weights are obtained after a learning process, which can strengthen

them or not.

#### 2.2.1.1 Neural Networks

Neural networks (or artificial neural networks) can be represented as graphs in which every node is interconnected to others using edges, which have a weight properly tuned during the training process.

As mentioned before, each and every node of the neural networks is called artificial neurons (a loosely model compared with its biological counterpart) and the connections (synapses in biological brain) can transmit information from one neuron to another. In Figure 2.2 the neurons receive signals, which are processed internally, and then they propagate it to the other connected neurons. The information exchanged between a neuron and another is a real number, a result of a non-linear function of the sum of all its input.

In the Figure 2.3 an implementation of a neural network can be appreciated.



Figure 2.3: Example of a Neural Network

As it can be seen in Figure 2.3, a neural netwokrs is always divided in layers in which only the output and input layers are visible from the external world, as consequence the internal layers are called hidden layers. When an input vector is applied, it will propagate from the left side of the network to its right side through the layers and the neurons which compose each layer. It is worth to mention that layers may perform different kind of computation on their inputs. Moreover, the *deep* neural networks are named after the huge amount of hidden layers.

In the early development of artificial neural networks (ANNs) the goal was to solve problems in the same way of human brain would do. However, over time, the aim

<sup>&</sup>lt;sup>1</sup>Figures under CC license

moved to perform specific tasks, leading to a different architecture of the biological brain and brain-inspired networks (Spiking Neural Networks).

Depending on how the edges are connected and the topology, an artificial neural network can be classified in several sub-types:

- Feed forward, the data move only from input layer to output layer without cycles in the graph.
- Regulatory feedback which provides feedback connections back to the same inputs that active them, reducing requirements during learning. It also allows learning and updating much easier.
- Recurrent neural network which propagates data backward and forward, from later processing stages to earlier stages.
- Modular, several small networks cooperate or compete to solve problems.
- Physical which is based on electrically adjustable resistance material to simulate artificial synapses.

#### 2.2.1.2 Spiking Neural Networks

Spiking neural networks (SNNs) are artificial neural networks that more closely mimic natural neural networks [14].

In addition to neuronal and synaptic state, in their operational model, SNNs add the concept of time. The idea is that neurons in the SNN do not activate at each propagation cycle but rather activate only when specific value is reached. The current activation level is modeled as a differential equation and it is normally considered as neuron's state.

In principle, SNNs can be applied to the same application of artificial neural networks. Moreover, SNNs can model brain of biological organisms without prior knowledge of the environment. Thus, SNNs have been useful in neuroscience for evaluating the reliability of the hypothesis on biological neural circuits but not in engineering.

SSNs are still lagging ANNs in terms of accuracy, but the gap is decreasing and has vanished for some task[15]. However, computer architectures based on SNN have a huge energy footprint compare to other types of architecture [16].

# 2.3 Machine Learning Quantization

The reduction of computation demand, the increase of power efficiency and the memory footprint of machine learning algorithms can be achieved through the quantization, which is a set of techniques for converting, and mapping, input values from a large set to output values in a smaller set.

The idea of quantization is not recent, it has been around since the birth of digital electronics. Imagine taking a picture with the phone's camera, the real world is analog and the camera is capturing the analog world and converting it into a digital format. Nevertheless, the high quality of nowadays pictures, quantization is not lossless, it is practically impossible to fully represent in the digital domain the analog world.

A trivial quantization example for neural network model is given in the below figure, where a set of potentially infinite value(floating-point) are mapped to finite values (integer).



Figure 2.4: Approximation of floating-point values to integer values

It has been proved that even if the model has been quantized, for example from fp32 to integer32, its accuracy is still good and the accuracy drop between the two data representation is negligible [7].

Several quantization techniques can be applied, together or separated, to already trained ML models (post-training quantization):

- Linear quantization: data are directly scaled by taking their maximum value and normalizing them to falling in the desired range.
- Outlier channel splitting (OCS) [17]: linear quantization is sensitive by large inputs. The idea of OCS is to reduce the value of outliers (for both weights and activations) duplicating the node with halving the output or the weight. This transformation leaves the node functionality equivalent while at the same time it narrows the weight/activation distribution allowing a better linear quantization.
- Analytical Clipping for Integer Quantization [18]: it represents the state-ofthe-art for the post-training quantization techniques. It basically means applying a clipping function in a given range in order to reduce the quantization noise.

On the other hand, a quantization-aware training can also be done [19].

Quantization relaxes the requirements on the hardware, as it is very well-known floating-point operations are much more expensive than integer operation from a lot of perspectives, and as consequence a reduction into the power consumption of the algorithm. It is also important to mention that the data traffic between the memory and the hardware is reduced due to the compaction of data.

Nowadays, edge devices take advantage of lower precision and quantized operations, including GPUs. Thus, quantization of machine learning algorithms is a defacto standard for edge inference.

# 2.4 Applications

In principle the AI can be applied to any intellectual tasks [13]. Focusing on machine learning applications, they can spread through a variety of different domains:

- Healthcare, mainly used for classification purposes.
- Automotive, used in self-driving cars.
- Finance and economics, to detect charges or claims outside the norm, flagging these for human investigation. In banks system for organizing operations, maintains book-keeping, investing in stocks and managing properties.
- Cybersecurity, automatically sort the data in networks into high risk and low-risk information.
- Government, for paired with facial recognition systems may be used for mass surveillance.
- Video games in which it is routinely used to generate dynamic purposeful behavior in non-player characters.
- Military, enhancing Communications, Sensors, Integration and Interoperability.
- Hospitality, to reduce staff load and increase efficiency.
- Advertising, it is used to predict the behavior of customers from their digital footprint in order to target them with personalized promotions.
- Art, it has inspired numerous creative applications including its usage to produce visual art.

However, all the machine learning applications are characterized by the need of a huge amount of data set for the training process.

# 3

# State-of-the-Art

#### 3.1 Overview

The role of machine learning has continuously growth in the past few years and a lot of efforts have been done for developing good software APIs in order to address different needs and domains.

In principle, all the machine learning algorithms can be run on the CPU, which already runs the OS and other application software. This leads to overheads, especially in terms memory accesses which are expensive in terms of energy and latency.

Analyzing machine learning algorithms comes evident that they massively do the same operations and access to data with some kind of patterns. Thus, with the outcome of the new paradigm for the GPU, the General Purpose GPU programming comes in handy that implementing those algorithms on a GPU, which matches the machine learning algorithms requirements regarding the massive operations and the reuse of data, has given a lot of advantages in terms of latency and energy efficiency. However, the capability of GPU of running machine learning algorithms has been pushed almost at the maximum with the increase of computation demands in modern neural networks. Therefore other solutions have been explored, such as the development of specific hardware platform.

## 3.2 GPU

The Moore's law is reaching the end from the point of view of CPUs. However, it seems that the GPUs can still carry on the Moore's law [20]. For this reason, improving efforts especially from the companies have been made for developing more and more GPUs with a higher performance per watts.

As already mentioned, with the income of general purpose GPU programming paradigm, more and more machine learning algorithms have been designed for being run on the GPU, gathering the best fruits given by that type of architecture.

As consequences, companies such as Nvidia have started to develop GPU for boosting machine learning applications performance.

## 3.2.1 Nvidia Ampere A100 Tensor Core GPU

The Nvidia Ampere A100 Tensor Core GPU has been announced recently and it is one of the most performant GPU. The newly added Tensor Core Unit allows massive increases in throughput and efficiency. It is able to deliver up to 624 TFPLOPS<sup>1</sup> for training and inference machine learning applications.

The GPU is composed of multiple GPU processing clusters (GPCs), texture processing clusters (TPCs) and streaming multiprocessors (SMs). The core of the GPU is the Streaming Multiprocessor, which is built up from the SM of Volta GPU and Turing one.



Figure 3.1: Streaming Multiprocessor Architecture [21]

Composed of integer, FP32, FP64 units and the Tensor Core Units are designed specifically for deep learning. It introduces also new data types in the tensor core for the computation such as binary, integer 8 and 4 bits, floating-point 64, 32, 16 and bfp16 (the throughput of the tensor core computation for fp16 and bfp16 is the same). The Ampere SM can achieve such efficient workload on mixed computation and addressing calculations thanks to an independent parallel integer and floating-point data paths.

<sup>&</sup>lt;sup>1</sup>floating-point operations per second

Matrix-Matrix multiplication operations are at the core of neural network training and inference, and are used to multiply large matrices of input data and weights in the connected layers of the network. The idea is represented into the Figure 3.2 and compared to previous architectures.



Figure 3.2: Matrix Multiplication in Tensor Core [21]

The Ampere A100 GPU contains 108 Streaming multiprocessor, and 432 third generation Tensor Core. According to Figure 3.3 the Tensor Core Units are able to compute multiplications on FP16 and accumulate on FP32, leading to a further reduction of latency and energy consumption.



Figure 3.3: Mixed Precision Schema of a FMA unit in Tensor Core Unit [21]

A novel approach for doubling the throughput of deep neural networks has been introduced in this architecture. At the end of training process, only a subset of the total weights are necessary to execute a neural network correctly. As consequence not all the weights are needed, and they can be removed.

Based on training feedback, weights can be adapted at runtime during the training and this does not have any impact on the final accuracy. Thus, thanks to the

sparsity of weight tensors., inference process can be accelerated. In addition, also the training process can be accelerated exploiting the sparsity idea but it has to be introduced at the beginning of the process for achieving some benefits.



Figure 3.4: Sparsity Optmization of a weight tensor [21]

The apporach in Figure 3.4 doubles the throughput by skipping the zeros. It also leads to a reduction of memory footprint and an increase into the memory bandwidth.

Following the idea, NVIDIA has introduced a new set of instruction for inference: sparse Matrix Multiply-Accumulate (MMA). Those instructions are able to skip the matrix entries which contain zero values, leading to an increase of the Tensor core throughput. An example can be seen in Figure 3.5, where the light blue matrix has a sparsity of 50%. It is also important to mention that the non-zero entries of the light blue matrix will be matched with the correct entries of the red one.



Figure 3.5: Matrix Multiply Accumulate [21]

14

The deep learning frameworks and the CUDA Toolkit include libraries that have been custom-tuned to provide high multi-GPU performance for each one of the following deep learning frameworks in the Figure 3.6.



Figure 3.6: Software stack [21]

Combining powerful hardware with software tailored to deep learning, it provides to developers and researchers solutions for high-performance GPU-accelerated deep learning application development, testing, and network training.

# 3.3 Domain Specific Hardware Platform

Instead of developing GPUs also suitable for Machine Learning applications, the companies have designed and deployed special purpose hardware accelerators.

#### 3.3.1 NVDLA

The Nvidia Deep Learning Accelerator is a free open source hardware platform from Nvidia, highly customizable and modular, which allows to design and deploy deep learning inference hardware.

The architecture comes in two configurations:



Figure 3.7: Comparsion of two possible NVDLA system [22]

As already mentioned, the aim of the work is to develop a hardware accelerator for machine learning suitable for mobile devices. Therefore from now on the NVDLA small system will be considered and analyzed.

The internal architecture of the NVDLA small system is:



Figure 3.8: Internal architecture of NVDLA small system, Secondary DBB not considered [22]

According to Figure 3.7, for the Small configuration of the accelerator, the processor will be in charge of programming and scheduling the operations on the NVDLA

and as consequences handles the start/end of operations and possible interrupts, all of them through the CSB (Configuration Space Bus) interface which is AXI Lite compliant [23].

The data movement to/from memory are handled by the Internal memory controller through the DBB (Data BackBone) interface, which is AXI [23] compliant.

The internal architecture of NVDLA is composed by various engines. Each one of them is able to perform specific Machine Learning operations:

- Convolution Core: it comes in pair with the Convolution Buffer, its private memory for the data (inputs and weights). It is used to accelerate the convolution algorithms.
- Activation engine (Single Data point Operations): it performs post processing operations at the single data element level such as bias addition, Non-linear function, PReLU (Parametric Rectified Linear Unit) and format conversion when the software requires different precision for different hardware layers.
- Pooling engine (Planar Data Operations): it is designed to accomplish pooling layers, i.e. it executes operation along the width and height plane.
- Local response normalization engine (Cross Channel Data operations): it is designed to address local response normalization layers.
- Reshape(Data memory and reshape operations): it transforms data mapping format without any data calculation.
- Bridge DMA: it is in charge of copying data from the Main Memory to the SRAM of the accelerator, only available into the large configuration of the system.

Another possible configuration which is worth to mention is the possibility to let the engines work separately on independent task or in a fused fashion where all of them are pipelined, working as a single entity.

According to developers the configurability of the cores ranges from arithmetic precision to the theoretical throughput that a single unit can achieve (increasing the number of internal Processing Elements). Moreover, since the engine units are independent of each other, according to the application and the model used they can be safely removed from the design.

#### 3.3.1.1 NVDLA Software

It is also worth to mention that the accelerator comes already with a basic software stack:



Figure 3.9: NVDLA Software stack[24]

The Compilation tools are in charge of converting existing pretrained model into a set of hardware layers (for the desired precision) and programming sequences suitable for the NVDLA. The output of this process is a Nvidia Loadable file suitable for the runtime environment.

Regarding the runtime environment, it has been designed for a system in which is present an OS. It is composed in two parts: the User Mode Driver (UMD) and the Kernel Mode Driver (KMD).

The User Mode Driver loads the loadable file in memory and submits the operation to the KMD. It is also in charge of data movement from/to the accelerator.

The KMD is in charge of submitting operations to the accelerator through low level functions, scheduling the operations and handling the interrupts.

Both the KMD and the UMD are wrapped into portability layers which are, respectively, hardware dependent and OS dependent. In principle, for migrating the software to another OS or hardware plaftorm it is enough to modify only the portability layers.

## 3.3.2 Google TPU

Google developed its own application-specific integragrated circuit for neural networks, which is tightly integrated with TensorFlow Software. It includes:

- Matrix Multiplier Unit (MXU): 65,536 8-bit multiply-and-add units for matrix operations
- Unified Buffer (UB): 24 MB of SRAM that work as registers
- Activation Unit (AU): Hardwired activation functions

In Figure 3.10 a general view of TPU architecture is presented.



Figure 3.10: Google TPU architecture[1]

Rather than be tightly integrated with a CPU, the TPU is designed to be a coprocessor in which the instruction are sent by the host server rather than fetched.

The matrix multiplication unit reuses both inputs many times as part of producing the output, avoiding the overhead of continuously read data from memory. Only spatial adjacent ALU are connected together, which makes wires shorter and energy-efficient. The ALUs only perform computations in fixed pattern.

As far as concerned the software stack, the TPU can be programmed for a wide variety of neural network models. To program it, API calls from TensorFlow graph are converted into TPU instructions.



Figure 3.11: Google TPU Software Stack [25]

## 3.3.3 Habana Goya HL-1000

Habana's Goya is a processor dedicated to inference workloads. It is designed to deliver superior performance, power efficiency and cost savings for data centers and other emerging applications.

It allows the adoption of different deep learning models and is not limited to specific domains. Moreover, the performance requirements and accuracy can be user-defined.

In Figure 3.12 a high level view of the Goya architecture can be appreciated.



Figure 3.12: High level view of Goya architecture [4]

It is based on scalable, fully programmable Tensor Processing Cores, specifically designed for deep learning workloads.

It also provides other flexible features such as GEMM operation acceleration, special functions dedicated hardware, tensor addressing, latency hiding capabilities and different data types support in TPC (FP32, INT32, INT16, INT8, UINT32, UINT16, UINT8).

Regarding the software stack, it can be interfaced with all deep learning frameworks. However, a model has to be first converted into an internal representation, as it can be seen in Figure 3.13.



Figure 3.13: Habana Goya Software Stack [4]

It also supports quantization of models trained in floating-point format with near-zero accuracy loss.

# 4

## System Development

#### 4.1 Overview

As already mentioned, the use of custom hardware for a specific application can have big benefits especially in terms of energy consumption and latency.

The inference process of neural network is mainly characterized by massive multiply and addition operations. Fetch of data from main memory follows patterns and it has been proved that those data, in particular weight data, are reused for several executions of the neural network model.

As consequence, executing a neural network model on a von Neumann based architecture machine leads to performance degradation, even in a cache-based system, since the CPU has to request the data from the main memory, execute the operation on those data and then save back to main memory before moving to the next data.

The introduction of vectored instruction in the modern processors can have a slight impact in the performance benefits. However, the drastically increase of layers in the neural network has made them suitable for several applications. This it can be translated into a massive increase of operations for executing them, as it can be also observed in the following Figure:



**Figure 4.1:** Average execution time divided by type of operations

Following the fast demands of operations into a neural network, it becomes evident

that executing them on a CPU could not meet real-time application requirements.

Instead, the designed accelerator has a dataflow architecture, with emphasis on weight data reuse, and it is able to execute a tensor convolution. The basic idea is a computation matrix composed in every entry of processing elements which are able to perform operation between the incoming data and the weights, which have been already loaded for exploiting a data reuse approach.

The custom hardware accelerator is not useful as it is. It has to be integrated into a ML-Framework in order to appreciate its benefits. After a preliminary research on which ML-Framework would allow to integrate a custom hardware accelerator minimizing the efforts to change the model code and its definitions, it has been evident that the TensorFlow Framework, an end-to-end open source machine learning platform [26], suits the needs.

The workflow of the hardware-software development is illustrated in the following:



Figure 4.2: Development workflow

The entire work is implemented on a PYNQ Z2 board from TUL, based on a Zynq-7000 SoC [27]. In order to speed-up the development process and use built-in library for the AXI protocol and the DMA transfers, the software is partially carried out through the PYNQ environment of the board [28] based on Python which has became a de facto standard [29].

The usage of Python as basic software allows to easily integrate it with high level Machine Learning Framework, such as TensorFlow in this case.

#### 4.2 Software

The focus of the work is the inference process, pre-trained models are needed and TensorFlow Hub [30] comes in handy for this purpose. It provides already pre-trained machine learning models for different domains. Moreover, TensorFlow has the feature of quantizing a post-trained model for different arithmetic precision. In the Fig. 4.2 it can be seen that the quantization process has been done offline.

The choice of using the stable release 2.1 of TensorFlow is dictated from the possibility of using Delegates (aka hardware accelerators or GPUs) in its neural network model.

A delegate is a way to delegate part or all graph execution to another executor. Every model is represented, internally, as a graph (with its relative order of execution for the nodes) and every node of the graph is described as a set of operation that has to be applied to the node's input. As every node is described by a set of operations, it is easy to understand which part of the graph can be executed on the accelerator in advance, and this operation is done at the beginning when both the model and the accelerator library is loaded as it is represented in the following Figures:



Figure 4.3: Execution Graph

It is worth to mention that TensorFlow is open-source and since no binary installations for its 2.1 release are provided for Arm processor, it has been cross-compiled from scratch for the PYNQ-Z2 board.

TensorFlow demands as library for the accelerator a C Python-API compatible shared library. In addition, the code for using the accelerator was already written using the PYNQ environment in Python. Therefore, for allowing code reuse and decreasing the development time the Python code has been embedded in the C code (from a TensorFlow example of the delegate library), adding callbacks to Python code<sup>1</sup>.

This has been possible thanks to the Python library *CFFI* (C Foreign Function Interface) [31], which is also able to provide a shared library Python-API compatible as output. In the following Figure the flow chart between Tensorflow Lite and the accelerator library can be seen:



Figure 4.4: Flow Chart with accelerator

<sup>&</sup>lt;sup>1</sup>See Appendix A

#### 4.3 System Level

As it can be seen from Figure 4.5, it is divided in two big blocks:

- Processing System: The processing system (in Figure 4.6 referred as *processing system7*) is in charge of running the OS and the Machine Learning application. As consequence it also runs the necessary software for programming the accelerator registers and the data movement to/from main memory from/to the accelerator.
- Programmable Logic: The programmable logic (PL) hosts the entire design, from the accelerator itself to the DMAs and the AXI interconnections.



Figure 4.5: Zynq 7000 SoC [32]

Furthermore, the Programmable Logic in Figure 4.6 is hosting:

- AXI interconnections: IP cores from Xilinx [33] [34] in order to connect and correctly address entities in the Programmable Logic.
- AXI DMA: IP core from Xilinx [35] which allows data movement between main memory and accelerator memories. Several single channel DMA have been used instead of using a single DMA with multiple channels. The reason is that in the PYNQ environment only the drivers for the single channel DMA are provided.
- DTPU: the actual hardware accelerator.
- XADC: IP core from Xilinx [36] which allows to measure the temperature of the SoC, the voltages and the currents at run time.

In the following figure, the schematic of the overall design in the PL is presented.



Figure 4.6: System view hosted in the PL <sup>2</sup>

 $<sup>^2\</sup>mathrm{Except}$  for the Zynq Processing system

#### 4.4 DTPU, the hardware accelerator

The hardware accelerator, named *Cogitantium*<sup>3</sup>, *The Dumb Tensor Processing Unit*, is in charge of carrying out the tensor convolution of the neural network model, exploiting a data-flow architecture on the input data and a data reuse for the weight data.

Figure 4.7 presents the Logical block diagram of the accelerator.



Figure 4.7: Logical view of DTPU accelerator

#### 4.4.1 Real Implementation

The work is not focused on developing embedded memories and AXI interfaces, therefore a Xilinx's IP core, which includes all those necessary sub components, has been used [37] leading to the actual block diagram which can be observed in the Figure 4.8.



Figure 4.8: Real RTL view of DTPU accelerator

<sup>&</sup>lt;sup>3</sup>Thoughtful

The latter has allowed to completely focus the work on the DTPU core<sup>4</sup>, which has become:



Figure 4.9: RTL view of DTPU core

Where the sub-units:

- L/S array provides the data for the Matrix Multiplication Unit, especially the weight data are reused across several executions and therefore loaded once.
- Control Unit is in charge of handling handshake signals for transferring the ownership of the data (data transferred by the DMA from the Main Memory), load the weights and activation in the respectively units and save the results to the output FIFO. Since it is a Data flow architecture, there is no control flow of the data in the core and this has allowed to keep the Control Unit as simple as possible.
- Matrix Multiplication Unit (Mxu) is the computation unit of the hardware accelerator. It executes the tensor convolution for different arithmetic precision.

<sup>&</sup>lt;sup>4</sup>See Appendix B

#### 4.4.2 High Level State Machine of Control Unit

The dataflow architecture has allowed to design a control unit as much simple as possible, presented in the below figure:



Figure 4.10: A high level view of Control Unit

In which:

- *Idle* state is waiting for the start signal from the *axis accelerator adapter* (generated when all the data have been transferred<sup>5</sup>).
- Fetch CSR Memory state is in charge of retrieving from the CSR memory the desired data precision for the computation and the starting address of the weight memory. It also notifies to the axis accelerator adapter that it is ready<sup>6</sup>.
- Load data in L/S array state loads the correct weight values (retrieved from the weight memory) and the activation data into the correct L/S unit. The number of active L/S unit is computed at run time. It depends from the current required data precision and the fixed number of rows and columns in the MXU.

<sup>&</sup>lt;sup>5</sup>Input Data, Weight data and CSR data

<sup>&</sup>lt;sup>6</sup>The ready signal is used as handshake between the core and the axis accelerator adapter for transferring the ownership of the data

- Compute state activates the MXU and it waits the end of computation before committing the results to the output FIFO.
- Save to output FIFO state saves the data stored in the active L/S units to the output FIFO.
- Done state, depending on the input FIFO if it is empty or not, continues the computation for the next activation data or it returns to the idle state, notifying to the axis accelerator adapter the end of the computation<sup>7</sup>.

#### 4.4.3 Datapath

As it is well-known, the execution of ML models is memory intensive and it consists in massive multiplication and accumulation operations. In addition, it can be seen that during execution of ML models some memory location are accessed frequently. Therefore, it is evident that a dataFlow architecture which could exploit local data reuse and compute, massively, in parallel multiplications and additions could boost the performance.

The DTPU core has been designed according to the previously mentioned ideas and the datapath of the core is presented in Figure 4.11 as block diagram.



Figure 4.11: A detailed view of the DTPU core datapath. Enable and resets signals for clocked units has been omitted for improving readability.

<sup>&</sup>lt;sup>7</sup>the notification for the end of computation allows the axis accelerator adapter to put the results on the output master axi stream interface in order to be transferred by the DMA

In Figure 4.11, the brawn of the accelerator is the MXU wrapper, which contains the symmetric matrix of MACs with variable precision. Regarding the other blocks:

- Activation Decoder: It is able to generate the right activation signals for the L/S units, depending on from the current data precision and MXU size.
- Muxes and DeMuxes: Their purpose is to feed the right data from/to memory to/from the right units. The counter (from 0 to ROWS-1) in the Mux for the output FIFO is for saving at every clock cycle a data in the FIFO.
- Filter&Select: depending on the precision it provides the correct data to the correct computation units.
- Compact&Select: it is the complement of the Filter&Select unit. It is able to compact the output data from the MXU wrapper and feed the store registers.
- L/S weight Units: the name L/S has been kept for consistency even if it does not have any store process since the weight are only loaded once (stationary weights) and kept until a next full execution.
- L/S Activation Units: they are in charge of loading the data from the input FIFO into batteries of Flip-Flops while at the same time they can save the results to submit late in the output FIFO.

#### 4.4.3.1 Filter&Select and Compact&Select

In principle, for each Processing element in the MXU wrapper a weight and an activation has to be provided (and as consequence it has to be provided from its relative Load Units). However, since the data width of memories and FIFO has been fixed to its maximum, 64 bits, it comes evident that during a computation with 8 bit integer it will fetch(and save for the output FIFO), in case of a 8x8 Mxu Size, 8 values from FIFO and 64 values from the weight memory. In this scenario all the Flip-Flops of the L/S units (both activation and weights) would sample values where the 56 upper bits are always unused leading to a waste of time for the memory accesses and energy for unused data.

A clever solution is to pack data before sending them to the accelerator. Nevertheless, the pack of data requires to internally unpack and, before committing to the output FIFO, pack the results. Unpacking and packing are done, respectively, by Filter&Select and Compact&Select units. Retrieving the previous example (computation on 8 bit integer, MXU size of 8x8 and 64 bit memory data) and using the approach of unpacking and packing, this leads to use only one L/S unit for activations (8 for the L/S weight units) for both the load and store operation. With one single L/S active unit and 8 bit integer computation, an 8 bit activation data has to be distributed for each column of the MXU, and this is done by the Filter&Select unit. For committing to output FIFO, results on 8 bit will be compacted in one single data of 64 bit by the Compact&select.

A visual distribution of the data can be seen in Figure 4.12. The same can be applied for each row of L/S Weigth units.



Figure 4.12: Data Distribution of Filter&Select unit for a MXU size of 8x8

In case the required precision is on 16 bit, with the same MXU size, two L/S units for activation are activated (2\*ROWS for the L/S weight units) and will feed the respective Columns.

The reason behind the two active L/S units is that in 64 bit, only 4 16-bit values can be packed. Increasing the MXU size, the L/S units are activated accordingly. For example, in case of a MXU size of 16x16 and integer 8 bit, two L/S units are activated (in case of integer 16 computation, 4 units are activated).

This approach comes also with the overhead of packing and unpacking the data on the CPU but, on the other hand, the memory data movement is reduced and bandwidth increased, with a reduction in the energy consumption (thanks also to the reduced active L/S units).

It is also worth to mention that using sizes for the MXU which are power of two would maximize the memory bandwidth.

#### 4.4.3.2 Matrix Multiplication Unit

The Matrix Multiplication Units (referred as MXU) is the muscle part of the accelerator, where the convolution is done. As the name suggest, it is organized as a Matrix:



Figure 4.13: MXU interal structure and weights distribution

Every sub units has its own weight value (distributed thanks to the L/S weight combined with Filter&Select units, see Figure 4.11).

It is a homogeneous unit, except for the first column, which does not accumulate. In addition, as it can be seen from the block diagram, there is no control flow between every processing units. There is only data exchange from the previous unit to the next one (for both axis).

This matrix configuration of the hardware allows to massive multiply and accumulate at the same time, in particular it can compute:

$$MAC_{OPS} = ROWS * COLUMNS per \# clock cycle required for a single unit with a  $Throughput = Rows$$$

The MXU can be synthesized with different criteria. In particular, the processing elements can be independently generated for a single data precision, from integer 8/16/32/64 to floating-point 32 or brain floating-point 16, or with some precision at the same time. Then data precision is decided, via software, and properly controlled using signal in Figure 4.14.

A detailed view of SMAC (Sub unit Multiply and Accumulate) and SMUL(Sub unit Multiply), the Processing Elements, is given in Figure 4.14.





Figure 4.14: SMAC and SMUL details

It is important to mention that the sub units are always receiving data on 64 bits even if internally they may use all of them or not, depending on the value of *select precision* and *active chain* signals.

For the full integer configuration (64 bit width operations) beside the possibility of computing for different data width (i.e. choose between 8/16/32/64) the processing elements can compute vectorized operations. With the help of *active chain* signal (active low, otherwise it is a 64 bit computation) and data width fixed to 64 bit, it is able to compute at the same time two 8-bit, one 16-bit and one 32-bit operations (multiplication for SMUL and multiplication and addition for SMAC). However, this comes with the overhead of correctly packing and unpacking the data on the CPU before transferring them to the accelerator.

SMAC and SMUL units have been designed, internally, using Vivado DSP primitives [38], which a general schema can be appreciated in Figure 4.15:



Figure 4.15: DSP Slice Functionality [38]

Allowing fitting two computation (referring to SMAC) in one single unit<sup>8</sup> and maximize the resource utilization.

As soon as the Synthesis process reach the maximum value of DSP utilization, it does not switch automatically to use fabric for those primitives. For maximizing the resource usage of the FPGA, the DSP primitives have been regenerated for both Fabric and DSP blocks. In this way, during the generation algorithm for the MXU, it uses primitives for DSP up to the maximum allowed value for the given board and then it starts to utilize fabric. This approach has allowed almost a full utilization of the FPGA resources.

<sup>&</sup>lt;sup>8</sup>Only for integer 8 and 16

# 5

## Results

If you can not measure something, you can not improve it.

— William Thomson Kelvin

#### 5.1 Evaluation metrics

Generally speaking in computer science, every domain and application could have different evaluation metrics, for example the energy efficient of a CPU is a heavy metrics in embedded systems while in a high performant CPU latency and throughput are dominant metrics. As said that, evaluation metrics strongly depend on the end-users, therefore the designers have to make assumption on the end-user intentions and applications.

In this work the assumptions are that the accelerator will be deployed into an embedded system and at the same time it should give to the user a certain degree of flexibility for running neural network models. Thus, as it is suggested [5] the following metrics are used:

- Accuracy, quality of the final result of inference process.
- Throughput, for measuring real time performance. It depends on the number of internal computation cores.
- Latency, for interactive applications.
- Energy and power.
- Hardware cost (Utilization Factor in case of an FPGA) of chip area and process technology.

#### 5.2 Utilization Factor

An important aspect of an embedded system is the on-die utilization area. Those kinds of system are usually deployed on tight area-constrained chips for hiding their presence to the user. Therefore, it is important to measure and understand the behavior on the Utilization of the FPGA (used as area measurement in this case) of the design as the size of Matrix Multiplication Unit increases and in parallel the throughput.

The Utilization Factor, composed of Look-up-Table, Flip Flops and Digital Signal Processor usage, is expected to increase as the size of Multiplication Matrix increase and the bit width of Computation Unit.

In the following Figures, utilization results are presented for each data type, where the Matrix Multiplication Unit sizes are pushed as much as the timing requirements are meet:

• Integer 8 bit:



**Figure 5.1:** Post Implementation Utilization Factor of integer 8 bit PEs and clock frequency of 30 Mhz

• Integer 16 bit:



**Figure 5.2:** Post Implementation Utilization Factor of integer 16 bit PEs and clock frequency of 30 Mhz

#### • Integer 32 bit:



**Figure 5.3:** Post Implementation Utilization Factor of integer 32 bit PEs and clock frequency of 30 Mhz

#### • Integer 64 bit:



**Figure 5.4:** Post Implementation Utilization Factor of integer 64 bit PEs and clock frequency of 30 Mhz

• Brain Floating point 16:



**Figure 5.5:** Post Implementation Utilization Factor of bfp 16 bit PEs and clock frequency of 30 Mhz

#### • Floating point 32:



**Figure 5.6:** Post Implementation Utilization Factor of fp 32 bit PEs and clock frequency of 30 Mhz

It can be seen that the trend for integer 8 and 16 is similar (Figure 5.1 and 5.2). It is a 1 to 1 mapping between the PE and the DSP entity on the board. Actually, the DSP entities are on 16 bit and using the 8 bit units the high 8 bits are gated to zeros.

As soon as the DSP are used the utilization of LUT and FF is linear in the sizes of Matrix Multiplication unit, while the DSP utilization is quadratic. At a full utilization of DSP entities the PEs start to be implemented in logic and it can be seen, in all the previous graphs, a sudden rise in the LUT utilization.

It is also worth to mention that the PEs on 64 bit integer are a special case of FPGA's utilization, they reach sooner than the other designs the full utilization. Every PEs in this configuration is using a 14 DSP entities for taking into account also the possibility of computing vectorized operations as previously mentioned.

Regarding the floating point units, it has been used the same hardware unit for both the fp32 and bfp16, since they have the same exponent bit length but different mantissa length (this also allows to have the same numerical stability). Relying on the synthesis process to properly optimize the different units and discard, where necessary, the unused hardware. In fact, comparing Figure 5.5 with Figure 5.6, there is a slight different in the utilization of the LUT and a more remarkable difference in FF utilization.

Increasing the clock frequency, the FPGA's utilization is reduced since with an increase of the Matrix Multiplication Unit sizes the design is not able anymore to meet the timing requirements, especially for the floating point units.

### 5.3 Energy and Power Consumption

Energy and Power consumption are important factor, for a mobile device in which there is a limited battery capacity meanwhile for data centers stringent power ceilings due to cooling costs.

According to the Vivado Power estimation manual[39], the static power is calculated over all the FPGA resources. This is due to the hard estimation of the static power per single design. In the following Figures, estimations of power consumption from Vivado are presented for each data type and different clock frequencies:

• Integer 8:



**Figure 5.7:** Post Implementation Power Consumption of Processing System for integer 8 PEs



Figure 5.8: Post Implementation Static Power Consumption Programmable logic for integer 8 PEs



**Figure 5.9:** Post Implementation Dynamic Power Consumption per Programmable logic with integer 8 PEs

The previous Figures represent the behaviour of the power consumption with different Matrix Multiplicatio Unit and for different clock frequency (see Appendix C). It is expressed as percentage with reference to the total power consumption of the SoC (processing system and programmable logic). In fact it can be seen that the power consumption of the processing system and the static power consumption have a less impact on the total power consumption with an increase of the Matrix Multiplication Unit and the frequency. On the other hand, the dynamic power consumption in Figure 5.9, as expected, grows with a growing Matrix Multiplication Unit and the design frequency.

In the following Figures, the dynamic power consumption for each entities (in percentage, wrt the dynamic power in Figure 5.9) in the FPGA is analyzed for different clock frequencies.

As it is very well known, the clock distribution is one of the main source of



**Figure 5.10:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and integer 8 PEs

power consumption, and it is confirmed from all the previous Figures. Also the interconnections, called *signals* in the figures, are power hungry (a bigger MXU leads to many, and longer, interconnections between PEs). In fact the clock distribution networks and the interconnections are the predominant entities of the dynamic power consumption of the programmable logic. The logic entity is containing all the power consumed by the FFs and LUTs, it looks like their power consumption is decreasing but it is only the percentage of the total dynamic power which is decreasing.

It is worth to mention that the PEs (at least the majority of them) are implemented using the DSP entities. However, the power consumed by those entities is almost negligible, the DSPs are low power entities in the FGPA according to its datasheet [32].

Regarding the BRAM, I/O and XADC, with an increase or a decrease of the other entities impact they have a slightly modification of their impact on the power consumption.

#### • Integer 16:

In the follwing Figures, the same considerations for the Integer 8 are still valid since the PEs are always implemented on the same DSP entity but without the higher 8 bits of the input values gated to zeros.



**Figure 5.11:** Post Implementation Power Consumption of Processing System for integer 16 PEs



Figure 5.12: Post Implementation Static Power Consumption Programmable logic for integer 16 PEs



**Figure 5.13:** Post Implementation Dynamic Power Consumption per Programmable logic with integer 16 PEs



**Figure 5.14:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and integer 16 PEs

## • Integer 32: From now on, the MXU sizes and the frequencies will show a reduction in their values, this is manly because big designs (with almost full FPGA's utilization) are not able to meet anymore the timing requirements.



**Figure 5.15:** Post Implementation Power Consumption of Processing System for integer 32 PEs

It is worth to mention the power consumed by the DSP entities, comparing to integer 8 and 16 PEs, is bigger. The main reason is that there is no more one to one mapping between PEs and DSP entities.



Figure 5.16: Post Implementation Static Power Consumption Programmable logic for integer 32 PEs



Figure 5.17: Post Implementation Dynamic Power Consumption per Programmable logic with integer 32 PEs



**Figure 5.18:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and integer 32 PEs

#### • Integer 64:



**Figure 5.19:** Post Implementation Power Consumption of Processing System for integer 64 PEs

As mentioned in the Utilization chapter, the PEs on 64 bit integer are using 14 DSP entities (for having the possibility to compute vectorized operations on data). Therefore, this heavy utilization per PEs is impacting also the power consumed by the DSPs but as it can be seen in Figures.



Figure 5.20: Post Implementation Static Power Consumption Programmable logic for integer 64 PEs



**Figure 5.21:** Post Implementation Dynamic Power Consumption per Programmable logic with integer 64 PEs



Figure 5.22: Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and integer 64 PEs

#### • Brain floating point 16:



Figure 5.23: Post Implementation Power Consumption for bfp16 PEs



**Figure 5.24:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and bfp16 PEs

#### • Floating point 32:



Figure 5.25: Post Implementation Power Consumption for fp32 PEs

For the bfp16 and fp32 it can be seen that the majority of the power is consumed by the interconnections and the logic. Mainly, because the PEs are implemented in logic.



**Figure 5.26:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and fp32 PEs

Until now, the focus has been on how much the single entities and the different type of power were impacting the total power consumption. It is also worth to compare the absolute values for different data precision, as in the Figure 5.27.



**Figure 5.27:** Comparison of Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 30 MHz and a MXU 3x3

As it is very well known from literature and it has also been evident from the other figures and observations, the power consumption per entities grows with the increase of the bitwidth (in the case of integer) and complexity (in the case of floating point).

The power consumed by the DSPs (entities in which the integer PEs are implemented in) is negligible for the integer 8 and 16 while it starts to grow slowly using the integer 32 but it explodes with the 64 bit PEs. The high utilization of those PEs leads also to a huge impact in their power consumption.

#### 5.4 Throughput

According to the definition, the Throughput is the amount of units of information a system can process in a given time. As said that, for the designed accelerator, it results to be equal to the number of rows into the Matrix Multiplication Unit. Normalizing this value with the clock frequency, it results to be constant for all the data type and frequencies.



Figure 5.28: Roofline model of the accelerator with a MXU size of 8x8

The theoretical throughput given by the roofline model (Figure 5.28) and it is equal to the number of rows in the matrix multiplication unit. The assumption is that enough data are provided to the accelerator in order to have all the Processing Elements working with useful data, if the latter is not meet the throughput goes down. In Figure 5.28 the different slopes for different data width are representing the different number of internal memory accesses in order to retrieve data for all the Processing Elements.

The throughput can be further increased in the 64-bitwidth configuration of the Processing Elements. As already mentioned, those 64-bit units are able to compute vectorized instructions and therefore increase the number of computation per cycle. However, this comes with the overhead of more memory accesses as it can be appreciated in Figure 5.29.



**Figure 5.29:** Roofline model of the accelerator with a MXU size of 8x8 and vectorized PEs

### 5.5 Latency

In a real time application, the most important factor is the latency, the execution time of a task. In this case the latency is measured as average of the execution time of a neural network model for different platforms. In addition, the execution of the models, on the target, in the configuration CPU+accelerator is done with different clock frequencies and data type in the Programmable Logic, and as consequence a different overall latency (and power consumption).

In the following tables, the execution type for different data type and model is presented (with a fixed clock frequency of the accelerator at 100 MHz).

| Model    | $CPU (host)^1$       | $GPU(host)^2$ | CPU(Pynq              | CPU(Pynq    |
|----------|----------------------|---------------|-----------------------|-------------|
|          |                      |               | $Z2 \text{ board})^3$ | Z2  board + |
|          |                      |               |                       | accelerator |
| MNIST    | $0.3 \; \mathrm{ms}$ | 5.7 ms        | 2.9 ms                | 509 ms      |
| Cifar 10 | 20 ms                | 22 ms         | 160 ms                | 13356  ms   |

**Table 5.1:** Execution Time for different platform and model, integer 8

| Model    | CPU (host) <sup>1</sup> | GPU(host) <sup>2</sup> | CPU(Pynq              | CPU(Pynq    |
|----------|-------------------------|------------------------|-----------------------|-------------|
|          |                         |                        | $Z2 \text{ board})^3$ | Z2  board + |
|          |                         |                        | ·                     | accelerator |
| MNIST    | $0.3 \; \mathrm{ms}$    | 5.7 ms                 | 2.9 ms                | 503 ms      |
| Cifar 10 | 20 ms                   | 22 ms                  | 160 ms                | 13178 ms    |

**Table 5.2:** Execution Time for different platform and model, integer 16

| Model    | $CPU (host)^1$       | $GPU(host)^2$ | CPU(Pynq              | CPU(Pynq    |
|----------|----------------------|---------------|-----------------------|-------------|
|          |                      |               | $Z2 \text{ board})^3$ | Z2 board) + |
|          |                      |               |                       | accelerator |
| MNIST    | $0.3 \; \mathrm{ms}$ | 5.7 ms        | 2.9 ms                | 496.9 ms    |
| Cifar 10 | 20 ms                | 22 ms         | 160 ms                | 13218 ms    |

**Table 5.3:** Execution Time for different platform and model, integer 32

Looking at the previous tables, the latency for different data precision it is not changing. This is due to the hardware structure, the Matrix Multiplication Unit is build in such a way that the latency between the one operation and the next one is

 $<sup>^{1}</sup>$ Intel i7-6700HQ, 2.60 Ghz

<sup>&</sup>lt;sup>2</sup>NVIDIA, GeForce GTX960M, 1.176 Ghz

<sup>&</sup>lt;sup>3</sup>Arm dual-core Cortex-A9, 650 MHz

always of 3 clock cycles (for integer operations).

It is worth to analyze and reason about the increase in the latency in the configuration with the accelerator, since one of the main goal was to reduce the latency.

Focusing on the following Figure:



Figure 5.30: Total Execution time of Invoke method (left) in the configuration with accelerator and MNIST model



Figure 5.31: Total Execution time of Invoke method (left) in the configuration with accelerator and Cifar10 model

As it has been mentioned before, the most compute intensive part is always the convolution operations. Introducing the hardware accelerator and its library comes with several overheads as it can be seen in Figures 5.30 and 5.31:

- Data exchange between C and python: the accelerator library has been developed in Python code with the C interface to Tensorflow Lite. This means that every matrix (input, output and weight) is copied to the python sublayer for further processing. Migrating all the accelerator library into C code will remove this overhead.
- Rebuilding of output matrix: After every execution of the computation by the accelerator it is parsing the accelerator's output and rebuilding the output matrix accordingly to the current execution indexes. It can be removed preprocessing the model before the deployment, transforming the matrices in a suitable format for the accelerator as most of on the market accelerators do.
- Hardware execution time (and data transfer to the accelerator): This is the actual execution time of the hardware and the data transfer from/to the accelerator. It is also bounded by the fixed internal memory access. It can be reduced by increasing the frequency of Programmable Logic.
- Other internal operations: it includes the time for reshaping the input matrix in a format suitable for the accelerator and the save back from python to C of the output matrix. It can be removed preprocessing the model before the deployment, transforming the matrices in a suitable format for the accelerator. Moreover, the migration towards a complete C implementation is going to remove the overhead due to the saving back of the output matrix.

It is also important to mention another reason for the latency overheads. This work is based on accelerating ML execution, by solo-accelerating predefined operations of TensorFlow (the most compute intensive ones). All the previous overheads has been the results of this approach, because inside the TensorFlow accelerated functions another software, hardware and data transfer layers have been added. This is slight in contrast with the approaches analyzed during the State-of-the-art chapter, where the entire (preprocessed) model is loaded on the accelerator one and then only the prediction is sent to the CPU.

Taking into account all the previous details and suggestions, the latency of the model can be pushed down to the latency in the solo-CPU execution with the benefits of less power consumption and CPU overload.

### 5.6 Accuracy

The accuracy of inference process in Machine Learning model is how much the prediction is close to the actual value. For example, using the MNIST model, how much is accurate the prediction of a number giving the number as input to the neural network.

In the following case, the accuracy for different data width and model will be presented with reference to the actual value, in this case the inference without the hardware accelerator. Moreover, the data provided to the accelerator are bounded by the filter size of the weight, which is always fixed to 3x3 for the used models. Therefore, the MXU for the following comparison has been the standard one, the 8x8.

| Model    | $\leq \pm 5\%$ | $\pm 5\% \div 25\%$ | $\pm 25\% \div 50\%$ | $\pm 50\% \div 75\%$ | $\geq \pm 75\%$ |
|----------|----------------|---------------------|----------------------|----------------------|-----------------|
| MNIST    |                |                     | X                    |                      |                 |
| Cifar 10 |                |                     |                      |                      | X               |

Table 5.4: Accuracy Output<sup>1</sup> with Convolution on integer 8

| Model    | $\leq \pm 5\%$ | $\pm 5\% \div 25\%$ | $\pm 25\% \div 50\%$ | $\pm 50\% \div 75\%$ | $\geq \pm 75\%$ |
|----------|----------------|---------------------|----------------------|----------------------|-----------------|
| MNIST    |                |                     |                      | X                    |                 |
| Cifar 10 |                |                     |                      | X                    |                 |

Table 5.5: Accuracy Output<sup>1</sup> with Convolution on integer 16

| Model    | $\leq \pm 5\%$ | $\pm 5\% \div 25\%$ | $\pm 25\% \div 50\%$ | $\pm 50\% \div 75\%$ | $\geq \pm 75\%$ |
|----------|----------------|---------------------|----------------------|----------------------|-----------------|
| MNIST    |                |                     |                      | X                    |                 |
| Cifar 10 |                |                     |                      |                      | X               |

Table 5.6: Accuracy Output<sup>1</sup> with Convolution on integer 32

It comes suddenly evident that the output prediction with the accelerator integrated varies of a huge amount from the expected one.

One of the reason of the wrong prediction may reside in the input data feed to the model, they are totally random. Being feed with random, and probably unreasonable, data the prediction accuracy has been degradeted.

<sup>&</sup>lt;sup>1</sup>The accuracy is measured percentage(wrt the reference accuracy) of the difference between the reference accuracy (the model's output on the CPU only execution) and the output accuracy of the CPU+ hardware accelerator of the main prediction, the higher one.

Another improvement from the hardware point of view, which could improve the output accuracy, should be to accumulate on the different bitwidth precision, for example compute multiplication on 8 bits integer and accumulate values on 16 bit integer. Moreover, as in every software product, bugs have not been detected but this does not means that the written software is bug-free.

# 6

### Conclusion

#### 6.1 Discussion

A big portion of inference process for neural networks involves massive multiply and add computation, basic operation of tensor convolutions, and across several execution data, especially weight tensors, are reused. As consequence, for speeding-up and reduce the power consumption (especially in mobile devices) of ML models an hardware accelerator has been developed. It is also designed for accommodating different data type computation request from neural network models, ranging from integer8/16/32/64 to floating-point 32 and brain floating-point 16.

The approach of the work has been a hardware/software co-design in order to accommodate the high compute intensive request of machine learning, the tensor convolution. Therefore, the hardware core for tensor convolution has been developed from scratch, while the common components, such as memories and bus interface, have been chosen from the available ones in the tools.

Moving one step at the time above in the abstraction level, the accelerator library has been developed and deployed. In order to accomplish it in a fixed time, the core of the library has been developed in Python, which has been interfaced with a C-code template provided from the developers of thee ML-framework used. This has lead to a hybrid library which encapsulates a frozen Python code layer, called from the C-code, the latter is only in charge of retrieving the data and passing them to the Python layer.

Again, moving one step above in the abstraction, the ML-framework level is reached. In this level, the most popular ML-framework, TensorFlow, has been chosen. It also offers the possibility of delegate part of the execution graph to coprocessor or GPUs. Moreover, existing Tensorflow pretrained models have been quantized for different bitwidth and data precision.

It is possible to build a custom hardware accelerator for a specific ML operation and then integrate it into a framework without changing the model nor the framework. The bottom up approach and the delegate class available in Tensorflow has allowed to fully tailor a new class of hardware accelerators which can accommodate different needs (i.e. depending on which part of the model has to be accelerated). As it has been organized, changing the core software in the Python code and the core in the hardware, it can be also used for addressing different model's operations.

#### 6.2 Future Works

For every human artifacts, there is always work to do. In addition, for computer engineering artifacts there is also an important step which is the software (and in this case also of the hardware) optimization. In particular:

- Software optimization and migration to a full C code implementation for further reducing the latency.
- Hardware optimization.
- A deep software/hardware testing for finding additional bugs.
- Power estimation using the simulation's switching activity in order to obtain a very precise and reliable power consumption.
- Comparison of model execution on different state-of-the-art platforms.

Following the previous recommendation, the work may arrive to a competitive level such as the one of the GPUs or other hardware platforms.

## Bibliography

- [1] N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, et all, "In-Datacenter Performance Analysis of a Tensor Processing Unit", CoRR 2017, abs/1704.04760.
- [2] Nvidia, NVDLA, http://nvdla.org/index.html#.
- [3] Habana, "Gaudi<sup>TM</sup> Training PlatformWhite Paper", **2019**.
- [4] Habana, "Goya<sup>TM</sup> Inference Platform White Paper", **2019**.
- [5] V. Sze, Y. H. Chen, T. Yang, J. S. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey", *IEEE vol.* 105 no. 12 pp. 2295-2329 Dec. 2017.
- [6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, "A domain-specific architecture for deep neural networks", ACM 61 pag. 50-59 August 2018.
- [7] Y. Cai, C. Liang, Z. Tang, H. Li, S. Gong, "Deep Neural Network with Limited Numerical Precision", **2018**, (Eds.: J. Abawajy, K.-K. R. Choo, R. Islam), 42–50.
- [8] J. Johnson, "Rethinking floating point for deep learning", Facebook AI Research 2018.
- [9] A. Rahman, S. Oh, J. Lee, K. Choi, "Design space exploration of FPGA accelerators for convolutional neural networks", **1996**.
- [10] J. T. Johnston, S. R. Young, C. D. Schuman, J. Chae, D. D. March, R. M. Patton, T. E. Potok, "Fine-Grained Exploitation of Mixed Precision for Faster CNN Training", 2019, 9–18.
- [11] H. J. L. Hao Zhang, S.-B. Ko, "Efficient Fixed/Floating-Point Merged Mixed-Precision Multiply-Accumulate Unit for Deep Learning Processors", **2018**.
- [12] A. Turing, "Computing machinery and intelligence", Mind 1950.
- [13] S. Russell, P. Norvig, Artificial intelligence: a modern approach. Pearson Education Limited, 2016.
- [14] W. Maass, "Networks of spiking neurons: The third generation of neural network models", Neural Networks 1997, 10, 1659 –1671.
- [15] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, A. Maida, "Deep learning in spiking neural networks", *Neural Networks* **2019**, *111*, 47–63.
- [16] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. D. Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza, J. Kusnitz, M. Debole, S. Esser, T. Delbruck, M. Flickner, D. Modha, "A Low Power, Fully Event-Based Gesture Recognition System", IBM research 2017.
- [17] R. Zhao, Y. Hu, J. Dotzel, C. D. Sa, Z. Zhang, "Improving Neural Network Quantization without Retraining using Outlier Channel Splitting", **2019**.

- [18] R. Banner, Y. Nahshan, D. Soudry, "Post training 4-bit quantization of convolutional networks for rapid-deploymen", **2019**.
- [19] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, Y. Zou, "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients", **2018**.
- [20] Chien-Ping Lu in Proceedings of 2010 International Symposium on VLSI Design, Automation and Test, **2010**, pp. 5–5.
- [21] Nvidia, "NVIDIA A100 Tensor Core GPU Architecture, Unprecedent acceleration at every scale", **2020**.
- [22] Nvidia, NVDLA Hardware Architectural Specification, http://nvdla.org/hw/v1/hwarch.html.
- [23] Arm, "AMBA® AXI<sup>TM</sup> and ACE<sup>TM</sup> ProtocolSpecification", **2011**.
- [24] Nvidia, NVDLA Software Manual, http://nvdla.org/sw/contents.html.
- [25] Google, An in-depth look at Google's first Tensor Processing Unit (TPU), https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu.
- [26] TensorFlow, https://www.tensorflow.org/overview.
- [27] Xilinx, "Zynq-7000 SoC, Technical Reference Manual", 2018.
- [28] Xilinx, PYNQ, http://www.pynq.io/board.
- [29] G. Corradi, "The value of Python Productivity: exteme edge analytics on Xilinx zynq portfolio", Xilinx 2018.
- [30] TensorFlow Hub, https://www.tensorflow.org/hub/overview.
- [31] C Foreign Function Interface for Python, https://cffi.readthedocs.io/en/latest/.
- [32] Xilinx, "Zyng-7000 SoC Data Sheet: Overview", 2018.
- [33] Xilinx, "AXI Interconnect v2.1LogiCORE IP Product Guide", 2017.
- [34] Xilinx, "Vivado Design Suite, AXI Reference Guide", 2017.
- [35] Xilinx, "AXI DMA v7.1LogiCORE IP Product Guide", 2019.
- [36] Xilinx, "7 Series FPGAs and Zynq-7000 SoC XADC Dual 12-Bit 1 MSPS Analog-to-Digital Converter, User Guide", **2018**.
- [37] Xilinx, "AXI4-Stream Accelerator Adapter v2.1LogiCORE IP Product Guide", **2015**.
- [38] Xilinx, "7 Series DSP48E1 Slice, User Guide", 2018.
- [39] Xilinx, "Vivado Design Suite UserGuide: Power Analysis and Optimization", **2020**.

# A

## Accelerator library

```
Script for creating library:
1 import cffi
2 import sys
3 sys.path.append('/usr/local/lib')
5 #
    6 #####
               The Frankenstein, a mix of C and Python
7 #### create .so library from PYNQ python code for DTPU accelerator
    ######
             on board compiling, it requires
       to have tensorflow/tensorflow/lite in /usr/include/pythonX.X
    ######
                     from r2.1 branch
11 #
    12 ffibuilder = cffi.FFI()
14 ffibuilder.cdef("""
15 extern "Python" {
16 bool destroy_p(void
17 bool CopyFromBufferHandle_p(void);
18 bool CopyToBufferHandle_p(void);
19 void FreeBufferHandle_p(void);
20 bool SelectDataTypeComputation_p(int);
21 bool Init_p(int , int , int);
22 bool Prepare_p(int);
23 bool Invoke_p(bool, int);
      load_overlay(void);
25 bool ResetHardware_p(void);
26 void push_weight_to_heap( void *, int *, int);
 void push_input_tensor_to_heap( void *, int *, int);
28 void push_output_tensor_to_heap( void *, int *, int);
29 bool print_power_consumption_p(void);
30 bool start_power_consumption(void);
31 void activate_time_probe_p ( bool);
32 bool print_python_time_probes(void);
33
   void * tflite_plugin_create_delegate();
```

```
void tflite_plugin_destroy_delegate(void * ,void * );
          SelectDataTypeComputation(int);
36
    bool print_power_consumption();
37
    bool measure_power_consumption();
    bool print_execution_stats();
    bool activate_time_probe(bool); """)
40
41
43 cpp_file=open("./DTPU_delegate.cpp","r")
44 ffibuilder.set_source("dtpu_lib", cpp_file.read(),source_extension=".
      cpp",
    extra_compile_args=['-Wno-unused-result', '-Wsign-compare', '-DNDEBUG
     ', '-g', '-fwrapv', '-O2', '-Wall', '-Wstrict-prototypes', '-g', '-fdebug-prefix-map=/build/python3.5.2=.', '-specs=/usr/share/
      dpkg/no-pie-compile.specs', '-fstack-protector-strong',
     '-Wformat', '-Werror=format-security', '-I/usr/local/include', '-L/usr/
47
      local/lib'],
    extra_link_args=['-Wl,-Bsymbolic-functions','-specs=/usr/share/dpkg/
48
     no-pie-link.specs',
     '-Wl,-z, relro', '-specs=/usr/share/dpkg/no-pie-compile.specs', '-
    D_FORTIFY_SOURCE=2', '-fPIC'],
libraries=['pthread', 'expat', 'z', 'dl', 'util', 'm', 'tensorflow'])
51 #if you want to simply access a global variable you just use its name.
_{52} # However to change its value you need to use the global keyword.
53 python_file=open("./DTPU_delegate.py", "r")
54 ffibuilder.embedding_init_code(python_file.read())
56 ffibuilder.compile(target="DTPU_delegate.*", verbose=True)
58 cpp_file.close()
59 python_file.close()
```

#### C++ code of the library:

```
1 /// release dependent libraries tensorflow r2.1
2 #include <tensorflow/lite/c/builtin op data.h>
#include <tensorflow/lite/c/c_api_internal.h>
4 #include <tensorflow/lite/builtin_ops.h>
5 #include <tensorflow/lite/context util.h>
6 #include <tensorflow/c/c api.h>
7 #include <vector>
8 #include <time.h>
9 #define DEBUG 1
static bool destroy_p(void );
static bool CopyFromBufferHandle p(void);
13 static bool CopyToBufferHandle_p(void);
14 static void FreeBufferHandle_p(void);
static bool SelectDataTypeComputation_p(int);
static bool Init_p(int,int,int);
static bool Prepare_p(int );
18 static bool Invoke p(bool, int);
19 static void load_overlay(void);
20 static bool ResetHardware_p(void);
21 static void push_weight_to_heap( void *,int *,int);
22 static void push_input_tensor_to_heap( void
                                                *,int *,int);
23 static void push output tensor to heap (void
                                                *,int *,int);
24 static bool print_power_consumption_p(void);
25 static bool start_power_consumption(void);
26 static void activate_time_probe_p (bool);
27 static bool print_python_time_probes(void);
28
29
  possible operations
    kTfLiteBuiltinAdd = 0,
    kTfLiteBuiltinConcatenation = 2,
33
    kTfLiteBuiltinConv2d = 3.
34
    kTfLiteBuiltinDepthwiseConv2d = 4,
35
    kTfLiteBuiltinDepthToSpace = 5,
    kTfLiteBuiltinFullyConnected = 9,
37
    kTfLiteBuiltinMul = 18.
38
    kTfLiteBuiltinSub = 41,
39
    kTfLiteBuiltinDelegate = 51,
    kTfLiteBuiltinAddN = 106, struct timespec ts_start, ts_end;
41
42
45 int bit_width_computation;
46 int NO FP=−1;
47 bool signed_computation=false;
48 bool only_con2d=false;
```

```
50 // time probes
51 bool time_probe=false;
52 int n_execution=0;
53 double avg time delegate;
54 double avg_time_data_exchange;
56
57 using namespace tflite;
59 #ifdef __cplusplus
60 extern "C" {
61 #endif
62 // This is where the execution of the operations or whole graph
63 // The class below has an empty implementation just as a
     quideline
_{64} // on the structure.
65 class DTPU_delegate {
   public:
    // Returns true if my delegate can handle this type of op.
    static bool SupportedOp(const TfLiteRegistration * registration
68
    // from builtin_ops.h
69
    #ifdef DEBUG
    printf("[DEBUG - C]--- Supported Operation of DTPU delegate
       class ---- \n");
    #endif
      switch (registration -> builtin code) {
        /*case kTfLiteBuiltinConv2d:
74
          only con2d=true;
          #ifdef DEBUG
76
          printf("[DEBUG-C]-- Supported operations only 2d
              convolution ----\n"):
          #endif
          * /
        case kTfLiteBuiltinDepthwiseConv2d:
81
           printf("[DEBUG - C]--Hello world! I can make 2D
82
              convolution and depth wise 2D convolution ---\n");
          #endif
          return true;
        default:
85
          return false;
86
      }
87
    }
    // Any initialization code needed
90
    bool Init (TfLiteContext * context , const TfLiteDelegateParams *
```

```
delegate_params) {
    #ifdef DEBUG
92
       printf("[DEBUG - C]--- Init of DTPU delegate class --- \n");
93
      #endif
94
95
       #ifdef DEBUG
         printf("[DEBUG - C]--- Init of DTPU delegate class check
            if tensors indexes are equal to the ones in the Invoke
            --- \n"):
       for (int input_index: TfLiteIntArrayView(delegate_params->
98
          input_tensors)){
         printf("[DEBUG - C]--- Init of DTPU delegate class getting
100
             tensors %d— \n",input index);
      }
         #endif
       if (time_probe) {
         avg_time_delegate = 0.00;
106
         avg time data exchange=0.00;
         n execution=0;
108
      }
109
       // instantiate buffcfers and soft reset of accelerator
111
       return Init_p (context->tensors_size, delegate_params->
          input_tensors -> size , delegate_params -> output_tensors ->
          size);
113
114
    // Any preparation work needed (e.g. allocate buffers)
    TfLiteStatus Prepare(TfLiteContext * context , TfLiteNode * node)
    #ifdef DEBUG
117
    printf("[DEBUG - C]--- Prepare of DTPU delegate class --- \n")
118
    #endif
119
           initialize. link the buffers according to the size of
120
            node data
       // kTfLiteMmapRo aka weights
121
       int num_weight_tensor=0;
       // set precison check
       if (NO FP==-1){
124
             printf("ERROR! Need to execute
                SelectDataTypeComputation function before calling
                the Tensorflow Interpreter \n");
              return kTfLiteError;
126
         }
127
128
```

```
for (int input_index : TfLiteIntArrayView(node->inputs)){
130
         // one of this should be the weight tensor
         auto& in t= context->tensors[input index];
         if (in_t.allocation_type==kTfLiteMmapRo) {
                 num_weight_tensor++;
134
                 #ifdef DEBUG
                  printf("[DEBUG -C]---found a tensor weight %d----\
136
                     n", input index);
                 #endif
         // get dimesion of tensors
138
         // push to python sublayer
          if (!NO FP) {
140
         switch(bit_width_computation){
141
         default:
142
         case 8:
143
             #ifdef DEBUG
144
                        if (signed_computation) {
145
                          printf("[DEBUG-C]---- kTfLiteInt8 ----\n
                             ");
                        }else{
147
                          printf("[DEBUG-C]---- kTfLiteUInt8 -----\
148
                             n");
                        }
             #endif
150
             if (signed_computation) {
             push_weight_to_heap(in_t.data.int8, in_t.dims->data,
                in_t.dims->size);
             }else {
             push_weight_to_heap( in_t.data.uint8, in_t.dims->data,
                 in_t.dims->size);
             }
156
           break:
         case 16:
               #ifdef DEBUG
                        printf("[DEBUG-C]---- kTfLiteInt16 ----\n"
160
                           );
               #endif
                      push_weight_to_heap( in_t.data.i16, in_t.dims
162
                         ->data, in_t.dims->size);
                      break;
         case 32:
164
         #ifdef DEBUG
                      printf("[DEBUG-C]---- kTfLiteInt32 ----\n");
             #endif
167
               push_weight_to_heap( in_t.data.i32, in_t.dims->data,
                    in_t.dims->size);
               break;
```

```
case 64:
170
             #ifdef DEBUG
171
                    printf("[DEBUG-C]---- kTfLiteInt64 ----\n");
172
             #endif
173
                    push_weight_to_heap( in_t.data.i64, in_t.dims->
174
                        data, in_t.dims->size);
                    break;
176
177
         else { // use fp units
178
           switch (bit_width_computation){
179
           case 16:
                  if (context->allow_fp32_relax_to_fp16 && NO_FP==3 )
                     \{ // NO FP==3 \rightarrow fp active and bfp active \}
                     #ifdef DEBUG
182
                printf("[DEBUG-C]--- kTfLitefloat32 relaxed aka
183
                   bfp16 ----\n");
                #endif
                  // remembedr f16 is TfLiteFloat16 *
186
                /*typedef struct {
                      uint16 t data;
188
                    } TfLiteFloat16;
189
                    * /
                push_weight_to_heap( in_t.data.f16, in_t.dims->data,
191
                    in_t.dims->size);
193
                  break;
           case 32:
                #ifdef DEBUG
                printf("[DEBUG-C]---- kTfLitefloat32 ----\n");
                #endif
197
                  push_weight_to_heap( in_t.data.f, in_t.dims->data,
198
                       in t.dims->size);
                  break;
           default:
                printf("[DEBUG-C]--- ERROR! no fp precision defined
201
                         ---\n"):
                break;
         }
203
205
206
207
       #ifdef DEBUG
208
       printf("[DEBUG-C]--- number of weights found= %d \n",
209
          num_weight_tensor);
       #endif
         if (Prepare_p(num_weight_tensor)){
211
```

```
return kTfLiteOk;
213
        return kTfLiteError;
214
215
     // Actual running of the delegate subgraph.
     TfLiteStatus Invoke(TfLiteContext* context, TfLiteNode* node)
       struct timespec ts_start,ts_end;
219
       int curr input=0;
       #ifdef DEBUG
221
       printf("[DEBUG - C]--- Invoke of DTPU delegate class --- \n"
       printf("[DEBUG - C]--- Invoke of DTPU delegate class getting
223
           tensors --- \n");
       #endif
224
225
       if (time_probe) {
             if (!timespec_get(&ts_start ,TIME_UTC)) {
228
             fprintf(stderr, "error during the acquisition of start
229
                time!\n");
             exit(-1);
230
                 }
       }
       // run inference on the delegate and data transfer to/from
234
          memory/accelerator
       for (int input index : TfLiteIntArrayView(node->inputs)){
         // one of this should be the weight tensor
         #ifdef DEBUG
         printf("[DEBUG - C]--- Invoke of DTPU delegate class
238
            getting tensors %d— \n",input_index);
         #endif
         TfLiteTensor in_t= context->tensors[input_index];
         if (!(in_t.allocation_type==kTfLiteMmapRo)) { //cause the
            weights have been transferred into the Prepare method
             if (curr input!=0) {
242
               curr_input=input_index;
243
             }
244
         // get dimesion of tensors
         // push to python sublayer
246
          if (!NO FP) {
247
         switch(bit_width_computation){
248
         default:
         case 8:
250
             #ifdef DEBUG
                        if (signed_computation) {
252
                          printf("[DEBUG-C]---- kTfLiteInt8 ----\n
253
```

```
");
                         }else{
254
                           printf("[DEBUG-C]---- kTfLiteUInt8 -----\
                              n");
             #endif
             if (signed_computation) {
           push_input_tensor_to_heap(in_t.data.int8,in_t.dims->data
259
               , in_t.dims->size);
             }else {
             push_input_tensor_to_heap(in_t.data.uint8,in_t.dims->
261
                 data, in_t.dims->size);
             }
262
263
           break:
264
         case 16:
265
                #ifdef DEBUG
266
                         printf("[DEBUG-C]---- kTfLiteInt16 ----\n"
                            );
                #endif
268
                  push_input_tensor_to_heap(in_t.data.i16,in_t.dims
269
                     ->data,in t.dims->size);
                      break:
270
         case 32:
         #ifdef DEBUG
272
                       printf("[DEBUG-C]---- kTfLiteInt32 ----\n");
             #endif
274
                push_input_tensor_to_heap(in_t.data.i32,in_t.dims->
275
                   data, in t.dims->size);
                break;
         case 64:
             #ifdef DEBUG
278
                    printf("[DEBUG-C]---- kTfLiteInt64 ----\n");
279
             #endif
280
                push_input_tensor_to_heap(in_t.data.i64,in_t.dims->
                   data,in_t.dims->size);
                    break;
282
283
284
         else { // use fp units
           switch (bit_width_computation){
           case 16:
287
                  if (context->allow_fp32_relax_to_fp16 && NO_FP==3 )
288
                     \{ // NO FP==3 \rightarrow fp active and bfp active \}
                     #ifdef DEBUG
289
                printf("[DEBUG-C]--- kTfLitefloat32 relaxed aka
290
                   bfp16 ----\n");
291
                  push_input_tensor_to_heap(in_t.data.f16,in_t.dims
292
```

```
->data,in_t.dims->size);
                  }
                  break:
294
           case 32:
295
                #ifdef DEBUG
296
                printf("[DEBUG-C]---- kTfLitefloat32 ----\n");
297
                  push_input_tensor_to_heap(in_t.data.f,in_t.dims->
299
                      data, in t.dims->size);
                  break:
300
            default:
301
                printf("[DEBUG-C]---- ERROR! no fp precision defined
                         ---\n");
                break;
303
304
305
         }
306
307
308
       }
309
311
       for (int output_index : TfLiteIntArrayView(node->outputs)){
312
         auto& out_t= context->tensors[output_index];
         // get dimesion of tensors
314
         // push to python sublayer
315
         #ifdef DEBUG
317
         printf("[DEBUG - C]--- Invoke of DTPU delegate class
318
             getting output tensors %d— \n",output_index);
         #endif
319
320
          if (!NO_FP) {
321
         switch(bit width computation){
         default:
         case 8:
              #ifdef DEBUG
325
                         if (signed_computation) {
                           printf("[DEBUG-C]---- kTfLiteInt8 -----\n
327
                               ");
                         }else{
                           printf("[DEBUG-C]---- kTfLiteUInt8 -----\
                              n");
                         }
330
              #endif
331
              if (signed computation) {
332
              push_output_tensor_to_heap(out_t.data.int8,out_t.dims
                 ->data , out_t . dims->size ) ;
              }else {
334
```

```
push_output_tensor_to_heap(out_t.data.uint8,out_t.dims
335
                 ->data , out_t . dims->size ) ;
336
           break:
338
         case 16:
339
                #ifdef DEBUG
                         printf("[DEBUG-C]---- kTfLiteInt16 ----\n"
341
                            );
                #endif
                  push_output_tensor_to_heap(out_t.data.i16,out_t.
343
                     dims->data, out_t.dims->size);
                      break:
344
         case 32:
345
         #ifdef DEBUG
346
                       printf("[DEBUG-C]---- kTfLiteInt32 ----\n");
             #endif
348
                push_output_tensor_to_heap(out_t.data.i32,out_t.dims
                   ->data , out_t .dims->size ) ;
                break;
350
         case 64:
351
             #ifdef DEBUG
                    printf("[DEBUG-C]---- kTfLiteInt64 ----\n");
353
             #endif
                push_output_tensor_to_heap(out_t.data.i64,out_t.dims
355
                   ->data , out_t . dims->size ) ;
                    break;
         }
357
         else { // use fp units
           switch (bit_width_computation){
           case 16:
361
                  if (context->allow fp32 relax to fp16 && NO FP==3)
362
                     \{ // NO FP==3 \rightarrow fp active and bfp active \}
                     #ifdef DEBUG
                printf("[DEBUG-C]--- kTfLitefloat32 relaxed aka
364
                   bfp16 ----\n");
365
                  push_output_tensor_to_heap(out_t.data.f16,out_t.
366
                     dims->data,out_t.dims->size); // a uint16
                     pointer
367
                  break;
368
           case 32:
369
                #ifdef DEBUG
370
                printf("[DEBUG-C]---- kTfLitefloat32 ----\n");
371
                #endif
372
                  push_output_tensor_to_heap(out_t.data.f,out_t.dims
373
                     ->data,out_t.dims->size);
```

```
break;
            default:
                printf("[DEBUG-C]--- ERROR! no fp precision defined
376
                         ---\n");
                break;
         }
378
         }
380
381
         if (time_probe) {
383
         if (!timespec_get(&ts_end, TIME_UTC)){
            fprintf(stderr, "erorr during the acquisition of end time
               !\n");
            exit(-1);
386
         }
387
         // update average and execution time
388
         avg_time_data_exchange+=ts_end.tv_sec*1000 + ((double)
             ts\_end.tv\_nsec)/1000000 - ts\_start.tv\_sec*1000 - ((
             double) ts_start.tv_nsec)/1000000;
390
         n_execution++;
391
392
         if (time_probe) {
394
         if (!timespec_get(&ts_start, TIME_UTC)){
395
            fprintf(stderr, "erorr during the acquisition of end time
               !\n");
            exit(-1);
397
       if (Invoke_p(only_con2d,curr_input)){
400
         if (time_probe) {
401
         if (!timespec_get(&ts_end, TIME_UTC)){
402
            fprintf(stderr, "erorr during the acquisition of end time
403
               !\n");
            exit(-1);
404
405
         avg_time_delegate+=ts_end.tv_sec*1000 + ((double)ts_end.
406
             tv_nsec)/1000000 - ts_start.tv_sec*1000 - ((double)
             ts_start.tv_nsec)/1000000;
407
         return kTfLiteOk;
408
409
        return kTfLiteError;
410
411
413 };
414
```

```
415
     TfLiteStatus
                    SelectDataTypeComputation(int data_type ){
416
    #ifdef DEBUG
417
     printf("[DEBUG - C]--- SelectDataTypeComputation of DTPU
418
        delegate class --- \n");
    #endif
419
     int precision= data_type & 0x000f;
     signed_computation= ((data_type & 0x00100)>>8)==1 ? true :
421
        false:
422
    NO_FP= (data_type \& 0x060) >> 5;
423
     switch(precision){
       default:
425
       case 1: //INT8
       bit_width_computation=8;
427
       break;
428
       case 3: //INT16
429
       bit_width_computation=16;
430
       break;
431
       case 7: //INT32
432
       bit_width_computation=32;
433
       break:
434
       case 15: //INT64
435
         bit_width_computation=64;
         break;
437
     }
438
     // check compatibilyt of signed and unsigned
439
     if (signed_computation && bit_width_computation!=8) {
440
       printf("ERROR-> signed/unsigned distinction is only
441
          compatible with 8 bit computation");
       return kTfLiteError;
442
443
     if (SelectDataTypeComputation_p(data_type) ) {
444
         return kTfLiteOk;
445
        return kTfLiteError;
     }
448
449
     TfLiteStatus
                    ResetHardware(){
450
    #ifdef DEBUG
451
     printf("[DEBUG - C]--- Reset underlaying hardware --- \n");
    #endif
453
     if (ResetHardware_p()){
454
         return kTfLiteOk;
455
456
        return kTfLiteError;
457
     }
459
460
```

```
// Create the TfLiteRegistration for the Kernel node which will
     replace
  // the subgraph in the main TfLite graph.
  TfLiteRegistration GetMyDelegateNodeRegistration() {
463
    // This is the registration for the Delegate Node that gets
464
        added to
    // the TFLite graph instead of the subGraph it replaces.
    // It is treated as a an OP node. But in our case
466
    // Init will initialize the delegate
467
    // Invoke will run the delegate graph.
468
    // Prepare for preparing the delegate.
469
    // Free for any cleaning needed by the delegate.
    #ifdef DEBUG
471
    printf("[DEBUG - C] --- get delegate node registration
472
        function ---\n");
    #endif
473
    TfLiteRegistration kernel registration;
474
    kernel registration.builtin code = kTfLiteBuiltinDelegate;
475
    kernel_registration.custom_name = "DTPU_delegate";
476
    kernel_registration.free = [](TfLiteContext* context, void*
477
        buffer) -> void {
      delete reinterpret_cast < DTPU_delegate * > (buffer);
478
479
    kernel_registration.init = [](TfLiteContext* context, const
        char* buffer,
                                        size_t) -> void* {
481
       // In the node init phase, initialize MyDelegate instance
482
       const TfLiteDelegateParams* delegate_params =
483
           reinterpret cast < const TfLiteDelegateParams *> (buffer);
484
      DTPU_delegate * my_delegate = new DTPU_delegate;
       if (!my_delegate->Init(context, delegate_params)) {
         return nullptr;
487
      }
488
       return my delegate;
489
    kernel_registration.invoke = [](TfLiteContext* context,
491
                                        TfLiteNode * node) ->
492
                                            TfLiteStatus {
       DTPU_delegate * kernel = reinterpret_cast < DTPU_delegate * > (
493
          node->user_data);
       return kernel->Invoke(context, node);
495
    kernel_registration.prepare = [](TfLiteContext* context,
496
                                         TfLiteNode * node) ->
497
                                             TfLiteStatus {
      DTPU delegate* kernel = reinterpret cast < DTPU delegate*>(
498
          node->user data);
       return kernel->Prepare(context, node);
499
    };
500
```

```
501
     return kernel_registration;
503
504
  // TfLiteDelegate methods
  // interface to tensorflow runtime
  TfLiteStatus DelegatePrepare(TfLiteContext, context,
      TfLiteDelegate * delegate) {
       Claim all nodes that can be evaluated by the delegate and
508
        ask the
     // framework to update the graph with delegate kernel instead.
     // Reserve 1 element, since we need first element to be size.
    #ifdef DEBUG
     printf("[DEBUG - C] ---- preparing the delegate ----\n");
    #endif
513
     std::vector<int> supported_nodes(1);
514
     TfLiteIntArray* plan;
    TF LITE ENSURE STATUS(context->GetExecutionPlan(context, &plan
        ));
     TfLiteNode * node;
517
     TfLiteRegistration * registration;
518
     for (int node index : tflite::TfLiteIntArrayView(plan) ) {
519
      TF_LITE_ENSURE_STATUS(context->GetNodeAndRegistration(
520
           context, node_index, &node, &registration));
       if (DTPU_delegate::SupportedOp(registration)) {
         supported_nodes.push_back(node_index);
      }
524
     }
525
     // Set first element to the number of nodes to replace.
526
     supported nodes [0] = supported nodes.size () - 1;
     TfLiteRegistration my_delegate_kernel_registration =
         GetMyDelegateNodeRegistration();
529
530
     // This call split the graphs into subgraphs, for subgraphs
        that can be
     // handled by the delegate, it will replace it with a
     // 'my_delegate_kernel_registration'
     return context -> ReplaceNodeSubsetsWithDelegateKernels (
534
         context, my_delegate_kernel_registration,
         reinterpret_cast < TfLiteIntArray *>(supported_nodes.data()),
536
             delegate);
537
538
  void FreeBufferHandle(TfLiteContext * context , TfLiteDelegate *
     delegate,
                          TfLiteBufferHandle * handle ) {
540
    #ifdef DEBUG
     printf("[DEBUG - C]--- Do any cleanups---\n");
    #endif
543
```

```
FreeBufferHandle_p();
  }
545
546
547
   TfLiteStatus CopyToBufferHandle(TfLiteContext* context,
548
                                      TfLiteDelegate * delegate,
549
                                      TfLiteBufferHandle buffer_handle
                                      TfLiteTensor* tensor) {
551
    #ifdef DEBUG
     printf("[DEBUG - C]--- Copies data from tensor to delegate
        buffer if needed.---\n");
    #endif
554
     if (CopyToBufferHandle_p()) {
     return kTfLiteOk;
556
557
     return kTfLiteError;
558
559
  TfLiteStatus CopyFromBufferHandle(TfLiteContext* context,
561
                                        TfLiteDelegate * delegate,
562
                                        TfLiteBufferHandle
                                           buffer_handle,
                                        TfLiteTensor* tensor) {
564
    #ifdef DEBUG
565
     printf("[DEBUG - C]---Copies the data from delegate buffer
566
        into the tensor raw memory———\n");
    #endif
567
     if (CopyFromBufferHandle p()){
568
     return kTfLiteOk;
     return kTfLiteError;
571
572
573
   TfLiteStatus activate_time_probe(bool activate){
    #ifdef DEBUG
     printf("[DEBUG-C]---- activating time probes ----\n");
577
     if (!time_probe && activate){
578
       time_probe=true;
579
         #ifdef DEBUG
              printf("[DEBUG-C]--- activated time probes ----\n");
581
           #endif
582
       activate_time_probe_p(activate);
584
       printf("ATTENTION! Time probes are not active\n");
585
586
587
      return kTfLiteOk;
```

```
590
591
  TfLiteStatus print_execution_stats() {
    #ifdef DEBUG
     printf("[DEBUG - C]---- printing time probes of the library
        ----\n");
    #endif
596
       printf("If you are seeing too many zeros you probably did
          not set the time probes variable to true!\n");
     // print c time probes
599
       printf("Overall time of delegate invoke: %3f [ms]\n",
600
          avg_time_delegate/n_execution);
       printf("Data exchange between interfaces (C->Python->C): %3f
601
           [ms]\n", avg time data exchange/n execution);
602
     // print python time probes
603
     if (print_python_time_probes()){
604
       return kTfLiteOk;
     return kTfLiteError;
609 }
610
  TfLiteStatus measure_power_consumption(){
  #ifdef DEBUG
     printf("[DEBUG - C]---Measuring power consumption of the
613
        accelerator during invoke ----\n");
    #endif
614
     if (start_power_consumption()){
615
       return kTfLiteOk;
616
617
     return kTfLiteError;
620
621
   TfLiteStatus print_power_consumption(){
    #ifdef DEBUG
623
     printf("[DEBUG - C]--- Printing power consumption of the
        accelerator during invoke ----\n");
625
     if (print_power_consumption_p()) {
626
       return kTfLiteOk;
627
     return kTfLiteError;
630 }
631
```

```
632 // instantiate the delegate, it returns null if there is an
  TfLiteDelegate * tflite_plugin_create_delegate()
   //char** argv , char** argv2, size t argc, void (*report error)(
      const char *) )
     TfLiteDelegate * delegate = new TfLiteDelegate;
636
637
     delegate->data = nullptr;
638
     delegate -> flags = kTfLiteDelegateFlagsNone;
     delegate -> Prepare = & Delegate Prepare;
640
     // This cannot be null.
    delegate -> CopyFromBufferHandle = &CopyFromBufferHandle;
642
     // This can be null.
     delegate -> CopyToBufferHandle = &CopyToBufferHandle;
644
     // This can be null.
645
    delegate -> FreeBufferHandle = & FreeBufferHandle;
646
     // load overlay
647
    load_overlay();
648
    #ifdef DEBUG
649
     printf("[DEBUG - C] ---the delegate method of DTPU is born for
650
         TensorFlow %s---\n", TF_Version());
    #endif
651
     return delegate;
653
654
655
  void tflite_plugin_destroy_delegate(void * delegate_op , void *
       argtypes) {
  // destroy the delegate
  TfLiteDelegate * delegate= (TfLiteDelegate *) delegate_op;
659 #ifdef DEBUG
  printf("[DEBUG - C]----cleaning memory -> callback of python
      function ---\n");
661 #endif
  if (!destroy_p()) {
     printf("ERROR IN FREEING BUFFERS!");
663
664 }
  // free (argtypes);
666 free (delegate);
668 #ifdef __cplusplus
669 } // extern "C"
670 #endif
```

Frozen python code in the accelerator library:

```
1 from dtpu_lib import ffi
<sub>2</sub> from pyng import Overlay
3 from pynq import allocate
4 from pyng import MMIO
5 from pyng import XInk
6 from pynq. lib import dma
7 import numpy as np
8 import math
9 import thread
10 import sys
11 import time
12 import struct # see https://docs.python.org/3/library/struct.
     html#struct-examples
13 _DEBUG_PRINT=True
14 _TIME_PROBES=False
16 ##### memory map of xadc #####
18 C BASEADDRESS=0x43C10000 #
19 SRR= 0x0 # w software reset register
20 SR= 0x04 # r status register
21 AOSR= 0x08 # r allarm output status register
22 CONVSTR= 0x0C # w Bit[0] = ADC convert start register (3) Bit[1]
      = Enable temperature update logic Bit[17:2] = Wait cycle for
      temperature update
23 SYSMONRR=0x10 # w xadc hard macro reset register
24 GIER=0x5C # rw global interrupt enable register
25 IPISR=0x60 # r toggle on write ip interrupt status register
26 IPIER=0x68 # rw ip interrupt enable register
27 TEMPERATURE=0x200 # The 12-bit Most Significant Bit (MSB)
     justified result of on-device temperature measurement is
     stored in this register
28 VCC INT=0x204 # The 12-bit MSB justified result of on-device V
     CCINT supply monitor measurement is storedin this register.
29 VCC AUX=0x208 # The 12-bit MSB justified result of on-device V
    CCAUX Data supply monitor measurement is stored in this
     register
30 VP VN=0x20C # rw When read: The 12-bit MSB justified result of A
     /D conversion on the dedicated analog input channel (Vp/Vn)
     is stored in this register. When written: Write to this
     regiter resets the XADC hard macro
31 VREF P=0x210 # r The 12-bit MSB justified result of A/D
     conversion on the reference input V REFP is stored in this
     register.
32 VREF N= 0x214 #r The 12-bit MSB justified result of A/D
     conversion on the reference input V REFN is stored in this
     register.
33 VCC_BRAM= 0x218 # r The 12-bit MSB justified result of A/D
```

```
conversion on the reference input V BRAM is stored in this
     register
34 SUPPLY_A_OFFSET=0x220 # r The calibration coefficient for the
     supply sensor offset of ADC A is stored in this register
35 ADC A OFFSET= 0x224 # r The calibration coefficient for the ADC
      A offset calibration is stored in this register.
36 ADC_A_GAIN_ERR=0x228 # r The calibration coefficient for the
     gain error of ADC A is stored in this register.
37 DEV CORE SUPPLY=0x234 #r The VCCINT of PSS core supply.
     Present only on Zyng-7000 devices.
38 DEV_AUX_CORE_SUPPLY=0x238 # r The VCCAUX of PSS core supply.
     Present only on Zynq-7000 devices.
39 DEV CORE MEM SUPPLY=0x23C # r The VCCMEM of PSS core supply.
     Present only on Zynq-7000 devices
40 # v axux p/n
41 V AUX 0=
           0x240 #r The 12-bit MSB justified result of A/D
     conversion on the auxiliary analog input 0 is stored in this
      register.
_{42} V_AUX_1= 0x244 #r
_{43} V AUX 2= 0x248 #r
44 V AUX 3= 0x24C #r
45 V AUX 4= 0x250 #r
_{46} V AUX 5= 0x254 #r
_{47} V_AUX_6= 0x258 #r
48 V AUX 7= 0x25C #r
49 V AUX 8= 0x260 #r
50 V_AUX_9= 0x264 #r
51 V_AUX_10= 0x268 #r
52 V AUX 11= 0x26C #r
53 V AUX 12= 0x270 #r
54 V AUX 13= 0x274 #r
55 V AUX 14= 0x278 #r
56 V AUX 15= 0x27C #r
57 ## value of 12 bit msb
58 MAX TMP= 0x280
59 MAX VCC INT= 0x284
60 MAX_VCC_AUX= 0x288
61 MAV V BRAM= 0x28C
62 MIN TMP= 0x290
63 MIN_VCC_INT= 0x294
64 MIN_VCC_AUX= 0x298
65 MIN V BRAM=0x29C
66 MAX VCC PINT= 0x2A0 # r
67 MAX VCC PAUX= 0x2A4 # r
68 MAX VCC DDRO= 0x2A8 # r
69 MIN VCC PINT= 0x2b0 # r
70 MIN VCC PAUX= 0x2b4 # r
71 MIN VCC DDRO= 0x2b8 # r
72 SUPPLY_B_OFFSET= 0x2C0 # r The calibration coefficient for the
```

- supply sensor offset of ADC A is stored in this register
  73 ADC\_B\_OFFSET= 0x2C4 # r The calibration coefficient for the ADC
  A offset calibration is stored in this register.
- 74 ADC\_B\_GAIN\_ERR= 0x2C8 # r The calibration coefficient for the gain error of ADC A is stored in this register.
- 75 FLAGS=0x2FC # The 16-bit register gives general status information of ALARM, Over Temperature (OT), Disable XADC information. Whether the XADC is using the internal reference voltage or external reference voltage is also p
- 76 CONF REG 0=0x300 # rw
- 77 CONF\_REG\_1=0x304 # rw
- 78 CONF\_REG\_2=0x308 # rw
- 79 SEQ REG 0= 0x320 # r/w adc channel selection
- 80 SEQ REG 1= 0x324 # r/w adc channel selection
- 81 SEQ\_REG\_2= 0x328 # r/w adc channel average enable
- 82 SEQ REG 3= 0x32C # r/w adc channel average enable
- 83 SEQ REG 4= 0x330 # r/w adc channel analog input mode
- 84 SEQ REG 5= 0x334 # r/w adc channel analog input mode
- 85 SEQ\_REG\_6= 0x338 # r/w adc channel acquistion
- 86 SEQ\_REG\_7= 0x33C # r/w adc channel acquistion
- 87 ALLARM\_THRESHOLD\_0= 0x340 #rw The 12bit MSB justified alarm threshold register 0 Temperature Upper
- 88 ALLARM\_THRESHOLD\_1= 0x344 #rw he 12bit MSB justified alarm threshold register 1 V CCINT Upper
- 89 ALLARM\_THRESHOLD\_2= 0x348 #rw The 12bit MSB justified alarm threshold register 2 V CCAUX Upper
- 90 ALLARM\_THRESHOLD\_3= 0x34C #rw the 12bit MSB justified alarm threshold register 3 T Upper
- 91 ALLARM\_THRESHOLD\_4= 0x350 #rw the 12bit MSB justified alarm threshold register 4 Temperature Lower
- 92 ALLARM\_THRESHOLD\_5= 0x354 #rw the 12bit MSB justified alarm threshold register 5 V CCINT Lower
- 93 ALLARM\_THRESHOLD\_6= 0x358 #rw The 12bit MSB justified alarm threshold register 6 V CCAUX Lower
- 94 ALLARM\_THRESHOLD\_7= 0x35C # rw The 12bit MSB justified alarm threshold register 7 OT Lower
- 95 ALLARM\_THRESHOLD\_8= 0x360 # rw The 12bit MSB justified alarm threshold register 8 VBRAM Upper
- 96 ALLARM\_THRESHOLD\_9= 0x364 # rw The 12 bit MSB justified alarm threshold register 9 V CCPint Upper This register is only on Zynq-7000 devices.
- 97 ALLARM\_THRESHOLD\_10= 0x368 # rw The 12bit MSB justified alarm threshold register 10 V CCPaux Upper This register is only on Zynq-7000 devices
- 98 ALLARM\_THRESHOLD\_11= 0x36C # rw The 12bit MSB justified alarm threshold register 11 CCDDRO Upper This register is only on Zynq-7000 devic
- 99 ALLARM\_THRESHOLD\_12= 0x370 # rw he 12bit MSB justified alarm threshold register 12 VBRAM Lower

```
100 ALLARM THRESHOLD 13= 0x374 # rw The 12Bit MSB justified alarm
     threshold register 13 V CCPint Lower This register is only on
      Zynq-7000 devices
101 ALLARM THRESHOLD 14= 0x378 # rw The 12bit MSB justified alarm
     threshold register 14 V CCPaux Lower This register is only on
      Zynq-7000 devices
102 ALLARM_THRESHOLD_15= 0x37C # rw he 12bit MSB justified alarm
     threshold register 15 v CCDDRO Lower This register is only on
      Zvna-7000 devices
106 BASE_ADDRESS_ACCELERATOR=0x43C00000
107 ADDRESS RANGE ACCELERATOR=0x10000
  # address reg offset
109 CTRL =0x0000
110 STATUS =0x0004
111 IARG RQT EN =0x0010
 OARG RQT EN =0x0014
113 CMD = 0x0028
114 OARG LENGTH MODE =0x003C
115 ISCALAR FIFO RST =0x0040
116 OSCALAR FIFO RST =0x0044
117 ISCALAR_RQT_EN =0x0048
118 OSCALAR_RQT_EN =0x004C
119 ISCALARO DATA =0x0080
120 ISCALAR1 DATA =0x0084
121 ISCALAR2_DATA =0x0088
122 ISCALAR3 DATA =0x008C
123 ISCALAR4 DATA =0x0090
124 ISCALAR5 DATA =0x0094
125 ISCALAR6 DATA =0x0098
126 ISCALAR7 DATA =0x009C
127 ISCALAR8 DATA =0x00A0
128 ISCALAR9_DATA =0x00A4
129 ISCALAR10_DATA=0x00A8
130 ISCALAR11_DATA =0x00AC
131 ISCALAR12 DATA =0x00B0
132 ISCALAR13 DATA =0x00B4
133 ISCALAR14 DATA =0x00B8
134 ISCALAR15_DATA =0x00BC
135 OSCALARO DATA =0x00C0
136 OSCALAR1 DATA =0x00C4
137 OSCALAR2 DATA =0x00C8
138 OSCALAR3 DATA =0x00CC
139 OSCALAR4 DATA =0x00D0
140 OSCALAR5 DATA =0x00D4
141 OSCALAR6 DATA =0x00D8
```

142 OSCALAR7\_DATA =0x00DC

```
143 IARGO STATUS =0x0100
144 IARG1 STATUS =0x0104
145 IARG2_STATUS =0x0108
146 IARG3 STATUS =0x010C
  IARG4 STATUS =0x0110
148 IARG5 STATUS =0x0114
149 IARG6_STATUS =0x0118
150 IARG7_STATUS =0x011C
  OARGO STATUS =0x0140
  OARG1 STATUS =0x0144
153 OARG2_STATUS =0x0148
  OARG3_STATUS =0x014C
155 OARG4 STATUS =0x0150
  OARG5 STATUS =0x0154
  OARG6 STATUS =0x0158
  OARG7 STATUS =0x015C
  ISCALARO STATUS =0x0180
160 ISCALAR1 STATUS =0x0184
  ISCALAR2 STATUS =0x0188
162 ISCALAR3 STATUS =0x018C
163 ISCALAR4 STATUS =0x0190
164 ISCALAR5 STATUS =0x0194
165 ISCALAR6 STATUS =0x0198
166 ISCALAR7_STATUS =0x019C
167 ISCALAR8_STATUS =0x01A0
168 ISCALAR9 STATUS =0x01A4
  ISCALAR10_STATUS =0x01A8
170 ISCALAR11_STATUS =0x01AC
  ISCALAR12 STATUS =0x01B0
  ISCALAR13 STATUS =0x01B4
  ISCALAR14 STATUS =0x01B8
  ISCALAR15 STATUS =0x01BC
  OSCALARO STATUS =0x01C0
  OSCALAR1 STATUS =0x01C4
  OSCALAR2 STATUS =0x01C8
  OSCALAR3_STATUS =0x01CC
  OSCALAR4_STATUS =0x01D0
  OSCALAR5 STATUS =0x01D4
  OSCALAR6 STATUS =0x01D8
  OSCALAR7_STATUS =0x01DC
  OSCALAR8_STATUS =0x01E0
  OSCALAR9_STATUS =0x01E4
  OSCALAR10 STATUS =0x01E8
  OSCALAR11_STATUS =0x01EC
  OSCALAR12_STATUS =0x01F0
  OSCALAR13 STATUS =0x01F4
  OSCALAR14 STATUS =0x01F8
  OSCALAR15 STATUS =0x01FC
191 OARGO LENGTH =0x0200
```

```
192 OARG1 LENGTH =0x0204
193 OARG2 LENGTH =0x0208
194 OARG3 LENGTH =0x020C
195 OARG4 LENGTH = 0x0210
196 OARG5 LENGTH =0x0214
197 OARG6 LENGTH = 0x0218
198 OARG7_LENGTH =0x021C
199 OARGO_TDEST =0x0240
200 OARG1 TDEST =0x0244
201 OARG2 TDEST =0x0248
202 OARG3 TDEST =0x024C
203 OARG4_TDEST = 0x0250
204 OARG5 TDEST = 0x0254
205 OARG6 TDEST = 0x0258
206 OARG7 TDEST =0x025C
CSR DEFINITIONS
208 ###########
                                          ##########
209 ###########
                    MEMORY MAP
                                          ##########
210 ##########
                     bitwidth 8
                                          ##########
211 ###########
                     see csr definition.vh
213 ARITHMETIC PRECISION=0
214 FP MODE=1
215 BATCH_SIZE=2 # aka active rows
216 TRANSPARENT_DELAY_REGISTER=3
217 DEBUG=4
218 TEST OPTIONS=5
219 ACTIVATE CHAIN=0x1
220 INT8=0x1
221 INT16=0X3
222 INT32=0x7
223 INT64=0xF
224 # precision of fp computation is tuned using the
225 # integer precision
226 ACTIVE FP=1
227 ACTIVE BFP=0x03
228 ROUNDING=0x00
229 NO FP=0x00
230 SIGNED=0x1
231 NO SIGNED=0x0
232 WMEM_STARTING_ADDRESS=0 #32 MSB
234 #### accelerator adapter command ##############
  236 CMD UPDATE IN ARG=0x0
237 CMD UPDATE OUT ARG=0x1
238 CMD EXECUTE_STEP=0x2
239 CMD EXECUTE CONTINOUS=0x4
240 CMD_STOP_EXECUTE_CONTINOUS=0x5
```

```
242 BASE ADDRESS INTC=0x40800000
243 ADDRESS RANGE INTC=0x10000
244 BASE ADDRESS DMA INFIFO=0x40400000
245 ADDRESS RANGE DMA INFIFO=0x10000
246 BASE ADDRESS DMA WM=0x40410000
247 ADDRESS_RANGE_DMA_WM=0x10000
248 accelerator=None
249 infifo buffer transfer=None
250 output fifo buffer=None
251 weight_buffer=None
252 csr_buffer=None
253 overlay=None
254 driver_csr=None
255 driver wm=None
256 driver fifo in=None
257 driver fifo out=None
259 ### DESIGN DEPENDENT DEFINITION #####
261 WMEM SIZE=16384 # 1 Mbvtes
262 CSRMEM SIZE=1024
263 INFIFO_SIZE=2048
                    #1Kbytes
264 OUTFIFO_SIZE=2048 #1Kbytes
265 ROWS=0
266 COLUMNS=0
267 DATAWIDTH=64
268 BUFFER DEPTH=2
269 output size=0
270 input size=0
271 tot_size_weight=0
272 tot_size_input=0
273 tot_size_output=0
274 curr data precision=INT8
275 curr_bitwidth_data_computation=8
276 PACK_TYPE="b" # default is 1 byte signed for integer lower case
     -> signed upper case-> unsigned
277 DTYPE NP=np.uint8
278 FP=False
279 BFP=False
280 size_tot=0
281 num_weight=0
282 global iteration=1 ## at least one execution of the tensor
     accelerator
283 global iteration shift wm =[]
284 weight tensors = []
285 input tensors =[]
286 output_tensors = []
287 output_tensors_p = []
```

```
weight_buffer_multiple = []
  index_wm=0
  class Tensor:
    def init (self, data, tot dim, size l):
291
      self.tot dim=tot dim
292
      self.data=data
      self.size_l=size_l
  filter_height=0
296 filter width=0
297 ##############################
298 ##### time probes #####
299 ############################
300 avg hw execution=0.0
301 n execution=0
302 avg_hw_execution_internal=0.00
303 n execution internal=0
304 #######################
305 ##### XADC ########
306 ########################
307 xadc mon=None
309 ##### Retrieve and display power consumption
310 ##### Supply sensor: Vccint, Vccaux, Vccbram
311 #####
            Vccpint, Vccpaux, Vcc0ddr
312 ##### Nominal values of resistances and Vcc ######
314 # from vivado report power
315 # [V]
316 vcc pl int nom=1.00
317 vcc pl aux nom=1.80
318 vcc_pl_bram_nom= 1.00
319 vcc_ps_int_nom = 1.80
320 vcc_ps_aux_nom=1.80
321 vcc ddr nom=1.50
322 # equivalent series resitstances of capacitor -> worst case
323 # [omh]
324 r_pl_int=225
325 r_pl_aux=300
326 r_pl_bram=225
327 r_ps_int=225
328 r_ps_aux=400
329 r_ddr= 0.005
330 n_sample=1
331 ps_power=0
332 pl power=0
333 mem power=0
334 ps power max=sys.float info.min
pl_power_max=sys.float_info.min
336 mem_power_max=sys.float_info.min
```

```
337 ps power min=sys.float info.max
  pl_power_min=sys.float_info.max
mem_power_min=sys.float_info.max
340 tmp max=sys.float info.min
  tmp min=sys.float info.max
  tmp_avg=0.00
  def sample_power( threadName, delay):
344
     global ps power
345
     global pl_power
346
     global mem_power
347
     global n_sample
     global xadc mon
349
     global vcc_ps_aux_nom
350
     global ps_power_max
351
     global pl_power_max
352
     global mem power max
353
     global ps power min
354
     global pl_power_min
355
     global mem_power_min
356
     global tmp_max
357
     global tmp min
358
    global tmp_avg
359
    while True:
       time.sleep(0.8/1000)
361
       vcc_pl_int = ( xadc_mon.read(VCC_INT) & 0x0000FFF0) >> 4
369
       vcc_pl_int= (vcc_pl_int* vcc_ps_aux_nom) / 4096
363
       vcc_pl_aux=( xadc_mon.read(VCC_AUX) & 0x0000FFF0) >> 4
364
       vcc pl aux = (vcc pl aux * vcc ps aux nom) / 4096
365
       vcc pl bram= ( xadc mon.read(VCC BRAM) & 0x0000FFF0) >> 4
       vcc_pl_bram= (vcc_pl_bram * vcc_ps_aux_nom) / 4096
367
       vcc_ps_int= ( xadc_mon.read(DEV_CORE_SUPPLY) & 0x0000FFF0)
368
       vcc_ps_int= (vcc_ps_int* vcc_ps_aux_nom) / 4096
369
       vcc_ps_aux=( xadc_mon.read(DEV_AUX_CORE_SUPPLY) & 0x0000FFF0
          ) >> 4
       vcc_ps_aux= (vcc_ps_aux* vcc_ps_aux_nom) / 4096
371
       vcc ddr= ( xadc mon.read(DEV CORE MEM SUPPLY) & 0x0000FFF0)
372
          >> 4
       vcc_ddr= (vcc_ddr* 3) / 4096
373
       n_sample+=1
       ps_power_i = ((vcc_ps_int_nom-vcc_ps_int)/r_ps_int) *
375
          vcc_ps_int_nom + ((vcc_ps_aux_nom-vcc_ps_aux)/r_ps_aux) *
          vcc_ps_aux_nom
       pl_power_i= ((vcc_pl_int_nom-vcc_pl_int)/r_pl_int) *
376
          vcc pl int nom + ((vcc pl aux nom-vcc pl aux)/r pl aux)*
          vcc_pl_aux_nom + ((vcc_pl_bram_nom-vcc_pl_bram)/r_pl_bram
          ) * vcc_pl_bram_nom
       mem_power_i = ((vcc_ddr-vcc_ddr_nom)/r_ddr) * vcc_ddr
377
```

```
## update max
      if pl_power_i > pl_power_max:
379
        pl_power_max=pl_power_i
380
      if ps power i > ps power max:
381
         ps power max=ps power i
382
      if mem_power_i > mem_power_max:
        mem_power_max=mem_power_i
      #update min
385
      if pl_power_i < pl_power_min:</pre>
386
         pl_power_min=pl_power_i
387
      if ps_power_i < ps_power_min:</pre>
388
        ps_power_min=ps_power_i
      if mem power i < mem power min:
        mem power min=mem power i
391
      ## update values for the averages
392
      ps_power+=ps_power_i
393
      pl power+=pl power i
394
      mem power+=mem power i
      # temperature
      tmp=( xadc_mon.read(TEMPERATURE) & 0x0000FFF0) >> 4
397
      tmp = (tmp * 503.975)/4096 - 273.15
398
      ## update max
      if tmp > tmp_max:
400
        tmp_max=tmp
      ## update min
402
      if tmp < tmp min:
403
        tmp_min=tmp
404
405
      tmp_avg+=tmp
  ########## LOAD DESIGN ############
  @ffi.def extern()
  def load overlay():
    global accelerator
    global overlay
413
    global xadc_mon
414
    global ROWS
415
    global COLUMNS
416
    global
             ps_power
417
    global pl_power
    global mem_power
419
    global ps_power_max
420
           pl_power_max
    global
421
    global mem power max
422
    global ps power min
423
    global pl power min
424
    global mem_power_min
425
    global tmp_max
```

```
global tmp_min
427
     global tmp_avg
428
    ## modify this part for choosing a different overlay and
429
        recompile the library
     f clk="30mhz"
430
     datawidth="only_integer8"
431
    mxu_size="mxu_8x8"
    ROWS=8
433
    COLUMNS=8
434
     print("Hardware design space points", f_clk, " ", " ", mxu_size,
435
         " ", datawidth)
     overlay = Overlay("/home/xilinx/dtpu_configurations/"+
        datawidth+"/"+f clk+"/" + mxu size+"/pyngz2.bit") # tcl is
        also parsed
     overlay.download() # Explicitly download bitstream to PL
437
     if overlay.is_loaded():
438
     # Checks if a bitstream is loaded
439
      if _DEBUG_PRINT: print("[DEBUG- PYTHON] -----overlay is loaded
     else :
441
         DEBUG PRINT: print("[DEBUG- PYTHON] ---- overlay is not
442
          loaded——")
       exit(-1)
443
     if overlay.monitor is not None:
      xadc_mon=overlay.monitor.xadc_wiz_0_0
445
      xadc_mon.write(SRR,0x0000000A) # reset
446
     else:
447
       print("ERROR NO XADC")
448
     if overlay.dtpu is not None:
449
       accelerator=overlay.dtpu.axis_accelerator_ada
450
     else:
451
       print("ERROR NO ACCELERATOR")
452
       exit(-1)
453
     overlay.reset()
454
    # clean power variable
455
    n_sample=1
456
    ps_power=0
457
    pl_power=0
458
    mem power=0
459
    ps_power_max=sys.float_info.min
460
    pl_power_max=sys.float_info.min
    mem_power_max=sys.float_info.min
462
    ps_power_min=sys.float_info.max
463
    pl_power_min=sys.float_info.max
464
    mem_power_min=sys.float_info.max
465
    tmp max=sys.float info.min
466
    tmp min=sys.float info.max
467
    tmp_avg=0.00
468
469
```

```
@ffi.def_extern()
  def Init_p(tot_tensors,input_tens_size,output_tens_size):
    global accelerator
473
    global overlay
474
    global size_tot
475
    global input_size
    global output_size
477
    global avg_hw_execution
478
    global n execution
479
    global avg_hw_execution_internal
480
    global n_execution_internal
    global tmp avg
    if _DEBUG_PRINT: print("[DEBUG - PYTHON] --- Init p function
483
    ## soft reset and accelerator configuration
484
    accelerator.write(CTRL,0x0000001)
485
    accelerator.write(CTRL,0x0000000)
    accelerator.write(IARG_RQT_EN,0x000000007) ## all data
        avialable csr, weights and data
    accelerator.write(OARG LENGTH MODE,0x00000001) # software mode
488
    accelerator.write(OARGO LENGTH,OUTFIFO SIZE) # size outfifo
489
    accelerator.write(ISCALAR_RQT_EN,0) # NO input SCALAR
490
    accelerator.write(OSCALAR_RQT_EN,0) # no output scalar
    accelerator.write(OARG0_TDEST,0) # only one output
492
    size tot=tot tensors
493
     if _DEBUG_PRINT: print("[DEBUG-PYTHON|--- total tensors",
494
        size_tot , "----")
    input size=input tens size
495
    if DEBUG PRINT: print("[DEBUG-PYTHON]--- int tensors",
       input_size,"----")
    output_size=output_tens_size
497
    if DEBUG PRINT: print("[DEBUG-PYTHON]--- out tensors",
498
        output tens size, "----")
    n execution=0
    avg_hw_execution=0.00
    avg_hw_execution_internal=0.00
501
    n execution internal=0
    tmp avg=0.00
    return True
504
506
  @ffi.def_extern()
507
  def SelectDataTypeComputation_p(data_type):
    global curr data precision
    global curr bitwidth data computation
    global PACK TYPE
511
    global FP
    global BFP
513
```

```
global DTYPE NP
514
    if _DEBUG_PRINT: print("[DEBUG - PYTHON] -
        SelectDataTypeComputation DTPU class —
    if data type!=0:
      #case switch
       if ((data_type)&0x00000f) == INT8:
518
         curr_data_precision=INT8
         curr_bitwidth_data_computation=8
520
         if (data_type&0x00100) == SIGNED:
           PACK TYPE="b"
           DTYPE_NP=np.int8
523
         else:
           PACK TYPE="B"
           DTYPE_NP=np.uint8
       elif ((data_type)&0x00000f) == INT16:
         curr_data_precision=INT16
528
         curr_bitwidth_data_computation=16
         if (data type&0x00100)==SIGNED:
           PACK_TYPE="h"
           DTYPE_NP=np.int16
         else:
           PACK TYPE="H"
534
           DTYPE_NP=np.uint16
535
       elif ((data_type)&0x00000f)==INT32:
         curr_data_precision=INT32
         curr_bitwidth_data_computation=32
538
           (data_type&0x00100)==SIGNED:
           PACK_TYPE="i"
540
           DTYPE NP=np.int32
         else:
           PACK TYPE="I"
           DTYPE_NP=np.uint32
544
       elif ((data_type)&0x00000f)==INT64:
         curr data precision=INT64
546
         curr_bitwidth_data_computation=64
         if (data_type&0x00100)==SIGNED:
           PACK_TYPE="q"
549
           DTYPE NP=np.int64
         else:
           PACK TYPE="Q"
           DTYPE_NP=np.uint64
       else:
554
         print ("ERROR PYTHON! Setting the Data type of computation"
      # floating point check
       if ((data type \& 0x000060) >> 5) == ACTIVE FP:
        FP=True
        BFP=False
        PACK_TYPE="f"
560
```

```
DTYPE NP=np.float32
561
       elif ((data_type \& 0x000060) >> 5) == ACTIVE_BFP:
562
        FP=True
563
        BFP=True
564
        PACK TYPE="e"
         DTYPE_NP=np.uint16 ## accroding to tensorflow bfp16
            representation
       else:
567
        FP=False
568
        BFP=False
    else:
       curr_data_precision=INT8
       curr bitwidth data computation=8
      FP=False
573
      BFP=False
574
     if DEBUG PRINT:
       print("[DEBUG-PYTHON]----precision default 8 bit signed-
       print("[DEBUG-PYTHON]---- Signed : ",PACK_TYPE.islower(),"
          type: ",curr_data_precision, " ->",
          curr_bitwidth_data_computation,"----
    return True
578
579
  @ffi.def_extern()
  def push_input_tensor_to_heap( tensor, size, dim_size):
    global input_tensors
589
    global tot_size_input
    #push the tensor to the heap for handling their transfefr in
584
        the Prepare p
    tot size=1
585
    if not(FP) or not(BPF):
       if PACK_TYPE.islower(): # signed
587
         if curr_data_precision==INT8:
588
           tensor i=ffi.cast("int8 t *",tensor)
589
         elif curr_data_precision==INT16:
           tensor_i=ffi.cast("int16_t *",tensor)
         elif curr_data_precision==INT32:
           tensor_i=ffi.cast("int32_t *",tensor)
593
         else: # int64
594
           tensor_i=ffi.cast("int64_t *",tensor)
595
       else: #unsigned
         if curr_data_precision==INT8:
           tensor_i=ffi.cast("uint8_t *",tensor)
598
         elif curr_data_precision==INT16:
           tensor_i=ffi.cast("uint16_t *",tensor)
         elif curr data precision==INT32:
601
           tensor_i=ffi.cast("uint32_t *",tensor)
         else: # int64
           tensor_i=ffi.cast("uint64_t *",tensor)
604
```

```
else:
605
       if BFP:
606
         tensor_i=ffi.cast("uint16_t *",tensor)
608
         tensor_i=ffi.cast("float *",tensor)
     size_i=ffi.cast("int *",size)
     tot_size=1
     size_l = 4*[1]
612
     data p=[]
613
     for i in range(dim_size):
614
       size_|[i]=size[i]
615
       tot_size *= size[i]
     tot size input+=tot size
617
     if _DEBUG_PRINT: print("[DEBUG-PYTHON]---- size of tensor
        input ",tot_size_input,"----")
     for i in range (tot_size):
619
       data p.append(tensor i[i])
     input_tensors.append(Tensor(data_p,tot_size,size_l))
621
  @ffi.def_extern()
623
  def push_output_tensor_to_heap(tensor, size,dim_size):
624
     global output_tensors
625
     global tot_size_output
     global output_tensors_p
    #push the tensor to the heap for handling their transfefr in
628
        the Prepare_p
     tot_size=1
629
     output_tensors_p.append(tensor)
630
     if not(FP) or not(BPF):
631
       if PACK TYPE.islower(): # signed
         if curr_data_precision==INT8:
           tensor_i=ffi.cast("int8_t *",tensor)
634
         elif curr_data_precision==INT16:
635
           tensor i=ffi.cast("int16 t *",tensor)
636
         elif curr_data_precision==INT32:
           tensor_i=ffi.cast("int32_t *",tensor)
         else: # int64
           tensor_i=ffi.cast("int64_t *",tensor)
640
       else: #unsigned
641
         if curr_data_precision==INT8:
642
           tensor_i=ffi.cast("uint8_t *",tensor)
         elif curr_data_precision==INT16:
644
           tensor_i=ffi.cast("uint16_t *",tensor)
         elif curr_data_precision==INT32:
646
           tensor_i=ffi.cast("uint32_t *",tensor)
647
         else: # int64
648
           tensor_i=ffi.cast("uint64_t *",tensor)
649
     else:
       if BFP:
651
```

```
tensor_i=ffi.cast("uint16_t *",tensor)
652
       else:
         tensor_i=ffi.cast("float *",tensor)
654
     size i=ffi.cast("int *", size)
655
     tot size=1
     size_l = 4*[1]
    data_p = []
     for i in range(dim_size):
       size_|[i]=size[i]
       tot_size *= size[i]
661
     tot_size_output+=tot_size
     if _DEBUG_PRINT: print("[DEBUG-PYTHON]---- size of tensor
        output ",tot_size,"----")
     for i in range (tot_size):
664
       data_p.append(0)
665
     output_tensors.append(Tensor(data_p,tot_size,size_l))
666
  @ffi.def extern()
  def push_weight_to_heap(tensor, size, dim_size):
     global weight_tensors
670
     global tot_size_weight
    #push the tensor to the heap for handling their transfefr in
672
        the Prepare_p
     tot_size=1
     if not(FP) or not(BPF):
674
       if PACK_TYPE.islower(): # signed
         if curr_data_precision==INT8:
676
           tensor_i=ffi.cast("int8_t *",tensor)
677
         elif curr data precision==INT16:
           tensor_i=ffi.cast("int16_t *",tensor)
         elif curr_data_precision==INT32:
           tensor_i=ffi.cast("int32_t *",tensor)
681
         else: # int64
682
           tensor i=ffi.cast("int64 t *",tensor)
683
       else: #unsigned
         if curr_data_precision==INT8:
           tensor_i=ffi.cast("uint8_t *",tensor)
686
         elif curr_data_precision==INT16:
687
           tensor_i=ffi.cast("uint16_t *",tensor)
688
         elif curr_data_precision==INT32:
689
           tensor_i=ffi.cast("uint32_t *",tensor)
         else: # int64
           tensor_i=ffi.cast("uint64_t *",tensor)
699
     else:
693
       if BFP:
694
         tensor_i=ffi.cast("uint16_t *",tensor)
695
       else:
696
         tensor_i=ffi.cast("float *",tensor)
     size_i=ffi.cast("int *",size)
```

```
tot size=1
699
     size_l = 4*[1]
700
    data_p = []
701
     for i in range (dim size):
       size |[i]=size[i]
       tot_size *= size[i]
704
     tot_size_weight+=tot_size
     if _DEBUG_PRINT: print("[DEBUG-PYTHON]---- size of tensor
706
        weight ",tot_size_weight,"----")
     for i in range (tot_size):
       data_p.append(tensor_i[i])
708
     weight_tensors.append(Tensor(data_p, tot_size, size_l))
  @ffi.def_extern()
711
  def Prepare_p(weight_num):
     global output_fifo_buffer
713
     global infifo buffer transfer
714
     global weight buffer
715
     global csr_buffer
716
     global overlay
717
     global driver wm
718
     global driver_csr
719
     global driver_fifo_in
720
     global driver_fifo_out
     global num_weight
722
     global global_iteration
723
     global global_iteration_shift_wm
     global curr_data_precision
725
     global weight tensors
726
     global filter height
     global filter_width
     global weight_buffer_multiple
729
     global index wm
730
     if DEBUG PRINT: print("[DEBUG - PYTHON] --- Prepare p of
731
        DTPU class ——")
     if _DEBUG_PRINT: print("[DEBUG - PYTHON] --- in size",
732
        input_size , "output size", output_size , " ----")
     if DEBUG PRINT: print("[DEBUG - PYTHON] --- weigth size",
733
        weight_num, " ----")
    #allocate buffers for data transfer
734
    num_weight=weight_num
     filter_height=num_weight * [0]
736
     filter_width=num_weight * [0]
737
    ## symmetric input/output fifo
738
     output_fifo_buffer=allocate(shape=(INFIFO_SIZE,),dtype='u8')
739
     weight buffer=allocate(shape=(WMEM SIZE,),dtype='u8')
740
     csr buffer=allocate(shape=(CSRMEM SIZE,),dtype='u8')
741
     infifo_buffer_transfer=allocate(shape=(INFIFO_SIZE,),dtype='u8
```

```
driver wm=overlay axi dma weight mem
743
    driver_csr=overlay.axi_dma_csr_mem
744
    driver fifo in=overlay.axi dma infifo
745
    driver fifo out=overlay.axi dma outfifo
746
747
       ######## populate buffers pack depending on the precision
748
       #########
    #
749
       if DEBUG PRINT:
750
      print("[DEBUG - PYTHON] --- Prepare p of DTPU class ",
751
         num_weight, "weight to transfer ----")
      for i in range (num weight):
752
        tmp=weight tensors[i]
753
        print("[DEBUG_PYTHON] ---- weight ",i,"----")
        print("[DEBUG-PYTHON] ---- size ",*tmp.size_I,"----")
        for j in range(tmp.tot_dim):
756
          print(tmp.data[i],end=" ")
757
      print("",end="\n")
758
    index_wm=0# it eats the first data?
759
    shift=int(64/curr_bitwidth_data_computation)
    iter=int(tot_size_weight/(WMEM_SIZE*(64/
761
       curr_bitwidth_data_computation))) # if it fits in th
       eaccelerator memory
    # always 4D tensors
762
    # assumptio is that the filter sizes always fit the
763
       accelerator
    if False:
764
      weight_buffer_multiple=1
765
      for w_ind in range(1): # pack only the weight for deep wise
         convolution
        tmp=np.array(weight_tensors[w_ind].data, dtype=DTYPE_NP)
        tmp=tmp.reshape(*weight_tensors[w_ind].size_I)
        filter_height[w_ind], filter_width[w_ind]=tmp.shape[1:3]
769
        for i in range(len(tmp)):
770
          for I in range (weight tensors [w ind]. size I[3]):
            global_iteration_shift_wm.append(index_wm)
            for j in range(len(tmp[i])):
773
                # boundary check
774
                shift=int(64/curr_bitwidth_data_computation)
                if shift > len(tmp[i]):
                  shift=len(tmp[i])
                weight buffer[index wm]=np.uint64(int.from bytes(
778
                   tmp[i,j,0:shift,l],byteorder="little",signed=
                   False))
                index wm+=1
779
```

```
for j in range(ROWS-len(tmp[i])):
                weight_buffer[index_wm]=0
781
               index_wm+=1 # padding with zeros
782
     else:
783
       #print("it requires multiple iterations for the weight
784
          matrix") # multiple iteration on total weight 1MB should
          be enou-gh
       weight_buffer_multiple = []*np.uint64(0)
785
       for w_ind in range(1): # pack only the weight for deep wise
786
          convolution
         tmp=np.array(weight_tensors[w_ind].data, dtype=DTYPE_NP)
787
         tmp=tmp.reshape(*weight_tensors[w_ind].size_I)
         filter height[w ind], filter width[w ind]=tmp.shape[1:3]
         for i in range(len(tmp)):
           for I in range (weight tensors [w ind]. size I[3]):
791
             global_iteration_shift_wm.append(index_wm)
792
             for j in range(len(tmp[i])):
793
                  # boundary check
794
                  shift=int(64/curr_bitwidth_data_computation)
                  if shift > len(tmp[i]):
796
                    shift=len(tmp[i])
797
                  weight buffer multiple.append(np.uint64(int.
798
                     from_bytes( tmp[i,j,0:shift,l],byteorder="
                     little ", signed=False)))
                  index wm+=1
799
             for j in range(ROWS-len(tmp[i])):
800
                weight buffer[index wm]=0
801
               index_wm+=1 # padding with zeros
802
     if DEBUG PRINT:
803
       for i in range(10):
804
         print(hex(weight buffer[i]))
805
    ######################################
806
    ##### transferring data ######
807
    #####################################
808
     weight buffer.flush()
     return True
810
811
  @ffi.def extern()
812
  def Invoke_p(only_conv2d,input_shift):
813
     global infifo_buffer_transfer
814
     global driver_csr
     global driver wm
816
     global driver_fifo_in
817
     global driver_fifo_out
818
     global csr buffer
819
    global weight_buffer
820
     global output fifo buffer
821
     global accelerator
822
     global global iteration
823
```

```
global global_iteration_shift_wm
    global curr_data_precision
825
    global input_tensors
826
    global output tensors
827
    global filter_width
828
    global filter_height
829
    global tot_size_output
    global tot_size_input
831
    global output_tensors_p
832
    global avg hw execution
833
    global n_execution
834
    global avg_hw_execution_internal
    global n execution internal
836
    global weight_buffer_multiple
837
838
       ######## populate buffers pack depending on the precision
       ########
840
       tmp = []
841
    if _DEBUG_PRINT:
842
      print("[DEBUG - PYTHON] --- Invoke p of DTPU class",
843
         input_size - num_weight, "input tensors to transfer --
      for i in range( input_size-num_weight):
844
        tmp=input_tensors[i]
845
        print("[DEBUG-PYTHON] ---- input tensor ",i,"----")
846
        print("[DEBUG-PYTHON] ---- size ",*tmp.size I,"----
        for j in range(tmp.tot_dim):
          print(tmp.data[j],end=" ")
849
    index=0
850
    shift=int(64/curr bitwidth data computation)
851
    # check if it fits the inputs
    # always 4D tensors
853
    # assumptio is that the filter sizes always fit the
854
       accelerator
    #then compact
855
    ## split the input shape into submatrices equalt to filter
856
       sizes
    applyed_weight=0
857
    #over allocate input_fifo_buffer
858
    input_fifo_buffer = []*np.uint64(0)
859
    for w ind in range(len(input tensors)):
860
      tmp=np.array(input tensors[w ind].data, dtype=DTYPE NP)
861
      tmp=tmp.reshape(*input tensors[w ind].size | )
      for batch in range(len(tmp)):
863
        for channel in range (tmp.shape[-1]):
864
```

```
tmp s=tmp[batch,:,:,channel]
         #iteration for the whole matrix
          for i in range(len(tmp_s)-filter_height[applyed_weight])
867
            for j in range(len(tmp_s[i])-filter_width[
868
              applyed_weight]):
             tmp_ss=tmp_s[i:i+filter_height[applyed_weight],j:j+
                filter_width[applyed_weight]]
              for row in range(len(tmp ss)):
870
                shift=int(64/curr_bitwidth_data_computation)
871
               if shift > len(tmp_ss):
872
                 shift=len(tmp_ss)
               input fifo buffer.append(np.uint64(int.from bytes(
                  tmp_ss[row,0:shift],byteorder="little",signed=
                  False)))
               index+=1
875
    input fifo buffer=np.array(input fifo buffer, dtype='u8')
876
    input fifo buffer=np.reshape(input fifo buffer,newshape=(index
       ,))
    if _DEBUG_PRINT:
878
      for i in range(10):
879
        print(hex(input_fifo_buffer[i]))
880
    #iterate on the output matrix with also multiple weight
881
       iteration and inputs
    ## assumption is that the output tensor is always one!
882
    ## getting the output matrix structure
883
    #accelerator.write(CMD, (0x0000000 | CMD EXECUTE CONTINOUS
884
    output matrix=np.array(output tensors[0].data, dtype=DTYPE NP)
885
    output matrix=output matrix.reshape(*output tensors[0].size |)
    point_wise=np.array(weight_tensors[1].data,dtype=DTYPE_NP)
    888
    ###### deepwise convolution ########
889
    890
    if DEBUG PRINT:
      print("[DEBUG_PYTHON] —
                               ---- deepwise convolution
               -")
    if TIME PROBES:
893
      start time=time.time()
894
    for shift_w in range(math.ceil(len(weight_buffer_multiple)/
895
      WMEM_SIZE)):
      896
      ##### program the dma for the weight #########
897
      898
      if DEBUG PRINT: print("[DEBUG-PYTHON]--- transfering weight
899
          buffer ----")
      weight_buffer[0:len(weight_buffer_multiple[WMEM_SIZE*(
900
         shift_w):WMEM_SIZE*(shift_w+1)])]=weight_buffer_multiple[
         WMEM_SIZE * (shift_w):WMEM_SIZE * (shift_w+1)]
```

```
driver wm.sendchannel.transfer(weight buffer)
901
      driver wm.sendchannel.wait()
      for batch_i in range(input_tensors[0].size_I[0]):
903
        for channel i in range (input tensors [0]. size [-1]):
904
          905
         ###### program the dma for the csr reg ########
906
         if _DEBUG_PRINT: print("[DEBUG-PYTHON]--- transfering
908
            csr buffer for weight——")
          csr buffer[ARITHMETIC PRECISION]=(
909
             global_iteration_shift_wm[channel_i]<<32) | ((NO_FP
            <<8)) | (ACTIVATE_CHAIN<<4)| (curr_data_precision)
         #csr buffer.flush()
910
          driver csr.sendchannel.transfer(csr buffer)
         #driver csr.sendchannel.wait()
912
          for infifo_shift in range(math.ceil(input_fifo_buffer.
913
             size/INFIFO SIZE)):
           914
           ##### program the dma for the in/out fifos #########
           916
           if TIME PROBES:
             start_time_i=time.time()
918
              _DEBUG_PRINT: print("[DEBUG_PYTHON]--- transfering
919
              input buffer",infifo_shift," ----")
           infifo_buffer_transfer[0:input_fifo_buffer[INFIFO_SIZE
920
              *(infifo_shift):INFIFO_SIZE*(infifo_shift+1)].size
              ]=input_fifo_buffer[INFIFO_SIZE*(infifo_shift):
              INFIFO_SIZE*(infifo_shift+1)]
            driver fifo in.sendchannel.transfer(
921
              infifo buffer transfer)
           #driver_fifo_in.sendchannel.wait()
           accelerator.write(OARGO_LENGTH,OUTFIFO_SIZE) # size
923
            accelerator.write(CMD, (0x0000000 | (CMD EXECUTE STEP
924
              <<16)))
           accelerator.write(CMD,((CMD_UPDATE_OUT_ARG<<16)|(1)))
           driver_fifo_out.recvchannel.transfer(
926
               output fifo buffer)
            if _DEBUG_PRINT: print("[DEBUG_PYTHON]---- getting
927
              output data ----")
           driver_fifo_out.recvchannel.wait()
           if _TIME_PROBES:
929
             end_time_i=time.time()
930
             avg_hw_execution_internal+=end_time_i-start_time_i
931
              n execution internal+=1
932
            if DEBUG PRINT: print(output fifo buffer)
933
            accelerator.write(CMD,((CMD UPDATE IN ARG<<16)|(4))) #
               update input fifo
935
```

```
###### unpack the output buffer depending on the
936
               precision #######
            #
937
               ## get values from output fifo buffer and put them
938
               into an array in order to sum all the data
            for i in range (output matrix.shape[1]-1):
939
              for j in range (output_matrix.shape[2]-1):
940
                tmp_sum=np.zeros(shape=(ROWS, int(64/
                   curr bitwidth data computation)), dtype=DTYPE NP
                   )
                tmp_data=output_fifo_buffer[channel_i * (ROWS*
942
                   COLUMNS) + i *ROWS+ j *COLUMNS: channel i * (ROWS*
                   COLUMNS) + (i + 1) *ROWS + (i + 1) *COLUMNS]
                tmp sum=np.frombuffer(tmp data.tobytes(),dtype=
943
                   DTYPE NP)
                #reshuffle and check if it is worth it
944
                #if tmp data.size >0:
945
                   for row in range(len(tmp data)):
946
                     if row in tmp data:
947
                       tmp_sum[row]=np.frombuffer(tmp_data[row].
                   tobytes(), dtype=DTYPE_NP)#convert(tmp_data[row
                   1)
                output_matrix[batch_i,i,j,channel_i]=np.multiply(
949
                   tmp_sum.sum(dtype=DTYPE_NP),point_wise[
                   channel i], dtype=DTYPE NP)
          accelerator.write(CMD,((CMD UPDATE IN ARG<<16)|(1))) #
             update csr
        accelerator.write(CMD,((CMD UPDATE OUT ARG<<16)|(1)))
951
      accelerator.write(CMD,((CMD UPDATE IN ARG<<16)|(2))) #
952
         update w memory
    #if DEBUG PRINT:
953
    # print ("[DEBUG-PYTHON]---- point wise convolution
954
    if TIME PROBES:
955
      end time=time.time()
956
      avg_hw_execution+=end_time-start_time
957
      n_execution+=1
    accelerator.write(STATUS,0x00000003)##clear status
959
    #accelerator.write(CMD,((CMD UPDATE IN ARG<<16)|(1))) # update
960
        csr
    #accelerator.write(CMD, (0x0000000) |
961
       CMD STOP EXECUTE CONTINOUS<<16)))
                                          # stop accelerator
    962
    ####### point wise convolution ######## moved inside
963
       previous loop
```

```
964
     #for batch i in range(len(output matrix)):
965
        for i in range(len(output_matrix[batch_i])):
966
          for j in range(len(output matrix[batch i,i])):
967
            for channel i in range(len(output matrix[batch i,i,j]))
968
              output_matrix[batch_i,i,j,channel_i]=output_matrix[
969
        batch_i,i,j,channel_i] * weight_tensors[1]. data[channel_i]
     if DEBUG PRINT: print("[DEBUG -PYTHON] --- accelerator done
970
         ----")
     if DEBUG PRINT:
971
       print("[DEBUG-PYTHON] ---- final output data to tensorflow
       print(output matrix)
973
     # copy the output matrix to tensorflow environment ffi.memmove
974
        (dest, src, nbytets)
     ffi.memmove(ffi.buffer(output tensors p[0],output matrix.
975
        nbytes), output matrix, output matrix.nbytes)
     # save the pointer to the output and then substitute the
976
        values into the point wise convolution
     #clean up input/output
977
     input tensors = []
978
     output_tensors =[]
979
     tot_size_input=0
     tot_size_output=0
981
     del input fifo buffer
982
     return True
983
984
   @ffi.def extern()
   def ResetHardware p():
     global accelerator
987
     global overlay
988
     if DEBUG PRINT: print("[DEBUG - PYTHON] --- Reset hardware p
989
         function ----")
     overlay.reset()
     accelerator.write(CTRL,0x0000001)
991
     accelerator.write(CTRL,0x0000000)
992
     return True
993
994
   @ffi.def_extern()
   def destroy_p():
     global infifo_buffer_transfer
997
     global output_fifo_buffer
998
     global csr_buffer
999
     global weight buffer
1000
     global accelerator
1001
     global overlay
1002
     global global_iteration_shift_wm
1003
     global weight_tensors
1004
```

```
global input_tensors
1005
     global output_tensors
1006
     if _DEBUG_PRINT: print("[DEBUG - PYTHON] --- destroying the
1007
        buffers ——")
     infifo buffer transfer.freebuffer()
1008
     output_fifo_buffer.freebuffer()
1009
     csr_buffer.freebuffer()
1010
     weight_buffer.freebuffer()
1011
     del accelerator
1012
     del overlay
1013
     del global_iteration_shift_wm
1014
     del weight_tensors
1015
     del input tensors
1016
     del output_tensors
1017
     return True
1018
1019
   @ffi.def extern()
   def CopyFromBufferHandle p():
     if _DEBUG_PRINT: print("[DEBUG - PYTHON] --- the
        delegate and buffers ----")
     return True
1024
   @ffi.def_extern()
   def CopyToBufferHandle_p():
     if _DEBUG_PRINT: print("[DEBUG - PYTHON] --- copying to
                                                                      the
1027
          delegate and buffers ——")
     return True
1028
   @ffi.def_extern()
   def FreeBufferHandle p():
     global output fifo buffer
     global csr_buffer
     global weight_buffer
1033
     global driver csr
     global driver wm
     global driver_fifo_in
1036
     global driver_fifo_out
     global accelerator
1038
     if _DEBUG_PRINT: print("[DEBUG - PYTHON] --- freeing buffers
1039
     output_fifo_buffer.freebuffer()
1040
     csr_buffer.freebuffer()
     weight_buffer.freebuffer()
1042
     del accelerator
     del driver_csr
1044
     del driver wm
     del driver fifo in
1046
     del driver_fifo_out
1047
1049 @ffi.def_extern()
```

```
def start power consumption():
     global xadc mon
     if _DEBUG_PRINT: print("[DEBUG-PYTHON] ---- start measurement
           power consumption ——")
     if xadc mon is not None:
       try:
1054
         _thread.start_new_thread( sample_power, ("Sampling power",
             0.5 ) ) # every 1ms
       except:
1056
         print("Error: unable to start thread")
     return True
1058
   @ffi.def extern()
1060
   def print power consumption p():
1061
     global xadc mon
1062
     global ps power
1063
     global pl power
1064
     global mem power
1065
     if _DEBUG_PRINT: print("[DEBUG_PYTHON] ---- printing power
1066
        consumption from xadc readings ———")
     1067
     ### Retrieve and display current temperature ###
1068
     1069
     tmp=( xadc_mon.read(TEMPERATURE) & 0x0000FFF0) >> 4
1070
     tmp = (tmp * 503.975)/4096 - 273.15
1071
     print("Current temperature:", round(tmp,3)," C")
     print("Average execution temperature:", round(tmp_avg/n_sample
1073
        ,3)," C")
     print("Max temperature:", round(tmp_max,3) ," C")
     print("Min temperature:", round(tmp min,3) ," C")
     # printing power consumption
     tot_power=ps_power+pl_power+mem_power
1077
     print ("Average power consumption=", round (tot power * 1000/
1078
        n_sample,5)," mWatt")
     print ("---> Processing System:",round (ps power*1000/n sample
1079
        ,5)," mWatt")
     print("---> Programmable Logic:",round(pl_power*1000/n sample
1080
        ,5)," mWatt")
     print("---> Memory:",round(mem power*1000/n sample,3)," mWatt"
1081
     print("Maximum power consumption")
1082
     print("---> Processing System:",round(ps_power_max*1000,5),"
1083
        mWatt")
     print("---> Programmable Logic:",round(pl power max*1000,5),"
1084
        mWatt")
     print("---> Memory:",round(mem_power_max*1000,3)," mWatt")
     print("Minimum power consumption")
1086
     print ("---> Processing System:",round (ps power min * 1000,5),"
1087
        mWatt")
```

```
print("---> Programmable Logic:",round(pl_power_min*1000,5),"
1088
        mWatt")
     print("---> Memory:",round(mem_power_min*1000,5)," mWatt")
1089
     return True
1090
   @ffi.def_extern()
   def activate_time_probe_p(activate):
     global _TIME_PROBES
1094
     if _DEBUG_PRINT: print("[DEBUG-PYTHON]--- activating time
1095
        probe in python ———")
     if not(_TIME_PROBES) and activate:
1096
       print("Time probes activated")
       TIME PROBES=True
1098
1000
   @ffi.def_extern()
1100
   def print_python_time_probes():
     if DEBUG PRINT: print("[DEBUG-PYTHON]---- printing python
        time probes ----")
     print ("Hardware execution time and rebuilding output matrix:",
         avg_hw_execution/n_execution," [s]")
     print("Hardware execution time:", avg_hw_execution_internal/
1104
        n_execution_internal," [s]")
     #print("Hardware calls:", n_execution_internal)
     return True
```

## В

## Top level entity of DTPU core

```
1 //
                    ______
2 // Filename : dtpu_core.v
     Created On : 2020-04-22 17:05:56
3 //
     Last Modified: 2020-05-20 15:03:03
5 // Revision
6 // Author
                  : Angione Francesco
7 // Company
              : Chalmers University of Technology, Sweden
     - Politecnico di Torino, Italy
     Email
             : francescoangione8@gmail.com
 //
  // Description : Cogitantium, the dumb tensor processor
     unit, top level enity of the accelerator
  //
 //
13 //
14
  'timescale 1ns / 1ps
  'include "precision_def.vh"
  //'define DUMMY
19
  module dtpu_core
  #(parameter DATA_WIDTH_MAC=64,
      ROWS=3,
      COLUMNS=3,
23
      SIZE_WMEMORY=8196,
24
      ADDRESS_SIZE_WMEMORY=32,
25
      ADDRESS_SIZE_CSR=32,
26
      SIZE CSR=1024,
      DATA_WIDTH_CSR=8,
      DATA_WIDTH_WMEMORY=64,
      DATA_WIDTH_FIFO_IN=64,
30
      DATA WIDTH FIFO OUT=64,
      MAX_BOARD_DSP = 220
```

```
(
35
      input wire clk,
      (* X_INTERFACE_INFO = "xilinx.com:signal:reset:1.0
36
         aresetn RST" *)
      (* X INTERFACE PARAMETER = "POLARITY ACTIVE LOW" *)
37
      input wire aresetn,
      input wire test_mode,
      input wire enable,
40
41
      42
      ///// CSR INTERFACE //////
43
      (* X_INTERFACE_PARAMETER = "MASTER TYPE BRAM CTRL, MEM ECC
          no, MEM_WIDTH 8, MEM_SIZE 1024 " *)
      (* X_INTERFACE_INFO = "xilinx.com:interface:bram_rtl:1.0
46
         csr mem interface EN" *)
      output wire
                          csr_ce,
47
      (* X_INTERFACE_INFO = "xilinx.com:interface:bram rtl:1.0
48
         csr_mem_interface DOUT" *)
      input wire [DATA_WIDTH_CSR-1:0]
                                         csr_dout,
49
      (* X INTERFACE INFO = "xilinx.com:interface:bram rtl:1.0
50
         csr_mem_interface DIN" *)
      output wire [DATA_WIDTH_CSR-1:0]
51
      (* X_INTERFACE_INFO = "xilinx.com:interface:bram_rtl:1.0
         csr_mem_interface WE" *)
      output wire
                          csr_we,
53
      (* X_INTERFACE_INFO = "xilinx.com:interface:bram_rtl:1.0
54
         csr_mem_interface ADDR" *)
      output wire [ADDRESS_SIZE_CSR-1:0] csr_address,
      (* X_INTERFACE_INFO = "xilinx.com:interface:bram rtl:1.0
         csr_mem_interface CLK" *)
      output wire
                            csr_clk,
57
      (* X_INTERFACE_INFO = "xilinx.com:interface:bram rtl:1.0
58
         csr_mem_interface RST" *)
      output wire
                          csr_reset,
        61
        ///// WEIGHT MEMORY //////
62
        63
        (* X_INTERFACE_PARAMETER = "MASTER_TYPE BRAM_CTRL,
64
           MEM_ECC no, MEM_WIDTH 64, MEM_SIZE 8192 " *)
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram_rtl
65
           :1.0 weight_mem_interface EN" *)
        output wire
                     wm_ce,
66
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram rtl
67
           :1.0 weight_mem_interface DOUT" *)
        input wire [DATA_WIDTH_WMEMORY-1:0]
                                                 wm_dout,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram_rtl
           :1.0 weight_mem_interface DIN" *)
```

```
output wire [DATA_WIDTH_WMEMORY-1:0]
                                             wm_din,
70
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram rtl
71
          :1.0 weight_mem_interface WE" *)
       output wire
                             wm_we,
72
        (* X INTERFACE INFO = "xilinx.com:interface:bram rtl
73
          :1.0 weight_mem_interface ADDR" *)
       output wire [ADDRESS_SIZE_WMEMORY-1:0]
                                          wm_address,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram_rtl
75
          :1.0 weight_mem_interface CLK" *)
                    wm_clk,
       output wire
76
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram_rtl
77
          :1.0 weight_mem_interface RST" *)
       output wire
                          wm_reset,
79
       80
       //////// INPUT DATA FIFO ///////////////
81
       82
       ///////// using stream axi
        (* X_INTERFACE_INFO = "xilinx.com:interface:
          acc_fifo_read:1.0 input_fifo RD_DATA" *)
       input wire [DATA_WIDTH_FIF0_IN-1:0] infifo_dout,
85
         (* X_INTERFACE_INFO = "xilinx.com:interface:
86
            acc_fifo_read:1.0 input_fifo RD_EN" *)
       output wire infifo_read,
         (* X_INTERFACE_INFO = "xilinx.com:interface:
88
            acc_fifo_read:1.0 input_fifo EMPTY_N" *)
       input wire infifo_is_empty,
89
90
       /////// OUTPUT DATA FIFO ////////////
       94
       //////// using stream axi
95
        (* X_INTERFACE_INFO = "xilinx.com:interface:
96
          acc_fifo_write:1.0 output_fifo WR_DATA" *)
       output wire [DATA_WIDTH_FIFO_OUT-1:0] outfifo_din,
         (* X_INTERFACE_INFO = "xilinx.com:interface:
98
            acc_fifo_write:1.0 output_fifo WR_EN" *)
       output wire outfifo_write,
99
         (* X_INTERFACE_INFO = "xilinx.com:interface:
100
            acc_fifo_write:1.0 output_fifo FULL_N" *)
       input wire outfifo_is_full,
101
102
       103
       /////// CONTROL FROM/TO PS //////////
104
       105
        (* X_INTERFACE_INFO = "xilinx.com:interface:
           *)
```

```
input wire cs_start,
107
          (* X_INTERFACE_INFO = "xilinx.com:interface:
108
             acc_handshake_rtl:1.0 control_interface ap_ready"
             *)
         output wire cs ready,
109
          (* X_INTERFACE_INFO = "xilinx.com:interface:
110
             acc_handshake_rtl:1.0 control_interface ap_done" *)
         output wire cs_done,
111
          (* X_INTERFACE_INFO = "xilinx.com:interface:
112
             " *)
         input wire cs_continue,
113
          (* X_INTERFACE_INFO = "xilinx.com:interface:
114
             acc_handshake_rtl:1.0 control_interface ap_idle" *)
         output wire cs_idle,
115
116
          // debug state
117
         output wire [3:0] state,
         output wire[3:0]d_out
119
           ):
120
       121
       ///***************************///
122
       ///// ----- Cogitantium ----- /////
123
       ///// the dumb tensor processing unit ////
125
       126
127
         wire [COLUMNS*ROWS*DATA_WIDTH_FIFO_OUT-1:0]
128
            weight_to_mxu;
         wire [COLUMNS*DATA_WIDTH_FIFO_IN-1:0] input_data_to_mxu
129
         wire [ROWS*DATA_WIDTH_FIFO_OUT-1:0]
130
            output_data_from_mxu;
         wire enable_deskew_ff_i,enable_enskew_ff_i;
131
         wire ['LOG_ALLOWED_PRECISIONS-1:0] data_precision;
132
         wire enable_i;
133
         wire enable_load_array;
134
         wire [ROWS * COLUMNS -1:0] read_weight_memory;
135
         wire [COLUMNS:0] enable_load_activation_data;
136
         wire [COLUMNS:0] enable_store_activation_data;
137
         wire enable_cnt;
         wire ld_max_cnt;
139
         wire enable_cnt_weight;
140
         wire ld_max_cnt_weight;
141
         wire enable_chain;
142
         wire ld_weight_page_cnt;
143
         wire [1:0]enable_fp_unit;
144
145
         wire [ADDRESS_SIZE_WMEMORY-1:0] start_value_wm;
146
```

```
wire [$clog2(COLUMNS):0]max_cnt_from_cu;
147
        wire [$clog2(ROWS*COLUMNS):0]max_cnt_weight_from_cu;
148
        wire reset_i;
149
150
        assign d_out=data_precision;
151
152
        assign reset_i=~aresetn;
       154
      ///// MATRIX MULTIPLICATION UNIT ////////
155
      156
     mxu_wrapper
157
      #(.M(ROWS), // matrix row -> weights
          .K(COLUMNS), // matrix columsn -> input data
159
          .max_data_width(DATA_WIDTH_MAC),// it must be a
160
             divisor of 64
          .MAX_BOARD_DSP(MAX_BOARD_DSP)
161
          ) engine
162
              .data_type(data_precision),
              .reset(reset_i),
164
              .clk(clk),
165
              .enable(enable i),
166
              .enable chain (enable chain),
167
              .enable_fp_unit(enable_fp_unit),
168
              .enable_in_ff(enable_enskew_ff_i),
              .enable_out_ff(enable_deskew_ff_i),
170
              .test_mode(test_mode),
171
              .input_data(input_data_to_mxu),
172
              .weight(weight_to_mxu),
173
              .y(output_data_from_mxu)
          );
175
176
      177
      178
      control_unit #( .DATA_WIDTH_FIFO_IN(DATA_WIDTH_FIFO_IN),
                  .DATA_WIDTH_FIFO_OUT(DATA_WIDTH_FIFO_OUT),
                  .DATA_WIDTH_WMEMORY(DATA_WIDTH_WMEMORY),
182
                  .DATA_WIDTH_CSR(DATA_WIDTH_CSR),
183
                  .ROWS (ROWS),
184
                  .COLUMNS (COLUMNS),
185
                  .ADDRESS_SIZE_CSR(ADDRESS_SIZE_CSR),
                  . ADDRESS_SIZE_WMEMORY(ADDRESS_SIZE_WMEMORY))
187
      cu(
188
          .clk(clk),
189
          .reset(reset_i),
190
          .test_mode(test_mode),
191
          .glb_enable(enable),
          .enable_mxu(enable_i),
193
          .csr_address(csr_address),
194
```

```
.csr_dout(csr_dout),
195
           .csr_ce(csr_ce),
196
           .csr_reset(csr_reset),
197
           .csr_we(csr_we),
198
           .wm ce(wm ce),
199
           .wm_reset(wm_reset),
200
           .wm_we(wm_we),
201
           .infifo_is_empty(infifo_is_empty),
202
           .infifo_read(infifo_read),
203
           .outfifo_is_full(outfifo_is_full),
204
           .outfifo_write(outfifo_write),
205
           .cs_continue(cs_continue),
           .cs_done(cs_done),
207
           .cs_idle(cs_idle),
208
           .cs_ready(cs_ready),
209
           .cs_start(cs_start),
210
           .state_out(state),
211
           .enable_deskew_ff(enable_deskew_ff_i),
           .enable_enskew_ff(enable_enskew_ff_i),
213
           .enable_fp_unit(enable_fp_unit),
214
           .enable chain(enable chain),
215
           .enable load array(enable load array),
216
           .data_precision(data_precision),
217
           .read_weight_memory(read_weight_memory),
           .enable_load_activation_data(
219
              enable_load_activation_data),
           .enable_store_activation_data(
220
              enable_store_activation_data),
           .enable_cnt(enable_cnt),
           .ld_max_cnt(ld_max_cnt),
222
           .enable_cnt_weight(enable_cnt_weight),
223
           .ld_max_cnt_weight(ld_max_cnt_weight),
224
           .ld_weight_page_cnt(ld_weight_page_cnt),
225
           .start_value_wm(start_value_wm),
226
           .max_cnt_from_cu(max_cnt_from_cu), // it depends on
              the current bitwidt [$clog2(COLUMNS):0]
           .max_cnt_weight_from_cu(max_cnt_weight_from_cu) //[
228
              $clog2(ROWS):0]
229
              );
230
231
     232
     /////// LOAD AND STORE ARRAY
                                         //////////
233
     234
     'ifndef DUMMY
235
237
     ls_array
238
     #(
         .ROWS(ROWS),
239
```

```
.COLUMNS (COLUMNS),
240
          .data_in_width(DATA_WIDTH_FIFO_IN),
241
          .data_in_mem(DATA_WIDTH_WMEMORY),
242
          .address_leng_wm(ADDRESS_SIZE_WMEMORY),
243
          .size_wmemory(SIZE_WMEMORY)) ls_array_inst
244
     (
     .clk(clk),
     .reset(reset_i),
247
     .enable_load_array(enable_load_array),
248
     .data_precision(data_precision),
249
     .read_weight_memory(read_weight_memory),
250
     .infifo_read(infifo_read),
     .outfifo_write(outfifo_write),
252
     .input_data_from_fifo(infifo_dout), //[data_in_width-1:0]
253
     .data_to_fifo_out(outfifo_din), //[data_in_width-1:0]
254
     .data_from_weight_memory(wm_dout), //[data_in_mem -1:0]
255
     .data_from_mxu(output_data_from_mxu), //[data_in_width*ROWS
        -1:07
     .data_to_mxu(input_data_to_mxu), //[data_in_width*COLUMNS
257
     .weight_to_mxu(weight_to_mxu), //[data_in_width*ROWS-1:0]
258
     .wm_address(wm_address), //[address_leng_wm-1:0]
259
     .enable_load_activation_data(enable_load_activation_data),
260
     .enable_store_activation_data(enable_store_activation_data)
261
     .enable_cnt(enable_cnt),
262
     .ld_max_cnt(ld_max_cnt),
263
     .enable_cnt_weight(enable_cnt_weight),
264
     .ld_max_cnt_weight(ld_max_cnt_weight),
     .ld_weight_page_cnt(ld_weight_page_cnt),
266
     .start_value_wm(start_value_wm),
267
     .max_cnt_from_cu(max_cnt_from_cu), // it depends on the
268
        current bitwidt [$clog2(COLUMNS):0]
     .max_cnt_weight_from_cu(max_cnt_weight_from_cu) //[$clog2(
269
        ROWS):0]
     );
270
271
272
      'endif
273
276
     'ifdef DUMMY
277
     always @(posedge(clk)) begin
278
     if(reset_i) begin
279
     input_data_from_fifo <=0;
     weight_from_memory <=0;</pre>
     end else begin
282
                if (enable_load_array && infifo_read ) begin
283
```

```
input_data_from_fifo <= infifo_dout;</pre>
284
                  weight_from_memory <= wm_dout;</pre>
285
                  end
286
287
      end
288
      end
      // dummy assignment for 3 columns and rows
      assign outfifo_din=( outfifo_write ? input_data_to_fifo:64')
291
         b0);
292
293
      'endif
295
      // same clock for bram interface
296
      assign csr_clk=clk;
297
      assign wm_clk=clk;
298
   endmodule
300
```

## C

## Results for different frequencies



**Figure C.1:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 50 MHz and integer 8 PEs



**Figure C.2:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 80 MHz and integer 8 PEs



**Figure C.3:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 100 MHz and integer 8 PEs



Figure C.4: Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 120 MHz and integer 8 PEs



**Figure C.5:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 50 MHz and integer 16 PEs



**Figure C.6:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 80 MHz and integer 16 PEs



**Figure C.7:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 100 MHz and integer 16 PEs



**Figure C.8:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 50 MHz and integer 32 PEs



**Figure C.9:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 80 MHz and integer 32 PEs



**Figure C.10:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 100 MHz and integer 32 PEs



**Figure C.11:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 50 MHz and integer 64 PEs



**Figure C.12:** Post Implementation Dynamic Power Consumption per entities in Programmable Logic with a clock frequency of 60 MHz and integer 64 PEs