# POLITECNICO DI TORINO

Master degree course in Electronic Engineering

Master Degree Thesis in Integrated System Architecture

# Next Generation Hardware Acceleration opportunities in Data Centers



**Supervisor** prof. Maurizio MARTINA **Candidate** Giuseppe Dongiovanni Mancino 245205

External Supervisor Haute Ecole d'Ingénierie et de Gestion du Canton de Vaud (HEIG-VD) prof. Alberto DASSATTI

Anno Accademico 2019-2020

To my beloved family and friends

#### Abstract

The world has entered in the Era of Data. The role of data centers is more and more relevant due to the increased amount of data produced that have to be computed and stored. Therefore the scientific and engineering interest focuses on the research of new technologies to increase the computational power and storage capacity of data centers, and at the same time on the necessity to reduce their energy footprint.

Some of the available technological solutions, that can be used to meet the requirements for performance and energy efficiency, are presented in the first part of this thesis. They are the well-established NVMe protocol, designed to become an industrial standard that exploits the performance given by the PCIe it leverages, and the computational storage. The combination of these two technologies is the enabler framework for the near data computational paradigm also known as smart storage. The idea is simple: move the computing close to the data reducing the data movement, main source of energy dissipation, without compromising in processing performance.

Then the attention is focused on the development of a computational storage based on the NVMe protocol: the prototype is built on a Xilinx FPGA board, on the basis of an open-source project of an NVMe SSD controller.

The created prototype represents a scalable and standard compliant solution: it has been developed to experience and explore capabilities of the used technologies and standard, and the possible benefits that they can provide to data centers. Extensive benchmark tests have been carried out to characterize the device and discover its performance limits, in terms of both data rate and latency. At the same time, a general test application has been integrated in the prototype to evaluate, as real-world example, the complexity that the deployment of hardware accelerators involves.

# Acknowledgements

The candidate would like to warmly thank the REDS institute, attached to the Department of Information and Communications Technology (ICT) of the HEIG-VD, for the resources provided, the professor Dassatti and the whole team for their guidance and help during this months. Moreover, the candidate would like to thank Mr. Kibin Park, one of the PhD students that works at the OpenSSD project, for his help in the early stages of the thesis.

# Contents

| List of Tables 4 |                            |                                                     |                                  |                                       |  |  |  |  |  |
|------------------|----------------------------|-----------------------------------------------------|----------------------------------|---------------------------------------|--|--|--|--|--|
| Lis              | st of l                    | Figures                                             |                                  | 5                                     |  |  |  |  |  |
| 1                | <b>Intro</b><br>1.1<br>1.2 | o <b>ductio</b><br>Possibl<br>Thesis                | <b>n</b><br>le Solutions         | 9<br>10<br>15                         |  |  |  |  |  |
| 2                | State                      | e of Art                                            | t                                | 17                                    |  |  |  |  |  |
|                  | 2.1<br>2.2<br>2.3          | PCI E:<br>NVM I<br>2.2.1<br>2.2.2<br>2.2.3<br>Compl | xpress                           | 17     18     18     20     20     23 |  |  |  |  |  |
| 3                | Com                        | 2.3.1                                               | Commercial Hardware Accelerators | 25 27                                 |  |  |  |  |  |
|                  | 3.1                        | Cosmo<br>3.1.1<br>3.1.2<br>3.1.3<br>3.1.4<br>3.1.5  | s+ OpenSSD Project               | 27<br>27<br>31<br>33<br>33<br>36      |  |  |  |  |  |
|                  | 3.2                        | Compu<br>3.2.1<br>3.2.2<br>3.2.3<br>3.2.4           | Itational Storage Development    | $40 \\ 40 \\ 43 \\ 44 \\ 47$          |  |  |  |  |  |
| 4                | Con                        | clusion                                             |                                  | 61                                    |  |  |  |  |  |
| Bi               | bliogr                     | aphy                                                |                                  | 63                                    |  |  |  |  |  |

# **List of Tables**

| 3.1 | Set Parameter - Feature Identifiers Assignation           | 55 |
|-----|-----------------------------------------------------------|----|
| 4.1 | Iometer Sequential Read Test - block size 4kB             | 62 |
| 4.2 | Latency Sequential Read Test - block size $4 \mathrm{kB}$ | 62 |

# List of Figures

| 1.1  | Energy Forecast - Nature, How to stop data centres from gobbling                                                                                                |                 |
|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
|      | up the world's electricity                                                                                                                                      | 11              |
| 1.2  | Comparison between SATA, SAS and NVMe - StorageIOblog.com .                                                                                                     | 14              |
| 2.1  | Comparison between SATA and PCIe in number of IOPS - Design & Reuse                                                                                             | 17              |
| 2.2  | PCIe Generations - Wikipedia, PCI Express                                                                                                                       | 18              |
| 2.3  | General diagram of a Multi-core Host NVMe Management - IDF13,<br>Optimized Interface for PCI Express SSDs                                                       | 19              |
| 2.4  | NVMe Hierarchy Types - SDC19, Managing Capacity in NVM Express SSDs                                                                                             | 21              |
| 2.5  | Namespace Management and Attachment Commands - SNIA DSI<br>Conference 2015, Creating Higher Performance Solid State Storage<br>with Non-Volatile Memory Express | 22              |
| 2.6  | NVMe-oF General Scheme - NVM Express NVMe Over Fabrics                                                                                                          | $\frac{22}{22}$ |
| 2.7  | NVMe-oF: Protocol Options - Flash Memory Summit 2020, NVMe-<br>oF <sup>™</sup> Enterprise Appliances                                                            | <br>วว          |
| 2.8  | Different implementations of Computational Storage - SNIA Compu-<br>tational Storage 2019, What happens when Compute meets Storage?                             | 23<br>24        |
| 2.9  | Peer-to-Peer with an NVMe CSx Device - MSST 2019, How NVM<br>Express and Computational Storage can make your AI Applications                                    | <b></b>         |
|      | Shine!                                                                                                                                                          | 25              |
| 2.10 | NoLoad <sup>®</sup> CSP - SNIA SDC2017, An NVMe-based Offload Engine         for Storage Acceleration                                                           | 26              |
| 3.1  | OpenSSD project History - Cosmos+ OpenSSD 2017 Tutorial                                                                                                         | 28              |
| 3.2  | Cosmos+ OpenSSD project System Overview - Cosmos+ OpenSSD 2017 Tutorial                                                                                         | 28              |
| 3.3  | Cosmos+ OpenSSD project System Design                                                                                                                           | 29              |
| 3.4  | Nand Modules Organization - Cosmos+ OpenSSD 2017 Tutorial                                                                                                       | 30              |
| 3.5  | Command Priority - Cosmos+ OpenSSD 2017 Tutorial                                                                                                                | 31              |
| 3.6  | Firmware Overall Sequence - Cosmos+ OpenSSD 2017 Tutorial                                                                                                       | 31              |
| 3.7  | First Adaptation of the Cosmos+ OpenSSD project System Design                                                                                                   | 32              |
|      |                                                                                                                                                                 |                 |

| 3.8          | Functionality test - Power-up, Partitioning and Reset - Development<br>PC                                           |
|--------------|---------------------------------------------------------------------------------------------------------------------|
| 3.9          | Functionality test - Power-up - Host PC                                                                             |
| 3.10         | Functionality test - Formatting operation - Host PC                                                                 |
| 3.11         | Functionality test - Data Correctness - Host PC                                                                     |
| 3.12         | Functionality test - NVMe Identify Command - Host PC                                                                |
| 3.13         | Functionality test - NVMe Namespace List - Host PC                                                                  |
| 3.14         | Performance test - dd function, write and read - Base Version                                                       |
| 3.15         | Performance test - Iometer Sequential Read transfer size = $4k$ , block size = $4k$ Base Version                    |
| 3 16         | Parformance test I lomator Sequential Read transfer size $-16k$ block                                               |
| 0.10         | size $-4k$ Base Version                                                                                             |
| 2 17         | Barformance test Read Latency Rese Version                                                                          |
| 0.17<br>9.10 | Terminal Error haad autoDraPy stuck                                                                                 |
| 3.10<br>2.10 | Waveform From head autoDmaRx stuck                                                                                  |
| 3.19         | Waveform Error Submitted DMA Boquest                                                                                |
| 2.20         | Performance test dd function write and read Optimized Version                                                       |
| 2.00         | Performance test - du function, write and fead - Optimized Version                                                  |
| J.22         | $r_{\rm rize} = 4k$ , Diock                                                                                         |
| 2 92         | Size $= 4k$ - Optimized Version                                                                                     |
| J.2J         | size = $4k$ - Optimized Version                                                                                     |
| 3.24         | General Architecture of a FPGA-based Computational Storage                                                          |
| 3.25         | 32bit AXI DMA Adaptation of the Cosmos+ OpenSSD project Sys-                                                        |
|              | tem Design                                                                                                          |
| 3.26         | Performance test - dd function, write and read - 32bit AXI DMA                                                      |
|              | Version                                                                                                             |
| 3.27         | Performance test - Iometer Sequential Read transfer size = $4k$ , block<br>size = $4k$ - $32bit$ AXI DMA Adaptation |
| 3.28         | Performance test - Iometer Sequential Read transfer size $= 16k$ , block                                            |
|              | size = $4k - 32bit$ AXI DMA Adaptation                                                                              |
| 3.29         | Performance test - Read Latency - 32bit AXI DMA Adaptation                                                          |
| 3.30         | System Design of the first version of the Computational Storage                                                     |
|              | based on the Cosmos+ OpenSSD project                                                                                |
| 3.31         | FSM of the first version of the Hardware Accelerator                                                                |
| 3.32         | NVMe Set Parameter Admin Command - Addition of 0x7 to a 32-bit                                                      |
|              | sequence - Host and Developer PCs                                                                                   |
| 3.33         | Performance test - dd function, write and read - Computational stor-                                                |
|              | age first prototype                                                                                                 |
| 3.34         | Performance test - Iometer Sequential Read transfer size $= 4k$ , block                                             |
|              | size = $4k$ - Computational storage first prototype                                                                 |
| 3.35         | Performance test - Iometer Sequential Read transfer size $= 16k$ , block                                            |
|              | size = $4k$ - Computational storage first prototype                                                                 |

| 3.36 | Performance test - Read Latency - Computational storage first pro-       |
|------|--------------------------------------------------------------------------|
|      | totype 46                                                                |
| 3.37 | Performance test - dd function, write and read - 128bit AXI DMA          |
|      | Version                                                                  |
| 3.38 | Performance test - Iometer Sequential Read transfer size $= 4k$ , block  |
|      | size = $4k - 128bit AXI DMA Adaptation \dots 47$                         |
| 3.39 | Performance test - Iometer Sequential Read transfer size $= 16k$ , block |
|      | size = $4k - 128bit AXI DMA Adaptation \dots 47$                         |
| 3.40 | Performance test - Read Latency - 128bit AXI DMA Adaptation 48           |
| 3.41 | General diagram of the Hardware Accelerator                              |
| 3.42 | FSMs of the second version of the Hardware Accelerator 49                |
| 3.43 | CTR Mode - NIST, Recommendation for Block Cipher Modes of                |
|      | Operation: Methods and Techniquesm                                       |
| 3.44 | Verilog Test-bench - AES-128 Encryption                                  |
| 3.45 | Verilog Test-bench - AES-128 Decryption                                  |
| 3.46 | Verilog Test-bench - AES-256 Encryption                                  |
| 3.47 | Verilog Test-bench - AES-256 Decryption                                  |
| 3.48 | Module execution - AES-128 Terminal                                      |
| 3.49 | Module execution - AES-128 Configuration                                 |
| 3.50 | Module execution - AES-128 Encryption                                    |
| 3.51 | Module execution - AES-256 Terminal                                      |
| 3.52 | Module execution - AES-256 Configuration                                 |
| 3.53 | Module execution - AES256 Encryption                                     |
| 3.54 | General Execution - AES-128 Host PC Terminal                             |
| 3.55 | General Execution - AES-128 Developer PC Terminal                        |
| 3.56 | General Execution - AES-128 Encryption Waveform                          |
| 3.57 | General Execution - AES-256 Host PC Terminal                             |
| 3.58 | General Execution - AES-256 Developer PC Terminal                        |
| 3.59 | General Execution - AES-256 Encryption Waveform                          |
| 3.60 | Performance test - dd function, write and read - 128bit AXI DMA          |
|      | Version                                                                  |
| 3.61 | Performance test - Iometer Sequential Read transfer size = $4k$ , block  |
|      | size = $4k$ - Computational Storage                                      |
| 3.62 | Performance test - Iometer Sequential Read transfer size $= 16k$ , block |
|      | size = $4k$ - Computational Storage                                      |
| 3.63 | Performance test - Read Latency - Computational Storage 59               |

# Chapter 1 Introduction

Despite never being in a data center, we live in a data-driven society and are depend on the services provided by data centers, almost as much other primary services, such as the water supply.

Every time we use the social networks or check our bank balance, or we create a back up of our data in a cloud, we are interacting with a data center: nowadays living without their existence seems to be nearly impossible.

A data center, or also called server farm, is a structure that hosts a large number of networked servers, routers and storage. It is used by governmental organizations and companies for remote storage and large-scale data handling and processing, and operates 24/7 to provide secure and continuous service[1].

The requirements that a data center should meet to provide an optimal service are:

- High performances (in terms of latency, IOPS and bandwidth);
- High memory capacity, to match the increase of data that have to be stored and/or computed;
- Non-volatile capabilities, to protect data even in case of power outage (backup server);
- Lower power consumption, to reduce costs for energy supply and cooling;
- Efficient management, to use the resources in the best way adapting to the workload;
- Flexibility, to easily deploy new technologies and/or application while, at the same time, leaving room for growth.

Through the years, the required processing capabilities of the data centers increased in order to match the increase in demand of bandwidth and computing. This increase in performance can be provided in two main ways:

- increasing the workload on server processors, resulting in an excess of consumed power and an increase in temperature;
- increasing the elasticity, the ability to respond to workload changes, through the over-provisioning: underutilising hardware, in order to be able to meet peaks in demand, is not sustainable since facility consumes a lot of energy even when idle[2].

However, as demand for Internet traffic grows exponentially, the information and communications technology could lead to an explosion in energy consumption if eco-sustainable strategies are not implemented.

As shown in Figure 1.1, the data centers use more than one-third of the worstcase expected consumption for ICT[3].

The Storage Networking Industry Association (SNIA) has been working to search for improvements to increase the energy efficiency of data centers, regarding server and storage devices, through its Green Storage Initiative. Companies are now paying more attention to the type of technology they are going to deploy. As a matter of fact, they are in need of lower power consumption as much as higher performance and additional capacity [4].

# **1.1 Possible Solutions**

To deal with these issues, current data centers must adopt new technologies and resources.

A first solution is to exploit hardware accelerators to offload the processors for different tasks, resulting in lower power consumption and higher energy efficiency: in addition, they can offer higher throughput and lower latency compared to common server processors[5].

Common hardware accelerators can be divided in 3 main groups, based on the platform used for their integration[6]:

- ASIC: integrated circuit designed for a specific application, improving the overall system speed as it focuses on performing just only or few functions;
- GPU: Originally designed for handling images, GPUs have more flexibility and programmability in respect to the ASIC: nowadays they are able to support different application with intensive computing.

This specialization results into GPUs consisting of a large number of simpler processing cores with simple control logic, due to the handling of data that have little to no branching conditions or data dependences, leading to a very high parallelism.

#### 1.1 – Possible Solutions



Figure 1.1: Energy Forecast - Nature, How to stop data centres from gobbling up the world's electricity

Different GPU organizations exhibit different performance and energy characteristics, according to the needs of the user: for example, high-end GPUs are mostly designed for hardware acceleration, with an architecture that can be more complex internally, but with less general diversity thanks to the organization in cluster (GPC), large number of cores and vast on-chip memory resources, useful to achieve high performances.

• FPGA: as opposed to the fixed design of the ASIC, FPGAs consist of an array of logic blocks, DSPs, on-chip BRAMs, and routing channels, resulting in an extreme flexibility of configuration. It can be customized to fit the needs of specific computations, with an efficiency that can reach the one of a custom architecture.

In FPGA, custom data paths can transfer data directly between computing units and exploit data locality thanks to the distributed on-chip BRAMs, which results in less access to the external memory.

This extreme flexibility, however, comes at a cost. First, FPGAs are not as space efficient as its ASIC or GPU counterparts, which have higher internal density. Then, with FPGAs, developers must verify that their design complies with timing and space requirements.

Finally, FPGA require, lacking a fixed structure, the use of hardware synthesis tools to create a configuration file that defines the architecture logic on the device.

An alternative approach to realize reconfigurable hardware accelerators is the use of architectures with higher abstraction level, such as CGRAs (Coarse-Grained Reconfigurable Architecture), offering shorter compilation times but lower configuration flexibility.

• FPGA w/ SoC: Recent commercial SoCs also include reconfigurable hardware fabrics, which are able to implement custom functions, such as hardware accelerators.

With these devices it is possible to achieve hardware/software solutions in a single chip, with the need to apply codesign approaches and hardware/software partitioning. This means that the developer needs to separate the application in parts that run on the CPU and those in the reconfigurable hardware.

The development of embedded applications targeting these reconfigurable systems is speed up by the progress of the hardware synthesis tools, shortening the time to market of the final product.

However, moving data can be more expensive than processing them, or it can be even worse if the transfers take place through an interconnect network. In fact, transferring data from storage systems to processors (and viceversa) is one of the major obstacle toward meeting performance and energy efficiency requirements.

To remove this obstacle, it is necessary to change a basic concept from "move data to the process" to "move process to data". This approach, called In-situ processing (ISP) and already present in framework such as Hadoop, can be fully exploited thanks to the modern solid-state drives (SSDs) architecture and the availability of powerful embedded processors, leading to the creation of computational storage. The embedded processor has access to the data stored in the NAND flash memory through an high speed and low power bus, avoiding both workload for the processor and transfer from and to the memory and so reducing the energy consumption[7].

Computational storages and how they are implemented will be dealt with in detail in section 2.3.

In order to fully exploit SSDs potential, the PCI Express, the interconnect bus that is closest to the host CPU, has been widely deployed in the data center, due to the ability to provide higher bandwidth and lower latency. The PCIe is currently one of the fastest I/O data highway available for storage and capable of supporting fast processors with high number of cores and heavy traffic[8]. However, the software protocols, that defines the traffic flow, need improvement to match the high performance of the PCIe.

A new storage protocol, that exploits the PCIe and is called Non-Volatile Memory Express (NVMe)[9], has been designed, with its first specifications published on March 2011, to take full advantage of the capabilities of SSDs.

NVMe allows applications that require the high performance servers and access to local storage via fast I/O data highways to reach their performance potential.

NVMe provides what data centers and hardware accelerators require. In particular:

- Low latency, given by the direct CPU connection;
- High throughput;
- Low CPU overhead;
- Multi-core awareness, and so capable to withstand the hundred of cores in data centers;
- Management at scale.

A Datacenter needs to manage everything on a single network: if the management is not efficient and network traffic is heavy, the management would take a part of the bandwith that could be sold to the costumer, resulting in a loss of profit. A simple network, easy to control and analyse, is needed.

NVMe can easily manage any number of storage devices and hardware accelerators, if they are all NVMe devices, thanks to NVMe Admin commands, that can for example update the firmware, format or repair the drive and management devices.

From the point of view of the performance, NVMe protocol reduces the wait time and latency through a more effective use of the PCIe data highway. Infact, while SATA only allows a single command queue that holds 32 commands, NVMe enables 64K queues with 64K commands each[10]. Introduction



Figure 1.2: Comparison between SATA, SAS and NVMe - StorageIOblog.com

NVMe has been designed with more and deeper queues, supporting a larger number of commands in those queues. In this way the SSD are able to receive a large number of commands, exploiting the internal parallelism, and better optimize command execution, achieving much higher concurrent IOPS.

Moreover, the specifications of the NVMe protocol includes many features for power management: in particular, the presence of non-operational power states is of great significance to further reduce the idle power of the devices[11], thus providing an opportunity to reduce the over-provisioning energy impact.

Summing up, NVMe provides both flexibility and compatibility, provides lower latency and allows higher number of concurrent I/O operations to be completed, thanks also to the PCIe interface it leverages, while having higher power efficiency. And last but not the least, the NVMe protocol has been developed with the idea of creating a free and standard that would be equal for everybody: for all these reasons, more and more large and small companies invest in the development of the NVMe protocol.

Internet-of-Things and, in a near future, the Artificial Intelligence will totally change the technology landscape: their requirements for data and processing power are massive, leading to a huge increase in power consumption. NVMe is a first step that can have a big impact on the future of ITC: shared solutions must be found in order to satisfy the demands but, at the same time, stop data centers and all the ICT infrastructure from gobbling up the world's electricity[3].

# **1.2 Thesis Structure**

The thesis is organized in 4 main chapters. A background section follows this introduction and presents the relevant aspects of NVMe and Computational Storage. In particular, this first chapter describes in detail the reasons that led to the adoption of the PCIe and NVMe protocol. Then the discussion will move to the state of art of computational storages.

The third chapter describes all the steps that were necessary to create an NVMe hardware accelerator, starting from the base project Cosmos+ OpenSSD to the final hardware accelerator, and the results of the various tests.

Then the final chapter sums up the obtained results, presenting the encountered problems and the possible developments for future work.

# Chapter 2

# State of Art

# 2.1 PCI Express

PCI Express (Peripheral Component Interconnect Express) is a high-speed serial computer expansion bus standard, directly connected to the motherboard.

Its direct competitor was the Serial ATA (SATA)[8], a computer bus interface that connects host bus adapters to mass storage devices: initially designed for interfacing with hard-disk drives, the SATA interface became the IOPS bottleneck with the coming of the new solid-state drives.

SATA was not able, in order to catch up with increasing speed of the SSDs, to overcome the upper-limit of 6Gb/s (about 750MB/s), without any major and time consuming changes and with solutions that would be less power efficient and more expensive.

Thanks to better performance, due to low latency and high bandwidth, PCIe is becoming the predominant interface for storage devices[12].



Figure 2.1: Comparison between SATA and PCIe in number of IOPS - Design & Reuse

However, nowadays SATA and PCIe still coexist: for example, SATA SSDs has lower cost and better performance than the SATA HDD, making it widely used in consumer applications; on the other hard, PCIe SSD are more expensive but are much more performing, meeting the requirements of industial applications as the data centers.

Defined by its number of lanes, the PCIe has undergone several revision, reaching very high throughput with the newer versions, as it can be seen in the Figure 2.2.

| · · ·         |            |           |                           |                           |            |            |            |             |  |  |
|---------------|------------|-----------|---------------------------|---------------------------|------------|------------|------------|-------------|--|--|
| PCI Express   | Introduced | Line      | Transfer                  | Throughput <sup>[i]</sup> |            |            |            |             |  |  |
| version       | introduced | code      | rate <sup>[i]</sup>       | ×1                        | ×2         | ×4         | ×8         | ×16         |  |  |
| 1.0           | 2003       | 8b/10b    | 2.5 GT/s                  | 250 MB/s                  | 0.50 GB/s  | 1.0 GB/s   | 2.0 GB/s   | 4.0 GB/s    |  |  |
| 2.0           | 2007       | 8b/10b    | 5.0 GT/s                  | 500 MB/s                  | 1.0 GB/s   | 2.0 GB/s   | 4.0 GB/s   | 8.0 GB/s    |  |  |
| 3.0           | 2010       | 128b/130b | 8.0 GT/s                  | 984.6 MB/s                | 1.97 GB/s  | 3.94 GB/s  | 7.88 GB/s  | 15.75 GB/s  |  |  |
| 4.0           | 2017       | 128b/130b | 16.0 GT/s                 | 1969 MB/s                 | 3.94 GB/s  | 7.88 GB/s  | 15.75 GB/s | 31.51 GB/s  |  |  |
| 5.0           | 2019       | 128b/130b | 32.0 GT/s <sup>[ii]</sup> | 3938 MB/s                 | 7.88 GB/s  | 15.75 GB/s | 31.51 GB/s | 63.02 GB/s  |  |  |
| 6.0 (planned) | 2021       | 128b/130b | 64.0 GT/s                 | 7877 MB/s                 | 15.75 GB/s | 31.51 GB/s | 63.02 GB/s | 126.03 GB/s |  |  |

PCI Express link performance<sup>[35][36]</sup>

Figure 2.2: PCIe Generations - Wikipedia, PCI Express

NAND chips in consumer SSDs usually have a bus bandwidth of around 200MB/s: consisting of 4-8 chips, a typical SSD can reach for up to 1.6GB/s transfer speeds, that can easily supported with a PCIe Gen2.0 x4.

# 2.2 NVM Express

However, PCIe as storage protocol is not enough: Non-Volatile Memory Express is a communication transfer protocol, designed to address the needs of both Enterprise and Client systems, that has been especially developed for PCIe-based SSDs, supporting all form factors (U.2, M.2, AIC, EDSFF) and providing the capabilities to meet the demands of cloud, internet portal data centers and other high-performance computing environments.

### 2.2.1 Overview

The benefits that NVMe provides are[12, 13]:

- Driver standardization: NVME is open source and supported by the major operating systems, and future NVMe features, such as vendor specific commands, can be integrated in the standard driver;
- Performance increase, since a multi-queues management and submission of practically unlimited number of commands are possible, as shown in Figure 1.2, and the SATA bottleneck has been removed;

- Scalability, with headroom for improvements;
- Optimized register interface and command set, to simplify host software and device firmware;
- Reduced power consumption, resulting in a lower Total Cost of Ownership (TCO) and carbon footprint.

Another important feature for data centers is the possibility to exploit the parallel processing capabilities of multi-core processors: the ownership of queues, their priority and the arbitration mechanisms can be shared between different CPU cores, achieving higher IOPS and lower data latency.



Figure 2.3: General diagram of a Multi-core Host NVMe Management - IDF13, Optimized Interface for PCI Express SSDs

As shown in Figure 2.3, each core has one or more I/O submission queues, a completion queue, and the MSI-X interrupt.

While the cores manage the I/O commands, in the submission and completion queues of the controller management takes place the management of the Admin commands, that are used to obtain informations about the NVMe device, modify its configuration or create the Submission and Completion Queues.

A write access from the host to the NVMe controller of a device can be described in the following way[15]:

- The host submits new commands in the Submission Queue and sets the Submission doorbell register (tail/head mechanism) of the NVMe controller in order to inform it that there is a new submission queue ready.
- The NVMe controller fetches the commands, including all the necessary information (source and destination address, data size, priority, ect.), from Submission Queue into the host memory processes them;

- The NVMe controller manages the data transfer and write the completion of the commands in the host Completion Queue. As all the commands are executed, the NVMe controller generate the MSI-X interrupt;
- Finally the host processes the completition and updates the Completion doorbell register.

## 2.2.2 NVMe Hierarchy

The subsystem of an NVMe device subsystem consists of different elements:

- Namespaces array of logical blocks;
- NVM Sets groups of one or more namespaces;
- Endurance Groups consisting of a fixed or variable number of NVM Sets;
- Domains consisting of endurance groups, one or more NVMe controllers, etc.

The two type of endurance group identify two methods of management [16]:

- fixed capacity management, as shown in 2.4 for drives to satisfy the requirements of an hyperscalar system. If the number of namespace and NVMe set for endurance group is fixed to 1 (NVMe sets are not needed anymore), the host will have a high workload but will be able to optimize the wear leveling of the storage;
- variable capacity management for an higher customization.

An NVMe namespace is a storage volume of non-volatile memory formatted into logical blocks. Its range from the LBA 0 to LBA (n-1), where LBA stands for Logical Block Address and n is the size of the namespace, and is backed by some capacity of non-volatile memory.

Namespaces may be created and deleted using the Namespace Management of the NVMe Controller of the device, and each namespace is indipendent of other namespaces.

An identifier (NSID) is provided by the controller, using Namespace Attachment Commands, in order for the host to have access to a namespace. A namespace, with a given NSID, can be accessed by multiple NVMe controllers[17], as shown in Figure 2.5.

## 2.2.3 NVMe-oF

The NVMe over PCIe is limited to the local use. Therefore, the natural evolution of the NVMe protocol is the possibility to access an external NVMe device through a network (Figure 2.6).

2.2 - NVM Express



Figure 2.4: NVMe Hierarchy Types - SDC19, Managing Capacity in NVM Express SSDs

This evolution allows to access a shared network, keeping the benefits of the NVMe protocol. The additional advantages that the NVMe-oF provides are:

- sharing and provisioning;
- data and workload migration;
- better efficiency;
- better data protection.



Figure 2.5: Namespace Management and Attachment Commands - SNIA DSI Conference 2015, Creating Higher Performance Solid State Storage with Non-Volatile Memory Express



Figure 2.6: NVMe-oF General Scheme - NVM Express, NVMe Over Fabrics

The direct consequence of the NVMe-oF is the further consolidation of the position of the data centers as single shared and efficient infrastructure for both SAN (storage area network) and DAS (direct-attached storage). Moreover, being able to access any NVMe device over a network, it is even possible to use hardware accelerator to transfer part of the local CPU workload, that can even be performed in a more efficient way[20].

NVMe-oF is transport agnostic: it means that NVMe-oF supports all transport protocols, like RoCE, TCP and Fibre Channel.

For Enterprise Storage, Fibre Channel fabric is the best choice: the Fibre Channel Protocol is stable, reliable, mature, very efficient and high speed, and it offers consistently high performance.

However, not all organizations and costumers can use means like Fibre Channel or RoCE: the TCP is the most diffused and simple, does not require special hardware or networks, being based on Ethernet fabric, and can provide high performance if the network design is properly set up[21, 22].



Figure 2.7: NVMe-oF: Protocol Options - Flash Memory Summit 2020, NVMe-oF<sup>™</sup> Enterprise Appliances

# 2.3 Computational Storage

A computational storage is an architecture that couples hardware accelerators to the storage. The benefits of this architecture are

- CPU offload, with the reduction of the host processor workload;
- increase in performance, due to the hardware acceleration;
- near storage computing, with the reduction of the amount of data that must move between the storage plane and the compute plane, increasing efficiency.

As future prospective for the NVMe protocol, there is the possibility to create a new namespace type, the NVMe Computation Namespaces. Operating Systems will be able to treat these computation namespaces in a different way in respect to storage namespaces. For instance, the computational storage will be seen as </dev/nvmeXcsY> and, if possible, the user-space will not know or care if it is local (over PCIe) or remote (over Fabrics)[24, 25].

Computational storage, generally referred as CSx, can be mainly implemented, as shown in Figure 2.8, in three different ways:

- Computational Storage Processors (CSP);
- Computational Storage Drive (CSD);

• Computational Storage Array (CSA).



Figure 2.8: Different implementations of Computational Storage - SNIA Computational Storage 2019, What happens when Compute meets Storage?

Any type of computational storage provides computational storage services (CSS) that can be fixed (FCSS) or general/programmable purpose (PCSS)[26].

The CSP is the basic implementation of a computational storage: it is a component that provides CSS to a storage, but does not provide any persistent storage internally. The CSD, instead, is a component that provides CSS and persistent storage, with the possibility to access directly both or only one between the CSP and the storage. Finally, the CSA is a collection of computational storage drives, computational storage processors and/or storage, managed by a control software.

Moreover, Peer-to-Peer (P2P) operations can be achieved using computational storage.

As shown in Figure 2.9, the usual route followed to perform a data computation has the following step:

- copy data from the storage to the DDR;
- compute data in the CPU;
- write back the data in the storage.

Using computational storage, to process the data instead of the CPU, and an NVMe Controller Memory Buffer(CMB), to store the Submission and Completition Queues, it is possible to compute a huge amount of data without any workload for the host CPU, apart from the usual tasks like security.

The benefits that can be achieved are [27]:

• reduced data movement,



Figure 2.9: Peer-to-Peer with an NVMe CSx Device - MSST 2019, How NVM Express and Computational Storage can make your AI Applications Shine!

- CPU offload of processing and DMA traffic;
- power efficiency.

## 2.3.1 Commercial Hardware Accelerators

Some of the companies involved in the development of the NVMe protocol works on creating computational storage.

The product that is going to be presented has been create by Eideticom, founded in 2016 with the only objective of developing Computational Storage solutions for cloud and data centers. It is called NoLoad<sup>®</sup> CSP<sup>1</sup>, the first nvme-based one to be created in August 2019, as certified by the UNH-IOL.

Eideticom's NoLoad CSP purpose is to accelerate storage and intensive workloads, reducing the utilization of the host CPU. The CSP is Plug-and-Play: it utilizes drivers that are available on all major operating systems.

Moreover, it supports all types of form factors, P2P and CMB, NVMe-oF and provides different type of computation services, like compression, encryption or data analytics.

Different demonstrations were carried out by partners like Bittware and Xilinx, respectively with the FPGA platforms 250-U2 and Alveo U50.

<sup>&</sup>lt;sup>1</sup>Eideticom NoLoad: https://www.eideticom.com/uploads/images/NoLoad\_Product\_Spec.pdf



Figure 2.10: NoLoad<sup>®</sup> CSP - SNIA SDC2017, An NVMe-based Offload Engine for Storage Acceleration

On the other hand, there are several vendors developing CSDs not based on NVMe. An example of FPGA-based CSD is the ScaleFlux 2000 Series<sup>2</sup> over PCIe, or the Samsung SmartSSD drive<sup>3</sup>, produced by Samsung.

However, while the SmartSSD memory is managed by a Samsung SSD controller, the ScaleFlux 2000 is practically an open-channel SSD: the flash translation layer (FTL) is not implemented in the FPGA, but it runs in the software on the host system[29].

<sup>&</sup>lt;sup>2</sup>ScaleFlux 2000 Series: http://scaleflux.com/product.html

<sup>&</sup>lt;sup>3</sup>Samsung SmartSSD: https://www.nimbix.net/samsungsmartssd

# Chapter 3 Computational Storage Project

In this chapter all the steps necessary to build an FPGA-based NVMe Computational Storage are going to be analyzed. However, due its complexity, it is necessary to first obtain the NVMe communication interface: an NVMe Controller. Therefore, the open source project Cosmos+ OpenSSD has been chosen as the basis for our goal due to matching the requirements.

# 3.1 Cosmos+ OpenSSD Project

The first step consists of analysis and adaptation of the OpenSSD project from the original custom board to the Xilinx Zynq-7000 SoC ZC706.

# 3.1.1 Overview

Cosmos OpenSSD is an open source and FPGA-based SSD controller project that has been developed since 2014 by the HYU ENC Lab of the Hanyang University in South Korea, with research and education purposes[30].

A first version of the project was based on Indilinx Barefoot, a SoC over SATA2. The project version that will be analyzed is the Cosmos+ OpenSSD: developed in 2016, this version of the SSD controller supports the NVMe protocol. The project has been developed using Xilinx Developer Tools, Vivado Design Suite and SDK.

The custom Cosmos+ FPGA board has the following main features:

- FPGA Xilinx Zynq-7000 with a Dual ARM Cortex-A9 1GHz Core;
  - 1GB of DDR3;
  - AXI4-lite bus width of 32 bits;
  - AXI4 bus width of 64 bits;
- dual PCIe Gen2 x8 End-Points (Cabled PCIe Interface);

Computational Storage Project

|                       | Jasmine OpenSSD         | Cosmos OpenSSD             | Cosmos+ OpenSSD            |
|-----------------------|-------------------------|----------------------------|----------------------------|
| Released in           | 2011                    | 2014                       | 2016                       |
| Main Board            |                         |                            |                            |
| SSD Controller        | Indilinx Barefoot (SoC) | HYU Tiger3 (FPGA)          | HYU Tiger4 (FPGA)          |
| Host Interface        | SATA2                   | PCIe Gen2 4-lane<br>(AHCI) | PCIe Gen2 8-lane<br>(NVMe) |
| Maximum Capacity      | 128 GB (32 GB/module)   | 256 GB (128 GB/module)     | 2 TB (1 TB/module)         |
| NAND Data Interface   | SDR (Asynchronous)      | NVDDR (Synchronous)        | NVDDR2 (Toggle)            |
| ECC Type and Strength | BCH, 16 bits/512 B      | BCH, 32 bits/2 KB          | BCH, 26 bits/512 B         |

Figure 3.1: OpenSSD project History - Cosmos+ OpenSSD 2017 Tutorial

- additional interfaces (JTAG, USB, Ethernet);
- up to 2 NAND Flash Modules, with 8 flash packages slot each.

In Figure 3.2 it is illustrated the internal system overview of the project.



Figure 3.2: Cosmos+ OpenSSD project System Overview - Cosmos+ OpenSSD 2017 Tutorial

The Zynq processor is connected to the host through the Host Interface, called NVMe Host Controller, which is responsible of:

• handling of the data from the host to the buffer with a DMA engine;

• automated completion of the NVMe IO Command, without involving the Flash Transition Layer (FTL), that will be described later.

The NAND Flash Controller is the interface between the NAND Flash and the processor. It consists, as shown in the system block desing in Figure 3.3, of three different hardware IP blocks:

- Tiger4 NSC;
- Tiger4 Shared KES;
- V2NFC.



Figure 3.3: Cosmos+ OpenSSD project System Design

The IP Tiger4 NSC is the responsible of the handling of command and data from the processing system: the commands, consisting of information such as source and destination of the operation, are written by the firmware driver in the Tiger4 NSC registers and then elaborated. The data, instead, undergo different manipulations: in particular they pass through a module responsible of the Error Detection and Correction, the Tiger4 Shared KES.

Finally, the data are handled by the V2NFC block, that physically performs the low-level I/O operations in the NAND Flash Modules.

The operations have to be scheduled in order to be performed on the right SSD package and die. As shown in Figure 3.4, each NAND module has up to 4 available channels, one every two packages, and each channel has 8 maximum ways, corresponding to the maximum number of connected dies. The number of channels is equal to the number of NAND Flash Controllers.



Figure 3.4: Nand Modules Organization - Cosmos+ OpenSSD 2017 Tutorial

In the first version of the project, Cosmos, the way scheduling was managed by the NFC block, while the channel one by the firmware Flash Transition Layer (FTL): with the Cosmos+ one, both channel and way scheduling is managed by the FTL, providing more flexibility.

The other main features of the FTL are:

- Least Recently Used (LRU) data buffer management;
- priority command scheduling, as shown in Figure 3.5, with the aim of enhancing the multi-channel and way parallelism;
- on demand garbage collection, triggered only when there is no more free user block in each die;

The garbage collection is needed to recover free blocks for write requests: a victim block with invalid data is selected, then the valid data are copied in a free block while the victim one is erased.

However, while supporting the garbage collection, the firmware does not support the wear leveling of the flash memory.

A schematic description of the firmware execution is shown in Figure 3.6.

In order to be executed, a received IO command is first transformed in Slice Requests, which number depends on the number of logic blocks requested. Then

| Command                     | Priority |
|-----------------------------|----------|
| LLSCommand_RxDMA            | 0        |
| LLSCommand_TxDMA            | 0        |
| V2FCommand_StatusCheck      | 1        |
| V2FCommand_ReadPageTrigger  | 2        |
| V2FCommand_BlockErase       | 3        |
| V2FCommand_ProgramPage      | 4        |
| V2FCommand_ReadPageTransfer | 5        |

Figure 3.5: Command Priority - Cosmos+ OpenSSD 2017 Tutorial



Figure 3.6: Firmware Overall Sequence - Cosmos+ OpenSSD 2017 Tutorial

each Slice Request undergoes a second transformation in DMA and, if there is no buffer hit for read operations, NAND Requests. These requests are organized in different queues: one for the free requests, three for the requests that are going to be executed, one for each aforementioned type of requests, and two for the blocked requests, either for buffer or row address dependencies. Then the requests are, if not blocked, finally scheduled and executed.

### 3.1.2 **Project Adaptation**

The available platform is the Xilinx ZC706: in comparison with the original custom board, it has:

• same FPGA Zynq-7000 SoC;

- 4-line Gen2 PCIe Connector, instead of the 8-lane of the custom board;
- no NAND module.

The first step to adapt the project is to modify the hardware<sup>1</sup>: there is no NAND module, therefore the entire NAND Flash Controller hardware is not necessary.

At the same time, the NVMe Host Controller has to be changed from the 8-lane PCIe to the 4-lane one: this can be done by modifying the configuration of the Xilinx PCIe Core IP. The result is shown in Figure 3.7. Likewise, the constraint files have to be modified or removed, in order to match the pinout of the Xilinx ZC706[31].



Figure 3.7: First Adaptation of the Cosmos+ OpenSSD project System Design

A BRAM Controller has been added to provide a destination address for the firmware channel, substituting the Tiger4NSC one: however, it has no active role in the operations.

After exporting the new hardware file (.hdf), it is necessary to create a new project with the new platform specifications.

Some modifications have also been made to the firmware:

- allocation of the memory arrays <MemSpace> in the DDR: the storage capacity is of 64 MB, due to the DDR already being used by the firmware FTL;
- different organization of the memory management unit (MMU) table and the memory mapping, given the presence of the memory array;
- variation to the memory dimensions, number of channel and way;
- bypass of the status check and ECC functions;

<sup>&</sup>lt;sup>1</sup>This modified project and all the following ones have been uploaded on the GitHub repository <a href="https://github.com/giuseppedongiovanni/nvme\_comp\_storage">https://github.com/giuseppedongiovanni/nvme\_comp\_storage</a>

• replacement of the NAND operations, represented by the write operation of the commands in the Tiger4 registers, with the MemCpy function.

The memory array <MemSpace> is the replacement for the SSD: however, the available storage is much smaller than what the firmware expects. Varying the FTL configuration parameters, that should be left untouched, is necessary to avoid that important memory location useful for the firmware execution are corrupted.

### 3.1.3 Functionality Tests

The first test that has been carried out is the functionality one, consisting of a routine of power-up, partitioning and reset from both developer and host side, as shown respectively in Figures 3.8, 3.9 and 3.10. Then the data correctness, that corresponds to writing a file and reading it back, is verified in Figure 3.11 by the matching MD5 values (practically the digital fingerprint of a file).

Finally, it is possible to send some NVMe Admin Commands to verify the presence of the namespace and, as well, obtain information about the device. As it can be seen in Figure 3.13, the namespace id is set to 14740, equal to the storage capacity in terms of NVMe blocks (4096 bytes): this namespace, defined by the project creators, has been attached by the controller to the NSID 1, being the device seen as <nvme0n1>.

#### **3.1.4** Performance Tests

Different performance tests are carried out: <dd> function and the software Iometer are used to evaluate IOPS and bandwidth; to measure the latency instead, timers are used both in the firmware and in a c-file on the host side.

The Figures 3.14 to 3.16 refer to the performed tests for the read operation.

The results obtained are unexpected: performing operations from a DDR to an host device through a PCIe bus should imply very high performance, in the order of GB/s for the bandwidth. Instead, the obtained one is around 100 MB/s, with only 7000 IOPS.

More information can be obtained from the analysis of the latency test results, shown in Figure 3.17.

The pie chart is divided in two main parts: the green one is related to the time spent in the device firware, while the orange one includes all the other contributions, in particular the host, the PCIe bus and the NVMeHostController of the device.

The firmware execution takes up the 73% of the total latency: in particular, the <MemCpy> function employs more than half of this time period to be performed, slowing the entire execution. The obtained speed is of about 150 MB/s, while a single-port DDR3, with a 32-bit bus-width at 533 MHz, has a maximum theoretical bandwidth of 4 GB/s.

Computational Storage Project



Figure 3.8: Functionality test - Power-up, Partitioning and Reset - Development PC

The cause of this problem has to be found in the project itself: both the processing system and the NVMeHostController IP are running with a heavy memory access loading through the same port of the DDR, that for this reason has become the bottleneck of the project.

On the other hand, no write test are available: if the device is stressed with long or continuous write operations, the DMA of the NVMe Host Controller freezes, resulting in a Timeout Abort Error on the host.

In Figure 3.18 different variables were printed in the terminal to backtrack the cause of the error: in particular, head.autoDmaRx is the hardware counter of the completed DMA request, while tail.autoDmaRx is the software counter of the submitted DMA request: when the two counters coincide, the DMA operation is completed. It is possible to see that the DMA is stuck at head.autoDmaRx = 0xAA, although other 2 requests are present in the queue, being tail.autoDmaRx = 0xAC. 3.1 – Cosmos+ OpenSSD Project



Figure 3.9: Functionality test - Power-up - Host PC



Figure 3.10: Functionality test - Formatting operation - Host PC

| reds@reds: ~ 🕒 🕒 🕲                                                                                                                               | < | > • •      | ] 44 MB Volum | ne 🔸       | ۹ |        |          |
|--------------------------------------------------------------------------------------------------------------------------------------------------|---|------------|---------------|------------|---|--------|----------|
|                                                                                                                                                  |   |            |               | News       |   | Class  | Madified |
| reds@reds:~\$ sudo cp ./Downloads/test_rand_2MB /media/reds/nvme_folder/write_tes                                                                | 0 | Recent     |               | Name       | - | Size   | Modified |
| t<br>[sudo] password for reds:                                                                                                                   | ŵ |            |               | write_test |   | 2.0 MB | 20:47    |
| reds@reds:~\$ sudo cp /media/reds/nvme_folder/write_test ./Desktop/read_test<br>reds@reds:~\$ diff ./Downloads/test rand 2MB ./Desktop/read test |   | Desktop    |               |            |   |        |          |
| reds@reds:~\$ md5sum /media/reds/nvme_folder/write_test                                                                                          | ۵ |            |               |            |   |        |          |
| reds@reds:~\$ md5sum ./Downloads/test_rand_2MB                                                                                                   |   | Downloads  |               |            |   |        |          |
| 79c2ecc8ef24cb656352d3c19a65410a ./Downloads/test_rand_2MB<br><mark>reds@reds:~\$</mark> md5sum ./Desktop/read_test                              | a |            |               |            |   |        |          |
| 79c2ecc8ef24cb656352d3c19a65410a ./Desktop/read_test<br>reds@reds:~S                                                                             | ۵ |            |               |            |   |        |          |
|                                                                                                                                                  | - |            |               |            |   |        |          |
|                                                                                                                                                  | 1 |            |               |            |   |        |          |
|                                                                                                                                                  |   | Other Loca | tions         |            |   |        |          |
|                                                                                                                                                  |   |            |               |            |   |        |          |
|                                                                                                                                                  |   |            |               |            |   |        |          |
|                                                                                                                                                  |   |            |               |            |   |        |          |
|                                                                                                                                                  |   |            |               |            |   |        |          |
|                                                                                                                                                  |   |            |               |            |   |        |          |
|                                                                                                                                                  |   |            |               |            |   |        |          |

Figure 3.11: Functionality test - Data Correctness - Host PC

As it can be seen in Figures 3.19 and 3.20, the counter X of the submitted request is incremented up to 0x30, however the last increment of head.autoDmaRx is due to the count 0x2e.

Given multiple factors, between which the difficulty to easily reproduce and backtrack the error and the altered timing due to the modifications, a solution to

#### Computational Storage Project

| reds@rea | ds | :~\$ su | do nvme id-ctrl -H /dev/nvme0                           |
|----------|----|---------|---------------------------------------------------------|
| NVME Ide | en | tify Co | ontroller:                                              |
| vid      |    | 0x1ed   |                                                         |
| ssvid    |    | 0x1ed   |                                                         |
| sn       |    | SSDD5:  | 15T                                                     |
| mn       |    | Cosmos  | 5+ OpenSSD                                              |
| fr       |    | TYPE00  | 905                                                     |
| rab      |    |         |                                                         |
| ieee     |    | 5cd2e4  | 4                                                       |
| cmic     |    |         |                                                         |
| [2:2]    |    |         | PCI                                                     |
| [1:1]    |    |         | Single Controller                                       |
| [0:0]    |    |         | Single Port                                             |
| mdts     |    |         |                                                         |
| cntlid   |    | 9       |                                                         |
| ver      |    | 0       |                                                         |
| rtd3r    |    | Θ       |                                                         |
| rtd3e    |    | 0       |                                                         |
| oaes     |    | 0       |                                                         |
| [8:8]    |    | 0       | Namespace Attribute Changed Event Not Supported         |
| ctratt   |    | 0       |                                                         |
| [0:0]    |    |         | 128-bit Host Identifier Not Supported                   |
| oacs     |    |         |                                                         |
| [8:8]    |    |         | Doorbell Buffer Config Not Supported                    |
| [7:7]    |    |         | Virtualization Management Not Supported                 |
| [6:6]    |    |         | NVMe-MI Send and Receive Not Supported                  |
| [5:5]    |    | 0       | Directives Not Supported                                |
| [4:4]    |    |         | Device Self-test Not Supported                          |
| [3:3]    |    |         | NS Management and Attachment Not Supported              |
| [2:2]    |    |         | FW Commit and Download Not Supported                    |
| [1:1]    |    |         | Format NVM Not Supported                                |
| [0:0]    |    |         | Security Send and Receive Not Supported                 |
| acl      |    |         |                                                         |
| aerl     |    |         |                                                         |
| frmw     |    | 0x3     |                                                         |
| [4:4]    |    |         | Firmware Activate Without Reset Not Supported           |
| [3:1]    |    | 0x1     | Number of Firmware Slots                                |
| [0:0]    |    | 0x1     | Firmware Slot 1 Read-Only                               |
| lpa      |    |         |                                                         |
| [2:2]    |    | 0       | Extended data for Get Log Page Not Supported            |
| [1:1]    |    | 0       | Command Effects Log Page Not Supported                  |
| [0:0]    |    | 0       | SMART/Health Log Page per NS Not Supported              |
| elpe     |    | 8       |                                                         |
| npss     |    | 0       |                                                         |
| avscc    |    | 0       | Adata Manda analista anno 16 mars Marda                 |
| 0.0      |    | 19      | Admin Vendor Specific Commands Uses Vendor Specific For |

| reds@re | ds  | :~\$ sudo nvme get-ns-id /dev/nvme0     |
|---------|-----|-----------------------------------------|
| nvme0n1 | : 1 | hamespace-id:14740                      |
| reds@re | ds  | :~\$ sudo nvme id-ns /dev/nvme0n1       |
| NVME Id | en  | tify Namespace 14740:                   |
| nsze    |     | 0x3994                                  |
| ncap    |     | 0x3994                                  |
| nuse    |     | 0x3994                                  |
| nsfeat  |     | 0                                       |
| nlbaf   |     | 0                                       |
| flbas   |     | 0                                       |
| ۹C      |     | 0                                       |
| dpc     |     | 0                                       |
| dps     |     | 0                                       |
| nmic    |     | 0                                       |
| rescap  |     | 0                                       |
| fpi     |     | 0                                       |
| nawun   |     | 0                                       |
| nawupf  |     | 0                                       |
| nacwu   |     | 0                                       |
| nabsn   |     | 0                                       |
| nabo    |     | 0                                       |
| nabspf  |     | 0                                       |
| noiob   |     | 0                                       |
| nvmcap  |     | 0                                       |
| nguid   |     | 000000000000000000000000000000000000000 |
| eui64   |     | 000000000000000                         |

Figure 3.12: Functionality test - NVMe Figure 3.13: Functionality test - NVMe Identify Command - Host PC

Namespace List - Host PC

| File Edit View Cearch Terminal Halo                                                                                                                                                                                | reds@reds: ~ |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
| The Lot view Search Terminist help                                                                                                                                                                                 |              |
| reds@reds:~S sudo dd if=./Downloads/test_rand_8MB of=/dev/nvme0n1 bs=4096<br>[sudo] password for reds:<br>2000+0 records in<br>2000+0 records out<br>8120200 bytes (8.2 MB. 7.8 MiB) copied. 0.0530143 s. 155 MB/s |              |
|                                                                                                                                                                                                                    |              |
| reds@reds:~\$ sudo dd tf=/dev/hvmeon1 of=./Desktop/read_test bs=4096 count=1500                                                                                                                                    |              |
| 1500+0 records in                                                                                                                                                                                                  |              |
| 1500+0 records out                                                                                                                                                                                                 |              |
| 6144000 bytes (6.1 MB, 5.9 MiB) copied, 0.0644825 s, 95.3 MB/s                                                                                                                                                     |              |

Figure 3.14: Performance test - dd function, write and read - Base Version

the problem was not found.

#### **Optimized Firmware Version** 3.1.5

At first, the cause for the poor performance has been attributed to the scheduling of the NAND requests, having taken for granted that the <MemCpy> function was extremely fast. Therefore an optimized version of the firmware was developed in which the converted slice requests are directly executed, not creating any NAND request.

Even if working correctly, performance were just slightly better than the ones

| O lometer                                                          |                                                                                              |                                    |                                                      | - 🗆 🗡                   |
|--------------------------------------------------------------------|----------------------------------------------------------------------------------------------|------------------------------------|------------------------------------------------------|-------------------------|
| 2 B 🖳 🗖 i                                                          | <b>F</b> - <b>A</b>                                                                          |                                    | ?                                                    |                         |
| Topology                                                           | Disk Targets   Network Targets   Acce                                                        | ess Specifications Res             | ults Display Test Setup                              |                         |
| All Managers     B-     DESKTOP-8RMOI     Guseppezyng     Worker 1 | Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | Record last update results to file | Results Since Upda<br>Start of Test<br>C Last Update | ate Frequency (seconds) |
|                                                                    | Display<br>Total I/Os per Second                                                             | All Managers                       | 7457.48                                              | 10000                   |
|                                                                    | Total MBs per Second (Decimal)                                                               | All Managers                       | 30.55 MBPS (29.13 MiBPS)                             | 100                     |
|                                                                    | Average I/O Response Time (ms)                                                               | All Managers                       | 0.2681                                               | 1                       |
|                                                                    | Maximum I/O Response Time (ms)                                                               | All Managers                       | 0.3203                                               | 1                       |
|                                                                    | % CPU Utilization (total)                                                                    | All Managers                       | 0.32 %                                               | 1%                      |
| < >                                                                | Total Error Count                                                                            | All Managers                       | 0                                                    | 0                       |
| rt Completed Successfully                                          |                                                                                              |                                    |                                                      |                         |

Figure 3.15: Performance test - Iometer Sequential Read transfer size = 4k, block size = 4k - Base Version

| o lometer                        |                                                                                              |                                    |                                                      | - 🗆 X                   |
|----------------------------------|----------------------------------------------------------------------------------------------|------------------------------------|------------------------------------------------------|-------------------------|
| 2 8 9 2                          | 🎦 🔁 🔺 📼 📰                                                                                    | <u>^</u> ₩₽                        | 2                                                    |                         |
| Topology                         | Disk Targets Network Targets Acc                                                             | ess Specifications Re              | sults Display Test Setup                             |                         |
| All Managers                     | Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | Record last update results to file | Results Since Upda<br>Start of Test<br>C Last Update | ate Frequency (seconds) |
|                                  | Display                                                                                      | All Managers                       | 7136.47                                              | 10000                   |
|                                  | Total I/Os per Second                                                                        |                                    |                                                      | >                       |
|                                  | Total MBs per Second (Decimal)                                                               | All Managers                       | 116.92 MBPS (111.51 MiBPS                            | i) 1000<br>>            |
|                                  |                                                                                              | All Managers                       | 0.2802                                               | 1                       |
|                                  | Average I/O Response Time (ms)                                                               |                                    |                                                      | >                       |
|                                  | Maximum I/O Response Time (ms)                                                               | All Managers                       | 0.3427                                               | 1                       |
|                                  |                                                                                              |                                    | 0.21 %                                               | 1*                      |
|                                  | % CPU Utilization (total)                                                                    | All Managers                       | 0.31 %                                               | >                       |
|                                  |                                                                                              | All Managers                       | 0                                                    | 0                       |
| < >                              | Total Error Count                                                                            |                                    |                                                      | 2                       |
| E<br>Test Completed Successfully |                                                                                              |                                    |                                                      |                         |

Figure 3.16: Performance test - Iometer Sequential Read transfer size = 16k, block size = 4k - Base Version

of the original firmware, as shown in Figures 3.21,3.22 and 3.23: this helped to discover the real problem, but has lead to no significant improvements. For this reason and to modify the original code the least possible to preserve stability, this version was discarded and no further optimization was carried out.



Figure 3.17: Performance test - Read Latency - Base Version



Figure 3.18: Terminal Error - head.autoDmaRx stuck



Figure 3.19: Waveform Error - head.autoDmaRx stuck



Figure 3.20: Waveform Error - Submitted DMA Request



Figure 3.21: Performance test - dd function, write and read - Optimized Version

| o lometer        |                                                                                              |                                    |                                                   | – 🗆 ×                    |
|------------------|----------------------------------------------------------------------------------------------|------------------------------------|---------------------------------------------------|--------------------------|
|                  | 🎦 🔁 <u> 🔶 🧝</u>                                                                              |                                    | ?                                                 |                          |
| Topology         | Disk Targets   Network Targets   Acc                                                         | ess Specifications Res             | ults Display Test Setup                           |                          |
| All Managers<br> | Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | Record last update results to file | Results Since Upd<br>Start of Test<br>Last Update | late Frequency (seconds) |
|                  | Display                                                                                      | All Managers                       | 8142.45                                           | 10000                    |
|                  | Total I/Os per Second                                                                        |                                    |                                                   | >                        |
|                  |                                                                                              | All Managers                       | 33.35 MBPS (31.81 MBPS)                           | ) 100                    |
|                  | Total MBs per Second (Decimal)                                                               |                                    |                                                   | >                        |
|                  |                                                                                              | All Managers                       | 0.2456                                            | 1                        |
|                  | Average I/O Response Time (ms)                                                               |                                    |                                                   | 2                        |
|                  | Maximum I/O Response Time (ms)                                                               | Al Managers                        | 0.2925                                            |                          |
|                  |                                                                                              | All Managem                        | 0.29.%                                            | 1 *                      |
|                  | % CPU Utilization (total)                                                                    | Ai Managers                        | 0.29 %                                            | >                        |
|                  |                                                                                              | All Managers                       | 0                                                 |                          |
|                  | Total Error Count                                                                            |                                    | , i i i i i i i i i i i i i i i i i i i           | >                        |
|                  |                                                                                              |                                    |                                                   | _                        |

Figure 3.22: Performance test - Iometer Sequential Read transfer size = 4k, block size = 4k - Optimized Version

| o lometer                                                         |                                                                                                                                  |                       |                                                   | - 🗆 X                    |
|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|-----------------------|---------------------------------------------------|--------------------------|
|                                                                   | 🎦 🔁 🔥 📼 👷                                                                                                                        |                       | ] ?                                               |                          |
| Topology  All Managers  B DESKTOP-8RMOI  G giuseppezynq  Vorker 1 | Disk Targets Network Targets Acc<br>Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | ess Specifications Re | suits Display   Test Setup  <br>Results Since Upo | late Frequency (seconds) |
|                                                                   | Total I/Os per Second                                                                                                            | All Managers          | 7491.65                                           | 10000                    |
|                                                                   | Total MBs per Second (Decimal)                                                                                                   | All Managers          | 122.74 MBPS (117.06 MiBP                          | s) 1000                  |
|                                                                   | Average I/O Response Time (ms)                                                                                                   | All Managers          | 0.2669                                            | 1                        |
|                                                                   | Maximum I/O Response Time (ms)                                                                                                   | All Managers          | 0.3373                                            | 1 >                      |
|                                                                   | % CPU Utilization (total)                                                                                                        | All Managers          | 0.30 %                                            | 1 %                      |
| < >>                                                              | Total Error Count                                                                                                                | All Managers          | 0                                                 | 0                        |
| L '<br>Test Completed Suscessfully                                |                                                                                                                                  |                       |                                                   |                          |

Figure 3.23: Performance test - Iometer Sequential Read transfer size = 16k, block size = 4k - Optimized Version

# **3.2** Computational Storage Development

The objective is to develop an hardware accelerator consisting of a wrapper that can host different types of acceleration core. However, before starting with the development of the hardware accelerator, it is necessary to modify the project with the addition of the AXI DMA IP, in order to match the general architecture of a FPGA-based Computational Storage Drive 3.24.



Figure 3.24: General Architecture of a FPGA-based Computational Storage

# 3.2.1 32-bit AXI DMA

At first, the AXI DMA has been used as a replacement for the MemCpy function, being simply closed on a loop-back. The only firmware modifications concern the configuration of the DMA and, as mentioned above, the substitution of the MemCpy with a DMA transfer.

Some preliminary performance tests are carried out to compare this version to the one with <MemCpy> function.

In Figure 3.26 the  $\langle dd \rangle$  test has been performed: however, due to the small amount of data transferred, the results are not accurate. Instead, as it can be seen in Figures 3.27 to 3.29, despite an increment in the latency, there is a slight increase in both IOPS and bandwidth due to the introduction of the DMA, that reduces the workload of the processing system and favors the increase of throughput.

3.2 - Computational Storage Development



Figure 3.25: 32bit AXI DMA Adaptation of the Cosmos+ OpenSSD project System Design



Figure 3.26: Performance test - dd function, write and read - 32<br/>bit AXI DMA Version

| O lometer                                                 |                                                                                              |                                      | -                                                  | - 🗆 X              |
|-----------------------------------------------------------|----------------------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------|--------------------|
| 28                                                        | <b>F</b>                                                                                     |                                      | 2                                                  |                    |
| Topology                                                  | Disk Targets   Network Targets   Acce                                                        | ess Specifications F                 | Results Display Test Setup                         |                    |
| All Managers<br>DESKTOP-8RMOI<br>giuseppezynq<br>Worker 1 | Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | Record last updat<br>results to file | te Cast Update Cast Update Cast Update Cast Update | requency (seconds) |
|                                                           | Display                                                                                      | All Managers                         | 8435.41                                            | 10000              |
|                                                           | Total I/Os per Second                                                                        |                                      |                                                    | >                  |
|                                                           |                                                                                              | All Managers                         | 34.55 MBPS (32.95 MiBPS)                           | 100                |
|                                                           | Total MBs per Second (Decimal)                                                               |                                      |                                                    | 2                  |
|                                                           |                                                                                              | All Managers                         | 0.2371                                             | 1                  |
|                                                           | Average I/O Response Time (ms)                                                               |                                      |                                                    | >                  |
|                                                           | Maximum I/O Reenonce Time (me)                                                               | All Managers                         | 0.3156                                             | 1                  |
|                                                           |                                                                                              |                                      |                                                    | <u>ک</u>           |
|                                                           | % CPU Utilization (total)                                                                    | All Managers                         | 0.17 %                                             | 1%                 |
|                                                           |                                                                                              | All Managers                         | 0                                                  |                    |
|                                                           | Total Error Count                                                                            | i managera                           | 0                                                  | >                  |
|                                                           |                                                                                              |                                      |                                                    |                    |
| Test Completed Successfully                               |                                                                                              |                                      |                                                    | /                  |

Figure 3.27: Performance test - Iometer Sequential Read transfer size = 4k, block size = 4k - 32bit AXI DMA Adaptation



Figure 3.28: Performance test - Iometer Sequential Read transfer size = 16k, block size = 4k - 32bit AXI DMA Adaptation



Figure 3.29: Performance test - Read Latency - 32bit AXI DMA Adaptation

## **3.2.2** First Prototype

The first prototype of hardware accelerator is a simple block which adds the value of a parameter to the data that need to be processed. Its primary goal was to provide a better understanding of the AXI-4 lite and stream protocols and the NVMe Admin Command.



Figure 3.30: System Design of the first version of the Computational Storage based on the Cosmos+ OpenSSD project

As shown in Figure 3.30, the accelerator is enclosed by two Data FIFO, in order to decouple the DMA transfer from the computation.

The designed Finite State Machine (FSM) is shown in Figure 3.31.

The Configuration an Transfer\_Configuration state handle respectively the modification and the reading of the configuration and parameter registers through the Set Parameters NVMe Admin Commands. The feature identifiers of the commands, a field that indicates for what feature the attribute are being specified for, are chosen between the group of the vendor specific ones[32], being the registers custom.

On the acceleration branch, the different states are used to take care of all the combination of the AXI-4 stream protocol signals and to recover any lost input data in the occurrence of particular conditions (e.g. output not ready on the last element).

The Accelerator has two configurations: the first is a simple pass-through, necessary for the power-on procedure, while the second is the acceleration configuration that, as previously said, is a simple addition of a parameter to the data. It is possible to change the values of the configuration and parameter registers through the NVMe Set Parameter Command, as shown in Figure 3.32.

The test in Figures 3.33 to 3.36 shows a degradation in performance for this solution due to the insertion of FIFOs and Accelerator block.



Figure 3.31: FSM of the first version of the Hardware Accelerator

| reds@reds: ~                                                                                                                                                                                                                                        | 00 |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| File Edit View Search Terminal Help                                                                                                                                                                                                                 |    |
| <pre>reds@reds:~\$ sudo nvme set-feature /dev/nvme0 -f 0xC0 -v 0x1 [sudo] password for reds: set-feature:c0 (Unknown), value:0x0000001 reds@reds:~\$ sudo nvme set-feature /dev/nvme0 -f 0xC1 -v 0x7 ret_foature.red(Unknown), value:0v000007</pre> |    |
| set-reactive:ci (unknown), value:0x000007                                                                                                                                                                                                           |    |
| Done Admin Command OPC: C<br>Done Admin Command OPC: 6<br>Done Admin Command OPC: 6<br>Done Admin Command OPC: 6                                                                                                                                    |    |
| Set Accelerator Configuration: 1                                                                                                                                                                                                                    |    |
| Set Feature FID:C0                                                                                                                                                                                                                                  |    |
| Done Admin Command OPC: 9                                                                                                                                                                                                                           |    |
| Set Accelerator Parameter: 7                                                                                                                                                                                                                        |    |
| Set Feature FID:C1                                                                                                                                                                                                                                  |    |
| Done Admin Command OPC: 9                                                                                                                                                                                                                           |    |

Figure 3.32: NVMe Set Parameter Admin Command - Addition of 0x7 to a 32-bit sequence - Host and Developer PCs

# 3.2.3 128-bit AXI DMA

As it will be treated in the next section, the chosen acceleration core works on 128bit data packet: in order not to add complexity to the accelerator and achieve higher performance, it is necessary to change the AXI DMA bus width from 32 to 128 bits.



Figure 3.33: Performance test - dd function, write and read - Computational storage first prototype



Figure 3.34: Performance test - Iometer Sequential Read transfer size = 4k, block size = 4k - Computational storage first prototype



Figure 3.35: Performance test - Iometer Sequential Read transfer size = 16k, block size = 4k - Computational storage first prototype



Figure 3.36: Performance test - Read Latency - Computational storage first prototype

Only two firmware modifications has to be carried out in order to perform the 128-bit DMA transfer:

- allocation of a new memory array for the spare data, called <SpaceArray>, to avoid conflicts for consecutive transfers;
- use of an attribute to specify the minimum alignment for the memory arrays, that has to be set to 16 bytes.

As for the previous versions, some preliminary tests are carried out to evaluate the performance of the new configuration.



Figure 3.37: Performance test - dd function, write and read - 128bit AXI DMA Version

The achieved performance, in terms of bandwidth, IOPS and latency, for the 128bit AXI DMA version are better than those of the 32-bit one thanks to the higher speed of the data transfer from and to the PS. However, as seen in section 3.1.1, the PS can provide only a 64-bit AXI interface, limiting the possible performance increase.

| 2 🖬 🛄 🗖                                                           | <u>7</u> -2 🔺 📼 👷                                                                            | <u> 1</u>                          | ?                                                  |                        |
|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------|------------------------------------|----------------------------------------------------|------------------------|
| Topology                                                          | Disk Targets Network Targets Acce                                                            | ess Specifications Res             | ults Display Test Setup                            |                        |
| All Managers  All Managers  DESKTOP-8RMOI  Guiseppezynq  Worker 1 | Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | Record last update results to file | Results Since Up<br>Start of Test<br>C Last Update | date Frequency (second |
|                                                                   | Display                                                                                      | All Managers                       | 10727.36                                           | 100000                 |
|                                                                   | Total I/Os per Second                                                                        |                                    |                                                    |                        |
|                                                                   | Total MRe per Second (Decimal)                                                               | All Managers                       | 43.94 MBPS (41.90 MiBPS                            | i) <u>100</u>          |
|                                                                   |                                                                                              | All Mapagem                        | 0.1964                                             | 1                      |
|                                                                   | Average I/O Response Time (ms)                                                               |                                    | 0.1004                                             |                        |
|                                                                   | Maximum I/O Response Time (ms)                                                               | All Managers                       | 0.2019                                             | 1                      |
|                                                                   | Havingin / O neaponae nine (na)                                                              | All Mapagem                        | 0.21 %                                             | 1 %                    |
|                                                                   | % CPU Utilization (total)                                                                    | ni managets                        | 0.21 %                                             | 1 /6                   |
|                                                                   |                                                                                              | All Managers                       | 0                                                  | 0                      |
|                                                                   | Total Error Count                                                                            |                                    |                                                    |                        |

Figure 3.38: Performance test - Iometer Sequential Read transfer size = 4k, block size = 4k - 128bit AXI DMA Adaptation

| o lometer                   |                                                                                              |                                    |                                                    | - 🗆 ×                     |
|-----------------------------|----------------------------------------------------------------------------------------------|------------------------------------|----------------------------------------------------|---------------------------|
|                             | <b>* -</b>                                                                                   |                                    | ] ?                                                |                           |
| Topology                    | Disk Targets   Network Targets   Acce                                                        | ess Specifications Re              | sults Display   Test Setup                         |                           |
| All Managers                | Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | Record last update results to file | Results Since Up<br>Start of Test<br>C Last Update | odate Frequency (seconds) |
|                             | Total I/Os per Second                                                                        | All Managers                       | 10586.29                                           | 100000                    |
|                             | Total MBs per Second (Decimal)                                                               | All Managers                       | 173.45 MBPS (165.41 MiB                            | PS) 1000                  |
|                             | Average I/O Response Time (ms)                                                               | All Managers                       | 0.1889                                             | 1                         |
|                             | Maximum I/O Response Time (ms)                                                               | All Managers                       | 0.2408                                             | 1 >                       |
|                             | % CPU Utilization (total)                                                                    | All Managers                       | 0.30 %                                             | 1%                        |
| < >                         | Total Error Count                                                                            | All Managers                       | 0                                                  | 0                         |
|                             |                                                                                              |                                    |                                                    |                           |
| Test Completed Successfully |                                                                                              |                                    |                                                    |                           |

Figure 3.39: Performance test - Iometer Sequential Read transfer size = 16k, block size = 4k - 128bit AXI DMA Adaptation

## **3.2.4** Second Prototype

The first prototype is not very suitable as general wrapper because the AXI-4 stream protocol makes the management of cores with different timing difficult. Therefore, the previous external FIFOs are then included in the hardware of this second prototype, in order to separate the input and the output AXI-4 stream signals, as shown in Figure 3.41.

In this second prototype, there are two FSMs, as shown in Figure 3.42: the modification and the reading of the registers through NVMe Admin Commands is



Figure 3.40: Performance test - Read Latency - 128bit AXI DMA Adaptation



Figure 3.41: General diagram of the Hardware Accelerator

now divided from the acceleration branch.

The acceleration branch is simpler than in the first prototipe, due to the acceleration process being indipendent from the AXI-4 stream protocol. The WAIT\_PIPE state is used to wait for the core to be ready, in the case its output is not immediately available.

### **AES** Core and CTR Configuration

In order to verify the correct functioning of the block, it was necessary to choose the acceleration function and, consequently, the core to be inserted.

The sought acceleration function has to be:

- widespread and standard;
- format-preserving: for construction limits (DMA transfer) the output data



Figure 3.42: FSMs of the second version of the Hardware Accelerator

format has to be the same of the input one;

• used with a data stream: to achieve high performance.

Therefore, the chosen core is an encryption one based on the Advanced Encryption Standard (AES)[33], called "tiny\_aes". The AES encryption algorithm, that is a specific implementation of the Rijndael one, has been adopted and standardized by the National Institute of Standards and Technology (NIST) and the US FIPS PUB in 2001 and is accepted all over the world. The AES is a symmetric encryption algorithm, using a single key for both encryption and decryption. It is a format-preserving encryption algorithm that works on a single fixed-length data block of 128-bit. On the other hand, the key size can be 128,192 or 256 bit.

The encryption process consists of different rounds of transformations, which number varies depending on the length of the key.

Five modes of operation were defined [34]:

• Electronic Code Book (ECB) mode: the simplest mode and, for this reason,

generally not recommended. All the input data blocks, called plain text, are encrypted with same key, allowing parallel encryption but resulting in low level of security;

- Cipher Block Chaining (CBC) mode: the input block is xored with an initialization vector (IV) and then encrypted. The resulting cipher text is used as IV for the next block;
- Cipher FeedBack (CFB) mode: the key and the IV are the input of the encryption block, which result is xored with the plain text. As in the CBC, the resulting cipher text is used as IV for the next block;
- Output FeedBack (OFB) mode: as in the CFB, the key and the IV are the input of the encryption block, which result is xored with the plain text to give the cipher text. However the IV for the next block is given by the result of the encryption block;
- Counter (CTR) mode: in this mode the IV is a counter. As in the CFB, the key and the IV are the input of the encryption block, which result is xored with the plain text to give the cipher text. However the IV for the next block is given by the incremented counter.

Apart from the ECB, each mode has advantages and disadvantages: the choice on which to use depends on the application. Given the requirements for the accelerator, the best choice is the CTR mode: in fact in this mode the encryption and decryption can be prepared in advance and the parallel computing is supported, providing in this way the possibility to achieve high speed while ensuring security. Then the hardware accelerator is modified according to the scheme shown in Figure 3.43, in which each block is equivalent of a pipeline stage.

It is fundamental that a data block uses the same value of the counter in both encryption and decryption. Otherwise, the operation is not completed correctly.

As described in the project specification[33], there are three cores available, one for each key size: due to the limit in FPGA resources, only the AES-128 and AES-256 are included in the hardware accelerator.

#### **Functionality Tests**

To verify the correct behavior of the developed accelerator, the NIST provides test vectors [34] for both encryption and decryption operations.

By construction of the cores, input and output of the core have to be reversed in order to be processed. This occurs for the opposite orientation of the software and hardware vectors.

Accounting NIST test vectors to be in a reversed state, only the output of the core is inverted: the key has to be inserted as it is. As well, to verify that results match, both plain-text and cipher-text have to be reversed.



Figure 3.43: CTR Mode - NIST, Recommendation for Block Cipher Modes of Operation: Methods and Techniquesm

The test are performed in three different ways:

- Module verilog test-bench;
- Module execution;
- General execution.

As shown in Figures the first vectors of the test-bench waveform are, respectively, plain-text blocks, initial counter and key from NIST test vectors.

The plain-text is reversed in the <data\_in\_pipe> vector to ensure the correct behavior: the match of results can be verified by the vector <reverse\_data>, inversion of the accelerator output.

For the module execution test, the comparison between the obtained cipher-text and NIST test vectors is directly executed by the software. In this type of test the configuration still takes place with direct writing of configuration and parameter registers.

Figures 3.48 and 3.51 shows the messages printed by the software, as a consequence of the correct completion of the operations, for both configurations.

In Figures 3.49, 3.50, 3.52 and 3.53 instead, the complete execution waveforms for the parameters configuration and the encryption phase only are shown.

As mentioned previously, it is possible to notice the difference between the time passed in the state WAIT\_PIPE (code 2) in the two encryption waveforms: the process takes more rounds of transformations to complete the operations in the case of the AES-256 core rather than the AES-128 one.

Computational Storage Project

| Accelerator_tb_behav.wcfg* |                                         |                                         |                                         |               |                   |                                         |                                          |                  |          |                    |                   | _              |                   |
|----------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|---------------|-------------------|-----------------------------------------|------------------------------------------|------------------|----------|--------------------|-------------------|----------------|-------------------|
| 20                         |                                         | 51,000 ps                               |                                         |               |                   |                                         |                                          |                  |          |                    |                   |                |                   |
| Name                       | Value                                   |                                         |                                         |               |                   |                                         |                                          |                  |          |                    |                   |                |                   |
|                            |                                         | 51,000 ps                               | 51,200 ps                               | 51,400 ps     | 51,600 ps         | 51,800 ps                               | 52,000 ps                                | 52,200 ps        | 52,400   | ps 52,600 ps       | 52,800 ps         | 53,000 ps      | 53,200            |
| 4 W /Accelerator_tb/clk    | 1                                       |                                         |                                         |               |                   |                                         |                                          |                  |          |                    |                   |                |                   |
| • vector_in[0][127:0]      | 6bc1bee22e409f9                         |                                         |                                         |               |                   | 6bclt                                   | ee 22e 409 f 96e 93                      | 7e117393172a     |          |                    |                   |                |                   |
| •-** vector_in[1][127:0]   | ae2d8a571e03ac8                         |                                         |                                         |               |                   | ae2d8                                   | 3a571e03ac9c9eb                          | 76fac45af8e51    |          |                    |                   |                |                   |
| •-** vector_in[2][127:0]   | 30c81c46a35ce41                         |                                         |                                         |               |                   | 30c8)                                   | 1¢46a35ce411e5f                          | tc1191a0a52ef    |          |                    |                   |                |                   |
| •                          | f69f2445df4f9b17;                       |                                         |                                         |               |                   | f69f3                                   | 2445df4f9b17ad2                          | 417be66c3710     |          |                    |                   |                |                   |
| <ul> <li></li></ul>        | f0f1f2f3f4f5f6f7f8f                     |                                         |                                         |               |                   | f0f11                                   | f2f3f4f5f6f7f8f                          | 9fafbfcfdfeff    |          |                    |                   |                |                   |
| •                          | 2b7e151628aed2                          |                                         |                                         |               |                   | 2b7e1                                   | 1\$1628aed2a6abf                         | 7158809cf4f3c    |          |                    |                   |                |                   |
| •-** data_in_pipe(127:0)   | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 666600 5      | e8c9ce887ebc9769  | rf90274477d83d6                         | 8a71f5a235f6                             | 5ed793935c078ea  | 16475    | f74a50589883dfa7   | 88273ac56238130c  | 08ec 3667de820 | 14b5e             |
| e-1 crypto_128[127:0]      | ec8cdf7398607cb                         | ec8cdf7398603                           | cb0f2d21675ea                           | Seale4 X 38   | 267c3c6773516318  | a077d7fc5073ae                          | 6a2cc378788                              | 3874fbeb4c81b17  | 86544    | e89c399ff0f198c6   | d40a31db156cabfe  | b00d47f8148a9  | 910ef             |
| •                          | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 666600 X 21   | /857957ae684b4f0: | 3e0619cefb3137                          | 75ce0a3febee                             | :0518c68acee63c  | 3ed46c   | 22365de8d8132d7d   | f2ec911e1ec33456  | 7fd536a8db8c5  | 50266             |
| •-> computed_data[127:0]   | 000000000000000000000000000000000000000 |                                         | 000000000                               | 0000000000000 | 00000000000       |                                         | 736db0992616                             | 5 7d864c7046d89i | 36b2e1   | ffbfff9dde18e861   | ffbfGe9ed66f6619  | d57c0db040901  | 2da7              |
| •-** FIFO_DATA_OUT[127:0]  | 000000000000000000000000000000000000000 |                                         |                                         |               |                   | 100000000000000000000000000000000000000 | :4600000000000000                        | do               |          |                    |                   | 736db09926161  | 7d85              |
| • V reverse_data(127:0)    | 000000000000000000000000000000000000000 |                                         |                                         |               |                   | 100000000000000000000000000000000000000 | \$46600000000000000000000000000000000000 | do               |          |                    |                   | 874d6191b620e  | \$261             |
| Accelerator_tb_behav.wcfg* |                                         |                                         |                                         |               |                   |                                         |                                          |                  |          |                    |                   | -              |                   |
| 20                         |                                         | 53,000 ps                               |                                         |               |                   |                                         |                                          |                  |          |                    |                   |                | <u>^</u>          |
| Name                       | Value                                   |                                         |                                         |               |                   |                                         |                                          |                  |          |                    |                   |                |                   |
|                            |                                         | 53,000 ps                               | 53,200 ps                               | 53,400 ps     | 53,600 ps         | 53,800 ps                               | 54,000 ps                                | 54, 200 ps       | 54,400   | os 54,600 os       | 54,800 ps         | 55,000 ps      | 55,200            |
| Accelerator_tb/clk         | 1                                       |                                         |                                         |               |                   |                                         |                                          |                  |          |                    |                   |                |                   |
| •-** vector_in[0][127:0]   | 6bc1bee22e409f9                         |                                         |                                         |               |                   | 6bc1b                                   | e 22e 409 f 96e 93                       | c7e117393172a    |          |                    |                   |                |                   |
| •-** vector_in(1)(127:0)   | ae2d8a571e03ac8                         |                                         |                                         |               |                   | ae2d8                                   | 3a571e03ac9c9eb                          | 76fac45af8e51    |          |                    |                   |                |                   |
| •                          | 30c81c46a35ce41                         |                                         |                                         |               |                   | 30c81                                   | 46a35ce411e5f                            | tc1191a0a52ef    |          |                    |                   |                |                   |
| •                          | f69f2445df4f9b17a                       |                                         |                                         |               |                   | f69f2                                   | 2445df4f9b17ad2                          | t417be66c3710    |          |                    |                   |                |                   |
| •-** start_counter[127:0]  | f0f1f2f3f4f5f6f7f8f                     |                                         |                                         |               |                   | fOflf                                   | 2f3f4f5f6f7f8f                           | 9fafbfcfdfeff    |          |                    |                   |                |                   |
| •                          | 2b7e151628aed2a                         |                                         |                                         |               |                   | 267e1                                   | \$1628aed2a6abf                          | 7158809cf4f3c    |          |                    |                   |                |                   |
| ᅚ 📲 data_in_pipe(127:0)    | 08ec3667de82d4t                         |                                         |                                         |               |                   | GBec                                    | 3667de82d4b5e8d                          | 9f2fba224f96f    |          |                    |                   |                |                   |
| e-1 crypto_128[127:0]      | b00d47f8148a910                         | b00d47f8148as                           | 10ef068309790                           | 4ba502        |                   |                                         | 5                                        | 899445a4de101f5  | 13cad198 | 7d89e91b           |                   |                |                   |
| e-w crypto_128_pipe[127:0] | 7fd536a8db8c502                         | 7fd536a8db8c                            | 502b63198f0ff9                          | e 3917 🔨 40   | a5d209e90c160f70  | 8951281fe2b00d                          | ¥                                        |                  | d89791   | be196b\$3c8af8087b | 25a22991a         |                | <u>المحمد الم</u> |
| e-w computed_data[127:0]   | d57c0db04090f2d                         | d57c0db04090                            | 2da7acbabdb7c                           | b275a         |                   | 773900cf050e849                         | e8bc07df45bb8c0                          | 578              |          | d07ba              | d9d709877d475975  | 49f8066075     | <u>النام ال</u>   |
| FIFO_DATA_OUT[127:0]       | 736db0992616f7d                         | 736db0992616                            | 7d864c7046d89                           | 6b2e1 ( f1    | bfff9dde18e861ff  | bf0e9ed66f6019                          | d57c0db04090                             | 2da7acbabdb7c1   | b275a 🗙  | 77890              | 3cf050e849e8bc07d | f45bb8c078     |                   |
| e-₩ reverse_data[127:0]    | 874d6191b620e3                          | 874d6191b620                            | 3261bef686499                           | 3db6ce 🗙 98   | 06f66b7970fdff86  | 17187bb9fffdff                          | 5ae4df8edbd5                             | d35e5b4f09020db  | OSeab 🗙  | 1e031              | ida2fbe03d1792170 | abf 3009cee    |                   |

Figure 3.44: Verilog Test-bench - AES-128 Encryption



Figure 3.45: Verilog Test-bench - AES-128 Decryption

Before performing the general test, is necessary to define in the firmware the Feature Identifiers of the Set Feature command. Their assignation to the configuration and parameters registers is shown in Table 3.1.

Finally, it is possible to perform the general test. Figures 3.54 and 3.55 show, from both host and developer PC sides, the configuration phase and write operation to send the test vectors: <test\_128.bin> is a binary file in which the NIST vectors have been reversed. The printed output, which are the hexadecimal values of the byte saved in the MemSpace array, are equivalent to test vectors reversed.

The write operations are always referred to a page (16kB): therefore, as it can be seen in Figure 3.31, the amount of computed data is much larger than the 64 bytes of binary file.





Figure 3.46: Verilog Test-bench - AES-256 Encryption

| Accelerator_tb_behav.wcfg*                                                               |                                                        |                                               |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              | -                       |        |
|------------------------------------------------------------------------------------------|--------------------------------------------------------|-----------------------------------------------|----------------------------------------------------|---------------------------------------|-----------------------------------|--------------------------------------------------------|---------------------------------------|----------------------|-----------------------------|------------------------------------------|----------------------------------------------|-------------------------|--------|
| 20                                                                                       |                                                        | 55,000 ps                                     |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              |                         |        |
| Name Name                                                                                | Value                                                  |                                               |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              |                         |        |
|                                                                                          |                                                        | 55,000 ps                                     | 55,200 ps                                          | 55,400 ps                             | 55,600 ps                         | 55,800 ps                                              | 56,000 ps                             | 56,200 ps            | 56,400 ps                   | 56,600 ps                                | 56,800 ps                                    | 57,000 ps               | 57,200 |
| Accelerator_tb/clk                                                                       | 1                                                      |                                               |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              |                         |        |
| • vector_in[0][127:0]                                                                    | 601ec313775789a                                        |                                               |                                                    |                                       |                                   | 601ec                                                  | \$18775789a567a                       | 715046613d228        |                             |                                          |                                              |                         |        |
| vector_in[1][127:0]                                                                      | f443e3ca4d62b59                                        |                                               |                                                    |                                       |                                   | f443e                                                  | Sca4d62b59aca8                        | e990cacaf5c5         |                             |                                          |                                              |                         |        |
| vector_in[2][127:0]                                                                      | 2b0930daa23de9-                                        |                                               |                                                    |                                       |                                   | 26095                                                  | Odaa2Sde94ce87                        | 017ba2d84988d        |                             |                                          |                                              |                         |        |
| • vector_in[3][127:0]                                                                    | dfc9c58db67aada                                        |                                               |                                                    |                                       |                                   | dfc9c                                                  | \$8db67aada618c                       | 2dd08457941a6        |                             |                                          |                                              |                         |        |
| • start_counter[127:0]                                                                   | f0f1f2f3f4f5f6f7f8f                                    |                                               |                                                    |                                       |                                   | f0f1f                                                  | 21314151617181                        | gfafbfcfdfeff        |                             |                                          |                                              |                         |        |
| •                                                                                        | 603deb1015ca71b                                        |                                               |                                                    |                                       | 603de                             | b1315ca71be2b73a                                       | ¢f0857d77811f3                        | 52c073b6108d72       | 9810a30914d1                | f4                                       |                                              |                         |        |
| • data_in_pipe[127:0]                                                                    | 000000000000000000000000000000000000000                | 666999006666699                               | 000000000000000000000000000000000000000            | 60000 X 14                            | 4bcfdd20afe5eda                   | 591eaeec8c37806                                        | a3af53530997                          | 215359ad46b253       | c7c22f X b11                | 921b45de80e17829                         | 7bc455b0c90d4                                | 65829ea210bb4           | \$c86  |
| e-** crypto_256[127:0]                                                                   | Obdf7df15917163                                        | Obdf7df15917                                  | 6335e9a8b15c8                                      | 30c502 X 5a                           | 5e699d536119065                   | 18\$863c8f657b94                                       | 1bc12c9c0161                          | 0d5d0d8bd6a337       | 8eca62 X 295                | 6e1c8693536b1bee                         | 99c73a31576b6                                | 8b77ffe0d97c0           | 992d   |
| e-w crypto_256_pipe[127:0]                                                               | 000000000000000000000000000000000000000                | 666999006666699                               | 000000000000000000000000000000000000000            | 60000 X 40                            | a30513a8d1597ac                   | :68e89a8fbefbd0                                        | 29dea6f13c61                          | cc2a609886cab9       | 96765a X 465                | 37lecc56bd1b0bat                         | 0868039348348                                | 6d6ea8c5ce395           | 97748  |
| e-w computed_data[127:0]                                                                 | 000000000000000000000000000000000000000                |                                               | CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC            | 6666699999966666                      | 00000000000                       |                                                        | 54e8c9ce887e                          | 6c9769f9027447       | 7d83d6 X 8a7                | 1f5a285f6ed79893                         | Sc078ea51b475                                | f74a50589883            | fa78   |
| • • FIFO_DATA_OUT[127:0]                                                                 | 000000000000000000000000000000000000000                |                                               |                                                    |                                       |                                   | 000000000000000000000000000000000000000                | 0000000000000000000                   | 20                   |                             |                                          |                                              | 54e8c9ce887eb           | ¢976   |
| •                                                                                        | 000000000000000000000000000000000000000                |                                               |                                                    |                                       |                                   | 000000000000000000000000000000000000000                | 0000000000000000                      | 20                   |                             |                                          |                                              | 6bc1bee22e40            | 196e   |
| Accelerator_tb_behav.wcfg*                                                               |                                                        |                                               |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              | -                       |        |
|                                                                                          |                                                        | 57,000 ps                                     |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              |                         |        |
| Name                                                                                     | Value                                                  |                                               |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              |                         |        |
|                                                                                          |                                                        | 57,000 ps                                     | 57,200 ps                                          | 57,400 ps                             | 57,600 ps                         | 57,800 ps                                              | 58,000 ps                             | 58,200 ps            | 58,400 ps                   | 58,600 ps                                | 58, 800 ps                                   | 59,000 ps               | 59,200 |
| Accelerator_tb/clk                                                                       | 1                                                      |                                               |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              |                         |        |
| • vector_in[0][127:0]                                                                    | 601ec3137757898                                        |                                               |                                                    |                                       |                                   | 601ec                                                  | \$13775789a5b7a                       | 7f504bbf3d228        |                             |                                          |                                              |                         |        |
| 🔍 •-📲 vector_in[1][127:0]                                                                | f443e3ca4d62b59                                        |                                               |                                                    |                                       |                                   | f 443e                                                 | Sca4d62b59aca8                        | e990caca15c5         |                             |                                          |                                              |                         |        |
| ••••• vector_in[2][127:0]                                                                | 2b0930daa23de9                                         |                                               |                                                    |                                       |                                   | 26093                                                  | ©daa23de94ce87                        | 317ba2d84988d        |                             |                                          |                                              |                         |        |
| • vector_in[3][127:0]                                                                    | dfc9c58db67aada                                        |                                               |                                                    |                                       |                                   | dfc9c                                                  | \$8db67aada613c                       | 2dd08457941a6        |                             |                                          |                                              |                         |        |
| • ** start_counter[127:0]                                                                | f0f1f2f3f4f5f6f7f8f                                    |                                               |                                                    |                                       |                                   | fof1f                                                  | 21314151617181                        | fafbfcfdfeff         |                             |                                          |                                              |                         |        |
| • • • • • • • • • • • • • • • • • • •                                                    | 603deb1015ca71b                                        |                                               |                                                    |                                       | 603de                             | b1315ca71be2b73a                                       | ef0857d77811f3                        | 52c073b6108d72       | 9810a30914d1                | f4                                       |                                              |                         |        |
| • data_in_pipe[127:0]                                                                    | 65829ea210bb43                                         |                                               |                                                    |                                       |                                   | 65829                                                  | ea210bb43c865b                        | \$5e6db1a393fb       |                             |                                          |                                              |                         |        |
| end crypto_256[127:0]                                                                    | ph77ffo0d07c0001                                       | 8b77ffe0d97ct                                 | 0992d7f7Ge1ce9                                     | fcSb7 X                               |                                   |                                                        | 4                                     | eb67826006088        | 6ca2cb45259a                | 85ad                                     |                                              |                         |        |
|                                                                                          | 807/Heous/coss.                                        |                                               |                                                    |                                       |                                   |                                                        |                                       |                      |                             |                                          |                                              |                         |        |
| e-W crypto_256_pipe[127:0]                                                               | 6d6ea8c5ce39977                                        | 6d6ea8c5ce39                                  | 977d8d6cac9613                                     | 76a94 🗙 ede                           | : 3f 897 3870e f eb 4             | 903e9b07ffeed1                                         | ×                                     |                      | b5a159a4a2                  | d34536c110600641                         | e6d732                                       |                         |        |
| • *** crypto_256_pipe[127:0] • *** computed_data[127:0]                                  | 6d6ea8c5ce39977<br>f74a50589883dfa3                    | 6d6ea8c5ce39<br>174a505898836                 | 977d8d6cac9613<br>Ifa788278ac562                   | 76a94 X eds<br>8130c X                | 3f5973870efeb4                    | 903e9b07ffeed1<br>08ec3667de82d4b9                     | x<br>se8d9f2fba224f9                  | 5f                   | b5a159a4a2                  | d34596c110600641<br>d023c706             | e6d732<br>b26806fea4a53e6                    | 5 f04544c9              |        |
| • *** crypto_256_pipe[127:0]<br>• *** computed_data[127:0]<br>• *** FIF0_DATA_OUT[127:0] | 6d6ea8c5ce39977<br>f74a50589883dfa1<br>54e8c9ce887ebc9 | 6d6ea8c5ce39<br>174a50589683<br>54e8c9ce887et | 977d8d6cac9613<br>1fa788278ac562<br>ac9769f9027447 | 76a94 / eds<br>8130c /<br>d83d6 / 8a: | 3f8973870efeb4<br>71f5a235f6ed793 | 9903e9b07ffeed1<br>08ec3667de82d4b5<br>935c078ea51b475 | x<br>e8d9f2fba224f5<br>x f74a50589883 | 6f<br>dfa788275ac562 | b5a159a4a2<br>X<br>36130c X | d34536c110600641<br>d023c706<br>08ec3667 | e6d732<br>b26806fea4a53ee<br>de82d4b5e8d9f21 | 5 f04544c9<br>5a224f96f |        |

Figure 3.47: Verilog Test-bench - AES-256 Decryption

| Terminal ready                                               |                                                |
|--------------------------------------------------------------|------------------------------------------------|
| ∳[!] MMU has been enabled.                                   |                                                |
| Hello COSMOS+ OpenSSD !!!                                    |                                                |
| Configuration - Press 'X' to start                           |                                                |
| Configuration: 02; Key 128bit = 2B7E1516-28AED2A6-ABF71588-9 | F4F3C; IV: F0F1F2F3-F4F5F6F7-F8F9FAFB-FCFDFEFF |
| Encryption-Decryption - Press 'X' to start                   |                                                |
| transfer Dev-DMA successful                                  |                                                |
| transfer DMA-Dev successful                                  |                                                |
| Encryption completed successfully                            |                                                |
| transfer Dev-DMA successful                                  |                                                |
| transfer DMA-Dev successful                                  |                                                |
| Decryption completed successfully                            |                                                |
|                                                              |                                                |

Figure 3.48: Module execution - AES-128 Terminal

Same operations are performed for the AES-256 core in Figures 3.57 to 3.59.

Computational Storage Project



Figure 3.50: Module execution - AES-128 Encryption



Figure 3.51: Module execution - AES-256 Terminal



Figure 3.52: Module execution - AES-256 Configuration



Figure 3.53: Module execution - AES256 Encryption

| Feature Identifier | Register Name          | Notes                           |
|--------------------|------------------------|---------------------------------|
| 0xC0               | Configuration Register | 0 = Pass through; $1 = $ Adder; |
|                    |                        | 2 = AES-128; 8 = AES-256        |
| 0xC1               | Parameter 1 Register   | LSW of the key vector           |
| $0 \mathrm{xC2}$   | Parameter 2 Register   |                                 |
| 0xC3               | Parameter 3 Register   |                                 |
| 0xC4               | Parameter 4 Register   | MSW of the 128-bit key vector   |
| $0 \mathrm{xC5}$   | Parameter 5 Register   |                                 |
| 0xC6               | Parameter 6 Register   |                                 |
| $0 \mathrm{xC7}$   | Parameter 7 Register   |                                 |
| 0xC8               | Parameter 8 Register   | MSW of the 256-bit key vector   |
| 0xCA               | Parameter A Register   | LSW of the IV vector            |
| $0 \mathrm{xCB}$   | Parameter B Register   |                                 |
| $0 \mathrm{xCC}$   | Parameter C Register   |                                 |
| $0 \mathrm{xCD}$   | Parameter D Register   | MSW of the IV vector            |

 Table 3.1: Set Parameter - Feature Identifiers Assignation



Figure 3.54: General Execution - AES-128 Host PC Terminal



Figure 3.55: General Execution - AES-128 Developer PC Terminal



Figure 3.56: General Execution - AES-128 Encryption Waveform



Figure 3.57: General Execution - AES-256 Host PC Terminal

| et Accelerator Configuration: 8                                       |                |                |                |                |                |                |                |                |
|-----------------------------------------------------------------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| iet Feature FID:C0                                                    |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| et Accelerator Parameter 1: 914DFF4                                   |                |                |                |                |                |                |                |                |
| et Feature FID:C1                                                     |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| et Accelerator Parameter 2: 2D9810A3                                  |                |                |                |                |                |                |                |                |
| et Feature FID:C2                                                     |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| et Accelerator Parameter 3: 3B6108D7                                  |                |                |                |                |                |                |                |                |
| et Feature FID:C3                                                     |                |                |                |                |                |                |                |                |
| one Admin Command OPC: 9                                              |                |                |                |                |                |                |                |                |
| et Accelerator Parameter 4: 1F352C07                                  |                |                |                |                |                |                |                |                |
| et Feature FID:C4                                                     |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| et Accelerator Parameter 5: 857D7781                                  |                |                |                |                |                |                |                |                |
| iet Feature FID:C5                                                    |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| iet Accelerator Parameter 6: 2B73AEF0                                 |                |                |                |                |                |                |                |                |
| et Feature FID:C6                                                     |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| et Accelerator Parameter 7: 15CA71BE                                  |                |                |                |                |                |                |                |                |
| et Feature FID:C7                                                     |                |                |                |                |                |                |                |                |
| one Admin Command OPC: 9                                              |                |                |                |                |                |                |                |                |
| et Accelerator Parameter 8: 603DEB10                                  |                |                |                |                |                |                |                |                |
| et Feature FID:C8                                                     |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| et Accelerator Parameter A: FCFDFEFF                                  |                |                |                |                |                |                |                |                |
| iet Feature FID:CA                                                    |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| iet Accelerator Parameter B: F8F9FAFB                                 |                |                |                |                |                |                |                |                |
| et Feature FID:CB                                                     |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| et Accelerator Parameter C: F4F5F6F7                                  |                |                |                |                |                |                |                |                |
| et Feature FID:CC                                                     |                |                |                |                |                |                |                |                |
| one Admin Command OPC: 9                                              |                |                |                |                |                |                |                |                |
| et Accelerator Parameter D: F0F1F2F3                                  |                |                |                |                |                |                |                |                |
| et Feature FID:CD                                                     |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 9                                             |                |                |                |                |                |                |                |                |
| Done Admin Command OPC: 6                                             |                |                |                |                |                |                |                |                |
| incryption completed successfully                                     |                |                |                |                |                |                |                |                |
| <pre>len[16384]:D6 -Mem[16385]:83 -Mem[16386]:7D -Mem[16387]:47</pre> | -Men[16388]:74 | -Mem[16389]:02 | -Mem[16390]:F9 | -Men[16391]:69 | -Mem[16392]:97 | -Men[16393]:BC | -Mem[16394]:7E | -Mem[16395]:88 |
| <pre>Men[16396]:CE -Men[16397]:C9 -Men[16398]:E8 -Men[16399]:5</pre>  |                |                |                |                |                |                |                |                |
| <pre>len[16400]:75 -Mem[16401]:84 -Mem[16402]:51 -Mem[16403]:EA</pre> | -Men[16404]:78 | -Mem[16405]:C0 | -Mem[16406]:35 | -Men[16407]:39 | -Mem[16408]:79 | -Men[16409]:ED | -Mem[16410]:F6 | -Mem[16411]:35 |
| <pre>Men[16412]:A2 -Men[16413]:F5 -Men[16414]:71 -Men[16415]:8</pre>  | (A -           |                |                |                |                |                |                |                |
| lem[16416]:0C -Mem[16417]:13 -Mem[16418]:38 -Mem[16419]:62            | -Men[16420]:C5 | -Mem[16421]:3A | -Mem[16422]:27 | -Men[16423]:88 | -Mem[16424]:A7 | -Mem[16425]:DF | -Men[16426]:83 | -Mem[16427]:98 |
| Men[16428]:58 -Men[16429]:50 -Men[16430]:4A -Men[16431]:F             |                |                |                |                |                |                |                |                |
| len[16432]:6F -Mem[16433]:F9 -Men[16434]:24 -Mem[16435]:A2            | -Men[16436]:FB | -Men[16437]:F2 | -Mem[16438]:D9 | -Men[16439]:E8 | -Mem[16440]:B5 | -Mem[16441]:D4 | -Mem[16442]:82 | -Mem[16443]:DE |
| Men[16444]:67 -Men[16445]:36 -Men[16446]:EC -Men[16447]:0             | 8 -            |                |                |                |                |                |                |                |
| long Admin Command OPCL 6                                             |                |                |                |                |                |                |                |                |

Figure 3.58: General Execution - AES-256 Developer PC Terminal



Figure 3.59: General Execution - AES-256 Encryption Waveform

#### **Performance Tests**

As in the previous cases, performance tests are carried out to evaluate IOPS, bandwidth and latency for the final computational storage configured to perform the cryptography.

As it can be seen in Figures 3.60 to 3.63, there is almost no degradation in performance after the insertion of the hardware accelerator.



Figure 3.60: Performance test - dd function, write and read - 128<br/>bit AXI DMA Version

| O lometer                                                                              |                                                                                              |                        |                          | – 🗆 X    |  |  |  |
|----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|------------------------|--------------------------|----------|--|--|--|
| 2 B 🖳 🗖 ì                                                                              |                                                                                              | <u>↑</u> ## <b>•</b>   | ?                        |          |  |  |  |
| Topology Disk Targets Network Targets Access Specifications Results Display Test Setup |                                                                                              |                        |                          |          |  |  |  |
| All Managers<br>B B DESKTOP-8RMOI<br>B B giuseppezynq<br>B Worker 1                    | Drag managers and workers<br>from the Topology window<br>to the progress bar of your choice. | te Frequency (seconds) |                          |          |  |  |  |
|                                                                                        | Total I/Os per Second                                                                        | All Managers           | 10658.55                 | 100000 > |  |  |  |
|                                                                                        | Total MBs per Second (Decimal)                                                               | All Managers           | 43.66 MBPS (41.63 MiBPS) | 100 >    |  |  |  |
|                                                                                        | Average I/O Response Time (ms)                                                               | All Managers           | 0.1876                   | 1 >      |  |  |  |
|                                                                                        | Maximum I/O Response Time (ms)                                                               | All Managers           | 0.2293                   | 1        |  |  |  |
|                                                                                        | % CPU Utilization (total)                                                                    | All Managers           | 0.25 %                   | 1%       |  |  |  |
| < >>                                                                                   | Total Error Count                                                                            | All Managers           | 0                        | 0 >      |  |  |  |
| Test Consulated Successfully                                                           |                                                                                              |                        |                          |          |  |  |  |

Figure 3.61: Performance test - Iometer Sequential Read transfer size = 4k, block size = 4k - Computational Storage



Figure 3.62: Performance test - Iometer Sequential Read transfer size = 16k, block size = 4k - Computational Storage



Figure 3.63: Performance test - Read Latency - Computational Storage

# Chapter 4 Conclusion

This research aimed to experience and explore capabilities of an NVMe computational storage, as opportunity take data centers to the next level in terms of performance and energy efficiency.

The developed computational storage is easily configurable from the host side, thanks to the NVMe protocol on which is based: no additional software/driver installations is required due to the NVMe driver being open source and supported in all major distributions. This involves the development of an NVMe controller as interface block for the computational storage that, though, complies with specifications common to all the vendors. Moreover, the results provided by the performance test carried out, summarized in Tables tables 4.1 and 4.2, show that the developed hardware accelerator has good performance: in fact, it does not introduce significant losses with respect to the standard version (without the acceleration block).

However, it was not possible to make a detailed analysis of the possible effects which could result from the increase of the operating frequency for both DMA and accelerator blocks due to performance limits. In fact, the obtained performance results are much lower than the expected ones, being heavily affected by the high memory traffic in the DDR: the problem could be solved using a platform with more internal resources or expansion possibilities. Furthermore, it is necessary to find a solution to the problem affecting the write operation, in order to finally obtain a bidirectional device.

Further improvements can be carried out: the original project was not designed to perform this type of operations, therefore a dedicated firmware can enhance the performance, reducing the software latency. Lastly, the computational storage, thanks to the NVMe protocol, could work in a peer-to-peer network, which exploitation enables further reduction of the CPU workload and data movement, resulting in higher efficiency and lower energy consumption.

|                 | Iometer           |                    |                    |                    |  |
|-----------------|-------------------|--------------------|--------------------|--------------------|--|
| Version         | Transfer size 4kB |                    | Transfer size 16kB |                    |  |
|                 | IOPS              | Bandwidth $[MBPS]$ | IOPS               | Bandwidth $[MBPS]$ |  |
| Base            | 7457              | 30.55              | 7136               | 116.92             |  |
| 32-bit AXI DMA  | 8435              | 34.55              | 8359               | 136.95             |  |
| First Prototype | 6738              | 27.60              | 6674               | 109.36             |  |
| 128-bit AXI DMA | 10727             | 43.94              | 10586              | 173.45             |  |
| Final Prototype | 10658             | 43.66              | 10478              | 171.68             |  |

Table 4.1: Iometer Sequential Read Test - block size  $4 \mathrm{kB}$ 

| Version         | Latency $[\mu s]$ |               |                 |       |  |  |  |
|-----------------|-------------------|---------------|-----------------|-------|--|--|--|
|                 | Firmware          | Data Transfer | Host & Hardware | Total |  |  |  |
| Base            | 136.6             | 105.6         | 52.1            | 188.7 |  |  |  |
| 32-bit AXI DMA  | 140.9             | 120.9         | 66.2            | 207.1 |  |  |  |
| First Prototype | 140.7             | 121.6         | 74.3            | 215   |  |  |  |
| 128-bit AXI DMA | 114.6             | 97.4          | 63.6            | 178   |  |  |  |
| Final Prototype | 114.9             | 97.4          | 67.3            | 182.2 |  |  |  |

Table 4.2: Latency Sequential Read Test - block size 4kB

# **Bibliography**

- [1] Roderick Bauer, *The Challenges of Opening a Data Center*, 2018, accessed June 2020, <https://www.backblaze.com/blog/choosing-data-center/>.
- [2] Kachris, Christoforos, Falsafi, Babak, Soudris, Dimitrios, Energy-Efficient Servers and Cloud, *Hardware Accelerators in Data Centers*, 2019.
- [3] Nicola Jones, How tostop data from centres qobbling upthe September world's electricity, Nature, 2018 2020,accessed June <a href="https://www.nature.com/articles/d41586-018-06610-y">https://www.nature.com/articles/d41586-018-06610-y</a>>.
- [4] Wayne M. Adams, Power consumption in data centers is a global problem, Data Center Dynamics (DCD), November 2018 accessed June 2020, <a href="https://www.datacenterdynamics.com/en/opinions/power-consumption-data-centers-global-problem/">https://www.datacenterdynamics.com/en/opinions/power-consumptiondata-centers-global-problem/>.
- [5] Kachris, Christoforos, Falsafi, Babak, Soudris, Dimitrios, The Era of Accelerators in the Data Centers, *Hardware Accelerators in Data Centers*, 2019.
- [6] João M.P. Cardoso, José Gabriel F. Coutinho, Pedro C. Diniz, Highperformance embedded computing, *Embedded Computing for High Performance*, 2017.
- [7] Mahdi Torabzadehkashi, Siavash Rezaei, Ali HeydariGorji, Hosein Bobarshad, Vladimir Alves, Nader Bagherzadeh, *Computational storage: an efficient and* scalable platform for big data and HPC applications, Journal of Big Data, 2019.
- [8] Rino Micheloni, The Need for High Speed Interfaces, Solid-State-Drives (SSDs) Modeling: Simulation Tools & Strategies fonte, 2017.
- [9] NVM Express, What is  $NVMe^{\forall}$ ?, accessed June 2020, < https://nvmexpress.org/>.
- [10] Greg Schulz, Doug Rollins, Why NVMe Should Be in Your Data Center, Micron, accessed June 2020, <a href="https://www.micron.com/solutions/technical-briefs/why-nvme-should-be-in-your-data-center">https://www.micron.com/solutions/technical-briefs/why-nvme-should-be-in-your-data-center</a>>.
- [11] NVM Express, How does power management work with NVMe technology?, accessed June 2020, <a href="https://nvmexpress.org/faq-items/how-does-powermanagement-work-with-nvme-technology/">https://nvmexpress.org/faq-items/how-does-powermanagement-work-with-nvme-technology/>.</a>
- [12] IP-Maker, NVMe IP for Enterprise SSD, Design & Reuse, accessed June 2020, <a href="https://www.design-reuse.com/articles/39493/nvme-ip-for-enterprise-ssd.html">https://www.design-reuse.com/articles/39493/nvme-ip-for-enterprisessd.html</a>.

- [13] IP-Maker, Conquering the challenges of PCIe with NVMe in order to deliver highly competitive Enterprise PCIe SSD, Design & Reuse, accessed June 2020, <a href="https://www.design-reuse.com/articles/34744/conquering-the-challenges-of-pcie-with-nvme.html">https://www.design-reuse.com/articles/34744/conqueringthe-challenges-of-pcie-with-nvme.html>.
- [14] Wikipedia, PCI Express Wikipedia, The free encyclopedia, 2020, accessed June 2020, <https://en.wikipedia.org/wiki/PCI\_Express>.
- [15] Amber Huffman, NVM Express: Optimized Interface for PCI Express SSDs, Intel Developer Forum (IDF), 2013.
- [16] Mark Carlson, Paul Suhler, Managing Capacity in NVM Express SSDs, Samsung Developer Conference (SDC), 2019.
- [17] J Metz, Creating Higher Performance Solid State Storage with Non-Volatile Memory Express, Data Storage Innovation (DSI) Conference SNIA<sup>™</sup>, 2015.
- [18] Brandon Hoff, NVMe over Fabrics, NVM Express, 2017.
- [19] Kamal Hyder, Manoj Wadekar, Yaniv Romem, Nishant Lodha, NVMe-oF<sup>™</sup> Enterprise Appliances, Flash Memory Summit, 2020.
- [20] Pure Storage, PERCHÉ NVMe-oF È IL FUTURO?, accessed June 2020, <a href="https://www.purestorage.com/it/resources/glossary/nvme-over-fabrics.html">https://www.purestorage.com/it/resources/glossary/nvme-over-fabrics.html</a>>.
- [21] Mike Kieran, When You're Implementing NVMe Over Fabrics, the Fabric Really Matters, NetApp, March 2019, accessed June 2020, <a href="https://blog.netapp.com/nvme-over-fabric/">https://blog.netapp.com/nvme-over-fabric/</a>>.
- [22] Franco Benuzzi, NVMe e NVMe-oF: i nuovi protocolli di accesso ai dischi flash, lantechlongwave, June 2019, accessed June 2020, <https://www.lantechlongwave.it/archivio/nvme-nvme-of-nuovi-protocolliaccesso-ai-dischi-flash/>.
- [23] BittWare, *Traditional vs. Computational Storage*, accessed June 2020, <a href="https://www.bittware.com/fpga/storage/>">https://www.bittware.com/fpga/storage/></a>.
- [24] Stephen Bates, Accelerating RocksDB with Eideticom's NoLoad<sup>®</sup>, Samsung Developer Conference (SDC), 2019.
- [25] Stephen Bates, Richard Mataya, Accelerating Applications with NVM Express<sup>™</sup> Computational Storage, NVMe<sup>®</sup> Annual Members Meeting and Developer Day, 2019.
- [26] Scott Shadley, Nick Adams, David Slik, What happens when Compute meets Storage?, SNIA Computational Storage, 2019.
- [27] Stephen Bates, *How NVM Express and Computational Storage can make your AI Applications Shine!*, Massive Storage Systems and Technology (MSST), 2019.
- [28] Sean Gibb, Stephen Bates, An NVMe-based Offload Engine for Storage Acceleration, SNIA Storage Developer Conference (SDC), 2017.
- [29] Daniel Robinson, Computational Storage Winds Its Way Towards The Mainstream, TheNextPlatform, February 2020, accessed June 2020, <a href="https://www.nextplatform.com/2020/02/25/computational-storage-windsits-way-towards-the-mainstream/>.</a>

#### Bibliography

- [30] Yong Ho Song, Cosmos+ OpenSSD 2017 Tutorial, HYU ENC Lab of Hanyang University, 2017.
- [31] Xilinx, ZC706 Evaluation Board for the Zynq-7000 XC7Z045 SoC, User Guide, Page 46-47, 2019.
- [32] NVM Express, NVM Express<sup>™</sup>Base Specification Revision 1.4, Page 206-207, June 2019.
- [33] Homer Hsing, AES Core Specification, OpenCores, February 2013, Accessed June 2020, <https://opencores.org/projects/tiny\_aes>
- [34] Morris Dworkin, Recommendation for Block Cipher Modes of Operation: Methods and Techniquesm, NIST Special Publication 800-38A, December 2001.