内容简介:There arePopular libraries such as Tensorflow run using CUDA (Compute Unified Device Architecture) to process data on GPUs, harnessing theirThe deep convolutional neural network, as shown below, requires filters to be slid across pixel regions while output
There are options outside of GPUs (Graphics Processing Units) when it comes to deploying a neural network, namely the FPGA (Field Programmable Gate Array). Before delving into FPGAs and their implementation though, it’s good to understand a bit about GPU architecture and why GPUs are a staple for neural networks.
Popular libraries such as Tensorflow run using CUDA (Compute Unified Device Architecture) to process data on GPUs, harnessing their parallel computing power . This work is called GPGPU (General Purpose GPU) programming. It has been adapted to deep learning models which require at least thousands of arithmetic operations.
The deep convolutional neural network, as shown below, requires filters to be slid across pixel regions while outputting a weighted sum at each iteration. For each layer this process gets repeated thousands of times with varying filters of the same size. Logically, deep models get computationally heavy and GPUs come in handy.
Tensorflow can be built on the back of CUDA, which saves the end user from implementing parallel code and understanding the architecture of their chip. Its convenience and high optimization makes it perfect for widespread use.
FPGAs did not offer such a convenient solution earlier, using them required a deep understanding of how hardware works. But recent progress has made them more accessible and there’s more on that to come later.
Overview
Contents of this article assume little to no knowledge of how different hardware models function. It goes over the following:
- GPUs, CPUs and CUDA
- Advantages and design of FPGAs
- HDL as a method for FPGA deployment
- HLS as a method for FPGA deployment
- FPGA deployment using LeFlow
- Features of LeFlow for optimisation
Why Are GPUs Sometimes Better than CPUs?
CPUs(Central Processing Units) are designed for serial operations and supporting advanced logic. This is reflected in their design which contains less cores and more cache memory to quickly fetch complex instructions.
GPUs,however have hundreds of smaller cores for simple computation, and thus a higher throughput as compared to CPUs.
CUDA accesses a GPU’s many cores by abstracting them into blocks. Each block contains up to 65 535 accessible threads. Every thread executes a short program, and the catch is that it can run in parallel with other threads. Tensorflow takes advantage of this pattern to improve processing power, often running hundreds to thousands of threads simultaneously .
To learn more about using CUDA visit Nvidia’s Developer Blog or check out the book CUDA By Example .
Neural Network Hardware
Tensorflow is divided into two sections: library and runtime.
Library is the creation of a computational graph (neural network) and runtime is the execution of it on some hardware platform.
The preferred platform is a GPU, however there is an alternative: FPGAs.
Why Use FPGAs?
FPGAs can produce circuits with thousands of memory units for computation, so they work similarly to GPUs and their threads in CUDA. FPGAs have adaptable architecture, enabling additional optimisations for an increase in throughput. Thus the possible volume of calculations makes FPGAs a viable solution to GPUs.
Comparatively FPGAs have lower power consumption and can be optimal for embedded applications. They are also an accepted standard in safety-critical operations such as ADAS (Advanced Driver Assistance Systems) in automotive.
Furthermore, FPGAs can implement custom data types whereas GPUs are limited by architecture. With neural networks transforming in many ways and reaching out to more industries, it is useful to have the adapatability FPGAs offer.
Now You Must Be Wondering, What Are FPGAs?
An FPGA (Field Programmable Gate Array) is a customisable hardware device. It can be thought of as a sea of floating logic gates . A designer comes along and writes down a program using a hardware description language (HDL), such as Verilog or VHDL. That program dictates what connections are made and how they are implemented using digital components. Another word for HDL is RTL (register-transfer level) language.
FPGAs are easy to spot, look for an oversized Arduino.
Just kidding, they come in all shapes and sizes.
Using software analogous to a compiler, HDL is synthesized (figure out what gates to use), then routed (connect parts together) to form an optimized digital circuit. These tools (HDL, synthesis, routing, timing analysis, testing) are all encompassed in a software suite, some include Xilinx Design Tools and Quartus Prime.
Currently models get trained using a GPU, but then are deployed on an FPGA for real-time processing .
Then Why Don’t We Use FPGAs Instead?
For FPGAs, the tricky part is implementing ML frameworks which are written in higher level languages such as Python. HDL isn’t inherently a programming platform , it is code written to define hardware components such as registers and counters. Some HDL languages include: Verilog, VHDL.
Shown below is a snippet of some code used to create a serial bit detector.
If you’re unfamiliar with it, try guessing what it does.
Done? Even if you stare at it for a while, it isn’t obvious.
Most of the time FSMs (Finite State Machines) are used to split the task up into states with input-dependent transitions . All of this is done before programming, to figure out how the circuit will work per each clock cycle. Then this diagram, as shown below, gets converted into blocks of HDL code.
Back to the topic: main point is that there’s no direct translation to convert a loop in Python to a bunch of wires in Verilog.
Given the possible complexity of a design, it can be very difficult to debug it for further optimisation. There are no abstractions to simplify the process as there would be in CUDA, where a thread can be selected and modified.
So Should We Stick to GPUs?
Well no, FPGAs aren’t useless.
One way to work around the programming problem is to use HLS (high level synthesis) tools such as LegUp to generate programs in Verilog for deployment. HLS tools allow designers to avoid writing HDL from scratch and instead use a more intuitive, algorithmic programming language (C).
HLS tools abstract away hardware-level design; similar to how CUDA automatically sets up concurrent blocks and threads when the model is run.
HLS tools require C code as an input which gets mapped to an LLVM IR (intermediate representation) for execution. The tools are used to convert procedural descriptions to a hardware implementation.
Their role in FPGA design is shown below.
More on LLVM IRs
LLVMis not an acronym, it is a library that constructs assembly-like instructions (IRs). These programs are easier for HLS tools to process and can be used to create synthesizable code for an FPGA.
IRsare used to describe source code in a general format , allowing use by various programs.
To learn more about LLVM and IRs, refer to Dr. Chisnall’s slides .
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。