Deep Learning Hardware: Know Your Options

栏目: IT技术 · 发布时间: 4年前

内容简介:There arePopular libraries such as Tensorflow run using CUDA (Compute Unified Device Architecture) to process data on GPUs, harnessing theirThe deep convolutional neural network, as shown below, requires filters to be slid across pixel regions while output

There are options outside of GPUs (Graphics Processing Units) when it comes to deploying a neural network, namely the FPGA (Field Programmable Gate Array). Before delving into FPGAs and their implementation though, it’s good to understand a bit about GPU architecture and why GPUs are a staple for neural networks.

Popular libraries such as Tensorflow run using CUDA (Compute Unified Device Architecture) to process data on GPUs, harnessing their parallel computing power . This work is called GPGPU (General Purpose GPU) programming. It has been adapted to deep learning models which require at least thousands of arithmetic operations.

The deep convolutional neural network, as shown below, requires filters to be slid across pixel regions while outputting a weighted sum at each iteration. For each layer this process gets repeated thousands of times with varying filters of the same size. Logically, deep models get computationally heavy and GPUs come in handy.

Deep Learning Hardware: Know Your Options

Source: Hinton G.E. et Al via Research Paper ; First layer contains 253, 440 weights, therefore at least that many calculations

Tensorflow can be built on the back of CUDA, which saves the end user from implementing parallel code and understanding the architecture of their chip. Its convenience and high optimization makes it perfect for widespread use.

FPGAs did not offer such a convenient solution earlier, using them required a deep understanding of how hardware works. But recent progress has made them more accessible and there’s more on that to come later.

Overview

Contents of this article assume little to no knowledge of how different hardware models function. It goes over the following:

  • GPUs, CPUs and CUDA
  • Advantages and design of FPGAs
  • HDL as a method for FPGA deployment
  • HLS as a method for FPGA deployment
  • FPGA deployment using LeFlow
  • Features of LeFlow for optimisation

Why Are GPUs Sometimes Better than CPUs?

CPUs(Central Processing Units) are designed for serial operations and supporting advanced logic. This is reflected in their design which contains less cores and more cache memory to quickly fetch complex instructions.

Deep Learning Hardware: Know Your Options

Source: Elkaduwe et Al. via Research Paper ; One letter makes ALL the difference

GPUs,however have hundreds of smaller cores for simple computation, and thus a higher throughput as compared to CPUs.

Deep Learning Hardware: Know Your Options
Source: Self ; kernel<<>>() notation indicates number of parallel processes running, equal to size of variable grid. Function running named kernel .

CUDA accesses a GPU’s many cores by abstracting them into blocks. Each block contains up to 65 535 accessible threads. Every thread executes a short program, and the catch is that it can run in parallel with other threads. Tensorflow takes advantage of this pattern to improve processing power, often running hundreds to thousands of threads simultaneously .

To learn more about using CUDA visit Nvidia’s Developer Blog or check out the book CUDA By Example .

Neural Network Hardware

Tensorflow is divided into two sections: library and runtime.

Library is the creation of a computational graph (neural network) and runtime is the execution of it on some hardware platform.

The preferred platform is a GPU, however there is an alternative: FPGAs.

Why Use FPGAs?

FPGAs can produce circuits with thousands of memory units for computation, so they work similarly to GPUs and their threads in CUDA. FPGAs have adaptable architecture, enabling additional optimisations for an increase in throughput. Thus the possible volume of calculations makes FPGAs a viable solution to GPUs.

Comparatively FPGAs have lower power consumption and can be optimal for embedded applications. They are also an accepted standard in safety-critical operations such as ADAS (Advanced Driver Assistance Systems) in automotive.

Deep Learning Hardware: Know Your Options

Source: Ford Motor Company via Wikimedia (CC); Applications of safety-critical task ideal for FPGAs: Collision Warning System

Furthermore, FPGAs can implement custom data types whereas GPUs are limited by architecture. With neural networks transforming in many ways and reaching out to more industries, it is useful to have the adapatability FPGAs offer.

Now You Must Be Wondering, What Are FPGAs?

An FPGA (Field Programmable Gate Array) is a customisable hardware device. It can be thought of as a sea of floating logic gates . A designer comes along and writes down a program using a hardware description language (HDL), such as Verilog or VHDL. That program dictates what connections are made and how they are implemented using digital components. Another word for HDL is RTL (register-transfer level) language.

FPGAs are easy to spot, look for an oversized Arduino.

Deep Learning Hardware: Know Your Options

Source: Paulo Matias via Wikimedia (CC); Altera DE2–115 Board, one of many FPGAs that can be used

Just kidding, they come in all shapes and sizes.

Using software analogous to a compiler, HDL is synthesized (figure out what gates to use), then routed (connect parts together) to form an optimized digital circuit. These tools (HDL, synthesis, routing, timing analysis, testing) are all encompassed in a software suite, some include Xilinx Design Tools and Quartus Prime.

Currently models get trained using a GPU, but then are deployed on an FPGA for real-time processing .

Then Why Don’t We Use FPGAs Instead?

For FPGAs, the tricky part is implementing ML frameworks which are written in higher level languages such as Python. HDL isn’t inherently a programming platform , it is code written to define hardware components such as registers and counters. Some HDL languages include: Verilog, VHDL.

Shown below is a snippet of some code used to create a serial bit detector.

If you’re unfamiliar with it, try guessing what it does.

Source: Self

Done? Even if you stare at it for a while, it isn’t obvious.

Most of the time FSMs (Finite State Machines) are used to split the task up into states with input-dependent transitions . All of this is done before programming, to figure out how the circuit will work per each clock cycle. Then this diagram, as shown below, gets converted into blocks of HDL code.

Deep Learning Hardware: Know Your Options

Source: Maggyero via Wikimedia (CC); Possible FSM diagram

Back to the topic: main point is that there’s no direct translation to convert a loop in Python to a bunch of wires in Verilog.

Given the possible complexity of a design, it can be very difficult to debug it for further optimisation. There are no abstractions to simplify the process as there would be in CUDA, where a thread can be selected and modified.

So Should We Stick to GPUs?

Well no, FPGAs aren’t useless.

One way to work around the programming problem is to use HLS (high level synthesis) tools such as LegUp to generate programs in Verilog for deployment. HLS tools allow designers to avoid writing HDL from scratch and instead use a more intuitive, algorithmic programming language (C).

HLS tools abstract away hardware-level design; similar to how CUDA automatically sets up concurrent blocks and threads when the model is run.

HLS tools require C code as an input which gets mapped to an LLVM IR (intermediate representation) for execution. The tools are used to convert procedural descriptions to a hardware implementation.

Their role in FPGA design is shown below.

Deep Learning Hardware: Know Your Options

Source: Greg S. via University of Florida Slides ; HLS tools produce HDL which allows register-transfer (RT) synthesis into a digital circuit, finally deployed on an FPGA

More on LLVM IRs

LLVMis not an acronym, it is a library that constructs assembly-like instructions (IRs). These programs are easier for HLS tools to process and can be used to create synthesizable code for an FPGA.

IRsare used to describe source code in a general format , allowing use by various programs.

To learn more about LLVM and IRs, refer to Dr. Chisnall’s slides .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

图说区块链

图说区块链

徐明星、田颖、李霁月 / 中信出版社 / 2017-7-1 / 59.00元

区块链,如瑞士仪表般精密,如互联网般惊世骇俗,它在以神一般的节奏颠覆社会。 当新兴技术来临时,你可以选择规避——如果明天也可以规避的话。区块链也一样。 作为一个现象级概念,金融科技创新在过去几年迎来了奇点式发展。其中最引人注目的当属区块链技术。区块链技术正在动摇全球金融基础设施,它是全球顶级银行和其他金融机构重点追逐的领域。毫无疑问,区块链是未来5年最有前景的行业之一。 《图说区......一起来看看 《图说区块链》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

MD5 加密
MD5 加密

MD5 加密工具