Machine learning for object recognition: why FPGAs have an edge over other embedded hardware platforms
By Wafy Butty
FPGA Specialist Engineer (Central Europe), Future Electronics
Read this to find out about:
- The inherent advantages of FPGAs for running math-heavy object-recognition computations
- The scope to implement object recognition on small, low-power, resource-constrained FPGAs
- How the Future Electronics Avalanche development board gives a head-start to new image-recognition development projects
The application of Artificial Intelligence (AI) has gained broad public attention in a handful of high-performance systems: examples include the machines which can beat humans at complex games such as chess and Go, IBM’s Watson AI platform which provides, for instance, natural-language diagnostic support to medical practitioners.
These pioneering uses of AI are based in the computer science world, and draw on the massive computing resources of data centres operated by large companies such as IBM, Google and Microsoft.
Now, however, interest in the application of AI is growing in the embedded world, where edge computing resources are a tiny fraction of those of a data centre. Already, the sweet spots of AI for embedded developers are emerging. AI applications known to be feasible even with the constrained computing resources of a microcontroller, applications processor or FPGA include machine condition monitoring for predictive maintenance, and object classification or recognition.
Object recognition is a particularly exciting function for embedded AI, because of the wide range of use cases for it:
- In Automotive Driver Assistance Systems (ADAS), a smart camera might be trained to identify road signs such as speed limits and to alert the driver via a heads-up display, for instance if the car’s speed exceeded the limit
- A smart multimedia advertising panel can detect a passer-by, identify his or her age and gender, and display appropriate advertising content
- Automatic doors could use a camera to identify objects in view, and choose to open for humans in proximity, but not for other objects such as animals
- Drones are an increasingly important tool in smart agriculture systems. Object detection capability embedded in an autonomous drone could enable it to identify objects from the air, such as farm buildings, trees, hedges and animals
- In industrial processes, a machine may be used perform visual inspection for quality-control purposes, for instance to evaluate the ripeness and appearance of food, or to verify that a PCB assembly contains the correct components
The creation of an embedded device which is capable of recognising a certain category or categories of objects involves machine learning, a technology which is now the subject of a large body of literature. It is not the purpose of this article to shed light on the process of machine learning itself, nor on the functions of data collection and labelling, model training, and optimisation of a neural network algorithm.
Instead, this article looks at the narrow question of hardware evaluation: which type of platform is best suited to the task of running an object-recognition algorithm and the associated system functions? And how might the developer expect a silicon manufacturer to support the compilation of the algorithm to the hardware target?
The importance of math functions
The embedded developer community tends to be divided into tribes: a developer is normally a user of either a microcontroller, or an applications processor, or an FPGA. When it comes to the implementation of an object-recognition system, the FPGA user enjoys some important advantages.
The first arises from the typical composition of the host system in which the object-recognition function is embedded. It might include:
- One or more image sensors with different resolutions and interfaces.
- A type of output device. The output could be a display connected via for example an HDMI, eDP or 7:1 LVDS video interface. Or it could be a data interface such as Ethernet at a data rate of 1Gbit/s, 5Gbits/s or 10Gbits/s, or PCIe to allow for further processing.
- As well as drawing on a large embedded memory, the host system might also potentially use an external DDR3 or DDR4 memory for frame buffering.
- The system might implement a custom image signal-processing chain, or use off-the-shelf image-processing IP.
In addition, the object-recognition algorithm itself is essentially a complex set of mathematical operations performed in parallel. This calls for extensive digital-signal processing resources and a large number of I/Os for shifting data into and out of memory at high speed.
FPGAs are particularly well suited to this combination of requirements. Widely used in telecoms and networking equipment, FPGAs are excellent handlers of high-speed data streams. Their basic building blocks, Logic Elements (LEs) are readily configured to perform logic functions in parallel – the hallmark of the FPGA, distinguishing it from the sequential processing mode of a microcontroller or applications processor. This tends to mean that an FPGA can perform neural networking functions faster, while using less power and less hardware resource, than an MCU or applications processor.
Interestingly, for all the computational complexity of neural networks for object detection, they do not always require the massive array of LEs offered by high-end FPGAs available from Xilinx or Altera. In fact, successful implementations of camera-based AI have been made even on small FPGAs containing fewer than 10,000 LEs. Lattice Semiconductor, for instance, supplies the Himax HM01B0 Upduino shield, a modular development board for AI applications using visual and sound inputs, and running on a Lattice UltraPlus FPGA which contains just 5,300 Look-Up Tables (LUTs).
Fig. 1: the architecture of the PolarFire series of FPGAs from Microchip. (Image credit: Microchip)
For many object-recognition applications, mid-range FPGAs such as Microchip’s PolarFire series provide an ideal balance between capability, cost, size and power consumption. The features of the PolarFire MPF300T, for instance, include 300,000 LEs, 924 multiply-accumulate math blocks (18x18), and 20.6Mbits of RAM (see Figure 1). The biggest device in the PolarFire family has around 500,000 LEs, 1,480 math blocks, and 33Mbits of RAM.
The device’s features are closely aligned to the system requirements of machine vision equipment handling images at up to 4K resolution. An MPF300T can provide:
- Image sensor interfaces such as MIPI-CSI2 Rx supporting data rates up to 1.5Gbits/s/lane and SLVS-EC at 2.3Gbits/s/lane and 4K resolution
- Comprehensive support for connectivity and display interfaces
- An IP portfolio for image signal processing, HDMI 2.0, CoaXPress 6.25G, DisplayPort 1.4a, HD-SDI (HD/3G) and more.
- 512 user I/Os supporting data transfers at up to 1.6Gbits/s, including a 16-lane area optimized for 12.7Gbits/s transceiver operation.
- High-speed memory interfaces to DRAM technologies up to DDR4
- Ethernet interfaces operating at 1Gbits/s on general-purpose I/O with CDR function, and up to 10Gbits/s on high-speed SERDES channels
- Up to 50% lower power consumption than competing FPGAs
- Live debugging capability without the need to reconfigure the FPGA
A particular feature of PolarFire FPGAs is the way that they perform math operations in DSP blocks: a PolarFire DSP block can perform up to four 9-bit operations per clock cycle, whereas some other FPGAs typically only perform two operations per clock cycle. This means that the PolarFire device can perform the same number of math operations at half the clock frequency. The biggest PolarFire device can perform around 1.48 tera-operations per clock cycle.
Expert support for neural network compilation
The hardware configuration of PolarFire FPGAs, then, is ideally suited to embedded systems that perform object recognition. An FPGA’s implementation of object recognition is underpinned by the way it compiles the neural network, which will normally be trained on a cloud-based compiler such as Caffe, TensorFlow or Keras.
For compilation to a PolarFire FPGA, Microchip collaborates with ASIC Design Services (ADS). The latter has developed Core Deep Learning (CDL), a scalable, flexible software framework optimized for convolutional neural networks – the type of neural network commonly used for object recognition, as well as other machine learning functions. CDL takes an input – a trained neural network – from the Caffe framework and renders it as a SystemVerilog file to be programmed in PolarFire logic fabric.
The CDL framework’s functions include:
- Full pipeline from convolutional neural network description to FPGA implementation
- Network re-training for memory footprint minimization
- Support for various network layers
- Convolutional layer
- Fully connected layer
- Pooling layer
- Activation layers - Convolutional layers can implement filters of any size and type
- Pooling layers supporting arbitrary kernel size
- Support for padding
- AXI interface to external memory
An important advantage of CDL is the scope to add constraints. The system developer can specify the features of the hardware target at which the compiled neural network is aimed. The PolarFire family, for instance, stretches from the MPF100T with 109,000 LEs to the MPF500T with 481,000 LEs. The CDL will try to compile the trained network with the specified user constraints to fit in the target FPGA.
Platform for rapid development of object-recognition systems
The quickest way for developers to start experimenting with the object-recognition capability of the PolarFire family is to use the Avalanche board (part number AVMPF300TS-00) supplied by Future Electronics (see Figure 2). This is a complete object-recognition demonstration system based on an MPF300T PolarFire FPGA with 256Mbits of DDR3 memory, 64Mbits of serial Flash, a Gigabit Ethernet interface and a webcam-style image sensor.
A PC application supplied with the board prepares the video stream and encapsulates it in an Ethernet-based link. The Avalanche board itself is programmed with the TINY YOLOv2 convolutional neural networks, which are pre-trained for object detection with the Pascal VOC dataset of 20 classes of common objects.
The Avalanche board returns the classification result provided by the TINY YOLOv2 algorithm to the user interface, which maps them on to the real-time video stream.
Fig. 2: the Future Electronics Avalanche board is supplied with an example neural network capable of recognising common objects such as cows and cats. (Image credit: Future Electronics)
This MPF300T-based system can reliably recognize still or video images of more than 20 object types including:
- cows
- cats
- birds
- cars
- people
- bottles
Users of the Avalanche board can start by exploring the Future Electronics demonstration and its supporting documentation, and then go on to experiment with different neural network implementations to see the variations in performance (speed, accuracy) and resource usage that result from changes in the way that the neural network is optimised in the Caffe training framework, or in the training data on which it learns.
Interested developers may apply for an Avalanche board from any branch of Future Electronics, and the company’s team of machine learning and FPGA specialists will be pleased to provide advice on starting a new object-recognition application.