FPGAs provide an early architectural specialization option for high-performance computing and machine learning. Architectural specialization is an option to continue to improve performance to overcome the limitations of Moore's Law that slow the pace of technology. Whether in terms of power consumption or performance, use application-specific hardware to accelerate applications or parts of them and allow more efficient hardware to be used as a support.
Considering the inherent cost of building computing hardware for a single application or workflow, this strategy cannot be used for all applications. However, by grouping challenges into groups or identifying key workloads or code that can benefit from acceleration, it is likely to be an important part of improving application performance. Some applications are well-suited for technologies such as GPUs or FPGAs, which can improve performance by implementing acceleration techniques. GPU acceleration or architectural specialization are not new concepts, and some experts predict that they will become more and more popular to improve performance and reduce energy costs of future systems.
The European Centre for Nuclear Research (CERN) is using Xilinx FPGAs to accelerate inference and sensor preprocessing workloads to search for dark matter. The researchers behind the project are using FPGAs and other CERN computing resources to process massive amounts of high-energy particle physics data at extremely fast speeds in order to find clues to the origin of the universe. So this requires real-time filtering of sensor data to identify new particle structures that may contain evidence of dark matter and other physical phenomena.
In today's big data era, businesses and consumers are overwhelmed by massive amounts of data from a variety of sources, including business transactions, social media, and sensor or machine-to-machine data. This data comes in a variety of formats, from structured digital data in traditional databases to unstructured text documents, email, video, audio, and financial transactions.
Effective analysis of this data is key to generating insights and driving better decision-making and machine learning (ML) algorithms, which are widely used in modern data analysis. As a special ML algorithm, deep convolutional network (DNN) has been widely used in image classification. The current generation of DNNs, such as ALEXNET and VGG, rely on dense floating-point matrix multiplication (GEMM). This algorithm has regular parallelism and high TFLOPS (floating point operations per second), which can be mapped to the GPU well. Features.
Although FPGAs are more energy efficient than GPUs (important in today's IoT market), their performance on DNNs does not match that of GPUs. A series of tests conducted by Intel evaluated the performance of two latest generation FPGAs (Intel's Arria TM10 and statix TM10) and the latest high-performance GPU (TItan X Pascal) on DNN calculations. Because data parallel computing has regular parallelism and high floating-point computing throughput, DNNs traditionally use GPUs. Each generation of GPU has added more floating-point units, on-chip RAM, and higher memory bandwidth to provide more floating-point operations.
However, due to issues such as divergence, computations with irregular parallelism may pose challenges to the GPU. In addition, because the GPU only supports a fixed set of local data types, custom-defined data types may not be processed efficiently, resulting in insufficient hardware resource utilization and unsatisfactory performance. First, next-generation FPGAs integrate more on-chip RAM. Second, technologies like HyperFLEX can significantly increase frequency. Third, there are more DSPs available. Fourth, the integration of HBM memory technology leads to increased off-chip bandwidth. Finally, next-generation FPGAs will use more advanced process technologies such as 14nm CMOS.
Intel StraTIx 10 FPGA chip has more than 5000 hardened floating point units (dsp), more than 28MB of on-chip RAM (M20Ks), integrated with high bandwidth memory (up to 4x250GB / s / stack or 1TB / s), and improved new HyperFlex Frequency, making FP32 throughput peak at 9.2 Tflops. In addition, the FPGA development environment and toolset are constantly evolving to support higher levels of abstraction, making it easier for developers to access FPGA programming.
Intel is currently studying various GEMM operations for next-generation DNNs. Developed DNN hardware acceleration template for FPGA, providing first-class hardware support for developing sparse matrix algorithms and custom data types. This template is developed to support various next-generation DNNs and can be customized to generate optimized FPGA hardware instances for a given DNN variant.
This template is used to run and evaluate various key matrix multiplication operations of next-generation DNNs, current and next-generation FPGAs (Arria 10, StraTIx 10), and the latest high-performance TItan X Pascal GPUs. The results of this study were found to be consistent with Titan X Pascal Compared with GPUs, the performance of Stratix 10 FPGAs is 1.1 times, 1.5 times, and 5.4 times that of Titan X Pascal GPUs on GEMM operations of pruned, Int6, and binary network learning (pruned, Int6, and binarized) DNNs.
These tests also show that Arria 10 and Stratix 10 FPGAs provide satisfactory energy efficiency (TOP / sec / watt) relative to Titan X GPUs, and both devices have improved energy efficiency by 3 to 3 compared to Titan X 10 times. Although GPUs have always been the undisputed choice to support DNNs, recent performance comparisons between two generations of Intel FPGAs (Arria 10 and Stratix 10) and the latest Titan X GPUs show that the current trend of DNN algorithms is beneficial to FPGAs, and even FPGAs may Provide better performance.