With the widespread application of AI, deep learning has become the mainstream method of AI research and application. Faced with the parallel operation of massive data, AI's requirements for computing power continue to increase, and higher requirements are placed on hardware computing speed and power consumption.
At present, in addition to general-purpose CPUs, some chip processors such as GPUs, NPUs, and FPGAs that are hardware-accelerated play their respective advantages in different applications of deep learning, but are they good or bad?
Taking face recognition as an example, the distribution of computing power required to process the basic process and corresponding functional modules is as follows:
Why is there such an application distinction?
What's the point?
In order to know the answer, we need to understand the CPU, GPU, NPU, FPGA from their respective principles, architecture and performance characteristics.
First, let's take a look at the architecture of a general-purpose CPU.
The CPU (Central Processing Unit) is a very large-scale integrated circuit. The main logical architecture includes the control unit Control, the arithmetic unit ALU and the cache, and the data, control, and Status bus (Bus).
It is the calculation unit, the control unit and the storage unit.
The architecture diagram is as follows:
The CPU follows the Von Neumann architecture, and its core is to store programs and execute them sequentially. The CPU architecture requires a large amount of space for the storage unit (Cache) and control unit (Control). In contrast, the calculation unit (ALU) occupies only a small part, so it is extremely affected by large-scale parallel computing capabilities Limits, and is better at logical control.
The CPU cannot do a lot of matrix data parallel computing, but the GPU can.
The GPU (Graphics Processing Unit), which is a graphics processor, is a large-scale parallel computing architecture composed of a large number of computing units. It is designed to process multiple tasks simultaneously.
Why GPU can achieve the ability of parallel computing? GPU also includes basic computing unit, control unit and storage unit, but the architecture of GPU is different from that of CPU, as shown in the figure below:
Compared with the CPU, less than 20% of the CPU chip space is ALU, and more than 80% of the GPU chip space is ALU. That is, the GPU has more ALUs for data parallel processing.
The neural network models AlexNet, VGG-16, and Restnet152 built on Darknet perform predictions on ImageNet classification tasks on GPU Titan X, CPU Intel i7-4790K (4 GHz):
It can be seen that GPUs are much more efficient at processing neural network data than CPUs.
Summary GPU has the following characteristics:
1. Multi-threading, which provides the basic structure of multi-core parallel computing, and the number of cores is very large, which can support parallel computing of large amounts of data.
2. It has higher fetching speed.
3. Higher floating-point arithmetic capabilities.
Therefore, the GPU is more suitable than the CPU for a large amount of training data, a large number of matrices, and a convolution operation in deep learning.
Although the GPU shows its advantages in parallel computing capabilities, it cannot work alone. It requires the cooperative processing of the CPU. The construction of the neural network model and the transfer of data flow are still performed on the CPU. At the same time, there are problems of high power consumption and large volume.
Higher-performance GPUs have larger volumes, higher power consumption, and higher prices, and will not be usable for some small devices and mobile devices.
Therefore, a special-purpose chip NPU with small size, low power consumption, high computing performance, and high computing efficiency was born.
NPU (Neural Networks Process Units) neural network processing units. The NPU works by simulating human neurons and synapses at the circuit layer, and directly processing large-scale neurons and synapses with a deep learning instruction set. One instruction completes the processing of a group of neurons. Compared with the CPU and GPU, the NPU integrates storage and computing through synaptic weights, thereby improving operating efficiency.
The NPU is constructed by imitating biological neural networks. The CPU and GPU processors need to process neurons with thousands of instructions. The NPU can complete only one or a few of them. Therefore, it has obvious advantages in the processing efficiency of deep learning.
The experimental results show that the performance of the NPU is 118 times that of the GPU at the same power consumption.
Like the GPU, the NPU also needs the cooperative processing of the CPU to complete specific tasks. Below, we can look at how the GPU and NPU work together with the CPU.
The GPU is currently only a simple parallel matrix multiplication and addition operation. The construction of the neural network model and the transfer of data flow are still performed on the CPU.
The CPU loads the weight data, constructs a neural network model according to the code, and transmits the matrix operations of each layer to the GPU through class library interfaces such as CUDA or OpenCL to achieve parallel computing and output the results; the CPU then schedules the calculation of the matrix data of the lower neuron groups until the The output layer of the network is calculated and the final result is obtained.
CPU and GPU interaction process:
1 Get GPU information and configure GPU id
2 Load neuron parameters to the GPU
3 GPU-accelerated neural network computing
4 receive GPU calculation results
NPU is different from GPU acceleration, which is mainly reflected in that the calculation results of neurons in each layer do not need to be output to main memory, but are passed to the neurons in the lower layer to continue calculation according to the connection of the neural network. Promotion.
The CPU passes the compiled neural network model file and weight file to a dedicated chip for loading and completes hardware programming.
During the entire operation process of the CPU, the CPU mainly implements the loading of data and the control of business processes. The interaction process is as follows:
1Open NPU dedicated chip device
2 Pass in the model file to get the model task
3 Get task input and output information
4 Copy input data into model memory
5 Run the model and get the output data
In addition to the NPU, FPGAs also have a fight on power and computing power.
Field-programmable gate array (FPGA) is called field-programmable gate array, and users can repeat programming according to their own needs. Compared with CPU and GPU, it has the characteristics of high performance, low power consumption, and hardware programming.
The basic principle of FPGA is to integrate a large number of digital circuit basic gate circuits and memories in the chip, and the user can define the connection between these gate circuits and memories by burning the FPGA configuration file. This burn-in is not a one-time operation. You can repeatedly write definitions and repeat configurations.
The internal structure of the FPGA is shown in the following figure:
The FPGA's Programable Logic Blocks contain many functional units, which are composed of LUT (Look-up Table) and triggers. FPGA implements the user's algorithm directly through these gate circuits, without the translation of the instruction system, the execution efficiency is higher.
We can compare
Features of CPU / GPU / NPU / FPGA
70% of the transistors are used to build the cache, and there are also some control units with fewer computing units, which are suitable for logic control operations.
Most of the transistors constitute computing units with low computational complexity and are suitable for large-scale parallel computing. Mainly used in big data, background server, image processing.
Simulate neurons at the circuit level, and integrate storage and calculation through synaptic weights. One instruction completes the processing of a group of neurons, improving operation efficiency. Mainly used in the field of communications, big data, image processing.
Programmable logic, high computing efficiency, closer to the underlying IO, logic editable through redundant transistors and wiring. Essentially it has no instructions, no shared memory, and higher computing efficiency than CPU and GPU. Mainly used in smart phones, portable mobile devices, and automobiles.
As the most common part, the CPU cooperates with other processors to complete different tasks. The GPU is suitable for large-scale data training and matrix convolution operations in back-end servers in deep learning. NPU and FPGA have great advantages in terms of performance, area, and power consumption, which can better accelerate neural network calculations. The FPGA is characterized by the use of a hardware description language, and the development threshold is relatively high compared to the GPU and NPU.
It can be said that each processor has its advantages and disadvantages. In different application scenarios, it is necessary to weigh the advantages and disadvantages according to the needs and select the appropriate chip.