At present, a large number of embedded systems use single-chip microcomputers, and such applications are being further expanded; but for many years people have been confused by the reliability of single-chip microcomputer systems. In some control systems that require high reliability, this often becomes the main reason to limit their application.
The reliability of a single-chip microcomputer system is the result of the combined effects of its own software and hardware and its working environment. Therefore, the reliability of the system should also be analyzed and designed from these two aspects. For the system itself, whether it can ensure the implementation of various functions of the system and effectively suppress various interference signals and interference signals directly from the outside of the system during its operation is to determine system reliability. The essential. Defective systems often only logically guarantee the realization of system functions, and lack of consideration of potential problems that may occur during the system's operation. Inadequate measures are taken. When interference signals really strike, the system may fall into trouble. . The reliability of any system is relative. A system that works well in one environment may be unstable in another environment. This fully illustrates the importance of the environment to the reliable operation of the system. While designing the system for the system's operating environment, measures should be taken to improve the system's operating environment and reduce environmental interference, but such measures are often limited.
There are many methods and measures to improve the reliability of single-chip microcomputer systems. In general, according to the specific reliability problems faced by the system, different measures should be taken for the factors that cause or affect the unreliability of the system. These measures generally start from two purposes: first, try to reduce the external factors that cause the system to be unreliable or affect the system's reliability; second, try to improve the system's own anti-interference ability and reduce its instability. For example, the filtering technology, isolation technology, and shielding technology used to suppress power supply noise and environmental interference signals are all for the first purpose; in addition, the watchdog circuit, software anti-interference technology, Backup technologies are measures taken for the second purpose. Among them, the first type of measures is more commonly used, which is simple to use and has better results, but its improvement in system reliability is limited, and in many cases it cannot meet the requirements of the system. The use of the second type of measures can further improve the reliability of the system, which is often widely used in high reliability system design. The following is a further analysis of some related issues in the use of the second type of technology.
2.1 Improving system reliability with watchdog timer technology
Watchdog technology is now widely used, the technology is relatively mature, and there are many ways to support this technology. At present, almost every processor manufacturer is producing single-chip products with built-in watchdog timers, and there are many independent watchdog timer chips on the market to choose from. It is relatively easy to implement such a circuit, so the general details of how to implement this technology are not discussed in detail here, only the humane problems caused by the use of this technology are analyzed. After the watchdog timer technology is used, once the program runs, the system will be immediately reset by the watchdog timer, and the system will be restarted from the beginning, thus exiting the abnormal operating state, but the use of the system must be heavier. The so-called humanity of the system can be defined as follows: When a microprocessor system is reset and started, the external operation of the system does not change due to the restart, or the change can be tolerated, thereby ensuring that the entire system is external The continuity and sequence of operations, that is, the ultimate safety and reliability of the system. For a system, if its external control operation is only related to the current input state of the system, then the system has almost complete reentrancy performance; on the contrary, if a system's external output operation is not only related to the current input of the system, but also related to The historical state of the system is related. If the historical state of the system is not retained or the historical state is destroyed when the system reenters, then the external operation of the system may be completely wrong at this time. Although such a system plays the role of a watchdog timer It exited the abnormal running state, but the re-entered state will not be normal, so such a system can only be a morbid system and cannot be used. Therefore, for a system that uses a watchdog circuit to improve reliability, the reentrancy of the system must be strictly guaranteed.
For the system related to the historical state, in order to ensure its reentrancy performance, its historical state can be saved in the system's RAM, that is, in the memory of the microcontroller system or its extended external memory, a buffer dedicated to saving the historical state is created. Area. Under the condition that the system is not powered off, these historical data can be reused when the system re-enters. If the power supply of the system cannot be guaranteed, you must also consider using a backup battery to ensure the security and stability of the RAM data. For systems that are not too time sensitive, you can also use E2PROM or Flash ROM to save historical data.
2.2 Software anti-interference technology
A system may malfunction due to various disturbances and unstable factors. To solve this problem, some measures can be taken from the aspect of program design. The traditional software filtering technology and software redundancy design that are often used to suppress the interference signals of the system are typical applications of this type. According to design experience, software lock design and program trap design can usually be used. This type of method is mainly used when the program is flying. When the system runs under the influence of interference signals, the program pointer may point to two areas: one may go to other addresses in the program area for execution, and one may move to a blind area in the program space for execution. The so-called dead zone means that there are no valid program instructions stored there. In the first case, software locks can be used to suppress it. For example, to ensure the security of external operations, in the software lock design, a pre-set password is verified for each relatively independent program block before or during its execution. Only when this password matches, it is executed. It is really effective. Only when the program is transferred through the normal transfer channel, the correct password will be set by the upper-level program; otherwise, the program will be forcibly transferred according to the check error, and the error status will be processed. Restore the normal running state of the program.
When the program is executed sequentially, each program block can be effectively and correctly executed. Now suppose that the program is flying because of interference. The block processing of SUB-PRO1 jumps to the program SUB-PR03 to start execution. Then the password verification will be wrong during execution, and the program will be transferred to the error processing program to avoid errors. Operation.
The purpose of designing program traps is mainly to prevent the program from flying to the blind area of the program for execution. Generally, the ROM space other than the program code space is processed by using an empty method. When the program is hardened, these vacant spaces are written as all 1s or all Os, so the program jumps into this area will be uncontrollable. To capture programs that jump into this area, program traps can be used to implement them. The following is an example to illustrate: Suppose a system program space is 32KB, and a total of 18 KB of code is generated after the program is compiled. Then, 14 KB of program space is not used. The following trap program can be placed in this area:
The above program segments are used to repeatedly cover the remaining program space. The number of NOP instructions contained in each segment of the trap program has an impact on the success rate of the capture and the capture time. The more NOP instructions are placed, the higher the success rate of the capture, but the longer it takes, and the longer the program goes out of control; otherwise, the situation is reversed. Because the program can only successfully capture when the program jumps to the first byte of the NOP or LJMP instruction; when the program jumps to the last two bytes of the LJMP instruction, unpredictable execution results may occur. If the captured program jumps to the beginning of the program for execution, the human nature of the program must also be considered.
2.3 Using a backup system to improve reliability
The backup system has been widely used in many important control systems, but is mostly used in industrial control computers or larger systems. The backup system can be divided into online backup system and backup backup system according to the specific situation. For an online backup system, the two CPUs in the system are in working state. It is possible that the two CPUs are in the same position, or one is in the position of the master CPU, and the other is in the position of the slave CPU. In the case of equivalence, the two CPUs jointly determine the external operation of the system. If any one of the CPUs fails, an error will occur.
Causes prohibition of external operations. In the case of one master and one slave, the master CPU is often responsible for implementing the system control logic, and the slave CPU is responsible for monitoring the working status of the master CPU. When the abnormal operation of the master CPU is monitored, the slave CPU restores the master CPU by forcibly resetting the master CPU. At the same time, in order to ensure that the slave CPU works normally, the slave CPU's working state is also monitored by the master CPU; When abnormal, the master CPU can also take measures to restore the slave CPU to normal work, that is to achieve the purpose of mutual monitoring. In the specific design, the way for the master and slave CPUs to exchange information is very flexible and diverse. For example, a common memory is used to implement the exchange of monitoring information (such as storing common information in dual-port RAM), and a handshake signal is used to implement the exchange of monitoring information.
In a specific system design, in order to improve the stability and reliability of the system, it is often necessary to comprehensively adopt a variety of measures to achieve satisfactory results, which is the only way to comprehensively improve the reliability of the system. Different systems may have different specific control objects and different operating environments. Therefore, the main interference problems they face are different, and the measures taken are different. However, it is common to take a certain measure to improve the reliability of the system. It is unrealistic, and it is necessary to comprehensively take a number of measures to improve the reliability of the main problems.
A design example is given below to further illustrate some common methods for improving system reliability design.
In a satellite communication system, in order to reduce the phase noise of the system, the operating temperature of the front-end low-noise amplifier (LNA) is required to be constant (40 ° C); and the ambient temperature range of the amplifier in the field is -40 to +60. ℃, so the amplifier must be placed in a special incubator. The thermostat should have the function of both heating and cooling. Resistance heating is adopted for heating, and semiconductor refrigeration chip is used for cooling. In order to prevent the temperature of the incubator from being out of control or even damage the low-noise amplifier due to the failure of the controller, and disrupt the normal operation of the entire system, the design of the incubator mainly uses a master-slave dual CPU system to improve the reliability of the system.
The main CPU is responsible for the detection of the temperature of the heater, cooling plate and inside and outside the box, and is responsible for the main control tasks. The main CPU selects the AT89S52 single-chip microcomputer, which contains a watchdog timer, and MAX707 is added to the chip as a power monitoring circuit; in addition to providing a reliable reset signal to the main CPU, it can also detect a power-down interrupt request signal, which is timely when a power-down occurs. Save field data. The heating rod is powered by AC 220V, and the cooling chip is powered by 15V DC regulated power supply. In order to prevent the high voltage and strong current from interfering with the weak current part, the control signals generated by the main CPU are sent to the drive circuit through photoelectric isolation to improve the reliability of the system.
Select AT89C2051 from the CPU, which is mainly responsible for monitoring the working condition of the main CPU and monitoring the power supply voltage. When a power failure occurs, the voltage comparator in AT89C2051 will detect this change, and it will be powered by the backup battery and report to the monitoring station through 485 port.
The monitoring between the master and slave CPUs is mutual. The master and slave CPUs shake hands through the I / O port lines between them, monitor each other's working status, and take corresponding measures to ensure the security of the system's external operations. Through the implementation of the above measures, the reliability of the system is excellent, and it has been stable and reliable since it was put into operation. No unknown cause of crashes or out of control occurred, which fully illustrates the success of the system design. According to past experience, if the above-mentioned comprehensive design method is not adopted, such a system usually has problems after 1 to 2 weeks of continuous operation.
This paper analyzes in detail the reasons for the failure of the single-chip microcomputer system, discusses measures to improve system reliability, and proposes a comprehensive design method to improve system reliability. The successful application in the low-noise amplifier thermostat controller shows that this design method is effective and the reliability of the system is fully guaranteed.