Systematic methodology is the key to quickly and effectively troubleshooting control circuits.
Even the most experienced troubleshooter must rely on a systematic troubleshooting process to solve problems on today’s complex control circuits. At a high level, a good troubleshooting process is simple. First, you must investigate the symptoms. Then, try to identify the possible causes. The next step is to test the system and verify possible causes. After correcting the problem follow through by monitoring the operation to make sure you’ve pinpointed the root cause, and completing any required documentation. Let’s take a closer look at each of these steps.
Investigate the symptoms. Make sure you understand the system. Pull any available documentation, whether online or hardcopy. Look for schematics and piping and instrumentation diagrams, as well as loop sheets. Talk to the operators and anyone else familiar with the operation. Look up operations and maintenance records and control and configuration parameters. Some of this information may be available from the PLC or DCS or other online databases.
Because you often won’t know where the problem lies, keep the big picture in mind. Start by breaking down even the most complex system into the following five elements:
Process controller—most often involving a microprocessor.
Input field devices—sensors of some type that monitor the process.
Output field devices— drives, valves, and alarms that receive a command signal from a control element.
Connectivity elements—wires, cables, and buses.
Don’t forget the sixth element: the people who can affect the process and its control system.
Since the wiring and the inputs and outputs (I/Os) are the most vulnerable elements in a system, you’ll want to examine them first. As you talk to people and review information, look for a reoccurrence or pattern. If you see a pattern, is it related to shift changes, process changes, or any other reoccurring event? Use your judgment on when to quit gathering information, but make sure the data displayed by the human machine interface (HMI) match what the operator tells you.
Understand everything the operator did in response to the problem. Walk down the system or process to make sure the field conditions match those reported by the operator HMI.
Identify possible causes. Analyze the system with an open mind, systematically eliminating components and functional elements from the overall process as unlikely trouble spots. Start by following the logic through from input to output. What happens in the cause-and-effect chain? Compare the current symptoms with the action that the specified decision logic or control algorithm should produce. As you eliminate some process elements as possible causes, you can also start building and prioritizing your list of most likely causes, keeping in mind that you’ll want to test the system to eliminate these possibilities.
You can usually eliminate simultaneous, unrelated problems as being too unlikely. If you can link a problem to one likely cause, do so. At this stage, don’t look for interrelated, multiple causes. Your first priority should be to get the operation back up and running. Tackle complex situations after a quick fix gets things going. Just don’t forget to use your company’s work procedures to highlight the open job. Operations people sometimes confuse a quick fix with a problem solution, so be very clear that your fix is temporary.
As you prioritize possible causes, go back to your sources of information. Maintenance records can help you decide that one component has been much more trouble-prone than another. For example, construction work in the area might lead you to suspect damaged cabling rather than an I/O board failure, because cabling running through the plant is more likely to suffer damage than is an I/O board inside a cabinet.
Test possible causes. When you’ve narrowed your probable cause list down to a manageable size, you can begin testing. Once the process is back up and running, first do those tests that don’t interrupt operations. Quick and easy tests can save you time in eliminating potential causes, so do those early in your troubleshooting. In many cases you need to look, listen, or feel specific components. When working around or with energized equipment, don’t take chances with safety. In all cases, follow established and required safety procedures.
As stated before, inputs and outputs are usually the first place you should look for problems. Most inputs and outputs fall into one of two categories: discrete devices with two states (on or off), or analog devices that can send and receive continuously varying signals.
Common discrete devices include limit switches, solenoid valves, indicators, and alarms. When PLCs send signals to a master PLC or DCS, they count as discrete devices. Common analog devices include resistance temperature devices; thermocouples; transmitters for pressure, level, temperature, and flow; valves; analytical field devices like pH sensors; and variable speed drives.
Discrete field devices typically use low-voltage DC. A variation in these voltages usually indicates a problem. Some drift is acceptable, but anything more than 5% to 10% in either direction, at either end of the range, calls for a close look.
Use a scope to check a discrete signal. Rise and fall times that aren’t instantaneous usually indicate a fault in the sensor itself, which can typically be attributed to sticking contacts in a mechanical switch or an impending failure in a solid-state device. High signals that aren’t flat usually indicate loose ground connections, ground loops, or improper shield connections. Low signals that aren’t flat are often noisier than the high signals and usually indicate a grounding or shield problem. Noisy low signals can also indicate an improperly wired field device.
If a measurement suddenly dips to a minimum or maximum, you’ve most likely got a sensor, wiring, or other I/O problem that should be relatively easy to find. Often the best place to check is the field termination assembly because you can divide a process loop in half.
More gradual changes could indicate much more complex and hard-to-pin down problems like a change in valve stiction (static friction), a subtle change in the process materials, or a drift in instrument calibration. Your job will be much easier if you’re working with a DCS because you can pull up the history of each signal loop and look for changes over time.
Sometimes the only way to test a circuit is to see how the system reacts to a manual input. When working with PLCs you’re forcing the contacts. When working with continuous process control loops, you’re bumping the system. If you can’t manually force the system to respond to your input, you probably have a problem with the outputs. If the outputs respond properly to manual inputs, you can probably eliminate outputs and look more closely at the field transmitters, proximity switches, and other related input devices.
Be careful testing the process this way. Forcing contacts, adjusting timers and counters, changing set points, and tinkering with loop tuning parameters or the control program is risky business that can have disastrous results. Coordinate closely with the process operator. Be sure you know what limits the process can tolerate so you don’t destabilize the system or crash it.
Follow through. Follow through with careful replacement of faulty parts, a period of monitoring the operation, and documentation of what you did according to your plant’s requirements. If your action was a quick fix to get equipment up and running, follow your plant’s root-cause analysis procedure to get to the bottom of the problem.
All the sophisticated equipment and software in the world is useless if the troubleshooters who use it fail to follow a systemic process and make full use of the tools available. Take the time to understand what you’re doing. Don’t be afraid to ask for training if you need it. Then conduct your troubleshooting methodically all the way through to root cause and you’ll have the respect of management and your peers.