Achieving High Availability: The Impact of Redundancy to the Rack

Corded equipment often gets left out of high-reliability planning.

To build highly available electrical infrastructures, you need redundant components like power distribution units (PDUs), UPSs, and generators. Yet many installers fail to consider redundancy near the point of use, especially with the single-corded equipment so prevalent in today’s common rack-mounted configurations.

Rack redundancy should be part of any power quality strategy because it facilitates the maintenance and repair that is so essential to power quality and uptime at the point of use.

At the rack level—home to servers, mass storage assemblies, routers, and hubs—redundancy allows for routine maintenance and quick repair. Its absence means electrical systems are often neglected until they fail, resulting in power corruption or downtime.

Approaches to rack redundancy. A typical backup for rack-mounted equipment is a single rack-mount UPS. You usually see this configuration in commercial office networks, small data centers, and wiring closets. In data centers with hundreds of racks, distributed 3-phase power from a central UPS is a more common configuration because it’s often easier to manage a small number of large UPSs than it is to manage a large number of small UPSs. Yet neither configuration offers power redundancy in the system. Given the high quality of today’s UPS systems, either configuration will work well in many situations, but they present a single point of failure that may be unacceptable for maximizing uptime in mission-critical situations.

Fig. 1 (right) demonstrates power distribution in a mission-critical data center, such as a credit-card processing facility. Here, the UPS systems are redundant. The PDU is equipped with a static transfer switch that will transfer power from the primary to the secondary. However, the static transfer switch, downstream subpanel, equipment power cord, and all terminations are single points of failure. This makes maintenance difficult, because you can’t take these components offline without dropping the load. As with UPS systems, the issue is not component quality or reliability, which are typically quite high, but the existence of a single point of failure.

The configuration in Fig. 2 (right) below addresses this limitation by pushing redundancy toward the load, using an extra PDU and subpanel. Notice the rack-mounted transfer switch, or point-of-use (POU) switch. It will allow you to conduct maintenance upstream of the POU switch without taking down the load. However, the POU switch and the equipment power supply remain single points of failure.

The configuration in Fig. 3 (right) shows full redundancy to the load, achieved with dual-corded equipment. This scenario is identical to that of Fig. 2 except the POU switch has been replaced with dual-corded equipment. You even have a redundant power strip.

Availability analysis approach. Does bringing redundancy closer to the load work? We conducted three availability analyses to see—one for each of the configurations described above. We used a method called “linear combinatorial analysis,” which uses defined reliability data and develops a system model that represents the configuration under analysis. The data for this analysis came primarily from the IEEE Gold Book (Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems) and a military specification: Mil Spec 217 Reliability Prediction Of Electronic Equipment (See Sidebar on PQ3).

In this analysis, we modeled all key components, including terminations, circuit breakers, UPS systems, the static transfer (STS) and point- of-use (POU) switches, and power distribution units (PDU). As with any availability analysis, you must make certain assumptions to create a valid model. Our assumptions were:

All components exhibit a constant failure rate.
The failure rate of the wiring is low and is not modeled.
Downtime caused by human error is not accounted for here.
Silicon-controlled rectifier (SCR) controls are 100% reliable.

These assumptions allow us to address the power. Most downtime in data centers is due to human error, which we cannot control. We have taken human error out of the model for purposes of examining the power system design, which we can control. This improves the accuracy, relevance, and usefulness of the analysis

Results. After measuring availability with respect to the outlets supplying power to the critical load, we found all common components shared the same mean time between failure (MTBF) data. For a dropped load to occur in the case of a single-corded load using an STS, both UPSs or the transformers would have to fail at the same time, or any component downstream of the transformer would have to fail.

For a dropped load to occur in the case of the single-corded load that uses a POU switch, at least one component from each path upstream of the switch would have to fail simultaneously. The Achille’s Heel of this configuration: A failure of the POU switch would also result in a dropped load.

In the case of a dual-corded load, at least one component from each path upstream of the load would have to fail simultaneously for a dropped load to occur. Without the POU switch to act as a single point of failure, the configuration offers higher reliability. In all cases, redundant feeds from the UPS do not share the same bus. This is critical to ensuring fault isolation. The Table at right summarizes the results of the three availability calculations.

This analysis demonstrates that if you don’t use dual-corded equipment in a critical application, you should at least use a POU switch with single-corded loads. One of the principles for increasing the availability to critical loads is to bring redundancy as close to the load as possible. You can see this by comparing the availability of the single-corded load using an STS to that of the single-corded load using a POU switch, which represents a difference of nearly 1 hr of downtime per year.

In the end. In general, equipment with one power cord can be a liability when trying to develop a high-availability network or IT infrastructure. This is true of all mission-critical equipment. To maximize uptime, you must remove as many single points of failure in the power distribution system as you can.

Most servers and routers are available with dual cords, but many low-end hubs or PCs have only one. The decision to implement POUs or larger static transfer switches for single-corded loads depends on the network architecture. For example, a large data center has several small transfer switches to manage—a potentially challenging task. However, a failure in a network employing one larger transfer switch can bring down a large portion of equipment; a failure in a smaller switch will only bring down one rack. Also consider the time required to repair or replace these switches and the cost of keeping spares. Larger switches pose more challenges in this regard.

Ultimately, architectures with complete system redundancy will provide the highest levels of availability, be resilient to component failure, and allow for concurrent maintenance. Anything less results in an availability compromise and an increased threat of downtime.

Avelar is an availability engineer with APC’s Availability Science Center, West Kingston, R.I.