Since data is crucial to most organizations' operations, it must be preserved, protected and accessible at all times. Periodic inspections of data centers are very important to ensuring the reliability, continuity and sustainability of the systems they house. In fact, such inspections are often mandated by user-founded organizations such as the Uptime Institute and/or by insurance carriers, who do not want to pay damages for lost data due to failed equipment.
One important tool for performing data center inspections is the thermal imager, also known as an infrared (IR) camera. The following step-by-step account describes how to use a thermal imager to inspect data center systems from the electrical source – a transformer or substation – to the server racks and everything in between, including the critical heating, ventilation and air-conditioning (HVAC) system.
Why thermal imaging?
A thermal imager displays and can store two-dimensional images of an object's surface temperatures. Using an imager, you can easily detect anomalies in the temperatures of electrical or mechanical components – items that are hotter or colder than similar objects in the same environment. Overheating components usually indicate a potential problem that requires maintenance before failure occurs. In data centers, where cooling is important to keep servers from overheating, uncharacteristically cool surfaces might also indicate a problem, perhaps an imbalance in the HVAC system that requires correcting.
In addition to easily detecting comparative temperatures of equipment surfaces, thermal cameras can also record actual surface temperatures. This helps detect situations such as an overheating transformer or motor, allowing for repair or replacement before failure.
When thermal images reveal potential problems, capture them on the imager and upload them to a computer that runs software for reporting and analysis. By regularly monitoring equipment and keeping a thermal "track record" on your computer for long term comparison, you can better detect abnormal readings and changes in the trend. To ensure the consistency required for side-by-side comparison, follow a pre-established sampling route and scan the same objects or areas each time from the same vantage points. Along with repair records, thermal trending information provides a documented data trail for insurance carriers, management, and any others who require confirmation of a reliable operation.
What to scan
In a data center, the components are like a series of dominoes. If one fails, it takes everything downstream with it. It makes sense to "begin at the beginning," at what the National Electric Code calls "the source" – typically a transformer, perhaps a substation. For a meaningful inspection session, the system must be operating and should be pulling as large an electrical load as possible. More current running through the wires produces more heat energy, and that's what an infrared camera "sees."
- Transformers are usually owned by the electric utility, although sometimes they are the property of the data center's owner. On transformers, check the secondary windings and coils. Look at terminations and lugs (bolted connections) "inside of the box." Look for thermal anomalies, i.e., differences in temperature – ΔTs – of similar components. Also, look for physical damage and debris that might interfere with the operation of the transformer, and scan for load imbalance. The latter is signaled by a ΔT between circuit phases.
- Many data centers have an alternate source of power for redundancy. This second source could be another utility transformer on a different grid or a standby generator. Alternate power sources must be scanned and inspected, too, and while they are in use and under load.
- Standby generators should be inspected while they are powered up with everything downstream running off them. Here, too, check lugs and terminations and look for damage and debris. To detect problems with cooling or exhaust systems, you'll need to record actual temperatures rather than observing ΔTs.
- When a transfer switch is functioning correctly, it senses where the power is coming from (main or standby) and switches to that source. Don't overlook that switch during your inspection, because if it fails, it won't matter how good maintenance procedures are downstream. With current running through the transfer switch, scan it and look for heating that might signal loose connections (e.g., insufficient torque or compression on a lug or termination).
- The main switchboard is a large enclosure with many switches. The cabinet houses various components including busbars, bolted connections and fuse clips. Look for thermal anomalies in connections (including bus connections), terminations, fuses and fuse clips. Also look for imbalance, damage, and debris.
- A UPS (uninterruptible power supply) is usually immediately downstream of the switchboard. When inspecting a UPS, scan the input connections, the terminals, and the inverter section, where there are small fuses and capacitors. Under load, use your thermal imager to check the battery section. Look at terminal posts, casings and feeders. A bad cell heats up very quickly under load. After the load scan, immediately scan the batteries not loaded. Bad cells cool very quickly when the load is removed. Finally, check the on-board transformer (if present).
- Power distribution units (PDUs) are downstream of the UPS and are typically located close to the servers, to which they distribute power. Normally, a PDU will have a circuit breaker panel and sometimes a transformer. In scanning PDUs, look at lugs and terminals, including circuit breaker terminals. Visually check for damage and debris, and if a PDU is not a straight-through-voltage model, scan the on-board transformer.
- Server racks are becoming increasingly more compact, opening up space for more servers in existing data centers, but they are also increasing demand on the centers' power and cooling capabilities. In fact, the heat generated by the today's blade servers has some experienced thermographers reporting that they no longer spend much time scanning server racks. The high heat makes comparative temperatures difficult. Still, the thermal imager is useful for monitoring power strips and power supplies built into the racks as well as wiring connections, plugs and plug strips. Look for overheating due to loose connections and loose or bent plugs. A thermal scan can also detect broken cords and broken conductors in wires. To detect the latter condition, look for what is called "the barber pole effect," in which you can observe the thermal differences of the twisted strands.
You should also monitor the areas where air enters and heat is expelled from server racks propelled by built-in fans. Both a thermal imager and temperature/airflow meter are useful for monitoring air cooling effectiveness. Generally speaking, you can 1) map cooling patterns into, out of and around server racks and 2) confirm whether cooling is adequate or not. Such monitoring identifies where to install perforated panels to improve circulation or blanking plates to keep hot air from entering empty slots on unfilled racks. These strategies help many data center users keep their servers cool enough to maintain their server warranties.
- HVAC systems are essential in data centers because of the amount of heat generated by servers, especially the latest generation of blade servers. A data center's AC is typically provided by either a split system or a chilled-water system, which ideally will maintain the temperature in the center between 65 °F and 72 °F. Many servers are designed to automatically and autonomously shut down when their temperature exceeds 75 or 76 degrees.
Scan your AC system's fuses, terminations, lugs, and crimped or bolted connections. Also check mechanical components for overheating that signals misalignment (in drives), unbalance (in fans) or degradation (in motors and bearings). An infrared image will also reveal a refrigerant leak if it is blowing against the cabinet.
Split systems and chilled-water systems with cooling towers have outside components as well as inside components. For example, a split system's evaporator coil is typically inside the building while the condensing unit is outside. Check the evaporator coil for icing, but be aware that there's no point in checking the AC system inside if you are not going to go outside. There are usually fuses and terminations (lugs) outside, and, if there's a cooling tower there are motors. Use your thermal imager to check flow and find leaks in the towers.
Regarding training, Fluke recommends two to three days of training for users of higher end cameras. Operating the hardware is not the difficult part. Making good diagnoses is the challenging aspect of thermal imaging. The keys to success are collecting good, reliable, repeatable data and then reviewing that data through the eyes of someone knowledgeable about electrical systems. This strategy will result in good judgments about what, if anything, is wrong and how to correct it. Sound judgments about thermal scans of a data center requires good training, technical knowledge, and practical field experience.
¹Most of the information in this Application Note is based on an interview with Paul Twite, a thermographer with 24-7 Power, located in Edina, Minnesota; Phone: 952-944-8900; Fax: 952-746-1958; Toll Free: 1-866-269-1767.
²For a detailed discussion of emissivity, read "Emissivity: Understanding the difference between apparent and actual infrared temperatures," by L. Terry Clausing, P.E., ASNT Certified NDT Level III T/IR. The Application Note is available for downloading at Fluke's Digital Library Fulfillment Center accessible from www.fluke.com.