Why Thermal and Power Quality Monitoring Are Critical for CDU Troubleshooting

by Jason Axelson, Subject Matter Expert at Fluke, Power Quality

In today's high-density server environments, artificial intelligence (AI) workloads and other processing-intensive computing applications generate heat faster than airflow alone can dissipate. Data centers are increasingly adopting liquid cooling for more effective thermal management.

Continuous thermal imaging and power quality monitoring function as an early warning system against cyberattacks
Continuous thermal imaging and power quality monitoring function as an early warning system against cyberattacks.

Liquid coolant circulated directly to the heat source absorbs heat produced by high-density racks and transfers it away from sensitive electronics. Coolant distribution units (CDUs) are the primary systems for managing this process. CDUs maintain temperatures and liquid pressure within optimal ranges. This ensures efficient heat removal, prevents equipment from overheating, and maintains stable operating conditions for critical systems.

When CDUs fail, data centers experience hot spots, thermal throttling, and hardware shutdowns. Given that uptime is a top priority, CDU performance is crucial. It protects critical workloads, helps meet service level agreements (SLAs), and prevents costly downtime.

Guidelines from the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) and frameworks from the International Organization for Standardization (ISO) establish standards and best practices for monitoring and maintaining cooling systems to ensure performance and reliability. However, despite their importance to overall system efficiency, CDUs are often overlooked in data center infrastructure management (DCIM) dashboards. When monitoring is minimal or absent, maintenance often becomes reactive, providing operators with little warning before a problem escalates to a system shutdown.

Thermal imaging and power quality monitoring quickly pinpoint cooling inefficiencies and electrical issues that threaten CDU performance. This enables data center teams to address problems before they affect system uptime.

Thermal Imaging

Thermal imaging allows for non-contact inspections while a system operates under normal conditions. It provides insights into developing problems, offering early warning signs for issues such as uneven cooling, blocked coils, and fluid imbalances. These issues can lead to overheating and SLA penalties. This critical insight enables teams to act proactively, long before overheating or a system shutdown occurs. It supports compliance and minimizes downtime, eliminating the need to wait for a DCIM to detect and alert to changing conditions.

Thermal imaging also verifies thermal output, providing actual results instead of relying on system-reported values. It helps identify airflow performance degradation over time and serves as an essential preventive maintenance tool in Tier II/III facilities with mixed infrastructure. For example, a Fluke Ti480 Pro Thermal Imager™ can identify one underperforming CDU in a row of CDUs. This provides teams with time to intervene before the issue impacts the server.

Power Quality Monitoring

CDUs rely on stable power to drive pumps that circulate coolant through the system and to power the controls, sensors, and processors that manage liquid flow. If the power supply fluctuates, drops, or fails, the pumps may stop circulating coolant, and the control system may malfunction.

Power quality monitoring helps identify three key problems before they cause downtime:

  • Voltage Sags (short drops in voltage) can cause pumps to slow down or stall. This reduces coolant flow and causes temperatures to rise. Sags may also cause control electronics to reset, glitch, or lose data, resulting in erratic operation. Align power quality analyzer settings with the commissioning requirements of the CDU manufacturer.
  • Harmonics (distorted electrical waveforms) can cause motors to run hotter and less efficiently. This increases wear and impacts cooling performance. Harmonics can also overload neutral conductors in three-phase systems, which raises energy costs and risks power supply failures. Poor power factor, often caused by non-linear loads, further stresses CDU motors and reduces efficiency.
  • Transients (brief spikes or dips in voltage) can cause pump motor damage or insulation breakdown. They may also damage microprocessors, corrupt firmware, or permanently damage control boards.
  • Single-Phasing (loss of a phase) is easily recognizable by a power quality analyzer. A phase loss can prevent a stopped pump motor from starting or cause a running pump to overheat.

Achieving Compliance with Global Standards Using Thermal Imaging and Power Quality Monitoring

As critical components for maintaining the correct environments for data center operations, CDUs are covered by many certification standards. These include Tier certification standards, ASHRAE TC 9.9, and the EU Code of Conduct for Data Centers. Preventive diagnostics, such as those performed with a Fluke thermal imager and a Fluke power quality analyzer, help ensure compliance with these codes. They achieve this by identifying power issues before these issues lead to localized overheating, equipment stress, or cascading thermal failures.

Additionally, data collected during preventive diagnostics provides traceable proof of compliance with EN 50160, IEEE 1159, and IEC 61000-4-30. This documentation is critical during audits, insurance claims, or disputes with power suppliers.

Finally, continuous thermal imaging and power quality monitoring provide another layer of protection from cyberattacks. They function as an early warning system. CDUs often connect with DCIMs, which makes them Internet of Things (IoT) vectors and introduces the possibility of cyberattacks. If a malicious actor gains access to a CDU control system and alters pump speeds, coolant flow, or temperature setpoints, any resulting changes in heat distribution or electrical load may become visible before alarms trigger. Thermal imaging can reveal unexpected hot spots, uneven cooling patterns, or rapid temperature fluctuations that deviate from established baselines. Concurrently, power quality monitoring can detect anomalies such as unusual load profiles, voltage sags or swells, and harmonic distortion caused by irregular control behavior.

Together, these monitoring methods help distinguish between normal equipment faults and potentially malicious interference. This provides operators with an opportunity to respond before an attack causes widespread disruption.

About the Author

Jason is a subject matter expert at Fluke specializing in power quality, electrical test equipment, and product applications. With deep experience supporting both customers and distribution partners, he helps professionals select, operate, and troubleshoot a wide range of diagnostic tools — including power quality analyzers, battery testers, acoustic imagers, and thermal imagers. Jason regularly leads application-based training sessions, using his hands-on knowledge to bridge the gap between technical challenges and practical solutions across industries. Connect with Jason on LinkedIn.

You might also be interested in

Chat with ourFluke assistant
Clear Chat