Skip to main content
IT Operations February 21, 2026 · 6 min read

Data Center Modernization for AI Workloads: What Organizations Need to Know in 2026 | DOYB Technical Solutions

Organizations are increasingly moving AI workloads into production — and discovering that their existing data center infrastructure wasn't designed for them. The gap between what a traditional enterprise data center was built to do and what AI-at-scale demands is wider than most infrastructure teams expect until they're in the middle of a deployment that has stalled.

The planning failure is predictable. AI workload requirements aren't obvious from a distance. They look like compute requirements until an organization gets specific — and then the power density numbers, the cooling constraints, and the network throughput requirements make clear that traditional infrastructure assumptions don't transfer.

Why AI Workloads Are Different

Traditional enterprise compute is CPU-centric, operates at moderate power densities, and follows workload patterns that are relatively predictable — batch jobs, application servers, database queries. Infrastructure built for these workloads is designed around assumptions that held true for decades.

AI workloads break most of those assumptions. GPU-dependent computation draws power at densities that are an order of magnitude above what traditional CPU infrastructure requires. A single rack dense-packed with GPU servers can draw 40–80kW or more. A comparable rack of traditional compute might draw 5–8kW. The difference isn't incremental — it requires a fundamentally different approach to power delivery, thermal management, and physical infrastructure.

Uptime Institute research on power density trends has documented that average data center power density per rack has been roughly doubling every four to five years, driven almost entirely by AI adoption. Organizations that haven't revisited their infrastructure assumptions in the past three years are working from a baseline that no longer reflects the requirements they're about to face.

Network throughput requirements compound the challenge. AI training workloads require high-bandwidth connections between compute nodes and between compute and storage — connections that 1Gbps or 10Gbps switching architectures cannot support at the scale required for serious training runs. Inference workloads have different but equally demanding requirements, particularly for latency-sensitive applications.

The Power and Cooling Challenge

Cooling is the constraint most organizations encounter first. Traditional air cooling — the raised floor and hot aisle/cold aisle containment architecture that most enterprise data centers use — becomes inadequate above approximately 15–20kW per rack. Above that threshold, air simply can't remove heat fast enough to keep GPU hardware within operating temperatures.

Liquid cooling is the answer, but the term covers a range of approaches with different cost, complexity, and disruption profiles. Direct liquid cooling routes coolant directly to heat sources on the hardware — CPU and GPU dies — and is increasingly supported by server manufacturers as a standard configuration option. Immersion cooling submerges hardware in dielectric fluid and is more effective at extreme densities but requires specialized tanks, fluid management systems, and compatible hardware.

NVIDIA's data center design guidance for GPU deployments addresses these requirements explicitly. Organizations that plan GPU deployments without working through the thermal and power delivery implications typically discover the constraint mid-deployment — after hardware has been purchased and racked, and after the operational timeline has been committed to stakeholders.

Power delivery infrastructure — PDUs, UPS systems, generators — often requires parallel evaluation. A facility designed with power distribution rated for 5kW racks doesn't automatically support 40kW racks. Upgrading power delivery is expensive and time-consuming, and in many co-location environments, the available power envelope is a fixed constraint that can't be expanded without renegotiating the facility agreement.

Network Architecture for AI

AI inference workloads require low-latency, high-bandwidth connections between compute nodes. For distributed training — where a model is trained across multiple GPU servers simultaneously — the network becomes a performance-critical component, not just a connectivity layer. Bottlenecks in inter-node communication directly constrain training throughput and extend training runs in ways that aren't recoverable by adding more compute.

InfiniBand is the standard interconnect for high-performance AI training environments. 100GbE and 400GbE Ethernet are increasingly common alternatives for organizations that prefer Ethernet-based architectures. Traditional 10Gbps switching, which is adequate for most enterprise application workloads, is insufficient for serious AI training infrastructure.

Edge AI deployments have a different profile. Inference at the edge — running a model locally rather than calling a centralized API — prioritizes low-latency response over raw throughput. The infrastructure requirements are more modest in terms of power and cooling, but security considerations for endpoint hardware are more complex, particularly in environments with physical access risks or sensitive data at the point of collection.

The Hybrid Approach Most Organizations Should Consider

Not every organization needs to build an on-premises GPU cluster. For many, the more practical path is a hybrid model that matches workload characteristics to the most appropriate infrastructure.

  • Cloud GPU instances (AWS, Azure, Google Cloud) are well-suited for development, experimentation, and production workloads with unpredictable or variable demand. The cost is higher per compute-hour than on-premises at sustained utilization, but the absence of capital commitment and the ability to scale down to zero make cloud the right choice for workloads that aren't running continuously.
  • On-premises GPU infrastructure makes economic and operational sense for high-volume production inference workloads, for organizations with data sovereignty requirements that prohibit processing certain data in cloud environments, and for regulated data that can't leave a controlled environment. At sustained high utilization, the total cost of ownership for on-premises typically becomes favorable within 18–36 months.
  • Co-location offers a middle path: the organization owns the hardware and controls the environment, but the data center infrastructure (power, cooling, physical security, connectivity) is provided by the facility. This reduces the capital requirement for infrastructure upgrades while maintaining hardware ownership and data control.

The most expensive mistake in AI infrastructure planning is committing to a deployment model before understanding the workload requirements. Organizations that start with the use case — what they're running, at what volume, with what latency requirements, and with what data handling constraints — and then select infrastructure to match, consistently achieve better outcomes than organizations that start with a technology commitment and work backward.

What to Evaluate Before Committing to AI Infrastructure

Before making infrastructure commitments for AI workloads, organizations should work through a structured evaluation across five dimensions:

  • Power capacity. Current available power per rack vs. required power density for target GPU hardware. This number must be confirmed with the facility — not estimated from general specifications.
  • Cooling architecture. Maximum thermal load the current cooling infrastructure supports, and whether liquid cooling is feasible in the facility without major structural modification.
  • Network throughput. Available bandwidth from compute to storage and between nodes, and whether current switching infrastructure supports the interconnect speeds required for the target workload.
  • Security architecture. GPU infrastructure introduces specific security considerations: model intellectual property protection, data exfiltration risks during training runs, and physical security for hardware in environments where the hardware itself has significant value.
  • Total cost of ownership. Capital cost, operational cost, and the cost of any facility upgrades required, compared against cloud alternatives at equivalent utilization levels and timeframes.

Organizations that work through this evaluation before making commitments avoid the most common and expensive failure mode in AI infrastructure deployment: discovering a blocking constraint after the timeline and budget have been set.

The Ascend Infrastructure assessment evaluates your current environment against the requirements of your target AI workloads — identifying gaps in power capacity, cooling architecture, network throughput, and security controls before you commit. DOYB's Data Center services provide design and implementation support for organizations building AI-capable infrastructure.

Sources:

[1] Uptime Institute — Research and Reports on Power Density Trends — https://uptimeinstitute.com/resources/research-and-reports

[2] NVIDIA Data Center — GPU Infrastructure Design Guidance — https://www.nvidia.com/en-us/data-center/

Work With DOYB

Understand Your Actual Risk Profile

Schedule a free 30-minute consultation. We'll identify the right Ascend assessment for your organization and outline what a first engagement looks like.