NVIDIA GeForce RTX 5090 and RTX 6000 PRO Affected by Virtualization Bug
NVIDIA’s flagship consumer GPU, the GeForce RTX 5090, along with the high-end RTX 6000 PRO from the ProViz lineup, are currently experiencing a significant virtualization bug. This issue has been brought to light by developers at CloudRift, a company specializing in GPU cloud solutions for AI development. According to their findings, both the RTX 5090 and RTX 6000 PRO can become completely unresponsive after several days or weeks of continuous operation in virtualized environments.
The problem manifests unpredictably, with the affected GPUs freezing at random intervals and failing to recover. Notably, CloudRift’s team has tested a range of other GPUs—including the NVIDIA H100, B200, and the previous-generation RTX 4090—none of which exhibited similar issues. Even the top-tier server-grade B200 GPU from the Blackwell family remains unaffected, suggesting the bug is specific to the latest consumer and ProViz models.
Technical Details: PCIe FLR and GPU Lockup
The root of the issue appears to be related to how these GPUs handle virtualization, particularly when passed through to virtual machines using KVM and VFIO. During the normal shutdown or migration of a virtual machine, the host system performs a PCIe function-level reset (FLR) to clean up the device state. However, instead of returning to a ready state, the RTX 5090 and RTX 6000 PRO become unresponsive. The Linux kernel then times out, displaying the error message: “not ready 65535ms after FLR; giving up.” This points to a hardware-level problem with the affected GPUs.
CloudRift has taken the unusual step of offering a $1,000 bug bounty to anyone who can provide a solution, underlining the severity and complexity of the issue. Reports from the Level1Techs forums indicate that this is not an isolated incident, with multiple users encountering similar problems in their own virtualization setups.
Current Workarounds and NVIDIA’s Response
NVIDIA has acknowledged the virtualization bug and is actively investigating the matter. As a temporary mitigation, the company recommends installing a specific Proxmox kernel version using the command apt install proxmox-kernel-6.14.8-2-bpo12-pve/stable. However, this workaround does not fully resolve the underlying issue, and virtual machine environments remain susceptible to GPU lockups.
The broader virtualization and AI development communities are now awaiting an official fix, which may arrive as a driver update, a Linux kernel patch, or potentially both. Until then, users of the GeForce RTX 5090 and RTX 6000 PRO should be aware of the risks associated with deploying these GPUs in virtualized environments.