The driver now intelligently merges adjacent kernels on the fly, reducing global memory round-trips. In tests with popular transformer architectures, this slashed latency by nearly 27% without any code changes.
Speaking with a senior AI infrastructure engineer at a major cloud provider (who requested anonymity due to NDA), we learned that the R555 driver series was internally delayed by four months due to a "catastrophic" bug involving Multi-Instance GPU (MIG) partitioning. cuda driver release news exclusive