Is it worht to use offcial NVidia PyTorch image?

TL/DR

  • Significant Performance Boost: Using the official NVidia PyTorch Docker image resulted in a 50% increase in TFLOPS for a specific matrix multiplication task compared to a native PyTorch installation.
  • Essential for DGX Spark: If you’re running workloads on a DGX Spark (or similar NVidia hardware), the official image appears to be a must-have for maximizing performance.
  • Custom Optimizations: The performance gain likely stems from NVidia’s custom PyTorch fork and optimized PTX code within their official image.
  • Code: repo with code of the experiment
  • Important note: That is a simple check using mat-mul just to verify if there is any difference - and there is. Will try to do PyTorch bench in next post.

Recently, I stumbled upon a fascinating code snippet by @awnihannun that piqued my interest. It prompted me to conduct a quick experiment on our DGX Spark to see if NVidia’s official PyTorch Docker image truly offers a performance advantage over a standard PyTorch installation.

The official NVidia PyTorch images, available on NGC, are often touted for their optimized performance. But how much of a difference do they actually make in a real-world scenario?

The Experiment Setup

To test this, I adapted Awni Hannun’s code to perform a specific, compute-intensive operation: 8192x8192 bfloat16 square matrix multiplication, repeated 50 times per iteration, for 100 repetitions. The goal was to measure the average TFLOPS achieved in both environments.

The Code:

import time
import torch

d = 8192
x = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")
y = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")

def fun(x):
    for _ in range(50):
        x = x @ y.T
    return x

# Warm-up
for _ in range(10):
    fun(x)
    torch.cuda.synchronize()

tic = time.time()
repetitions = 100
for _ in range(repetitions):
    fun(x)
    torch.cuda.synchronize()
toc = time.time()

s = (toc - tic)
msec = 1e3 * s
tf = (d**3) * 2 * 50 * repetitions / (1024**4) # Calculate theoretical TFLOPS
print(f"{msec=:.3f}")
tflops = tf / s
print(f"{tflops=:.3f}")

This code sets up two large bfloat16 tensors on the GPU and then performs a series of matrix multiplications. The torch.cuda.synchronize() calls ensure that all GPU operations are complete before timing.

Native PyTorch Run

First, I ran the code with a native PyTorch installation on the DGX Spark.

PyTorch Version: 2.9.0+cu130

Results: msec=87755.411 tflops=56.977

Dockerized PyTorch Run (NVidia Official Image)

Next, I executed the same code within the official NVidia PyTorch Docker image.

PyTorch Version: 2.9.0a0+145a3a7bda.nv25.10 (Note the nv25.10 indicating NVidia’s custom build)

Results: msec=57511.067 tflops=86.940

Analysis and Conclusion

During both runs, the DGX GPU was fully utilized, indicating that the differences observed are due to efficiency rather than resource contention.

Comparing the results:

  • Native Run: Approximately 57 TFLOPS
  • Dockerized Run: Approximately 87 TFLOPS

That’s a 50% increase in performance just by switching to the official NVidia PyTorch image!

Prompt

Post was generated using AI - here is the prompt if you do not have time to read the story

Write Blog post using markdown for jekyll blog. 

title:  Is it worht to use offcial NVidia PyTorch image?

Add sections like
TL/DR - key points of the post
<main body> - all the sections with main contnent


Based on [@awnihannun](https://x.com/awnihannun) [code](https://x.com/awnihannun/status/1982880363765768288) i have executed some small tests on DGX Spark to check if [NVidia official PyTorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) image  allows you to run your code faster

Modified code to run  8192 bfloat16 square matrix multiplication 100 times and compute average tflops  


Code
import time
import torch
d = 8192
x = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")
y = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")

def fun(x):
    for _ in range(50):
        x = x @ y.T
    return x

for _ in range(10):
    fun(x)
    torch.cuda.synchronize()

tic = time.time()
repetitions = 100
for _ in range(repetitions):
    fun(x)
    torch.cuda.synchronize()
toc = time.time()
s = (toc - tic)
msec = 1e3 * s
tf = (d**3)  * 2 * 50 * repetitions / (1024 **4)
print(f"{msec=:.3f}")
tflops = tf / s
print(f"{tflops=:.3f}")


Results:
For native run:
2.9.0+cu130
msec=87755.411
tflops=56.977


For dockerized run
2.9.0a0+145a3a7bda.nv25.10
msec=57511.067
tflops=86.940

During both runs DGX GPU is fully utilized. 

Hard to say what differs inside of NVidia image - for sure PyTorch use PTX  + the pytorch code is a custom internal NVidia fork.

As shown by the results - IT IS A MUST to use official NVidia official PyTorch image to squeeze out more performance from the DGX Spark. 

That is a simple check just to verify if there is any difference - and there is - deeper and broader experiments could be done what are the differences for other operations and so one. 

[repo](https://github.com/riomus/dgx-spark-performance)

Comments