Is it worht to use offcial NVidia PyTorch image?

TL/DR

Significant Performance Boost: Using the official NVidia PyTorch Docker image resulted in a 50% increase in TFLOPS for a specific matrix multiplication task compared to a native PyTorch installation.
Essential for DGX Spark: If you’re running workloads on a DGX Spark (or similar NVidia hardware), the official image appears to be a must-have for maximizing performance.
Custom Optimizations: The performance gain likely stems from NVidia’s custom PyTorch fork and optimized PTX code within their official image.
Code: repo with code of the experiment
Important note: That is a simple check using mat-mul just to verify if there is any difference - and there is. Will try to do PyTorch bench in next post.

Recently, I stumbled upon a fascinating code snippet by @awnihannun that piqued my interest. It prompted me to conduct a quick experiment on our DGX Spark to see if NVidia’s official PyTorch Docker image truly offers a performance advantage over a standard PyTorch installation.

The official NVidia PyTorch images, available on NGC, are often touted for their optimized performance. But how much of a difference do they actually make in a real-world scenario?

The Experiment Setup

To test this, I adapted Awni Hannun’s code to perform a specific, compute-intensive operation: 8192x8192 bfloat16 square matrix multiplication, repeated 50 times per iteration, for 100 repetitions. The goal was to measure the average TFLOPS achieved in both environments.

The Code:

import time
import torch

d = 8192
x = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")
y = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")

def fun(x):
    for _ in range(50):
        x = x @ y.T
    return x

# Warm-up
for _ in range(10):
    fun(x)
    torch.cuda.synchronize()

tic = time.time()
repetitions = 100
for _ in range(repetitions):
    fun(x)
    torch.cuda.synchronize()
toc = time.time()

s = (toc - tic)
msec = 1e3 * s
tf = (d**3) * 2 * 50 * repetitions / (1024**4) # Calculate theoretical TFLOPS
print(f"{msec=:.3f}")
tflops = tf / s
print(f"{tflops=:.3f}")

This code sets up two large bfloat16 tensors on the GPU and then performs a series of matrix multiplications. The torch.cuda.synchronize() calls ensure that all GPU operations are complete before timing.

Native PyTorch Run

First, I ran the code with a native PyTorch installation on the DGX Spark.

PyTorch Version: 2.9.0+cu130

Results: msec=87755.411 tflops=56.977

Dockerized PyTorch Run (NVidia Official Image)

Next, I executed the same code within the official NVidia PyTorch Docker image.

PyTorch Version: 2.9.0a0+145a3a7bda.nv25.10 (Note the nv25.10 indicating NVidia’s custom build)

Results: msec=57511.067 tflops=86.940

Analysis and Conclusion

During both runs, the DGX GPU was fully utilized, indicating that the differences observed are due to efficiency rather than resource contention.

Comparing the results:

Native Run: Approximately 57 TFLOPS
Dockerized Run: Approximately 87 TFLOPS

That’s a 50% increase in performance just by switching to the official NVidia PyTorch image!

Prompt

Post was generated using AI - here is the prompt if you do not have time to read the story

Write Blog post using markdown for jekyll blog. 

title:  Is it worht to use offcial NVidia PyTorch image?

Add sections like
TL/DR - key points of the post
<main body> - all the sections with main contnent


Based on [@awnihannun](https://x.com/awnihannun) [code](https://x.com/awnihannun/status/1982880363765768288) i have executed some small tests on DGX Spark to check if [NVidia official PyTorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) image  allows you to run your code faster

Modified code to run  8192 bfloat16 square matrix multiplication 100 times and compute average tflops  


Code
import time
import torch
d = 8192
x = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")
y = torch.randn(size=(d, d)).to(torch.bfloat16).to("cuda")

def fun(x):
    for _ in range(50):
        x = x @ y.T
    return x

for _ in range(10):
    fun(x)
    torch.cuda.synchronize()

tic = time.time()
repetitions = 100
for _ in range(repetitions):
    fun(x)
    torch.cuda.synchronize()
toc = time.time()
s = (toc - tic)
msec = 1e3 * s
tf = (d**3)  * 2 * 50 * repetitions / (1024 **4)
print(f"{msec=:.3f}")
tflops = tf / s
print(f"{tflops=:.3f}")


Results:
For native run:
2.9.0+cu130
msec=87755.411
tflops=56.977


For dockerized run
2.9.0a0+145a3a7bda.nv25.10
msec=57511.067
tflops=86.940

During both runs DGX GPU is fully utilized. 

Hard to say what differs inside of NVidia image - for sure PyTorch use PTX  + the pytorch code is a custom internal NVidia fork.

As shown by the results - IT IS A MUST to use official NVidia official PyTorch image to squeeze out more performance from the DGX Spark. 

That is a simple check just to verify if there is any difference - and there is - deeper and broader experiments could be done what are the differences for other operations and so one. 

[repo](https://github.com/riomus/dgx-spark-performance)

TL/DR

The Experiment Setup

Native PyTorch Run

Dockerized PyTorch Run (NVidia Official Image)

Analysis and Conclusion

Prompt

Comments

Related Posts

Oh no 128GB is not enaught for 7B parameters! 31 Oct 2025

Unsloth your DGX Spark 28 Oct 2025