More performance checks of Nvidia PyTorch on DGX Spark

This post continues the discussion from my last post after a insightful discussion with Sebastian Raschka @rasbt. We dive deeper into the performance characteristics of Nvidia PyTorch on the DGX Spark, specifically focusing on various data types and common AI workloads.

TL/DR - Key Points

  • Docker is Crucial for GEMM-heavy Workloads: For tasks relying heavily on General Matrix Multiply (GEMM) operations, such as Large Language Models (LLMs), running within the Nvidia PyTorch Docker container provides significant performance gains.
  • GEMM Benefits from Docker: Regardless of data type (fp32, fp16, bf16), GEMM operations are consistently faster inside the Docker environment. The difference is less pronounced for bf16, but still present.
  • Convolutional Operations Less Affected: For convolutional layers, we observed no noticeable performance difference between running inside or outside the Docker container.
  • Mixed Precision (fp16, bf16) Insights: While GEMM generally benefits, the impact on convolutional networks is minimal, suggesting that other bottlenecks might be at play.
  • DGX Spark Memory Bandwidth: The DGX Spark appears to be limited by memory bandwidth, meaning workloads with high memory I/O will see performance constraints.
  • UMA Architecture Implications: Given the DGX Spark’s Uniform Memory Access (UMA) architecture, it’s generally not efficient to frequently load/unload gradients or weights, as this can introduce unnecessary overhead.
  • TorchBench Results:
    • ResNet50: No noticeable difference for training or evaluation, aligning with its primarily convolutional nature.
    • BERT & nanoGPT: Both showed noticeable speedups for training and evaluation when run inside the Docker container, reinforcing the benefit for GEMM-heavy models.

Testing Methodology

Testing methodology involved running various PyTorch benchmarks and custom scripts on the DGX Spark, both directly on the host system and within the recommended Nvidia PyTorch Docker container. We focused on analyzing the impact of different floating-point precisions: fp32 (single precision), fp16 (half precision), and bf16 (bfloat16).

GEMM Performance

General Matrix Multiply (GEMM) operations are the backbone of many modern neural networks, especially transformers and large language models. Our tests consistently showed that GEMM operations are significantly faster when executed inside the Nvidia PyTorch Docker container, regardless of the data type used.

While fp32 and fp16 saw substantial improvements, bf16 also benefited, albeit with a slightly smaller performance delta. This highlights the importance of leveraging the optimized libraries and configurations provided within the Nvidia Docker images for maximum GEMM throughput.

Convolutional Network Performance

In contrast to GEMM, our benchmarks for convolutional layers revealed no noticeable performance difference between running inside or outside the Docker environment. This suggests that for workloads primarily dominated by convolution operations, the overhead or optimizations provided by the Docker container might not be as critical. This observation could be due to convolution operations being less sensitive to specific library versions or GPU driver optimizations that the Docker image might bundle.

Training Step Sample

To understand the real-world impact, we ran a sample training step, which often combines various operations including GEMM. As expected, given that a significant portion of a typical training step involves GEMM, we observed a noticeable speedup when running inside the Docker container. This further reinforces the recommendation to use Docker for training scenarios where GEMM is a dominant factor.

Memory Bandwidth Considerations

Memory performance for both test environements - native and docker based - is same.

TorchBench Results

To provide a broader perspective, we evaluated several models from the TorchBench suite:

  • ResNet50: For both training and evaluation, ResNet50, being a predominantly convolutional neural network, showed no significant performance difference whether run inside or outside the Docker container. This aligns with our specific convolutional layer findings.

  • BERT (Bidirectional Encoder Representations from Transformers): As a model heavily reliant on transformer blocks and thus GEMM operations, BERT demonstrated noticeable speedups for both training and evaluation when run inside the Docker environment. This is a clear indicator of the benefits of Docker for LLM-type architectures.

  • nanoGPT: Similar to BERT, nanoGPT, a minimalistic implementation of GPT, also showed improved performance for both training and evaluation within the Docker container. This further underscores the importance of the Docker environment for transformer-based models.

Visualizing the Results

For a detailed breakdown of the performance metrics, please refer to the image below (click to load full image, red results are for experiments inside of the container):

Result charts

Conclusion

In summary, for deep learning workloads on the DGX Spark, the use of Nvidia PyTorch Docker images is strongly recommended, especially for models that are heavily reliant on GEMM operations, such as Large Language Models (LLMs) and transformer-based architectures. While convolutional networks might not see the same dramatic improvements, the overall ecosystem and optimized libraries within the Docker environment still offer a more robust and performant experience.

It’s also crucial to be mindful of the DGX Spark’s memory bandwidth limitations and its UMA architecture. Optimizing data movement and keeping computations on-device as much as possible will yield the best results.

You can find all the code and benchmarks used for this analysis in the repository.

Prompt

Post was generated using AI - here is the prompt if you do not have time to read the story

Write Blog post using markdown for jekyll blog. 

title:  More performance checks of Nvidia PyTorch on DGX Spark

Add sections like
TL/DR - key points of the post
<main body> - all the sections with main contnent

That is a continuation of [last post](https://bartusiak.ai/2025/11/04/is-it-worth.html) after a discussion with 
[Sebastian Raschka @rasbt](https://x.com/RomanBartusiak/status/1985772482553430427).

1. Tested fp32 fp16 and bf16
2. GEMM is always faster no matter of data type -but for bf16 difference is smaller 
3. Conv - no noticible difference
4. Train step sample - noticible faster in docker - same as GEMM as it is mainly GEMM
5. Memory bandwidth - no noticible difference
6. TorchBench

    * resnet50 - no noticible difference for train/eval as it is mainly conv
    * BERT - faster inside of docker for both train and eval
    * nanogpt - faster inside of docker for both train and eval

Results img url /assets/img/performance.png


As in previous post - using docker image is a must for cases where GEMM performance is important (LLMs for sure, conv based - not so much)
DGS Spark is limited by memory bandwidth - so a lot of IO on memory will litmit the performance
DGX Spark is UMA -  it is not worth to load/unload gradients/weights 

[repo](https://github.com/riomus/dgx-spark-performance)

Comments