Roman Bartusiak · My personal page

More performance checks of Nvidia PyTorch on DGX Spark

07 Nov 2025

This post continues the discussion from my last post after a insightful discussion with Sebastian Raschka @rasbt. We dive deeper into the performance characteristics of Nvidia PyTorch on the DGX Spark, specifically focusing on various data types and common AI workloads.

TL/DR - Key Points

Docker is Crucial for GEMM-heavy Workloads: For tasks relying heavily on General Matrix Multiply (GEMM) operations, such as Large Language Models (LLMs), running within the Nvidia PyTorch Docker container provides significant performance gains.
GEMM Benefits from Docker: Regardless of data type (fp32, fp16, bf16), GEMM operations are consistently faster inside the Docker environment. The difference is less pronounced for bf16, but still present.
Convolutional Operations Less Affected: For convolutional layers, we observed no noticeable performance difference between running inside or outside the Docker container.
Mixed Precision (fp16, bf16) Insights: While GEMM generally benefits, the impact on convolutional networks is minimal, suggesting that other bottlenecks might be at play.
DGX Spark Memory Bandwidth: The DGX Spark appears to be limited by memory bandwidth, meaning workloads with high memory I/O will see performance constraints.
UMA Architecture Implications: Given the DGX Spark’s Uniform Memory Access (UMA) architecture, it’s generally not efficient to frequently load/unload gradients or weights, as this can introduce unnecessary overhead.
TorchBench Results:
- ResNet50: No noticeable difference for training or evaluation, aligning with its primarily convolutional nature.
- BERT & nanoGPT: Both showed noticeable speedups for training and evaluation when run inside the Docker container, reinforcing the benefit for GEMM-heavy models.

More …

Is it worht to use offcial Nvidia PyTorch image?

04 Nov 2025

TL/DR

Significant Performance Boost: Using the official NVidia PyTorch Docker image resulted in a 50% increase in TFLOPS for a specific matrix multiplication task compared to a native PyTorch installation.
Essential for DGX Spark: If you’re running workloads on a DGX Spark (or similar NVidia hardware), the official image appears to be a must-have for maximizing performance.
Custom Optimizations: The performance gain likely stems from NVidia’s custom PyTorch fork and optimized PTX code within their official image.
Code: repo with code of the experiment
Important note: That is a simple check using mat-mul just to verify if there is any difference - and there is. Will try to do PyTorch bench in next post.

More …

Oh no 128GB is not enaught for 7B parameters!

31 Oct 2025

TL/DR

If you’re working with large language models (LLMs) on systems like the DGX Spark, and encountering “out of memory” errors despite having seemingly ample RAM (e.g., 128GB for a 7B parameter model), the culprit might be your operating system’s caching mechanisms. The solution is often as simple as dropping system caches.

DGX Spark uses UMA (Unified Memory Architecture): CPU and GPU share the same memory.
OS Caching: The OS aggressively uses memory for caches, which might not be visible to GPU tools.
CUDA vs. Actual Usage: DGX Dashboard’s memory usage (via CUDA API) might show high usage even without a model loaded due to OS caches.
The Fix: Clear system caches with sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'.
It is mentioned in NVidia docs - check it here

More …

Unsloth your DGX Spark

28 Oct 2025

TL/DR

The NVIDIA DGX Spark is a powerful little devbox for local model development, boasting 128GB of unified memory despite its compact size. To truly unleash its potential with tools like Unsloth, you need to navigate a few key challenges:

Official NVIDIA PyTorch Image is Key: Leverage NVIDIA’s optimized PyTorch Docker image for maximum performance on the B10 chip.
UV for Dependency Management: Use uv to create a virtual environment, allowing you to pin specific library versions while utilizing the optimized PyTorch from the base image.
Block PyTorch with UV: Prevent uv from reinstalling PyTorch by using override-dependencies in your pyproject.toml.
TORCH_CUDA_ARCH_LIST Override: Correctly set or unset TORCH_CUDA_ARCH_LIST to 12.0 for successful xformers compilation.
Custom xformers Build: Install xformers from a custom source branch that supports CUDA 12.1 until the official merge.
Upgrades: When upgrading base image - virtual environment needs to be recreated
Full repo with code: code is here

More …

Recent posts

More performance checks of Nvidia PyTorch on DGX Spark

TL/DR - Key Points

Is it worht to use offcial Nvidia PyTorch image?

TL/DR

Oh no 128GB is not enaught for 7B parameters!

TL/DR

Unsloth your DGX Spark

TL/DR