Recent posts
07 Nov 2025
This post continues the discussion from my last post after a insightful discussion with Sebastian Raschka @rasbt. We dive deeper into the performance characteristics of Nvidia PyTorch on the DGX Spark, specifically focusing on various data types and common AI workloads.
TL/DR - Key Points
- Docker is Crucial for GEMM-heavy Workloads: For tasks relying heavily on General Matrix Multiply (GEMM) operations, such as Large Language Models (LLMs), running within the Nvidia PyTorch Docker container provides significant performance gains.
- GEMM Benefits from Docker: Regardless of data type (fp32, fp16, bf16), GEMM operations are consistently faster inside the Docker environment. The difference is less pronounced for bf16, but still present.
- Convolutional Operations Less Affected: For convolutional layers, we observed no noticeable performance difference between running inside or outside the Docker container.
- Mixed Precision (fp16, bf16) Insights: While GEMM generally benefits, the impact on convolutional networks is minimal, suggesting that other bottlenecks might be at play.
- DGX Spark Memory Bandwidth: The DGX Spark appears to be limited by memory bandwidth, meaning workloads with high memory I/O will see performance constraints.
- UMA Architecture Implications: Given the DGX Spark’s Uniform Memory Access (UMA) architecture, it’s generally not efficient to frequently load/unload gradients or weights, as this can introduce unnecessary overhead.
- TorchBench Results:
- ResNet50: No noticeable difference for training or evaluation, aligning with its primarily convolutional nature.
- BERT & nanoGPT: Both showed noticeable speedups for training and evaluation when run inside the Docker container, reinforcing the benefit for GEMM-heavy models.
More …
04 Nov 2025
TL/DR
- Significant Performance Boost: Using the official NVidia PyTorch Docker image resulted in a 50% increase in TFLOPS for a specific matrix multiplication task compared to a native PyTorch installation.
- Essential for DGX Spark: If you’re running workloads on a DGX Spark (or similar NVidia hardware), the official image appears to be a must-have for maximizing performance.
- Custom Optimizations: The performance gain likely stems from NVidia’s custom PyTorch fork and optimized PTX code within their official image.
- Code: repo with code of the experiment
- Important note: That is a simple check using mat-mul just to verify if there is any difference - and there is. Will try to do PyTorch bench in next post.
More …
31 Oct 2025
TL/DR
If you’re working with large language models (LLMs) on systems like the DGX Spark, and encountering “out of memory” errors despite having seemingly ample RAM (e.g., 128GB for a 7B parameter model), the culprit might be your operating system’s caching mechanisms. The solution is often as simple as dropping system caches.
- DGX Spark uses UMA (Unified Memory Architecture): CPU and GPU share the same memory.
- OS Caching: The OS aggressively uses memory for caches, which might not be visible to GPU tools.
- CUDA vs. Actual Usage: DGX Dashboard’s memory usage (via CUDA API) might show high usage even without a model loaded due to OS caches.
- The Fix: Clear system caches with
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'.
- It is mentioned in NVidia docs - check it here
More …
28 Oct 2025
TL/DR
The NVIDIA DGX Spark is a powerful little devbox for local model development, boasting 128GB of unified memory despite its compact size. To truly unleash its potential with tools like Unsloth, you need to navigate a few key challenges:
- Official NVIDIA PyTorch Image is Key: Leverage NVIDIA’s optimized PyTorch Docker image for maximum performance on the B10 chip.
- UV for Dependency Management: Use
uv to create a virtual environment, allowing you to pin specific library versions while utilizing the optimized PyTorch from the base image.
- Block PyTorch with UV: Prevent
uv from reinstalling PyTorch by using override-dependencies in your pyproject.toml.
TORCH_CUDA_ARCH_LIST Override: Correctly set or unset TORCH_CUDA_ARCH_LIST to 12.0 for successful xformers compilation.
- Custom
xformers Build: Install xformers from a custom source branch that supports CUDA 12.1 until the official merge.
- Upgrades: When upgrading base image - virtual environment needs to be recreated
- Full repo with code: code is here
More …