Recent posts

More performance checks of Nvidia PyTorch on DGX Spark

This post continues the discussion from my last post after a insightful discussion with Sebastian Raschka @rasbt. We dive deeper into the performance characteristics of Nvidia PyTorch on the DGX Spark, specifically focusing on various data types and common AI workloads.

TL/DR - Key Points

  • Docker is Crucial for GEMM-heavy Workloads: For tasks relying heavily on General Matrix Multiply (GEMM) operations, such as Large Language Models (LLMs), running within the Nvidia PyTorch Docker container provides significant performance gains.
  • GEMM Benefits from Docker: Regardless of data type (fp32, fp16, bf16), GEMM operations are consistently faster inside the Docker environment. The difference is less pronounced for bf16, but still present.
  • Convolutional Operations Less Affected: For convolutional layers, we observed no noticeable performance difference between running inside or outside the Docker container.
  • Mixed Precision (fp16, bf16) Insights: While GEMM generally benefits, the impact on convolutional networks is minimal, suggesting that other bottlenecks might be at play.
  • DGX Spark Memory Bandwidth: The DGX Spark appears to be limited by memory bandwidth, meaning workloads with high memory I/O will see performance constraints.
  • UMA Architecture Implications: Given the DGX Spark’s Uniform Memory Access (UMA) architecture, it’s generally not efficient to frequently load/unload gradients or weights, as this can introduce unnecessary overhead.
  • TorchBench Results:
    • ResNet50: No noticeable difference for training or evaluation, aligning with its primarily convolutional nature.
    • BERT & nanoGPT: Both showed noticeable speedups for training and evaluation when run inside the Docker container, reinforcing the benefit for GEMM-heavy models.
More …

Is it worht to use offcial Nvidia PyTorch image?

TL/DR

  • Significant Performance Boost: Using the official NVidia PyTorch Docker image resulted in a 50% increase in TFLOPS for a specific matrix multiplication task compared to a native PyTorch installation.
  • Essential for DGX Spark: If you’re running workloads on a DGX Spark (or similar NVidia hardware), the official image appears to be a must-have for maximizing performance.
  • Custom Optimizations: The performance gain likely stems from NVidia’s custom PyTorch fork and optimized PTX code within their official image.
  • Code: repo with code of the experiment
  • Important note: That is a simple check using mat-mul just to verify if there is any difference - and there is. Will try to do PyTorch bench in next post.
More …

Oh no 128GB is not enaught for 7B parameters!

TL/DR

If you’re working with large language models (LLMs) on systems like the DGX Spark, and encountering “out of memory” errors despite having seemingly ample RAM (e.g., 128GB for a 7B parameter model), the culprit might be your operating system’s caching mechanisms. The solution is often as simple as dropping system caches.

  • DGX Spark uses UMA (Unified Memory Architecture): CPU and GPU share the same memory.
  • OS Caching: The OS aggressively uses memory for caches, which might not be visible to GPU tools.
  • CUDA vs. Actual Usage: DGX Dashboard’s memory usage (via CUDA API) might show high usage even without a model loaded due to OS caches.
  • The Fix: Clear system caches with sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'.
  • It is mentioned in NVidia docs - check it here
More …

Unsloth your DGX Spark

TL/DR

The NVIDIA DGX Spark is a powerful little devbox for local model development, boasting 128GB of unified memory despite its compact size. To truly unleash its potential with tools like Unsloth, you need to navigate a few key challenges:

  • Official NVIDIA PyTorch Image is Key: Leverage NVIDIA’s optimized PyTorch Docker image for maximum performance on the B10 chip.
  • UV for Dependency Management: Use uv to create a virtual environment, allowing you to pin specific library versions while utilizing the optimized PyTorch from the base image.
  • Block PyTorch with UV: Prevent uv from reinstalling PyTorch by using override-dependencies in your pyproject.toml.
  • TORCH_CUDA_ARCH_LIST Override: Correctly set or unset TORCH_CUDA_ARCH_LIST to 12.0 for successful xformers compilation.
  • Custom xformers Build: Install xformers from a custom source branch that supports CUDA 12.1 until the official merge.
  • Upgrades: When upgrading base image - virtual environment needs to be recreated
  • Full repo with code: code is here
More …