One of the features I was most excited about in [CUDA 6] is the drop-in library ([nvBLAS]) support for BLAS. The idea is to use `LD_PRELOAD` when launching IDL to indicate that BLAS should be coming from the nvBLAS instead of the BLAS implementation that would normally be found, e.g., in IDL's case, the `idl_lapack.so` distributed with IDL. I've had problems getting it to work so far, though.
Right now, the way I'm starting IDL is something like the following:
$ export NVBLAS_CONFIG_FILE=/path/to/nvblas.conf
$ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib64
$ LD_PRELOAD=/usr/local/cuda-6.0/lib64/libnvblas.so idl
This seems to be recognizing nvBLAS because it will crash if I don't set `LD_LIBRARY_PATH` and `NVBLAS_CONFIG_FILE` (I'm using a default configuration file). But I have not been able to get any different results testing performance of `MATRIX_MULTIPLY` between using nvBLAS or not. I will continue to test this, because it's too interesting to pass up and there are a couple of items that I haven't explored yet:
1. I'm not using big enough matrices at 5000 x 5000 elements.
2. I'm not setting something in the configuration file specified by `NVBLAS_CONFIG_FILE`.
I don't imagine the speedup from nvBLAS is going to be amazing because memory transfer will eat into the performance, but you can't beat not having to change any code at all. I am worried that the way that IDL loads dynamic libraries might get in the way of this working.
[CUDA 6]: http://docs.nvidia.com/cuda/index.html "CUDA Toolkit Documentation"
[nvBLAS]: http://docs.nvidia.com/cuda/nvblas/index.html "NVBLAS :: CUDA Toolkit Documentation"