[QR factorization][QR] of a matrix `A` is the process of determining an orthogonal matrix `Q` and upper triangular matrix `R` such that $$A = QR$$ Using a QR factorization is faster and much more numerically stable than finding an inverse matrix of `A` to solve the system of equations `Ax = b`. QR factorization is used in fundamental linear algebra IDL routines such as `LA_EIGENPROBLEM`, `LA_LEAST_SQUARES`, `LA_SVD`, and others. I was recently using [MAGMA]'s `[SDCZ]GEQRF` within [GPULib] to retrieve a QR factorization and found excellent performance improvements by using the GPU.

Also, the full decomposition into `Q` and `R` matrices is seldom needed, so the results are not returned in a naive manner. I will show how to retrieve `Q` and `R` for the GPULib routines (it's slightly different than the CPU version).

[QR]: http://en.wikipedia.org/wiki/QR_decomposition "QR decomposition"
[MAGMA]: http://icl.cs.utk.edu/magma/ "MAGMA"
[GPULib]: http://www.txcorp.com/home/gpulib "GPULib"

Performance of MAGMA for QR factorization was excellent and it scaled quite well compared to the CPU as shown below.

For small matrices of size 500 x 500, I got about a 25x speedup, but on larger 5000 x 5000 element matrices about a 435x speedup.[^1]

Next, we'll go through how to do perform QR factorization using GPULib and verify that we are really factoring `A` correctly. First, as always, we initialize GPULib:

IDL> gpuinit
GPULib Full 1.7.0 (Revision: 2876M)
Graphics card: Tesla C2070, compute capability: 2.0, memory: 1205 MB available, 1279 MB total
CUDA version: 5.5
MAGMA version: 1.4.0
Checking GPU memory allocation...cudaSuccess

Next, we make a simple test case and transfer it to the GPU:

IDL> a = [[9.0, 2.0, 6.0], [4.0, 8.0, 7.0]]
IDL> da = gpuputarr(a)

There are a few variables we need to setup:

IDL> dims = size(da, /dimensions)
IDL> m = dims[0]
IDL> n = dims[1]

The GPU routine we will use, `GPUSGEQRF`, is a call directly into the [MAGMA] C layer, so we need to setup a GPU workspace variable of the correct size as well as create a CPU variable to return the multipliers in `tau`:

IDL> info = 0L
IDL> tau = fltarr(m < n) IDL> nb = gpu_get_sgeqrf_nb(m)
IDL> lwork = (2L * (m < n) + (n + 31L) /32L * 32L ) * nb IDL> dworkspace = gpufltarr(lwork)

We are ready to call the routine to perform the QR factorization:

IDL> status = gpusgeqrf(m, n, da->_getHandle(), m, tau, dworkspace->_getHandle(), info)

Part of the result is done in place in `da` (the other portion of the result is in the CPU variable `tau`):

IDL> result = gpugetarr(da)

`Q` is decomposed into `H` *reflector* matrices such that: $$Q = H_1 H_2 . . . H_k$$ where `k = min(m, n)`, $$H_i = I - \tau_i v v^T$$ and `v` is stored in `da`:

v[i + 1:m - 1L] = result[i + 1:m - 1L, i]

The code to reconstruct `Q` is therefore:

IDL> q = identity(m)
IDL> .run
- for i = 0L, n - 1L do begin
- v = fltarr(m)
- v[i] = 1.0
- v[i + 1:m - 1L] = result[i + 1:m - 1L, i]
- q_sub = (identity(m) - tau[i] * (v # v))
- q = q ## q_sub
- endfor
- end

`R` is stored in the other triangle of `da`:

IDL> r = fltarr(m, n)
IDL> for i = 0L, n - 1L do r[0L:i, i] = result[0L:i, i]

To verify the result, `r` is upper triangular (IDL and standard mathematical notation have different conventions for displaying matrices):

IDL> print, r
-11.0000 0.00000 0.00000
-8.54546 -7.48166 0.00000

`Q` is orthogonal:

IDL> print, transpose(q) # q
1.00000 0.00000 5.96046e-08
0.00000 1.00000 -2.98023e-08
5.96046e-08 -2.98023e-08 1.00000

And, $$A = QR$$

IDL> print, transpose(q) # r
9.00000 2.00000 6.00000
4.00000 8.00000 7.00000

*Full disclosure: I work for Tech-X and I am the product manager for the GPULib.*

[^1]: CPU was 6 core Intel Core i7-3960X @ 3.30GHz, GPU Tesla C2070 with 1279 MB memory