[QR factorization][QR] of a matrix `A` is the process of determining an orthogonal matrix `Q` and upper triangular matrix `R` such that $$A = QR$$ Using a QR factorization is faster and much more numerically stable than finding an inverse matrix of `A` to solve the system of equations `Ax = b`. QR factorization is used in fundamental linear algebra IDL routines such as `LA_EIGENPROBLEM`, `LA_LEAST_SQUARES`, `LA_SVD`, and others. I was recently using [MAGMA]'s `[SDCZ]GEQRF` within [GPULib] to retrieve a QR factorization and found excellent performance improvements by using the GPU.
Also, the full decomposition into `Q` and `R` matrices is seldom needed, so the results are not returned in a naive manner. I will show how to retrieve `Q` and `R` for the GPULib routines (it's slightly different than the CPU version).
[QR]: http://en.wikipedia.org/wiki/QR_decomposition "QR decomposition"
[MAGMA]: http://icl.cs.utk.edu/magma/ "MAGMA"
[GPULib]: http://www.txcorp.com/home/gpulib "GPULib"
Performance of MAGMA for QR factorization was excellent and it scaled quite well compared to the CPU as shown below.
For small matrices of size 500 x 500, I got about a 25x speedup, but on larger 5000 x 5000 element matrices about a 435x speedup.[^1]
Next, we'll go through how to do perform QR factorization using GPULib and verify that we are really factoring `A` correctly. First, as always, we initialize GPULib:
IDL> gpuinit
GPULib Full 1.7.0 (Revision: 2876M)
Graphics card: Tesla C2070, compute capability: 2.0, memory: 1205 MB available, 1279 MB total
CUDA version: 5.5
MAGMA version: 1.4.0
Checking GPU memory allocation...cudaSuccess
Next, we make a simple test case and transfer it to the GPU:
IDL> a = [[9.0, 2.0, 6.0], [4.0, 8.0, 7.0]]
IDL> da = gpuputarr(a)
There are a few variables we need to setup:
IDL> dims = size(da, /dimensions)
IDL> m = dims[0]
IDL> n = dims[1]
The GPU routine we will use, `GPUSGEQRF`, is a call directly into the [MAGMA] C layer, so we need to setup a GPU workspace variable of the correct size as well as create a CPU variable to return the multipliers in `tau`:
IDL> info = 0L
IDL> tau = fltarr(m < n)
IDL> nb = gpu_get_sgeqrf_nb(m)
IDL> lwork = (2L * (m < n) + (n + 31L) /32L * 32L ) * nb
IDL> dworkspace = gpufltarr(lwork)
We are ready to call the routine to perform the QR factorization:
IDL> status = gpusgeqrf(m, n, da>_getHandle(), m, tau, dworkspace>_getHandle(), info)
Part of the result is done in place in `da` (the other portion of the result is in the CPU variable `tau`):
IDL> result = gpugetarr(da)
`Q` is decomposed into `H` *reflector* matrices such that: $$Q = H_1 H_2 . . . H_k$$ where `k = min(m, n)`, $$H_i = I  \tau_i v v^T$$ and `v` is stored in `da`:
v[i + 1:m  1L] = result[i + 1:m  1L, i]
The code to reconstruct `Q` is therefore:
IDL> q = identity(m)
IDL> .run
 for i = 0L, n  1L do begin
 v = fltarr(m)
 v[i] = 1.0
 v[i + 1:m  1L] = result[i + 1:m  1L, i]
 q_sub = (identity(m)  tau[i] * (v # v))
 q = q ## q_sub
 endfor

 end
`R` is stored in the other triangle of `da`:
IDL> r = fltarr(m, n)
IDL> for i = 0L, n  1L do r[0L:i, i] = result[0L:i, i]
To verify the result, `r` is upper triangular (IDL and standard mathematical notation have different conventions for displaying matrices):
IDL> print, r
11.0000 0.00000 0.00000
8.54546 7.48166 0.00000
`Q` is orthogonal:
IDL> print, transpose(q) # q
1.00000 0.00000 5.96046e08
0.00000 1.00000 2.98023e08
5.96046e08 2.98023e08 1.00000
And, $$A = QR$$
IDL> print, transpose(q) # r
9.00000 2.00000 6.00000
4.00000 8.00000 7.00000
*Full disclosure: I work for TechX and I am the product manager for the GPULib.*
[^1]: CPU was 6 core Intel Core i73960X @ 3.30GHz, GPU Tesla C2070 with 1279 MB memory