n smallest elements of an array is a common problem with an informative solution. It is not necessary to sort the entire array to solve this problem. To solve it, I’m going to use
HISTOGRAM in a slightly non-intuitive manner (or maybe it is intuitive if you’ve read J.D. Smith’s Histogram Tutorial)
The code quoted here is from a simplified
MG_N_SMALLEST routine that finds only the smallest elements of an array (the actual code can also find the largest elements of the array).
function mg_n_smallest, data, n<br />compile_opt strictarr ; use histogram to find a set with more elements than n ; of smallest elements nData = n_elements(data) nBins = nData / n h = histogram(data, nbins=nBins, reverse_indices=ri) ; loop through the bins until we have n or more elements nCandidates = 0L for bin = 0L, nBins - 1L do begin nCandidates += h[bin] if (nCandidates ge n) then break endfor ; get the candidates and sort them candidates = ri[ri : ri[bin + 1L] - 1L] sortedCandidates = sort(data[candidates]) ; return the proper n of them return, (candidates[sortedCandidates])[0:n-1L] end
This routine assumes that the data set is close to uniformly distributed. It uses a histogram to find a superset of candidates for the
n smallest, then sorts them to find exactly the right ones. If the data is non-uniform, the sorting step will have to sort more than
n elements, slowing the routine to the extent of the non-uniformity.
Let’s follow through the lines of code with a simple example data set.
IDL> data = randomu(seed, 10) IDL> n = 3 IDL> print, data 0.115125 0.283748 0.942899 0.770475 0.818332 0.138929 0.175479 0.663068 0.806682 0.974345 IDL> nData = n_elements(data) IDL> nBins = nData / n
The calculation of the histogram is (hopefully) where all the work will be done. (Remember, in bad cases, the
SORT routine will take up the most time.)
IDL> h = histogram(data, nbins=nBins, reverse_indices=ri) IDL> print, h 4 5 1 IDL> print, ri 4 8 13 14 0 1 5 6 2 3 4 7 8 9
The four elements in the first bin of the histogram are the candidates, but which elements are they? The output from the
REVERSE_INDICES keyword of
HISTOGRAM indicates which elements fall in each bin. It is an index array in two parts; what makes it confusing is the first part is an index into the second part. In our example here, the parts are divided as (
ri below with the indices of the elements of
ri above it):
0 1 2 3 | 4 5 6 7 8 9 10 11 12 13 4 8 13 14 | 0 1 5 6 2 3 4 7 8 9
The first part,
4 8 13 14, gives the starting indices of each of the three bins (plus a dummy last bin to indicate where the real last bin ends). These indices are indices into
ri itself. So the first bin starts at index 4 (of
ri), the second bin starts at index 8, and the third bin starts at index 13. The bins end one before the next bin starts (which is why the dummy last bin is there). So we have, the first bin in
ri[4:7], the second bin in
ri[8:12], and the third bin in
ri[13:13]. Now, the “values” in the bins are actually indices into the original data set. So values in the first bin are:
IDL> print, ri[4:7] 0 1 5 6 IDL> print, d[ri[4:7]] 0.115125 0.283748 0.138929 0.175479
If you wanted to do this in general for bin
i, it would be
ri[ri[i]:ri[i+1] - 1]
We’ll use this formula in a second. Back to our example, we now have the (four) candidates for the three smallest elements of the array. In the general case, there might not be
n elements here, so we might have to include more bins than just the first one.
nCandidates = 0L for bin = 0L, nBins - 1L do begin nCandidates += h[bin] if (nCandidates ge n) then break endfor
This code will stop as soon as we have have enough candidates, i.e.
n or more. We know how many bins (and hence how many candidates), but we don’t know which elements are the candidates yet. They are:
candidates = ri[ri : ri[bin + 1L] - 1L]
We sort the candidates to find the smallest.
sortedCandidates = sort(data[candidates])
Now that we have sorted the candidates, just return the first
n of them (remember that
SORT returns indices into the original data array).
IDL> print, (candidates[sortedCandidates])[0:n-1L] 0 5 6
Here is the source code for