Finding the n smallest elements of an array is a common problem with an informative solution. It is not necessary to sort the entire array to solve this problem. To solve it, I’m going to use HISTOGRAM in a slightly non-intuitive manner (or maybe it is intuitive if you’ve read J.D. Smith’s Histogram Tutorial)

Find the n smallest elements of a data set

The code quoted here is from a simplified MG_N_SMALLEST routine that finds only the smallest elements of an array (the actual code can also find the largest elements of the array).

function mg_n_smallest, data, n<br />compile_opt strictarr
  ; use histogram to find a set with more elements than n
  ; of smallest elements
  nData = n_elements(data)
  nBins = nData / n
  h = histogram(data, nbins=nBins, reverse_indices=ri)

  ; loop through the bins until we have n or more elements
  nCandidates = 0L
  for bin = 0L, nBins - 1L do begin
    nCandidates += h[bin]
    if (nCandidates ge n) then break

  ; get the candidates and sort them
  candidates = ri[ri[0] : ri[bin + 1L] - 1L]
  sortedCandidates = sort(data[candidates])

  ; return the proper n of them
  return, (candidates[sortedCandidates])[0:n-1L]

This routine assumes that the data set is close to uniformly distributed. It uses a histogram to find a superset of candidates for the n smallest, then sorts them to find exactly the right ones. If the data is non-uniform, the sorting step will have to sort more than n elements, slowing the routine to the extent of the non-uniformity.

Let’s follow through the lines of code with a simple example data set.

IDL> data = randomu(seed, 10)
IDL> n = 3
IDL> print, data
0.115125 0.283748 0.942899 0.770475 0.818332
0.138929 0.175479 0.663068 0.806682 0.974345
IDL> nData = n_elements(data)
IDL> nBins = nData / n

The calculation of the histogram is (hopefully) where all the work will be done. (Remember, in bad cases, the SORT routine will take up the most time.)

IDL> h = histogram(data, nbins=nBins, reverse_indices=ri)
IDL> print, h
4 5 1
IDL> print, ri
4 8 13 14 0
1 5 6 2 3
4 7 8 9

The four elements in the first bin of the histogram are the candidates, but which elements are they? The output from the REVERSE_INDICES keyword of HISTOGRAM indicates which elements fall in each bin. It is an index array in two parts; what makes it confusing is the first part is an index into the second part. In our example here, the parts are divided as (ri below with the indices of the elements of ri above it):

0 1 2 3 | 4 5 6 7 8 9 10 11 12 13
4 8 13 14 | 0 1 5 6 2 3 4 7 8 9

The first part, 4 8 13 14, gives the starting indices of each of the three bins (plus a dummy last bin to indicate where the real last bin ends). These indices are indices into ri itself. So the first bin starts at index 4 (of ri), the second bin starts at index 8, and the third bin starts at index 13. The bins end one before the next bin starts (which is why the dummy last bin is there). So we have, the first bin in ri[4:7], the second bin in ri[8:12], and the third bin in ri[13:13]. Now, the “values” in the bins are actually indices into the original data set. So values in the first bin are:

IDL> print, ri[4:7]
0 1 5 6
IDL> print, d[ri[4:7]]
0.115125 0.283748 0.138929 0.175479

If you wanted to do this in general for bin i, it would be

ri[ri[i]:ri[i+1] - 1]

We’ll use this formula in a second. Back to our example, we now have the (four) candidates for the three smallest elements of the array. In the general case, there might not be n elements here, so we might have to include more bins than just the first one.

nCandidates = 0L
for bin = 0L, nBins - 1L do begin
  nCandidates += h[bin]
  if (nCandidates ge n) then break

This code will stop as soon as we have have enough candidates, i.e. n or more. We know how many bins (and hence how many candidates), but we don’t know which elements are the candidates yet. They are:

candidates = ri[ri[0] : ri[bin + 1L] - 1L]

We sort the candidates to find the smallest.

sortedCandidates = sort(data[candidates])

Now that we have sorted the candidates, just return the first n of them (remember that SORT returns indices into the original data array).

IDL> print, (candidates[sortedCandidates])[0:n-1L]
0 5 6

Here is the source code for MG_N_SMALLEST (docs).