## ★ Finding the n smallest elements in an array

posted Fri 2 Jun 2006 by Michael Galloy under IDL, OptimizationFinding the `n`

smallest elements of an array is a common problem with an informative solution. It is not necessary to sort the entire array to solve this problem. To solve it, Iâ€™m going to use `HISTOGRAM`

in a slightly non-intuitive manner (or maybe it is intuitive if youâ€™ve read J.D. Smithâ€™s Histogram Tutorial)

The code quoted here is from a simplified `MG_N_SMALLEST`

routine that finds only the smallest elements of an array (the actual code can also find the largest elements of the array).

```
function mg_n_smallest, data, n<br />compile_opt strictarr
; use histogram to find a set with more elements than n
; of smallest elements
nData = n_elements(data)
nBins = nData / n
h = histogram(data, nbins=nBins, reverse_indices=ri)
; loop through the bins until we have n or more elements
nCandidates = 0L
for bin = 0L, nBins - 1L do begin
nCandidates += h[bin]
if (nCandidates ge n) then break
endfor
; get the candidates and sort them
candidates = ri[ri[0] : ri[bin + 1L] - 1L]
sortedCandidates = sort(data[candidates])
; return the proper n of them
return, (candidates[sortedCandidates])[0:n-1L]
end
```

This routine assumes that the data set is close to uniformly distributed. It uses a histogram to find a superset of candidates for the `n`

smallest, then sorts them to find exactly the right ones. If the data is non-uniform, the sorting step will have to sort more than `n`

elements, slowing the routine to the extent of the non-uniformity.

Letâ€™s follow through the lines of code with a simple example data set.

```
IDL> data = randomu(seed, 10)
IDL> n = 3
IDL> print, data
0.115125 0.283748 0.942899 0.770475 0.818332
0.138929 0.175479 0.663068 0.806682 0.974345
IDL> nData = n_elements(data)
IDL> nBins = nData / n
```

The calculation of the histogram is (hopefully) where all the work will be done. (Remember, in bad cases, the `SORT`

routine will take up the most time.)

```
IDL> h = histogram(data, nbins=nBins, reverse_indices=ri)
IDL> print, h
4 5 1
IDL> print, ri
4 8 13 14 0
1 5 6 2 3
4 7 8 9
```

The four elements in the first bin of the histogram are the candidates, but which elements are they? The output from the `REVERSE_INDICES`

keyword of `HISTOGRAM`

indicates which elements fall in each bin. It is an index array in two parts; what makes it confusing is the first part is an index into the second part. In our example here, the parts are divided as (`ri`

below with the indices of the elements of `ri`

above it):

```
0 1 2 3 | 4 5 6 7 8 9 10 11 12 13
4 8 13 14 | 0 1 5 6 2 3 4 7 8 9
```

The first part, `4 8 13 14`

, gives the starting indices of each of the three bins (plus a dummy last bin to indicate where the real last bin ends). These indices are indices into `ri`

itself. So the first bin starts at index 4 (of `ri`

), the second bin starts at index 8, and the third bin starts at index 13. The bins end one before the next bin starts (which is why the dummy last bin is there). So we have, the first bin in `ri[4:7]`

, the second bin in `ri[8:12]`

, and the third bin in `ri[13:13]`

. Now, the â€śvaluesâ€ť in the bins are actually indices into the original data set. So values in the first bin are:

```
IDL> print, ri[4:7]
0 1 5 6
IDL> print, d[ri[4:7]]
0.115125 0.283748 0.138929 0.175479
```

If you wanted to do this in general for bin `i`

, it would be

```
ri[ri[i]:ri[i+1] - 1]
```

Weâ€™ll use this formula in a second. Back to our example, we now have the (four) candidates for the three smallest elements of the array. In the general case, there might not be `n`

elements here, so we might have to include more bins than just the first one.

```
nCandidates = 0L
for bin = 0L, nBins - 1L do begin
nCandidates += h[bin]
if (nCandidates ge n) then break
endfor
```

This code will stop as soon as we have have enough candidates, i.e. `n`

or more. We know how many bins (and hence how many candidates), but we donâ€™t know which elements are the candidates yet. They are:

```
candidates = ri[ri[0] : ri[bin + 1L] - 1L]
```

We sort the candidates to find the smallest.

```
sortedCandidates = sort(data[candidates])
```

Now that we have sorted the candidates, just return the first `n`

of them (remember that `SORT`

returns indices into the original data array).

```
IDL> print, (candidates[sortedCandidates])[0:n-1L]
0 5 6
```

Here is the source code for `MG_N_SMALLEST`

(~~docs~~).

July 22nd, 2014 at 11:50 am

[…] have a more detailed description of what MG_N_SMALLEST is doing in an older article. I like this routine as a good example of using HISTOGRAM and its REVERSE_INDICES keyword. It also […]