When working with data files at my day job, I often come across directories containing a large number of files of several distinct types. It would be useful to produce a listing of the files clustered into these types. I wrote cls (Clustered ls) to find patterns in filenames for display.

For example, in the directory with 7552 files, the ls output is difficult to parse easily. But cls produces a much more compact listing that can be scanned:

$ cls
*.ls [4 files]
20180101_*_kcor_l1.5.fts.gz [1335 files]
20180101_*_kcor_l1.5.gif [1335 files]
20180101_*_kcor_l1.5_avg.fts.gz [167 files]
20180101_*_kcor_l1.5_avg.gif [167 files]
20180101_*_kcor_l1.5_avg_cropped.gif [167 files]
20180101_*_kcor_l1.5_cropped.gif [1335 files]
20180101_*_kcor_l1.5_nrgf.fts.gz [167 files]
20180101_*_kcor_l1.5_nrgf.gif [167 files]
20180101_*_kcor_l1.5_nrgf_cropped.gif [167 files]
20180101_180511_kcor_l1.5_extavg.fts.gz
20180101_180511_kcor_l1.5_extavg.gif
20180101_180511_kcor_l1.5_extavg_cropped.gif
20180101_180511_kcor_l1.5_nrgf_extavg.fts.gz
20180101_180511_kcor_l1.5_nrgf_extavg.gif
20180101_180511_kcor_l1.5_nrgf_extavg_cropped.gif
20180101_kcor_l1.5.{mp4,tarlist,tgz}
20180101_kcor_l1.5_nrgf_cropped.mp4
20180101_kcor_l1.5_{cropped,nrgf}.mp4
20180101_kcor_minus.mp4
20180102_*_kcor_l1.5.fts.gz [675 files]
20180102_*_kcor_l1.5.gif [675 files]
20180102_*_kcor_l1.5_avg.fts.gz [84 files]
20180102_*_kcor_l1.5_avg.gif [84 files]
20180102_*_kcor_l1.5_avg_cropped.gif [84 files]
20180102_*_kcor_l1.5_cropped.gif [675 files]
20180102_*_kcor_l1.5_nrgf.fts.gz [84 files]
20180102_*_kcor_l1.5_nrgf.gif [84 files]
20180102_*_kcor_l1.5_nrgf_cropped.gif [84 files]

The arguments for the script are quite simple right now:

usage: cls [-h] [-m MAX_LISTED] [files [files ...]]

Clustered ls 0.0.1

positional arguments:
  files                 path specification to check

optional arguments:
  -h, --help            show this help message and exit
  -m MAX_LISTED, --max-listed MAX_LISTED
                        max number of files to explicitly list

The -m argument indicates the cutoff number of files to use “*” for, by default 3. For example, the line in the above output:

*.ls [4 files]

is changed to:

{okcgif,okfgif,okl1gz,oknrgf}.ls

if you use:

$ cls -m 4

The current code for cls uses a naive method for determining the clusters which does not try to produce optimal clusters. This can change the output because there can be multiple ways to cluster a list of files. For example, in the directory with the following files:

$ ls
a-1.log  a-2.log  b-1.log  b-2.log  c-1.log  c-2.log

cls can make two different clustering depending on how the original files are ordered:

$ cls *
a-{1,2}.log
b-{1,2}.log
c-{1,2}.log
$ cls *-{1,2}.log
{a,b,c}-1.log
{a,b,c}-2.log

See my scripts repo for code for cls. This code is a proof of concept prototype written in Python 3.