Clustered ls
posted Tue 28 Aug 2018 by Michael Galloy under IDLWhen working with data files at my day job, I often come across directories containing a large number of files of several distinct types. It would be useful to produce a listing of the files clustered into these types. I wrote cls
(Clustered ls) to find patterns in filenames for display.
For example, in the directory with 7552 files, the ls
output is difficult to parse easily. But cls
produces a much more compact listing that can be scanned:
$ cls
*.ls [4 files]
20180101_*_kcor_l1.5.fts.gz [1335 files]
20180101_*_kcor_l1.5.gif [1335 files]
20180101_*_kcor_l1.5_avg.fts.gz [167 files]
20180101_*_kcor_l1.5_avg.gif [167 files]
20180101_*_kcor_l1.5_avg_cropped.gif [167 files]
20180101_*_kcor_l1.5_cropped.gif [1335 files]
20180101_*_kcor_l1.5_nrgf.fts.gz [167 files]
20180101_*_kcor_l1.5_nrgf.gif [167 files]
20180101_*_kcor_l1.5_nrgf_cropped.gif [167 files]
20180101_180511_kcor_l1.5_extavg.fts.gz
20180101_180511_kcor_l1.5_extavg.gif
20180101_180511_kcor_l1.5_extavg_cropped.gif
20180101_180511_kcor_l1.5_nrgf_extavg.fts.gz
20180101_180511_kcor_l1.5_nrgf_extavg.gif
20180101_180511_kcor_l1.5_nrgf_extavg_cropped.gif
20180101_kcor_l1.5.{mp4,tarlist,tgz}
20180101_kcor_l1.5_nrgf_cropped.mp4
20180101_kcor_l1.5_{cropped,nrgf}.mp4
20180101_kcor_minus.mp4
20180102_*_kcor_l1.5.fts.gz [675 files]
20180102_*_kcor_l1.5.gif [675 files]
20180102_*_kcor_l1.5_avg.fts.gz [84 files]
>20180102_*_kcor_l1.5_avg.gif [84 files]
20180102_*_kcor_l1.5_avg_cropped.gif [84 files]
20180102_*_kcor_l1.5_cropped.gif [675 files]
20180102_*_kcor_l1.5_nrgf.fts.gz [84 files]
20180102_*_kcor_l1.5_nrgf.gif [84 files]
20180102_*_kcor_l1.5_nrgf_cropped.gif [84 files]
The arguments for the script are quite simple right now:
usage: cls [-h] [-m MAX_LISTED] [files [files ...]]
Clustered ls 0.0.1
positional arguments:<br />files path specification to check
optional arguments:
-h, --help show this help message and exit
-m MAX_LISTED, --max-listed MAX_LISTED
max number of files to explicitly list
The -m
argument indicates the cutoff number of files to use “*” for, by default 3. For example, the line in the above output:
*.ls [4 files]
is changed to:
{okcgif,okfgif,okl1gz,oknrgf}.ls
if you use:
$ cls -m 4
The current code for cls
uses a naive method for determining the clusters which does not try to produce optimal clusters. This can change the output because there can be multiple ways to cluster a list of files. For example, in the directory with the following files:
$ ls
a-1.log a-2.log b-1.log b-2.log c-1.log c-2.log
cls
can make two different clustering depending on how the original files are ordered:
$ cls *
a-{1,2}.log
b-{1,2}.log
c-{1,2}.log
$ cls *-{1,2}.log
{a,b,c}-1.log
{a,b,c}-2.log
See my scripts repo for code for cls
. This code is a proof of concept prototype written in Python 3.