I’ve got a rather large dataset that I need to do a lot of processing on, over several iterations, it’s a 20gb zip file, flat text, and I’m impatient and don’t like not knowing things!
My new favourite Linux command line tool, pv (pipe viewer) is totally awesome. Check this out:
pv -cN source < urls.gz | zcat | pv -cN zcat | perl -lne '($a,$b,$c,$d) = split /\||\t/; print $b unless $b =~ /ac\.uk/; print $c unless $c =~ /ac\.uk/' | pv -cN perl | gzip | pv -cN gzip > hosts.gz
zcat: 93.4GiB 1:33:18 [26.6MiB/s] [ <=> ]
perl: 85.7GiB 1:33:18 [25.3MiB/s] [ <=> ]
source: 13.2GiB 1:33:17 [3.57MiB/s] [===============================================> ] 67% ETA 0:44:41
gzip: 12.7GiB 1:33:18 [3.51MiB/s] [ <=> ]
I’m basically splitting some text, removing stuff I don’t want and doing:
zcat urls.gz | perl -lne '($a,$b,$c,$d) = split /\||\t/; print $b unless $b =~ /ac\.uk/; print $c unless $c =~ /ac\.uk/' | gzip > hosts.gz
But at appropriate moments I’ve piped the output in to the pv pipe viewer tool to report on some metrics. FYI the -N flag lets me set a name for the pv instance, and the -c flag is to enable cursor positioning so we can use multiple instances of pv!
The reason pipe viewer is totally cool is the extra sneaky data we get!
Pipe Viewer Is Magic
Because the first instance of pv is reading our urls.gz file in itself, it can display how much of the file it’s processed and roughly how long it will complete. MOST USEFUL THING EVER! Also I had no idea how large the compressed dataset was and was hesitant to uncompress the data as I wasn’t sure how big it would be, we can see from the pv instance named zcat that zcat has so far spat out 93.4GB of data, at 67% through we can predict this file is probably around 140GB if we extract it. How cool is that? We can also tell from the pv named perl that after splitting and removing the data we don’t want, we’ve so far shaved off 10GB, which is kinda interesting to splurge over for a bit, and lastly with the named gzip pv instance, pipe viewer is telling us the size of the output file we’ve generated so far.
This is totally rad.
Note. Many thanks to Norway for forcing me to rewrite my initial one liner of
zcat urls.gz | sed 's/|/ /g' | while read a b c d ; do echo $b ; echo $c ; done | grep -v ac.uk$ | gzip > hosts.gz
by glaring at me.