I’ve got a rather large dataset that I need to do a lot of processing on, over several iterations, it’s a 20gb zip file, flat text, and I’m impatient and don’t like not knowing things!
My new favourite Linux command line tool, pv (pipe viewer) is totally awesome. Check this out:
<br /> pv -cN source < urls.gz | zcat | pv -cN zcat | perl -lne '($a,$b,$c,$d) = split /\||\t/; print $b unless $b =~ /ac\.uk/; print $c unless $c =~ /ac\.uk/' | pv -cN perl | gzip | pv -cN gzip > hosts.gz<br /> zcat: 93.4GiB 1:33:18 [26.6MiB/s] [ <=> ]<br /> perl: 85.7GiB 1:33:18 [25.3MiB/s] [ <=> ]<br /> source: 13.2GiB 1:33:17 [3.57MiB/s] [===============================================> ] 67% ETA 0:44:41<br /> gzip: 12.7GiB 1:33:18 [3.51MiB/s] [ <=> ]<br />
