Parallelize repetitive tasks on list of files

suggest change

Many repetitive jobs can be performed more efficiently if you utilize more of your computer’s resources (i.e. CPU’s and RAM). Below is an example of running multiple jobs in parallel.

Suppose you have a < list of files >, say output from ls. Also, let these files are bz2 compressed and the following order of tasks need to be operated on them.

  1. Decompress the bz2 files using bzcat to stdout
  2. Grep (e.g. filter) lines with specific keyword(s) using grep <some key word>
  3. Pipe the output to be concatenated into one single gzipped file using gzip

Running this using a while-loop may look like this

while read -r line
     ## grab lines with puppies in them
     bzcat $line | grep puppies | gzip >> output.gz
done < "$filenames"

Using GNU Parallel, we can run 3 parallel jobs at once by simply doing

parallel -j 3 "bzcat {} | grep puppies" ::: $( cat filelist.txt ) | gzip > output.gz

This command is simple, concise and more efficient when number of files and file size is large. The jobs gets initiated by parallel, option -j 3 launches 3 parallel jobs and input to the parallel jobs is taken in by :::. The output is eventually piped to gzip > output.gz

Feedback about page:

Optional: your email if you want me to get back to you:

Table Of Contents