Parallelize repetitive tasks on list of files

suggest change

Many repetitive jobs can be performed more efficiently if you utilize more of your computer’s resources (i.e. CPU’s and RAM). Below is an example of running multiple jobs in parallel.

Suppose you have a < list of files >, say output from ls. Also, let these files are bz2 compressed and the following order of tasks need to be operated on them.

Decompress the bz2 files using bzcat to stdout
Grep (e.g. filter) lines with specific keyword(s) using grep <some key word>
Pipe the output to be concatenated into one single gzipped file using gzip

Running this using a while-loop may look like this

filenames="file_list.txt"
while read -r line
do
name="$line"
     ## grab lines with puppies in them
     bzcat $line | grep puppies | gzip >> output.gz
done < "$filenames"

Using GNU Parallel, we can run 3 parallel jobs at once by simply doing

parallel -j 3 "bzcat {} | grep puppies" ::: $( cat filelist.txt ) | gzip > output.gz

This command is simple, concise and more efficient when number of files and file size is large. The jobs gets initiated by parallel, option -j 3 launches 3 parallel jobs and input to the parallel jobs is taken in by :::. The output is eventually piped to gzip > output.gz

Found a mistake? Have a question or improvement idea? Let me know.

Table Of Contents