Speed up data processing using GNU Parallel

GNU Parallel (https://www.gnu.org/software/parallel/) is a command-line utility designed to achieve parallel execution of tasks across multiple CPU cores. It promises  to speed up computationally intensive tasks on multi core machines.


To test this tool we’ll download a set of Digital Elevation Models (DEM) from Corsica (.ASC files) and process them in GDAL (reproject and convert to TIFF files).

First let’s download the DEM files from https://geoservices.ign.fr/rgealti (a 285MB zipped file):

$ wget --no-check-certificate https://wxs.ign.fr/9u5z4x13jqu05fb3o9cux5e1/telechargement/prepackage/RGEALTI-5M_PACK_CORSE_16-06-2021$RGEALTI_2-0_5M_ASC_LAMB93-IGN78C_D02A_2020-04-16/file/RGEALTI_2-0_5M_ASC_LAMB93-IGN78C_D02A_2020-04-16.7z

After unzip there are a lot of folders with long names, but somewhere there are 220 ASC files. Example filename is ‘RGEALTI_FXX_1155_6135_MNT_LAMB93_IGN78C.asc’. We can load them in QGIS (using projection EPSG:2154):


Now let’s process the 220 files with GNU Parallel:

$ time find *.asc | parallel --bar 'gdalwarp -s_srs EPSG:2154 -t_srs EPSG:4326 {} {=s:RGEALTI:warped:;s:asc:tif:=}'

Result of this process in a set of 220 GeoTIFFS in EPSG:4326.

A lot of things are happening in this single line command:

– Get a list of ASC files (find *.asc)

– On all the files do gdalwarp (from EPSG:2154 to EPSG:4326)

– Rename the output files, replace  prefix ‘RGEALTI’ with ‘warped’ and replace extension ‘asc’ to ‘tif’

The syntax ‘{=s:RGEALTI:warped:;s:asc:tif:=}’ for getting the output file names is a bit cryptic at first but it’s very powerful.

The process with Parallel took 1m32.028s on my machine for 220 ASC files.

But how long does it take when using only 1 CPU? For this we add ‘–j 1’ and run again:

$ time find *.asc | parallel –j 1 --bar 'gdalwarp -s_srs EPSG:2154 -t_srs EPSG:4326 {} {=s:RGEALTI:warped:;s:asc:tif:=}'

Now it takes 7m1.548s.

So about 4.5 times faster, this saves a lot of time! Imagine the time saving when running on very large datasets on a machine with many CPU’s…

The nice thing is we can apply GNU Parallel on any kind of data processing so this a huge time/money saver.

Leave a comment