Speed up data processing using GNU Parallel

GNU Parallel (https://www.gnu.org/software/parallel/) is a command-line utility designed to achieve parallel execution of tasks across multiple CPU cores. It promises  to speed up computationally intensive tasks on multi core machines.

https://www.amazon.com/GNU-Parallel-2018-Ole-Tange/dp/1387509888

To test this tool we’ll download a set of Digital Elevation Models (DEM) from Corsica (.ASC files) and process them in GDAL (reproject and convert to TIFF files).

First let’s download the DEM files from https://geoservices.ign.fr/rgealti (a 285MB zipped file):

$ wget --no-check-certificate https://wxs.ign.fr/9u5z4x13jqu05fb3o9cux5e1/telechargement/prepackage/RGEALTI-5M_PACK_CORSE_16-06-2021$RGEALTI_2-0_5M_ASC_LAMB93-IGN78C_D02A_2020-04-16/file/RGEALTI_2-0_5M_ASC_LAMB93-IGN78C_D02A_2020-04-16.7z

After unzip there are a lot of folders with long names, but somewhere there are 220 ASC files. Example filename is ‘RGEALTI_FXX_1155_6135_MNT_LAMB93_IGN78C.asc’. We can load them in QGIS (using projection EPSG:2154):

image

Now let’s process the 220 files with GNU Parallel:

$ time find *.asc | parallel --bar 'gdalwarp -s_srs EPSG:2154 -t_srs EPSG:4326 {} {=s:RGEALTI:warped:;s:asc:tif:=}'

Result of this process in a set of 220 GeoTIFFS in EPSG:4326.

A lot of things are happening in this single line command:

– Get a list of ASC files (find *.asc)

– On all the files do gdalwarp (from EPSG:2154 to EPSG:4326)

– Rename the output files, replace  prefix ‘RGEALTI’ with ‘warped’ and replace extension ‘asc’ to ‘tif’

The syntax ‘{=s:RGEALTI:warped:;s:asc:tif:=}’ for getting the output file names is a bit cryptic at first but it’s very powerful.

The process with Parallel took 1m32.028s on my machine for 220 ASC files.

But how long does it take when using only 1 CPU? For this we add ‘–j 1’ and run again:

$ time find *.asc | parallel –j 1 --bar 'gdalwarp -s_srs EPSG:2154 -t_srs EPSG:4326 {} {=s:RGEALTI:warped:;s:asc:tif:=}'

Now it takes 7m1.548s.

So about 4.5 times faster, this saves a lot of time! Imagine the time saving when running on very large datasets on a machine with many CPU’s…

The nice thing is we can apply GNU Parallel on any kind of data processing so this a huge time/money saver.

Leave a comment