seek-optimized ZIP (SOZip)

Introduction

Geographical file formats tend to be huge and they are frequently composed of multiple files. 

For example: To get railroad data from the world we have to download a zip file and unzip it to a shapefile:

$ wget http//www.naturalearthdata.com/download/10m/cultural/ne_10m_railroads.zip
$ unzip ne_10m_railroads.zip

The result is set of files: ne_10m_railroads.shp, ne_10m_railroads.shx, ne_10m_railroads.dbf and so on.

After downloading and unzip we can start with processing the data.

Because the zips can be huge those steps can consume a lot of time, it would be time saving if we could do the processing directly on the remote zip file instead.

Now there is a solution for this use case, using a slightly modified zip file – called Seek-Optimized ZIP file (SOZip).

The specs of SOZip files we can find at https://github.com/sozip/sozip-spec/blob/master/sozip_specification.md, current version is 0.5.

Let’s see how this all works.

Create SOZip files

First download and unzip the ‘classic’ zip file:

$ wget http//www.naturalearthdata.com/download/10m/cultural/ne_10m_railroads.zip
$ unzip ne_10m_railroads.zip

There is a new GDAL command line tool to create Seek-Optimized zip’s: sozip (https://github.com/rouault/gdal/blob/sozip/doc/source/programs/sozip.rst). 

To run command ‘sozip’ I’ll use the Docker image ‘rouault/sozip’, it contains a new version of GDAL (including sozip tool) – https://gdal.org/.

First open a terminal in the sozip Docker image:

$ docker run -it  rouault/sozip

To create the sozip ‘railroads.zip’ containing the shapefile binaries we can execute:

$ sozip railroads.zip *
0...10...20...30...40...50...60...70...80...90...100 - done.

We can now validate the zip file:

$ sozip --validate railroads.zip
* File ne_10m_railroads.dbf has a valid SOZip index, using chunk_size = 32768.
* File ne_10m_railroads.shp has a valid SOZip index, using chunk_size = 32768.
railroads.zip is a valid .zip file, and contains 2 SOZip-enabled file(s)

Use SoZip files

There are some tools that can handle sozip files, see ‘Software implementations at https://github.com/sozip/sozip-spec/blob/master/sozip_specification.md#annex-a-software-implementations

For developers there is a Python module ‘sozipfile’ (https://github.com/sozip/sozipfile), this is a fork of the python ‘zipfile’ module (https://docs.python.org/3/library/zipfile.html).

I’ll use GDAL tooling (ogrinfo) to prove the interoperability. We have to use prefix ‘/vsipzip’ to access the local zip file.

$ ogrinfo /vsizip/railroads.zip 
INFO: Open of `/vsizip/railroads.zip'
      using driver `ESRI Shapefile' successful.
1: ne_10m_railroads (Line String)

Tool ogrinfo has detected there is one shapefile (ne_10m_railroads). We can now use all the regular GDAL tooling (for example convert it to other formats) on this file.

The killer feature is that we can run the same command on a remote (cloud file). I’ve copied the sozip file to https://bertt.github.io/vsizip/data/railroads.zip to test. In this case we have to use prefix ‘/vsizip/vsicurl’:

$ ogrinfo /vsizip/vsicurl/https://bertt.github.io/vsizip/data/railroads.zip
INFO: Open of `/vsizip/vsicurl/https://bertt.github.io/vsizip/data/railroads.zip'
      using driver `ESRI Shapefile' successful.
1: ne_10m_railroads (Line String)

Conclusion

With the seek-optimized ZIP technique we can safe time by skipping download/unzipping large zip files. The nice thing a SOZip is just a regular zip file, so all the existing zip tooling will still work. Another big bonus is this technique will work on all kinds of zipped data, not just geographical formats.

In a future version of QGIS there will be support for SOZip files, we can expect to see more uptake of this new useful technique.

2 thoughts on “seek-optimized ZIP (SOZip)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s