convert pdf to text using ocr

The Dutch Public Access to Government Information Act (Wet openbaarheid van bestuur, WOB) regulates the right of citizens to receive information from the government with regard to administrative matters.

The information is usually delivered in the form of large PDF’s, a recent drop was a PDF of 425 pages (1GB).

The PDF’s contain image scans of documents/emails/notes so the documents are not easy to use for text analysis and search.

Let’s see if we can use Optical Character Recognition (OCR) techniques to export the PDF’s to plain text.

In a first step we want to create an image (png) of every page of the PDF, tool ‘pdftoppm’ can do this:

$ pdftoppm wob-documenten.pdf ./images/wob-documenten -png

Subdirectory ‘images’ now contains 425 png images. The first page png looks like:

image

Now we want to perform OCR analysis on all the png’s. We’ll use tool Tesseract, a well known OCR Engine (https://github.com/tesseract-ocr/tesseract).

cd images
for FILE in *.png; 
do echo $FILE; 
tesseract $FILE $FILE
done

425 text files are created, the process took 15 minutes on my machine. The first page looks like:

App wisseling Gerard Beverdam (Nederlands Dagblad)

[09:08, 23-04-2020] Gerard Beverdam: Goedemorgen Anna Sophia, er is straks een gesprek van
minister Grapperhaus met de kerken. Kunnen wij na afloop over de inhoud contact hebben?

[14:54, 23-04-2020] Anna Sophia Posthumus: Dag Gerard, ik moet nog terugkoppeling krijgen
van het overleg. Bel je zo!

[14:55, 23-04-2020] Gerard Beverdam: **

0001 14826351

Sometimes the OCR fails to recognize characters, for example when there are hand written notes, but that is expected.

The dataset is reduced to 2.5MB plain text files that is searchable/linkable and easy to use in text analytics.

Look at https://github.com/bertt/wob for the scripts, input PDF document and the resulting png/txt files.

Leave a comment