Scanning and OCR on Arch Linux

bike9876@エボ猫.コム

This describes setting up scanning from a generic scanner (connected to my local network) to my Arch Linux system. It also describes OCR both while scanning and also on existing pdfs.

Install packages

Install sane:

sudo pacman -S sane

scanimage (see below) suits me fine for most scanning, but you might want to also install:

And for scanning with OCR:

sudo pacman -S tesseract tesseract-data-eng

(if a language other than English is required, run

sudo pacman -S tesseract tesseract-data

and select the tesseract-data package(s) to install)

A graphical program for scanning with OCR:

sudo pacman -S gscan2pdf

For OCR of pdfs on the command line:

yay -S ocrmypdf

Check if scanner supports AirScan: enter scanner details in https://mfi.apple.com/account/airprint-search.

If supported, install sane-airscan:

sudo pacman -S sane-airscan

Check for connected scanners

Run

scanimage -L

I have an HP OfficeJet Pro 6960 connected to the network (and which supports AirScan), and get output:

device `escl:https://192.168.X.Y:443' is a HP OfficeJet Pro 6960 [E3C811] platen,adf scanner
device `hpaio:/net/officejet_pro_6960?ip=192.168.X.Y&queue=false' is a Hewlett-Packard officejet_pro_6960 all-in-one
device `airscan:e0:HP OfficeJet Pro 6960 [E3C811]' is a eSCL HP OfficeJet Pro 6960 [E3C811] ip=192.168.X.Y

(ip address have been redacted)

Notes:

O0W
   Hs<M´M
        q.[lNJ+"ߏ}sszr}?[<|:B{3v=ɇsi"1vƦ?X9oI
                                             2?ȋ ])^uFF{ԞN75)#|
...

which is in fact the output from the printer when accessed via its web interface (gzipped). I don’t know why this is appearing or how to suppress it.

Scanning

With scanimage

I always want the airscan device to be used. For my HP OfficeJet Pro 6960, I have found that:

Setting the environment variable SANE_DEFAULT_DEVICE sets the default device scanimage will use. So for my HP6960 I have

export SANE_DEFAULT_DEVICE='airscan:e0:HP OfficeJet Pro 6960 [E3C811]'

(I set this in ~/local/profile.d/210sane, which is read in and executed by my ~/.bash_profile - see my notes on Bash configuration .)

For a single page on the scanner

Scan to file page.png with

scanimage --resolution=300 -o page.png

Change the output file suffix to .tif (or .tiff), .jpg (or .jpeg), .pdf, or .pnm 1 to save as a tiff file etc. (Can also specify the output format with --format=tiff etc.) The supported resolutions can be seen by running scanimage -h . If I’m scanning a page smaller than A4, I specify -x and -y args (see table at bottom).

If I need to apply an additional rotation I use something like:

scanimage --resolution=300 --format=png | magick - -rotate 90 page.png

(the imagemagick package is needed) or

scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.jpg

etc EXCEPT if the desired output is a pdf, in which case use 2

scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.pdf

Can also scan to a pdf with OCR (so there is a “text layer” as well as the scanned image in the pdf - so the document can be searched and text can be selected from the document):

scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng pdf

which saves to foo.pdf. Allow for multiple languages (apart from just English) with eg -l eng+fr (need to install the relevant tesseract-data-<lang> packages).

Consider changing the tesseract thresholding method if not all the page text is successfully processed. See the output of

tesseract --print-parameters | grep thresholding_

for options, so eg

scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng -c thresholding_method=2 pdf

for Sauvola thresholding.

(Ways to see how much of the text has been processed include:

For multiple pages from the automatic document feeder (ADF)

(The scanner has to support an ADF - see the output of scanimage -h for the --source option.)

scanimage --batch --resolution=300 --source=ADF --format=png

The scan pages are saved as out1.png, out2.png etc.,

If there are more than 9 pages, then better to specify an additional argument --batch=out%02d.png (for example), so the pages will be saved as out01.png, …, out09.png, out10.png, … (and so will sort nicely).

Change --format=... to output files in a different format. If specify --format=pdf then the scan pages are saved as a single multi-page file out1.pdf.

If I’m scanning a page smaller than A4, I specify -l, -x and -y args (see table at bottom).

As for the single page example above, can also scan to a pdf with OCR:

scanimage --resolution=300 --batch --batch-print --source=ADF --format=jpeg | tesseract - foo -l eng -c stream_filelist=true pdf

which saves to foo.pdf. Note that it leaves files out1.jpg, out2.jpg, … (one file for each page) that probably should then be deleted.

With gscan2pdf

gscan2pdf is an easy to use graphical program to scan pages and apply OCR (it is a graphical layer over scanimage and tesseract). See https://gscan2pdf.sourceforge.net/.

The only gotcha I found is knowing which scanning device to use (if there is more than 1 device found).

The options listed by gscan2pdf correspond to the “description” part of the output of scanimage -L.

More precisely, the options listed have the form <vendor> <model>. To get the mapping from device name to <vendor> <model>, use

scanimage -f $'%d: %v %m\n'

which outputs for my HP6960 printer scanner:

escl:https://192.168.1.47:443: HP OfficeJet Pro 6960 [E3C811]
hpaio:/net/officejet_pro_6960?ip=192.168.1.47&queue=false: Hewlett-Packard officejet_pro_6960
airscan:e0:HP OfficeJet Pro 6960 [E3C811]: eSCL HP OfficeJet Pro 6960 [E3C811]

So if I want the airscan device, I choose eSCL HP OfficeJet Pro 6960 [E3C811] as my scanner for gscan2pdf.

OCR of existing pdfs

ocrmypdf does a good job of adding an OCR text layer to existing scanned pdfs (see https://github.com/ocrmypdf/OCRmyPDF).

Typical usage is:

ocrmypdf -l eng --deskew --jobs 4 input.pdf output.pdf

-l eng means scan in English, can have eg -l eng+fr if the relevant tesseract data files are installed. --jobs 4 means use up to 4 cpu cores simultaneously.

As for above, consider changing the tesseract thresholding method (eg --tesseract-thresholding=sauvola) if not all the page text is successfully processed.

Table of arguments to scanimage for smaller scan sizes

The HP6960 has a scan size of 215.9mm x 300.567mm (from output of scanimage -h).

Run scanimage -l <l> -x <x> -y <y> for paper sizes smaller than A4, where <l>, <x> and <y> are given by the following table. (The values depend on whether the page is being scanned on the flatbed, or via the ADF. 3 - blank entries mean I never scan by that method.)

Paper Size Flatbed 4 ADF
Aerogramme letter paper (199x245mm unfolded) -x 199 -y 245
Letter writing envelopes (143x95mm) -x 143 -y 95
Letter writing paper (135x175mm) -x 135 -y 175 -l 40.45 -x 135 -y 177

  1. all these formats except for pnm record the resolution (in dpi) of the image. Don’t use pnm if you need to determine the physical size of the document from the image file.↩︎

  2. specifying --format=pdf results in an error currently (not sure why). Suggest jpeg encoding with scanimage as this is how scanimage encodes images internally when outputting a pdf file. Could also specify png, tiff, or pnm as the format - they will result in different final file sizes. Using pnm will lose information about the physical size of the image.↩︎

  3. For the ADF, the <y> argument is made slightly larger than the paper length to allow for the scanner not feeding the paper perfectly.↩︎

  4. With the page aligned with the top right of the scanner flatbed.↩︎