Scanning and OCR on Arch Linux

bike9876@エボ猫.コム

This describes setting up scanning from a generic scanner (connected to my local network) to my Arch Linux system. It also describes OCR both while scanning and also on existing pdfs.

Install packages

Install sane:

sudo pacman -S sane

scanimage (included in sane package - see below for usage) suits me fine for most scanning, but you might want to also install:

simple-scan (sudo pacman -S simple-scan)
(more advanced) xsane (yay -S xsane - see https://github.com/Jguer/yay#user-content-installation on how to install yay)

And for scanning with OCR:

sudo pacman -S tesseract tesseract-data-eng

(if a language other than English is required, run

sudo pacman -S tesseract tesseract-data

and select the tesseract-data package(s) to install)

A graphical program for scanning with OCR:

sudo pacman -S gscan2pdf

For OCR of pdfs on the command line:

yay -S ocrmypdf

Check if scanner supports AirScan: enter scanner details in https://mfi.apple.com/account/airprint-search.

If supported, install sane-airscan:

sudo pacman -S sane-airscan

If AirScan is supported, can disable all other sane drivers:

sudo sed -i -E 's/^([^#].*)/#\1/' /etc/sane.d/dll.conf
cd /etc/sane.d/dll.d
for f in *; do if [[ $f != airscan ]]; then sudo sed -i -E 's/^([^#].*)/#\1/' "$f"; fi; done

Check for connected scanners

Run

scanimage -L

I have an HP OfficeJet Pro 6970 connected to the network (and which supports AirScan).

If I disable all the non-airscan drivers (as described above), I get output:

device `airscan:w0:HP OfficeJet Pro 6970' is a WSD HP OfficeJet Pro 6970 ip=192.168.X.Y, 2a01:4b00:A:B:C:D:E:F

(ip address have been redacted).

If I did not disable all the non-airscan drivers, I get output:

device `escl:https://192.168.X.Y:443' is a HP OfficeJet Pro 6970 [FC7836] platen,adf scanner
device `hpaio:/net/officejet_pro_6970?ip=192.168.1.X&queue=false' is a Hewlett-Packard officejet_pro_6970 all-in-one
device `airscan:e0:HP OfficeJet Pro 6970 [FC7836]' is a eSCL HP OfficeJet Pro 6970 [FC7836] ip=192.168.X.Y, 2a01:4b00:A:B:C:D:E:F

(again ip address have been redacted)

Notes on the output with all drivers enabled:

If not already started, the avahi-daemon service will be started (and the user’s password will be prompted for to enable this).
I only get the device `hpaio: ... line if the hplip package has been installed (which is needed for printing to work).
I actually get a load of “garbage” at the start of the output of scanimage:

O0W
   Hs<M´M
        q.[lNJ+"ߏ}sszr}?[<|:B{3v=ɇsi"1vƦ?X9oI
                                             2?ȋ ])^uFF{ԞN75)#|
...

which is in fact the output from the printer when accessed via its web interface (gzipped). I don’t know why this is appearing or how to suppress it.

Scanning

With `scanimage`

I always want the airscan device to be used. For my HP OfficeJet Pro 6970, I have found that:

the escl device does not work correctly for multiple pages fed from the ADF (only the first page is captured).
the hpaio device does not work at all.

I therefore disable all sane drivers apart from airscan (see above), and set the environment variable SANE_DEFAULT_DEVICE to the airscan device reported by scanimage -L:

export SANE_DEFAULT_DEVICE='airscan:w0:HP OfficeJet Pro 6970'

(I set this in ~/local/profile.d/210sane, which is read in and executed by my ~/.bash_profile - see my notes on Bash configuration .)

For a single page on the scanner

Scan to file page.png with

scanimage --resolution=300 -o page.png

Change the output file suffix to .tif (or .tiff), .jpg (or .jpeg), .pdf, or .pnm ¹ to save as a tiff file etc. (Can also specify the output format with --format=tiff etc.) The supported resolutions can be seen by running scanimage -h . If I’m scanning a page smaller than A4, I specify -x and -y args (see table at bottom).

If I need to apply an additional rotation I use something like:

scanimage --resolution=300 --format=png | magick - -rotate 90 page.png

(the imagemagick package is needed) or

scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.jpg

etc EXCEPT if the desired output is a pdf, in which case use ²

scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.pdf

Can also scan to a pdf with OCR (so there is a “text layer” as well as the scanned image in the pdf - so the document can be searched and text can be selected from the document):

scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng pdf

which saves to foo.pdf. Allow for multiple languages (apart from just English) with eg -l eng+fr (need to install the relevant tesseract-data-<lang> packages).

Consider changing the tesseract thresholding method if not all the page text is successfully processed. See the output of

tesseract --print-parameters | grep thresholding_

for options, so eg

scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng -c thresholding_method=2 pdf

for Sauvola thresholding.

(Ways to see how much of the text has been processed include:

Run pdftotext (in the poppler package on Arch Linux) on the pdf: eg pdftext file.pdf file.txt. Compare file.txt with the pdf file.
Open the processed pdf with a pdf viewer and try to select all the text.
Open the pdf with gscan2pdf (see below for gscan2pdf): gscan2pdf file.pdf and view the “Text layer”. )

For multiple pages from the automatic document feeder (ADF)

(The scanner has to support an ADF - see the output of scanimage -h for the --source option.)

scanimage --batch --resolution=300 --source=ADF --format=png

The scan pages are saved as out1.png, out2.png etc.,

If there are more than 9 pages, then better to specify an additional argument --batch=out%02d.png (for example), so the pages will be saved as out01.png, …, out09.png, out10.png, … (and so will sort nicely).

Change --format=... to output files in a different format. If specify --format=pdf then the scan pages are saved as a single multi-page file out1.pdf.

If I’m scanning a page smaller than A4, I specify -l, -x and -y args (see table at bottom).

As for the single page example above, can also scan to a pdf with OCR:

scanimage --resolution=300 --batch --batch-print --source=ADF --format=jpeg | tesseract - foo -l eng -c stream_filelist=true pdf

which saves to foo.pdf. Note that it leaves files out1.jpg, out2.jpg, … (one file for each page) that probably should then be deleted.

With `gscan2pdf`

gscan2pdf is an easy to use graphical program to scan pages and apply OCR (it is a graphical layer over scanimage and tesseract). See https://gscan2pdf.sourceforge.net/.

The only gotcha I found is knowing which scanning device to use (if there is more than 1 device found).

The options listed by gscan2pdf correspond to the “description” part of the output of scanimage -L.

More precisely, the options listed have the form <vendor> <model>. To get the mapping from device name to <vendor> <model>, use

scanimage -f $'(%d) %v %m\n'

which output for my HP6970 printer scanner:

(airscan:w0:HP OfficeJet Pro 6970) WSD HP OfficeJet Pro 6970

So the device name listed by gscan2pdf would be WSD HP OfficeJet Pro 6970. As I only have 1 device listed, there is no issue here, but this is useful if more devices are listed.

OCR of existing pdfs

ocrmypdf does a good job of adding an OCR text layer to existing scanned pdfs (see https://github.com/ocrmypdf/OCRmyPDF).

Typical usage is:

ocrmypdf -l eng --deskew --jobs 4 input.pdf output.pdf

-l eng means scan in English, can have eg -l eng+fr if the relevant tesseract data files are installed. --jobs 4 means use up to 4 cpu cores simultaneously.

As for above, consider changing the tesseract thresholding method (eg --tesseract-thresholding=sauvola) if not all the page text is successfully processed.

Table of arguments to scanimage for smaller scan sizes

The HP6970 has a scan size of 215.9mm x 300.558mm (from output of scanimage -h).

Run scanimage -l <l> -x <x> -y <y> for paper sizes smaller than A4, where <l>, <x> and <y> are given by the following table. (The values depend on whether the page is being scanned on the flatbed, or via the ADF. ³ - blank entries mean I never scan by that method.)

Paper Size	Flatbed ⁴	ADF
Aerogramme letter paper (199x245mm unfolded)	-x 199 -y 245
Letter writing envelopes (143x95mm)	-x 143 -y 95
Letter writing paper (135x175mm)	-x 135 -y 175	-l 40.45 -x 135 -y 177

all these formats except for pnm record the resolution (in dpi) of the image. Don’t use pnm if you need to determine the physical size of the document from the image file.↩︎
specifying --format=pdf results in an error currently (not sure why). Suggest jpeg encoding with scanimage as this is how scanimage encodes images internally when outputting a pdf file. Could also specify png, tiff, or pnm as the format - they will result in different final file sizes. Using pnm will lose information about the physical size of the image.↩︎
For the ADF, the <y> argument is made slightly larger than the paper length to allow for the scanner not feeding the paper perfectly.↩︎
With the page aligned with the top right of the scanner flatbed.↩︎