bike9876@エボ猫.コム
This describes setting up scanning from a generic scanner (connected to my local network) to my Arch Linux system. It also describes OCR both while scanning and also on existing pdfs.
Install sane:
sudo pacman -S sane
scanimage (see below) suits me fine for most scanning,
but you might want to also install:
simple-scan
(sudo pacman -S simple-scan)xsane
(yay -S xsane - see https://github.com/Jguer/yay#user-content-installation
on how to install yay)And for scanning with OCR:
sudo pacman -S tesseract tesseract-data-eng
(if a language other than English is required, run
sudo pacman -S tesseract tesseract-data
and select the tesseract-data package(s) to install)
A graphical program for scanning with OCR:
sudo pacman -S gscan2pdf
For OCR of pdfs on the command line:
yay -S ocrmypdf
Check if scanner supports AirScan: enter scanner details in https://mfi.apple.com/account/airprint-search.
If supported, install sane-airscan:
sudo pacman -S sane-airscan
Run
scanimage -L
I have an HP OfficeJet Pro 6960 connected to the network (and which supports AirScan), and get output:
device `escl:https://192.168.X.Y:443' is a HP OfficeJet Pro 6960 [E3C811] platen,adf scanner
device `hpaio:/net/officejet_pro_6960?ip=192.168.X.Y&queue=false' is a Hewlett-Packard officejet_pro_6960 all-in-one
device `airscan:e0:HP OfficeJet Pro 6960 [E3C811]' is a eSCL HP OfficeJet Pro 6960 [E3C811] ip=192.168.X.Y
(ip address have been redacted)
Notes:
device `hpaio: ... line if the
hplip package has been installed (which is needed for
printing to work).O0W
Hs<M´M
q.[lNJ+"ߏ}sszr}?[<|:B{3v=ɇsi"1vƦ?X9oI
2?ȋ ])^uFF{ԞN75)#|
...
which is in fact the output from the printer when accessed via its web interface (gzipped). I don’t know why this is appearing or how to suppress it.
scanimageI always want the airscan device to be used. For my HP
OfficeJet Pro 6960, I have found that:
escl device does not work correctly for multiple
pages fed from the ADF (only the first page is captured).hpaio device does not work at all.Setting the environment variable SANE_DEFAULT_DEVICE
sets the default device scanimage will use. So for my
HP6960 I have
export SANE_DEFAULT_DEVICE='airscan:e0:HP OfficeJet Pro 6960 [E3C811]'
(I set this in ~/local/profile.d/210sane, which is read
in and executed by my ~/.bash_profile - see my notes on Bash configuration .)
Scan to file page.png with
scanimage --resolution=300 -o page.png
Change the output file suffix to .tif (or
.tiff), .jpg (or .jpeg),
.pdf, or .pnm 1 to
save as a tiff file etc. (Can also specify the output format with
--format=tiff etc.) The supported resolutions can be seen
by running scanimage -h . If I’m scanning a page smaller
than A4, I specify -x and -y args (see table
at bottom).
If I need to apply an additional rotation I use something like:
scanimage --resolution=300 --format=png | magick - -rotate 90 page.png
(the imagemagick package is needed) or
scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.jpg
etc EXCEPT if the desired output is a pdf, in which case use 2
scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.pdf
Can also scan to a pdf with OCR (so there is a “text layer” as well as the scanned image in the pdf - so the document can be searched and text can be selected from the document):
scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng pdf
which saves to foo.pdf. Allow for multiple languages
(apart from just English) with eg -l eng+fr (need to
install the relevant tesseract-data-<lang>
packages).
Consider changing the tesseract thresholding method if not all the page text is successfully processed. See the output of
tesseract --print-parameters | grep thresholding_
for options, so eg
scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng -c thresholding_method=2 pdf
for Sauvola thresholding.
(Ways to see how much of the text has been processed include:
pdftotext (in the poppler package on
Arch Linux) on the pdf: eg pdftext file.pdf file.txt.
Compare file.txt with the pdf file.gscan2pdf):
gscan2pdf file.pdf and view the “Text layer”. )(The scanner has to support an ADF - see the output of
scanimage -h for the --source option.)
scanimage --batch --resolution=300 --source=ADF --format=png
The scan pages are saved as out1.png,
out2.png etc.,
If there are more than 9 pages, then better to specify an additional
argument --batch=out%02d.png (for example), so the pages
will be saved as out01.png, …, out09.png,
out10.png, … (and so will sort nicely).
Change --format=... to output files in a different
format. If specify --format=pdf then the scan pages are
saved as a single multi-page file out1.pdf.
If I’m scanning a page smaller than A4, I specify -l,
-x and -y args (see table at bottom).
As for the single page example above, can also scan to a pdf with OCR:
scanimage --resolution=300 --batch --batch-print --source=ADF --format=jpeg | tesseract - foo -l eng -c stream_filelist=true pdf
which saves to foo.pdf. Note that it leaves files
out1.jpg, out2.jpg, … (one file for each page)
that probably should then be deleted.
gscan2pdfgscan2pdf is an easy to use graphical program to scan
pages and apply OCR (it is a graphical layer over scanimage and
tesseract). See https://gscan2pdf.sourceforge.net/.
The only gotcha I found is knowing which scanning device to use (if there is more than 1 device found).
The options listed by gscan2pdf correspond to the “description” part
of the output of scanimage -L.
More precisely, the options listed have the form
<vendor> <model>. To get the mapping from
device name to <vendor> <model>, use
scanimage -f $'%d: %v %m\n'
which outputs for my HP6960 printer scanner:
escl:https://192.168.1.47:443: HP OfficeJet Pro 6960 [E3C811]
hpaio:/net/officejet_pro_6960?ip=192.168.1.47&queue=false: Hewlett-Packard officejet_pro_6960
airscan:e0:HP OfficeJet Pro 6960 [E3C811]: eSCL HP OfficeJet Pro 6960 [E3C811]
So if I want the airscan device, I choose
eSCL HP OfficeJet Pro 6960 [E3C811] as my scanner for
gscan2pdf.
ocrmypdf does a good job of adding an OCR text layer to
existing scanned pdfs (see https://github.com/ocrmypdf/OCRmyPDF).
Typical usage is:
ocrmypdf -l eng --deskew --jobs 4 input.pdf output.pdf
-l eng means scan in English, can have eg
-l eng+fr if the relevant tesseract data files are
installed. --jobs 4 means use up to 4 cpu cores
simultaneously.
As for above, consider changing the tesseract thresholding method (eg
--tesseract-thresholding=sauvola) if not all the page text
is successfully processed.
The HP6960 has a scan size of 215.9mm x 300.567mm (from output of
scanimage -h).
Run scanimage -l <l> -x <x> -y <y> for
paper sizes smaller than A4, where <l>,
<x> and <y> are given by the
following table. (The values depend on whether the page is being scanned
on the flatbed, or via the ADF. 3 - blank entries mean I
never scan by that method.)
| Paper Size | Flatbed 4 | ADF |
|---|---|---|
| Aerogramme letter paper (199x245mm unfolded) | -x 199 -y 245 | |
| Letter writing envelopes (143x95mm) | -x 143 -y 95 | |
| Letter writing paper (135x175mm) | -x 135 -y 175 | -l 40.45 -x 135 -y 177 |
all these formats except for pnm record the resolution (in dpi) of the image. Don’t use pnm if you need to determine the physical size of the document from the image file.↩︎
specifying --format=pdf results in an error
currently (not sure why). Suggest jpeg encoding with
scanimage as this is how scanimage encodes images
internally when outputting a pdf file. Could also specify
png, tiff, or pnm as the format -
they will result in different final file sizes. Using pnm
will lose information about the physical size of the image.↩︎
For the ADF, the <y> argument is made
slightly larger than the paper length to allow for the scanner not
feeding the paper perfectly.↩︎
With the page aligned with the top right of the scanner flatbed.↩︎