bike9876@エボ猫.コム
This describes setting up scanning from a generic scanner (connected to my local network) to my Arch Linux system. It also describes OCR both while scanning and also on existing pdfs.
Install sane:
sudo pacman -S sane
scanimage (included in sane package - see
below for usage) suits me fine for most scanning, but you might want to
also install:
simple-scan
(sudo pacman -S simple-scan)xsane
(yay -S xsane - see https://github.com/Jguer/yay#user-content-installation
on how to install yay)And for scanning with OCR:
sudo pacman -S tesseract tesseract-data-eng
(if a language other than English is required, run
sudo pacman -S tesseract tesseract-data
and select the tesseract-data package(s) to install)
A graphical program for scanning with OCR:
sudo pacman -S gscan2pdf
For OCR of pdfs on the command line:
yay -S ocrmypdf
Check if scanner supports AirScan: enter scanner details in https://mfi.apple.com/account/airprint-search.
If supported, install sane-airscan:
sudo pacman -S sane-airscan
If AirScan is supported, can disable all other sane drivers:
sudo sed -i -E 's/^([^#].*)/#\1/' /etc/sane.d/dll.conf
cd /etc/sane.d/dll.d
for f in *; do if [[ $f != airscan ]]; then sudo sed -i -E 's/^([^#].*)/#\1/' "$f"; fi; done
Run
scanimage -L
I have an HP OfficeJet Pro 6970 connected to the network (and which supports AirScan).
If I disable all the non-airscan drivers (as described above), I get output:
device `airscan:w0:HP OfficeJet Pro 6970' is a WSD HP OfficeJet Pro 6970 ip=192.168.X.Y, 2a01:4b00:A:B:C:D:E:F
(ip address have been redacted).
If I did not disable all the non-airscan drivers, I get output:
device `escl:https://192.168.X.Y:443' is a HP OfficeJet Pro 6970 [FC7836] platen,adf scanner
device `hpaio:/net/officejet_pro_6970?ip=192.168.1.X&queue=false' is a Hewlett-Packard officejet_pro_6970 all-in-one
device `airscan:e0:HP OfficeJet Pro 6970 [FC7836]' is a eSCL HP OfficeJet Pro 6970 [FC7836] ip=192.168.X.Y, 2a01:4b00:A:B:C:D:E:F
(again ip address have been redacted)
Notes on the output with all drivers enabled:
avahi-daemon service will
be started (and the user’s password will be prompted for to enable
this).device `hpaio: ... line if the
hplip package has been installed (which is needed for
printing to work).O0W
Hs<M´M
q.[lNJ+"ߏ}sszr}?[<|:B{3v=ɇsi"1vƦ?X9oI
2?ȋ ])^uFF{ԞN75)#|
...
which is in fact the output from the printer when accessed via its web interface (gzipped). I don’t know why this is appearing or how to suppress it.
scanimageI always want the airscan device to be used. For my HP
OfficeJet Pro 6970, I have found that:
escl device does not work correctly for multiple
pages fed from the ADF (only the first page is captured).hpaio device does not work at all.I therefore disable all sane drivers apart from airscan
(see above), and set the
environment variable SANE_DEFAULT_DEVICE to the airscan
device reported by scanimage -L:
export SANE_DEFAULT_DEVICE='airscan:w0:HP OfficeJet Pro 6970'
(I set this in ~/local/profile.d/210sane, which is read
in and executed by my ~/.bash_profile - see my notes on Bash configuration .)
Scan to file page.png with
scanimage --resolution=300 -o page.png
Change the output file suffix to .tif (or
.tiff), .jpg (or .jpeg),
.pdf, or .pnm 1 to
save as a tiff file etc. (Can also specify the output format with
--format=tiff etc.) The supported resolutions can be seen
by running scanimage -h . If I’m scanning a page smaller
than A4, I specify -x and -y args (see table
at bottom).
If I need to apply an additional rotation I use something like:
scanimage --resolution=300 --format=png | magick - -rotate 90 page.png
(the imagemagick package is needed) or
scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.jpg
etc EXCEPT if the desired output is a pdf, in which case use 2
scanimage --resolution=300 --format=jpeg | magick - -rotate 90 page.pdf
Can also scan to a pdf with OCR (so there is a “text layer” as well as the scanned image in the pdf - so the document can be searched and text can be selected from the document):
scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng pdf
which saves to foo.pdf. Allow for multiple languages
(apart from just English) with eg -l eng+fr (need to
install the relevant tesseract-data-<lang>
packages).
Consider changing the tesseract thresholding method if not all the page text is successfully processed. See the output of
tesseract --print-parameters | grep thresholding_
for options, so eg
scanimage --resolution=300 --format=jpeg | tesseract - foo -l eng -c thresholding_method=2 pdf
for Sauvola thresholding.
(Ways to see how much of the text has been processed include:
pdftotext (in the poppler package on
Arch Linux) on the pdf: eg pdftext file.pdf file.txt.
Compare file.txt with the pdf file.gscan2pdf):
gscan2pdf file.pdf and view the “Text layer”. )(The scanner has to support an ADF - see the output of
scanimage -h for the --source option.)
scanimage --batch --resolution=300 --source=ADF --format=png
The scan pages are saved as out1.png,
out2.png etc.,
If there are more than 9 pages, then better to specify an additional
argument --batch=out%02d.png (for example), so the pages
will be saved as out01.png, …, out09.png,
out10.png, … (and so will sort nicely).
Change --format=... to output files in a different
format. If specify --format=pdf then the scan pages are
saved as a single multi-page file out1.pdf.
If I’m scanning a page smaller than A4, I specify -l,
-x and -y args (see table at bottom).
As for the single page example above, can also scan to a pdf with OCR:
scanimage --resolution=300 --batch --batch-print --source=ADF --format=jpeg | tesseract - foo -l eng -c stream_filelist=true pdf
which saves to foo.pdf. Note that it leaves files
out1.jpg, out2.jpg, … (one file for each page)
that probably should then be deleted.
gscan2pdfgscan2pdf is an easy to use graphical program to scan
pages and apply OCR (it is a graphical layer over scanimage and
tesseract). See https://gscan2pdf.sourceforge.net/.
The only gotcha I found is knowing which scanning device to use (if there is more than 1 device found).
The options listed by gscan2pdf correspond to the “description” part
of the output of scanimage -L.
More precisely, the options listed have the form
<vendor> <model>. To get the mapping from
device name to <vendor> <model>, use
scanimage -f $'(%d) %v %m\n'
which output for my HP6970 printer scanner:
(airscan:w0:HP OfficeJet Pro 6970) WSD HP OfficeJet Pro 6970
So the device name listed by gscan2pdf would be
WSD HP OfficeJet Pro 6970. As I only have 1 device listed,
there is no issue here, but this is useful if more devices are
listed.
ocrmypdf does a good job of adding an OCR text layer to
existing scanned pdfs (see https://github.com/ocrmypdf/OCRmyPDF).
Typical usage is:
ocrmypdf -l eng --deskew --jobs 4 input.pdf output.pdf
-l eng means scan in English, can have eg
-l eng+fr if the relevant tesseract data files are
installed. --jobs 4 means use up to 4 cpu cores
simultaneously.
As for above, consider changing the tesseract thresholding method (eg
--tesseract-thresholding=sauvola) if not all the page text
is successfully processed.
The HP6970 has a scan size of 215.9mm x 300.558mm (from output of
scanimage -h).
Run scanimage -l <l> -x <x> -y <y> for
paper sizes smaller than A4, where <l>,
<x> and <y> are given by the
following table. (The values depend on whether the page is being scanned
on the flatbed, or via the ADF. 3 - blank entries mean I
never scan by that method.)
| Paper Size | Flatbed 4 | ADF |
|---|---|---|
| Aerogramme letter paper (199x245mm unfolded) | -x 199 -y 245 | |
| Letter writing envelopes (143x95mm) | -x 143 -y 95 | |
| Letter writing paper (135x175mm) | -x 135 -y 175 | -l 40.45 -x 135 -y 177 |
all these formats except for pnm record the resolution (in dpi) of the image. Don’t use pnm if you need to determine the physical size of the document from the image file.↩︎
specifying --format=pdf results in an error
currently (not sure why). Suggest jpeg encoding with
scanimage as this is how scanimage encodes images
internally when outputting a pdf file. Could also specify
png, tiff, or pnm as the format -
they will result in different final file sizes. Using pnm
will lose information about the physical size of the image.↩︎
For the ADF, the <y> argument is made
slightly larger than the paper length to allow for the scanner not
feeding the paper perfectly.↩︎
With the page aligned with the top right of the scanner flatbed.↩︎