Difference between revisions of "Tesseract"

From OpenKM Documentation
Jump to: navigation, search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
Tesseract is an Open Source OCR engine adopted by Google. It works really well. The OCR natively can read TIFF documents and has hight ratio of recognition with images 300 dpi of resolution and converted to lineart (1 bit color).
+
#REDIRECT [[Third-party software integration: OCR Tesseract]]
 
 
If you are using a computer with Debian / Ubuntu, the installation simplifies a lot:
 
 
 
$ aptitude install tesseract-ocr
 
 
 
And
 
 
 
$ aptitude install tesseract-ocr-eng
 
 
 
If you want to add support for english language. You can also download Windows executables for tesseract-2.04 at http://code.google.com/p/tesseract-ocr/downloads/list.
 
 
 
Now you have to tell OpenKM to use this OCR application. Edit the file [[OpenKM.cfg]]:
 
 
 
$ vim OpenKM.cfg
 
 
 
And set the system.ocr property to the path of the tesseract executable:
 
 
 
<source lang="java">
 
system.ocr=/usr/local/bin/tesseract
 
</source>
 
 
 
For more info, go to http://code.google.com/p/tesseract-ocr/ and [http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html Tesseract - Summary & first experiences].
 
 
 
You can download Tesseract 3 for Windows from [http://code.google.com/p/tesseract-ocr/ tesseract-ocr Google Code]. To install Tesseract 3 in Ubuntu, add the PPA and install Tesseract OCR 3.0 SVN:
 
 
 
  $ sudo add-apt-repository ppa:alex-p/notesalexp
 
  $ sudo apt-get update
 
  $ sudo apt-get install tesseract-ocr tesseract-ocr-eng
 
 
 
{{Warning|You must add the PPA, install the latest Tesseract and then disable the PPA as it contains a lot of bleeding edge packages!}}
 
 
 
  $ sudo add-apt-repository -r ppa:alex-p/notesalexp
 
 
 
There is also another interesting free OCR application called OCRopus. It has many improvements over Tesseract but is on early development stage. Last released version (0.3.1) is quite usable and works very well but have to be compiled and actually is a difficult task. Visit http://code.google.com/p/ocropus/ for more info.
 
 
 
== Compile from source code ==
 
 
 
You can download the source code from http://code.google.com/p/tesseract-ocr/ and compile yourself. Also download the language files you need and uncompress them in the same folder of the application.
 
 
 
$ sudo aptitude install build-essential libtiff4-dev
 
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
 
$ tar xzvf tesseract-2.04.tar.gz
 
$ cd tesseract-2.04
 
$ ./configure --prefix=/opt/tesseract
 
$ make
 
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
 
$ tar xzvf tesseract-2.00.eng.tar.gz
 
$ sudo make install
 
 
 
The executable should be located at /opt/tesseract/bin/tesseract. More info about compilation at:
 
 
 
* http://code.google.com/p/tesseract-ocr/wiki/ReadMe
 
* http://code.google.com/p/tesseract-ocr/wiki/FAQ
 
 
 
== External links ==
 
* [http://groups.google.com/group/tesseract-ocr Tesseract OCR Google Groups]
 
* [http://triviaatwork.blogspot.com/2009/08/first-interactions-with-tesseract-ocr.html First Interactions with Tesseract OCR on Ubuntu Linux]
 
 
 
[[Category: Installation Guide]]
 

Latest revision as of 10:14, 11 January 2012