Difference between revisions of "Third-party software integration: OCR"

From OpenKM Documentation
Jump to: navigation, search
(Compile from source)
Line 1: Line 1:
 
Tesseract is an Open Source OCR engine adopted by Google. It works really well. The OCR natively can read TIFF documents and has hight ratio of recognition with images 300 dpi of resolution and converted to lineart (1 bit color).
 
Tesseract is an Open Source OCR engine adopted by Google. It works really well. The OCR natively can read TIFF documents and has hight ratio of recognition with images 300 dpi of resolution and converted to lineart (1 bit color).
 
You can download the source code from http://code.google.com/p/tesseract-ocr/ and compile yourself. Also download the language files you need and uncompress them in the same folder of the application.
 
  
 
If you are using a computer with Debian / Ubuntu, the installation simplifies a lot:
 
If you are using a computer with Debian / Ubuntu, the installation simplifies a lot:
Line 11: Line 9:
 
  $ aptitude install tesseract-ocr-eng
 
  $ aptitude install tesseract-ocr-eng
  
If you want to add support for english language. Now you have to tell OpenKM to use this OCR application. Edit the file OpenKM.cfg:
+
If you want to add support for english language. You can also download Windows executables for tesseract-2.04 at http://code.google.com/p/tesseract-ocr/downloads/list.
 +
 
 +
Now you have to tell OpenKM to use this OCR application. Edit the file OpenKM.cfg:
  
 
  $ vim OpenKM.cfg
 
  $ vim OpenKM.cfg
Line 22: Line 22:
  
 
For more info, go to http://code.google.com/p/tesseract-ocr/.
 
For more info, go to http://code.google.com/p/tesseract-ocr/.
 +
 +
There is also another interesting free OCR application called OCRopus. It has many improvements over Tesseract but is on early development stage. Last released version (0.3.1) is quite usable and works very well but have to be compiled and actually is a difficult task. Visit http://code.google.com/p/ocropus/ for more info.
 +
 +
== Compile from source code ==
 +
 +
You can download the source code from http://code.google.com/p/tesseract-ocr/ and compile yourself. Also download the language files you need and uncompress them in the same folder of the application.
 +
 +
$ sudo aptitude install libtiff4-dev
 +
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
 +
$ tar xzvf tesseract-2.04.tar.gz
 +
$ cd tesseract-2.04
 +
$ ./configure --prefix=/opt/tesseract
 +
$ make
 +
$ make install
 +
 +
The executable should be located at /opt/tesseract/bin/tesseract.
  
 
[[Category: Installation Guide]]
 
[[Category: Installation Guide]]
There is also another interesting free OCR application called OCRopus. It has many improvements over Tesseract but is on early development stage. Last released version (0.3.1) is quite usable and works very well but have to be compiled and actually is a difficult task. Visit http://code.google.com/p/ocropus/ for more info.
 

Revision as of 11:58, 29 March 2010

Tesseract is an Open Source OCR engine adopted by Google. It works really well. The OCR natively can read TIFF documents and has hight ratio of recognition with images 300 dpi of resolution and converted to lineart (1 bit color).

If you are using a computer with Debian / Ubuntu, the installation simplifies a lot:

$ aptitude install tesseract-ocr

And

$ aptitude install tesseract-ocr-eng

If you want to add support for english language. You can also download Windows executables for tesseract-2.04 at http://code.google.com/p/tesseract-ocr/downloads/list.

Now you have to tell OpenKM to use this OCR application. Edit the file OpenKM.cfg:

$ vim OpenKM.cfg

And set the system.ocr property to the path of the tesseract executable:

system.ocr=/usr/local/bin/tesseract

For more info, go to http://code.google.com/p/tesseract-ocr/.

There is also another interesting free OCR application called OCRopus. It has many improvements over Tesseract but is on early development stage. Last released version (0.3.1) is quite usable and works very well but have to be compiled and actually is a difficult task. Visit http://code.google.com/p/ocropus/ for more info.

Compile from source code

You can download the source code from http://code.google.com/p/tesseract-ocr/ and compile yourself. Also download the language files you need and uncompress them in the same folder of the application.

$ sudo aptitude install libtiff4-dev
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
$ tar xzvf tesseract-2.04.tar.gz
$ cd tesseract-2.04
$ ./configure --prefix=/opt/tesseract
$ make
$ make install

The executable should be located at /opt/tesseract/bin/tesseract.