Difference between revisions of "Third-party software integration: OCR"

Revision as of 08:46, 14 June 2011

There is also another interesting free OCR application called OCRopus. It has many improvements over Tesseract but is on early development stage. Last released version (0.3.1) is quite usable and works very well but have to be compiled and actually is a difficult task. Visit http://code.google.com/p/ocropus/ for more info.

Compile from source code

You can download the source code from http://code.google.com/p/tesseract-ocr/ and compile yourself. Also download the language files you need and uncompress them in the same folder of the application.

$ sudo aptitude install build-essential libtiff4-dev
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
$ tar xzvf tesseract-2.04.tar.gz
$ cd tesseract-2.04
$ ./configure --prefix=/opt/tesseract
$ make
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
$ tar xzvf tesseract-2.00.eng.tar.gz
$ sudo make install

The executable should be located at /opt/tesseract/bin/tesseract. More info about compilation at:

OpenKM 5.1 OCR configuration

Starting with OpenKM 5.1 you can choose between several OCR engines:

OCR Engine	Text Extractor	Image Formats	Default program arguments
Tesseract 2.x	com.openkm.extractor.Tesseract2TextExtractor	TIFF	Config.SYSTEM_OCR ${fileIn} ${fileOut}
Tesseract 3.x	com.openkm.extractor.Tesseract3TextExtractor	TIFF PNG JPG GIF	Config.SYSTEM_OCR ${fileIn} ${fileOut}
Cuneiform	com.openkm.extractor.CuneiformTextExtractor	TIFF PNG JPG GIF	Config.SYSTEM_OCR ${fileIn} -o ${fileOut}

So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:

system.ocr=/usr/bin/tesseract -l esp

In this OpenKM version you can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at http://wiki.services.openoffice.org/wiki/Dictionaries. After download, set this configuration property with the path to the dictorionay file:

system.openoffice.dictionary=/path/to/dictionary.dic

Software required

You can enable any of these text extractors adding it in the textFilterClasses param of the SearchIndex section in your repository.xml file.

You can download Tesseract 3 for Windows from tesseract-ocr Google Code. To install Tesseract 3 in Ubuntu, add the PPA and install Tesseract OCR 3.0 SVN:

 $ sudo add-apt-repository ppa:alex-p/notesalexp
 $ sudo apt-get update
 $ sudo apt-get install tesseract-ocr tesseract-ocr-eng

You must add the PPA, install the latest Tesseract and then disable the PPA as it contains a lot of bleeding edge packages!

 $ sudo add-apt-repository -r ppa:alex-p/notesalexp

Starting with OpenKM 5.1 we offer integration with Cognitive OpenOCR (Cuneiform). This OCR engine make a very good job improving Tesseract conversion ratios.

You can grab binaries from http://pkgs.org/package/cuneiform.

External links

Difference between revisions of "Third-party software integration: OCR"

Revision as of 08:46, 14 June 2011

Contents

Compile from source code

OpenKM 5.1 OCR configuration

Software required

External links

Navigation menu

Views

Personal tools

Navigation

Search

Tools

@@ Line 64: / Line 64: @@
   system.ocr=/usr/bin/tesseract -l esp
+In this OpenKM version you can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at http://wiki.services.openoffice.org/wiki/Dictionaries. After download, set this configuration property with the path to the dictorionay file:
+ system.openoffice.dictionary=/path/to/dictionary.dic
 === Software required ===