Difference between revisions of "Third-party software integration: OCR"

From OpenKM Documentation
Jump to: navigation, search
(Cuneiform)
m
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{TOCright}} __TOC__
+
Starting with OpenKM 5.1.9 you can choose between several OCR engines:
 
 
Tesseract is an Open Source OCR engine adopted by Google. It works really well. The OCR natively can read TIFF documents and has hight ratio of recognition with images 300 dpi of resolution and converted to lineart (1 bit color).
 
 
 
If you are using a computer with Debian / Ubuntu, the installation simplifies a lot:
 
 
 
$ aptitude install tesseract-ocr
 
 
 
And
 
 
 
$ aptitude install tesseract-ocr-eng
 
 
 
If you want to add support for english language. You can also download Windows executables for tesseract-2.04 at http://code.google.com/p/tesseract-ocr/downloads/list.
 
 
 
Now you have to tell OpenKM to use this OCR application. Edit the file [[OpenKM.cfg]]:
 
 
 
$ vim OpenKM.cfg
 
 
 
And set the system.ocr property to the path of the tesseract executable:
 
 
 
<source lang="java">
 
system.ocr=/usr/local/bin/tesseract
 
</source>
 
 
 
For more info, go to http://code.google.com/p/tesseract-ocr/ and [http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html Tesseract - Summary & first experiences].
 
 
 
There is also another interesting free OCR application called OCRopus. It has many improvements over Tesseract but is on early development stage. Last released version (0.3.1) is quite usable and works very well but have to be compiled and actually is a difficult task. Visit http://code.google.com/p/ocropus/ for more info.
 
 
 
== Compile from source code ==
 
 
 
You can download the source code from http://code.google.com/p/tesseract-ocr/ and compile yourself. Also download the language files you need and uncompress them in the same folder of the application.
 
 
 
$ sudo aptitude install build-essential libtiff4-dev
 
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
 
$ tar xzvf tesseract-2.04.tar.gz
 
$ cd tesseract-2.04
 
$ ./configure --prefix=/opt/tesseract
 
$ make
 
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
 
$ tar xzvf tesseract-2.00.eng.tar.gz
 
$ sudo make install
 
 
 
The executable should be located at /opt/tesseract/bin/tesseract. More info about compilation at:
 
 
 
* http://code.google.com/p/tesseract-ocr/wiki/ReadMe
 
* http://code.google.com/p/tesseract-ocr/wiki/FAQ
 
 
 
== OpenKM 5.1 OCR configuration ==
 
Starting with OpenKM 5.1 you can choose between several OCR engines:
 
  
  
 
{| align="center" border="1" cellpadding="5" cellspacing="0"
 
{| align="center" border="1" cellpadding="5" cellspacing="0"
! OCR Engine || Text Extractor || Image Formats || Default program arguments
+
! OCR Engine || Text Extractor || Image Formats || Program arguments
 
|-
 
|-
| Tesseract 2.x || com.openkm.extractor.Tesseract2TextExtractor || TIFF || Config.SYSTEM_OCR ${fileIn} ${fileOut}
+
| [[Tesseract|Tesseract 2.x]] || com.openkm.extractor.Tesseract2TextExtractor || TIFF || /path/to/tesseract ${fileIn} ${fileOut}
 
|-
 
|-
| Tesseract 3.x || com.openkm.extractor.Tesseract3TextExtractor || TIFF PNG JPG GIF || Config.SYSTEM_OCR ${fileIn} ${fileOut}
+
| [[Tesseract|Tesseract 3.x]] || com.openkm.extractor.Tesseract3TextExtractor || TIFF PNG JPG GIF || /path/to/tesseract ${fileIn} ${fileOut}
 
|-
 
|-
| Cuneiform || com.openkm.extractor.CuneiformTextExtractor || TIFF PNG JPG GIF || Config.SYSTEM_OCR ${fileIn} -o ${fileOut}
+
| [[Cuneiform]] || com.openkm.extractor.CuneiformTextExtractor || TIFF PNG JPG GIF || /path/to/cuneiform ${fileIn} -o ${fileOut}
 +
|-
 +
| Abby || com.openkm.extractor.AbbyTextExtractor || TIFF PNG JPG GIF || /path/to/abby ${fileIn} -o ${fileOut}
 
|-
 
|-
 
|}
 
|}
 +
 +
{{Advice|Check this [http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison Linux OCR Software Comparison].}}
  
 
So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:
 
So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:
  
  system.ocr=/usr/bin/tesseract -l esp
+
  system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l esp
 +
 
 +
{{Note|You need to modify the '''registered.text.extractors''' configuration property to match the OCR engine you have configured using '''system.ocr'''. By default only Cuneiform text extractor is enabled. If you want to configure Tesseract remove the Cuneiform extractor and add the Tesseract extractor.}}
 +
 
 +
You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at [http://extensions.services.openoffice.org/en/dictionaries OpenOffice.org Dictionary Repository]. After download, set this configuration property with the path to the dictorionay file:
 +
 
 +
system.openoffice.dictionary=/path/to/dictionary.(oxt|zip)
 +
 
 +
Since OpenKM 5.1.10 you have a new configuration property which make possible to perform OCR in upside down scanned pages. This optional configuration property is called '''system.ocr.rotate''' and is defined as a list of degrees to rotate the pages. For example:
 +
 
 +
  system.ocr.rotate=90;180;270;
  
 
=== Software required ===
 
=== Software required ===
 
You can enable any of these text extractors adding it in the '''textFilterClasses''' param of the '''SearchIndex''' section in your repository.xml file.
 
You can enable any of these text extractors adding it in the '''textFilterClasses''' param of the '''SearchIndex''' section in your repository.xml file.
  
You can download Tesseract 3 for Windows from [http://code.google.com/p/tesseract-ocr/ tesseract-ocr Google Code]. To install Tesseract 3 in Ubuntu, add the PPA and install Tesseract OCR 3.0 SVN:
+
Starting with OpenKM 5.1 we offer integration with [http://en.openocr.org/ Cognitive OpenOCR (Cuneiform)]. This OCR engine make a very good job improving Tesseract conversion ratios.
 
 
  $ sudo add-apt-repository ppa:alex-p/notesalexp
 
  $ sudo apt-get update
 
  $ sudo apt-get install tesseract-ocr tesseract-ocr-eng
 
 
 
{{Warning|You must add the PPA, install the latest Tesseract and then disable the PPA as it contains a lot of bleeding edge packages!}}
 
  
  $ sudo add-apt-repository -r ppa:alex-p/notesalexp
+
* [[Tesseract]]
 +
* [[Cuneiform]]
  
[[Category: Installation Guide]]
+
=== Older OpenKM versions ===
 +
Starting from OpenKM 5.1.8 Cuneiform configuration was changed and the parameters are, set in '''system.ocr''' configuration. Should be set to "/usr/bin/cuneiform ${fileIn} -o ${fileOut}". See [[Migration from 5.1.7 to 5.1.8]] for more info. In older OpenKM releases the right configuration was "/usr/bin/cuneiform".
  
== External links ==
+
Starting from OpenKM 5.1.9 Tesseract configuration has changed and the parameters are, set in '''system.ocr''' configuration. Should be set to "/usr/bin/tesseract ${fileIn} ${fileOut}". See [[Migration from 5.1.8 to 5.1.9]] for more info. In older OpenKM releases the right configuration was "/usr/bin/tesseract".
* [http://groups.google.com/group/tesseract-ocr Tesseract OCR Google Groups]
 
* [http://triviaatwork.blogspot.com/2009/08/first-interactions-with-tesseract-ocr.html First Interactions with Tesseract OCR on Ubuntu Linux]
 
  
 
[[Category: Installation Guide]]
 
[[Category: Installation Guide]]

Latest revision as of 11:20, 25 March 2013

Starting with OpenKM 5.1.9 you can choose between several OCR engines:


OCR Engine Text Extractor Image Formats Program arguments
Tesseract 2.x com.openkm.extractor.Tesseract2TextExtractor TIFF /path/to/tesseract ${fileIn} ${fileOut}
Tesseract 3.x com.openkm.extractor.Tesseract3TextExtractor TIFF PNG JPG GIF /path/to/tesseract ${fileIn} ${fileOut}
Cuneiform com.openkm.extractor.CuneiformTextExtractor TIFF PNG JPG GIF /path/to/cuneiform ${fileIn} -o ${fileOut}
Abby com.openkm.extractor.AbbyTextExtractor TIFF PNG JPG GIF /path/to/abby ${fileIn} -o ${fileOut}

Nota idea.png Check this Linux OCR Software Comparison.

So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:

system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l esp

Nota clasica.png You need to modify the registered.text.extractors configuration property to match the OCR engine you have configured using system.ocr. By default only Cuneiform text extractor is enabled. If you want to configure Tesseract remove the Cuneiform extractor and add the Tesseract extractor.

You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at OpenOffice.org Dictionary Repository. After download, set this configuration property with the path to the dictorionay file:

system.openoffice.dictionary=/path/to/dictionary.(oxt|zip)

Since OpenKM 5.1.10 you have a new configuration property which make possible to perform OCR in upside down scanned pages. This optional configuration property is called system.ocr.rotate and is defined as a list of degrees to rotate the pages. For example:

 system.ocr.rotate=90;180;270;

Software required

You can enable any of these text extractors adding it in the textFilterClasses param of the SearchIndex section in your repository.xml file.

Starting with OpenKM 5.1 we offer integration with Cognitive OpenOCR (Cuneiform). This OCR engine make a very good job improving Tesseract conversion ratios.

Older OpenKM versions

Starting from OpenKM 5.1.8 Cuneiform configuration was changed and the parameters are, set in system.ocr configuration. Should be set to "/usr/bin/cuneiform ${fileIn} -o ${fileOut}". See Migration from 5.1.7 to 5.1.8 for more info. In older OpenKM releases the right configuration was "/usr/bin/cuneiform".

Starting from OpenKM 5.1.9 Tesseract configuration has changed and the parameters are, set in system.ocr configuration. Should be set to "/usr/bin/tesseract ${fileIn} ${fileOut}". See Migration from 5.1.8 to 5.1.9 for more info. In older OpenKM releases the right configuration was "/usr/bin/tesseract".