Difference between revisions of "Third-party software integration: OCR"

Latest revision as of 11:20, 25 March 2013

Starting with OpenKM 5.1.9 you can choose between several OCR engines:

OCR Engine	Text Extractor	Image Formats	Program arguments
Tesseract 2.x	com.openkm.extractor.Tesseract2TextExtractor	TIFF	/path/to/tesseract ${fileIn} ${fileOut}
Tesseract 3.x	com.openkm.extractor.Tesseract3TextExtractor	TIFF PNG JPG GIF	/path/to/tesseract ${fileIn} ${fileOut}
Cuneiform	com.openkm.extractor.CuneiformTextExtractor	TIFF PNG JPG GIF	/path/to/cuneiform ${fileIn} -o ${fileOut}
Abby	com.openkm.extractor.AbbyTextExtractor	TIFF PNG JPG GIF	/path/to/abby ${fileIn} -o ${fileOut}

Check this Linux OCR Software Comparison.

So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:

system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l esp

You need to modify the registered.text.extractors configuration property to match the OCR engine you have configured using system.ocr. By default only Cuneiform text extractor is enabled. If you want to configure Tesseract remove the Cuneiform extractor and add the Tesseract extractor.

You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at OpenOffice.org Dictionary Repository. After download, set this configuration property with the path to the dictorionay file:

system.openoffice.dictionary=/path/to/dictionary.(oxt|zip)

Since OpenKM 5.1.10 you have a new configuration property which make possible to perform OCR in upside down scanned pages. This optional configuration property is called system.ocr.rotate and is defined as a list of degrees to rotate the pages. For example:

 system.ocr.rotate=90;180;270;

Software required

You can enable any of these text extractors adding it in the textFilterClasses param of the SearchIndex section in your repository.xml file.

Starting with OpenKM 5.1 we offer integration with Cognitive OpenOCR (Cuneiform). This OCR engine make a very good job improving Tesseract conversion ratios.

Older OpenKM versions

Starting from OpenKM 5.1.8 Cuneiform configuration was changed and the parameters are, set in system.ocr configuration. Should be set to "/usr/bin/cuneiform ${fileIn} -o ${fileOut}". See Migration from 5.1.7 to 5.1.8 for more info. In older OpenKM releases the right configuration was "/usr/bin/cuneiform".

Starting from OpenKM 5.1.9 Tesseract configuration has changed and the parameters are, set in system.ocr configuration. Should be set to "/usr/bin/tesseract ${fileIn} ${fileOut}". See Migration from 5.1.8 to 5.1.9 for more info. In older OpenKM releases the right configuration was "/usr/bin/tesseract".

@@ Line 15: / Line 15: @@
 |}
+{{Advice|Check this [http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison Linux OCR Software Comparison].}}
-{{Note|Starting from OpenKM 5.1.8 Cuneiform configuration was changed and the parameters are, set in '''system.ocr''' configuration. Should be set to "/usr/bin/cuneiform ${fileIn} -o ${fileOut}". See [[Migration from 5.1.7 to 5.1.8]] for more info. In older OpenKM releases the right configuration was "/usr/bin/cuneiform".}}
+So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:
-{{Note|Starting from OpenKM 5.1.9 Tesseract configuration has changed and the parameters are, set in '''system.ocr''' configuration. Should be set to "/usr/bin/tesseract ${fileIn} ${fileOut}". See [[Migration from 5.1.8 to 5.1.9]] for more info. In older OpenKM releases the right configuration was "/usr/bin/tesseract".}}
+ system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l esp
-So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:
+{{Note|You need to modify the '''registered.text.extractors''' configuration property to match the OCR engine you have configured using '''system.ocr'''. By default only Cuneiform text extractor is enabled. If you want to configure Tesseract remove the Cuneiform extractor and add the Tesseract extractor.}}
- system.ocr=/usr/bin/tesseract -l esp
+You can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at [http://extensions.services.openoffice.org/en/dictionaries OpenOffice.org Dictionary Repository]. After download, set this configuration property with the path to the dictorionay file:
-In this OpenKM version you can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at [http://extensions.services.openoffice.org/en/dictionaries OpenOffice.org Dictionary Repository]. After download, set this configuration property with the path to the dictorionay file:
   system.openoffice.dictionary=/path/to/dictionary.(oxt|zip)
-Since OpenKM 5.1.10 you have a new configuration property which make possible to perform OCR in upside down scanned pages. This optional configuration property is called '''system.ocr.rotation''' and is defined as a list of degrees to rotate the pages. For example:
+Since OpenKM 5.1.10 you have a new configuration property which make possible to perform OCR in upside down scanned pages. This optional configuration property is called '''system.ocr.rotate''' and is defined as a list of degrees to rotate the pages. For example:
    system.ocr.rotate=90;180;270;
@@ Line 39: / Line 38: @@
 * [[Tesseract]]
 * [[Cuneiform]]
+=== Older OpenKM versions ===
+Starting from OpenKM 5.1.8 Cuneiform configuration was changed and the parameters are, set in '''system.ocr''' configuration. Should be set to "/usr/bin/cuneiform ${fileIn} -o ${fileOut}". See [[Migration from 5.1.7 to 5.1.8]] for more info. In older OpenKM releases the right configuration was "/usr/bin/cuneiform".
+Starting from OpenKM 5.1.9 Tesseract configuration has changed and the parameters are, set in '''system.ocr''' configuration. Should be set to "/usr/bin/tesseract ${fileIn} ${fileOut}". See [[Migration from 5.1.8 to 5.1.9]] for more info. In older OpenKM releases the right configuration was "/usr/bin/tesseract".
 [[Category: Installation Guide]]

Difference between revisions of "Third-party software integration: OCR"

Latest revision as of 11:20, 25 March 2013

Software required

Older OpenKM versions

Navigation menu

Views

Personal tools

Navigation

Search

Tools