Difference between revisions of "Third-party software integration: OCR"
(→Software required) |
|||
Line 1: | Line 1: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Starting with OpenKM 5.1 you can choose between several OCR engines: | Starting with OpenKM 5.1 you can choose between several OCR engines: | ||
Line 78: | Line 30: | ||
=== Software required === | === Software required === | ||
You can enable any of these text extractors adding it in the '''textFilterClasses''' param of the '''SearchIndex''' section in your repository.xml file. | You can enable any of these text extractors adding it in the '''textFilterClasses''' param of the '''SearchIndex''' section in your repository.xml file. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Starting with OpenKM 5.1 we offer integration with [http://en.openocr.org/ Cognitive OpenOCR (Cuneiform)]. This OCR engine make a very good job improving Tesseract conversion ratios. | Starting with OpenKM 5.1 we offer integration with [http://en.openocr.org/ Cognitive OpenOCR (Cuneiform)]. This OCR engine make a very good job improving Tesseract conversion ratios. | ||
− | + | * [[Tesseract]] | |
− | + | * [[Cuneiform]] | |
− | |||
− | |||
− | |||
− | |||
− | * [ | ||
− | * [ | ||
[[Category: Installation Guide]] | [[Category: Installation Guide]] |
Revision as of 10:05, 11 January 2012
Starting with OpenKM 5.1 you can choose between several OCR engines:
OCR Engine | Text Extractor | Image Formats | Program arguments |
---|---|---|---|
Tesseract 2.x | com.openkm.extractor.Tesseract2TextExtractor | TIFF | Config.SYSTEM_OCR ${fileIn} ${fileOut} |
Tesseract 3.x | com.openkm.extractor.Tesseract3TextExtractor | TIFF PNG JPG GIF | Config.SYSTEM_OCR ${fileIn} ${fileOut} |
Cuneiform | com.openkm.extractor.CuneiformTextExtractor | TIFF PNG JPG GIF | Config.SYSTEM_OCR ${fileIn} -o ${fileOut} |
Abby | com.openkm.extractor.AbbyTextExtractor | TIFF PNG JPG GIF | Config.SYSTEM_OCR ${fileIn} -o ${fileOut} |
Starting from OpenKM 5.1.8 Cuneiform configuration has changed and the parameters are, set in system.ocr configuration. Should be set to "/usr/bin/cuneiform ${fileIn} -o ${fileOut}". See Migration from 5.1.7 to 5.1.8 for more info. |
Starting from OpenKM 5.1.9 Tesseract configuration has changed and the parameters are, set in system.ocr configuration. Should be set to "/usr/bin/tesseract ${fileIn} ${fileOut}". See Migration from 5.1.8 to 5.1.9 for more info. |
So, if you want to pass a command line parameter to your tesseract executable, you should use this configuration:
system.ocr=/usr/bin/tesseract -l esp
In this OpenKM version you can also use an OpenOffice.org dictionary to enhance the OCR process. You can find these language specific dictionaries at OpenOffice.org Dictionary Repository. After download, set this configuration property with the path to the dictorionay file:
system.openoffice.dictionary=/path/to/dictionary.(oxt|zip)
Software required
You can enable any of these text extractors adding it in the textFilterClasses param of the SearchIndex section in your repository.xml file.
Starting with OpenKM 5.1 we offer integration with Cognitive OpenOCR (Cuneiform). This OCR engine make a very good job improving Tesseract conversion ratios.