Difference between revisions of "Creating automatic key extraction training files"
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | {{Warning|Automatic key extraction was removed from OpenKM Community 6.2.4 and OpenKM Professional 6.2.15 due to obsolete and removed libraries.}} | ||
+ | |||
{{TOCright}} __TOC__ | {{TOCright}} __TOC__ | ||
Latest revision as of 10:03, 16 April 2013
Automatic key extraction was removed from OpenKM Community 6.2.4 and OpenKM Professional 6.2.15 due to obsolete and removed libraries. |
Contents |
Creating training files is so easy you simply must create a couple of files that KEA will use for creating KEA model extractor.
The main file to be analyzed by kea must be a foo.txt file ( if you've got pdf, doc, rtf or other type of file, that must be converted to txt ). Each file foo.txt must have a foo.key file. The foo.key file contains the keys which you identifies the document, that keys must be present into your thesaurus.
Example of foo.key
AMARANTHUS
PLANT PRODUCTION
GEOGRAPHICAL DISTRIBUTION
NUTRITIVE VALUE
SEEDS
MERCHANTS
Both files among other pair of couples must be under some directory. That directoy path is what it'll be used by KEA to create the model. Take a look at Automatic_key_extraction_full_example and the use of the trainingFolder param used by application to creation the KEA model.
You need a significative couples of documents in order making a good key extraction model. Upper 100 or more files ( depending how large is your thesaurus, etc... ) it's good size to starting.
We suggest you take a look at KEA project in order to see how that files are defined in training folder [1]
How optimize model
The KEA model is something alive. The idea is that users tunning the KEA model in OpenKM. For doing it we suggest creation of some metadata ( property group ) to indicating that user has validated some documents key ( flag to indicate that are documents that can be used to creating a new model ). After passed some time you can create a minimal application to extract relevant documents ( using openoffice conversion can created easilly txt files ) and key files too ( assigned keywords to documents ).
While your repository is growing your KEA model it'll become more efficient.