Creating automatic key extraction training files

From OpenKM Documentation
Revision as of 07:26, 30 September 2010 by Jllort (talk | contribs)

Jump to: navigation, search

Creating training files is so easy you simply must create a couple of files that KEA will use for creating KEA model extractor.

The main file to be analyzed by kea must be a foo.txt file ( if you've got pdf, doc, rtf or other type of file, that must be converted to txt ). Each file foo.txt must have a foo.key file. The foo.key file contains the keys which you identifies the document, that keys must be present into your thesaurus.

Example of foo.key

 AMARANTHUS
 PLANT PRODUCTION
 GEOGRAPHICAL DISTRIBUTION
 NUTRITIVE VALUE
 SEEDS
 MERCHANTS

Both files among other pair of couples must be under some directory. That directoy path is what it'll be used by KEA to create the model. Take a look at Automatic_key_extraction_full_example and the use of the trainingFolder param used by application to creation the KEA model.

You need a significative couples of documents in order making a good key extraction model. Upper 100 or more files ( depending how large is your thesaurus, etc... ) it's good size to starting.

We suggest you take a look at KEA project in order to see how that files are defined in training folder [1]

How optimize model

The KEA model is something alive. The idea is that users tunning the KEA model in OpenKM. For doing it we suggest creation of some metadata to indicating that user has validated some documents key ( flag to indicate that are documents that can be used to creating a new model ). After passed some time you can create a minimal application to extract relevant documents ( using openoffice conversion can created easilly txt files ) and key files too ( assigned keywords to documents ).

While your repository is growing your KEA model it'll become more efficient.