Difference between revisions of "Automatic key extraction full example"
(13 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{Warning|Automatic key extraction was removed from OpenKM Community 6.2.4 and OpenKM Professional 6.2.15 due to obsolete and removed libraries.}} | ||
+ | |||
+ | {{TOCright}} __TOC__ | ||
+ | |||
== SVN checkout modules == | == SVN checkout modules == | ||
Line 6: | Line 10: | ||
Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/thesaurus to refer thesaurus: | Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/thesaurus to refer thesaurus: | ||
− | |||
== Installing openkm classes into maven repository == | == Installing openkm classes into maven repository == | ||
Line 13: | Line 16: | ||
mvn clean package install -Dmaven.test.skip=true | mvn clean package install -Dmaven.test.skip=true | ||
− | |||
== Donwloading AGROVOC thesaurus == | == Donwloading AGROVOC thesaurus == | ||
We'll use agrovoc for testing purposes, you can downloading from http://oaei.ontologymatching.org/2007/environment/ please read terms of use. | We'll use agrovoc for testing purposes, you can downloading from http://oaei.ontologymatching.org/2007/environment/ please read terms of use. | ||
− | |||
Copy into '''thesaurus/src/test/resources/vocabulary''' folder the file '''ag_skos_20070219.rdf | Copy into '''thesaurus/src/test/resources/vocabulary''' folder the file '''ag_skos_20070219.rdf | ||
''' and '''agrovoc_oaei2007.owl''' | ''' and '''agrovoc_oaei2007.owl''' | ||
− | |||
Into '''vocabulary''' folder there's '''testdocs''' folders are some agrovoc training docs to creating KEA module. | Into '''vocabulary''' folder there's '''testdocs''' folders are some agrovoc training docs to creating KEA module. | ||
− | |||
== Create runtime configuration == | == Create runtime configuration == | ||
Line 45: | Line 44: | ||
documentEncoding | documentEncoding | ||
testDocs ( optional ) | testDocs ( optional ) | ||
− | |||
In my case | In my case | ||
+ | <source lang="text"> | ||
sourceFolder=/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary ( all path are relative to sourceFolder ) | sourceFolder=/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary ( all path are relative to sourceFolder ) | ||
− | |||
trainingFolder=testdocs/en/train | trainingFolder=testdocs/en/train | ||
− | |||
vocabularyFile=ag_skos_20070219.rdf | vocabularyFile=ag_skos_20070219.rdf | ||
− | |||
vocabularyType=skos | vocabularyType=skos | ||
− | |||
stopwordFile=stopwords_en.txt | stopwordFile=stopwords_en.txt | ||
− | |||
modelFileName=ag_skos_20070219.model | modelFileName=ag_skos_20070219.model | ||
− | |||
porterStemmerClass=com.openkm.kea.stemmers.PorterStemmer | porterStemmerClass=com.openkm.kea.stemmers.PorterStemmer | ||
− | |||
stopwordClass=com.openkm.kea.stopwords.StopwordsEnglish | stopwordClass=com.openkm.kea.stopwords.StopwordsEnglish | ||
− | |||
language=en | language=en | ||
− | |||
documentEncoding=UTF-8 | documentEncoding=UTF-8 | ||
− | |||
testDocs=testdocs/en/test | testDocs=testdocs/en/test | ||
− | + | </source> | |
The params to execute ModelBuilder class are "'''/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary testdocs/en/train ag_skos_20070219.rdf skos stopwords_en.txt ag_skos_20070219.model com.openkm.kea.stemmers.PorterStemmer com.openkm.kea.stopwords.StopwordsEnglish en UTF-8 testdocs/en/test'''" and VM argument "'''-Xmx526M'''" as you can see in next screenshot | The params to execute ModelBuilder class are "'''/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary testdocs/en/train ag_skos_20070219.rdf skos stopwords_en.txt ag_skos_20070219.model com.openkm.kea.stemmers.PorterStemmer com.openkm.kea.stopwords.StopwordsEnglish en UTF-8 testdocs/en/test'''" and VM argument "'''-Xmx526M'''" as you can see in next screenshot | ||
Line 86: | Line 75: | ||
− | [[File:Okm_installation_guide_007.jpeg|center| | + | [[File:Okm_installation_guide_007.jpeg|center|350]] |
− | |||
== Copying vocabulary files into OpenKM == | == Copying vocabulary files into OpenKM == | ||
− | Create a folder called vocabulary int %JBOSS_HOME%, copy into files called '''ag_skos_20070219.rdf''', '''agrovoc_oaei2007.owl''', ''' | + | Create a folder called vocabulary int %JBOSS_HOME%, copy into files called '''ag_skos_20070219.rdf''', '''agrovoc_oaei2007.owl''', '''ag_skos_20070219.model''', and '''stopwords_en.txt''' |
− | |||
== Configuring OpenKM.cfg == | == Configuring OpenKM.cfg == | ||
Line 111: | Line 98: | ||
kea.automatic.keyword.extraction.restriction is an optional paramater to indicate that only words in thesaurus are enabled to be extracted. | kea.automatic.keyword.extraction.restriction is an optional paramater to indicate that only words in thesaurus are enabled to be extracted. | ||
+ | == Creating thesaurus == | ||
+ | Login into OpenKM with some user with administrator grants, go to Administration tab and select Generate Thesaurus option. Then select the "show level" and execute the "send" option. | ||
+ | |||
+ | |||
+ | [[File:Okm_installation_guide_008.jpeg|center|900px]] | ||
+ | |||
+ | |||
+ | Please be patient it's needed some time to building all thesaurus. Depending your hardware configuration ( RAM ) could take some hours before process it'll be finishing. | ||
+ | |||
+ | |||
+ | [[File:Okm_installation_guide_009.jpeg|center]] | ||
+ | |||
+ | |||
+ | After finishing Thesarus creation in your desktop could see the thesaurus folders representation as is shown: | ||
+ | |||
+ | |||
+ | [[File:Okm_installation_guide_010.jpeg|center]] | ||
+ | |||
+ | |||
+ | == Automatic key extraction in new uploaded document == | ||
+ | Upload a new document, for example some document from testdocs/en/train | ||
+ | |||
+ | In your jboss console it'll appears something like this: | ||
+ | |||
+ | |||
+ | [[File:Okm_installation_guide_011.jpeg|center]] | ||
− | + | ||
− | + | And in your OpenKM UI the extracted keywords as shown: | |
+ | |||
+ | |||
+ | [[File:Okm_installation_guide_012.jpeg|center]] | ||
[[Category: Installation Guide]] | [[Category: Installation Guide]] | ||
− |
Latest revision as of 10:02, 16 April 2013
Automatic key extraction was removed from OpenKM Community 6.2.4 and OpenKM Professional 6.2.15 due to obsolete and removed libraries. |
SVN checkout modules
To creating KEA model must checkout openkm and thesaurus modules:
Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/openkm to refer openkm:
Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/thesaurus to refer thesaurus:
Installing openkm classes into maven repository
Ensure you've intalled openkm into your local maven repository, to ensure it you can execute the command:
mvn clean package install -Dmaven.test.skip=true
Donwloading AGROVOC thesaurus
We'll use agrovoc for testing purposes, you can downloading from http://oaei.ontologymatching.org/2007/environment/ please read terms of use.
Copy into thesaurus/src/test/resources/vocabulary folder the file ag_skos_20070219.rdf and agrovoc_oaei2007.owl
Into vocabulary folder there's testdocs folders are some agrovoc training docs to creating KEA module.
Create runtime configuration
Now we can create runtime configuration, it must be executed the ModelBuilder class with some params
For training KEA module is needed execute ModelBuilder class with that params:
sourceFolder trainingFolder vocabularyFile vocabularyType stopwordFile modelFileName porterStemmerClass stopwordClass language documentEncoding testDocs ( optional )
In my case
sourceFolder=/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary ( all path are relative to sourceFolder )
trainingFolder=testdocs/en/train
vocabularyFile=ag_skos_20070219.rdf
vocabularyType=skos
stopwordFile=stopwords_en.txt
modelFileName=ag_skos_20070219.model
porterStemmerClass=com.openkm.kea.stemmers.PorterStemmer
stopwordClass=com.openkm.kea.stopwords.StopwordsEnglish
language=en
documentEncoding=UTF-8
testDocs=testdocs/en/test
The params to execute ModelBuilder class are "/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary testdocs/en/train ag_skos_20070219.rdf skos stopwords_en.txt ag_skos_20070219.model com.openkm.kea.stemmers.PorterStemmer com.openkm.kea.stopwords.StopwordsEnglish en UTF-8 testdocs/en/test" and VM argument "-Xmx526M" as you can see in next screenshot
Classpath must be shown as
It all goes fine it has been generated into vocabulary folder a file called agrovoc_oaei2007.model
Copying vocabulary files into OpenKM
Create a folder called vocabulary int %JBOSS_HOME%, copy into files called ag_skos_20070219.rdf, agrovoc_oaei2007.owl, ag_skos_20070219.model, and stopwords_en.txt
Configuring OpenKM.cfg
Thesaurus configuration values
kea.thesaurus.owl.file=/vocabulary/agrovoc_oaei2007.owl kea.thesaurus.base.url=http://www.fao.org/aos/agrovoc kea.thesaurus.tree.root=SELECT DISTINCT UID, TEXT FROM {UID} Y {OBJECT}, {UID} rdfs:label {TEXT} ; [rdfs:subClassOf {CLAZZ}] where not bound(CLAZZ) and lang(TEXT)="en" USING NAMESPACE foaf=<http://xmlns.com/foaf/0.1/>, dcterms=<http://purl.org/dc/terms/>, rdf=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>, owl=<http://www.w3.org/2002/07/owl#>, rdfs=<http://www.w3.org/2000/01/rdf-schema#>, skos=<http://www.w3.org/2004/02/skos/core#>, dc=<http://purl.org/dc/elements/1.1/> kea.thesaurus.tree.childs=SELECT DISTINCT UID, TEXT FROM {UID} rdfs:subClassOf {CLAZZ}, {UID} rdfs:label {TEXT} where xsd:string(CLAZZ) = "RDFparentID" and lang(TEXT)="en" USING NAMESPACE foaf=<http://xmlns.com/foaf/0.1/>, dcterms=<http://purl.org/dc/terms/>, rdf=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>, owl=<http://www.w3.org/2002/07/owl#>, rdfs=<http://www.w3.org/2000/01/rdf-schema#>, skos=<http://www.w3.org/2004/02/skos/core#>, dc=<http://purl.org/dc/elements/1.1/>
KEA model configuration values
kea.thesaurus.skos.file=/vocabulary/ag_skos_20070219.rdf kea.thesaurus.vocabulary.serql=SELECT X,UID FROM {X} skos:prefLabel {UID} WHERE lang(UID) ="en" USING NAMESPACE rdf=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>, skos=<http://www.w3.org/2004/02/skos/core#>,rdfs=<http://www.w3.org/2000/01/rdf-schema#>,dc=<http://purl.org/dc/elements/1.1/>, dcterms=<http://purl.org/dc/terms/>, foaf=<http://xmlns.com/foaf/0.1/> kea.model.file=/vocabulary/ag_skos_20070219.model kea.stopwords.file=/vocabulary/stopwords_en.txt kea.automatic.keyword.extraction.number=10 kea.automatic.keyword.extraction.restriction=on
kea.automatic.keyword.extraction.restriction is an optional paramater to indicate that only words in thesaurus are enabled to be extracted.
Creating thesaurus
Login into OpenKM with some user with administrator grants, go to Administration tab and select Generate Thesaurus option. Then select the "show level" and execute the "send" option.
Please be patient it's needed some time to building all thesaurus. Depending your hardware configuration ( RAM ) could take some hours before process it'll be finishing.
After finishing Thesarus creation in your desktop could see the thesaurus folders representation as is shown:
Automatic key extraction in new uploaded document
Upload a new document, for example some document from testdocs/en/train
In your jboss console it'll appears something like this:
And in your OpenKM UI the extracted keywords as shown: