Be updated, subscribe to the OpenKM news

Keyphrase extraction summarization

Ana Canteli

Written by Ana Canteli on 3 April 2019

Automatic summarization is the process by which a software manages to summarize a document that condenses the content of said writing. Technological solutions capable of creating multi-document summarization consider variables such as length, style or syntax.

Keyword extraction is one of the two main approaches in the field of text summarization, which pivots between extraction and abstraction. The extractive methods focus on the selection of a set of words - keyword extraction - or sentences - sentence extraction - from the original text to create the summary (single document summarization). Abstractive methods construct an internal semantic representation, for which the use of natural language generation techniques is necessary, to create a summary as close as possible to what a human could write. In this article, we will focus on the extractive approach, which is a technique widely used today; search engines are just one example.

Keywords or key phrases are widely used in the management of large digital libraries. They can describe the content of files and provide useful semantic metadata for a multitude of goals or purposes. In the case of academic content, the authors manually include a selection of keywords that represent the content of the article, which helps with information retrieval. For this, the identification of relevant words and sentence position within a multi-document summarization is essential to be able to index the contents; to guide the user in the search for information and improve their experience both in search and information retrieval. This task is called indexing by keywords. However, most texts lack this information, hence the automatic extraction of keywords has become essential, in a world in which information and documentation are created exponentially.

The users of the network use search engines daily, such as Google or Bing, among others. Probably without realizing that, when we carry out searches in the search engines; in fact, we are consulting on information that has been previously analyzed and identified.

Search engines have powerful machine learning algorithms that apply data mining (big data). These use the algorithms to identify, filter and evaluate which keywords are relevant depending on the type of search; which allows you to get an idea of the content, which in turn helps to access it.

In short, the process by which search engines - which use millions of users daily - establish the subject of a web page in the form of keywords and phrases is a critical part of the indexing process, which will subsequently help us locate the information through the search engines.

Correct indexation will facilitate the identification and location of the information immediately fulfilling the two main objectives of the process:

  • provide a mechanism to identify and locate location information

  • saving time

For organizations, it is an important investment in human resources, time and money to organize, classify and facilitate the information retrieval within the entity. Therefore, keyword and sentence extraction are parts of the solution for the best management of information in companies.

The OpenKM document management system provides the right environment in which data and information management is transparently incorporated into business processes. When we enter a document into the DMS, the system will automatically submit the file to a text extraction process. The software, which through the REST API includes the automatic summarization service KEA (Keyphrase Extraction Algorithm) can identify and extract significant keywords from the text. In addition, this multi-document summarization service will allow us to choose and implement the keyword extraction model that most interests us.

The automatic extraction of keywords can be used in various stages of document management:

  • Classification of documents: OpenKM allows the assignment of categories to documents, records, folders, and emails (including attachments) simultaneously to the selected file system. For example, we can organize the directory of folders in alphabetical order. But in turn, assign the category Document Type, Project Department or Location, to the documentation stored in an alphabetically organized taxonomy. These options give us alternative navigation on the document repository. From the Categories menu, we can view the documentation using these criteria. And in the search engine, we can search all the documentation related to the marketing and sales department and the system will provide us with all the content that meets this condition, regardless of its location in the repository.

  • Indexing of documents: single document summarization will automatically assign indexing terms to facilitate their recovery. The terms that come from the body of the document describe the content to be indexed. The document management search engine allows retrieving information based on the assignment of the keywords. And through the Keyword Cloud functionality, we can see the set of repository terms, to which nodes are linked (keywords can index files, folders, documents of all kinds, e-mails, etc.) and combine them to obtain different sets of contents. If, for example, I select the keyword Client_A, the document management system will show me all the content related to this client. If I select the keyword Customer_A + the keyword Invoice, the system will only show me the invoices of Customer A.

  • Thesaurus: In OpenKM it is possible to create and compile a thesaurus; a list of words or controlled terms which are used to represent concepts of the scope to which the files belong. A thesaurus is intimately related to the Semantic Web. The Semantic Web is the set of activities developed by W3C for the creation of structured content so that they can be processed by machines (nowadays a large part of the information in the network is unstructured information). The thesaurus contains a documentary language formed by standardized terms and the semantic and functional relationships that are established between these terms. Semantic relationships can be equivalence, association or hierarchy. The thesaurus is very useful for the retrieval of information in closed documentary repositories.

You will find more information on automatic summarization and automatic keyword extraction in the OpenKM documentation and at http://community.nzdl.org/kea/index.html

Contact us

CAPTCHA ImageRefresh Image

Don't hesitate to contact us

OpenKM in 5 minutes!