Arabic Information Extraction Methods A Survey

Article Fingerprint
Research ID X02T4

Abstract

The IR systems developed for western languages, such as English, have high performances when used in their own languages, but they don?›A ??Zt have this same performance when used for ? eastern languages such as Arabic. This is due to the fact that the Arabic language has a different and complex structure and morphology: polysemy, irregular and inflected derived forms, various spelling of certain words, various writing of certain combination character, short (diacritics) and long vowels. In addition, an Arabic word is derived from a root by concatenating some affixes based on regular set of word patterns. To address these problems, several methods have been proposed. The aim of this paper is to propose a survey of these methods. Although we not claim that this an exhaustive study, this work covers near 20 different methods. The main approaches applied in these methods are morphological or statistical analyses. To extract information from an Arabic document, the involved methods based on both approaches must answer the following question: “How can we find the root of the word we search”. To find a word in an Arabic dictionary, first we must extract the root of this word and then find this root in the dictionary, due to the fact that the vocabulary of the Arabic language is essentially built from the roots derivation. The roots are words composed of three to five consonants letters. This work will contribute to the enhancement of the Arabic information retrieval system performance, due to the fact that Arabic information extraction methods are the kernel of such system.

Conflict of Interest

The authors declare no conflict of interest.

Ethical Approval

Not applicable

Data Availability

The datasets used in this study are openly available at [repository link] and the source code is available on GitHub at [GitHub link].

Funding

This work did not receive any external funding.

Cite this article

Generating citation...

Related Research

  • Classification

    FOR Code: 091599

  • Version of record

    v1.0

  • Issue date

    29 April 2019

  • Language

    English

Iconic historic building with domed tower in London, UK.
Open Access
Research Article
CC-BY-NC 4.0