Arabic Information Extraction Methods A Survey

Abstract

The IR systems developed for western languages, such as English, have high performances when used in their own languages, but they don?›A ??Zt have this same performance when used for ? eastern languages such as Arabic. This is due to the fact that the Arabic language has a different and complex structure and morphology: polysemy, irregular and inflected derived forms, various spelling of certain words, various writing of certain combination character, short (diacritics) and long vowels. In addition, an Arabic word is derived from a root by concatenating some affixes based on regular set of word patterns. To address these problems, several methods have been proposed. The aim of this paper is to propose a survey of these methods. Although we not claim that this an exhaustive study, this work covers near 20 different methods. The main approaches applied in these methods are morphological or statistical analyses. To extract information from an Arabic document, the involved methods based on both approaches must answer the following question: “How can we find the root of the word we search”. To find a word in an Arabic dictionary, first we must extract the root of this word and then find this root in the dictionary, due to the fact that the vocabulary of the Arabic language is essentially built from the roots derivation. The roots are words composed of three to five consonants letters. This work will contribute to the enhancement of the Arabic information retrieval system performance, due to the fact that Arabic information extraction methods are the kernel of such system.

Keywords

NA

  • Research Identity (RIN)

  • License

  • Language & Pages

    English, 11-28

  • Classification

    FOR Code: 091599