论文标题
治疗:乌尔都语信息检索评估和排名的收集
CURE: Collection for Urdu Information Retrieval Evaluation and Ranking
论文作者
论文摘要
乌尔都语是一种广泛的语言,全球拥有1.63亿扬声器。乌尔都语的信息检索(IR)由于其丰富的形态学特征和大量演讲者而需要特别考虑研究社区。通常,IR评估任务并未广泛探索乌尔都语。最重要的缺失元素是乌尔都语特有的标准化评估语料库的可用性。在这项研究工作中,我们提出并构建了IR评估的乌尔都语文档的标准测试集,并将其命名为乌尔都语检索评估(CURE)。我们选择了1,096个独特的文档,可使用两种IR模型从大量的50万个爬行文档中收集的50个不同查询。测试收集的目的是评估IR模型,排名算法和不同的自然语言处理技术。接下来,我们对所选文档执行二进制相关性判断。我们还为我们的测试收集特定的其他语言资源建造了其他两种语言资源,以进行查询和查询扩展。测试收集的评估也是使用四个检索模型以及使用挡词列表,lemmatization和查询扩展进行的。此外,使用不同的NLP技术对每个查询进行了误差分析。据我们所知,这项工作是为乌尔都语语言准备标准化信息检索测试收集的首次尝试。
Urdu is a widely spoken language with 163 million speakers worldwide across the globe. Information Retrieval (IR) for Urdu entails special consideration of research community due to its rich morphological features and a large number of speakers. In general, IR evaluation task is not extensively explored for Urdu. The most important missing element is the availability of a standardized evaluation corpus specific to Urdu. In this research work, we propose and construct a standard test collection of Urdu documents for IR evaluation and named it Collection for Urdu Retrieval Evaluation (CURE). We select 1,096 unique documents against 50 diverse queries from a large collection of 0.5 million crawled documents using two IR models. The purpose of test collection is the evaluation of IR models, ranking algorithms, and different natural language processing techniques. Next, we perform binary relevance judgment on the selected documents. We also built two other language resources for lemmatization and query expansion specific to our test collection. Evaluation of test collection is carried out using four retrieval models as well using the stop-words list, lemmatization, and query expansion. Furthermore, error analysis was performed for each query with different NLP techniques. To the best of our knowledge, this work is the first attempt for preparing a standardized information retrieval evaluation test collection for the Urdu language.