论文标题

基于群集的信息检索(k均值) - 分层平行遗传算法方法

Cluster-Based Information Retrieval by using (K-means)- Hierarchical Parallel Genetic Algorithms Approach

论文作者

Toman, Sarah Hussein, Abed, Mohammed Hamzah, Toman, Zinah Hussein

论文摘要

基于群集的信息检索是根据其相似性组织,提取和分类的信息检索(IR)工具之一。与传统方法不同,基于群集的IR在处理文档的大型数据集方面很快。为了提高检索文档的质量,请提高IR的效率并减少用户搜索中的无关文档。在本文中,我们提出了一种(k均值) - 分层平行遗传算法方法(HPGA),该方法将K-均值聚类算法与多峰值和主/从属/从属PG算法的混合PG结合在一起。 K-均用来将种群聚集到K亚群,然后将大多数与查询相关的集群通过两个级别的遗传并行性以平行方式操纵的大多数簇,因此,无关的文档将不包括在亚种群中,以提高结果质量。使用三个常见数据集(NLP,CISI和CACM)来计算召回,精度和F量度平均。最后,我们将三个数据集的精度值与遗传IR和Classic-IR进行了比较。 IR-GA的拟议方法精度改进为45%,在CISI中为27%,在NLP中为25%。而通过与经典IR进行比较,(k-均值)-HPGA在CACM中获得47%,CISI的28%,NLP的34%。

Cluster-based information retrieval is one of the Information retrieval(IR) tools that organize, extract features and categorize the web documents according to their similarity. Unlike traditional approaches, cluster-based IR is fast in processing large datasets of document. To improve the quality of retrieved documents, increase the efficiency of IR and reduce irrelevant documents from user search. in this paper, we proposed a (K-means) - Hierarchical Parallel Genetic Algorithms Approach (HPGA) that combines the K-means clustering algorithm with hybrid PG of multi-deme and master/slave PG algorithms. K-means uses to cluster the population to k subpopulations then take most clusters relevant to the query to manipulate in a parallel way by the two levels of genetic parallelism, thus, irrelevant documents will not be included in subpopulations, as a way to improve the quality of results. Three common datasets (NLP, CISI, and CACM) are used to compute the recall, precision, and F-measure averages. Finally, we compared the precision values of three datasets with Genetic-IR and classic-IR. The proposed approach precision improvements with IR-GA were 45% in the CACM, 27% in the CISI, and 25% in the NLP. While, by comparing with Classic-IR, (k-means)-HPGA got 47% in CACM, 28% in CISI, and 34% in NLP.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源