对影响机器学习分类的因素的调查

论文标题

对影响机器学习分类的因素的调查

An investigation on the factors affecting machine learning classifications in $γ$-ray astronomy

论文作者

Luo, Shengda, Leung, Alex P., Hui, C. Y., Li, K. L.

论文摘要

我们已经研究了许多因素，这些因素可能会对Fermi大面积望远镜（LAT）检测到的$γ$射线来源的分类性能产生重大影响。我们表明，一个自动特征选择的框架可以构建一个简单的模型，该模型具有一小部分功能，该模型比以前的结果产生更好的性能。其次，由于$γ$ ray的某些类别的训练/测试集的样本量很小，建议嵌套重新采样和交叉验证，以量化引用准确性的统计波动。我们还通过将鉴定的活跃的银河核（AGN）和脉冲星（PSR）与Fermi LAT八年级源目录（4FGL）与前3 $^$^{\ rm RM RM RM RM RM RM RM RM RM RM RM RM RM rd} $ Fermi Lat source Catalog（3FGGL）中的那些来源交叉构建了测试。使用此交叉匹配的集合，我们表明某些用于识别源的用于构建分类模型的功能可能会遭受协变量偏移问题的困扰，这可能是各种观察效应的结果。当人们将这种模型应用于对未知的来源分类时，这可能会妨碍实际性能。使用我们的框架，AGN/PSR和Young Pulsar（YNG）/毫秒Pulsar（MSP）分类器将自动更新新功能，并在4FGL Catalog Incorporated中使用新功能和扩大的培训样品进行更新。使用这些更新的分类器的两层模型，我们从4FGL目录中的未知资源中选择了20个有希望的MSP候选者$> 98 \％$，这些资料可以为多波长标识活动提供输入。

We have investigated a number of factors that can have significant impacts on the classification performance of $γ$-ray sources detected by Fermi Large Area Telescope (LAT) with machine learning techniques. We show that a framework of automatic feature selection can construct a simple model with a small set of features which yields better performance over previous results. Secondly, because of the small sample size of the training/test sets of certain classes in $γ$-ray, nested re-sampling and cross-validations are suggested for quantifying the statistical fluctuations of the quoted accuracy. We have also constructed a test set by cross-matching the identified active galactic nuclei (AGNs) and the pulsars (PSRs) in the Fermi LAT eight-year point source catalog (4FGL) with those unidentified sources in the previous 3$^{\rm rd}$ Fermi LAT Source Catalog (3FGL). Using this cross-matched set, we show that some features used for building classification model with the identified source can suffer from the problem of covariate shift, which can be a result of various observational effects. This can possibly hamper the actual performance when one applies such model in classifying unidentified sources. Using our framework, both AGN/PSR and young pulsar (YNG)/millisecond pulsar (MSP) classifiers are automatically updated with the new features and the enlarged training samples in 4FGL catalog incorporated. Using a two-layer model with these updated classifiers, we have selected 20 promising MSP candidates with confidence scores $>98\%$ from the unidentified sources in 4FGL catalog which can provide inputs for a multi-wavelength identification campaign.

下载PDF全文

下载文献需遵守相关版权规定

论文标题