论文标题
使用机器学习来识别物理课程中最高风险的学生
Using Machine Learning to Identify the Most At-Risk Students in Physics Classes
论文作者
论文摘要
机器学习算法最近已用于预测学生在入门物理课程中的表现。预测模型将可能会接受A或B的学生分类为可能会接受C,D,F或退出班级的学生。早期预测可以更好地允许教育干预措施和教育资源的分配方向。但是,该研究中使用的性能指标在用于分类学生是否会收到A,B或C(ABC结果)或是否会从班级(DFW结果)中获得D,F或提取(W)的结果是不可靠的更适合不平衡的结果变量。这些技术应用于两个机构的入门机制课程的三个样本($ n = 7184 $,$ 1683 $和$ 926 $)。应用与早期研究相同的方法产生了非常不准确的分类器,仅正确分类了16%的DFW病例;调整模型将DFW分类精度提高到43 \%。在课程的第二周,使用机构和课堂数据的组合将DFW准确度提高到53 \%。与先前的研究一样,在最终预测模型中,人口统计学变量,例如性别,代表性不足的少数群体状况,第一代学生地位和低社会经济地位并不重要。
Machine learning algorithms have recently been used to predict students' performance in an introductory physics class. The prediction model classified students as those likely to receive an A or B or students likely to receive a grade of C, D, F or withdraw from the class. Early prediction could better allow the direction of educational interventions and the allocation of educational resources. However, the performance metrics used in that study become unreliable when used to classify whether a student would receive an A, B or C (the ABC outcome) or if they would receive a D, F or withdraw (W) from the class (the DFW outcome) because the outcome is substantially unbalanced with between 10\% to 20\% of the students receiving a D, F, or W. This work presents techniques to adjust the prediction models and alternate model performance metrics more appropriate for unbalanced outcome variables. These techniques were applied to three samples drawn from introductory mechanics classes at two institutions ($N=7184$, $1683$, and $926$). Applying the same methods as the earlier study produced a classifier that was very inaccurate, classifying only 16\% of the DFW cases correctly; tuning the model increased the DFW classification accuracy to 43\%. Using a combination of institutional and in-class data improved DFW accuracy to 53\% by the second week of class. As in the prior study, demographic variables such as gender, underrepresented minority status, first-generation college student status, and low socioeconomic status were not important variables in the final prediction models.