论文标题
线索:视频讲座中用户参与度的上下文化统一学习
CLUE: Contextualised Unified Explainable Learning of User Engagement in Video Lectures
论文作者
论文摘要
通过使用不同的计算方法来利用视图数量或相关的喜欢,预测视频中的上下文参与是一个长期尝试的问题。最近的十年中,在线学习资源中兴起了,在大流行期间,在线教学视频的指数升高而没有太多的质量控制。如果创建者可以对其内容获得建设性的反馈,则可以提高内容的质量。雇用一组领域专家志愿者向视频提供反馈可能不会扩展。结果,开发计算方法的急剧上升,以预测用户参与得分,该评分表明了某种形式的用户参与度,即用户倾向于与内容互动的级别。当前方法中的缺点是它们以级联的方法分别对各种特征进行建模,这很容易出现错误传播。此外,他们中的大多数都没有提供有关创作者如何改善其内容的关键解释。在本文中,我们提出了一种新的统一模型,即教育领域的线索,该模型从从自由使用的公共在线教学视频中提取的功能中学习,并在视频上提供了可解释的反馈以及用户参与分数。鉴于任务的复杂性,我们的统一框架采用了不同的预培训模型作为分类器的集合。我们的模型利用了各种多模式特征,以建模语言的复杂性,上下文不可知的信息,传递内容的文本情感,动画,扬声器的音调和言语情绪。在转移学习设置下,在统一空间中的总体模型对下游应用程序进行了微调。
Predicting contextualised engagement in videos is a long-standing problem that has been popularly attempted by exploiting the number of views or the associated likes using different computational methods. The recent decade has seen a boom in online learning resources, and during the pandemic, there has been an exponential rise of online teaching videos without much quality control. The quality of the content could be improved if the creators could get constructive feedback on their content. Employing an army of domain expert volunteers to provide feedback on the videos might not scale. As a result, there has been a steep rise in developing computational methods to predict a user engagement score that is indicative of some form of possible user engagement, i.e., to what level a user would tend to engage with the content. A drawback in current methods is that they model various features separately, in a cascaded approach, that is prone to error propagation. Besides, most of them do not provide crucial explanations on how the creator could improve their content. In this paper, we have proposed a new unified model, CLUE for the educational domain, which learns from the features extracted from freely available public online teaching videos and provides explainable feedback on the video along with a user engagement score. Given the complexity of the task, our unified framework employs different pre-trained models working together as an ensemble of classifiers. Our model exploits various multi-modal features to model the complexity of language, context agnostic information, textual emotion of the delivered content, animation, speaker's pitch and speech emotions. Under a transfer learning setup, the overall model, in the unified space, is fine-tuned for downstream applications.