论文标题
实验精确度量和实验比较方法的系统分析
Systematic Analysis of Experiment Precision Measures and Methods for Experiments Comparison
论文作者
论文摘要
实验精度的概念量化了主观实验中用户评分的方差。尽管存在评估主观实验精度的措施,但对于文献中可用的这些措施尚无系统分析。据我们所知,多媒体质量评估领域也没有系统的框架来比较其精确度的主观实验。因此,本文的主要思想是提出一个基于适当的实验精确度量的MQA领域比较主观实验的框架。我们提出了三个实验精确度量和三个相关的实验精度比较方法。我们系统地分析了提出的措施和方法的性能。我们通过模拟研究(不同的用户评级差异和偏差)以及使用来自四个现实世界经验质量(QOE)主观实验的数据来做到这一点。在模拟研究中,我们关注众包实验,因为与传统的主观实验方法相比,众所周知,它们会产生具有较高差异和偏见的评分。我们得出的结论是,我们提出的措施和相关的比较方法正确捕获了实验精度(在模拟和现实世界中进行了测试时)。其中一项措施也证明能够处理甚至显着偏见的响应。我们认为,我们的实验精确评估框架将有助于比较不同的主观实验方法。例如,它可能有助于确定哪种方法会导致更精确的用户评分。这可能有可能为未来的标准化活动提供信息。
The notion of experiment precision quantifies the variance of user ratings in a subjective experiment. Although there exist measures that assess subjective experiment precision, there are no systematic analyses of these measures available in the literature. To the best of our knowledge, there is also no systematic framework in the Multimedia Quality Assessment field for comparing subjective experiments in terms of their precision. Therefore, the main idea of this paper is to propose a framework for comparing subjective experiments in the field of MQA based on appropriate experiment precision measures. We present three experiment precision measures and three related experiment precision comparison methods. We systematically analyse the performance of the measures and methods proposed. We do so both through a simulation study (varying user rating variance and bias) and by using data from four real-world Quality of Experience (QoE) subjective experiments. In the simulation study we focus on crowdsourcing QoE experiments, since they are known to generate ratings with higher variance and bias, when compared to traditional subjective experiment methodologies. We conclude that our proposed measures and related comparison methods properly capture experiment precision (both when tested on simulated and real-world data). One of the measures also proves capable of dealing with even significantly biased responses. We believe our experiment precision assessment framework will help compare different subjective experiment methodologies. For example, it may help decide which methodology results in more precise user ratings. This may potentially inform future standardisation activities.