社交网络中社区的隐私和独特性

论文标题

社交网络中社区的隐私和独特性

Privacy and Uniqueness of Neighborhoods in Social Networks

论文作者

Romanini, Daniele, Lehmann, Sune, Kivelä, Mikko

论文摘要

在单个联系层面共享社交网络数据的能力对科学有益：不仅有助于再现结果，而且还适用于希望将其用于数据发行者预见的目的的研究人员。但是，共享此类数据可能会导致严重的隐私问题，因为可以重新确定个人，这不仅是基于可能的节点的属性，而且还从周围网络的结构中。可以测量与重新识别相关的风险，并且在某些网络中比其他网络更严重。已经提出了各种优化算法，以使网络匿名化，同时保持最小的更改数量。但是，现有算法不能提供对更改的位置进行保证，因此很难量化它们对各种措施的影响。使用网络模型和实际数据，我们表明网络的平均程度是从节点社区重新识别风险严重性的关键参数。密集的网络更具风险，除了一小部分平均度值之外，几乎所有节点都是可重新识别的，或者都是安全的。我们的结果使研究人员能够根据少量网络统计数据评估隐私风险，这些网络统计数据甚至在收集数据之前就可以使用。作为大脑规则，如果平均程度高于10，则隐私风险很高。在这些结果的指导下，我们提出了一种基于边缘采样的简单方法，以减轻节点的重新识别风险。我们的方法可以在数据收集阶段实现。它对各种网络度量的影响可以使用抽样理论估算和纠正。这些属性与以前的方法对数据形成鲜明对比。从这个意义上讲，我们的工作可以帮助以统计上的易于处理的方式共享网络数据。

The ability to share social network data at the level of individual connections is beneficial to science: not only for reproducing results, but also for researchers who may wish to use it for purposes not foreseen by the data releaser. Sharing such data, however, can lead to serious privacy issues, because individuals could be re-identified, not only based on possible nodes' attributes, but also from the structure of the network around them. The risk associated with re-identification can be measured and it is more serious in some networks than in others. Various optimization algorithms have been proposed to anonymize the network while keeping the number of changes minimal. However, existing algorithms do not provide guarantees on where the changes will be made, making it difficult to quantify their effect on various measures. Using network models and real data, we show that the average degree of networks is a crucial parameter for the severity of re-identification risk from nodes' neighborhoods. Dense networks are more at risk, and, apart from a small band of average degree values, either almost all nodes are re-identifiable or they are all safe. Our results allow researchers to assess the privacy risk based on a small number of network statistics which are available even before the data is collected. As a rule-of-thumb, the privacy risks are high if the average degree is above 10. Guided by these results we propose a simple method based on edge sampling to mitigate the re-identification risk of nodes. Our method can be implemented already at the data collection phase. Its effect on various network measures can be estimated and corrected using sampling theory. These properties are in contrast with previous methods arbitrarily biasing the data. In this sense, our work could help in sharing network data in a statistically tractable way.

下载PDF全文

下载文献需遵守相关版权规定

论文标题