Unicity (data analysis)

Unicity ($$\varepsilon_p$$) is a risk metric for measuring the re-identifiability of high-dimensional anonymous data. First introduced in 2013, unicity is measured by the number of points p needed to uniquely identify an individual in a data set. The fewer points needed, the more unique the traces are and the easier they would be to re-identify using outside information.

In a high-dimensional, human behavioural data set, such as mobile phone meta-data, for each person, there exists potentially thousands of different records. In the case of mobile phone meta-data, credit card transaction histories and many other types of personal data, this information includes the time and location of an individual.

In research, unicity is widely used to illustrate the re-identifiability of anonymous data sets. In 2013 researchers from the MIT Media Lab showed that only 4 points needed to uniquely identify 95% of individual trajectories in a de-identified data set of 1.5 million mobility trajectories. These points were location-time pairs that appeared with the resolution of 1 hour and 0.15 km² to 15 km². These results were shown to hold true for credit card transaction data as well with 4 points being enough to re-identify 90% of trajectories. Further research studied the unicity of the apps installed by people on their smartphones, the trajectories of vehicles, mobile phone data from Boston and Singapore, and, public transport data in Singapore obtained from smartcards.

Measuring unicity
Unicity ($$\varepsilon_p$$) is formally defined as the expected value of the fraction of uniquely identifiable trajectories, given p points selected from those trajectories uniformly at random. A full computation of $$\varepsilon_p$$ of a data set $$D$$ requires picking p points uniformly at random from each trajectory $$T_i \in D$$, and then checking whether or not any other trajectory also contains those p points. Averaging over all possible sets of p points for each trajectory results in a value for $$\varepsilon_p$$. This is usually prohibitively expensive as it requires considering every possible set of p points for each trajectory in the data set — trajectories that sometimes contain thousands of points.

Instead, unicity is usually estimated using sampling techniques. Specifically, given a data set $$D$$, the estimated unicity is computed by sampling from $$D$$ a fraction of the trajectories $$S$$ and then checking whether each of the trajectories $$T_j \in S$$ are unique in $$D$$ given p randomly selected points from each $$T_j$$. The fraction of $$S$$ that is uniquely identifiable is then the unicity estimate.