Data Disclosure Risk Evaluation - Slim Trabelsi

configurations using a smart bootstrapping system. 1. Introduction ... using various anonymization techniques, before data release. Accordingly .... MyShop has an independent management but needs to ..... or age and school curriculum may yield the same .... simple Java tool intended for non-expert users that do not have ...
452KB taille 3 téléchargements 268 vues
Data Disclosure Risk Evaluation Slim Trabelsi, Vincent Salzgeber, Michele Bezzi, Gilles Montagnon SAP Labs France {slim.trabelsi, vincent.salzgeber, michele.bezzi, gilles.montagnon}@sap.com Abstract Many companies have to share various types of information containing private data without being aware about the threats related to such non-controlled disclosure. Therefore we propose a solution to support these companies to evaluate the disclosure risk for all their types of data; by recommending the safest configurations using a smart bootstrapping system.

1. Introduction Large organizations hold thousands of terabytes of datasets about their customers or their activities. They often have to release data files containing private information to third parties for data analysis, application testing or support. To preserve individuals’ privacy and comply with privacy regulations, part of released datasets have to be hidden or anonymized using various anonymization techniques, before data release. Accordingly, finding the best set of masking transformations, which reduce risk but still preserve the relevant information, is the main issue of the masking process. To this extent, quantitative evaluation of the privacy risk (privacy metrics) may help the user to choose the appropriate set of transformations to apply to the dataset. Various privacy risk measures have been proposed so far [2,3,5,6,7], they typically include statistical analysis on datasets, thus most of them depend on the actual content of data to be released. In other words, risk assessment should be run over each different dataset before release. Whenever data are collected or released, a disclosure policy should be defined and agreed with the data owner. Privacy regulations provide specific constraints on some information, but, in addition, there may be some potential source of disclosure of sensitive information not explicitly specified by privacy laws. Two different problems may arise: first to decide if a piece of data has to be considered private or not, second, to assess whether the exposure of non-private data could be used by correlation algorithms to infer

hidden private data. The second task is particularly challenging, and cannot be handled manually for large datasets, where the potential number of combinations of different fields is extremely large. In fact, disclosure policies are typically described by human users (security experts or not) that are not able to predict all the possible combinations of the data that could ease the guess of private data contained in the dataset. In some other cases, policy writers are not necessarily security experts and could expose sensitive data without being aware about the impact of such exposure. To address some of these issues, we propose a tool that can be used to automate this process, providing the user with a feedback on the re-identification risk when disclosing certain information and proposing safe combinations in order to help him during the information disclosure. Albeit privacy risk estimators have already been developed in some specific contexts (statistical databases), they have had limited impact, since they are often too specific for a given context, and do not provide the user with the necessary feedback to mitigate the risk. In addition, they can be computationally expensive on large datasets. The proposed tool extends previous work on three related directions. First, it proposes a risk evaluation model to be applied to fields that are often not considered in standard anonymization processes (so called sensitive data in [6]), because they are unlikely to be used for the re-identification. Second, it analyzes a given disclosure policy, then provides the risk for a given dataset (compared to a target risk value) and, in case, it provides the user with relevant suggestions (fields to be additionally masked) to decrease the risk up to the set threshold. Third, it provides a caching system to optimize this process (that in large datasets, may be extremely long, therefore limiting the applicability). This paper is organized as follows: in the first section we provide an overview about the related work on the risk estimation and anonymization area. In the second section we describe our entropy-based risk estimation model adapted to sensitive data. The third

section is dedicated to the new bootstrapping system that we propose to make the risk analysis faster and still reliable. In the fourth section, we describe the implementation of our risk estimation tool. And finally we conclude this paper with a discussion related to our results and we open the path for our future work.

2. Related Work Outsourced datasets can be represented as a table of rows (or records) and columns (or fields). Each row is termed a tuple. A tuple contains a relationship among the set of values associated with a person. Tuples within a table are not necessarily unique. Each column is called an attribute and denotes a field of information that is a set of possible values. Attributes within a table are unique. Attributes include identifiers which identify univocally a subject such as name, social security number, phone number and passport number. In addition other attributes, usually called quasiidentifiers [1], may be present, such as birth date, gender and postal code. Although these attributes may appear less dangerous from a privacy point of view, they still can be used in combination to re-identify the subject [2]. In order to estimate the privacy risk, and considering combinations of quasi-identifiers, Samarati et al [2] proposed the k-anonymity model. The concept of k-anonymity tries to capture on the dataset a combination of values of quasi-identifiers that can be indistinctly matched to at least k respondents. kanonymity provides a valuable tool for estimating privacy risk; still it has few shortcomings that limit its applicability in real-world applications. These limitations include: weak performance and the need for strong assumptions on possible attacker’s knowledge. Recently, several studies demonstrated that kanonymity cannot prevent attribute disclosure since it does not consider the diversity in the private data. The notion of l-diversity has been proposed in [5] to address this weakness; l-diversity requires that each equivalence class has at least l well-represented values for each sensitive attribute. The importance of taking into account sensitive information was demonstrated by Narayanan and Shmatikov [9], when they successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information, by using the Internet Movie Database1 as the source of background knowledge. Solutions were proposed to distinguish between the risk impact of the attribute types (identifiers, quasi-identifiers, and sensitive data). Truta et al. proposed in [6] and [7] a weighting system 1

www.imdb.com

to increase the importance of some attributes according to the data owner priorities. The latter system is subjective, since there is no methodology for filling the weighting vector. As a result it is completely imprecise and could severely affect the risk estimation. In our previous work [3] we proposed an alternative solution which combines two approaches for assessing disclosure risks: estimating the rareness of an element type and estimating the probability of re-identification using the Shannon entropy uncertainty estimation. In our paper we use this entropy-based method as a part of our risk estimation system.

3. Risk Estimation Model 3.1 Use Case and Test Conditions As a scenario, such a mechanism could be used in sales for worldwide supermarkets. Besides the business purpose, a trademark also guarantees a worldwide confidentiality and privacy to its customers and providers. Let’s assume the following actors: • Global Supermarket Chain (GSC): a trademark, supermarket business, used everywhere in the world. • MyShop: a shop that is a part of the chain, existing in many countries and internet (under the same name). MyShop sells any kind of goods coming from many providers; it uses an ERP to store all information about customers, sales, delivery, providers, financial results, etc. Each worldwide MyShop has an independent management but needs to comply with local regulation as well as the WWTrademark rules. • MyShopAdmin: Super-user responsible for company data • Risk Evaluation Engine: MyShop’s application used to enforce GSC and local policies. • AnyCustomer: customer buying goods in the shop using a loyalty card or a registered customer on internet. • GoodsProvider: provides a specific type of good to the shop. A type of good includes many different products. GoodsProvider sells its goods to many different shops and places in the world. • For statistical analysis and revenue increase, GoodsProvider asks MyShop a set of data. MyShop shouldn’t know what processing is done on the data, it may reveal GoodsProvider strategy. So GoodsProvider needs the real dataset with information relationship about (product, country for order, customer profile and customer behavior). Since this is a contextual and behavioral analysis, all data should be preserved and cannot



be anonymized. With this information, GoodsProvider is then able to optimize the product range for all its customers including MyShop’s and any other shop. MyShopAdmin should select the data set excluding information forbidden by policies, but also all other data that may reveal them as well. A first human selection is done.

database generator called Fake Name Generator2. This generator provides fake personal information with a realistic and coherent semantic meaning (ex. zip codes corresponds really to the city and the street names, SSN corresponds to the birthdates and the zip, etc…).

3.2 Attacker Model The attacker aims to re-identify released data by linking them with some external register which has some overlapping attributes with the released dataset. We consider the worst case scenario where the external data source coincides with the original database. The re-identification procedure consists in estimating for each row (in the attacker data table) the probability of linking it with a record (row in the original data table). The risk is composed by two sub risk computations, the first being the personal risk and the second one being the sensitive risk computation.

3.3 Statistical Risk Estimation Figure 1. Global Supermarket Chain •

The engine checks the selection and evaluates the re-identification risks related to the disclosed data. Data content is analyzed to provide different metrics for the data selection. These metrics are used to assess the risk of revealing forbidden fields. A risk estimation report is provided. If the risk exceeds the threshold fixed by GSC, the MyShopAdmin can ask for the safest disclosure strategy. The risk estimation engine will propose the closest disclosure policies that are compliant with the risk threshold. He selects one of them to be enforced by the engine and the data is ready to be exported. GoodsProvider receives the data and can run or outsource the analysis.

In the rest of the paper we adopt this scenario in our tests and proofs of concept. Our risk analysis is executed over a shopping store database containing 14 identifiers and quasi-identifiers (information that are directly related to the customer identity), plus 21 types of sensitive data (information that is indirectly related to the customer identity and representing in our scenario all the products bought by a single person). We used a Pentium IV machine with a 2.8 GHz CPU, 2 GB RAM and a MySQL 5.1 database connector. The size of the database varies between 10, 000 to 1, 000, 000 entries. The entries related to the personal identifiers are generated using a coherent

The system we propose here, aims at estimating the privacy risk for disclosing some attributes of a dataset. Since it is the core of a tool for supporting a user in writing a disclosure policy, performances are crucial. The main idea is to check all the possible combinations of attributes to-be-disclosed in a dataset, and estimate the probability that they can be used to infer private information. Comparing with a pre-defined threshold, we can then help the user to decide whether or not to disclose certain data. The method is composed by two steps: • We pre-computed the possible inference between attributes, with the corresponding inference probabilities. The outcome of this process may be represented as a graph with edges associated to these probabilities. • Based on the attributes the user wants to disclose, we estimate whether it is possible (at a certain accuracy level), to infer sensitive attributes (that the user does not want to disclose). Once those risky combinations are identified, they are prompted to the user, so he can increase the protection by masking additional information. We analyze the two steps in detail: 3.3.1 Defining Risky Combinations We propose to use a structure (matrix or graph) describing all the data types contained in the dataset 2

Fake Name Generator http://www.fakenamegenerator.com/

with an indication about all the possible correlations between the different fields. For example with a dataset from the scenario the structure will contain this information:

SocialID ← Date ⊕ Country ⊕ City ⊕ Gender ZipCode ← City ⊕ Street A risk level value can be added to each attribute combination in order to estimate the retrieval probability of a dataset disclosure. This value can be calculated using an entropy-based method [3] or selected manually (by a security expert). The structure will then contain more detailed information like: 60%

SocialID ← Date ⊕ Country ⊕ City ⊕ Gender 80%

ZipCode ← City ⊕ Street In order to automate the reasoning and to provide an efficient structure in which we can store all the information about risk estimations we propose to use a binary matrix model (like in Figure 2). Att1

Att 2

Att 3

Att 4

Risk

0 0

0 0

0 1

1 0

r1 r2

0 ...

0 ...

1 ...

1 ...

r3 ...

1

1

1

1

rn

Figure 2. Risk Estimation Matrix This matrix contains all the attributes of the dataset (Att columns), all the possible combinations of these attributes (lines) and the risk disclosure probability of each kind of combination (Risk column). The value 0 corresponds to a displayed attribute whereas the value 1 corresponds to a hidden attribute. 3.3.2 Risk Estimation of Selected Attributes Say the user wants to disclose a set of attributes Oi, whereas other attributes Si are masked. Our system provides an efficient way to detect if combining the disclosed attributes Oi, some private attribute value Sj may be indirectly disclosed. In other words we want to estimate the probability P, if this value is larger than a certain threshold (e.g. 1/N_r, with N_r number of disclosed records), the private data is not safe enough. To this scope, we compute the risk for all combinations of O, and compare it with the threshold. The risk estimation graph is then used to provide the estimation of any kind of requested combination. If the requested combination is over exceeding the maximum risk threshold, the matrix will then be used in order to

retrieve the safest (satisfying the threshold) and closest combinations. For example, if the requested combination concerns the attribute vector [1,1,0,1,0] with a risk of 0.6 exceeding the threshold of 0.4, we explore the matrix in order to find the closest combination that satisfies the threshold. Closest means with the minimum Hamming distance range like [0,1,0,1,0], or [1,0,0,1,0], or [1,1,0,0,0].

3.4 Entropy-based Method for Measuring the Disclosure Risk The entropy-based method [3], adopted in our system, enables the estimation of the rareness of some particular values in a population. This estimation can be used as a measure of disclosure risk in personal data release. This computation is row based and we only consider the columns of personal nature. Let us consider a database S containing personal and private data. A subset R of this dataset, described through a disclosure policy, will be released to an external non-trusted party. R will not contain direct identifiers, but could contain quasi-identifiers and sensitive data. According to our attacker model, the malicious user will try to re-identify released data by linking them to an external dataset S’. Our goal then is to estimate the re-identification probability P(R|S’). Since we do not have any indication about the richness of S’, we estimate the worst case where S’~S. We assume the simplest case where a selected record s can be linked to ks indistinguishable records called k-anonymity [2], and we therefore get the uniform distribution over the k records r1, r2, …, rk:

⎧1 ⎪ iff r ∈ {r1 ,..., rk } P (r | s) = ⎨ k s ⎪0 otherwise ⎩ Intuitively, the more uncertain this mapping, the less the disclosure risk. Shannon’s entropy can be used to estimate this uncertainty:

H ( R | s) = −∑r∈R P(r | s) log2 P(r | s)

unif . distr .

=

log 2 k s

This measures the risk at the level of a single record s. It represents the average number of binary questions we have to ask to identify the corresponding r given s. Now we estimate the expected number of correct matches for personal columns ECM,P as follows:

ECM , P = ∑ s∈S

1 2

H ( R|s )

unif .distr .

=

1

∑k s∈S

s

=

∑1

# of distinguis hable records

We applied the entropy-based method for our Webstore database by taking into account all the

possible combinations of the identifier attributes. Then we compared the results obtained with four different database sizes. In Figure 3 we notice that the risk value is inversely proportional to the dataset size. This is due to the number of distinct values that decreases when the number of entries increases; the more the entities in the dataset share the same values, the less we obtain unique combinations, and the less the re-identification risk is important.

For example, we have a column that represents a certain kind of property (e.g. Wine categories) and a lot of people share the same common brands (e.g. Bordeaux, Chardonnays, Cote de Provence, …) and only very few people have special entries (e.g. Champagne Millesime de Dom Pérignon). The persons sharing the common brands will not be likely to be disclosed, but we gain some additional information about those few people that share special entries, and therefore the risk to disclose these person’s identities rises. The computation of the expected number of correct matches for sensitive data ECM,S is done as follows:

ECM , S = ∑



s∈S all sensitive columns ci

Figure 3. Entropy-based Risk Estimation Results

3.5 Sensitive Information Risk Estimation 3.5.1 Impact of Sensitive Information During our tests, we noticed that the entropy-based method applied to all the element types of the dataset (identifiers, quasi identifiers, and sensitive data) provides misleading results, especially with an overestimation of sensitive data impact [4]. This overestimation is due to the rareness of some combinations from sensitive attributes. We tried to make a semantic analysis of the attributes, and we noticed that in some cases these combinations do not really affect the privacy of the data holder. For example, in our case study, the combination of some common sensitive attributes like wine, bread, milk, and fruit could lead to identify rare combinations without representing any real threat to the consumer identity. While, some studies completely neglect the impact of such data [2], we decided to propose an alternative method that mitigates this impact, taking into account the risk related to some rare sensitive attributes. In fact, we believe that the main risk related to the sensitive data is related to the rareness of some values of attributes (like limited editions) and not to the rareness of some combination of common attributes. 3.5.2 Concept We decided to follow a column-based method and we only consider columns of sensitive nature. We claim that each sensitive column can in certain circumstances help to recover the identity of a person.

1 k s ci

The global risk including all the types of data is the sum of the personal and sensitive correct match expectations divided by the total number of records:

Global Risk =

ECM , P + ECM ,S

Total # of Records

3.5.3 Results We decided then to compare the risk estimation results obtained by the application of the entropy-based method to all the attributes of the dataset (including identifiers and sensitive information) with the application of our mitigation method (called Hybrid) and with the entropy-based method applied only over the identifiers attributes. We notice in Figure 4 that the entropy-based method applied to all the attributes estimates 100% of risk (for any database size). This is not realistic according to the data displayed. On the contrary, our hybrid approach estimates the risk where customers bought very rare items, which facilitate the reidentifications using external data sources. We also notice with this test the convergence of the results between our hybrid system and the under estimation approach when the dataset size increases. This phenomenon is due to the harmonization of the distribution of values, which is the result of the dissolution of the rare items with the number of buyers. This means that the more the greater the number of customers, the greater the probability to buy rare items is.

Figure 4. Risk Estimation Impact on divulging Sensitive Information

3.6 Semantic Risk Estimation The statistical risk estimation techniques described above do not typically take into account semantic relationships between different attributes in the same dataset. Doan and Halevy [8] raised the semantic integration problem of solving semantic conflicts between heterogeneous data sources. Sometimes distinct attributes can describe the same value type of an element (e.g. same field name in different language or age and school curriculum may yield the same information). Such attributes must be semantically analyzed in order to find the relationship between these different entities. Data types can be categorized and described using specific structures like ontologies in order to be able to maintain relationships between all the dataset elements. These ontologies are created by security experts belonging to a specific domain; each company should have its own specific domain (described in an ontology) related to its own dataset.

4. Bootstrapping estimation

System

for

risk

4.1 Concept Estimating the disclosure risk of a dataset implies the identification of all the clusters of identical data values per attribute and per combination of attributes. Knowing that we have 2#attributes possible combination of attributes, this task becomes complex. This complexity can exponentially rise with the number of entries (lines) and the number of attributes (rows) of a dataset. In our tests we spent more than 12 hours estimating the disclosure risks of a 250, 000 size database. In industry, datasets are really huge, and precomputing the risks could take days, and thus will not really be useful for an administrator that wants to check immediately the robustness of his disclosure

policy. For this reason we come up with a new bootstrapping system that can be used by the risk estimation tool in order to estimate the less risky combination that could satisfy the administrator requirements. The bootstrapping system is based on a particular usage of cached results from the previous estimations, which provide indications about the risk values per combination when the dataset entries are permanently changing. First the dataset administrator asks the system to estimate the risk of a combination of attributes with a risk threshold. The system estimates the risk of this combination, if the risk value is over the threshold, instead of estimating all the possible combinations in order to propose a less risky formula; we use the results of a previous full estimation, in order to get the closest combination that requires the minimum of changes and satisfies the threshold required by the administrator. This combination is then estimated by the system in order to verify the compliance with the threshold.

4.2 Assumptions The bootstrapping system requires some assumptions to be efficient: • The bootstrapping cache should be used over a dataset that shares the same contextual conditions (e.g. temporal, geographical), in order to maintain coherence between the distribution of the values from the entries and from the cache. For example, in the context of supermarket database analysis, in the south of France during summer, we should not use a bootstrapping cache recorded during the winter. • The attribute types should of course remain the same. • The size of the dataset: in order to verify the distribution variability of the dataset entries, we decided to verify if the database size affects the risk evaluation. To perform this comparison, we compared four pairs of databases with the same sizes but with different data values and we compared the average difference of risks, the maximum difference of risks and the average difference of distinct values. According to the results shown in Figure 5, we notice that the average difference is very close to zero and that the maximum difference between the same attribute combinations is in the worst case less than 1.1 % (the worst case is computed for the smallest database). This result confirms the most suitable bootstrap system that should be applied for datasets with approximately the same entry size (we are testing bootstraps with different sizes).

After configuring our bootstrapping system according to the assumptions described above, we compared the average time consumption of the risk estimation tool between the bootstrap-enabled configuration and the classic one. We tested four datasets containing the types of attributes with different values and different sizes (where 1,000,000 entries represent the worst case for the processing time). Finally, for a maximum error rate of 1.05%, we obtain a clear economy of time shown in Figure 7 for any dataset size. This makes the risk estimation and the combination suggestion quasi-instantaneous (less than one minute).

Figure 5. Impact of the Database Size on the Risk Estimation

The ideal scenario, where the dataset to be evaluated has the same size as the bootstrapping database, is not realistic in practice. For this reason we decided to study the evolution of the risk mean distance for different dataset sizes in order to find a certain threshold from which the size does not impact the risk values. We compared the maximum risk difference and the average risk difference between different databases with a step of 5, 000 entries. This study will enable us to specify a coherence range for which the bootstrapping database size is compliant with the datasets to be evaluated.

Figure 7. Time Consumption Comparison In order to make the bootstrap prediction as realistic as possible, we propose to update the bootstrapping system with the new values computed for every new risk evaluation. With this method we can guarantee the freshness of the risk evaluations for the most popular attribute combination.

5. Implementation

Figure 6. Coherence Range for the Bootstrap Size According to the results shown in figure 6, we notice a stabilization of the risk distance starting from the 100,000 sized database. This means that the bootstrapping based risk estimation is reliable for databases bigger than 100,000 entities with a coherence range of 150,000.

4.3 Results

One of our objectives in this study was to provide a simple Java tool intended for non-expert users that do not have any idea about the risks incurred during a dataset release. We proposed a friendly user graphical interface (see Figure 8) that permits the user to connect to his dataset, to create manually a disclosure policy (by clicking on the attributes to display and the attributes to hide), or to load a predefined XML file, and to propose a maximum threshold risk value to satisfy. By clicking on the Compute Risk button the tool will display the risk associated to the current disclosure policy proposed by the user. If the risk exceeds the threshold value, the bootstrapping system will propose to the user several less risky combinations that are close to his initial disclosure preferences

of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials subject to any liability which is mandatory due to applicable law.Copyright 2009 by Slim Trabelsi and Michele Bezzi.

8. References [1] T. Dalenius, “Finding a needle in a haystack – or identifying anonymous census record”, Journal of Official Statistics, 2(3):329-336, 1986. [2] P Samarati, and L. Sweeney, “Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression”, In Proceedings of the IEEE Symposium on Research in Security and Privacy 1998.

Figure 8. Disclosure Policy Risk Evaluation Tool

6. Conclusion We addressed in this paper two major issues related to disclosure risk estimation. The first one raises the impact of taking into account sensitive information during the risk evaluation process. We proposed then a new hybrid approach for evaluating the disclosure risk that computes the rareness of identifier attribute combinations and the rareness of the occurrence value per sensitive attribute. This approach provides a realistic estimation of the disclosure risk. The second major issue treated in this paper is related to the performance enhancement for disclosure risk estimation. In order to propose to the user a set of safe disclosure combinations, the risk estimator engine has to estimate all the possible combinations for the dataset attributes. As we demonstrated in our study, such computations are time, CPU and memory consuming. For this reason we proposed a new bootstrapping approach that relies on previous risk estimations in order to provide a high performance risk evaluation result accessible immediately to the user. To our knowledge, this system is the first solution addressing the performance problematic for the risk estimation area, and thus makes our tool adaptable to the industry requirements.

7. Acknowledgements The authors wish to thank Stuart Short for his precious comments. This work was supported by the European Community's Seventh Framework Program through project PrimeLife. The research leading to these results has received funding from the European Community's Seventh Framework Program (FP7/20072013) under grant agreement n° 216483. The information in this document is provided "as is", and no guarantee or warranty is given that the information is fit for any particular purpose. The above referenced consortium members shall have no liability for damages

[3] M. Bezzi, “An entropy-based method for measuring anonymity”, In IEEE/CreateNet SECOVAL Workshop on the Value of Security through Collaboration, September 2007. [4] R. Parameswaran and D. Blough, “A Robust Data obfuscation Approach for Privacy Preservation of Clustered Data”, Workshop Proceedings of the 2005 IEEE International Conference on Data Mining,(Houston, Texas), pages 18–25. [5] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, ”l-diversity: Privacy beyond kanonymity”, In Proceedings of 22nd International Conference of Data Engineering. (ICDE), 3-8 April 2006, Atlanta, GA, USA , page 24, 2006. [6] T. M. Truta, F. Fotouhi, and D. Barth-Jones, “Disclosure Risk Measures for Microdata”, Proceedings of the 15th International Conference on Scientific and Statistical Database Management (SSDBM’03), July 2003, Cambridge, MA, USA. [7] T. M. Truta, F. Fotouhi, and D. Barth-Jones, “Assessing global disclosure risk in masked microdata”, in WPES ’04: Proceedings of the 2004 ACM workshop on Privacy in the electronic society. New York, NY, USA: ACM Press, 2004, pp. 85–93. [8] A. Doan, and A. Halevy, “Semantic integration research in the database community: A brief survey”, AI Magazine, Special Issue on Semantic Integration 2005.

[9]

A. Narayanan and V. Shmatikov, “Robust deanonymization of large sparse datasets (how to break anonymity of the Netflix prize dataset)”, In Proceedings of 29th IEEE Symposium on Security and Privacy, May 2008.