Privacy-preserving data sharing via probabilistic modeling

Intro

During the last decade, scientific research on big data has increased rapidly, and more and more data have been openly released on the Internet. However, this phenomenon also leads to a major challenge in privacy preservation. Even if publishers delete fields that contain private information, the record can still be associated with a specific individual, especially in the medical field. In this context, the existing privacy-preserving methods, such as pseudonymization, are not enough to hide information from cross-comparing.

To solve these problems, Donald ¹ used multiple imputations to generate synthetic microdata. Dunning and Kresman ² claimed that data can be anonymously transferred as roots of a polynomial. Dwork ³ proposed differential privacy which provides a strict guarantee that any examples are not too similar to real data. Nowadays, differential privacy is used to achieve privacy protection while maintaining statistical features.

The definition and techniques

Differential privacy provides a way to generate a privacy-preserving dataset with the same distribution as the original dataset. Additionally, differential privacy provides computability for privacy protection. Therefore, this chapter will discuss the definition of differential privacy and briefly introduce two differential privacy techniques.

Differential privacy

We assume that there are two databases $D_1$ and $D_2$, and the difference between them is no more than one record. We assume that $D_1$ is the smaller one. Then we could say that $D_1$ is a subset of $D_2$ and $D_2$ only has one new record. For a randomized function $\mathcal{K}$ gives $\epsilon$-differential privacy, if it satisfies the following formula ⁴:

$$Pr[\mathcal{K}(D_1)\in S] \leq \exp(\epsilon) \times Pr[\mathcal{K}(D_2) \in S]$$

This definition shows that the presence or absence will not significantly impact the output of the query to databases, and the upper bound is defined by the parameter $\epsilon$. For instance, if a person wants to buy insurance, whether this person’s data is in the database, this insurance price will not exceed $\exp(\epsilon)$ times. Additionally, differential privacy can also provide a privacy guarantee even in the worst situation, which is that attackers have full access to $D_1$. Compared with a single record, differential privacy is not good enough to protect group privacy, which means there may be more than one record for some individuals. Therefore, we should be wary of the blurring of privacy lines caused by massive group data leaks.

Furthermore, the choice of $\epsilon$ is a sociological problem, which means the value may not be calculated. In this context, the value of parameter $\epsilon$ could be related to the leaking probability and the risk after leakage. In other words, if this data only involves general information, such as purchase records, which is difficult to cross-compare, then $\ln{2}$ or $\ln{3}$ for $\epsilon$ might be acceptable. By contrast, if these data contain sensitive information, like disease, the value of $\epsilon$ should be smaller ³.

However, we must point out that differential privacy cannot provide an absolute guarantee that privacy is not be leaked. Nevertheless, if the release of a dataset does help society, differential privacy is a powerful tool to add a limited hazard of privacy leakage to the release version.

In the context of differential privacy, Leoni ⁵ point out that data-sharing techniques have two main approaches, input perturbation and synthetic microdata. Input perturbation refers to some noise that will be added to the original dataset to generate a masked dataset. This idea is not new, but there still is a problem: if the noise is symmetric about the mean, the noise could be canceled by asking the same question multiple times. However, this method is only suitable for specific data types, like set-valued data ⁴.

Therefore, synthetic-data-based techniques are used in most situations. Recent research built powerful models that are proven more efficient ⁶. However, those techniques cannot build a model which contains the prior knowledge to increase the utility of data. Because the prior knowledge can reduce the ambiguity and ambiguity, and provide more information about model structural zeros. As a result, if modelers have prior knowledge of data and make use of them, the process of building the model will be much more efficient than using general-purpose data-driven models. Moreover, due to the sparsity of datasets, people usually need to smooth the data to keep the key attribute.

Experiments and results discussion of differential privacy

In this chapter, we will talk about how to produce a new dataset, while maintaining the statistical features from the original dataset with differential privacy. Specifically, we will introduce the workflow of differential privacy, the calculation method, and analyze the results from probabilistic modeling.

In the past, differential privacy data-sharing models had similar structures. The only learning source of these models is the dataset itself. Then, models built generators to produce synthetic data, whose pipeline is shown in Fig. 1.

Fig. 1: Standard differentially private data-sharing workflow .

Fig. 1: Standard differentially private data-sharing workflow ⁴.

Because the connections between observation variables are organized as a probabilistic network, the prior knowledge in the research field could be built as a Bayesian model, and this model could also be involved in the generation process. The pipeline of the Bayesian model with the data is shown in Fig. 2.

Fig. 2: Bayesian differential privacy data release .

Fig. 2: Bayesian differential privacy data release ⁴.

In order to release a new synthetic dataset, the posterior predictive distribution is used to generate new records ⁴. For dataset $X$ and probabilistic model $p(X|\theta)$ where $\theta$ is parameters of model. The posterior predictive distribution could be defined by the following formula: $$p(\tilde{X}|X)=\int_{Supp(\theta)}p(\tilde{X}|\theta)p(\theta|X)d\theta.$$ where $Supp\theta$ is possible values of $\theta$ and $\tilde{X}$ is the posterior predictive distribution of variable $X$. The posterior predictive distribution gives us the probability of future data from the perspective of the current dataset. Therefore, we can assert that, if our model can be fully trained, the posterior predictive distribution will be the natural choice for synthetic data. The process to generate a new record will be as follows: first, get parameters $\tilde{\theta}$ from the observed distribution, then use $\tilde{\theta}$ to generate a new data point from the probabilistic model.

Reproducing statistical features

In order to measure the effect of the algorithm, medical studies have been used to demonstrate the effectiveness of this algorithm ⁴. The knowledge shows that people in different gender have different alcohol-related death rate, and the end date of any record has two cases, either ends at a specific date or ends at the death date ⁷. After encoding that knowledge into the model, the Bayesian differential privacy model was trained, and synthetic data was generated.

The results obtained from the preliminary analysis of Figure 3 show that the critical statistic feature that people with diabetes have a higher risk of respiratory disease has been preserved ⁴. However, gender does not affect outcomes equally. For the male group, the statistic feature was preserved well, when we kept strict privacy guarantees ($\epsilon = 1$). Meanwhile, for the female group, in order to achieve the same accuracy level as the male group, we have to reduce privacy restrictions ($\epsilon = 4$). Moreover, varying degrees of prior knowledge level also significantly impact the probability of successfully discovering patterns. Figure 3 also shows three prior knowledge models and their influence on extracting statistical features. In general, if we can encode more prior knowledge into a model, we will get a better statistical feature. Significant percentage differences in the female group represent this point, which is shown at Figure 3.

Fig. 3 Encoding prior knowledge into the generative model improves performance

Fig. 3 Encoding prior knowledge into the generative model improves performance ⁴

Lastly, this synthetic dataset can be used in other studies, while maintaining its statistical features. Meanwhile, the privacy of these data is preserved from leakage by differential privacy. As a result, a processed dataset could be made available for future research without fear of privacy disclosure.

Discussion and conclutions

Though we cannot keep all statistical features without loss, differential privacy still provides a way to maintain some of the most important features. Besides, differential privacy also gives strong guarantees in the case of partial data leakage. Furthermore, with prior knowledge, the performance of differential privacy improves significantly on poor performing datasets.

Moreover, with the probabilistic model, the probability of reproducing a discovery could be increased in the synthetic data by encoding prior knowledge. The parameters of a probabilistic model can be retrieved by regression. Then models could use these parameters to generate new records. In this way, the issue of repeated queries is no longer a problem. Besides, this approach also shifted data release problems into modeling problems, which are much easier for researchers and engineers to solve by building Bayesian models. Additionally, the saliency of different features will also affect the performance of differential privacy with the same strength in different groups. Therefore, the choice of the strength of differential privacy becomes a hyper-parameter problem.

Furthermore, probabilistic modeling enables people to guide models to learn in the right direction. Meanwhile, privacy is also preserved in this process. With this approach, differential privacy data-sharing can be used for many different kinds of datasets, making it possible for downstream analysis tasks to process them with no further privacy cost.

The emergence of the probabilistic model enables people to release data with privacy guarantees and better feature reserves. In the future, the choice of $\epsilon$ may no longer be a social problem, which means the model could automatically choose the best value of $\epsilon$ to keep the balance between privacy and accuracy.