A Probabilistic Inference Attack On Suppressed Social Networks
by
BARIŞ ALTOP
Submitted to the Graduate School of Sabancı University in partial fulfillment of the requirements for the degree of
Master of Science
Sabancı University
Spring, 2011
c Barış Altop 2011
All Rights Reserved
A PROBABILISTIC INFERENCE ATTACK ON SUPPRESSED SOCIAL NETWORKS
Barış ALTOP
Computer Science and Engineering, Master’s Thesis, 2011
Thesis Supervisors: Assist. Prof. Dr. Mehmet Ercan Nergiz, Assoc. Prof. Dr. Yücel Saygın
Keywords: Social Networks, Inference Attack, Privacy, Private Data Protection, Classification
Abstract
Social Networks (SNs) are now widely used by modern time internet users to share any personal information. Such networks are so rich in information content that there is public and commercial benefit in sharing them with other third parties. However, information stored in SNs are mostly person specific and subject to privacy concerns. One way to address the privacy issues is to give the control of the data to the users enabling them to suppress data that they choose not to share with third parties.
Unfortunately, above mentioned preference-based suppression tech-
niques are not sufficient to protect privacy mainly because they do not
allow users to control data about other users they are linked with.
Information about neighbors becomes an inference channel in an SN when there is known correlation between the existence of a link be- tween two users and the users having the same sensitive information.
In this thesis, we propose a probabilistic inference attack on a sup- pressed social network data, that can successfully predict a suppressed label by looking at neighboring users’ data. The attack algorithm is designed for a realistic adversary that knows, from background or ex- ternal sources, the correlations between labels and links in the SN.
We experimentally show that it is possible to recover majority of the
suppressed labels of users even in a highly suppressed SN.
BASTIRILMIŞ SOSYAL AĞLARDA OLASILIKSAL BİR ÇIKARIM SALDIRISI
Barış ALTOP
Bilgisayar Bilimi ve Mühendisli ğ i, Y ü ksek Lisans Tezi, 2011
Tez Danışmanları: Yard. Doç. Dr. Mehmet Ercan Nergiz, Doç.
Dr. Yücel Saygın
Anahtar Kelimeler: Sosyal Ağlar, Çıkarım Saldırısı, Mahremiyet, Kişiye Özel Veri Güvenliği, Sınıflama
Özet
Sosyal Ağlar günümüz internet kullanıcıları tarafından kişisel bilgilerin
paylaşımı amacıyla yaygın olarak kullanılmaktadır. Bu tür ağların, bilgi
içerikleri çok zengin olduğundan, diğer üçüncü partiler ile paylaşımı kamusal
ve ticari fayda getirmektedir. Ancak, sosyal ağlarda saklanan bilgiler çoğun-
lukla kişiye özeldir ve gizlilik endişelerine tabidir. Gizlilik sorunlarını gider-
menin bir yolu, kullanıcılara kendi verilerinin kontrolü vermek ve istedikleri
verileri bastırarak üçüncü kişilerden gizlemelerini sağlamaktır.
Ne yazık ki yukarıda bahsedilen tercihe dayalı bastırma teknikleri gi- zliliği sağlamaya yetmemektedir. Bunun temel sebebi, bu tür koruma sis- temlerinin kullanıcılarına, bağlantılı oldukları diğer kullanıcıların paylaştık- ları veriler üzerinde kontrol izni vermemeleridir. Aralarında bağlantı bulu- nan kullanıcılar arasında veri benzerliği açısından ilişki mevcuttur; bu ilişki de iki komşu kullanıcı arasında veri çıkarsama kanalı oluşturur. Bu tezde bastırılmış sosyal ağlarda komşu kullanıcıların verilerine bakarak kişilerin bastırılmış bilgilerini bulabilen olasılıksal bir çıkarsama saldırısı öneriyoruz.
Bu saldırı algoritması sosyal ağdaki etiketler arası bağıntıyı ve bağlantıları
bilen gerçekçi bir düşmana göre tasarlanmıştır. Yüksek derecede bastırılmış
sosyal ağlarda bile kullanıcıların bastırılmış etiketlerinin çoğunluğunu çıkar-
samanın mümkün olduğunu deneysel olarak göstermekteyiz.
to my beloved family
Acknowledgements
I wish to express my sincere gratitude to Assoc. Prof. Yücel Saygın and Assist. Prof. Dr. Mehmet Ercan Nergiz, for their continuous support and wortwhile guidance thoughout my masters studies. Assoc. Prof. Yücel Saygın’s support starting from my application process till today, was always a great motivation during my studies. Also I am thankful to Assist. Prof.
Dr. Mehmet Ercan Nergiz for believing in the topic I was interested in. In addition, I am thankful to my thesis defense committee members: Assoc.
Prof. Berrin Yanıkoğlu, Assoc. Prof. Erkay Savaş and Assoc. Prof. Cem Güneri for their support and presence.
I appreciate Duygu Karaoğlan for her help during the implementation process. I would also like to thank to my friends Emre Kaplan, İsmail Fatih Yıldırım, Burcu Özçelik, Yarkın Doröz and Erman Pattuk for their help in the cirruculum courses. Duygu Karaoğlan deserves special thanks for her precious and continuous support.
Last, but not the least, I am immensely thankful to my family, for being
there when I needed them to be, for believing in me and supporting me
throughout all my decisions.
Contents
1 Introduction 1
2 Background Information and Problem Definition 4
2.1 Structure of Social Networks (SNs) . . . . 4
2.2 Problem Definition . . . . 8
2.3 Related Work . . . . 8
2.3.1 Tabular Data Publishing . . . . 9
2.3.2 Complex Data Publishing . . . 15
2.3.3 SN Data Publishing . . . 17
3 Motivation and Contribution of the Thesis 22 3.1 Motivation . . . 22
3.2 Contribution of the Thesis . . . 24
4 Proposed Probabilistic Inference Attack (PIA) 27 4.1 Methodology . . . 27
4.1.1 Anonymization Process . . . 28
4.2 Algorithm . . . 30
4.3 Complexity Analysis . . . 37
5 Performance Evaluation 40
5.1 Test Bench . . . 40
5.2 Test Cases . . . 41
5.2.1 Synthetic Data Creation . . . 41
5.2.2 Real Data . . . 44
5.3 Results . . . 45
5.3.1 Synthetic Data Results . . . 46
5.3.2 Real Data Results . . . 51
5.4 Evaluation . . . 55
6 Conclusion and Future Work 57
List of Figures
1 Example of a Social Network . . . . 6
2 Suppressed version of SN from Figure 1 . . . . 7
3 Sample Domain Generalization Hierarchy . . . 12
4 Algorithm 5 runtime . . . 45
5 Run Time of PIA with Synthetic Data . . . 47
6 Erroneous Node Count with Synthetic Data . . . 49
7 Erroneous Node Percentage with Synthetic Data . . . 50
8 Erroneous Node Percentage in Synthetic Data for label outcome 51 9 Run Time of PIA with Real Data . . . 52
10 Erroneous Node Count with Real Data . . . 53
11 Erroneous Node Percentage with Real Data . . . 54
12 Erroneous Node Percentage in Real Data for label outcome . . 55
List of Tables
1 A Fictitious Tabular Data . . . 10
2 Suppressed version of Data from Table 1 . . . 10
3 Over-anonymized Data . . . 11
4 Under-anonymized Data . . . 11
5 A dataset without personal identifiers . . . 13
6 2-anonymous and 2-diverse version of Table 5 . . . 13
7 Tabular representation of spouse relationship . . . 15
8 Path coordinates in spatio-temporal data . . . 16
9 Defined Symbols used in the algorithm . . . 28
10 Facts & Figures for synthetic data . . . 38
11 Facts & Figures for real data . . . 38
12 Part of file for synthetic data creation . . . 41
13 Suppression rates for synthetic data . . . 47
14 Suppression rates for real data . . . 52
15 Number of errors and the error rate in synthetic data size 25000 59
16 Number of errors and the error rate in real data size 783 . . . 59
1 Introduction
Social Networks (SNs) [19] are among the most popular communication and sharing platform on the Internet in the modern world. SNs are vast in size and can carry personal and sensitive information of an individual such as political views, religion, sexual orientation, etc. This raises every privacy concern when SN data is published for research purposes or released to third parties for business purposes. Without a direct transfer of SN data, even a simple internet user can easily get access to lots of profiles and information by just searching for publicly available SN data, i.e. by finding people with open profiles using web crawlers, elaborated in [38, 14].
Given such a threat, most service providers offer various privacy policies for their registered users most of which allow users to choose what information to share and whom to share with. For example, a user can specify her age to be publicly available while suppressing the political group he/she is a member of. However, to what extend such policies address privacy concerns remains to be an open question. The main problem with such preference- based protection mechanisms is that the users cannot decide what other people, that they are connected with, are sharing. Additionally, the user may share some information without the exact knowledge of its consequences.
Or, just connected people may share information about the user also without considering the aftermath [23, 8, 17].
As we know SNs are not just a way to keep records, like hospital databases
or voter lists. In SNs people mimic their daily social life onto the internet
public. As in real life, people do make mistakes and can cause an informa- tion breach for someone else. Such as publicly asking someone about his/her private disease. There is also another way to breach privacy in SNs, which is caused by emotions. People act, just like in real life, on some emotions like anger, sadness, grudge, etc. Hence they are more willing to share private information about other purely based on the emotions they have against them. For example if two best friends start to hate each other, they may post information publicly against each other. But beyond these two fac- tors sometimes the user itself discloses his/her information with the help of his/her neighbours. This is because people tend to build relations with sim- ilar backgrounds or facts, like school, age, political views, religious views, sexual orientation, etc. A person may hide his/her information, but the network he/she creates around him/her-self is a way to define him/her.
Such information disclosed by ’neighbours’ serves as an inference channel
for any suppressed data if the adversary knows that some correlation exists
between the existence of a link among two users and the users having the
same sensitive information. For example, even though the user chooses to
suppress his/her membership to a political group, the adversary can look
for memberships disclosed by his/her friends. If a sufficient number of her
friends specify their membership to the same political group, an adversary,
assuming such groups tend to form cliques in the social networks, can predict
her membership with high probability. Besides these information retrieving
techniques, an adversary can also be a moderator or owner of such groups in
a SN, giving him/her the ability to collect more accurate data and to extend
his/her prediction radius among the SN.
In this thesis, we propose a probabilistic inference attack, which predicts the suppressed sensitive information from a highly suppressed SN with high success rate given the network structure and the degree of correlation between links and labels. The attack algorithm returns, for each node, a probability that the node has a specific label (e.g., being a member of a group). The sketch of the algorithm is as follows: For each node and label (e.g., sensitive information) in the SN, the attack algorithm assigns a probability function for the likelihood of the node to have the label. As the correlations are known, the probability function for a node is defined in terms of probabilities of neighbouring nodes (e.g., probabilities that they have the label). This creates a system of equations to solve for the probabilities. In order to solve the large system of equations, we propose an iterative algorithm. Basically, we start with an initial state for all probabilities and iteratively update probabilities based on the probability function. The algorithm returns the probabilities when the system converges to a final state. We experimentally show that the attack algorithm predicts the suppressed labels with high success rates even in a highly suppressed social network.
The rest of the thesis is organized as follows: Section 2 gives background
on social networks, followed by related work on data publishing. In Section
3, we present the motivation of the thesis and its contributions. Then we
describe our algorithm in detail in Section 4. Section 5 evaluates the proposed
attack algorithm based on test cases. Finally, we conclude in Section 6.
2 Background Information and Problem Defi- nition
In this section, we formally define a social network in our domain, state what the adversary knows, and formally present the problem definition.
2.1 Structure of Social Networks (SNs)
SNs can be observed as graphs [6], which consists of vertices or nodes and edges. On a SN each user (i.e. profile owner) is a node on the graph and any relationship between two users is an edge between them. Depending on the SN these edges can vary in weight and directivity, e.g. the “friendship”
relationship between two users on a SN is an undirected edge, in contrast to a “following” / “follower” [15] relationship being a directed edge. If the SN has different types of relationship among two users, then each type of relationship can be represented with a different weight on each edge.
Social Network: In our domain a social network is an undirected graph SN = (V, E) where each node v ✏ V is a user and e ✏ E is an edge, defined as e = (v
i, v
j) with v
i, v
j2 V . There is an edge between v
iand v
jif and only if there exist an e 2 E such that e = (v
i, v
j). A node can have multiple edges to different nodes, but there can’t be a node without having any edges.
For the network we have set of labels L representing a sensitive information.
For each user v 2 V , and label ` 2 L either the user has the label which we
denote as v.` = 1, or does not have the label which we denote as v.` = 0 .
For example the set of labels could be:
L = {age > 30, location = Europe, political view = right} (2.1)
Each one of the labels will be referred to as `
iand i = [1, 3]. Hence the notation like v.`
2= 0 would mean that the user v is not in Europe, v.`
3= 1 would mean that v has a right-wing political view.
Suppressed Social Network: We say a suppressed social network SN
⇤= (V
0, E
0) is derived from a social network SN = (V, E) iff the fol- lowing conditions are met: 1. There is a one to one correspondence between v 2 V, e 2 E and v
02 V
0, e
02 E
02. For all matched v, v
0, and ` 2 L; if v.` = 1 , either v
0.` = 1 or v
0.` = ⇤ (representing unknown) . Else if v.` = 0, then v
0.` = ⇤.
So a suppressed SN
⇤has the same network structure as its corresponding
SN . The only diffirence is some of the labels in SN
⇤is set to * representing
unknown. An example of an SN subgraph and its suppressed version can be
seen in Figure 1 and 2.
Figure 1: Example of a Social Network
Figure 2: Suppressed version of SN from Figure 1
Neighbour Set: The -neighbour set N
vof a node v w.r.t. a label ` in a social network SN = (V, E), is defined as the set of nodes that are connected to v and have label : N
v= {v
0|9e = (v, v
0) 2 E, v
0.` = }. N
vreturns all neighbours of node v. (E.g., N
v= N
v0[ N
v1[ N
v⇤)
In Figure 2, ⇤-neighbour set of v
3, N
v⇤3= {v
1, v
6}. Similarly, N
v13= {v
2}.
2.2 Problem Definition
In our domain, the data holder has a social network SN, however only a suppressed version SN
⇤of SN is released due to preference based privacy policy. To ease discussion, we assume, without loss of generality, the network has only one label `. We assume the adversary has access to the following information:
1. K
1: The released suppressed network SN
⇤.
2. K
2: For a node v in the unsuppressed SN, P (v.` = 1 |
|N|Nv1v||= r) for all r.
Note that the above adversary realistic. The knowledge in item 2 can be obtained approximately by an adversary which is a user in the social network that can see a subgraph of the network. Or it can be obtained from other public networks or derived from domain knowledge. In this thesis, we propose an attack algorithm for such an adversary that will compute the following probability:
P (v.` = 1 | K
1, K
2)
2.3 Related Work
Privacy breach in published data sets was first shown in [39], where the au-
thors were able to obtain sensitive information of people from datasets with-
out unique identifier such as names, SSNs, · · · . Since then, many different
privacy models and anonymization techniques [34, 36] have been proposed to prevent attacks by different adversaries.
The first set of solutions for privacy preserving data publishing focused primarily on tabular data in which each individual has a single record. We now summarize the earlier research on tabular data publish but it should be noted that since the SN data inherits a network structure and the location of the individuals inside the structure gives away sensitive information, the techniques proposed for tabular data cannot be used to de-identify SN data.
2.3.1 Tabular Data Publishing
Tabular data [12] is a way of organizing data in rows and columns, where rows represent the records and columns represent the attributes of each record. In contrast to graphs, individual records are not linked to each other. Every record consists of many attributes and depending on the dataset there may be a number of sensitive attributes
1[7] for each record. Table 1 and 2 is an example for tabular data and its publishing methods, where all attributes are considered as sensitive information. These attributes are considered sensitive due to their nature for linking them with information on different tables, hence making them quasi-identifiers [40].
1
is a personal information or opinion, that can be used to classify people into groups
after re-identification, e.g. diseases, memberships, etc.
Table 1: A Fictitious Tabular Data
Name Age Sex Zip
John Doe 25 M 34141
Jane Doe 22 F 34140
Mark Johnson 34 M 34138 John Smith 19 M 34139
Sue Anne 43 F 34141
Table 2: Suppressed version of Data from Table 1
Name Age Sex Zip
* [25, 34] * 3414*
* [15, 24] * 3414*
* [25, 34] * 3413*
* [15, 24] * 3413*
* [35, 44] * 3414*
Anonymization techniques like k-anonymity [40, 39], `-diversity and -
presence try to anonymize the tabular data before releasing them publicly,
but also try to keep a level of information available in the suppressed versions
for research. Meaning that, if the data is over-anonymized then the released
data will not contain any information on the table itself, opposed to under-
anonymizing which will lead to total re-identification of the data (Table 3,
4).
Table 3: Over-anonymized Data Name Age Sex Zip
* < 50 * 341**
* < 50 * 341**
* < 50 * 341**
* < 50 * 341**
* < 50 * 341**
Table 4: Under-anonymized Data Name Age Sex Zip
* 25 M 34141
* 25 F 34140
* 35 M 34138
* 20 M 34139
* 45 F 34141
The anonymization techniques must be improved or revised according to the new adversary knowledge. Adversaries gather information from all sorts of sources and combine them into one big table for future use. Their main goal in tabular formed published data is to link the suppressed records with the data they have in hand.
Anonymization
k-Anonymity [40] is the first technique offered to anonymize datasets
to make them resistant against re-identification [2]. The re-identification
process in tabular data publishing is accomplished through combining two
different datasets with similar attributes along with different anonymized
attributes. Tabular data like U.S. voters’ lists were one of these records and were prefered in view of the fact that anyone could buy it from the government agencies. It contained sensitive information such as age, sex, zip code, etc., and is used to match information from different records to re-identify the suppressed data.
When k-Anonymity is applied to these datasets, it ensures that any com- bination of the quasi-identifiers would be matched to k indistinguishable records. In other words, when a specific value is queried on the dataset, the result set will contain k identical records for any attribute or queried at- tribute set. This is achieved through domain generalization hierarchy (DGH) [40] on each sensitive attribute list, where levels of generalization are viewed as a tree. The lesser the height, i.e. towards the root, the more general values are reached, as seen in Figure 3.
341**
3413*
34138 34139
3414*
34140 34141
[0-100]
[0-49]
[0-24] [25-49]
[50-100]
[50-74] [75-100]
Zip Code Age
Figure 3: Sample Domain Generalization Hierarchy
Using DGHs, the k-Anonymity algorithm will produce a suppressed dataset,
where any record would have (k 1) identical records regardless of any at-
tribute combination queried on.
Despite this new technique Machanavajjhala, et. al. [34] has proven that sensitive information is not secure to re-identification attacks and stresses the adversary knowledge as the cause for this and proposes a new algorithm, `- diversity. In k-anonymity, any query will return k indistinguishable records.
If these k records share the same value for a quasi-identifier it would mean that this information is leaked. In other words if an adversary knows that the person he/she is searching for is in the returned set of k records, then the adversary can conclude that for that quasi-identifier the attribute value is definite. To cover this defect, the authors propose the algorithm of `- diversity, where in each set of k records each quasi-identifier has at least ` values for the sensitive attribute. This means that any attribute within the k records is
1`diverse (Table 5, 6).
Table 5: A dataset without personal identifiers Age Sex Zip Disease
16 M 34106 Cancer
25 F 34107 Flu
20 M 34107 Cancer
30 F 34106 Cold
Table 6: 2-anonymous and 2-diverse version of Table 5 Age Sex Zip Disease
[16-25] * 3410* Cancer
[16-25] * 3410* Flu
[20-30] * 3410* Cancer
[20-30] * 3410* Cold
Yet, this new level of anonymity was also not sufficient, caused by contin- uesly increasing adversary knowledge. As a result, the third major anonymiza- tion method is offered, again based on k-anonymity: -presence [36]. On the contrary of `-diversity, -presence proves that for some sensitive information it may not be possible to achieve `-diversity. If the sensitive data has only two unique attributes v
i.sen = {0, 1}, i.e. is either true or false for each record, then it is impossible to reach `-diversity in k anonymous dataset. Let us consider that there are n records in which m of them satisfy v
i.sen = 1 , hence (n m) will be satisfying v
i.sen = 0, and m (n m). According to these values, the average value for ` would be calculated as ` =
n mn, therefore making the data
n mn-diverse. In order to overcome this deficiency and to make the dataset resistant to publicly available sets of information,˙
the -presence algorithm ensures that any record from the publicly avail- able set will have the probability to be linked to the original data between (
min,
max). This algorithm relies on the adversary knowledge, however ad- versary knowledge may increase in time and must be updated regularly for the algorithm to produce the same level of anonymity every time.
As explained in this section the early phases of privacy protection was
based on tabular data publishing. The main threat for privacy was the differ-
ent tabular data published with different anonymizations, causing adversaries
to link the two corresponding tables in order re-identify the original data.
2.3.2 Complex Data Publishing
Differentiating from tabular data, complex data is a set of records, where multiple records combined create a new record set. Complex datasets can be represented in multiple tabular forms or in different forms, such as graphs.
The reason for the complexity comes from what information is stored within the dataset and the relation among records, known as relational databases [35]. Let us assume we have a table similar to Table 1 and we would also like to store the spouse relationship. Hence using the data from the main table we create a second tabular data, e.g. MarriedTo (Table 7).
Table 7: Tabular representation of spouse relationship Spouse1 Spouse2
John Doe Jane Doe Mark Johnson Tara Johnson
Brad Smith Sarah Smith
Compared to a single tabular data, complex datasets contain more infor- mation on a single record, through the multiple relation sets between tabular datas. The information and its meaning is being researched instensively un- der the topic of data mining [25, 21]. Using data mining techniques the information within the tables are interpreted into meaningful conclusions.
For example in a supermarket each transaction could point out different in-
formation using correlating information, such as “People who buy diapers also
buy milk”. Such and more examples could be found even in daily use within
most data.
Complex data is also being studied for privacy because most datasets can be used by adversaries for background knowledge for data mining ap- plications. One way to anonymize the data is using the k-anonymity tech- nique on multiple and related tables [37]. However datasets may differ and only k-anonymity would not be enough to securely publish the data. Espe- cially datasets like spatio-temporal data [20] are the most important ones.
They store coordinates and timestamps for a person, which can be collected through GPS-enabled devices, such as GPS navigator, smartphone, GSM carrier, digital cameras, etc. These informations can be expressed in a multi- relational table, where one table would hold the information on the individual (Table 1) and the other would hold the paths as coordinates (Table 8). In Table 8 the paths are stored as a comma-seperated format and each element of it, is a coordinate in (x
i, y
i, t
i) format, where x
iis the horizontal- and y
iis the vertical displacement and t
iis the timestamp at which the person was located at those coordinates.
Table 8: Path coordinates in spatio-temporal data
ID Path
1 {(x
1, y
1, t
1) , (x
2, y
2, t
2) , (x
3,, y
3, t
3) }
2 {(x
12, y
12, t
12) , (x
13, y
13, t
13) , (x
14,, y
14, t
14) , (x
15, y
15, t
15) }
As mentioned above there are adversaries for this information, too. It
is proven that even when anonymized, paths of individuals can be retrieved
[30, 32, 31]. This information can be used against the person to leak private
information, such as “Person X is going to the hospital every week” will
translate into “Possible Chronic Disease Carrier” by an insurance company.
The problem of anonymizing the data comes from its information. Each coordinate must be handled individually in order to suppress key information, such that an adversary can not recreate the original path correctly. So the path becomes a function on the x, y-coordinate system and there may be many ways to recreate the path. One way to visualize the paths could also be graphing them onto a map, but graphs are mostly used to display networks rather than directional single-line paths.
2.3.3 SN Data Publishing
Once SNs became popular among internet users, the number of accounts increased and SNs were holding more data than any other datasets. Hence adversaries created crawlers [38] to harvest the data on SNs, but they weren’t just storing them in their previously created tabular data, they were also storing it in a graph form in order to analyze the SN. The main reason for that is, that each user acount is connected with some other accounts, which makes inference between these connected users possible. This can be explained by the graph structure of the SNs, where the SN is not just a collection of profiles row-by-row, but is an interconnected network where people from same interest groups, same schools, same locations, etc. relate to each other by friendship links. Also when opening an account for the first time in a SN, the privacy settings are always set as public by default.
During the time the user figures out, how to set his/her privacy preferences
most of his/her sensitive data, e.g. age, sex, location, education, religious
and political views, etc. can be retrieved by adversaries.
Adversaries change by type or the information they are seeking for, though the information they retrieve gets summoned and distributed through the in- ternet. Inexperienced adversaries search the SNs by jumping through links while more advanced ones use crawlers to harvest the data. Crawlers are web scripts which do the same job as the inexperienced adversaries in an automated fashion.
In the recent years many publishing methods for SN data has been dis- cussed [26, 41, 42]. The main goal of these anonymization-based publishing techniques were also creating a version of the original graph that would mimic the relationships and label distribution of its source. Let SN be the original network and SN
⇤be a anonymized version of it, i.e. SN
⇤⇢ SN. The de- sired SN
⇤would then be such that the probability of identifying any node on SN given SN
⇤is smaller than a threshold value " (Equation 2.2) . In the mean time, the anonymized network SN
⇤must also return the same proba- bilistic results as the original network SN for each query, again with a very small noise factor (Equation 2.4). N’ will only return results for countable queries, e.g. “Probability of any node v
ihaving label `
j= 1” (Equation 2.3), due to the individual privacy factors it has to meet.
P (v
i| N) < ", v
i2 N
0(2.2)
q ⌘ [v
i.`
j= {1}] (2.3)
Q(q, N ) = Q(q, N
0) ± (2.4)
When publishing graph data, the key of anonymization does not rely on suppressing labels of vertices, instead it considers the position of the vertices, which is expressed through its neighbours [42]. Neighbours are the vertices that are linked to a node v
ithrough the edges. For instance in Figure 1, the node v
3has the neighbours {v
1, v
2, v
6} over the edges {e
13, e
23, e
36}. Back- strom, et. al. [16] explains passive and active attacks using the neighbouring property of the SNs. Adversaries may actively use the SN and create a small group of network, which they would match to the graph that is going to be publicly published. If they are able to find their group within the anonymized version of the graph they would be able to extend from that point to label anonymized nodes.
Hay, et. al. [26] uses perturbation on links in order to obscure the neigh- bouring relationships, causing the graph to be more securely anonymized.
They have concentrated on anonymizing the edges, by removing or adding
edges in order to create similar vertices based on an algorithm. Similarly
Zhou and Pei [42] create a k-anonymous graph based on the neighbours of
vertices, i.e. making k identical nodes for any query. Differentiating from
these two researches, Wei, et. al. [41] anonymize both the labels and the
neighbours of the nodes. They propose three algorithms to achieve their
anonymization. First they create subgraphs, in which each node has the
same label. Following this step, they add or remove edges based on sub-
graph average such that at the end, each node has exactly the same degree
of its neighbours. Finally, they conclude by anonymizing the connectivity among subgraphs, again based on k-anonymity. Despite these anonymiza- tion methods, we concentrate on such nodes that do not contain any label at all. In addition to this fact, anonymized graph data is not what the ad- versaries are going to deal in our proposed method because we assume, that the adversaries collect their data with crawlers. Using crawler data with our algorithm we will infer on the graph. Hence no third party anonymized published data will be used.
It was He, et. al. [27] that brought the idea of inference into SNs. In their
research they have assumed that people in a SN tend to build relationships
with others such as classmates, co-workers, fellow townsman, etc. Using
these relationships and the Bayesian network [28, 29, 24] representation, they
inferred on one label of each node. They assumed that the adversary would
be aware of one’s content of relationships with others, i.e. the adversary is
aware of all the neighbours, and social groups of the neighbours of the given
v
i2 V . Using this information, they prove that for a given node v
ithey infer
on the labels `
jby analyzing the correct group of neighbours of it. Thus, for
each node of v
i.`
j= ⇤, they select the neighbours with the correct relevance
to this label and come to a decision based on this deductive algorithm. In
contrast to He, et. al. [27], our inference algorithm does not concentrate
on specific groups to infer on labels of nodes, this is due to the fact that we
assume adversary does not have the knowledge of which node belonging to
which social group.
Recently Lindamood, et. al. [33] showed that inference attacks are a major threat to SNs. They proposed a classification algorithm and a way to prevent inference attacks. They were able to keep the classification algorithm from classifying by looking at the information available after removal of edges and some label information. Basically, by removing edges between the nodes and suppressing labels on the nodes they made the classification algorithm to come to a position where it can’t make an decision.
The authors [33] collect the data through a SN crawler, which is described
in [38, 14]. They perform three different tests in order to reach the most
attack tollerant version of suppression. When removing 10 links per node
and 10 labels from each node the classifier ends up in an decision to make
between 0.52 and 0.48, which in this case is impossible to infer about the
decision of any node. We differ from this work right from the beginning
because they are suppressing data which is collected through crawlers. What
we want to show is that after adversary collects the data using a crawler it
is possible to infer on the remaining unidentified nodes and their labels.
3 Motivation and Contribution of the Thesis
This section includes information on why we selected this subject and what contributions we made.
3.1 Motivation
As SNs come to be popular, many different ways of collecting data became possible. Despite all the preference-based privacy options, Social Network systems are still weak in protecting personal sensitive data. Although some algorithms have been offered on suppressing networks, they always assumed that these networks consist only of edges and nodes. These algorithms ignore the adversaries that try to recover some key sensitive information about that node.
As described in Section 2.3, the relationships among users can point out some key information about the connected parties. If the relationship be- tween some users is more dense, the information retrieved among them can be more precise, regardless of the number of captured profiles. Even if a user suppresses all of his/her sensitive information to any third party, i.e. anyone besides his/her friends can not see any info on the profile, his/her friends’
information may point out about his/her sensitive facts, like age, location, political view, etc.
The inference possibility of using information on neighbouring nodes does
not need much of an adversary knowledge; however the publicly available
data is more than adequate to conclude such an inference attack. The idea behind inference depends on the fact that people can not control what others are sharing and how this information can relate to their private data. One may decide to not share a fact about his/her personal life, but it can be shared publicly by other users in many different ways. Some may give a per- sonal fact about him/her self, that can unknowingly or willingly depict other related users. This kind of privacy breaches are called neighbour sharing, because an adversary is informed about a sensitive information of an user by a neighbouring node.
This breach in privacy is detected by active adversaries, which not only rely on data, that is gathered by crawlers, but also personally view accounts for such information. The adversary is aware of such situations due to the fact that in social networks people may act on emotional factors and share some key information about their neighbours. However, most of the time such neighbour sharing is not based on emotional factors, where users act out without thinking of the consequences, in contrast it is the basic informa- tion that the user shares. In other words, people tend to have friends and relations that are based on common factors, such as school, work, political view, religious view, sexual orientation, etc. Using this kind of related neigh- bours the adversary can guess on information that is suppressed by some users.
Our primary objective is to use this structure of relationships and prove
that it is possible to infer on suppressed information without using any other
data sets or related information.
3.2 Contribution of the Thesis
What we did is, we try to show that in SNs not sharing any information is not a definite solution for privacy protection. Among the proposed net- work suppression algorithms, the main issue is to anonymize the network in such a way that it still makes sense and carries similar information of the original network N. We are going to prove that if the data holds correlated information to its original state, defined by Equation 2.4, then simple adver- sary knowledge may be enough to recover key information on each person individually.
As mentioned in Section 3.1, people that are neighbours have a high pos- sibility of having common factors or having the same information on one or more cases. When considering big social networks, it would be very difficult to view each account for neighbour sharing. Hence using the similar details among users would be a much more efficient way to conclude our proposal.
In the early phases, we concentrated on special interest groups on SNs,
which shows the persons belonging to an idea or ideological group. This
method was selected as the primary adversary behaviour in order to clas-
sify people, even if their profiles were closed to outside viewing. However,
this produced a low level of connectivity among users. The low level of
connectivity means, that through the special interest groups we were able
to access many peoples key information, but the relations among them and
their mutual relations was very low to use in an inference attack. Therefore,
we changed our objective to the network itself, rather than the information
providing entity, in our case the special interest groups. When using the net- work itself the number of mutual neighbours among the vertices increases, which will allow better results within each subgraph. In other words, the more related or connected the networks is the better it can be inferred.
Our adversary knowledge was based on only the measurable fact of ten- dency of users connecting to users with similar formations. When considering each label seperately adversaries can easily conclude to the ratio of connec- tions, i.e. edges, among users that have or not have the label, as in Equations 3.1 and 3.2.
P (v
j, ` = 1 | v
i.` = 1) (3.1)
P (v
j.` = 1 | v
i.` = 0) (3.2)
Although we started our research in the direction of a binomial distri- bution [1] attack, it evolved into a multinomial inference attack due to the change in information source.
We developed a probabilistic inference attack that can recover a highly
suppressed SN. We assumed that an average adversary is capable of creat-
ing a crawler for a SN. Although many accounts would be closed to public
viewing, the adversary may use his/her account for the crawler to retrieve
better results. Many researches on SNs [33, 27] did also produce a crawler,
returning more than 50000 accounts in each case. Hence the assumed adver-
sary knowledge is fair in our case. We probabilistcally re-identify the label values of each node individually by looking at their neighbours that do not have suppressed labels.
Each node, v
i, may or may not have neighbours with suppressed label `
j; however, we can conclude in both of the situations the value of v
i.`
j. The key point here is that the adversary has the knowledge of label distribution within the SN or sub-SN, meaning that the adversary knows by ratio how many of which label is present in the graph (Equation 3.3). By knowing this value the adversary can attack the suppressed graph SN
⇤and re-identify the graph even if edges are removed or perturbed, too.
R
`= | v
i.` = 0 |
| v
i.` = 1 | (3.3)
R
l=0= |v
i.` = 0 |
n (3.4)
R
`=1= |v
i.` = 1 |
n (3.5)
4 Proposed Probabilistic Inference Attack (PIA)
This section will explain our contribution in detail. First, we will define our assumptions. Then we will describe the evolution of our methods. Finally, we will explain our algorithm in detail.
4.1 Methodology
Our aim is, when given K
1, K
2, to find the probability for any suppressed node, of having label 1. However, this would have caused a recursive call cycle as the number of suppressed nodes increased. If a suppressed node has a suppressed neighbour, that also has suppressed neighbours and continuing like this, we will recursively reach every node by neighbour relations and keep on going until we reach the last node or worst, if there is a cycle, never reach an end. Hence we changed our model to a heuristic one, where instead of considering all suppressed nodes at once, we look one by one and update probabilities accordingly. The heuristic model, which will be explained shortly, relies on single comparisons and is faster. So our proposed algorithm works in an iterative mode using the distribution of the labels, but consists of different phases for computing the inference rates.
We assume that we obtained a part of a SN with n nodes, that can be
classified into m labels.
Table 9: Defined Symbols used in the algorithm
Value Symbol
Number of nodes n
Number of labels m
Label set `
Unique node v
iList of nodes L[n]
Number of connections of node N
iConnected nodes of node v
iF
ijUnique connection of a node N F
kUnique label `
jRatio of connections a node having same label value RN
jLabel of node v
i.`
Inference ratio of a node IRN
iAnonymization rate A
4.1.1 Anonymization Process
This anonymization process is developed for testing of the attack algorithm,
which will be explained in Section 4.2. While creating the network graph or
getting it as an input by randomly selecting some nodes anonymization can
be performed. In Algorithm 6 we showed how a single label classification
can be generated. Using the same algorithm with an addition we can create
an anonymized version of the network. It would have the same number of
friends per node and the same connections.
We will use a user input to determine the rate of anonymization we will produce. If a random value is smaller than the rate then, that node will be anonymized, except it would be stored in a different list. It is detailed in Algorithm 1.
Algorithm 1 Anonymization Process in Network Generation (1) a 0
(2) while a < n do
(3) b readLineF romF ile(fileP ath) (4) numberOf F riends random(n/100) (5) label random(0, 1)
(6) if label < P (`) then
(7) nodeLabel 0
(8) else then
(9) nodeLabel 1
(10) end if
(11) anonymization random(0, 1) (12) if anonymization < A then
(13) L
0[a] node(b, ⇤, numberOfF riends) (14) else then
(15) L
0[a] node(b, nodeLabel, numberOfF riends) (16) end if
(17) L[a] node(b, nodeLabel, numberOfF riends) (18) a a + 1
(19) end while
When using real data instead of generated synthetic data, then this pro-
cess is done after the nodes are created. This is because of the differentiating
input methods. In the real data version, each node has a seperate file for its
connections, also nodes from each label outcome are seperated in different
files, e.g. class1.txt, class2.txt, etc. In addition to that, since these parts
of the algorithm are for generating test cases, they will be excluded when a
real anonymized dataset is going to be used. As mentioned in Section 2 the suppressed data will mimic the original data, due to the reason that even when anonymized this set of records must make sense, without breaching the privacy of individuals.
4.2 Algorithm
In this section we will describe the PIA algorithm in depth. The PIA algo- rithm shown in Algorithm 4 is designed for a singe label classification, e.g.
{age<30, age>=30}, {left-wing, right-wing}, etc. This algorithm runs on the anonymized network data. It searches for anonymous nodes and calculates its probability of belonging to a class.
Our algorithm consists of three parts. First, it finds the unlabeled nodes in the network. After that for each unlabeled node, it checks, how many unlabeled/suppressed friends the node has, depending on the outcome it choses one of two probability functions and computes the probability of this node belonging to a class. Finally, after each unlabeled node has computed a probability, it compares these with the threshold values and comes to a decision about the node.
The first part of the algorithm, shown in Algorithm 2, works in an itera-
tive fashion. It goes over the list of nodes, and searches for the ones that are
suppressed, explained as in Algorithm 1.
Algorithm 2 Get Suppressed Nodes (1) a 0
(2) while a < n do (3) if F
ia.` = ⇤ then (4) L
0.push(L[a])
(5) end if
(6) a a + 1 (7) end while (8) return L
0Algorithm 3 Total number of suppressed nodes in the graph (1) a 0
(2) i 0 (3) b 0
(4) while i < n do (5) while a < N
ido (6) if F
ia.` = ⇤ then
(7) b b + 1
(8) end if
(9) a a + 1
(10) end while (11) i i + 1 (12) end while (13) return b
The second part of the algorithm uses the nodes returned from part 1 with
two different probability equations, depending on the number of suppressed
connections the selected node has. If the node in question isn’t connected to
any other suppressed node, then the inference is based purely on the distri-
bution of its connected peers. As equation 4.1 describes, we check the labels
of each connected node to determine the connectivity of our selected node
to this label. Again it is a single label classification version of the equation.
Then by comparing these values with each other and the probabilities of each class occurance rates, as in Equation 4.2, we can infer the label value this node belongs to.
8 >
<
> :
c1 = N
v0ic2 = N
v1i(4.1)
c1 c2
8 >
<
> :
>
1 K1 K2,vi102,vi
) v
i.` = 0
<
1 K2,vi1
1 K2,vi0