Submitted to the Graduate School of Sabancı University in partial fulfillment of the requirements for the degree of

(1)

A Probabilistic Inference Attack On Suppressed Social Networks

by

BARIŞ ALTOP

Submitted to the Graduate School of Sabancı University in partial fulfillment of the requirements for the degree of

Master of Science

Sabancı University

Spring, 2011

(2)

(3)

c Barış Altop 2011

All Rights Reserved

(4)

A PROBABILISTIC INFERENCE ATTACK ON SUPPRESSED SOCIAL NETWORKS

Barış ALTOP

Computer Science and Engineering, Master’s Thesis, 2011

Thesis Supervisors: Assist. Prof. Dr. Mehmet Ercan Nergiz, Assoc. Prof. Dr. Yücel Saygın

Keywords: Social Networks, Inference Attack, Privacy, Private Data Protection, Classification

Abstract

Social Networks (SNs) are now widely used by modern time internet users to share any personal information. Such networks are so rich in information content that there is public and commercial benefit in sharing them with other third parties. However, information stored in SNs are mostly person specific and subject to privacy concerns. One way to address the privacy issues is to give the control of the data to the users enabling them to suppress data that they choose not to share with third parties.

Unfortunately, above mentioned preference-based suppression tech-

niques are not suﬃcient to protect privacy mainly because they do not

allow users to control data about other users they are linked with.

(5)

Information about neighbors becomes an inference channel in an SN when there is known correlation between the existence of a link be- tween two users and the users having the same sensitive information.

In this thesis, we propose a probabilistic inference attack on a sup- pressed social network data, that can successfully predict a suppressed label by looking at neighboring users’ data. The attack algorithm is designed for a realistic adversary that knows, from background or ex- ternal sources, the correlations between labels and links in the SN.

We experimentally show that it is possible to recover majority of the

suppressed labels of users even in a highly suppressed SN.

(6)

BASTIRILMIŞ SOSYAL AĞLARDA OLASILIKSAL BİR ÇIKARIM SALDIRISI

Barış ALTOP

Bilgisayar Bilimi ve Mühendisli ğ i, Y ü ksek Lisans Tezi, 2011

Tez Danışmanları: Yard. Doç. Dr. Mehmet Ercan Nergiz, Doç.

Dr. Yücel Saygın

Anahtar Kelimeler: Sosyal Ağlar, Çıkarım Saldırısı, Mahremiyet, Kişiye Özel Veri Güvenliği, Sınıflama

Özet

Sosyal Ağlar günümüz internet kullanıcıları tarafından kişisel bilgilerin

paylaşımı amacıyla yaygın olarak kullanılmaktadır. Bu tür ağların, bilgi

içerikleri çok zengin olduğundan, diğer üçüncü partiler ile paylaşımı kamusal

ve ticari fayda getirmektedir. Ancak, sosyal ağlarda saklanan bilgiler çoğun-

lukla kişiye özeldir ve gizlilik endişelerine tabidir. Gizlilik sorunlarını gider-

menin bir yolu, kullanıcılara kendi verilerinin kontrolü vermek ve istedikleri

verileri bastırarak üçüncü kişilerden gizlemelerini sağlamaktır.

(7)

Ne yazık ki yukarıda bahsedilen tercihe dayalı bastırma teknikleri gi- zliliği sağlamaya yetmemektedir. Bunun temel sebebi, bu tür koruma sis- temlerinin kullanıcılarına, bağlantılı oldukları diğer kullanıcıların paylaştık- ları veriler üzerinde kontrol izni vermemeleridir. Aralarında bağlantı bulu- nan kullanıcılar arasında veri benzerliği açısından ilişki mevcuttur; bu ilişki de iki komşu kullanıcı arasında veri çıkarsama kanalı oluşturur. Bu tezde bastırılmış sosyal ağlarda komşu kullanıcıların verilerine bakarak kişilerin bastırılmış bilgilerini bulabilen olasılıksal bir çıkarsama saldırısı öneriyoruz.

Bu saldırı algoritması sosyal ağdaki etiketler arası bağıntıyı ve bağlantıları

bilen gerçekçi bir düşmana göre tasarlanmıştır. Yüksek derecede bastırılmış

sosyal ağlarda bile kullanıcıların bastırılmış etiketlerinin çoğunluğunu çıkar-

samanın mümkün olduğunu deneysel olarak göstermekteyiz.

(8)

to my beloved family

(9)

Acknowledgements

I wish to express my sincere gratitude to Assoc. Prof. Yücel Saygın and Assist. Prof. Dr. Mehmet Ercan Nergiz, for their continuous support and wortwhile guidance thoughout my masters studies. Assoc. Prof. Yücel Saygın’s support starting from my application process till today, was always a great motivation during my studies. Also I am thankful to Assist. Prof.

Dr. Mehmet Ercan Nergiz for believing in the topic I was interested in. In addition, I am thankful to my thesis defense committee members: Assoc.

Prof. Berrin Yanıkoğlu, Assoc. Prof. Erkay Savaş and Assoc. Prof. Cem Güneri for their support and presence.

I appreciate Duygu Karaoğlan for her help during the implementation process. I would also like to thank to my friends Emre Kaplan, İsmail Fatih Yıldırım, Burcu Özçelik, Yarkın Doröz and Erman Pattuk for their help in the cirruculum courses. Duygu Karaoğlan deserves special thanks for her precious and continuous support.

Last, but not the least, I am immensely thankful to my family, for being

there when I needed them to be, for believing in me and supporting me

throughout all my decisions.

(10)

1 Introduction 1

2 Background Information and Problem Definition 4

2.1 Structure of Social Networks (SNs) . . . . 4

2.2 Problem Definition . . . . 8

2.3 Related Work . . . . 8

2.3.1 Tabular Data Publishing . . . . 9

2.3.2 Complex Data Publishing . . . 15

2.3.3 SN Data Publishing . . . 17

3 Motivation and Contribution of the Thesis 22 3.1 Motivation . . . 22

3.2 Contribution of the Thesis . . . 24

4 Proposed Probabilistic Inference Attack (PIA) 27 4.1 Methodology . . . 27

4.1.1 Anonymization Process . . . 28

4.2 Algorithm . . . 30

4.3 Complexity Analysis . . . 37

(11)

5 Performance Evaluation 40

5.1 Test Bench . . . 40

5.2 Test Cases . . . 41

5.2.1 Synthetic Data Creation . . . 41

5.2.2 Real Data . . . 44

5.3 Results . . . 45

5.3.1 Synthetic Data Results . . . 46

5.3.2 Real Data Results . . . 51

5.4 Evaluation . . . 55

6 Conclusion and Future Work 57

(12)

List of Figures

1 Example of a Social Network . . . . 6

2 Suppressed version of SN from Figure 1 . . . . 7

3 Sample Domain Generalization Hierarchy . . . 12

4 Algorithm 5 runtime . . . 45

5 Run Time of PIA with Synthetic Data . . . 47

6 Erroneous Node Count with Synthetic Data . . . 49

7 Erroneous Node Percentage with Synthetic Data . . . 50

8 Erroneous Node Percentage in Synthetic Data for label outcome 51 9 Run Time of PIA with Real Data . . . 52

10 Erroneous Node Count with Real Data . . . 53

11 Erroneous Node Percentage with Real Data . . . 54

12 Erroneous Node Percentage in Real Data for label outcome . . 55

(13)

List of Tables

1 A Fictitious Tabular Data . . . 10

2 Suppressed version of Data from Table 1 . . . 10

3 Over-anonymized Data . . . 11

4 Under-anonymized Data . . . 11

5 A dataset without personal identifiers . . . 13

6 2-anonymous and 2-diverse version of Table 5 . . . 13

7 Tabular representation of spouse relationship . . . 15

8 Path coordinates in spatio-temporal data . . . 16

9 Defined Symbols used in the algorithm . . . 28

10 Facts & Figures for synthetic data . . . 38

11 Facts & Figures for real data . . . 38

12 Part of file for synthetic data creation . . . 41

13 Suppression rates for synthetic data . . . 47

14 Suppression rates for real data . . . 52

15 Number of errors and the error rate in synthetic data size 25000 59

16 Number of errors and the error rate in real data size 783 . . . 59

(14)

1 Introduction

Social Networks (SNs) [19] are among the most popular communication and sharing platform on the Internet in the modern world. SNs are vast in size and can carry personal and sensitive information of an individual such as political views, religion, sexual orientation, etc. This raises every privacy concern when SN data is published for research purposes or released to third parties for business purposes. Without a direct transfer of SN data, even a simple internet user can easily get access to lots of profiles and information by just searching for publicly available SN data, i.e. by finding people with open profiles using web crawlers, elaborated in [38, 14].

Given such a threat, most service providers oﬀer various privacy policies for their registered users most of which allow users to choose what information to share and whom to share with. For example, a user can specify her age to be publicly available while suppressing the political group he/she is a member of. However, to what extend such policies address privacy concerns remains to be an open question. The main problem with such preference- based protection mechanisms is that the users cannot decide what other people, that they are connected with, are sharing. Additionally, the user may share some information without the exact knowledge of its consequences.

Or, just connected people may share information about the user also without considering the aftermath [23, 8, 17].

As we know SNs are not just a way to keep records, like hospital databases

or voter lists. In SNs people mimic their daily social life onto the internet

(15)

public. As in real life, people do make mistakes and can cause an informa- tion breach for someone else. Such as publicly asking someone about his/her private disease. There is also another way to breach privacy in SNs, which is caused by emotions. People act, just like in real life, on some emotions like anger, sadness, grudge, etc. Hence they are more willing to share private information about other purely based on the emotions they have against them. For example if two best friends start to hate each other, they may post information publicly against each other. But beyond these two fac- tors sometimes the user itself discloses his/her information with the help of his/her neighbours. This is because people tend to build relations with sim- ilar backgrounds or facts, like school, age, political views, religious views, sexual orientation, etc. A person may hide his/her information, but the network he/she creates around him/her-self is a way to define him/her.

Such information disclosed by ’neighbours’ serves as an inference channel

for any suppressed data if the adversary knows that some correlation exists

between the existence of a link among two users and the users having the

same sensitive information. For example, even though the user chooses to

suppress his/her membership to a political group, the adversary can look

for memberships disclosed by his/her friends. If a suﬃcient number of her

friends specify their membership to the same political group, an adversary,

assuming such groups tend to form cliques in the social networks, can predict

her membership with high probability. Besides these information retrieving

techniques, an adversary can also be a moderator or owner of such groups in

a SN, giving him/her the ability to collect more accurate data and to extend

his/her prediction radius among the SN.

(16)

In this thesis, we propose a probabilistic inference attack, which predicts the suppressed sensitive information from a highly suppressed SN with high success rate given the network structure and the degree of correlation between links and labels. The attack algorithm returns, for each node, a probability that the node has a specific label (e.g., being a member of a group). The sketch of the algorithm is as follows: For each node and label (e.g., sensitive information) in the SN, the attack algorithm assigns a probability function for the likelihood of the node to have the label. As the correlations are known, the probability function for a node is defined in terms of probabilities of neighbouring nodes (e.g., probabilities that they have the label). This creates a system of equations to solve for the probabilities. In order to solve the large system of equations, we propose an iterative algorithm. Basically, we start with an initial state for all probabilities and iteratively update probabilities based on the probability function. The algorithm returns the probabilities when the system converges to a final state. We experimentally show that the attack algorithm predicts the suppressed labels with high success rates even in a highly suppressed social network.

The rest of the thesis is organized as follows: Section 2 gives background

on social networks, followed by related work on data publishing. In Section

3, we present the motivation of the thesis and its contributions. Then we

describe our algorithm in detail in Section 4. Section 5 evaluates the proposed

attack algorithm based on test cases. Finally, we conclude in Section 6.

(17)

2 Background Information and Problem Defi- nition

In this section, we formally define a social network in our domain, state what the adversary knows, and formally present the problem definition.

2.1 Structure of Social Networks (SNs)

SNs can be observed as graphs [6], which consists of vertices or nodes and edges. On a SN each user (i.e. profile owner) is a node on the graph and any relationship between two users is an edge between them. Depending on the SN these edges can vary in weight and directivity, e.g. the “friendship”

relationship between two users on a SN is an undirected edge, in contrast to a “following” / “follower” [15] relationship being a directed edge. If the SN has diﬀerent types of relationship among two users, then each type of relationship can be represented with a diﬀerent weight on each edge.

Social Network: In our domain a social network is an undirected graph SN = (V, E) where each node v ✏ V is a user and e ✏ E is an edge, defined as e = (v

i

, v

j

) with v

i

, v

j

2 V . There is an edge between v

ⁱ

and v

j

if and only if there exist an e 2 E such that e = (v

i

, v

j

). A node can have multiple edges to diﬀerent nodes, but there can’t be a node without having any edges.

For the network we have set of labels L representing a sensitive information.

For each user v 2 V , and label ` 2 L either the user has the label which we

denote as v.` = 1, or does not have the label which we denote as v.` = 0 .

(18)

For example the set of labels could be:

L = {age > 30, location = Europe, political view = right} (2.1)

Each one of the labels will be referred to as `

i

and i = [1, 3]. Hence the notation like v.`

2

= 0 would mean that the user v is not in Europe, v.`

3

= 1 would mean that v has a right-wing political view.

Suppressed Social Network: We say a suppressed social network SN

^⇤

= (V

⁰

, E

⁰

) is derived from a social network SN = (V, E) iﬀ the fol- lowing conditions are met: 1. There is a one to one correspondence between v 2 V, e 2 E and v

⁰

2 V

⁰

, e

⁰

2 E

⁰

2. For all matched v, v

⁰

, and ` 2 L; if v.` = 1 , either v

⁰

.` = 1 or v

⁰

.` = ⇤ (representing unknown) . Else if v.` = 0, then v

⁰

.` = ⇤.

So a suppressed SN

^⇤

has the same network structure as its corresponding

SN . The only diﬃrence is some of the labels in SN

^⇤

is set to * representing

unknown. An example of an SN subgraph and its suppressed version can be

seen in Figure 1 and 2.

(19)

Figure 1: Example of a Social Network

(20)

Figure 2: Suppressed version of SN from Figure 1

Neighbour Set: The -neighbour set N

v

of a node v w.r.t. a label ` in a social network SN = (V, E), is defined as the set of nodes that are connected to v and have label : N

v

= {v

⁰

|9e = (v, v

⁰

) 2 E, v

⁰

.` = }. N

v

returns all neighbours of node v. (E.g., N

v

= N

_v⁰

[ N

v¹

[ N

v^⇤

)

In Figure 2, ⇤-neighbour set of v

³

, N

v^⇤3

= {v

¹

, v

6

}. Similarly, N

v¹3

= {v

²

}.

(21)

2.2 Problem Definition

In our domain, the data holder has a social network SN, however only a suppressed version SN

^⇤

of SN is released due to preference based privacy policy. To ease discussion, we assume, without loss of generality, the network has only one label `. We assume the adversary has access to the following information:

1. K

1

: The released suppressed network SN

^⇤

.

2. K

2

: For a node v in the unsuppressed SN, P (v.` = 1 |

^|N_|N^v¹_v_|^|

= r) for all r.

Note that the above adversary realistic. The knowledge in item 2 can be obtained approximately by an adversary which is a user in the social network that can see a subgraph of the network. Or it can be obtained from other public networks or derived from domain knowledge. In this thesis, we propose an attack algorithm for such an adversary that will compute the following probability:

P (v.` = 1 | K

¹

, K

2

)

2.3 Related Work

Privacy breach in published data sets was first shown in [39], where the au-

thors were able to obtain sensitive information of people from datasets with-

out unique identifier such as names, SSNs, · · · . Since then, many diﬀerent

(22)

privacy models and anonymization techniques [34, 36] have been proposed to prevent attacks by diﬀerent adversaries.

The first set of solutions for privacy preserving data publishing focused primarily on tabular data in which each individual has a single record. We now summarize the earlier research on tabular data publish but it should be noted that since the SN data inherits a network structure and the location of the individuals inside the structure gives away sensitive information, the techniques proposed for tabular data cannot be used to de-identify SN data.

2.3.1 Tabular Data Publishing

Tabular data [12] is a way of organizing data in rows and columns, where rows represent the records and columns represent the attributes of each record. In contrast to graphs, individual records are not linked to each other. Every record consists of many attributes and depending on the dataset there may be a number of sensitive attributes

¹

[7] for each record. Table 1 and 2 is an example for tabular data and its publishing methods, where all attributes are considered as sensitive information. These attributes are considered sensitive due to their nature for linking them with information on diﬀerent tables, hence making them quasi-identifiers [40].

1

is a personal information or opinion, that can be used to classify people into groups

after re-identification, e.g. diseases, memberships, etc.

(23)

Table 1: A Fictitious Tabular Data

Name Age Sex Zip

John Doe 25 M 34141

Jane Doe 22 F 34140

Mark Johnson 34 M 34138 John Smith 19 M 34139

Sue Anne 43 F 34141

Table 2: Suppressed version of Data from Table 1

Name Age Sex Zip

* [25, 34] * 3414*

* [15, 24] * 3414*

* [25, 34] * 3413*

* [15, 24] * 3413*

* [35, 44] * 3414*

Anonymization techniques like k-anonymity [40, 39], `-diversity and -

presence try to anonymize the tabular data before releasing them publicly,

but also try to keep a level of information available in the suppressed versions

for research. Meaning that, if the data is over-anonymized then the released

data will not contain any information on the table itself, opposed to under-

anonymizing which will lead to total re-identification of the data (Table 3,

4).

(24)

Table 3: Over-anonymized Data Name Age Sex Zip

* < 50 * 341**

Table 4: Under-anonymized Data Name Age Sex Zip

*  25 M 34141

*  25 F 34140

*  35 M 34138

*  20 M 34139

*  45 F 34141

The anonymization techniques must be improved or revised according to the new adversary knowledge. Adversaries gather information from all sorts of sources and combine them into one big table for future use. Their main goal in tabular formed published data is to link the suppressed records with the data they have in hand.

Anonymization

k-Anonymity [40] is the first technique oﬀered to anonymize datasets

to make them resistant against re-identification [2]. The re-identification

process in tabular data publishing is accomplished through combining two

diﬀerent datasets with similar attributes along with diﬀerent anonymized

(25)

attributes. Tabular data like U.S. voters’ lists were one of these records and were prefered in view of the fact that anyone could buy it from the government agencies. It contained sensitive information such as age, sex, zip code, etc., and is used to match information from diﬀerent records to re-identify the suppressed data.

When k-Anonymity is applied to these datasets, it ensures that any com- bination of the quasi-identifiers would be matched to k indistinguishable records. In other words, when a specific value is queried on the dataset, the result set will contain k identical records for any attribute or queried at- tribute set. This is achieved through domain generalization hierarchy (DGH) [40] on each sensitive attribute list, where levels of generalization are viewed as a tree. The lesser the height, i.e. towards the root, the more general values are reached, as seen in Figure 3.

341**

3413*

34138 34139

3414*

34140 34141

[0-100]

[0-49]

[0-24] [25-49]

[50-100]

[50-74] [75-100]

Zip Code Age

Figure 3: Sample Domain Generalization Hierarchy

Using DGHs, the k-Anonymity algorithm will produce a suppressed dataset,

where any record would have (k 1) identical records regardless of any at-

tribute combination queried on.

(26)

Despite this new technique Machanavajjhala, et. al. [34] has proven that sensitive information is not secure to re-identification attacks and stresses the adversary knowledge as the cause for this and proposes a new algorithm, `- diversity. In k-anonymity, any query will return k indistinguishable records.

If these k records share the same value for a quasi-identifier it would mean that this information is leaked. In other words if an adversary knows that the person he/she is searching for is in the returned set of k records, then the adversary can conclude that for that quasi-identifier the attribute value is definite. To cover this defect, the authors propose the algorithm of `- diversity, where in each set of k records each quasi-identifier has at least ` values for the sensitive attribute. This means that any attribute within the k records is

¹_`

diverse (Table 5, 6).

Table 5: A dataset without personal identifiers Age Sex Zip Disease

16 M 34106 Cancer

25 F 34107 Flu

20 M 34107 Cancer

30 F 34106 Cold

Table 6: 2-anonymous and 2-diverse version of Table 5 Age Sex Zip Disease

[16-25] * 3410* Cancer

[16-25] * 3410* Flu

[20-30] * 3410* Cancer

[20-30] * 3410* Cold

(27)

Yet, this new level of anonymity was also not suﬃcient, caused by contin- uesly increasing adversary knowledge. As a result, the third major anonymiza- tion method is oﬀered, again based on k-anonymity: -presence [36]. On the contrary of `-diversity, -presence proves that for some sensitive information it may not be possible to achieve `-diversity. If the sensitive data has only two unique attributes v

i

.sen = {0, 1}, i.e. is either true or false for each record, then it is impossible to reach `-diversity in k anonymous dataset. Let us consider that there are n records in which m of them satisfy v

i

.sen = 1 , hence (n m) will be satisfying v

i

.sen = 0, and m (n m). According to these values, the average value for ` would be calculated as ` =

^{n m}_n

, therefore making the data

^{n m}_n

-diverse. In order to overcome this deficiency and to make the dataset resistant to publicly available sets of information,˙

the -presence algorithm ensures that any record from the publicly avail- able set will have the probability to be linked to the original data between (

min

,

max

). This algorithm relies on the adversary knowledge, however ad- versary knowledge may increase in time and must be updated regularly for the algorithm to produce the same level of anonymity every time.

As explained in this section the early phases of privacy protection was

based on tabular data publishing. The main threat for privacy was the diﬀer-

ent tabular data published with diﬀerent anonymizations, causing adversaries

to link the two corresponding tables in order re-identify the original data.

(28)

2.3.2 Complex Data Publishing

Diﬀerentiating from tabular data, complex data is a set of records, where multiple records combined create a new record set. Complex datasets can be represented in multiple tabular forms or in diﬀerent forms, such as graphs.

The reason for the complexity comes from what information is stored within the dataset and the relation among records, known as relational databases [35]. Let us assume we have a table similar to Table 1 and we would also like to store the spouse relationship. Hence using the data from the main table we create a second tabular data, e.g. MarriedTo (Table 7).

Table 7: Tabular representation of spouse relationship Spouse1 Spouse2

John Doe Jane Doe Mark Johnson Tara Johnson

Brad Smith Sarah Smith

Compared to a single tabular data, complex datasets contain more infor- mation on a single record, through the multiple relation sets between tabular datas. The information and its meaning is being researched instensively un- der the topic of data mining [25, 21]. Using data mining techniques the information within the tables are interpreted into meaningful conclusions.

For example in a supermarket each transaction could point out diﬀerent in-

formation using correlating information, such as “People who buy diapers also

buy milk”. Such and more examples could be found even in daily use within

most data.

(29)

Complex data is also being studied for privacy because most datasets can be used by adversaries for background knowledge for data mining ap- plications. One way to anonymize the data is using the k-anonymity tech- nique on multiple and related tables [37]. However datasets may diﬀer and only k-anonymity would not be enough to securely publish the data. Espe- cially datasets like spatio-temporal data [20] are the most important ones.

They store coordinates and timestamps for a person, which can be collected through GPS-enabled devices, such as GPS navigator, smartphone, GSM carrier, digital cameras, etc. These informations can be expressed in a multi- relational table, where one table would hold the information on the individual (Table 1) and the other would hold the paths as coordinates (Table 8). In Table 8 the paths are stored as a comma-seperated format and each element of it, is a coordinate in (x

i

, y

i

, t

i

) format, where x

i

is the horizontal- and y

i

is the vertical displacement and t

i

is the timestamp at which the person was located at those coordinates.

Table 8: Path coordinates in spatio-temporal data

ID Path

1 {(x

1

, y

1

, t

1

) , (x

2

, y

2

, t

2

) , (x

3,

, y

3

, t

3

) }

2 {(x

¹²

, y

12

, t

12

) , (x

13

, y

13

, t

13

) , (x

14,

, y

14

, t

14

) , (x

15

, y

15

, t

15

) }

As mentioned above there are adversaries for this information, too. It

is proven that even when anonymized, paths of individuals can be retrieved

[30, 32, 31]. This information can be used against the person to leak private

information, such as “Person X is going to the hospital every week” will

translate into “Possible Chronic Disease Carrier” by an insurance company.

(30)

The problem of anonymizing the data comes from its information. Each coordinate must be handled individually in order to suppress key information, such that an adversary can not recreate the original path correctly. So the path becomes a function on the x, y-coordinate system and there may be many ways to recreate the path. One way to visualize the paths could also be graphing them onto a map, but graphs are mostly used to display networks rather than directional single-line paths.

2.3.3 SN Data Publishing

Once SNs became popular among internet users, the number of accounts increased and SNs were holding more data than any other datasets. Hence adversaries created crawlers [38] to harvest the data on SNs, but they weren’t just storing them in their previously created tabular data, they were also storing it in a graph form in order to analyze the SN. The main reason for that is, that each user acount is connected with some other accounts, which makes inference between these connected users possible. This can be explained by the graph structure of the SNs, where the SN is not just a collection of profiles row-by-row, but is an interconnected network where people from same interest groups, same schools, same locations, etc. relate to each other by friendship links. Also when opening an account for the first time in a SN, the privacy settings are always set as public by default.

During the time the user figures out, how to set his/her privacy preferences

most of his/her sensitive data, e.g. age, sex, location, education, religious

and political views, etc. can be retrieved by adversaries.

(31)

Adversaries change by type or the information they are seeking for, though the information they retrieve gets summoned and distributed through the in- ternet. Inexperienced adversaries search the SNs by jumping through links while more advanced ones use crawlers to harvest the data. Crawlers are web scripts which do the same job as the inexperienced adversaries in an automated fashion.

In the recent years many publishing methods for SN data has been dis- cussed [26, 41, 42]. The main goal of these anonymization-based publishing techniques were also creating a version of the original graph that would mimic the relationships and label distribution of its source. Let SN be the original network and SN

^⇤

be a anonymized version of it, i.e. SN

^⇤

⇢ SN. The de- sired SN

^⇤

would then be such that the probability of identifying any node on SN given SN

^⇤

is smaller than a threshold value " (Equation 2.2) . In the mean time, the anonymized network SN

^⇤

must also return the same proba- bilistic results as the original network SN for each query, again with a very small noise factor (Equation 2.4). N’ will only return results for countable queries, e.g. “Probability of any node v

i

having label `

j

= 1” (Equation 2.3), due to the individual privacy factors it has to meet.

P (v

i

| N) < ", v

ⁱ

2 N

⁰

(2.2)

q ⌘ [v

i

.`

j

= {1}] (2.3)

(32)

Q(q, N ) = Q(q, N

⁰

) ± (2.4)

When publishing graph data, the key of anonymization does not rely on suppressing labels of vertices, instead it considers the position of the vertices, which is expressed through its neighbours [42]. Neighbours are the vertices that are linked to a node v

i

through the edges. For instance in Figure 1, the node v

3

has the neighbours {v

1

, v

₂

, v

₆

} over the edges {e

13

, e

₂₃

, e

₃₆

}. Back- strom, et. al. [16] explains passive and active attacks using the neighbouring property of the SNs. Adversaries may actively use the SN and create a small group of network, which they would match to the graph that is going to be publicly published. If they are able to find their group within the anonymized version of the graph they would be able to extend from that point to label anonymized nodes.

Hay, et. al. [26] uses perturbation on links in order to obscure the neigh- bouring relationships, causing the graph to be more securely anonymized.

They have concentrated on anonymizing the edges, by removing or adding

edges in order to create similar vertices based on an algorithm. Similarly

Zhou and Pei [42] create a k-anonymous graph based on the neighbours of

vertices, i.e. making k identical nodes for any query. Diﬀerentiating from

these two researches, Wei, et. al. [41] anonymize both the labels and the

neighbours of the nodes. They propose three algorithms to achieve their

anonymization. First they create subgraphs, in which each node has the

same label. Following this step, they add or remove edges based on sub-

graph average such that at the end, each node has exactly the same degree

(33)

of its neighbours. Finally, they conclude by anonymizing the connectivity among subgraphs, again based on k-anonymity. Despite these anonymiza- tion methods, we concentrate on such nodes that do not contain any label at all. In addition to this fact, anonymized graph data is not what the ad- versaries are going to deal in our proposed method because we assume, that the adversaries collect their data with crawlers. Using crawler data with our algorithm we will infer on the graph. Hence no third party anonymized published data will be used.

It was He, et. al. [27] that brought the idea of inference into SNs. In their

research they have assumed that people in a SN tend to build relationships

with others such as classmates, co-workers, fellow townsman, etc. Using

these relationships and the Bayesian network [28, 29, 24] representation, they

inferred on one label of each node. They assumed that the adversary would

be aware of one’s content of relationships with others, i.e. the adversary is

aware of all the neighbours, and social groups of the neighbours of the given

v

i

2 V . Using this information, they prove that for a given node v

ⁱ

they infer

on the labels `

j

by analyzing the correct group of neighbours of it. Thus, for

each node of v

i

.`

j

= ⇤, they select the neighbours with the correct relevance

to this label and come to a decision based on this deductive algorithm. In

contrast to He, et. al. [27], our inference algorithm does not concentrate

on specific groups to infer on labels of nodes, this is due to the fact that we

assume adversary does not have the knowledge of which node belonging to

which social group.

(34)

Recently Lindamood, et. al. [33] showed that inference attacks are a major threat to SNs. They proposed a classification algorithm and a way to prevent inference attacks. They were able to keep the classification algorithm from classifying by looking at the information available after removal of edges and some label information. Basically, by removing edges between the nodes and suppressing labels on the nodes they made the classification algorithm to come to a position where it can’t make an decision.

The authors [33] collect the data through a SN crawler, which is described

in [38, 14]. They perform three diﬀerent tests in order to reach the most

attack tollerant version of suppression. When removing 10 links per node

and 10 labels from each node the classifier ends up in an decision to make

between 0.52 and 0.48, which in this case is impossible to infer about the

decision of any node. We diﬀer from this work right from the beginning

because they are suppressing data which is collected through crawlers. What

we want to show is that after adversary collects the data using a crawler it

is possible to infer on the remaining unidentified nodes and their labels.

(35)

3 Motivation and Contribution of the Thesis

This section includes information on why we selected this subject and what contributions we made.

3.1 Motivation

As SNs come to be popular, many diﬀerent ways of collecting data became possible. Despite all the preference-based privacy options, Social Network systems are still weak in protecting personal sensitive data. Although some algorithms have been oﬀered on suppressing networks, they always assumed that these networks consist only of edges and nodes. These algorithms ignore the adversaries that try to recover some key sensitive information about that node.

As described in Section 2.3, the relationships among users can point out some key information about the connected parties. If the relationship be- tween some users is more dense, the information retrieved among them can be more precise, regardless of the number of captured profiles. Even if a user suppresses all of his/her sensitive information to any third party, i.e. anyone besides his/her friends can not see any info on the profile, his/her friends’

information may point out about his/her sensitive facts, like age, location, political view, etc.

The inference possibility of using information on neighbouring nodes does

not need much of an adversary knowledge; however the publicly available

(36)

data is more than adequate to conclude such an inference attack. The idea behind inference depends on the fact that people can not control what others are sharing and how this information can relate to their private data. One may decide to not share a fact about his/her personal life, but it can be shared publicly by other users in many diﬀerent ways. Some may give a per- sonal fact about him/her self, that can unknowingly or willingly depict other related users. This kind of privacy breaches are called neighbour sharing, because an adversary is informed about a sensitive information of an user by a neighbouring node.

This breach in privacy is detected by active adversaries, which not only rely on data, that is gathered by crawlers, but also personally view accounts for such information. The adversary is aware of such situations due to the fact that in social networks people may act on emotional factors and share some key information about their neighbours. However, most of the time such neighbour sharing is not based on emotional factors, where users act out without thinking of the consequences, in contrast it is the basic informa- tion that the user shares. In other words, people tend to have friends and relations that are based on common factors, such as school, work, political view, religious view, sexual orientation, etc. Using this kind of related neigh- bours the adversary can guess on information that is suppressed by some users.

Our primary objective is to use this structure of relationships and prove

that it is possible to infer on suppressed information without using any other

data sets or related information.

(37)

3.2 Contribution of the Thesis

What we did is, we try to show that in SNs not sharing any information is not a definite solution for privacy protection. Among the proposed net- work suppression algorithms, the main issue is to anonymize the network in such a way that it still makes sense and carries similar information of the original network N. We are going to prove that if the data holds correlated information to its original state, defined by Equation 2.4, then simple adver- sary knowledge may be enough to recover key information on each person individually.

As mentioned in Section 3.1, people that are neighbours have a high pos- sibility of having common factors or having the same information on one or more cases. When considering big social networks, it would be very diﬃcult to view each account for neighbour sharing. Hence using the similar details among users would be a much more eﬃcient way to conclude our proposal.

In the early phases, we concentrated on special interest groups on SNs,

which shows the persons belonging to an idea or ideological group. This

method was selected as the primary adversary behaviour in order to clas-

sify people, even if their profiles were closed to outside viewing. However,

this produced a low level of connectivity among users. The low level of

connectivity means, that through the special interest groups we were able

to access many peoples key information, but the relations among them and

their mutual relations was very low to use in an inference attack. Therefore,

we changed our objective to the network itself, rather than the information

(38)

providing entity, in our case the special interest groups. When using the net- work itself the number of mutual neighbours among the vertices increases, which will allow better results within each subgraph. In other words, the more related or connected the networks is the better it can be inferred.

Our adversary knowledge was based on only the measurable fact of ten- dency of users connecting to users with similar formations. When considering each label seperately adversaries can easily conclude to the ratio of connec- tions, i.e. edges, among users that have or not have the label, as in Equations 3.1 and 3.2.

P (v

j

, ` = 1 | v

i

.` = 1) (3.1)

P (v

j

.` = 1 | v

ⁱ

.` = 0) (3.2)

Although we started our research in the direction of a binomial distri- bution [1] attack, it evolved into a multinomial inference attack due to the change in information source.

We developed a probabilistic inference attack that can recover a highly

suppressed SN. We assumed that an average adversary is capable of creat-

ing a crawler for a SN. Although many accounts would be closed to public

viewing, the adversary may use his/her account for the crawler to retrieve

better results. Many researches on SNs [33, 27] did also produce a crawler,

returning more than 50000 accounts in each case. Hence the assumed adver-

(39)

sary knowledge is fair in our case. We probabilistcally re-identify the label values of each node individually by looking at their neighbours that do not have suppressed labels.

Each node, v

i

, may or may not have neighbours with suppressed label `

j

; however, we can conclude in both of the situations the value of v

i

.`

j

. The key point here is that the adversary has the knowledge of label distribution within the SN or sub-SN, meaning that the adversary knows by ratio how many of which label is present in the graph (Equation 3.3). By knowing this value the adversary can attack the suppressed graph SN

^⇤

and re-identify the graph even if edges are removed or perturbed, too.

R

`

= | v

ⁱ

.` = 0 |

| v

i

.` = 1 | (3.3)

R

l=0

= |v

i

.` = 0 |

n (3.4)

R

`=1

= |v

ⁱ

.` = 1 |

n (3.5)

(40)

4 Proposed Probabilistic Inference Attack (PIA)

This section will explain our contribution in detail. First, we will define our assumptions. Then we will describe the evolution of our methods. Finally, we will explain our algorithm in detail.

4.1 Methodology

Our aim is, when given K

1

, K

2

, to find the probability for any suppressed node, of having label 1. However, this would have caused a recursive call cycle as the number of suppressed nodes increased. If a suppressed node has a suppressed neighbour, that also has suppressed neighbours and continuing like this, we will recursively reach every node by neighbour relations and keep on going until we reach the last node or worst, if there is a cycle, never reach an end. Hence we changed our model to a heuristic one, where instead of considering all suppressed nodes at once, we look one by one and update probabilities accordingly. The heuristic model, which will be explained shortly, relies on single comparisons and is faster. So our proposed algorithm works in an iterative mode using the distribution of the labels, but consists of diﬀerent phases for computing the inference rates.

We assume that we obtained a part of a SN with n nodes, that can be

classified into m labels.

(41)

Table 9: Defined Symbols used in the algorithm

Value Symbol

Number of nodes n

Number of labels m

Label set `

Unique node v

i

List of nodes L[n]

Number of connections of node N

i

Connected nodes of node v

i

F

ij

Unique connection of a node N F

k

Unique label `

_j

Ratio of connections a node having same label value RN

j

Label of node v

i

.`

Inference ratio of a node IRN

i

Anonymization rate A

4.1.1 Anonymization Process

This anonymization process is developed for testing of the attack algorithm,

which will be explained in Section 4.2. While creating the network graph or

getting it as an input by randomly selecting some nodes anonymization can

be performed. In Algorithm 6 we showed how a single label classification

can be generated. Using the same algorithm with an addition we can create

an anonymized version of the network. It would have the same number of

friends per node and the same connections.

(42)

We will use a user input to determine the rate of anonymization we will produce. If a random value is smaller than the rate then, that node will be anonymized, except it would be stored in a diﬀerent list. It is detailed in Algorithm 1.

Algorithm 1 Anonymization Process in Network Generation (1) a 0

(2) while a < n do

(3) b readLineF romF ile(fileP ath) (4) numberOf F riends random(n/100) (5) label random(0, 1)

(6) if label < P (`) then

(7) nodeLabel 0

(8) else then

(9) nodeLabel 1

(10) end if

(11) anonymization random(0, 1) (12) if anonymization < A then

(13) L

⁰

[a] node(b, ⇤, numberOfF riends) (14) else then

(15) L

⁰

[a] node(b, nodeLabel, numberOfF riends) (16) end if

(17) L[a] node(b, nodeLabel, numberOfF riends) (18) a a + 1

(19) end while

When using real data instead of generated synthetic data, then this pro-

cess is done after the nodes are created. This is because of the diﬀerentiating

input methods. In the real data version, each node has a seperate file for its

connections, also nodes from each label outcome are seperated in diﬀerent

files, e.g. class1.txt, class2.txt, etc. In addition to that, since these parts

of the algorithm are for generating test cases, they will be excluded when a

(43)

real anonymized dataset is going to be used. As mentioned in Section 2 the suppressed data will mimic the original data, due to the reason that even when anonymized this set of records must make sense, without breaching the privacy of individuals.

4.2 Algorithm

In this section we will describe the PIA algorithm in depth. The PIA algo- rithm shown in Algorithm 4 is designed for a singe label classification, e.g.

{age<30, age>=30}, {left-wing, right-wing}, etc. This algorithm runs on the anonymized network data. It searches for anonymous nodes and calculates its probability of belonging to a class.

Our algorithm consists of three parts. First, it finds the unlabeled nodes in the network. After that for each unlabeled node, it checks, how many unlabeled/suppressed friends the node has, depending on the outcome it choses one of two probability functions and computes the probability of this node belonging to a class. Finally, after each unlabeled node has computed a probability, it compares these with the threshold values and comes to a decision about the node.

The first part of the algorithm, shown in Algorithm 2, works in an itera-

tive fashion. It goes over the list of nodes, and searches for the ones that are

suppressed, explained as in Algorithm 1.

(44)

Algorithm 2 Get Suppressed Nodes (1) a 0

(2) while a < n do (3) if F

ia

.` = ⇤ then (4) L

⁰

.push(L[a])

(5) end if

(6) a a + 1 (7) end while (8) return L

⁰

Algorithm 3 Total number of suppressed nodes in the graph (1) a 0

(2) i 0 (3) b 0

(4) while i < n do (5) while a < N

i

do (6) if F

ia

.` = ⇤ then

(7) b b + 1

(8) end if

(9) a a + 1

(10) end while (11) i i + 1 (12) end while (13) return b

The second part of the algorithm uses the nodes returned from part 1 with

two diﬀerent probability equations, depending on the number of suppressed

connections the selected node has. If the node in question isn’t connected to

any other suppressed node, then the inference is based purely on the distri-

bution of its connected peers. As equation 4.1 describes, we check the labels

of each connected node to determine the connectivity of our selected node

to this label. Again it is a single label classification version of the equation.

(45)

Then by comparing these values with each other and the probabilities of each class occurance rates, as in Equation 4.2, we can infer the label value this node belongs to.

8 >

<

> :

c1 = N

_v⁰_i

c2 = N

_v¹_i

(4.1)

c1 c2

8 >

<

> :

>

^{1 K}_{1 K}^2,vi¹0

2,vi

) v

ⁱ

.` = 0

<

^{1 K}

2,vi1

1 K_2,vi⁰

) v

i

.` = 1

(4.2)

Aside from this case, a node may be connected to other suppressed nodes.

In this scenerio the ratio equation changes, which also varies according to the connected nodes’ label values. Depending on the count of nodes’ label values from Equation 4.1 we choose one of the ratio calculation methods. If c1 > c2 then we use Equation 4.3. If c2 > c1 then we use Equation 4.5. The result of these equations are kept in a diﬀerent list and will be updated after each turn for all suppressed nodes are finished.

IRN

_i⁰

= IRN

m

⇥

⇣ | N

v⁰i

| + P

N_vi t=0

IRN

t

⌘

| N

^vi

| + (1 IRN

m

) ⇥ W

1

(4.3)

W

1

= 0

@1

⇣ | N

v¹i

| + P

N_vi t=0

IRN

t

⌘

| N

^vi

|

1 A (4.4)

(46)

IRN

_i⁰

= IRN

_m

⇥

⇣ | N

v¹i

| + P

N_vi t=0

IRN

t

⌘

| N

vi

| + (1 IRN

_m

) ⇥ W

0

(4.5)

W

0

= 0

@1

⇣ | N

v⁰i

| + P

N_vi t=0

IRN

t

⌘

| N

vi

|

1 A (4.6)

The key element of the PIA algorithm is that it runs iteratively and as it repeats itself the IRN for each node converges to a value. After the algorithm is finished these ratios would be used to compare it with the threshold values.

In Algorithm 4 we can see that the previously mentioned equations are called within the algorithm using the names Eq

—

4.X(). Until line (6) of the PIA algorithm we handle the case, where the suppressed node has no anonymized connections.

The second case, in which a node has also suppressed connections, is

run iteratively in order to see the convergence of the inference ratio for each

anonymous node. If a suppressed node v

i

is connected to another suppressed

node v

j

then during the runtime of the algorithm the new value of IRN

i

is

calculated w.r.t. IRN

j

and vice versa shown in Equation 4.3 and 4.5. Since

it utilizes a probabilistic method, the more this equation is calculated at a

single step the more precise the ratio converges. When the suppression rate

is low it important that this part of the algorithm iterates suﬃcient number

of times, since the size of the set of suppressed nodes with connections to

other suppressed nodes will be low, and we must guarantee that for each

(47)

such node the algorithm iterates z times, where z is the user input for the

minimal iteration count.

(48)

Algorithm 4 Probabilistic Inference Attack (1) a 0

(2) while a < n do

(3) x[] getSuppressedNodes(v

^a

) (4) if count(x[]) = 0 then

(5) Eq

_—

4.2 (Eq

_—

4.1(n

a

)) (6) else then

(7) y 0

(8) z 0

(9) while z < 15 do

(10) y 0

(11) while y < count(x[]) do (12) cr[] Eq

—

4.1(n

a

) (13) if cr[0] > cr[1] then (14) Eq

_—

4.3(n

a

, x[y])

(15) crT emp[] Eq

—

4.1(x[y])

(16) if crT emp[0] > crT emp[1] then

(17) Eq

_—

4.3(x[y], v

a

)

(18) else then

(19) Eq

_—

4.5(x[y], v

a

)

(20) end if

(21) else then

(22) Eq

_—

4.5(v

a

, x[y])

(23) crT emp[] Eq

—

4.1(x[y])

(24) if crT emp[0] > crT emp[1] then

(25) Eq

_—

4.3(x[y], v

a

)

(26) else then

(27) Eq

_—

4.5(x[y], v

_a

)

(28) end if

(29) end if

(30) y y + 1

(31) end while

(32) z z + 1

(33) end while

(34) end if

(35) a a + 1

(36) end while

(49)

As we can see each pair of suppressed neighbours are calculated together.

In other words, for each v

i

.` = ⇤ and e = (v

ⁱ

, v

j

) where j 2 N and v

^j

.` = ⇤ we will write the equation 4.3 or 4.5 and recalculate it z times. If we had gone with the deterministic model, we would have to write the equation such that, that each v

j

.` = ⇤ should be a part of it. In the mean time we should also write the eqution in the same loop for each v

j

and their suppressed neighbours, and then their neighbours, too. As we can understand this method will call itself recursively to find all suppressed vertices of a give v

i

.` = ⇤. Hence the runtime diﬀerence will increase polynomially between the two algorithms and there is the probability of creating an infinite recursion in the deterministic algorithm.

Once the algorithm runs and calculates each suppressed nodes’ probabilis-

tic value, the inference decision is based on two thresholds. The threshold

selection is based on the fact of seperation of probabilities each suppressed

vertice will have after the algorithm runs. We will have two thresholds, one to

represent ` = 0 for a given vertice and the other to represent ` = 1 again for

any given vertice. Our aim is to succeed in seperating the probabilities very

distinctly, hence when choosing a very small and a very big threshold we will

identify each vertices’ label value with the highest accuracy. For example, if

the threshold values are chosen as t

small

= 0.25 and t

large

= 0.75 and there is

no or very few vertices inbetween, then we say we have concluded the labels

with high accuracy for each vertice. Any probability between the thresholds

will be considered as non inferable.

(50)

Depending on the relation between vertices that have the label or not (Equation 3.3), the threshold values can be changed. The values can converge to each other at t

small

= t

large

= 0.5. The more apart these values are with low error rates then the seperation of labels is more precise and accurate.

4.3 Complexity Analysis

In our algorithm we expect to see a polynomial increase of degree 2, in other

words our algorithm will run in O (n

²

) . The reason for this is the two-way

calculation of the PIA algorithm. As mentioned in Section 2.1, if a sup-

pressed node v

i

is neighbours with other suppressed nodes, e.g. v

j

, then PIA

calculates the new probability (Equation 4.3 or 4.5) of both vertices. Between

lines (13)-(20), (21)-(28) of Algorithm 4 we can see that depending on the

number of friends with ` = 0 and ` = 1 the Equation 4.3 or 4.5 is calculated

for both set of nodes: (v

i

, v

j

) ^ (v

^j

, v

i

) . When the number of suppressed

nodes increases the possibility of a node v

i

having more than 1 suppressed

neighbour nodes. Considering that this calculation is repeated iteratively for

each suppressed node z-many times, the amount of calculation gets bigger

and bigger. The complexity of it can be seen in Equation 4.7, which proves

that when the total number of suppressed nodes (Algorithm 3) increases the

number of suppressed nodes per suppressed node v

i

(| Algorithm 2 |) also

increases. Since the second part is repeated z times, in total the number

of calculations increases exponentially. Considering the values from Table

10 and 13 we can conclude Equation 4.7 into Equation 4.8, which gives a

numerical representation of the complexity using the average number of con-

(51)

nections. (n, z) calculates the constant variables for a given network size n and iteration count z.

Algorithm

_—

3 ⇥ [z ⇥ (2 ⇥ (| Algorithm

—

2(v

i

) |))] (4.7)

(n ⇥ A

s

) ⇥ [z ⇥ 2 ⇥ (averageNumberOfEdges ⇥ A

s

)] = A

²_s

⇥ (n, z) (4.8)

Table 10: Facts & Figures for synthetic data Value

Count of ` = 0 17500

Count of ` = 1 7500

Average number of friends 8v

i

⇠ 18.5 Minimum number of friends 8v

i

5 Maximum number of friends 8v

ⁱ

33 Table 11: Facts & Figures for real data Value

Count of ` = 0 93

Count of ` = 1 690

Average number of friends 8v

i

⇠ 39.55

Total connectivity | e

ij

| 30970

Maximum number of mutual friends between v

1

and v

i

207

(52)

Considering the complexity eqution (Equation 4.7) both synthetic and real data application of PIA have the same complexity, yet the real data complexity functions’ coeﬃcient (n

real

, z) is smaller. Using Equations 4.9 and 4.10 we prove the diﬀerence of complexity between synthetic and real data, i.e. between

synth

and

real

(Equation 4.11).

n

real

n

synth

= 783

25000 ⇡ 0.031 (4.9)

AverageF riendCount

real

AverageF riendCount

synth

= 39.55

18.5 ⇡ 2.14 (4.10)

(n

real

, z) = Eq

_—

4.9 ⇥ Eq

—

4.10 ⇥ (n

synth

, z) ⇡ 0.67 ⇥ (n

synth

, z) (4.11)

(53)

5 Performance Evaluation

In this section, we will discuss our test cases and the results of recovering the information using probabilistic inference attack, PIA.

5.1 Test Bench

We have developed our algorithm in Eclipse [22], in order to have a cross- platform application. Any computer on JRE 1.5 or newer is able to run the code.

Besides, the simulations are run on a personal computer with the following specifications:

• Mac OS X 10.7 (x86/64)

• Intel Core 2 Duo Processor at 2.4GHz

• 4 GB 667 MHz DDR2 SDRAM

• JRE 1.5

• Eclipse 3.6 “Helios”

(54)

5.2 Test Cases

As mentioned in Section 4.1, we used two diﬀerent datasets. One was gen- erated from a Facebook [3] crawler data [18] and the other one is populated from a Facebook account.

5.2.1 Synthetic Data Creation

To create a random network, a list of node names can be used. In our case, we used a file, publicly known as the “100 Million Facebook List” [18], which is a text file with just above hundred million usernames and the number of repetitions of these usernames, shown in Table12. The original username list consists of 170879859 names and their links.

Table 12: Part of file for synthetic data creation 17204 john smith

7440 david smith 7200 michael smith 6784 chris smith 6371 mike smith 6149 arun kumar 5980 james smith 5939 amit kumar

Using this data, we can generate a seperate file with n records. We

randomly select a number in the range [0, 100128458] and retrieve the record

on that line. According to the occurance number, we project it to the size

(55)

of data we are generating, which is n. The details can be examined from Algorithm 5.

Algorithm 5 Random Network Generator with n records (1) a 0

(2) x 100128458 (3) y 170879859 (4) while a < n do (5) b random(x)

(6) c[] readLine(fileP ath, b) (7) repeatCount

^c[0]⇥n_y

(8) z 0

(9) while z < repeatCount do (10) writeT oF ile(outputF ile, c[1])

(11) z z + 1

(12) end while (13) a a + 1 (14) end while

The generated output file alone would not be enough to run the test. In

order to map this file to a network we must also generate connections between

the nodes. In addition to that a classification must be done while converting

the file into a network. For each node created we randomly select how many

friends, i.e. connections, it would have, which is shown in Algorithm 6. User

will input how many labels there will be and their probabilities. According

to these probabilities each node will have a label or set of labels. After all

nodes are created the number of friendship connection are fulfilled. During

this process user input decides on the correlations of the connections, i.e. the

user sets how many of which class the node will have a connection to. This

part is explained in Algorithm 7.

(56)

Algorithm 6 Random Friendship Generation with single label (1) a 0

(2) while a < n do

(3) b readLineF romF ile(fileP ath) (4) numberOf F riends random(n/100) (5) `abel random(0, 1)

(6) if label < P (`) then

(7) nodeLabel 0

(8) else then

(9) nodeLabel 1

(10) end if

(11) L[a] node(b, nodeLabel, numberOfF riends) (12) a a + 1

(13) end while

Using Algorithm 5 we can generate a smaller version of the source. Since the source is a very big data in size, (approx. 2.5GB) read operation, com- bined with the random line seek is very costly. We were able to read a random line and append it to the new file in 8.93 seconds on average. Thus creating a list of 25000 nodes took around 62 hours (Figure 4). This algorithm is the most costly part of the whole project.

In order to mimic the actual data, we analyzed the facts of Facebook from

back the time, in which this crawler was active. First, a twenty-five-thousand

line data is generated by using Algorithm 5. The generated data is read and

converted into a graph using the facts from Table 10 and with Algorithms

6 and 7. These values are calculated according to the ratio of the size of

generated data against the size of the SN itself [11]. Then these values are

randomly selected for suppression as shown in Algorithm 1.

(57)

Algorithm 7 Connection Generator with single label (1) a 0

(2) b 0 (3) x 0

(4) while a < n do (5) while b < N

a

do

(6) sameLabelConnections N

^a

⇥ RN

^vi.`

(7) while c = v

a

do

(8) c random(n)

(9) if v

c

.` = v

a

.` and

x < sameLabelConnections then (10) addConnection(v

a

, v

c

)

(11) x x + 1

(12) else if x >= sameLabelConnections and v

c

.` 6= v

a

.` then

(13) addConnection(n

a

, n

c

)

(14) end if

(15) end while

(16) b b + 1

(17) end while (18) b 0 (19) a a + 1 (20) end while

5.2.2 Real Data

The real data is based on one Facebook account. The friends of the account and the mutual friend list for each of the friends of the account have been gathered. The list of friends are stored in a file, in which every friends’ user- ID, that acts as a unique key [13] for each account on Facebook, is stored.

Assume v

1

is the account that the information is harvested from, and the

mutual friends of a friend v

i

of v

1

is stored just like the friends list file.

(58)

Figure 4: Algorithm 5 runtime

In order to gather this information, a crawler script is run on the account of v

1

. The crawler uses the Graph API [5] and the REST API [10] from the Facebook Developers library [4]. First the crawler retrives the user-ID’s of the given account and then for each returned user-ID it collects the mutual friends list (Algorithm 8). This data is also suppressed randomly according to Algorithm 1. Table 11 shows the important values of this dataset.

5.3 Results

In this section present the results of the tests according to the datasets men- tioned above. For each dataset, we show the success rate and the runtime of the attack algorithm. We test both datasets with diﬀerent suppression rates.

Submitted to the Graduate School of Sabancı University in partial fulfillment of the requirements for the degree of

A Probabilistic Inference Attack On Suppressed Social Networks

by

BARIŞ ALTOP

Submitted to the Graduate School of Sabancı University in partial fulfillment of the requirements for the degree of

Master of Science

Sabancı University

Spring, 2011

c Barış Altop 2011

All Rights Reserved

A PROBABILISTIC INFERENCE ATTACK ON SUPPRESSED SOCIAL NETWORKS

Barış ALTOP

Computer Science and Engineering, Master’s Thesis, 2011

Thesis Supervisors: Assist. Prof. Dr. Mehmet Ercan Nergiz, Assoc. Prof. Dr. Yücel Saygın

Keywords: Social Networks, Inference Attack, Privacy, Private Data Protection, Classification

Abstract

Unfortunately, above mentioned preference-based suppression tech-

niques are not suﬃcient to protect privacy mainly because they do not

allow users to control data about other users they are linked with.

Information about neighbors becomes an inference channel in an SN when there is known correlation between the existence of a link be- tween two users and the users having the same sensitive information.

We experimentally show that it is possible to recover majority of the

suppressed labels of users even in a highly suppressed SN.

BASTIRILMIŞ SOSYAL AĞLARDA OLASILIKSAL BİR ÇIKARIM SALDIRISI

Barış ALTOP

Bilgisayar Bilimi ve Mühendisli ğ i, Y ü ksek Lisans Tezi, 2011

Tez Danışmanları: Yard. Doç. Dr. Mehmet Ercan Nergiz, Doç.

Dr. Yücel Saygın

Anahtar Kelimeler: Sosyal Ağlar, Çıkarım Saldırısı, Mahremiyet, Kişiye Özel Veri Güvenliği, Sınıflama

Özet

Sosyal Ağlar günümüz internet kullanıcıları tarafından kişisel bilgilerin

paylaşımı amacıyla yaygın olarak kullanılmaktadır. Bu tür ağların, bilgi

içerikleri çok zengin olduğundan, diğer üçüncü partiler ile paylaşımı kamusal

ve ticari fayda getirmektedir. Ancak, sosyal ağlarda saklanan bilgiler çoğun-

lukla kişiye özeldir ve gizlilik endişelerine tabidir. Gizlilik sorunlarını gider-

menin bir yolu, kullanıcılara kendi verilerinin kontrolü vermek ve istedikleri

verileri bastırarak üçüncü kişilerden gizlemelerini sağlamaktır.

Bu saldırı algoritması sosyal ağdaki etiketler arası bağıntıyı ve bağlantıları

bilen gerçekçi bir düşmana göre tasarlanmıştır. Yüksek derecede bastırılmış

sosyal ağlarda bile kullanıcıların bastırılmış etiketlerinin çoğunluğunu çıkar-

samanın mümkün olduğunu deneysel olarak göstermekteyiz.

to my beloved family

Acknowledgements

Dr. Mehmet Ercan Nergiz for believing in the topic I was interested in. In addition, I am thankful to my thesis defense committee members: Assoc.

Prof. Berrin Yanıkoğlu, Assoc. Prof. Erkay Savaş and Assoc. Prof. Cem Güneri for their support and presence.

Last, but not the least, I am immensely thankful to my family, for being

there when I needed them to be, for believing in me and supporting me

throughout all my decisions.

Contents

1 Introduction 1

2 Background Information and Problem Definition 4

2.1 Structure of Social Networks (SNs) . . . . 4

2.2 Problem Definition . . . . 8

2.3 Related Work . . . . 8

2.3.1 Tabular Data Publishing . . . . 9

2.3.2 Complex Data Publishing . . . 15

2.3.3 SN Data Publishing . . . 17

3 Motivation and Contribution of the Thesis 22 3.1 Motivation . . . 22

3.2 Contribution of the Thesis . . . 24

4 Proposed Probabilistic Inference Attack (PIA) 27 4.1 Methodology . . . 27

4.1.1 Anonymization Process . . . 28

4.2 Algorithm . . . 30

4.3 Complexity Analysis . . . 37

5 Performance Evaluation 40

5.1 Test Bench . . . 40

5.2 Test Cases . . . 41

5.2.1 Synthetic Data Creation . . . 41

5.2.2 Real Data . . . 44

5.3 Results . . . 45

5.3.1 Synthetic Data Results . . . 46

5.3.2 Real Data Results . . . 51

5.4 Evaluation . . . 55

6 Conclusion and Future Work 57

List of Figures

1 Example of a Social Network . . . . 6

2 Suppressed version of SN from Figure 1 . . . . 7

3 Sample Domain Generalization Hierarchy . . . 12

4 Algorithm 5 runtime . . . 45

5 Run Time of PIA with Synthetic Data . . . 47

6 Erroneous Node Count with Synthetic Data . . . 49

7 Erroneous Node Percentage with Synthetic Data . . . 50