Quantifying interdependent risks in genomic privacy

(1)

3

MATHIAS HUMBERT, CISPA, Saarland University ERMAN AYDAY, Bilkent University

JEAN-PIERRE HUBAUX, EPFL

AMALIO TELENTI, Human Longevity Inc.

The rapid progress in human-genome sequencing is leading to a high availability of genomic data. These data is notoriously very sensitive and stable in time, and highly correlated among relatives. In this article, we study the implications of these familial correlations on kin genomic privacy. We formalize the problem and detail efficient reconstruction attacks based on graphical models and belief propagation. With our approach, an attacker can infer the genomes of the relatives of an individual whose genome or phenotype are observed by notably relying on Mendel’s Laws, statistical relationships between the genomic variants, and between the genome and the phenotype. We evaluate the effect of these dependencies on privacy with respect to the amount of observed variants and the relatives sharing them. We also study how the algorithmic performance evolves when we take these various relationships into account. Furthermore, to quantify the level of genomic privacy as a result of the proposed inference attack, we discuss possible definitions of

genomic privacy metrics, and compare their values and evolution. Genomic data reveals Mendelian disorders

and the likelihood of developing severe diseases, such as Alzheimer’s. We also introduce the quantification of health privacy, specifically, the measure of how well the predisposition to a disease is concealed from an attacker. We evaluate our approach on actual genomic data from a pedigree and show the threat extent by combining data gathered from a genome-sharing website as well as an online social network.

CCS Concepts:

r

Security and privacy_{→ Pseudonymity, anonymity and untraceability; Privacy}

protec-tions;

r

Applied computing_{→ Genomics;}

Additional Key Words and Phrases: Genomic privacy, inference, metrics, kinship ACM Reference Format:

Mathias Humbert, Erman Ayday, Jean-Pierre Hubaux, and Amalio Telenti. 2017. Quantifying interdepen-dent risks in genomic privacy. ACM Trans. Priv. Secur. 20, 1, Article 3 (February 2017), 31 pages.

DOI: http://dx.doi.org/10.1145/3035538 1. INTRODUCTION

Thanks to the plummeting costs of molecular profiling, biomedical researchers have access to an increasing amount of genomic data, a key enabler toward a more personal-ized, precise, and predictive medicine. In addition to research purposes, genomic data is being used by individuals to learn about their (genetic) predispositions to diseases Erman Ayday is supported by funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 707135 and by the Scientific and Tech-nological Research Council of Turkey, TUBITAK, under Grant No. 115C130.

Authors’ addresses: M. Humbert, CISPA, Saarland University, Computer Science Department, Campus E9 1, Room 3.18, Saarbr ¨ucken, 66123, Germany; email: humbert@cs.uni-saarland.de; E. Ayday, Department of Computer Engineering, Room EA529, Bilkent University Engineering Building, Bilkent, Ankara, 06800, Turkey; email: erman@cs.bilkent.edu.tr; J.-P. Hubaux, EPFL, BC 207, Station 14, Lausanne, 1015, Switzer-land; email: jean-pierre.hubaux@epfl.ch; A. Telenti, Human Longevity, Inc., 4570 Executive Rd., San Diego, CA 92121, USA; email: atelenti@humanlongevity.com.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax+1 (212) 869-0481, or permissions@acm.org.

c

2017 ACM 2471-2566/2017/02-ART3 $15.00

(2)

or their ancestries. This biomedical data revolution has spawned the emergence of health-related websites and online social networks (OSNs), in which individuals share their genomic data. Thus, currently, tens of thousands of genomes are available online. A major issue stemming from this increasing availability of genomic data is pri-vacy. First, it has been shown that, even if genomic data is anonymized, it is possi-ble to reidentify their owners by various means [Gymrek et al. 2013; Humbert et al. 2015b; Sweeney et al. 2013]. Second, there is an increasing number of individuals who share their genomes online, sometimes with their real identifiers (e.g., on OpenSNP.org [Greshake et al. 2014]). Access to such sensitive data can lead to discrimination in ac-cess to insurance and employment [Ayday et al. 2015].

These concerns are exacerbated by the fact that genomic data of family members is highly correlated, leading to interdependent privacy risks. These risks have been publicized by the story of the Lacks family.1_{However, given the trend on genomic-data} sharing, the Lacks family is by far not the only family whose privacy is threatened by these interdependent risks. We have shown the extent of this threat by using an OSN as a side channel to gather familial information [Humbert et al. 2013].

In this work, we quantify the interdependent risks stemming from familial cor-relations in genomic privacy. Focusing on the most common variant in the human population, single nucleotide polymorphism (SNP), and considering the intragenome statistical correlations (referred to as linkage disequilibrium), we quantify the loss in genomic privacy of individuals when one or more of their family members’ genomes are (either partially or fully) revealed. To achieve this goal, we design efficient inference algorithms that mimic the adversarial reconstruction attack. We present a Bayesian network model that takes into account the statistical relationships between the rela-tives’ genomes, and between the genome and the phenotype. We further extend this model to a factor graph representation in order to include intragenome correlations into our model. In order to infer the values of the unknown SNPs in linear complexity, we make use of the belief propagation algorithm, run either on a junction tree (which is a transformation of the Bayesian network that removes its loops), or on the factor graph. In the latter case, as the factor graph contains loops; the algorithm is carried out multiple times until the probability distributions converge to a stable state. Then, using various metrics, we quantify the genomic privacy of individuals and show the decrease in their level of genomic privacy caused by the published genomes of their family members. We also quantify the health privacy of the individuals regarding their (genetic) predisposition to certain serious diseases given current medical knowledge. We evaluate the proposed inference attacks and show their efficiency and accuracy by using real genomic data of a pedigree. More important, by using genomic and pheno-typic data and pedigree information collected from a genome-sharing website and an OSN, we show that inference attacks do not threaten just the Lacks family.

This article is a revised and extended version of our paper [Humbert et al. 2013], and contains the following additional contributions:

—We present a new framework for the inference attack that considers only the genomic correlations between familial members. We show that this new framework enables performance of an exact inference in a single iteration of our belief propagation algorithm. We also include analytic and empirical evaluations of its computational complexity.

—We add a new layer to this new framework that enables taking additional information about relatives’ phenotypes into account to improve the inference attack.

1_{http://www.nytimes.com/2013/03/24/opinion/sunday/the-immortal-life-of-henrietta-lacks-the-sequel.html?} pagewanted_=all.

(3)

Fig. 1. Reproduction and SNP. Each parent produces gametes that are derived from one’s genome. The offspring’s genome is the combination of these two gametes. As an example, the SNP circled on the offspring’s genome is homozygous-minor for the offspring but heterozygous for the parents.

—We update the results of the inference attack by conducting several new experiments. —We thoroughly evaluate the relation between various metrics, and draw conclusions

about the most appropriate metric in different settings.

—We carry out new experiments by making use of phenotypic information disclosed by OpenSNP users in combination with their genomic data.

—We include a performance evaluation, and a discussion about the potential improve-ments of the proposed inference attacks.

2. BACKGROUND

In this section, we briefly introduce the relevant genetic principles, as well as some important tools for modeling data dependencies and running inference efficiently.

2.1. Genomics 101

DNA is a double-helix structure that consists of two complementary polymer chains. Genetic information is encoded on the DNA as a sequence of nucleotides (A, T, G, C); human DNA includes around 3 billion nucleotide pairs. With the decreasing cost of DNA sequencing, genomic data is currently being used mainly in the following two areas: (i) clinical diagnostics, for personalized genomic medicine and genetic research (e.g., genomewide association studies), and (ii) direct-to-consumer genomics, for genetic risk estimation of various diseases or for recreational activities such as ancestry search. In the following, we briefly introduce some concepts about the human genome and reproduction that we use throughout this article.

2.1.1. Single Nucleotide Polymorphism.Human beings have 99.9% of their DNA in com-mon. Thus, there is no need to focus on the whole DNA structure, but rather on the variants. SNP is the most common DNA variation in human population. An SNP oc-curs when a nucleotide (at a specific position on the DNA) varies between individuals of a given population (as illustrated in Figure 1). There are approximately 50 million SNP positions in the human population.2 _{Recent discoveries show that the} suscepti-bility of an individual to several diseases can be computed from the individual’s SNPs [Johnson and O’Donnell 2009]. For example, it has been reported that two partic-ular SNPs (rs7412 and rs429358) on the Apolipoprotein E (ApoE) gene indicate an 2_{http://www.ncbi.nlm.nih.gov/projects/SNP/.}

(4)

Table I. Mendelian Inheritance ProbabilitiesFR(Xi

M, XiF, XiC) for a SNPgi, Given the Genotypes of the Parents

Father Xi_F= 0 Xi_F= 1 Xi_F= 2 Mother Xi M=0 (1,0,0) (0.5,0.5,0) (0,1,0) Xi_M=1 (0.5,0.5,0) (0.25,0.5,0.25) (0,0.5,0.5) Xi M=2 (0,1,0) (0,0.5,0.5) (0,0,1)

Note: The probabilities of the child’s genotype is represented in

parentheses. Each table entry represents (P(Xi

C = 0|XiM, XiF), P(Xi C= 1|X i M, X i F), P(X i C = 2|X i M, X i F)).

(increased) risk for Alzheimer’s disease. SNPs carry privacy-sensitive information about individuals’ health; thus, we will quantify health privacy focusing on individ-uals’ published (or inferred) SNPs and the diseases that they reveal.

Two different nucleotides (called alleles) can usually be observed at a given SNP position: (i) the major allele is the most frequently observed nucleotide, and (ii) the minor allele is the rare nucleotide.3 _{For each SNP position, we represent the major} allele as B and the minor allele as b (where both B and b are in{A, T, G, C}).

Furthermore, each SNP position contains two nucleotides (one inherited from the mother and one from the father, as we will discuss next). Thus, the content of an SNP position can be in one of the following states: (i) BB (homozygous-major genotype), if an individual receives the same major allele from both parents; (ii) Bb (heterozygous genotype), if an individual receives a different allele from each parent (one minor and one major); or (iii) bb (homozygous-minor genotype), if an individual inherits the same minor allele from both parents. For simplicity of presentation, in the rest of the article, we encode BB with 0, Bb with 1, and bb with 2. Finally, each SNP giis assigned a minor allele frequency (MAF), pi

maf, which represents the frequency at which the minor allele

b of the corresponding SNP occurs in a given population (typically, 0< pi

maf< 0.5). 2.1.2. Reproduction.Mendel’s First Law states that alleles are passed independently from parents to children for different meioses (the process of cell division necessary for reproduction). For each SNP position, a child inherits one allele from the mother and one from the father, as shown in Figure 1. Each allele of a parent is passed on to a child with equal probability of 0.5. Let FR(XiM, XiF, XiC) be the function modeling the Mendelian inheritance for an SNP gi, where M, F, and C represent mother, father, and child, respectively. We illustrate the Mendelian inheritance probabilities in Table I.

Based on FR(Xi_M, Xi_F, X_Ci), we can say that, given both parents’ genomes, a child’s genome is conditionally independent of all other ancestors’ genomes.

2.1.3. Linkage Disequilibrium.As we discussed before, DNA sequences are highly corre-lated between close relatives, but there also exist correlations between different SNPs in the DNA. Linkage disequilibrium (LD) [Falconer and Mackay 1996] defines a correla-tion that appears between any pair of SNP in the whole genome due to the populacorrela-tion’s genetic history. Because of LD, the content of an SNP can be inferred from the contents of other SNPs.

For example, assume that giand gjare in LD with each other. Let ( A1, A2) and (B1, B2) be the potential alleles for SNP gi and gj, respectively. Further, let ( p1, p2) and (q1, q2) be the allele probabilities of ( A1, A2) and (B1, B2), respectively, provided by population statistics. That is, the probability that an individual in a given population will have 3_{The two alleles for the SNP position highlighted in Figure 1 are G and A.}

(5)

Table II. Linkage Disequilibrium (LD) between two SNPsgiandgj with Potential Alleles (A1,A2) and (B1,B2), Respectively

A1, P(A1)= p1 A2, P(A2)= p2

B1, (P(B1)= q1 P( A1B1)= p1q1+ D P( A2B1)= p2q1− D

B2, (P(B2)= q2 P( A1B2)= p1q2− D P( A2B2)= p2q2+ D

allele A1at SNP giis p1, and so on. If there were no LD (i.e., if giand gj were indepen-dent), the probability that an individual would have both A1and B1at giand gjwould be

p1q1. However, due to correlations between giand gj, this probability is in reality equal to p1q1+ D, where D represents the discrepancy between the probability computed under independence assumption between the two SNPs and the probability in a given population. In Table II, we illustrate this LD relationship for all possible combinations of ( A1, A2) and (B1, B2). We note that D can be either negative or positive, depending on the LD values. Another relevant metric to capture LD is the correlation coefficient r, expressed as r= D/√p1p2q1q2, where r= 1 represents the strongest LD relationship.

2.2. Probabilistic Inference

In this section, we introduce the mathematical models and algorithms that form the basis of efficient inference methods.

2.2.1. Probabilistic Graphical Models.Probabilistic graphical models are very appropriate models to represent dependencies between random variables [Koller and Friedman 2009]. Such graph-based models can express conditional dependencies (e.g., Bayesian networks), joint dependencies (e.g., Markov random fields), or both (e.g., chain graphs). In graphical models, each node represents a random variable and arrows represent the dependencies between them. Such models are very useful to represent the factorization of the joint distribution of a large set of random variables, then dramatically reduce the complexity of, for example, the computation of marginal probabilities. If the graphical model contains loops or cycles,4_{it is possible to eliminate these by clustering variables} into single nodes (called cliques) and building a maximum spanning tree (called junction or clique tree [Jensen and Jensen 1994]) of cliques. A more generic model that can represent both directed and undirected graphs is the factor graph. Contrary to the junction tree, it enables finding approximate solutions in situations in which exact in-ference is computationally intractable. A factor graph is a bipartite graph with one set of vertices representing the random variables and the other set representing the (local) functions that factor the (global) joint probability function (based on the dependencies between the variables). A variable node is connected to a factor node if and only if the variable is an argument of the local function corresponding to the factor node.

2.2.2. Belief Propagation.Belief propagation [Pearl 1988] is a message-passing algo-rithm for performing inference on graphical models. It is also known as the sum-product algorithm [Kschischang et al. 2001]. It is typically used to compute marginal distributions of unobserved variables conditioned on observed variabled. Computing marginal distributions is hard in general, as it might require summing over an ex-ponentially large number of terms. The belief-propagation algorithm applies on var-ious types of graphical models, such as Bayesian networks or Markov random fields. If the underlying graphical model contains no (directed or undirected) cycle, the belief-propagation algorithm leads to exact inference, that is, exact posterior marginal 4_{There exists a cycle between X}

1 and Xkin a graph if X1 = Xkand, for every i = 1, . . . , k − 1, we have

either a directed or undirected edge between Xiand Xi+1with, for at least one i, a directed edge. A loop is defined similarly except that it also allows for a reverse-directed edge between Xiand Xi+1(i.e., directed edge between Xi+1and Xi). See Section 2.2 of Koller and Friedman [2009] for further details.

(6)

probabilities given the observed variables. If the graphical model is not a tree or poly-tree (not cycle-free), we can either transform it into a junction poly-tree and then run belief propagation on it and get the exact solution or perform loopy belief propaga-tion, which yields an approximate solution [Murphy et al. 1999]. The second approach is typically used when the junction-tree approach is computationally intractable, and often gives good approximate results. Belief propagation is commonly used in arti-ficial intelligence and information theory. It has demonstrated empirical success in numerous applications, including LDPC codes [Pishro-Nik and Fekri 2004], reputation management [Ayday and Fekri 2012a, 2012b], and recommender systems [Ayday et al. 2012].

As factor graphs are the most generic representation of graphical models, we will explain the generic belief-propagation algorithm on them.5_{We assume that the joint} distribution g(x1, . . . , xn) factors into a product of several local functions, or factors,

fa(xa):

g(x1, . . . , xn)=

a∈A

fa(xa), (1)

where A is a discrete index set (of factor nodes), and xais a subset of{x1, . . . , xn} rep-resenting the set of variable nodes connected to factor node a. The belief-propagation algorithm simply works by passing messages between the|A| factor nodes (represent-ing the factors f1(x1) to f|A|(x|A|)) and the n variable nodes (representing the random variables x1to xn) on the bipartite factor graph. The message ma→i(xi) from the factor node a to the variable node i can be interpreted as a statement about the relative prob-abilities that i is in its different states based on the function fa. The message ni→a(xi) from the variable node i to the factor node a can be interpreted as a statement about the relative probabilities that node i is in different states based on all the information node i has except for that based on the function fa. The messages are updated according to the following rules [Pearl 1988; Kschischang et al. 2001]:

ni→a(xi)= 1 Z c∈N(i)\a mc→i(xi) (2) and ma→i(xi)= xa\xi fa(xa) j∈N(a)\i nj→a(xj). (3) Here, N(i)\a denotes all the nodes that are neighbors of node i except for node a. Further, _x_a_\x_i denotes a sum over all the variables xa that are arguments of fa, except xi. Z is a normalization factor that is needed so that the resulting messages represent probability mass functions. At the beginning, messages are initialized as follows: ni→a(xi) = 1 and ma→i(xi) = fa(xi). Then, at the end of the algorithm, after convergence, the (estimated) marginal distribution of xi is given by the product of the messages received by the variable nodes:

P(xi)= 1 Z c∈N(i) mc→i(xi), (4)

where Z is such that_x_i P(xi)= 1. Note that, if the underlying graphical model is a tree, convergence can be reached after computing each message only once (for every 5_{Interested readers can check Kschischang et al. [2001] to see how it applies to other graphical models, such} as Bayesian networks.

(7)

Fig. 2. Overview of the proposed framework to quantify kin genomic privacy. SNP gi of relative rj is

represented by xi

j∈ {0, 1, 2}. The set of SNPs of individual rjis represented by vector xj. Given its health

and genomic privacy, the family should ideally decide whether to reveal less or more of their genomic data via the genomic-privacy preserving mechanism (GPPM).

factor and variable node). Otherwise, there is no guarantee of convergence to the true marginal in the general case, but there exist sufficient conditions for convergence [Mooij and Kappen 2007]. Neither is there any fixed convergence or error rates in general. We describe how many iterations of message computation for every node are needed in our context in Sections 3.4 and 6.1. Finally, note that exact and approximate marginaliza-tion is NP-hard in general, but, in our genomic setting, it can be solved in linear time in the number of factor nodes (or variable nodes). We refer the reader to Section 3.4 for more details on the computation complexity in our setting.

3. THE PROPOSED FRAMEWORK

In this section, we formalize our approach and present the different components that will allow us to quantify kin genomic privacy. Figure 2 gives an overview of the framework.

3.1. Notations and Definitions

The SNPs of all relatives are represented by the random variable X that takes values in the setX = {0, 1, 2}n×m_{, where n is the number of relatives in the targeted family,}

m is the number of SNPs in a single DNA sequence, and 0, 1, 2, encode the number

of minor alleles at every considered SNP. Moreover, the hidden SNPs are represented by the random variable XH (that takes value in the setXH), and the SNPs observed by the adversary by the random variable XO (that takes value in the set XO). We defineR = {r1, r2, . . . , rn} to be the set of relatives in the targeted family (whose family tree, showing the familial connections between the relatives, is denoted as T ) and

G = {g1, g2, . . . , gm} to be the set of SNPs (i.e., positions in the DNA sequence). Let Xi

j, respectively, xij ∈ {0, 1, 2}, denote the random variable representing SNP gi of individual rj, respectively, its value. Furthermore, we let xi = [xi1xi2· · · xim] represent the values of the SNPs of individual ri and let x∈ X be the n × m matrix representing

(8)

the values of the SNPs of all relatives: x= ⎡ ⎢ ⎢ ⎢ ⎣ x1 1 x12 · · · x m 1 x1 2 x22 · · · x2m .. . ... . .. ... x_n1 x_n2 · · · x_nm ⎤ ⎥ ⎥ ⎥ ⎦ (5)

FR(Xi_M, Xi_F, Xi_C) is the function representing the Mendelian inheritance probabilities (in Table I), where M, F, and C represent mother, father, and child, respectively. The

m× m matrix l represents the pairwise LD values between the SNPs in G, which can

be expressed by the correlation coefficient r; li j refers to the matrix entry at row i and column j. li j > 0 if i and j are in LD, and li j = 0 if these two SNPs are independent (i.e., there is no LD between them). The m-size vector pmaf= [p1maf pmaf2 · · · pmafm ] represents the minor allele probabilities/frequencies (MAFs) of the SNPs inG. Finally, note that, for any rk∈ R, gi ∈ G, and gj ∈ G, the joint probability P(Xik, X

j

k) can be derived from

li j, pimaf, and p

j

maf.

The adversary carries out a reconstruction attack to infer the value xH ∈ XH by relying on background knowledge,FR(XiM, XiF, XiC), l, pmaf, and on the adversary’s ob-servation xO∈ XO.6After carrying out this reconstruction attack, we evaluate genomic and health privacy of the family members based on the adversary’s success and cer-tainty about the targeted SNPs and the predispositions to diseases that they reveal. Finally, we discuss some ideas to preserve the individuals’ genomic and health privacy.

3.2. Adversary Model

An adversary is defined by the adversary’s objective(s), capabilities, and knowledge. The objective of the adversary is to compute the values of the targeted SNPs for one or more members of a targeted family by using (i) the available genomic data of one or more family members, (ii) the familial relationships between the family members, (iii) the rules of reproduction (in Section 2.1.2), (iv) the MAFs of the nucleotides, and (v) the population LD values between the SNPs. We note that (i) and (ii) can be gathered online from genome-sharing websites and OSNs, and that (iii), (iv), and (v) are publicly known information. Note that, in the future, the increasing possibility to accurately sequence as well as impute the actual haplotypes carried by an individual in each of the copies of the diploid genome will allow a more accurate inference of relatives’ genotype than relying on population LD patterns only.

Various attacks can be launched, depending on the adversary’s interest. The adver-sary might want to infer one particular SNP of a specific individual (targeted-SNP-targeted-relative attack) or one particular SNP of multiple relatives in the targeted family (targeted-SNP-multiple-relatives attack) by observing one or more other rela-tives’ SNP at the same position. Furthermore, the adversary might also want to infer multiple SNPs of the same individual (multiple-SNP-targeted-relative attack) or mul-tiple SNPs of mulmul-tiple family members (mulmul-tiple-SNP-mulmul-tiple-relatives attack) by observing SNPs at various positions of different relatives. The statistical inference model presented in this article applies to all these attacks.

3.3. Inference Attack

We formulate the reconstruction attack (on determining the values of the targeted SNPs) as finding the marginal probability distributions of the random variable xH rep-resenting the hidden SNPs, given the observed values xO, familial relationshipsT , and 6_x

(9)

the publicly available statistical information. We represent the marginal distribution of an SNP gifor an individual rj as P(Xij = xij|XO= xO).

These marginal probability distributions could traditionally be extracted from

P(XH = xH|XO = xO, FR(Xi_M, Xi_F, Xi_C), l, T , pmaf), which is the joint probability dis-tribution function of the hidden SNPs, given the available side information and the observed SNPs. Then, clearly, each marginal probability distribution could be obtained as follows: PXi j = xij|XO= xO = xH∈XH\Xij PXH= xH, Xi_j = xi_j|XO= xO, FR, l, T , pmaf , (6)

where XH is the random variable representing all hidden SNPs except SNP gi of rel-ative rj. However, the number of terms in Equation (6) grows exponentially with the number of variables, making the computation infeasible considering the scale of the human genome (which includes tens of million of SNPs). In the worst case, the com-putation of the marginal probabilities has a complexity of O(3nm). Thus, we propose to factorize the joint probability distribution function into products of simpler local func-tions, each of which depends on a subset of variables. These local functions represent the dependencies (due to LD and reproduction) between the different SNPs in X. Then, by running the belief-propagation algorithm on graphical models, we can compute the marginal probability distributions in linear complexity (with respect to both n and m). We present first the inference attack that takes only the familial correlations into account, which enables efficient performance of an exact inference, and then present the model for which both familial and LD correlations are considered. The former attack is typically sufficient if the adversary has access to the full set of SNPs of interest of the target’s relatives, whereas the latter can improve the attack’s accuracy if the adversary does not observe all SNPs of interest in the genomes of the target’s family members. For the second inference attack, due to the number and type of correlations, and the subsequent complexity of performing an exact inference, we make use of loopy belief propagation, which provides an approximate solution.

3.3.1. Inference Attack Without LD Correlations.Under the assumption that there is no LD correlation between SNPs, the random variables Xi_{s representing a column of matrix}

x are pairwise mutually independent, that is, Xi _{⊥ X}j_,_∀g

i, gj ∈ G, gi= gj. We can then express the marginal distribution of Xi_j in Equation (6) as

PXi_j = xi_j|Xi_O= xi_O= xi H∈X i H\X i j PXi_H = xi_H, Xi_j = xi_j|Xi_O= xi_O, FR, T , pmaf , (7) where the setX_Hiis of maximal size 3n−1, which can still be computationally intractable

if we deal with a large family. However, contrary to the general case, we can here com-pute the exact marginal distributions in linear time by modeling the various depen-dencies with a Bayesian network framework and applying the junction tree algorithm on it. In general, due to Mendelian inheritance laws, the joint distribution P(Xi) can be factored as follows: PXi₌ rj∈founders PXi j rk∈R\founders PXi k|Xm(k)i , Xif (k) , (8)

where the founders are the relatives who have no ancestor in the family tree T , and

m(k), f (k) are the indices of the mother, respectively, the father, of rk. P(Xij) is given by the MAFs pmaf, and P(Xik|Xim(k), X

i

f (k)) by the Mendelian inheritance probabilities

(10)

Fig. 3. Graphical models representing familial dependencies. (a) Bayesian network representing a trio (mother, father, and child). (b) Bayesian network with two parents and two siblings. (c) Junction tree (made of two cliques) corresponding to the Bayesian network in (b).

child), which is also the main basic building block of our Bayesian-network repre-sentation of familial genetic dependencies. In this example, the joint distribution in Equation (8) can be factored as P(Xi₎ _{= P(X}i

1)P(X i 2)P(X i 3|X i 1, X i 2). As mentioned in Section 2.2, we can efficiently compute the exact marginal distributions on polytrees by using belief propagation. However, as soon as sibling relationships appear in the family treeT , the underlying Bayesian network is no longer a polytree7_{and the belief} propagation does not necessarily converge to the exact marginal probabilities. In this case, in order to perform exact inference, we first need to transform the Bayesian net-work into a junction tree. Figures 3(b) and 3(c) show a simple example of a Bayesian network with undirected cycles and its corresponding junction tree.

The procedure to construct the junction tree is as follows. First, we have to trans-form the directed graph into an undirected one, and moralize it, that is, connect all unconnected parents (nodes that have outgoing edges connecting the same node in the directed graph). Second, we triangulate the resulting undirected graph, meaning that we remove all cycles containing four nodes or more by connecting some of these nodes together. More precisely, for any given cycle in the undirected graph, this step creates an edge between any two nonsuccessive nodes in the cycle. This step is not needed in our genetic case because all cycles are already of length 3. Third, we remove cycles by clustering nodes belonging to the same cycle into cliques. In this process, it is important to build cliques with the smallest number of variables8_{to minimize the} inference computational burden. In our case, all cliques will be of size 3 (representing mother–father–child). Then, all cliques sharing the same variables are still connected by edges, which usually yields a loopy graph. In order to remove these cycles, we form a maximum spanning tree of cliques and ensure that if a variable is in two cliques, then it is in every clique along the path connecting the two cliques. If this property holds, local propagation of information will lead to global consistency. Finally, we apply the belief-propagation algorithm on the resulting junction tree, first passing messages9 upward, from the leaves to the root, and then downward, from the root to the leaves, which eventually provides the marginal probabilities of all cliques. If we are interested in the marginal probability of a given variable in a clique, we simply sum all other variables in the clique out.

3.3.2. Inference Attack With Phenotypic Information.It could also happen that the adversary gets access to phenotypic data, such as physical traits or diseases. Such data can be found online, on health-related OSNs (such as PatientsLikeMe or OpenSNP) or 7_{Its underlying undirected graph is not a tree (it contains a loop made of the siblings and their parents).} 8_{Note that the size of the largest clique is called the treewidth and determines the complexity of the algorithm} (which is exponential in the treewidth).

(11)

Fig. 4. Bayesian network representing a trio (mother, father, and child), and two SNPs giand gkinfluencing

a disease l.

traditional OSNs. We show here how the Bayesian network framework can be easily expanded to take this type of information into account in our inference attack.

Figure 4 illustrates how phenotypic nodes can be included in the Bayesian network of Figure 3(a) that represents a single SNP. This updated Bayesian network shows two SNPs, gi and gkof a trio, and a single phenotype l. Here, it is assumed that two SNPs influence directly the phenotype, but there could be from one to many depending on the phenotype. The new layer of phenotypic information adds a number of nodes in the Bayesian network equal to n times the total number of phenotypic traits/diseases. Assuming that a single phenotype is observed, influenced by two SNPs, the general joint distribution presented in Equation (8) is updated as follows:

PXi, Xk, Yl = rp∈ f ounders PXi_pPXk_p rc∈R\ f ounders PXi_c|Xi_m(c), Xi_{f (c)}PXk_c|Xk_m(c), Xk_{f (c)} × rj∈R PYl_j|Xi_j, Xk_jPYl_j|Xi_j, Xk_j. (9) The resulting Bayesian network is not a polytree if it includes sibling relationships or phenotypes influenced by more than an SNP. In this case, as explained in Section 3.3.1, we have to first transform the Bayesian network into a junction tree. The process is the same as in the case without phenotypic data. After the moralization step (in which graphical parents are connected), all cycles are also of length 3, including those induced by the phenotype nodes. We evaluate this framework with real user data in Section 5.2. 3.3.3. Inference Attack With LD Correlations.Once we take into account correlations within the same genomic sequence, the Bayesian network representation does not fit well as it cannot represent undirected dependencies, such as the pairwise joint probabilities given by LD. Also, constructing a junction tree from a Bayesian network containing many cycles because of new nodes representing LD correlations would become un-tractable. A factor graph model is better suited, as it can take both conditional and joint local probabilities into account. It is a bipartite graph that consists of variable nodes, representing random variables, and factor nodes, representing functions that factor the global joint probability. Following Kschischang et al. [2001], we form a factor graph by setting a variable node for each SNP xi

j for each random variable Xij (gi ∈ G and rj∈ R). We use two types of factor nodes:10(i) the familial factor node, representing 10_{For the sake of clarity, we do not include the variable and factor nodes related to phenotypic information,} but the model also applies to them.

(12)

Fig. 5. The factor graph representation of a trio (mother, father, and child) and 3 SNPs per family member. The square, circle, and hexagonal nodes represent the familial factor nodes, variable nodes, and LD factor nodes, respectively. The message passing described in the main text is between the nodes x1

1, f31, and h 1,2

1 , highlighted in the graph.

the familial relationships and reproduction, and (ii) the LD factor node, representing the LD relationships between the SNPs. Our factor graph contains loops because of LD nodes and sibling relationships (if any). We summarize the connections between the variable and factor nodes here (Figure 5):

—Each variable node xi

j has its familial factor node fji to which it is connected. Fur-thermore, xi

k(k = j) is also connected to fji if k is the mother or father of j (inT ). Thus, the maximum degree of a familial factor node is 3.

—Variable nodes x_ij and xm

i are connected to an LD factor node h

j,m

i if SNP gj is in LD with SNP gm. Since the LD relationships are pairwise between the SNPs, the degree of an LD factor node is always 2.

Given the conditional dependencies caused by reproduction and LD, the global distri-bution P(XH = xH|XO= xO, FR(XiM, XiF, XiC), L, T , pmaf) can be factored into products of several local functions, each having a subset of variables from x as arguments:

PXH= xH|XO= xO, FR(XiM, X i F, X i C), l, T , pmaf = 1 Z ⎡ ⎣ gi∈G rj∈R fi_jxi_j, x_{m( j)}i , xi_{f ( j)}, FR Xi_M, Xi_F, Xi_C, pmaf ⎤ ⎦× ⎡ ⎢ ⎢ ⎣ ri∈R ( j,m) s.t. ljm=0 h_ij,mx_ij, x_im, ljm ⎤ ⎥ ⎥ ⎦, (10) where Z is the normalization constant, and xi

m( j), respectively, x

i

f ( j), are the SNPs gi of

the mother, respectively, father, of ri (if they exist inT ).

Next, we introduce the messages between the factor and the variable nodes to com-pute the marginal probability distributions using belief propagation. We denote the messages from the variable nodes to the factor nodes asμ. We also denote the mes-sages from familial factor nodes to variable nodes asλ, and from LD factor nodes to variable nodes as β. Let X(ν) = {xi_j(ν) : rj ∈ R, gi ∈ G} be the collection of variables representing the values of the variable nodes at the iterationν of the algorithm. The

(13)

messageμ(_iν)_→k(xi j

(ν)

) denotes the probability of xi j

(ν)

= ( ∈ {0, 1, 2}) at the νth_iteration. Furthermore, λ(_kν)_→i(xi

j

(ν)_{) denotes the probability that x}i j

(ν) _{= , for ∈ {0, 1, 2} at the}

νth_{iteration given x}i

m( j), xif ( j),FR(XiM, XiF, XiC), and pmaf. Finally,β

(ν) k→i(xij (ν) ) denotes the probability that xi j

(ν)_{= , for ∈ {0, 1, 2}, at the ν}th_{iteration given the LD relationships} between the SNPs.

For clarity of presentation, we choose a simple family tree consisting of a trio (i.e., mother, father, and child) and 3 SNPs (i.e., |R| = 3 and |G| = 3). In Figure 5, we show how the trio and SNPs are represented on a factor graph, where r1 represents the mother, r2 represents the father, and r3 represents the child. Furthermore, the 3 SNPs are g1, g2, and g3. We describe the message exchange between the variable node representing the first SNP of the mother (x₁1), the familial factor node of the child ( f₃1), and the LD factor node h1₁,2. The belief propagation algorithm iteratively exchanges messages between the factor and the variable nodes in Figure 5, updating the beliefs on the values (in xH) of the targeted SNPs at each iteration, until convergence. We denote the variable and factor nodes x1

1, f31, and h 1,2

1 with the letters i, k, and z, respectively. The variable nodes generate their messages (μ) and send them to their neighbors. Variable node i forms μ(_iν)_→k(x1

1 (ν)

) by multiplying all information it receives from its neighbors excluding the familial factor node k.11Therefore, the message from variable node i to the familial factor node k at theνth_{iteration is given by}

μ(ν) i→k x₁1(ν)= 1 Z× w∈(∼k) λ(ν−1) w→i x1₁(ν−1)× y∈{z,h1,3 1 } β(ν−1) y→i x₁1(ν−1), (11)

where Z is a normalization constant, and the notation (∼k) means all familial factor node neighbors of the variable node i, except k. This computation is repeated for every neighbor of each variable node. It is important to note that the message in Equation (11) is valid if the value of x₁1is hidden to the adversary. However, the value of x₁1can also be observed by the adversary. In this case, if x1

1= ρ (ρ ∈ {0, 1, 2}), then μ (ν) i→k(x11 (ν)_{= ρ) = 1} and μ(_iν)_→k(x1 1 (ν)

) = 0 for other potential values of x1

1 (regardless of the values of the messages received by the variable node i from its neighbors).

Next, the factor nodes generate their messages. The message from the familial factor node k to the variable node i at theνth_{iteration is formed using the principles of belief} propagation as λ(ν) k→i x₁1(ν)= {x1 2,x31} f₃1x₁1, x_m(1)1 , x1_{f (1)}, FR Xi_M, Xi_F, Xi_C, pmaf y∈{x2 1,x13} μ(ν) y→k x₁1(ν). (12) Note that f1 3(x11, x1m(1), x 1 f (1), FR(X i M, XiF, XiC), pmaf)∝ P(x11|xm(1)1 , x 1 f (1), FR(X i M, XiF, XiC)),

and this probability is computed using Table I. Furthermore, if the degree of the familial factor node is 1 for a particular SNP, then the local function corresponding to the familial factor node depends only on the MAF of the corresponding SNP. For example, the degree of f1

1 (in Figure 5(c)) is 1; thus, f11(x11, xm(1)1 , x1f (1), FR(XiM, XiF, XiC), pmaf) ∝

P(x1

1|p1maf). This computation must be performed for every neighbor of each familial factor node.

11_{The message}_μ(ν) i_→z(x11

(14)

Similarly, the message from the LD factor node z to the variable node i at theνth iteration is formed as β(ν) z→i x₁1(ν)= x2 1 g₁1,2 x1₁, x₁2, l12 y∈{x2 1} μ(ν) y→z x₁1(ν). (13) As before, this computation is performed for every neighbor of each LD factor node. We further note that h1₁,2(x1

1, x 2 1, l1,2)∝ P(x 1 1, x 2

1), which is derived from l1,2, p 1

maf, and p 2 maf. The algorithm proceeds to the next iteration in the same way as theνth_iteration.

The algorithm starts at the variable nodes. Thus, at the first iteration of the algorithm (i.e.,ν = 1), the variable node i sends messages to its neighboring factor nodes based on the following rules: (i) If the value of x1

1is hidden from the adversary,μ (1)

i→k(x11

(1) )= 1 for all potential values of x1

1 and (ii) if the value of x11is observed by the adversary and

x1 1 = ρ (ρ ∈ {0, 1, 2}), μ (1) i→k(x11 (1) = ρ) = 1 and μ(1) i→k(x11 (1)

)= 0 for other potential values of

x₁1. The iterations stop when all variable nodes have converged to stable distributions. The marginal probability of each variable inXH is given by multiplying all incoming messages at each variable node representing an unobserved SNP, as in Equation (4). Note that the factor graph could also embed phenotypic information by adding one factor node and one variable node per phenotype and individual. We do not present it here for the sake of clarity and conciseness.

3.4. Computational Complexity

The computational complexity of the inference without LD correlations is linear in the number of nodes n (i.e., number of family members) in the original Bayesian network, the number of SNPs m, and exponential in the treewidth, that is, the maximum number of variables in cliques. In our case, the treewidth is 3, which is negligible compared to n and m. We can thus state that the computational complexity is Onm. Note that, in general, finding an optimal triangulation ordering to construct the junction tree is NP-hard, but, in our case, all the cycles are already of size 3 after the moralization step; thus, there is no need to triangulate the graph. The same analysis applies for the inference with phenotypic information. Therefore, the computational complexity increases linearly with the number of phenotypes times the number of SNPs influencing each phenotype times the number of family members sharing each phenotype.

The computational complexity of the inference with LD correlations is proportional to the number of factor nodes. In our setting, there are nm familial factor nodes and a maximum of nm(m− 1)/2 LD factor nodes. Thus, the worst-case computational com-plexity per iteration is Onm2_{. However, as each SNP is in LD with a limited number} of other SNPs, the matrix L is sparse and the number of LD factor nodes grows with m rather than with m(m−1)/2, especially if we focus on SNPs in strong LD only. Thus, the average computational complexity per iteration is Onm. Based on our experiments, we can state that the number of iterations before convergence is a small constant, between 7 and 15. Note, finally, that this complexity can be further reduced by using similar techniques developed for message-passing decoding of LDPC codes (e.g., work-ing in log domain [Chen et al. 2002]). We implement the proposed attack and evaluate its performance in practice in Section 6.1.

3.5. Privacy Metrics

A crucial step toward protecting kin genomic privacy is to quantify the privacy loss induced by the release of genomic information. Through the inference attack, the adversary infers the targeted SNPs belonging to the members of a targeted family by using background knowledge and observed genomic data (of the family

(15)

members). The inferred information can be expressed as the posterior distribution

P(XH = xH|XO = xO, FR, L, T , pmaf). Moreover, each posterior marginal probability distribution is represented as P(Xi_j = ˆxi_j|XO = xO), ∀rj ∈ R, gi ∈ G.12 We propose to quantify kin genomic privacy by measuring the expected estimation error (incorrect-ness) and the uncertainty of the adversary.13

Correctness was already proposed in the context of location privacy [Shokri et al.

2011]. In our scenario, correctness quantifies the adversary’s success in inferring the targeted SNPs. That is, it quantifies the expected distance between the adversary’s estimate of the value of an SNP, ˆxi

j, and the true value of the corresponding SNP, xij. This distance can be expressed as the expected estimation error as follows:

Ei_j = ˆxi j∈{0,1,2} PXi_j= ˆxi_j|XO= xO xij− ˆx i j. (14)

Note that. can be any norm, such as the L1or L2(Euclidean) norms. We select the

L1 norm in our evaluation as it is the most intuitive and most representative of the discrepancy that we want to measure. If we rely on the Hamming distance14instead, the expected estimation error becomes equal to 1− P( ˆxi

j = xij), that is, 1 minus the probability of success (or success rate). We discuss this further in Section 4.2.

Privacy can also be represented as the adversary’s uncertainty [Diaz et al. 2003; Serjantov and Danezis 2003], that is, the ambiguity of P(Xi

j = ˆxij|XO = xO). This uncertainty is generally considered to be maximum if the posterior distribution is uniform. This definition of uncertainty can be quantified as the (normalized) entropy of P(Xi_j = ˆxi_j|XO= xO) as follows: Hi_j =− ˆxi j∈{0,1,2}P(X i j = ˆxij|XO= xO) log P(Xij = ˆxij|XO= xO) log(3) := H(Xi j|XO) log(3) . (15)

The higher the entropy, the higher the uncertainty.

Finally, we propose another entropy-based metric that quantifies the mutual de-pendence between the hidden genomic data that the adversary is trying to recon-struct and the observed data. This is quantified by mutual information I(Xi

j; XO) =

H(Xi_j)− H(Xi_j|XO) [Agrawal and Aggarwal 2001]. As privacy decreases with mutual information, we propose the following (normalized) privacy metric:

Ii_j= 1 − H(X i j)− H(Xij|XO) H(Xi j) = H(X i j|XO) H(Xi j) . (16)

We can then evaluate the genomic privacy of an individual rj by computing the average of the per-SNP values over all SNPs gi ∈ G, for any of the three aforementioned metrics. We can also compute the average over all SNPs of all family members to get the global privacy level of a whole family.

If we are interested in a more tangible privacy, we can also convert the per-SNP genomic-privacy metrics into health-privacy metrics. To quantify an individual’s health privacy, we focus on the individual’s predisposition to different diseases. LetSdbe the set of SNPs that are associated with a disease d. Then, a metric quantifying the health 12_{We use here ˆx}i

jto refer to the estimate of x i j.

13_{These metrics are not specific to the proposed inference attack; they can be used to quantify genomic} privacy in general.

14_xi

(16)

Fig. 6. Family tree of CEPH/Utah Pedigree 1463 consisting of the 11 family members that were considered. The symbolsɉandɊrepresent the male and female family members, respectively.

privacy for an individual ri regarding the disease d can be defined as follows:

Dd i = 1 k:gk∈Sdck k:gk∈Sd ckGki, (17) where Gk

i is the genomic privacy of an SNP gk for individual ri, computed using Equation (14), (15), or (16), and ckis the contribution of SNP gkto disease d.15Other health-privacy metrics based on nonlinear combinations of genotypes or combinations of alleles will be defined in future work. Note that health-privacy metrics are valid at a given time, and cannot be used to evaluate future privacy provision, as genome research can change knowledge on the contribution of SNPs to diseases.

4. EVALUATION

In this section, we first evaluate the performance of the proposed inference attack, then compare the entropy-based metrics with respect to the expected estimation error, and finally evaluate the accuracy of the inference attack with and without considering the LD between SNPs.

For this evaluation, we use the CEPH/Utah Pedigree 1463 that contains the partial DNA sequences of 17 family members (4 grandparents, 2 parents, and 11 children) [Drmanac et al. 2010]. We note in Figure 6 that we use only 5 (out of 11) children for our evaluation because (i) 11 is much above the average number of children per family, and (ii) we observe that the strength of the adversary’s inference does not increase further (due to the children’s revealed genomes) when more than 5 children’s genomes are revealed. As the SNPs related to important diseases, such as Alzheimer’s, are not included in this dataset, we quantify health privacy in Section 5 by using the data collected from a genome-sharing website.

To quantify the genomic privacy of the individuals in the CEPH family, we focus on their SNPs on chromosome 1 (which is the largest chromosome). We make use of the three base metrics introduced in Section 3.5, and rely on the L1norm to measure the distance between two SNP values in Equation (14), meaning that the distance for a sin-gle SNP can go from 0 to 2. We aggregate the per-SNP metrics by averaging them over all considered SNPs. We study the relationship between these metrics in Section 4.2. 15_{These contributions are determined as a result of medical studies. Some SNPs might increase (or decrease)} the risk for a disease more than others.

(17)

Note that, for the inference without LD, we made use of the MATLAB implementa-tion of the juncimplementa-tion tree algorithm provided in the Bayes Net Toolbox [Murphy et al. 2001] and, for the inference with LD, we implemented our own factor graph and loopy belief-propagation algorithm in Python.

4.1. Inference Without LD Correlations

First, we assume that the adversary targets one family member and tries to infer SNPs by using the published SNPs of other family members without considering the LD between the SNPs. We select an individual from the CEPH family and denote that person as the target individual. We constructG, the set of SNPs that we consider for evaluation, from all 81,899 available SNPs on chromosome 1. Thus, the random variable XHrepresents the hidden 81,899 SNPs of the target individual that we want to infer. Furthermore, the random variable XOrepresents the 81,899 SNPs of each of the other observed family members. That is, we sequentially reveal all 81,899 SNPs on chromosome 1 of all family members (excluding the target individual). The exact sequence of the family members (whose SNPs are revealed) is indicated on the figure of each evaluation. Note that we changed the order compared to the conference pa-per [Humbert et al. 2013] in order to convey new and complementary messages. In this endeavor, we also included Table III.

In Figure 7, we show the evolution of the genomic privacy of three target individuals from the CEPH family (in Figure 6): (i) grandparent (GP1), (ii) parent (P5), and (iii) child (C7). We note that all entropy-based metrics for each target individual start from the same values. This is logical, as these do not depend on the actual SNP values, but rather only on the MAFs given by population statistics. We also observe that the parent’s genomic privacy decreases to a lower level than the child’s genomic privacy, which itself degrades more than the grandparent’s (e.g., the adversary’s error for the grandparent’s genome does not go below 0.3). Compared to the graphs in [Humbert et al. 2013], the observation of GP3’s, GP4’s, and P6’s genomes has an impact on GP1’s and P5’s privacy. This is due to the fact that, here, we reveal the children’s genomes first, which creates a conditional probabilistic dependence between the genomes on the P5 and P6 sides of the pedigree tree.

We observe in Figure 7(a) that the grandparent’s genomic privacy is mostly affected by the SNPs of the first revealed children (C7, C8), as well as by those of the spouse (GP2) and the child (P5). Table III also shows that the observation of only P5 already decreases considerably the genomic privacy of GP1, and the observation of both P5 and GP2 decreases it to its minimal value. Thus, in some scenarios, it is not necessary to observe many relatives to threaten an individual’s genomic privacy. We also observe (in Figure 7(b)) that, by revealing all family members’ SNPs (expect P5), the adversary can almost reach an estimation error of 0 about P5’s genome. The target parent’s genomic privacy significantly decreases ones essentially with the observation of the children’s and spouse’s SNPs. GP1 and GP2 do not have so much influence, also because of the fact that they are observed in the end. Table III shows that, if we observe only GP1 and GP2, we can reduce the genomic privacy of P5 by 50%, which is more than with the observation of two children (40%), or one child and the spouse (35%) .

We observe in Figure 7(c) that C7’s genomic privacy decreases already significantly with the observation of one parent (P5) and two siblings (C8 and C9). We also notice that, once P5 is known, the disclosure of GP1 and GP2’s genomes has no impact on C7’s privacy. In the same way, we observe that once both parents’ genomes are revealed, the knowledge of an additional child’s genome does not help the attacker. Indeed, as each new offspring is created independently of another (except in the case of twins), each sibling’s genomic inheritance is independent of the others given his/her parents’ genetic background. This is confirmed by Table III, where we see that the observation

(18)

Fig. 7. Metrics for measuring personal genomic privacy. Evolution of the average genomic privacy measured with our three base metrics defined in Section 3.5 for the (a) grandparent (GP1), (b) parent (P5), and (c) child (C7) by gradually revealing other relatives’ genomes. We reveal all 81,899 SNPs on chromosome 1 of other family members while inferring the 81,899 SNPs of the targeted individual (GP1, P5, or C7). The x-axis represents the cumulative disclosure sequence. The order of disclosure has been chosen such that the results provide new insights on how relatives affect personal genomic privacy compared to previous work. We note that x= 0 represents the prior distribution, when no genomic data is observed by the adversary. (d) Per-SNP comparison of the two entropy-based metrics with regard to the expected estimation error, with data points taken from the same scenario as (c). Each point in the two plots represents the expected estimation error (x-axis) and the normalized entropy (y-axis, top) or 1-mutual information (y-axis, bottom) for a single SNP of child C7 for a different amount of observed kin genomic information (from 0 to 10 relatives, as for (c)). The closer to the x= y line the points are, the more correlated two metrics are.

of C8 in addition of P5 and P6 does not change C7 privacy. Like for the other cases, Table III tells us that we can infer a lot of genomic information by knowing only a few relatives’ genomes. For instance, P5’s observation already reduces the privacy by 30%. Moreover, the observation of the two parents provides the minimal privacy level that C7 can expect in this scenario.

Instead of averaging the privacy levels over the whole set of SNPs inG, Figure 8 depicts the cumulative distribution function (CDF) of the per-SNP privacy levels under five different settings of Figure 7(b). In addition to the three base metrics, we also plot the success rate, that is, P( ˆxi

j = xij) (Figure 8(a)). We first note that the success rate and the expected estimation error CDFs are very symmetric around the diagonal. We also observe that when no relative is observed, around half of the SNPs have a success

(19)

Fig. 8. Empirical cumulative distribution function (CDF) of (a) the success rate, and our three base metrics: (b) expected estimation error, (c) normalized entropy, and (d) 1- (normalized) mutual information. We plot here the CDF of the per-SNP privacy levels of parent P5. We selected 5 out of the 11 disclosure scenarios of Figure 7(b), specifically, (i) no disclosure (“prior”); disclosure (ii) of C7 only; (iii) of C7, C8, GP3, GP4; (iv) of C7, C8, C9, C10, GP3, GP4; and (v) of C7, C8, C9, C10, P6, GP1, GP3, GP4.

rate greater than 0.5, whereas, once C7’s SNPs are observed, half of the SNPs have a success rate higher than 0.7. Moreover, under no observation, only 20% of the SNPs can be guessed with success higher or equal to 0.9, whereas this percentage goes up to 65% when six of P5’s relatives are observed and 87% when nine of P5’s relatives are revealed. We also show the percentage of SNPs with success higher than or equal to 0.9 for different scenarios in Table III. We notice that, for example, by observing only the two parents of P5 (GP1 and GP2), the percentage of SNPs inferred with 0.9 success increases to 57%.

4.2. Metrics Comparison

First, we compare the success rate, a metric proposed in Wagner [2015], with the expected estimation error. As mentioned in Section 3.5, if we use the Hamming distance between xi

jand xijin our estimation error metric, the expected estimation error is simply equal to 1 minus the success rate. By comparing Figures 8(a) and 8(b), we note that

(20)

Table III. Absolute and Relative Levels of Genomic Privacy of the Grandparent (GP1), Parent (P5), and Child (C7) Given the Observation of 0 to 3 of Their Relatives

H_\O _∅ P5 P5, GP2 C7, GP2 C7, C8, GP2 GP1 Ej 0.446 0.322 0.309 0.404 0.385 100% 72% 69% 91% 86% ∗ 20% 28% 29% 23% 23% H\O ∅ GP1,GP2 C7,C8 C7,P6 GP1,GP2,C7 P5 E j 0.48 0.242 0.286 0.312 0.203 100% 50% 60% 65% 42% ∗ 20% 57% 38% 29% 57% H_\O _∅ P5 P5, C8 P5, P6 P5, P6, C8 C7 E j 0.489 0.344 0.301 0.182 0.182 100% 70% 62% 37% 37% ∗ 20% 28% 40% 64% 64%

Note: We use here the expected estimation error Ejto measure the genomic privacy

of GP1, P5, and C7 (first two rows for each individual, second row representing the relative error with respect to the error without observations) but also the success rate (third row, denoted with_{∗). Here, we represent the percentage of SNPs for which the} success rate is higher than 0.9, that is, P(xi_j= ˆxi_j)> 0.9.

these metrics are really symmetric and opposite, even if we use the L1 norm for the estimation error. This leads us to conclude that the estimation error is as intuitive as the success rate and that it is a suitable privacy metric as it increases monotically with privacy, whereas the success rate decreases with privacy.

In Figure 7(d), we compare both our entropy-based metrics with the estimation error, point by point, over all 81899 SNPs of chromosome 1 and for all values aggregated in Figure 7(c) to measure C7’s privacy evolution. Apart from the fact that normalized entropy slightly overestimates the expected estimation error, it is growing quite simi-larly to the estimation error, especially in the estimation error range [0, 0.5], where the majority of the points are located. We also observe that the third metric, 1- (normal-ized) mutual information, is worse than the normalized entropy in approximating the estimation error. This is corroborated by Figure 8, which shows that the normalized entropy empirical CDFs are closer to those of the estimation error than the empirical CDFs of the mutual information-based metric. This motivates us to rely on the nor-malized entropy to quantify privacy in Section 5 when we do not know the ground truth.

4.3. Inference With LD Correlations

Next, we include the LD relationships and observe the change in the inference power of the adversary using the LD values. We constructG from 1000 SNPs on chromosome 1. Among these 1000 SNPs, each SNP is in LD with 13 other SNPs, on average. Fur-thermore, the strength of the LD varies between 0.5 and 1 (where r = 1 represents the strongest LD relationship, as discussed before). As before, we define a target in-dividual from the CEPH family, construct the set XHfrom the individual’s SNPs, and sequentially reveal other family members’ SNPs to observe the decrease in the genomic privacy of the target individual. We observe that individuals do not always reveal all their genome, or disclose different parts of their genomes (e.g., different sets of SNPs). Thus, we assume that, for each family member (except for the target individual), the adversary does not observe the full set of 1000 SNPs of the individuals, but rather only a fraction of them. We instead assume that people reveal 25%, 50%, or 75% of their genomic data, and that they reveal different subsets of their SNPs. Figure 9(a) shows the evolution of genomic privacy (measured by the expected estimation error) of parent

(21)

Fig. 9. Evaluation of the impact of LD correlations on genomic privacy. (a) Evolution of parent P5’s privacy with and without considering LD. For each family member, we reveal 250, 500, or 750 randomly picked SNPs (among the 1000 SNPs in_{G), following the same order of familial disclosure as in Figure 7(b). Privacy level} in measured using the expected estimation as base metric. Note that x= 0 represents the prior distribution, when no genomic data is revealed. (b) Evolution of the global privacy of a family by gradually revealing 10% of its SNPs.

P5 with and without making use of LD correlations. First, we observe that LD clearly improves the inference attack, thus decreases genomic privacy compared to the case when LD is not used. We also note that the smaller the percentage of observed SNPs, the higher the effect of LD correlations on P5’s privacy. This is due to the fact that LD correlations help fill the missing SNPs. We also observe that the more relatives reveal their SNPs, the smaller the gap between the privacy with and without LD.

Finally, we also evaluate the global inference power of the adversary when inferring multiple SNPs among all family members, given a subset of SNPs belonging to some family members and considering the LD correlations between SNPs. That is, we eval-uate the inference power of the adversary for different fractions of observed data for the family members. Using a set of 100 SNPs for every family member, we construct XH from (κ × 100 × n) SNPs, randomly selected from all family members, where n is the number of family members in the family tree (n = 11 for this scenario), and

κ ∈ {0, 0.1, . . . , 0.9, 1}. We assume that the SNPs that are not in XH are observed by the adversary (i.e., in XO), and we evaluate the inference power of the adversary for the SNPs represented by XH, for different values ofκ. In Figure 9(b), we observe a very fast decrease in the global genomic privacy (privacy of all family members), showing that the observation of a small portion of the family’s SNPs can have a huge impact on genomic privacy. For instance, the estimation error is decreased by almost a factor of 3 by observing only the first 10% of the SNPs.

5. EXPLOITING GENOME-SHARING WEBSITES

We present here two concrete attacks that can be carried out using existing genome-sharing websites and OSNs.

5.1. Cross-Website Attack with Online Social Networks

In order to show that the proposed inference attack threatens not only the Lacks family, but potentially all families, we collected publicly available data from a genome-sharing website and familial information from an OSN, and evaluated the decrease in genomic and health privacy of people caused by the observation of their relatives’ data.