Private algorithms for the protected in social
network search
Michael Kearnsa,1, Aaron Rotha, Zhiwei Steven Wua, and Grigory Yaroslavtseva
a

Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104

data privacy

  social networks   counterterrorism

T

he tension between the useful or essential gathering and
analysis of data about citizens and the privacy rights of those
citizens is at an historical peak. Perhaps the most striking and
controversial recent example is the revelation that US intelligence
agencies systemically engage in “bulk collection” of civilian
“metadata” detailing telephonic and other types of communication and activities, with the alleged purpose of monitoring and
thwarting terrorist activity (1). Other compelling examples abound,
including in medicine (patient privacy vs. preventing epidemics),
marketing (consumer privacy vs. targeted advertising), and many
other domains.
Debates about (and models for) data privacy often have an
“all or nothing” flavor: privacy guarantees are either provided to
every member of a population, or else privacy is deemed to be a
failure. This dichotomy is only appropriate if all members of the
population have an equal right to, or demand for, privacy. Few
would argue that actual terrorists should have such rights, which
leads to difficult questions about the balance between protecting the rights of ordinary citizens, and using all available means
to prevent terrorism. [A recent National Academies study (2)
reached the conclusion that there are not (yet) technological
alternatives to bulk collection and analysis of civilian metadata,
in the sense that such data are essential in current counterterrorism practices.] A major question is whether and when individual privacy should be sacrificed in service of such societal
priorities. Similarly, in the medical domain, epidemics (such as
the recent international outbreak of Ebola; ref. 3) have raised
serious debate about the clear public interest in controlling
contagion versus the privacy rights of the infected and those that
care for them.
The model and results in this paper represent a step toward
explicit acknowledgments of such trade-offs, and algorithmic
methods for their management. The scenarios sketched above
can be broadly modeled by a population divided into two types.
There is a protected subpopulation that enjoys (either by law,
policy, or choice) certain privacy guarantees. For instance, in the
www.pnas.org/cgi/doi/10.1073/pnas.1510612113

examples above, these protected individuals might be nonterrorists, or uninfected citizens (and perhaps informants and
health care professionals). They are to be contrasted with the
“unprotected” or targeted subpopulation, which does not share
those privacy assurances. A key assumption of the model we will
introduce is that the protected or targeted status of individual
subjects is not known, but can be discovered by (possibly costly)
measures, such as surveillance or background investigations (in
the case of terrorism) or medical tests (in the case of disease).
Our overarching goal is to allow parties such as intelligence or
medical agencies to identify and take appropriate actions on the
targeted subpopulation, while also providing privacy assurances
for the protected individuals who are not the specific targets of
such efforts, all while limiting the cost and extent of the background investigations needed.
As a concrete example of the issues we are concerned with,
consider the problem of using social network data (for example,
telephone calls, emails, and text messages between individuals)
to search for candidate terrorists. One natural and broad approach would be to use common graph search methods: beginning from known terrorist “seed” vertices in the network, neighboring vertices are investigated, in an attempt to grow the known
subnetwork of targets. This general practice is sometimes referred to as contact chaining (section 3.1 of ref. 2). A major
concern is that such search methods will inevitably encounter
protected citizens, and that even taking action against only discovered targeted individuals may compromise the privacy of the
protected.
To rigorously study the trade-offs between privacy and societal
interests discussed above, our work introduces a formal model
for privacy of network data that provides provable assurances
only to the protected subpopulation, and gives algorithms that
allow effective investigation of the targeted population. These
algorithms are deliberately “noisy” and are privacy-preserving
versions of the widely used graph search methods mentioned
Significance
Motivated by tensions between data privacy for individual
citizens, and societal priorities such as counterterrorism, we
introduce a computational model that distinguishes between
parties for whom privacy is explicitly protected, and those for
whom it is not (the “targeted” subpopulation). Within this
framework, we provide provably privacy-preserving algorithms for targeted search in social networks. We validate the
utility of our algorithms with extensive computational experiments on two large-scale social network datasets.
Author contributions: M.K., A.R., Z.S.W., and G.Y. designed research, performed research,
analyzed data, and wrote the paper. Ordering of authors is alphabetical.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. S.V. is a guest editor invited by the Editorial
Board.
Freely available online through the PNAS open access option.
1

To whom correspondence should be addressed. Email: mkearns@cis.upenn.edu.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1510612113/-/DCSupplemental.

PNAS Early Edition   1 of 6

SOCIAL SCIENCES

Motivated by tensions between data privacy for individual citizens
and societal priorities such as counterterrorism and the containment of infectious disease, we introduce a computational model
that distinguishes between parties for whom privacy is explicitly
protected, and those for whom it is not (the targeted subpopulation). The goal is the development of algorithms that can effectively
identify and take action upon members of the targeted subpopulation in a way that minimally compromises the privacy of the
protected, while simultaneously limiting the expense of distinguishing members of the two groups via costly mechanisms such as
surveillance, background checks, or medical testing. Within this
framework, we provide provably privacy-preserving algorithms for
targeted search in social networks. These algorithms are natural
variants of common graph search methods, and ensure privacy for
the protected by the careful injection of noise in the prioritization
of potential targets. We validate the utility of our algorithms
with extensive computational experiments on two large-scale social
network datasets.

COMPUTER SCIENCES

Edited by Salil Vadhan, Harvard University, Cambridge, MA, and accepted by the Editorial Board November 23, 2015 (received for review May 30, 2015)

 above, and as such represent only mild, but important, departures from commonly used approaches. At the highest level, one
can think of our algorithms as outputting a list of confirmed
targeted individuals discovered in the network, for whom any
subsequent action (e.g., publication in a most-wanted list, further
surveillance, or arrest in the case of terrorism; medical treatment
or quarantine in the case of epidemics) will not compromise the
privacy of the protected.
The key elements of our model include the following:
i) Network data collected over a population of individuals and
consisting of pairwise contacts (physical, social, electronic,
financial, etc.). The contacts or links of each individual comprise the private data they desire to protect. We assume a
third party (such as an intelligence agency or medical organization) has direct access to this network data, and would
like to discover and act upon targeted individuals.
ii) For each individual, an immutable “status bit” that determines their membership status in the targeted subpopulation
(such as terrorism or infection). These status bits can be
discovered by the third party, but only at some nontrivial
cost (such as further surveillance or medical testing), and
thus there is a budget limiting the number of status bits that
an algorithm can reveal. One might assume or hope that in
practice, this budget is sufficient to investigate a number of
individuals that is of the order of the targeted subpopulation size (so they can all be discovered), but considerably less
than that needed to investigate every member of the general
population.
iii) A mathematically rigorous notion of individual data privacy
(based on the widely studied differential privacy; ref. 4) that
provides guarantees of privacy for the network data of only
the protected individuals, while allowing the discovery of
targeted individuals. Informally, this notion guarantees that
compared with a counterfactual world in which any protected individual arbitrarily changed any part of their data,
or even removed themselves entirely from the computation,
their risk (measured with respect to the probability of arbitrary events) has not substantially increased.
We emphasize two important points about our model. First,
we assume that the process of investigating an individual to
determine their status bit is unobservable, and leaks no information itself. This assumption is justified in some settings—
for example, when the investigation involves secretly intercepting digital communications, like emails and phone calls, or
when it involves performing tests on materials (like blood
samples) or information already obtained. However, our model
does not fit situations in which the investigations themselves
are observable—for example, if the investigation requires
interviewing an individual’s family, friends, and colleagues—
because the very fact that an individual was chosen for an investigation (regardless of its outcome) might disclose their
private data. The second point is that “privacy” is a word that
has many meanings, and it is important to distinguish between
the types of privacy that we aim to protect (see, for example,
Solove’s taxonomy of privacy; ref. 5). Our goal is to quantify
informational privacy—that is, how much information about a
protected individual can be deduced from the output of an
analysis. However, it is important to note that the status bit
investigations our algorithms make, even if unobservable, are a
privacy loss that Solove calls “intrusion.” Our results can be
viewed as providing a quantitative trade-off between informational privacy and intrusion: using our algorithms, it is possible
to guarantee more informational privacy at the cost of a higher
degree of intrusion, and vice versa.
Our main results are:
2 of 6   www.pnas.org/cgi/doi/10.1073/pnas.1510612113

i) The introduction of a broad class of graph search algorithms
designed to find and identify targeted individuals. This class
of algorithms is based on a general notion of a statistic of
proximity—a network-based measure of how “close” a given
individual v is to a certain set of individuals S. For instance,
one such closeness measure is the number of short paths in
the network from v to members of S. Our (necessarily randomized) algorithms add noise to such statistics to prioritize
which status bits to query (and thus how to spend the budget).
ii) A theoretical result proving a quantitative privacy guarantee
for this class of algorithms, where the level of privacy depends on a measure of the sensitivity of the statistic of proximity to small changes in the network.
iii) Extensive computational experiments in which we demonstrate the effectiveness of our privacy-preserving algorithms
on real social network data. These experiments demonstrate
that in addition to the privacy guarantees, our algorithms are
also useful, in the sense that they find almost as many members of the targeted subpopulation as their nonprivate counterparts. The experiments allow us to quantify the loss in
effectiveness incurred by the gain in privacy.
Our formal framework is the first to our knowledge to introduce explicit protected and targeted subpopulations with
qualitatively differing privacy rights. This is in contrast to the
quantitative distinction proposed by Dwork and McSherry (6),
which still does not allow for the explicit discovery of targeted
individuals. We also note that our definition of privacy can be
expressed in the Blowfish privacy framework (7) (although this
had not previously been done). Our algorithms are the first to
provide mathematically rigorous privacy guarantees for the protected while still allowing effective discovery of the targeted.
More generally, we believe our work makes one of the first steps
toward richer privacy models that acknowledge and manage the
tensions between different levels of privacy guarantees to different subgroups.
Preliminaries
Consider a social network in which the individuals have an immutable status bit which specifies whether they are members of
a targeted subpopulation or not (in which case we say they are
part of the protected subpopulation). These status bits induce a
partition on the population—we write T to denote the targeted
subpopulation and P to denote protected subpopulation. Individuals correspond to the vertices V in the network, and the
private data of each individual v is the set of edges incident to v.
We assume that the value of an individual’s status bit is not easily
observed, but can be discovered through (possibly costly) investigation. Our goal is to develop search algorithms to identify
members of the targeted subpopulation, while preserving the
privacy of the edge set of the protected population.
Any practical algorithm must operate under an investigation
budget, which limits the number of status bits that are examined.
Our goal is a total number of status bit examinations that is on
the order of the size of the targeted subpopulation T , which may
be much smaller than the size of the protected population P.
This is the source of the tension we study—because the budget is
limited, it is necessary to exploit the private edge set to guide our
search (i.e., we cannot simply investigate the entire population),
but we wish to do so in a way that does not reveal much about the
edges incident to any specific protected individual.
The privacy guarantee we provide is a variant of differential
privacy, an algorithmic definition of data privacy. It formalizes
the requirement that arbitrary changes to a single individual’s
private data should not significantly affect the output distribution of the data analysis procedure, and so guarantees that the
analysis leaks little information about the private data of any single
individual. We introduce the definition of differential privacy
Kearns et al.

 Differential privacy is an extremely strong guarantee—it has
many interpretations (see discussion in ref. 11), but most
straightforwardly, it promises the following: simultaneously for
every individual v, and simultaneously for any event S that they
might be concerned about, event S is almost no more likely to
occur given that individual v’s data is used in the computation,
compared with if it were replaced by an arbitrarily different
entry. Here, “almost no more likely” means that the probability
that the bad event S occurs has increased by a multiplicative
factor of at most e«, which we term the risk multiplier. As the
privacy parameter e approaches 0, the value of the risk multiplier
approaches 1, meaning that individual v’s data has no effect at all
on the probability of a bad outcome. The smaller the risk multiplier, the more meaningful the privacy guarantee. It will be
easier for us to reason directly about the privacy parameter e in
our analyses, but semantically it is the risk multiplier e« that
measures the quality of the privacy guarantee, and it is this
quantity that we report in our experiments.
Differential privacy promises the same protections for every
individual in a network, which is incompatible with our setting.
We want to be able to identify members of the targeted population, and to do so, we want to be able to make arbitrary inferences from their network data. Nevertheless, we want to give
strong privacy guarantees to members of the protected subpopulation. This motivates our variant of differential privacy,
which redefines the neighboring relation between networks. In
contrast to the definition of neighbors given above, we now say
that two networks are neighbors if and only if one can be
obtained from the other by arbitrarily rewiring the edges incident
to a single member of the protected population only. Crucially, two
networks are not considered to be neighbors if they differ in either:
i) The way in which they partition vertices between the protected and targeted populations P and T , or
ii) any edges that connect pairs of vertices u, v ∈ T that are both
members of the targeted population.
What this means is that we are offering no guarantees about
what an observer can learn about either the status bit of an individual (protected vs. targeted), or the set of edges incident to
targeted individuals. However, we are still promising that no
observer can learn much about the set of edges incident to any
member of the protected subpopulation. This naturally leads us
to the following definition:
Definition 2. (Protected differential privacy) Two networks
G, G′ in Gn are neighboring if they:
i) Share the same partition into P and T , and
ii) G can be obtained from G′ by rewiring the set of edges incident
to a single vertex v ∈ P.
An algorithm A : Gn → O satisfies «-protected differential privacy
if for any two neighboring networks G, G′ ∈ Gn, and for any event
S ⊆ O:
Kearns et al.

Formally, our network analysis algorithms take as input a
network and a method by which they may query whether vertices
v are members of the protected population P or not. The class of
algorithms we consider are network search algorithms—they aim
to identify some subset of the targeted population. Our formal
model is agnostic as to what action is ultimately taken on the
identified members (for example, in a medical application they
might be quarantined, in a security application they might be
arrested, etc.). From the perspective of informational privacy, all
that is relevant is that which members of the targeted population
we ultimately identify is observable. Hence, without loss of
generality we can abstract away the action taken and simply view
the output of the mechanism to be an ordered list of individuals
who are confirmed to be targeted.
Our privacy definition promises that what an observer learns
about an individual “Alice” (e.g., that Alice is in contact with a
particular individual Bob, or an entire class of individuals, such
as members of a religious group) is almost independent of Alice’s
connections, so long as Alice is not herself a member of the targeted population. On the other hand, it does not prevent an observer from learning that Alice exists at all. This models a setting
in which (for example) a national government has access to an
index of all of its citizens (through birth and immigration records),
but nevertheless would like to protect information about their
interactions with each other.
We note that the Blowfish privacy framework gives a general
definition of privacy with different neighboring relations (7). Our
definition can be seen as an instantiation of this general framework. This is in contrast to other kinds of relaxations of differential privacy, which relax the worst-case assumptions on the
prior beliefs of an attacker as in Bassily et al. (12), or the worstcase collusion assumptions on collections of data analysts as in
Kearns et al. (13). Several works have also proposed assigning
different differential privacy parameters to different individuals
(see, e.g., ref. 14). However, this is not compatible with identifying members of a targeted population.
Algorithmic Framework
The key element in our algorithmic framework is the notion of a
statistic of proximity (SoP), a network-based measure of how
close an individual is to another set of individuals in a network.
Formally, an SoP is a function f that takes as input a graph G, a
vertex v, and a set of targeted vertices S ⊆ T , and outputs a numeric value f ðG, v, SÞ. Examples of such functions include the
number of common neighbors between v and the vertices in S,
and the number of short paths from v to S. In our use, when we
compute a statistic of proximity, the set S will always be the set of
vertices confirmed through the status bit investigations so far to
be members of the targeted population. Hence, the SoP should
be viewed as a measure of closeness in the network to the known
portion of the targeted subnetwork.
Algorithms in our framework rely on the SoP to prioritize which
status bits to examine. Because the value of the SoP depends on
the private data of a vertex, we perturb the values of the SoP by
adding noise with scale proportional to its sensitivity, which captures the magnitude by which a single protected vertex can affect
the SoP of some targeted vertex. The sensitivity of the SoP f, denoted
Δðf Þ, is defined to be the maximum of jf ðG, t, SÞ − f ðG′, t, SÞj over
all choices for the subset T of targeted vertices, all neighboring pairs
of graphs G and G′, all t ∈ T , and all S ⊆ T . Crucially, note that in
this definition—in contrast to what is typically required in standard differential privacy—we are only concerned with the degree
to which a protected individual can affect the SoP of a targeted
individual.
PNAS Early Edition   3 of 6

COMPUTER SCIENCES

Pr½AðGÞ ∈ S ≤ e« Pr½AðG′Þ ∈ S.

Pr½AðGÞ ∈ S ≤ e« Pr½AðG′Þ ∈ S.

SOCIAL SCIENCES

specialized for the network setting. We treat networks as a collection of vertices representing individuals, each represented as a
list of its edges (which form the private data of each vertex). For a
network G and a vertex v, let Dv ðGÞ be the set of edges incident to
the vertex v in G. Let Gn be the family of all n vertex networks.
Definition 1. [Vertex differential privacy (4, 8–10)] The networks G, G′ in Gn are neighboring if one can be obtained from the
other by an (arbitrary) rewiring of the edges incident to a single
vertex v; i.e., if for some vertex v, Du ðGÞ∖fðu, vÞg = Du ðG′Þ∖fðu, vÞg
for all u ≠ v. An algorithm A:Gn → O satisfies «-differential privacy
if for every event S ⊆ O and all neighboring networks G,G′∈ Gn,

 Fig. 1. Visual comparison of the nonprivate algorithm Target (Left) and the private algorithm PTarget (Right) on a small portion of the IMDB network (see
Experimental Evaluation for more details). For each algorithm, blue indicates protected vertices that have been examined, red indicates targets that have
been examined, and gray vertices have not been examined yet. Both algorithms begin with the same seed target vertex, and by directed statistic-first search
discover a subnetwork of targeted individuals (central red edges). As a consequence, many protected vertices are discovered and examined as well. Due to the
added noise, PTarget explores the network in a more diffuse fashion, which in this case permits it to find an additional subnetwork of targets toward the right
side of the network. The primary purpose of the noise, however, is for the privacy of protected vertices.

We next describe the nonprivate version of our search algorithm Targetðk, f Þ. Our motivation in choosing this particular
algorithm is simplicity: it is the most straightforward type of
contact chaining algorithm that ignores privacy entirely, and
simply uses the given SoP to prioritize investigations.
For any fixed SoP f, Target proceeds in k rounds, each corresponding to the identification of a new connected component
in the subgraph induced by T . The algorithm must be started
with a seed vertex—a preidentified member of the targeted
population. Each round of the algorithm consists of two steps:
i) Statistic-first search: Given a seed targeted vertex, the algorithm iteratively grows a discovered component of targeted
vertices, by examining, in order of their SoP values (computed with respect to the set S of individuals already identified as being members of the targeted population), the vertices that neighbor the previously discovered targeted vertices.

This continues until every neighbor of the discovered members of the targeted population has been examined, and all of
them have been found to be members of the protected population. We note that this procedure discovers every member
of the targeted population that is part of the same connected
component as the seed vertex, in the subgraph induced by
only the members of the targeted population.
ii) Search for a new component: Following the completion of
statistic-first search, the algorithm must find a new vertex in
the targeted population to serve as an initial vertex to begin a
new round of statistic-first search. To do this, the algorithm
computes the value of the SoP for all vertices whose status bit
has not already been examined, using as the input set S the
set of already discovered members of the targeted population. It then sorts all of the vertices in decreasing order of
their SoP value, and begins examining their status bits in this
order. The first vertex that is found to be a member of the

Fig. 2. Performance for the case in which there is a dominant component in the targeted subpopulation. In Left, we show the number of targeted vertices
found as a function of the budget used for both the (deterministic) nonprivate algorithm Target (blue), and for several representative runs of the randomized
private algorithm PTarget (red). Colored circles indicate points at which the corresponding algorithm has first discovered a new targeted component. In Right,
we show average performance over 200 trials for the private algorithm with 1-SD error bars. We also show the private algorithm risk multiplier with error
bars. In this regime, after a brief initial flurry of small component discovery, both algorithms find the dominant component, so the private performance
closely tracks nonprivate, and the private algorithm’s risk multiplier quickly levels off at around only 1.17.

4 of 6   www.pnas.org/cgi/doi/10.1073/pnas.1510612113

Kearns et al.

 The algorithm outputs discovered targeted individuals as they are
found, and so its output can be viewed as being an ordered list of
individuals who are confirmed to be from the targeted population.
The private version of the targeting algorithm PTargetðk, f , «Þ,
is a simple variant of the nonprivate version. The statistic-first
search stage remains unchanged, and only the search for a new
component is modified via randomization. In the private variant,
when the algorithm computes the value of the SoP f on each
unexamined vertex, it then perturbs each of these values independently with noise sampled from the Laplace distribution
Lapð△ðf Þ=«Þ, where e is a parameter. [We use LapðbÞ to denote
the Laplace distribution centered at 0 with probability density
function: PrðxÞ = ð1=2bÞexpð−jxj=bÞ]. Finally, it examines the vertices in sorted order of their perturbed SoP values.
We prove the following theorem, deferring details of the proof
and the algorithm to the SI Appendix:
Theorem 1. Given any k ≥ 1 and « > 0 and a fixed SoP f, the algorithm PTargetðk,   f , «Þ recovers k connected components of the subgraph induced by the targeted vertices and satisfies ððk − 1Þ · «Þ-protected
differential privacy.
There are two important things to note about this theorem.
First, we obtain a privacy guarantee despite the fact that the
statistic-first search portion of our algorithm is not randomized—
only the search for new components employs randomness. Intuitively, the reason that statistic-first search can remain unmodified and deterministic is that as long as we remain with a
connected component of targeted vertices, we will eventually
output only those vertices, and thus we are not compromising
the privacy of protected vertices. It is only when we search for a
new targeted component via protected vertices and the SoP
that we must randomize—for instance to provide privacy to
protected “bridge” vertices between targeted components. See
the SI Appendix for the detailed technical argument.
Second, the privacy cost of the algorithm grows only with k,
the number of disjoint connected components of targeted individuals (disjoint in the subgraph defined on targeted individuals),
and not with the total number of individuals examined, or even
the total number of targeted individuals identified. Hence, the
privacy cost can be very small on graphs in which the targeted
individuals lie only in a small number of connected components
or “cells.” Both of these features are unusual compared with
typical guarantees that one can obtain under the standard notion
of differential privacy.
Kearns et al.

Because PTarget adds randomness for privacy, it results in examining a different set of vertices compared with Target. Fig. 1
provides a sample visualization of the contrasting behavior of the
two algorithms. Although theorems comparing the utility of Target
and PTarget are possible, they require assumptions ensuring that
the chosen SoP is sufficiently informative, in the sense of separating the targeted from the protected by a wide enough margin.
In particular, one needs to rule out cases in which all unexplored
targeted vertices are deemed closer to the current set than all
protected vertices, but only by an infinitesimal amount, in which
case the noise added by PTarget eradicates all signal. In general
such scenarios are unrealistic, so instead of comparing utility
theoretically, we now provide an extensive empirical comparison.
Experimental Evaluation
In this section we empirically demonstrate the utility of our
private algorithm PTarget by comparing its performance to its
nonprivate counterpart Target. (No institutional approval was
required for the experiments described.) We report on computational experiments performed on real social network data
drawn from two sources—the paper coauthorship network of the
Digital Bibliography and Library Project (DBLP) (dblp.uni-trier.
de/xml/), and the coappearance network of film actors of the
Internet Movie Database (IMDB) (www3.ul.ie/gd2005/dataset.
html), whose macroscopic properties are described in Table 1.
These data sources provide us with naturally occurring networks,
but not a targeted subpopulation. Although one could attempt to
use communities within each network (e.g., all coauthors within a
particular scientific subtopic), our goal was to perform large-scale
experiments in which the component structure of targeted vertices
(which we shall see is the primary determinant of performance)
could be more precisely controlled. We thus used a simple parametric stochastic diffusion process (described in the SI Appendix)
to generate the targeted subpopulation in each network. We then
evaluate our private search algorithm PTarget on these networks,
and compare its performance to the nonprivate variant Target. For
brevity we shall describe our results only for the IMDB network;
results for the DBLP network are quite similar.
In our experiments, we fix a particular SoP between v and S: the
size of the union, across all w in S, of the common neighbors of v
and w. Here S is the subset of vertices representing the already
discovered members of the targeted population. This SoP has
Table 1. Social network datasets used in the experiments
Network
DBLP
IMDB

No. vertices

No. edges

Edge relation

956,043
235,710

3,738,044
4,587,715

Scientific paper coauthorship
Movie coappearance

PNAS Early Edition   5 of 6

SOCIAL SCIENCES

targeted population is then used as the seed vertex in the next
round. In the SI Appendix, we present a slight variant of this
procedure that instead of running for a fixed number of rounds,
allows the search algorithm to halt if it is unable to find any new
targeted vertices after some number of examinations.

COMPUTER SCIENCES

Fig. 3. Same format as in Fig. 2, but now in a case where the component sizes are more evenly distributed, but still relatively large. The performance of both
algorithms is hampered by longer time spent investigating nontargeted vertices (note the smaller scale of the y axis compared with Fig. 2). Targeted
component discovery is now more diffuse. The private algorithm remains competitive but lags slightly, and as per Theorem 1 the risk multiplier grows (but
remains modest) as more targeted components are discovered.

 Fig. 4. A case with a highly fragmented targeted subpopulation. Both algorithms now spend most of their budget investigating nontargeted vertices and
suffer accordingly.

sensitivity 1, and so can be used in our algorithm while adding only
a small amount of noise. In particular, the private algorithm
PTarget adds noise sampled from the Laplace distribution Lapð20Þ
to the SoP when performing new component search. By Theorem
1, such an instantiation of PTarget guarantees ððk − 1Þ=20Þprotected differential privacy if it finds k targeted components.
The main trade-off we explore is the number of members of the
targeted population that are discovered by the algorithms (the y axis
in the ensuing plots), as a function of the budget, or number of status
bits that have been investigated so far (the x axis in the ensuing plots).
In each plot, the parameters of the diffusion model described above
were fixed and used to stochastically generate targeted subpopulations of the fixed networks given by our social network data. By
varying these parameters, we can investigate performance as a
function of the underlying component structure of the targeted subnetwork. As we shall see, in terms of relative performance, there are
effectively three different regimes of the diffusion model (i.e., targeted subpopulation) parameter space. In all of them PTarget compares favorably with Target, but to different extents and for different
reasons that we now discuss. We also plot the growth of the risk
multiplier for PTarget, which remains less than 2 in all three regimes.
On each plot, there is a single blue curve showing the performance of the (deterministic) algorithm Target, and multiple
red curves showing the performance across 200 runs of our
(randomized) algorithm PTarget.
The first regime (Fig. 2) occurs when the largest connected
component of the targeted subnetwork is much larger than all of the
other components. In this regime, if both algorithms begin at a seed
vertex inside the largest component, there is effectively no difference in performance, as both algorithms remain inside this component for the duration of their budget and find identical sets of
targeted individuals. More generally, if the algorithms begin at a
seed outside the largest component, relative performance is a race
to find this component; the private algorithm lags slightly due to the
added noise, but is generally quite competitive; see Fig. 2 for details.

The second regime (Fig. 3) occurs when the component sizes
are more evenly distributed, but there remain a few significantly
larger components. In this setting both algorithms spend more of
their budget outside the targeted subpopulation “searching” for
these components. Here the performance of the private algorithm lags more significantly—because both algorithms behave
the same when inside of a component, the smaller the components are, the more detrimental the noise is to the private algorithm (though again we see particular runs in which the
randomness of the private algorithm permits it to actually outperform the nonprivate).
The third regime (Fig. 4) occurs when all of the targeted
components are small, and thus both algorithms suffer accordingly, discovering only a few targeted individuals; but again the
private algorithm compares favorably with the nonprivate, finding only a few less targeted vertices.

1. Greenwald G (June 6, 2013) NSA collecting phone records of millions of Verizon
customers daily. The Guardian. Available at www.theguardian.com/world/2013/jun/
06/nsa-phone-records-verizon-court-order.
2. National Research Council (2015) Bulk Collection of Signals Intelligence: Technical
Options (The National Academies Press, Washington, DC).
3. Allen J (October 25, 2014) U.S. nurse quarantined over Ebola calls treatment “frenzy
of disorganization.” Reuters. Available at www.reuters.com/article/health-ebola-usaobama-idUSL6N0SK0IN20141026.
4. Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private
data analysis. Proceedings of Third Theory of Cryptography Conference (Springer, New
York), pp. 265–284.
5. Daniel J (2006) Solove. A taxonomy of privacy. Univ Pa Law Rev 154(3):477–564.
6. Dwork C, McSherry FD (2010) Selective privacy guarantees. US Patent 7,818,335.
7. He X, Machanavajjhala A, Ding B (2014) Blowfish privacy: Tuning privacy-utility tradeoffs using policies. Proceedings of the 2014 ACM SIGMOD International Conference
on Management of Data (ACM, New York), pp 1447–1458.

8. Hay M, Li C, Miklau G, Jensen D (2009) Accurate estimation of the degree distribution of
private networks. Ninth IEEE International Conference on Data Mining (IEEE, New York),
pp 169–178.
9. Kasiviswanathan SP, Nissim K, Raskhodnikova S, Smith A (2013) Analyzing graphs with node
differential privacy. Theory of Cryptography Conference (Springer, New York), pp 457–476.
10. Blocki J, Blum A, Datta A, Sheffet O (2013) Differentially private data analysis of social
networks via restricted sensitivity. Proceedings of the 4th Conference on Innovations
in Theoretical Computer Science (ACM, New York), pp 87–96.
11. Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found
Trends Theor Comput Sci 9(3-4):211–407.
12. Bassily R, Groce A, Katz J, Smith A (2013) Coupled-worlds privacy: Exploiting adversarial uncertainty in statistical data privacy. IEEE 54th Annual Symposium on Foundations of Computer Science (IEEE, New York), pp 439–448.
13. Kearns M, Pai M M, Roth A, Ullman, J (2014) Mechanism design in large games: Incentives and privacy. Am Econ Rev 104(5):431–435.
14. Alaggan M, Gambs S, Kermarrec A-M (2015) Heterogeneous differential privacy. arXiv:
1504.06998.

6 of 6   www.pnas.org/cgi/doi/10.1073/pnas.1510612113

Conclusion
We view the work presented here as a proof of concept: despite
the fact that using network analysis to identify members of a
targeted population is intrinsically contrary to the privacy of the
targeted individuals, we have shown that there is no inherent
reason why informational privacy guarantees cannot be given to
individuals who are not members of the targeted population, and
that these privacy guarantees need not severely harm our ability
to find targeted individuals. Our work is of course not a complete
solution to the practical problem, which can differ from our simple
model in many ways. Here we highlight just one interesting modeling question for future work: Is it possible to give rigorous privacy
guarantees to members of the protected population when membership in the targeted population is defined as a function of the
individuals’ private data? In our model, we avoid this question by
endowing the algorithm with a costly “investigation” operation,
which we assume can infallibly determine an individual’s targeted
status—but it would be interesting to extend our style of analysis to
situations in which this kind of investigation is not available.

Kearns et al.