Privacy and Cooperation in Peer-to-Peer Systems

(1)

Privacy and Cooperation in Peer-to-Peer

Systems

(2)

(3)

Privacy and Cooperation in Peer-to-Peer

Systems

Proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. ir. K. C. A. M. Luyben, voorzitter van het College voor Promoties,

in het openbaar te verdedigen op vrijdag 22 mei 2015 om 15:00 uur

door

Nicolaas Simon Marinus Zeilemaker

ingenieur in de technische informatica

(4)

Samenstelling promotiecommissie:

Rector Magnificus, voorzitter

Prof. dr. ir. H. J. Sips, Technische Universiteit Delft, promotor Dr. ir. J. A. Pouwelse, Technische Universiteit Delft, copromotor

Prof. dr. T. Strufe, Technische Universitat Darmstadt

Prof. dr. ir. A. K. Pras, Universiteit Twente

Prof. dr. A. Hanjalic, Technische Universiteit Delft Prof. dr. C. Witteveen, Technische Universiteit Delft Prof. dr. M. J. G. van Eeten, Technische Universiteit Delft

The work described in this thesis has been carried out in the ASCI graduate school. ASCI dissertation series number 327.

This work was supported by the Future and Emerging Technologies programme FP7-COSI-ICT of the European Commission through the QLectives project (grant no.: 231200).

Printed by: Proefschriftmaken.nl || Uitgeverij BOXPress

Published by: Uitgeverij BOXPress, ’s-Hertogenbosch

Front & Back: Janna Alberts

c

2015 Nicolaas Simon Marinus Zeilemaker ISBN 978-94-6295-195-2

An electronic version of this dissertation is available at

(5)

Acknowledgments

Although, I never had any problems implementing/building stuff, I needed quite some help to be able sell my ideas to the research community. This however, never seemed to bother either my promotor or co-promotor which were confident that a suitable venue for my ideas would be found. A clear vote of confidence, which helped me to see the upsides of yet another reject.

Johan, talking with you remains an adventure until this day. Before starting a discussion, I would have a general idea or direction of where I wanted to go to with a paper. However, afterwards the direction usually completely changed. Not because you convinced me, or tricked me into going into another direction, but by seemingly random remarks you made during the discussion. This is a skill which I have yet to discover in anyone else, but was a great help to me during my time as a PhD-student.

Henk, without your advice regarding structure and layout wrt organising a pa-per, this thesis would not have been possible. I know now that all details matter while presenting an idea in a paper, and that without proper structure not a single reader/reviewer will be grasp the gems hidden beneath.

Janna, although you did not make any technical contributions to my work, you acted as a sparring partner helping me to understand the core of my ideas. Moreover, you (almost) never complained when I started running a new experiment during dinner, while on holiday, or when we were almost going to sleep. Finally, when travelling to conferences you were a great help, supporting me in preparing my presentation, and not being focussed on shopping at all... This helped me to get some perspective on doing a PhD, its probably not going to change the world.

(6)

(7)

4 Open2Edit: a Peer-to-Peer platform for collaboration 41 4.1 Introduction. . . 42 4.2 Related work . . . 43 4.3 Motivation . . . 44 4.4 Dispersy. . . 44 4.4.1 Overview . . . 45 4.4.2 Permissions . . . 45 4.5 Open2Edit design. . . 46 4.5.1 CommunityOverlay. . . 47 4.5.2 CommunityDiscoveryOverlay . . . 47 4.5.3 Authentication . . . 48 4.5.4 Flexibility. . . 48 4.5.5 Detecting conflicts . . . 49 4.6 Tribler deployment . . . 49 4.6.1 Tribler overview. . . 49 4.6.2 Implementation details. . . 50 4.6.3 User interface. . . 51 4.7 Experiments. . . 51 4.7.1 DAS-4 emulation . . . 51 4.7.2 Internet deployment . . . 54 4.8 Discussion. . . 55 4.9 Conclusion . . . 57

5 Building a Privacy-Preserving Semantic Overlay for Peer-to-Peer Networks 59 5.1 Introduction. . . 60 5.2 Related Work. . . 61 5.3 Preliminaries . . . 62 5.3.1 RSA. . . 62 5.3.2 Paillier. . . 63

(9)

Contents ix

5.4 Privacy-Preserving Protocols. . . 63

5.4.1 Private Set Intersection Protocol I. . . 63

5.4.2 Private Set Intersection Protocol II . . . 64

5.4.3 Private Set Intersection Protocol III. . . 65

5.4.4 Security Discussion. . . 67

5.4.5 Improving Discovery Speed . . . 68

5.5 Experimental Results. . . 68

5.5.1 Dataset . . . 68

5.5.2 Experimental setup. . . 69

5.5.3 Results . . . 70

6 4P: Performant Private Peer-to-Peer File Sharing 75 6.1 Introduction. . . 76

6.2 Related Work. . . 77

6.3 Cost and Limitations of Current Systems. . . 79

6.3.1 Bandwidth Cost . . . 80 6.3.2 Limitations . . . 82 6.4 4P Design. . . 82 6.5 4P Implementation. . . 83 6.5.1 Semantic Overlay. . . 83 6.5.2 Search Messages . . . 84 6.5.3 Downloading Content. . . 85

6.5.4 Choosing the initial TTL, IEP, and FEP . . . 86

6.6 Security . . . 87 6.7 Experiments. . . 88 6.7.1 Dataset . . . 88 6.7.2 Emulation setup . . . 89 6.7.3 Evaluated strategies . . . 89 6.7.4 Results . . . 90 6.8 Conclusion . . . 93

7 ReClaim: a Privacy-Preserving Decentralized Social Network 95 7.1 Introduction. . . 96

7.2 Related Work. . . 97

7.3 Goals and Security model . . . 98

7.4 Features. . . 99

7.4.1 Establishing a new Friendship. . . 99

7.4.2 Locating Friends . . . 100 7.4.3 Posting a Message . . . 101 7.4.4 Unfriending . . . 101 7.5 Protocol Details. . . 101 7.5.1 PSI protocol. . . 101 7.5.2 Message Distribution. . . 103 7.5.3 Message Body. . . 104

(10)

7.5.5 Peercache . . . 105 7.6 Application Design . . . 105 7.7 Evaluation. . . 106 7.7.1 Dataset . . . 106 7.7.2 Setup . . . 107 7.7.3 Boostrapping . . . 107 7.7.4 Message distribution . . . 108 7.8 Conclusion . . . 109 8 Conclusion 111 8.1 Conclusions . . . 111

8.2 Suggestions for future work . . . 112

Bibliography 115

Summary 125

Samenvatting 127

Curriculum Vitæ 129

(11)

1

Introduction

In a traditional centralized network architecture, the central server (or a group of servers) needs to be able to handle all requests issued by clients. This, as shown by popular websites such as Youtube1and Facebook2, is not necessarily a problem but requires significant investments in infrastructure. E.g. Youtube uses a large network of servers distributed across different continents to be able to serve data to clients from a server which is geographically close to them [94].

In contrast, in a Peer-to-Peer network (P2P) the resources available at clients are employed. For instance, by using spare bandwidth at clients, a P2P network reduces the bandwidth requirements at the server. This lowers the initial investment, as less servers need to be bought. However, the protocols which allow clients to collaborate need to be carefully designed, as clients in a P2P network are highly susceptible to attacks. Attackers can try exploit their willingness to collaborate, in order to gain knowledge, perform synchronized attacks on others, or free-ride the system. Moreover, attackers can more easily target and influence a single client in a P2P network, hence a need arises for collaboration between honest clients to detect possible malicious acts.

A considerable amount of research focuses on collaboration and pooling of re-sources, improving efficiency, reducing the vulnerability to attacks, etc. However, most research efforts investigate solutions which overcome bandwidth bottlenecks, in order to efficiently share files, or creating backup solutions. Overlooking that col-laboration in a P2P network can improve upon other forms of centralized solutions as well.

In this thesis, we focus on collaboration between clients to distribute messages, improve privacy, and implement decentralized search. Choosing a decentralized over a centralized solution will initially cause trivial problems (such as making a piece of information available to all clients) to be substantially more difficult. However, after overcoming those initial problems, a decentralized solution requires a smaller

1_{www.youtube.com} 2_{www.facebook.com}

(12)

1

Centralized Unstructured

Structured Hybrid

Figure 1.1: Different overlay structures

investment in servers, has less running costs, improves privacy due to not having to store all data in the same place, and can be made more robust against attacks as there is no central point of failure.

Nevertheless, in order to exploit those benefits fully, care must be taken when designing the collaboration protocols. Poorly designed protocols can cause even greater privacy problems compared to a centralized solution, as an attacker would only be required to listen and record the messages sent by clients to construct a dataset consisting of potentially privacy sensitive information.

1.1. Collaboration in P2P

In a P2P network, the roles of the participating peers (or clients) are equal. The term peer stands for “a person of the same age, status, or ability as another specified person” 3_{. Without any specific roles, a P2P network depends on collaboration}

between peers in order to achieve a working system.

Figure1.1gives an overview of the different overlay (network) structures used in P2P networks. In the following subsections, we give an example of each of the overlay

(13)

1.1.Collaboration in P2P

1

3

structures. Morover, we will elaborate on current state-of-the-art P2P collaboration schemes, their goals, and their security against malicious peers.

1.1.1. File Sharing

Using the available bandwidth at peers can substantially reduce the bandwidth required at a server. In a P2P file-sharing network, the server could be compared to the first peer injecting a file into the network. This peer alone usually does not have the bandwidth to send a full copy of the file to all other peers requesting it. However, after the initial peer has uploaded one or more copies of the file to others, some of these other peers will assist the initial peer by uploading the file to others. One of the first P2P file-sharing systems, Napster, was cited to be the “destruc-tor” of copyright [53]. When introduced in 1999, it allowed its users to easily share files. Searching for files relied on central servers, but after receiving a list of files which were ready to be downloaded, a direct connection to another peer was made in order to download a file. Peers in the Napster network downloaded a file from a single source which would upload a complete file. However, a peer could choose between sources to manually to improve its download speed.

Gnutella [82] removed the need for centralized search servers by implementing a decentralized search protocol. It used an unstructured overlay to connect peers with each other in a more or less random manner. An unstructured overlay has low maintenance costs but efficiently searching such it can be hard. Initially, the developers opted for flooding. This is a technique in which a search message includes a TTL (time-to-live) property which is decremented whenever a peer receives the message. All peers forward a search message to all its neighbors (active connections) while the TTL is larger than 0. The TTL value is set by the initial peer and determines how many other peers it reaches. However, even small values for the TTL will result in a massive amounts of search messages to be sent.

eMule [54] started off as open-source client to the eDonkey network, however the developers quickly extended its functionality by integrating a Kademlia based DHT. A DHT (Distributed Hash Table), is a popular method to create a structured over-lay. In contrast to an unstructured overlay, maintaining it costs more, but searching or looking up values in a DHT requires less messages to be sent. Every peer in a DHT initially generates an identifier, usually by hashing its own IP-address. The identifiers position peers in the DHT ring, in which the peer with identifier 0 is positioned at the top, peer 1 next to him clockwise, etc. Next, peers maintain a connection to their neighbors both before and after them, and peers which are lo-cated further away in the ring with increasingly larger distances. Distances between two identifiers are computed by their XOR distance. When searching a DHT, a peer first converts the keyword/keywords into an identifier by applying the same hash function. A peer then forwards the search message to the peer of whose identifier is the closest to this hashed keyword. This peer will do the same, until the message arrives at the peer responsible for the keyword, and replies.

This causes a search message to be forwarded O (log N ) times, as Kademlia [63] provably to contacts this many peers during a lookup. Reducing both the latency and number of messages sent, makes it a efficient method for looking up search

(14)

1

results._{results, every peer in the network needs to announce what it is sharing. Therefore,}However, in order to allow the responsible peer to reply with a list of

each peer sends a message per keyword it has matching files for to the responsible peer. Without optimizations, this approach results in a similar overhead compared to flooding an unstructured overlay.

Kazaa [58] attempted to combine the best of both worlds. Its overlay, at its core, is an unstructured one, but in this unstructured overlay some peers are more important than others. These peers are called superpeers, and they index “ordinary” peers connected to them. Superpeers are peers which have a fast Internet connection, big CPU, no firewall, and have been online for a long time. Amongst superpeers a separate superpeer network is constructed, which consists of much fewer peers. Whenever an ordinary peer wants to search the network, it sends the query to the superpeer it is connected to, which then floods the superpeer network and replies with the result list. Because the superpeer network contains fewer peers, and the peers it does contain have above average performance, flooding this network can achieve excellent results. Creating a network consisting of ordinary and super peers is often referred to as a hybrid overlay.

BitTorrent [74] focuses at improving the speed at which peers in a P2P network can download files. It splits a file into pieces, and uses a descriptor called a torrent to store their checksums. Moreover, it uses an identifier called the infohash to distinguish files. The infohash is computed by hashing the info part of the torrent, which contains amongst others the filename and the checksums of the pieces. Using the checksums, a peer can download multiple pieces in parallel and verify each piece individually. This overcomes a substantial bandwidth bottleneck, as a peer can now download from multiple peers at the same time. The group of peers which are downloading/uploading the same file is often referred to as a swarm.

Another improvement BitTorrent incorporated was rarest-first-downloading, this improves the overall health of the swarm wherein peers are downloading/uploading. Each peer using this policy maintains a list of pieces available at its directly con-nected neighbors, and then downloads the pieces which are the “rarest” first. Down-loading the rarest pieces first helps the overall piece availability, and improves the ability to barter (trade) for pieces with other peers [57].

Finally, BitTorrent incorporated an incentive mechanism to promote uploading. Without an incentive mechanism, there is no reason for peers to upload while down-loading a file. But because of tit-for-tat, the incentive mechanism implemented by BitTorrent, each peer will only upload to the peers which have given it the most in return. These peers will be unchocked, which allows them to send requests for pieces. Additionally, each peer has an optimistic unchocke slot to allow a random peer to prove that its willing to reciprocate. Although, tit-for-tat does not solve the incentive problem (peers can still leave before uploading as much as they have downloaded), it improves upon all existing systems.

1.1.2. Reputation

Being able to determine the reputation of a peer, which reflects its long-term be-haviour, can improve both the performance and the security of a P2P network. As

(15)

1.1.Collaboration in P2P

1

5

described above, BitTorrent lacks an incentive mechanism to prevents peers from leaving before they upload as much as they downloaded. Similarly, peers in Gnutella should ignore, or neglect those which do not share. As shown by Adar et al. [4], almost 70% of peers in Gnutella do not share any files (and hence only download) and roughly 50% of replies to search queries are from the top 1% of sharing hosts (overloading those in the process).

A reputation system, capable of recording/representing this behaviour in a rep-utation score, will allow peers to avoid such free-riders and improve the download speed of peers which upload at least as much as they download. Generally, a reputa-tion system warns honest peers about maliciously behaving others, thereby allowing them to be more vigilant while interacting with them.

However, creating a distributed reputation system has proved to be very hard, as it must be able to resist attackers trying to exploit it. Having a reputation system which is easily exploitable might even cause more harm to a P2P network than good, as honest peers will trust malicious peers instead being cautious with everyone.

An example of a deployed reputation system is BarterCast [67]. In this system peers collect BarterCast records describing up and download traffic between two peers. Both peers sign this record, indicating that both agree on the content. By collecting BarterCast records, a peer can estimate the reputation score for the peers its interacting with based on their previous download/upload behaviour. However, a peer cannot collect all BarterCast records as there are simply too many, and hence it needs to decide which to collect/keep. Therein lies the difficulty of a reputation system, a peer needs to predict which peers it needs to collect the BarterCast records from, and which peers it can ignore. Moreover, it needs to do so, without being susceptible to attacks. A recent improvement to BarterCast uses similarity between peers to determine which records are interesting and which are not [34].

1.1.3. Database Synchronization

Synchronizing a database between two or more peers is a well defined problem. Demers et al. [35] outlined, in 1987, methods which created consistency across many nodes linked together using a large heterogeneous, slightly unreliable and slowly changing network. Moreover, he stated that these nodes can only be consistent after all updating activity has stopped.

Compared to file sharing, database synchronization is a slightly different problem as databases can change while synchronizing. Synchronizing a dynamic/changing file prevents peers from computing checksums in advance and hence prevents other peers from verifying those after receiving part of it. Moreover, a method is required which is able to detect which parts of a database are missing/updated.

A common approach is to use Merkle Trees [66] as implemented in Amazon’s Dynamo [32] distributed key-value store. Each peer builds a Merkle Tree of its current database by first hashing each row, and next building a tree by inserting parents which are the hash of its children. The root(hash) of a Merkle Tree can be used to compare if there are any differences between two databases, and if so by traversing the tree peers can efficiently determine where the changes are located.

(16)

1

1.2. Privacy in P2P

A P2P network can also yield substantial privacy improvements over a solution which uses a central server. By removing the central component, a P2P network gets rid of the single point of storage of privacy sensitive information highly susceptible to attacks. Moreover, a peer in a P2P network has more control over what is revealed and to whom than in a typical client-server setup wherein a client can only decide to either trust a server or not.

In the following subsections, we will elaborate on current state-of-the-art P2P privacy preserving protocols, their goals and their security against malicious peers.

1.2.1. Hiding your Location

Tor4 _{is an open-source project aiming to improve the privacy and security of its}

users on the Internet. Tor uses virtual tunnels to hide the location, or IP-address, of the original user requesting a website. All data sent over the tunnels is encrypted, only the source and endpoint of a tunnel can read the messages. The endpoint of a tunnel needs to be able to read the message, as it will act as if its connecting to the website.

Tor employs onion routing in order to create the virtual tunnels. Users sending a message over a virtual tunnel encrypt it multiple times, with a separate key known only to each hop in the tunnel. After receiving a message, each hop in the tunnel decrypts it before forwarding it to the next hop, peeling the layers of the onion. The endpoint does the same, and ends up with an unencrypted request. When the endpoint wants to send a reply, it encrypts with its own key, and each hop in the virtual tunnel does the same. Finally, when the source receives the encrypted reply, only it can decrypt it as it has all the keys from each hop in the tunnel.

After some time, Tor destroys its current tunnels and builds new ones. This prevents endpoints from being able to collect too much information. However, the protocols a users runs on top of the virtual tunnels can still leak privacy sensitive information, and the user itself is still vulnerable to attacks. Recently this vulner-ability was exploited by the FBI, which injected javascript into a website to break out of the Tor tunnels and make the user exposes its location5.

1.2.2. Hiding your Search Queries

Searching could also be made more privacy preserving using a P2P network. By hiding the user issuing a query, it can be made more difficult to create individual profiles of users as is commonly performed by centralized websites.

Popular examples of P2P networks which aim to provide its users with privacy preserving search are Freenet [22], Gnunet [13], OneSwarm [46], and RetroShare [79]. These networks employ well known techniques from cryptography in order to protect the privacy of their users. Examples of these are path length obfuscation, wherein peers construct paths of a unknown length between the user searching and the user replying to a query. End-to-end encryption, where all messages sent between

4_{https://www.torproject.org/}

(17)

1.3.Popular Attacks on P2P systems

1

7

the user searching and the user replying are encrypted, and finally the latter two networks use a Friend-to-Friend network, wherein users have to manually connect to others by establishing friendships with them.

Most networks achieve a certain level of anonymity against collaborating peers in the network, however none are able to resist a local eavesdropper. Such an attacker is able to observe all network traffic received by and sent from a node, hence can detect when a peer creates a new query, replies to a query etc. This does not necessarily mean that the attacker can read the queries, as those could be send in encrypted from, but simply that it can prove that an user is issuing them.

1.2.3. Hiding your Social Network

Social Networks are another prime example of a field wherein a P2P network could potentially improve the privacy of its users. Instead of relying on the security of centralized websites, peers can limit the exposure of their data by storing it in encrypted form and deciding whom can decrypt it.

We distiguish two approaches to hiding your social network. First, approaches similar to Scramble [10] use an existing social network to relay encrypted updates to friends. It uses broadcast encryption to send multiple friends an encrypted update which only needs to be stored once at an external storage provider. Broadcast encryption is a technique which allows multiple users using different decryption keys to decrypt the same file. Using a browser plugin, Scarmble hides the decryption process from its users letting them communicate more securely with ease. However, due to freeriding on an existing network, these approaches can be easily blocked.

Alternatively, another approach is to implement a new social network which does not rely on existing centralized social networks. Herein systems such as Diaspora (a federated approach) [1], SafeBook [31], and Cachet [71] reside. Although Di-aspora is the most well known, it does not have a truly decentralized design, but allows users to setup their own server which then can be integrated into the current system. SafeBook and Cachet both rely on a DHT to exchange updates between users. However, Safebook additionally requires a centralized identity server which hands out identities after verifying an existing one. Cachet does away with that requirement, but exposes the social graph in order to optimize performance.

1.3. Popular Attacks on P2P systems

Without a reputation system, peers in a P2P network rely on an incentive mecha-nism designed to be robust to strategic manipulation. Such incentive mechamecha-nisms are often derived from game theory, wherein the behaviour of users is modelled when confronted with strategic decisions [77].

An example from game theory is the public goods game, wherein each player has a private pool of tokens. In each round a player has to decide how many of his private tokens he/she wants to put in a public pot. The tokens in the public pot are multiplied by a factor, and then divided equally over all players. In this game, the profit is maximized if all players put all of their tokens in the public pot. However, a player that does not contribute will still receive a payout. Hence, a rational player

(18)

1

would not contribute at all, and only profit from the contributions of others. A very_{successful strategy which allows a single player to freeride the system. However, if}

all players employ this strategy, no-one will benefit from the multiplication factor of the public pool.

The incentive mechanism of BitTorrent prevents others from freeriding. How-ever, without initial risk taking (the optimistic unchocke slot), the protocol cannot bootstrap as no preexisting trust relationships exist. Locher et al. [59] have shown that this mechanism can be easily exploited, if an attacker simply connects to many peers, its probability to be selected as the lucky randomly selected unchocke peer increases and it receives bandwidth for free. Similar to the public good game, this is a successful strategy if not employed by all peers in the network.

A more general attack which do not target a specific P2P protocol, but is appli-cable to most, is called the sybil attack [37]. Herein an attacker poses as multiple peers by creating multiple identities. Using these identities an attacker can, for instance, gain reputation by creating fake upload records between them. This re-sults in a single identity having a great reputation, as it seems that it uploaded to many other peers, and numerous identities which have a bad reputation. By simply throwing away the identities with a bad reputation, the attacker now has a great reputation without needing to upload to anyone.

Somewhat similar is the eclipse attack [88] wherein, an attacker joins the P2P network more than once, and uses those peers to control the view of others. Eclipsing a peer consists of controlling all/most of its connections to the network, and by doing this the attacker can greatly influence the view of a peer on the network. Moreover, after eclipsing a peer, an attacker can gain much more knowledge on the behaviour of it, greatly impacting the privacy of a peer, e.g. it can detect which search queries a peer issues, which files it is uploading, etc.

Both the sybil and eclipse attacks have been well researched, and numerous papers have been published on evaluating their impact [52,62], reducing it [85,100], or even preventing the attacks from happening [60,88]. However, most “solutions” either rely on a trusted third party (TTP) in charge of handing out identities, or using another network (such as a social network) wherein edges between peers indicate friendship instead of being created at random. The idea being that attackers have a lot of difficulty in creating edges, as users in the social network will not accept friend requests from strangers.

1.4. Research Questions

Improve collaboration while maintaining privacy has many challenges, and as such not many solutions have been developed which are ready for deployment w.r.t. efficiency and practicality.

In this thesis we aim to address the following research questions:

How can we efficiently synchronize messages in a P2P network? A lot of

research effort was put into analyzing and optimizing file sharing in a P2P network, much less is known about efficiently synchronizing messages. Although seemingly a trivial task, a P2P network suffers from intermittent connectivity between peers, large delays on links, and the complete absence of end-to-end paths. To overcome

(19)

1.4.Research Questions

1

9

these problems and still be able to synchronize messages between all peers in a P2P network, the need for a stateless, NAT-puncturing, synchronization method arises. Moreover, we envision a scenario in which all peers must be able to create messages, without knowing beforehand which of them will. This prevents peers from being able to detect when they have received/synchronized all messages, as new messages are created at irregular intervals.

How can we use a P2P network to build a collaborative platform that does not depend on central servers? Websites such as Wikipedia and

StackExchange have shown that it is possible to build a system in which user knowl-edge, judgement, and expertise is donated without any apparent remuneration in return. Additionally, P2P networks have shown to scale to accommodate substan-tial amounts of concurrent users, without the need for servers. By combining a P2P network, with the methods used by collaboration websites to encourage user partic-ipation, a hybrid system will arise which in contrast to the centralized collaboration websites does not need large initial investments, or monthly costs to maintain an existing infrastructure. However, the latency introduced by the P2P network might cause collaboration to fail, as users spent effort on fixing/improving older versions of an article which has already been fixed.

How can we cluster peers by taste without exposing their privacy? It

has been shown by Voulgaris et al. [97] that clustering peers by taste, or preferences, can result in a P2P network in which peers can efficiently perform keywords search. However, in order to allow a peer to find other peers which are similar to it taste-wise, protocol developers often resorted to letting peers broadcast their preferences in plain-text. This results in a peer fully exposing its preferences for items or content, and therefore revealing potentially privacy sensitive information. Can we employ methods from the cryptographic domain to improve upon this? And what are the costs associated to using those methods?

How can we query a P2P network efficiently while hiding the identity of the issuer? A semantic overlay, a P2P network in which we cluster peers by

taste, holds a great promise if we manage to build it in a private manner. However, even after constructing a semantic overlay, care must be taken in order to not reveal the source of a search query. Typically, a peer in a semantic overlay would send a query to its neighbors which then reply with a result list. Hence, any peer receiving a query will know that the peer which sent the query is the source. Can we improve upon this? And what are the costs associated to hiding the source of a query? Can a semantic overlay reduce the bandwidth costs compared to currently deployed networks?

How can we enable peers to communicate with their friends privately, without requiring them to be online at the same time? In an online social

network (OSN), friends communicate with each other using a central server storing all their messages, likes, images, etc. Companies like Facebook provide its users with such a service, however in return they use private data of users to attract advertisers in order to be able to run and maintain the infrastructure required to be able to handle a network consisting of more than 200bn friendships. A P2P network could be viable alternative as it does not require the same amount of infrastructure to

(20)

1

keep running, and hence does not require a similar investment in infrastructure.

1.5. Contribution and thesis outline

The contributions of this thesis are as follows:

Introducing Tribler, the P2P client used as a testbed for our P2P experiments (Chapter2) Tribler is an open-source software project initiated in

2005. It was funded by multiple European research grants, which over time helped it to become a viable testbed for testing new P2P protocols. Currently, Tribler has more than 1500 users daily. In this thesis, Tribler was used to evaluate the performance of Dispersy (Chapter3) and Open2Edit (Chapter4). This chapter is largely based on our work published in ACM MM 2011 [102]:

N. Zeilemaker, M. Capotă, A. Bakker, and J. Pouwelse. Tribler: P2P media search and sharing. In Proceedings of the 19th ACM international conference on Multimedia, 2011.

Synchronizing messages in a P2P network with Dispersy (Chapter 3)

Synchronizing messages in a P2P network, or in a network which isn’t necessar-ily client-server, has been well researched. Even going back as far as Demers et al. [35], which in 1987 described a theoretical model outlining the synchronization performance of a single message being created by a peer in a random network. Us-ing this theoretical model, we design and implement a fully decentralized database build around a semi-random network. Additionally, we integrate NAT-traversal, to allow the database to continue to operate even when deployed in a challenged network such as the Internet. Finally, we extensively evaluate the performance of Dispersy, measuring propagation and synchronization speed, its bandwidth require-ments, churn resilience, and throughput. This chapter is largely based on our work published in ACM SAC 2014 [107]:

N. Zeilemaker, B. Schoon, and J. Pouwelse. Large-scale message synchronization in challenged networks. In Proceedings of the 29th Annual ACM Symposiumon Applied Computing (SAC), 2014.

Improving decentralized collaboration between users with Open2Edit (Chapter 4) P2P file-sharing networks owe much of their success to the efficiency

by which they can combine user contributed resources. Similarly, websites such as Wikipedia and StackExchange, connect experts with novices in such a manner that experts donate their knowledge without any remuneration in return. In this chapter, we introduce Open2Edit, a platform which enables developers to create decentralized versions of collaboration websites. In contrast to related works, Open2Edit does not implement a decentralized Wikipedia, but aims to be a flexible platform upon which collaborative systems can be developed. Using Open2Edit, we implement a YouTube-like media sharing system, integrate this system into Tribler, and show that users collaborated in discovering the best channels by casting over 100,000 votes. This chapter is largely based on our work published in IFIP Networking 2013 [101]:

N. Zeilemaker, M. Capotă, and J. Pouwelse. Open2edit: A peer-to-peer platform for collaboration. In IFIP Networking Conference, 2013.

(21)

1.5.Contribution and thesis outline

1

11

(Chapter5) In this chapter, we evaluate three methods which are able to compute

the cardinality of the intersection of two sets without disclosing the sets themselves. This problem is referred to as Private Set Intersection Cardinality (PSI-C) prob-lem, and we modify three existing well known solutions. We implement a semantic overlay wherein peers compute their similarity to others using the three methods, and show their performance differences; comparing the CPU time, bandwidth, and speed of discovering the most similar peers. This chapter is largely based on our work published in IEEE WIFS 2013 [103]:

N. Zeilemaker, Z. Erkin, P. Palmieri, and J. Pouwelse. Building a privacy-preserving semantic overlay for peer-to-peer networks. In Proceedings of the IEEE International Workshop on Information Forensics and Security, 2013.

Performant private file-sharing using 4P (Chapter6) A semantic overlay

holds great promise in creating scalable decentralized search. However, searching on top of a semantic overlay will typically still expose the initiator of a query. In this chapter, we show that by combining a semantic overlay with existing privacy enhancing features we can substantially reduce the number of messages sent per query while still achieve a high recall. As we show in Chapter5 it is possible to create a semantic overlay in a private manner, and by combining this with prob-abilistic query forwarding, path uncertainty, and encrypted links, 4P implements a performant private P2P file sharing system ready for deployment. We extend a PSI-C method described in Chapter5, to allow two peers to agree upon a session-key which is kept safe from a man-in-the-middle attack. During the evaluation, we compare the performance of 4P to Gnutella, OneSwarm, and RetroShare. This chapter is largely based on our work published in IEEE P2P 2014 [105]:

N. Zeilemaker, J. Pouwelse, and H. Sips. 4P: Performant Private Peer-to-Peer File Sharing. In Proceedings of the IEEE International Conference on Peer-to-Peer Computing (P2P), 2014.

Creating a Privacy Preserving online social network (Chapter 7)

Us-ing a P2P network to create a more privacy preservUs-ing online social network has been widely researched. However, most papers focus on either storing the data in encrypted form while letting users still access it, or are build around a DHT wherein a user relies on strangers to store its data. A semantic overlay, wherein peers only connect to their friends and friends-of-friends, might be better suited as these peers have a greater incentive to store your data. In this chapter, we introduce ReClaim, a privacy preserving decentralized social network which builds upon our previous work of Chapter 5 to construct a semantic overlay which connects peers to their friends and friends-of-friends. Using those connections, we create a social network which uses friends to store messages. Every wallpost a user creates, gets encrypted with the public key of each destination, and then stored at the friends of each des-tination. This allows two friends to continue to communicate even when they are not online at the same time. Using Bloom filters, peers can detect which of the messages destined for your friends you did not stored locally. This chapter is largely based on our work published in USENIX FOCI 2014 [104]:

N. Zeilemaker and J. Pouwelse. ReClaim: a Privacy-Preserving Decentralized Social Network. In Proceedings of the USENIX Workshop on Free and Open

(22)

(23)

2

Tribler: Peer-to-Peer Media

Search and Sharing

Tribler is an open-source software project facilitating search, streaming and shar-ing content usshar-ing P2P technology. Over 1,200,000 people have used Tribler since the project started in 2005. The Tribler P2P core supports BitTorrent-compatible downloading, video on demand and live streaming. Aside from a regular desktop GUI that runs on multiple OSes, it can be installed as a browser plug-in, currently used by Wikipedia. Additionally, it runs on a 450 MHz processor, showcasing fu-ture TV support. We continuously work on extensions and test out novel research ideas within our user base, resulting in sub-second content search, a reputation sys-tem for rewarding upload, and channels for content publishing and spam prevention. Presently, over 2000 channels have been created, enabling rich multimedia commu-nities without requiring any server.

Parts of this chapter have been published in ACM MM 2011 [102].

(24)

2

2.1. Introduction

As of 2001 BitTorrent has transformed the manner in which files are transfered over the Internet. While not the first P2P file-sharing application, it became the most successful because of its unprecedented efficiency. However, BitTorrent fails to properly address important issues, such as search, video on demand and long-term contribution incentives. In this chapter, we introduce Tribler, our BitTorrent-based open-source project that adds missing features and new use-cases, while remaining fully backwards compatible.

BitTorrent introduced a new mechanism for transferring files that maximizes the resource contribution of peers. In BitTorrent, before the transfer begins, files are split into small pieces which can then be individually downloaded from different peers. This way, downloaders (called leechers) can upload completed pieces to other leechers, without the need to have the whole file first. Furthermore, this uploading is encouraged by the tit-for-tat incentive mechanism built into BitTorrent: a leecher will rank its peers by upload speed and will upload in return only to the fastest uploaders. Peers which have completed the file can help others by sending them pieces for free, these peers are called seeders.

To protect against transfer errors and malicious modifications, each file piece is accompanied by its SHA-1 cryptographic hash. The peer that initiates the transfer of a file will first record these hashes in a metadata file called a torrent. This torrent has to be distributed out-of-band to the users interested in downloading the file, e.g., through email or the Web. Only after getting hold of the torrent can a BitTorrent client start downloading the files. This is one of the problems of the original BitTorrent protocol. Searching for torrents is also not part of the BitTorrent protocol and is normally done through centralized websites.

The original BitTorrent protocol had another important design flaw. Peer discov-ery was done through centralized servers called trackers. The address of a tracker is recorded in the torrent and every interested peer will announce itself to the tracker. In turn, the tracker will provide the peer a list of other peers currently uploading or downloading the file. Whenever the tracker is offline, new peers cannot join the

swarm (the group of peers sharing the file) because there is no other way to discover

it.

Initiated in 2005, Tribler [3]1_{is an open-source research project funded by}

mul-tiple European research grants, which addresses the problems of the original Bit-Torrent protocol and extends it beyond file sharing to new areas such as video on demand and live streaming. The existence of Tribler was made possible by the open nature of BitTorrent. The protocol specification is freely available [23] and the orig-inal BitTorrent software used an open-source license [14]. After extensive research and numerous incremental improvements, Tribler has implemented and deployed fully decentralized remote search, channels, reputation, video on demand, and live streaming.

(25)

2.2.Design and Features

2

15

Overlay/Dispersy

BarterCast Reputation Remote Search Channels Torrent Collecting

Protocols

Run on top of Video On Demand Live Streaming

Download

Engines

Interact with

GUI

SwarmPlayer Tribler

libTribler “glue”

Figure 2.1: Tribler architecture

2.2. Design and Features

Tribler consist of two base components, the core which implements all communi-cation protocols, and the GUI which presents the features of the software to the user. As part of the project we have implemented a secondary GUI by repackaging the core as a browser plug-in called SwarmPlayer. This section first describes the overall structure of the core, then the Tribler GUI, and finally the SwarmPlayer.

2.2.1. Architecture

The core of Tribler can be further divided into three parts: (1) the overlay which allows for communication between Tribler peers, (2) the protocols implementing remote search, torrent collecting etc., and (3) the download engine, which allows users to download files in BitTorrent swarms.

One of the missing features of BitTorrent is cross-swarm identification of peers. Tribler addresses this by using public key cryptography to generate an identifier called PermID for each peer. This PermID is stored separately from the Tribler installation files, so that upgrading or reinstalling Tribler do not cause it to change.

(26)

2

The PermID allows us to identify and authenticate messages from users, allowing for a secure method of communication.

A key feature of Tribler is its custom overlay network. Initially implemented as a BitTorrent swarm with a predefined identifier, it allowed Tribler peers to exchange information using custom overlay communication protocols. At the same time, the overlay does not require more TCP sockets to be opened in addition to regular BitTorrent. More recently however, Tribler switched to a new method which uses a separate overlay per protocol, and UDP instead of TCP to be able to puncture NAT-firewalls.

In the search overlay, Tribler peers are exchanging preferences lists containing infohashes of downloaded torrents. Infohashes are a commonly used as an identifier for BitTorrent swarms. By comparing a peer’s own downloads to those of others, it can calculate a similarity between them. Peers with a high similarity, those which have downloaded the same torrents, are called taste buddies, and connections to those peers are kept open. During the BuddyCast handshake, in addition to the preference list, peers also exchange which torrent files they have collected recently. Additionally, instead of sending the preference lists in plain-text, peers in the search overlay construct a Bloom filter containing their preferences. Bloom filters are compact hash-representation which allow for efficient membership testing. A peer receiving this Bloom filter can use it to check if it has any overlapping preferences, but determining the preferences of a peer by testing all possible infohashes will result in a lot of false positives, due to the compression of the Bloom filter. Providing some degree protection from attackers trying to determining the of the preferences of a peer.

2.2.2. Torrent Collecting

After discovering new infohashes using the search overlay, a peer will start collect-ing the associated torrents. An algorithm determines which torrents are the most interesting, from this list, and should be collected. It does so by sorting the torrents based on its popularity or if it exists in a channel or not. Popular torrents will thus be requested first.

Torrents are downloaded using PPSP. PPSP can be compared to BitTorrent as it splits files into smaller pieces which then can be downloaded in parallel from multiple sources. However, in contrast to BitTorrent, PPSP uses a Merkle tree to both identify and verify pieces of the data. A peer can therefore verify pieces of a PPSP swarm using only the roothash of the Merkle tree. In the search overlay, peers specify a roothash for each torrent file they recently downloaded.

When requesting a torrent, peers use the associated roothash to download the file from the sending peer. While searching, a torrent is fetched when the user wants to see its details, e.g., file name, number of peers in the swarm. The first 25 torrents are prefetched to improve the speed of showing the user details of these highly relevant torrents. If no peer is willing to send the torrent, it is downloaded using the BitTorrent extension for sending torrents [43].

After receiving a torrent, a peer will store its information in a embedded database called the MegaCache. This is a local SQlite database, in which we store information

(27)

2

17

from up to 50,000 torrents, is used during search.

2.2.3. Remote Search

A Tribler peer will, by default, maintain connections to 10 random peers and 10 taste buddies. By doing so, it will create a semantic overlay consisting of peers which are very similar it. Having a connection to similar peer allows us to implement remote

search in a very efficient manner. By only using these connected peers, instead of

flooding the network as is usually done, we can show very relevant results in less than a second.

In addition, due to each Tribler peer storing information for up to 50,000 relevant torrents, local results can be shown to the user even quicker, resulting in a very responsive search. The databases of the connected peers, can store up to 1,000,000 relevant torrents improving overall recall of the system.

2.2.4. Video on Demand

Another novel feature in Tribler is our video on demand (VOD) implementation. BitTorrent, by default, uses a piece picking policy which downloads the rarest piece. As a result, a peer maximizes the demand for the pieces it downloads, allowing the peer to trade them with others. By modifying the piece picking policy of BitTorrent, we can allow users to start playing the video even before completing the download. Tribler uses an internal HTTP server to determine the playback position of the video player and adjust the piece priorities accordingly. This allows users to not only start playing the video, but also seek through the file before the download is complete. Pieces of a file which are before the playback position are only downloaded after all pieces after the playback position have been downloaded.

2.2.5. Live Streaming

A slightly different technique, compared to VOD, is used to provide users with live streaming using PPSP. Since the pieces are not known in advance (the video content is generated live), it is impossible to construct the Merkle tree and hence compute its roothash. The solution is to let the seeder sign all pieces with a public/private key pair, and the leechers verify them using the roothash the public key. This way, all peers that receive the torrent can verify the authenticity of the pieces using the public key.

Live streaming can last an indefinitely long time. Hence, piece availability cannot be expected to be appropriate for old pieces.

2.2.6. Reputation

Currently, we are experimenting with a global reputation system for Tribler peers. Having such a system will allow peers to identify others who are behaving well over a long time, not only in a single swarm, as is the case with normal BitTorrent. Lazy freeriders, peers who after downloading a torrent leave immediately, will have a lower reputation then peers who stay and seed. Cooperative behavior is thus encouraged, leading to better performance.

(28)

2

The system we implemented, called BarterCast [67], creates records stating how much data has been exchanged between two peers. By collecting these BarterCast records, each peer can create a subjective graph of the whole Tribler population, whose edges describe the data exchanged between peers. Using this graph, a peer can estimate the contribution another peer has made to the system. To protect from malicious peers that are misreporting data exchanges, a peer computes the actual reputation of other peers using a maximum flow algorithm starting from itself. This reputation can then be used to order the peers which are requesting pieces, or even decide to not send any pieces to peers with a low reputation.

The manner in which we collect the BarterCast records, choosing the peers we synchronize with is where most research effort is put into. Peers select those to synchronize with according to heuristics, which have a substantial impact on the speed of synchronization and the accuracy.

2.2.7. Channels

Since the release of Tribler 5.2 we have implemented fully decentralized channels that each user can utilize to publish torrents they like. Other users can mark a channel as favorite, causing it to become more popular. Furthermore, channels can be marked as spam when users determine that the torrents provided are fake. Using both the total number of users marking a channel as a favorite or spam, a popularity metric is calculated and used to sort the channels. ChannelCast, the overlay protocol used to discover and update channels, exploits this popularity to ensure the right channels are discovered faster.

Channels allow for co-operation and communication. Moving away from a single injector publishing all content, users will be able to post comments and iteratively improve the metadata (title, description, etc.). A distributed permission system will allow for granting users specific permissions within a channel. Currently over 2000 channels have been created.

2.2.8. GUI

As part of the standalone Tribler client, we have implemented a GUI using the open-source wxPython toolkit, which allows us to implement a GUI once and run it on multiple platforms (Windows, Mac OS X, Linux). A big effort has been put into creating a GUI which is fully resizable and not dependent on a specific platform. Native GUI elements are used, which are positioned relative to other elements. Thus specifying that a button is positioned below another element, or to the right of it. This allows us to provide the same functionality on all platforms, while still having a slightly layout due to differences in the native elements.

The Tribler GUI includes search, channels and a library view. VOD is available for torrents with video content; the embedded version of the open-source VLC media player is used for playback. We implemented some additional features, such as the “Network Buzz” which shows the user what popular content is currently available in the Tribler network. Furthermore, the GUI provides several hints to users suggesting a specific action or promoting the use of a feature of the program.

(29)

2

19

Figure 2.2: Tribler GUI showing search results

of a list of results, which expand after being selected by the user to show the torrent details. This view includes a short overview, but additionally provides specific information using tabs. Hiding these details was a design choice as we expect most users to be intimidated by them.

2.2.9. Other deployments: TV and Browser

As part of the European funded P2P-Next project, we developed two other de-ployments based on the same Tribler core. Being able to run Python, Pioneer has created a set-top box called NextShare TV which allows users to do BitTorrent live streaming and VOD on their TV. The set-top box is a 450 MHz machine, with a custom GUI designed to be used on TVs [91].

Another deployment of the Tribler core is called SwarmPlayer and provides users with the same functionality as NextShare TV, but is packaged as a browser plug-in. It uses the combination of the HTML 5 video tag and a custom tribe:// protocol to download and stream the video data in the browser. After a torrent has been completely downloaded, it will be automatically seeded. Videos that have not been completely downloaded, due to the user browsing away from the Web page are considered unwanted videos and are removed. Seeding is supported through a limited cache which is normally transparent to the user. Manual control over the seeding is available using a separate HTML GUI, where users can stop and start torrents, as well as modify advanced settings like the upload speed limit.

(30)

2

2.3. Usage

Tribler serves two purposes; it is a BitTorrent client with additional features and, as such, it is used by people all over the world; on the other hand, it is also a tool for experimenting with and evaluating protocols and algorithms. In this section, we present both these use-cases.

2.3.1. Users

Since it’s release in 2006, Tribler has been downloaded over 1.2 million times. All big releases result in a temporary increase in usage. Tribler 5.3 was downloaded 10,000 times within a three day period, Tribler 6.0 roughly 15,000 times. These temporary increases of usage are usually caused by articles on popular websites such as Slashdot, Ars Technica, and BBC News. After the dust settles, our loyal user base continues to use Tribler and participates in the forums to help us improve the client. In March 2011, we had more than 5000 unique Tribler users downloading more than 6500 torrents (of which 2500 using VOD).

SwarmPlayer, our other deployment of the Tribler core, has been incorporated into Wikipedia and VODO2_{, where it has been downloaded by 6000 users and 13,000}

users, respectively.

2.3.2. Scientists

Any researcher can use Tribler to experiment with and evaluate algorithms they are developing. Tribler is very well suited for this since it has both a stable user base, allowing for long experiments, and it gets downloaded by many users at every major release. These spikes in usage provide researchers with large amounts of data which have been used to conduct numerous experiments.

There are two methods implemented to collect data, with user consent; the first is simply using a Tribler instance as an “ordinary” peer and keeping a log of all received responses. In the search overlay, for instance, we can record search queries issued by peers, which can then be used to improve the relevance ranking of the search results by tuning them for queries our users issue. Secondly, we can periodically push information to a server, wherein participating peers send a file to a server which is predefined.

After the initial release of our BarterCast reputation system, the data collected using a crawler resulted in a clear understanding that simply using one-hop gossip was not sufficient. Simulations based on the data collected resulted in a complete redesign using BarterCast records signed by both peers that are involved in the data exchange. Signed BarterCast records will allow for full gossip, due to the records being immutable. Using this scheme and other modifications resulted in a large improvement in the error estimation and coverage of the reputation system [33].

2.4. Conclusion

With Tribler we have successfully deployed a BitTorrent client that pushes the boundaries of P2P technology. It allows a stable user base to benefit from a set

(31)

2.4.Conclusion

2

21

of novel features, including remote search, VOD, live streaming, reputation, and channels. Testing research ideas with real users has always proven to be a significant challenge. Researchers therefore often rely on simulations instead of real systems with thousands of users. With Tribler we are providing scientists with an open platform to conduct experiments on.

All features have been implemented using open-source tools, readily available for all, while the Tribler source code is also available to be modified. Furthermore, SwarmPlayer and NextShare TV show how the implementation versatility of the Tribler core allows for different GUIs and use cases.

Finally, due to indirect and direct support from European research programmes, the platform has been available for over nine years with steady improvements; twelve big releases have already been completed, with more scheduled in coming years.

(32)

(33)

3

Large-Scale Message

Synchronization in

Challenged Networks

In this chapter we introduce Dispersy, a message synchronization platform capable of running inside a challenged network. Dispersy uses Bloom filters to let peers advertise their local state and receive missing messages. However, in contrast to previous work, the efficiency and effectiveness of our design allows for deployment in networks of unprecedented scale. We show, through extensive experimental evidence that peers have no difficulties in synchronizing over 100,000 messages.

We integrate in Dispersy a NAT traversal technique able to puncture 77% of NAT-firewalls. Not puncturing these firewalls would prevent up to 64% of peers from receiving any synchronization requests. Implementing a NAT traversal technique proved essential when we used Dispersy to extend the functionalities of a BitTor-rent client. To date, over 350,000 users used our modified BitTorBitTor-rent client which included Dispersy, synchronizing in overlays consisting of more than 500,000 mes-sages.

We emulate an overlay consisting of 1000 unmodified Dispersy peers in order to show the propagation and synchronization speed, bandwidth requirements, churn resilience, and overall throughput of Dispersy. Dispersy is able to synchronize a single message to all peers within 19 synchronization steps, withstand an average session-time of 30 seconds, and achieve an average throughput of 10,000 messages/s.

Parts of this chapter have been published in ACM SAC 2014 [107].

(34)

3

3.1. Introduction

With Dispersy we present a large-scale message synchronization platform which is deployed to over 350,000 users. Dispersy is designed to run inside a challenged network, and is able to deal with communication limitations such as intermittent connectivity, large delay between links, and absence of end-to-end paths. Messages are application dependent, however Dispersy adds optional headers describing if and to whom this message needs to be synchronized, the id and or signature of the creator, etc.

Challenging conditions can be found in a wide range of networks, i.e. vehicular ad-hoc networks (VANETs), Peer-to-Peer networks (P2P), and delay tolerant net-works (DTNs). As a result of this, message synchronization or more generally data dissemination while dealing with those conditions is a large established field with researchers writing over 20,000 papers in 2013 alone1_{. However, in contrast to most}

of the proposed designs which never see an actual implementation, or are not even suitable to be implemented, we design, implement, emulate, and deploy Dispersy into the wild.

We define message synchronization as creating consistency in a network of peers by resolving differences in messages received between two peers at a time. Peers regularly request missing messages by advertising their locally available messages using a Bloom filter. A Bloom filter is a compact hash-representation which allows for membership testing. Upon receiving such a request, a peer uses the Bloom filter to test if it has messages available the other has not. Although such an approach has been proposed before, we implement additional heuristics which allow a Dispersy overlay to scale to a much larger extend than previous works. Moreover, we implement a NAT traversal technique allowing peers to communicate in spite of NAT-firewalls.

When designing a message synchronization platform for a challenged network we cannot assume the possibility of a connection between any two peers. Therefore, Dispersy does not require a central server in order to operate. Moreover, all peers in the network perform the same tasks, improving its robustness but more importantly removing the dependency on peers which might have more computational/infra-structural resources.

3.1.1. Contribution

We propose, implement, evaluate, and deploy a message synchronization platform capable of running inside a challenged network. Whereas previous works have used Bloom filters to synchronize messages, we improve their approach to the point of finally being able to synchronize at a much larger scale. Heuristics allow us to synchronize a larger number of messages while reducing the CPU time required to do so considerably.

We thoroughly test the performance of our platform by emulating several over-lays measuring the propagation and synchronization speed, bandwidth requirements, churn resilience, and overall throughput. Finally, we show how we used our

(35)

3.1.Introduction

3

25

form to implement and deploy an application which was used by over 350,000 users creating more than 500,000 messages. In contrast, previous work [56] struggles to synchronize more than 10,000 messages due to very high false positive rates.

3.1.2. Background and Related Work

More than 25 years ago, Demers et al. [35] outlined methods for creating consis-tency across many nodes using a large heterogeneous, slightly unreliable and slowly changing network. Demers introduced the concept by stating that nodes can only be consistent after all updating activity has stopped. Demers et al. describe three different methods of distributing updates; direct mail, in which an update is sent from the initial node to all other nodes, anti-entropy, in which a node chooses an-other node at random and resolves the differences between the two, and finally,

rumor mongering, in which a node after receiving a new update (a hot rumor), will

keep forwarding this update until it goes cold (due to receiving it from many other nodes).

Introduced by Bloom [15], Bloom filters have become popular due to being a very space efficient method for membership testing. Bloom filters use a hash area consisting of N bits, which are initially set to 0. For each item that needs to be “stored” in the Bloom filter, K distinct addresses are generated using a hash function. The bits addressed in the Bloom filter are set to 1. Upon checking if an item is part of a Bloom filter, the same addresses are generated, but now the bits addressed are checked to be 1. If not all bits are 1, then this item is not in the Bloom filter. When all bits are set to 1, this item might be part of the Bloom filter or (due to the compression of the Bloom filter) could be a false positive. Because Bloom filters do not have false negatives, they can be a drop in optimisations to reduce disk-io. E.g. if we put all files which are on disk in a Bloom filter, then if the Bloom filter returns false a file is not on disk. If it returns true, then a file might be on disk. An overview of the use of Bloom filters in a range of network applications can be found in [19].

VANETs, vehicular ad hoc networks are quickly getting more commercially rel-evant, because of recent advances in inter-vehicular communications and their asso-ciated costs. In many respects VANETs can be considered as a P2P system as they are not restricted by power requirements, suffer from excessive churn, and have to be able to scale across millions of peers without the use of central components. Churn is the rate by which peer join and leave a network. In VANETs churn is caused by the movement of vehicles, which causes them to have a very small connection window wherein communication between two nodes is possible [45].

Lee et al. [56] envisioned a vehicular sensing platform able to provide proactive urban monitoring services by using sensors found in vehicles to detect events from urban streets, e.g. recognizing license plates, etc. Their MobEyes middleware runs inside a vehicle and creates summaries of detected events. Those summaries are then pushed by the creator for a period of time to all one-hop neighbors, and are harvested in parallel using a Bloom filter. By building a Bloom filter of the already-harvested still-valid summary packets, all other peers within the broadcasting range can detect the missing summaries. Lee et al. are using Bloom filters which are

(36)

3

2. Bloomfilter

creation 3. NAT traversal 1. Peer

selection

4. Receive missing Messages

Figure 3.1: The four step Dispersy synchronization method.

1024 bytes in size combined with a periodically changing hash function to reduce the number of false positives.

Fixing the size of the Bloom filter without performing a selection of summary packets limits the number of summaries which can be synchronized. Inserting 2000 summaries into this Bloom filter results in a false positive rate of 10%, inserting 10000 results in a false positve rate of roughly 70%. Incurring such a high false positive rate will greatly reduce the efficientcy of the synchronization.

3.2. System Design

In Dispersy peers communicate in one or more overlays. An overlay is an additional network built on top of an existing one, most commonly the Internet. Within an overlay, peers synchronize by advertising their locally available messages to others.

3.2.1. Overlay Definition

The design of Dispersy is modular and extensible and lets applications define their own overlay types for specific goals. By defining an overlay, an application can determine which messages are going to be used, and which need to be synchronized. The body of a message is application specific, but its headers are added by Dispersy and can contain to if and to whom this message needs to be synchronized, the id and or signature of the creator, etc.

After defining an overlay type, a peer needs to instantiate it. Instantiating an overlay requires a peer to generate an identifier for it. This identifier has to be known to all peers who attempt to join this instance. Dispersy overlays are completely separate from each other, as messages contain the identifier of the overlay they are part of and thus are not shared between overlay instances.

After joining an overlay, peers synchronize by advertising their locally available messages using a Bloom filter. At a fixed interval, each peer sends its Bloom filter to another peer in what we call a synchronization-step or step. This Bloom filter allows those two peers to compare their local messages and resolve the differences.