Understanding Operation and User Behavior in Peer-to-Peer Systems

(1)

Behavior in Peer-to-Peer Systems

(2)

(3)

Behavior in Peer-to-Peer Systems

Proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof.ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties,

in het openbaar te verdedigen op woensdag 30 januari 2013 om 10:00 uur doorBoxun ZHANG

Master of Science in Information Technology, Kungliga Tekniska h¨ogskolan, KTH geboren te Beijing, China

(4)

Samenstelling promotiecommissie:

Rector Magnificus voorzitter

Prof.dr.ir. H.J. Sips Technische Universiteit Delft, promotor

Dr.ir. J.A. Pouwelse Technische Universiteit Delft, copromotor

Prof.dr.ir. J.J. Lukkien Technische Universiteit Eindhoven

Prof.dr. E.W. Biersack EURECOM

Prof.dr.-ing. W. Kellerer Technische Universit¨at M¨unchen

Prof.dr.ir. R.E. Kooij Technische Universiteit Delft

Prof.dr.ir. A.P. de Vries Technische Universiteit Delft

Prof.dr.ir. D.H.J. Epema Technische Universiteit Delft, reservelid

Cover designed by: Chen Wang

Published and distributed by: Boxun Zhang E-mail: zhangboxun@gmail.com

ISBN: 978-90-79982-15-8

Keywords: Peer-to-Peer systems, BitTorrent, Spotify, Measurement, User behavior.

Copyright c� 2011 by Boxun Zhang.

All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission of the author. Printed in The Netherlands by: W¨ohrmann Print Service.

The work described in this thesis has been carried out in the ASCI graduate school. ASCI disser-tation series number 273.

This work was supported by the European Community’s Seventh Framework Programme through the P2P-NEXT project (grant no. 216217).

(5)

Acknowledgments

It is a great pleasure to spend four years in the PDS group, where I have met and worked with many talented and amazing colleagues and good friends. This thesis would not have been written without their support. I would like to begin by thanking my official and unofficial supervisors: Henk, Alex, Dick, and Johan. I am very lucky to have the opportunities to work with them at different times during my PhD.

Henk, my promotor, thanks for accepting my as a PhD in the PDS group and I am very grateful for the guidance and confidence you gave me in the initial phase of my PhD. You always encourage me to try out new ideas and never pressured me for publications. The freedom you gave me was crucial for me to find the topics I really like. You have also taught me much about research, which I have benefited through my PhD. Last but not least, thank you for running up the awesome PDS group.

Alex, my unofficial supervisor, thanks for being both an wonderful supervisor and great friend. We started the whole work of the P2P Trace Archive from a corridor talk together with Johan, and since then you have guided me through the research jungle with great care. Besides the excellence of your research, you are also very knowledgeable about many other things. Needless to say, you deserve the title “Lucky Golden Boss”.

Dick, it is a great pleasure to work with you, thanks for the guidance you gave me during my PhD. I am very grateful for the huge amount of effort you spent in helping me with writing and formulating research questions. Your precisions in research have influenced me greatly.

Johan, thanks for managing the P2P-Next project and bringing so many brilliant peo-ple to our group. Thanks for the freedom you give me and many interesting ideas you brought. Wish you all the best for the on-going projects and proposals.

In the PDS group, I am extremely lucky to have many amazing colleagues: Rameez, Dave, Ali, Tamas, Lucia, Naza, Adele, Mihai, Niels, Rahim, Ricardo, Arno, Boudewijn, Otto, Siqi, Jianbin, Jie, Yong, and Kefeng. I have had great time and enjoyed every moment with you. I wish you the best for your research and future. Nitin And Dimitra, you are great office mates and friends. I think our office has most laughters and the only office that has an “graffiti” white board with real scientific ideas.

(6)

years; Ilse, Rina, Monique, Esther, thanks for taking care of the administrative issues for the PDS group.

Gunnar, Javier, Guido, Marcus, thanks for your support and help during my internship at Spotify. I am very excited about joining Spotify after my PhD and work with you again.

Flutra, it was great to meet you in Delft and Stockholm. All the best!

Xuetao, you are a great friend from Sweden to Netherlands. Best wishes to you and your family.

Di, Yingchao, Chen, it was a great time we spent together in Delft in the past 3 years. I am very happy to see that you have (or almost) finished your master studies, and I wish you a very bright future. Please visit me in Sweden, or I will see you in Beijing, sometime in the future.

Yue, Yanjie, you are awesome friends and thanks for the joy you gave me every time when I was in Beijing. Cheers!

Finally, I want to dedicate this thesis to my parents. Thanks for your unconditional love and support. I am very grateful that you are always supportive and encouraging to my decisions. Without you standing by me, sharing the joy and difficulty, it would not be possible for me to come through my PhD.

(7)

5 Understanding User Behavior in Spotify 83 5.1 Spotify Overview . . . 84 5.1.1 Clients . . . 84 5.1.2 Protocol Overview . . . 85 5.1.3 Connection Behavior . . . 85 5.2 Dataset . . . 86 5.2.1 Trace collection . . . 86 5.2.2 Data Sanitization . . . 86 5.2.3 Terminology . . . 87 5.3 System Dynamics . . . 88 5.3.1 Session arrival . . . 88 5.3.2 Session length . . . 90 5.3.3 Playback arrival . . . 93 5.4 User behavior . . . 94

5.4.1 Device switch patterns . . . 95

5.4.2 Favorite times of day . . . 96

5.4.3 Correlations of successive sessions . . . 98

5.5 Related Work . . . 101 5.6 Conclusion . . . 102 6 Conclusion 105 6.1 Conclusions . . . 105 6.2 Future work . . . 106 Summary 123 Samenvatting 125 Curriculum vitae 127

(10)

(11)

Chapter 1 Introduction

In the last decade, Peer-to-Peer (P2P) applications, particularly file-sharing applications such as Gnutella and BitTorrent, have gained tremendous popularity. It is reported [7] that P2P file-sharing applications at some point have been contributing up to 80% of Internet traffic, and Cisco [8] forecasts that the traffic of P2P file-sharing will keep growing until 2015. Until now, BitTorrent is the most popular P2P file-sharing application. BitTorrent, Inc. reports [22] over 150 million monthly active users of their popular BitTorrent clients. Compared to the traditional client/server architecture, BitTorrent can significantly reduce the server-side bandwidth required for distributing bulky files among large numbers of users. Shortly after the release of BitTorrent, countless BitTorrent communities have emerged in the Internet, distributing a wide variety of content. Besides file-sharing, P2P technology has also been adopted in many other data-intensive applications, such as video streaming [56,85,152], music streaming [73], VoIP [48], and online storage [128], in order to increase service scalability and reduce server-side bandwidth requirement.

Despite the boom of P2P applications and the significant amount of effort spent in measuring those systems, some important knowledge is still lacking. First, the under-standing of different techniques for measuring P2P systems is insufficient. This in turn can lead to misinterpretation of measurement results. Secondly, the knowledge of opera-tion and usage patterns in different P2P systems and communities is not comprehensive enough. This knowledge is crucial for getting the big picture of the P2P ecosystem, and it can also lead to better understanding of user behavior in P2P systems. Thirdly, un-til now, there is no in-depth study of the flashcrowd phenomenon in BitTorrent — the rapid arrivals of large numbers of users in a swarm, which hampers the development of flashcrowd/churn handling mechanism for BitTorrent-like systems. Finally, compared to P2P video streaming systems, little is known about the user behavior in peer-assisted mu-sic streaming services like Spotify [73]. Without this knowledge, it is very difficult to develop and optimize P2P algorithms for music streaming.

(12)

of the missing knowledge mentioned above. Among all P2P systems, we focus on BitTor-rent and Spotify. BitTorBitTor-rent is the most popular P2P file-sharing application, and it still shows great potential for the future. Spotify is arguably the most popular music stream-ing service today. The knowledge about the user behavior in Spotify [73] can be used for developing P2P music streaming systems, and it can also provide valuable insights of music streaming services in general. In this thesis, we first investigate the factors causing sampling biases — the differences between the measured and the real characteristics — in BitTorrent measurements. This knowledge is key for understanding the measurement results in all empirical P2P studies. Secondly, to provide an comprehensive overview of P2P systems, we study the usage patterns of various P2P systems by analyzing more than 20 traces hosted in our P2P Trace Archive [3]. Thirdly, we focus on the flashcrowd phe-nomenon in BitTorrent, which can have significant impact on large numbers of BitTorrent users. We study the properties of the flashcrowds that occur in two of the largest public BitTorrent trackers today, such as the arrival time, duration, and magnitude. Finally, we shift our focus from file-sharing applications to a popular commercial P2P music stream-ing service — Spotify. We study the user behavior in Spotify by analyzstream-ing a massive dataset, which was collected by Spotify between 2010 and 2011. In particular, we are the first to study the realistic pattern of user switching across multiple devices, including both desktops and mobiles, with a large dataset.

1.1 P2P systems

The most common applications of P2P technology are file-sharing, streaming, and Inter-net telephony. In this section, we first introduce the popular P2P file-sharing and stream-ing systems. Secondly, we give a overview of some other P2P applications such as Internet telephony and online storage.

1.1.1 Early P2P file-sharing applications

The first well-known P2P file-sharing application is Napster [99], which was released in 1999. In Napster, an index of all shared files is maintained by a cluster of servers, and every Napster peer keeps a persistent connection to those servers [116]. Files in Napster can be located by querying the index servers and then transfered between peers with direct connections. In 2001, the FastTrack [80] protocol and its client Kazaa [78] was released. In FastTrack, the index and meta-data of the shared files are not stored in centralized servers but in decentralized super nodes that are topologically nearby ordinary nodes. Besides Kazaa, the FastTrack protocol has also been implemented in several clients such as iMesh [57] and Morpheus [100]. Gnutella is another popular P2P file-sharing protocol released around 2000, and it has been implemented by many clients like LimeWire [86]

(13)

and Morpheus [100]. In Gnutella network, there is no centralized server or supernode. Instead, peers in Gnutella form a overlay network by maintaining a set of neighbors. A Gnutella peer can search a file by flooding a query through its neighbors [70, 113, 114, 116], and the flood of a query is limited by the TTL (Time-to-Live) specified by user. Since no centralized component exists in Gnutella, multiple disjoint Gnutella overlays can coexist at the same time. In 2004, eDonkey [132] became the most popular P2P file-sharing system [76]. Similar to Napster, the files shared in eDonkey are indexed at centralized servers. Besides content search, eDonkey servers also facilitate locating sources of files and puncturing firewall/NAT.

1.1.2 BitTorrent

In 2001, the BitTorrent [21, 29] protocol and its first client was released. BitTorrent is usually considered as the second generation P2P file-sharing application, because unlike early file-sharing systems, BitTorrent does not address the problem of locating resources but instead focus on improving the file-sharing efficiency. In BitTorrent, peers download small pieces of a file instead of the whole file, and they upload to each other while down-loading. A leecher is a peer that does not have the complete copy of a file, and a seeder is a peer that owns or has downloaded the complete file. A swarm is a group of peers that are downloading the same file. In swarms, BitTorrent uses a tracker, a centralized component, to provide peer discovery services. As one of the most popular P2P proto-col, BitTorrent has been implemented in many clients, such uTorrent [23], Vuze [136], and Tribler [107], and BitTorrent-like protocols have been used by many P2P streaming systems [35, 98].

To download a file in BitTorrent, a user first needs to download the so called torrent of that file from a website like ThePirateBay [5]. This torrent file contains the meta-data of the file to be downloaded, such as file size, hashes of file pieces, and the location(s) of one or multiple trackers. Then, the BitTorrent client (peer) asks the tracker for a list of random peers that are already in the swarm, and connects to those peers to start the download. BitTorrent employs two core algorithms, the Tit-for-Tat (TFT) algorithm for peer selection and the Rarest-First (RF) algorithm for piece selection, to improve the download performance. The TFT algorithm encourages a peer to contribute its upload bandwidth by letting a peer only achieve decent download rates if it provides high upload rates to other peers. The RF algorithm ensures that the file pieces with least number of replicas in a peer’s neighborhood are always downloaded first, which leads to a even distribution of all file pieces in a swarm and solves the last-piece problem [77].

(14)

1.1.3 BitTorrent communities

As we stated earlier, the BitTorrent protocol does not provide a function for searching files. Instead, the search function is “implemented” by numerous BitTorrent websites or communities like ThePirateBay [5] and TvTorrents [51]. The BitTorrent ecosystem is significantly diversified and consists of various types of communities. From a content perspective, there are communities like ThePirateBay that provide general types of con-tent, while some other communities are dedicated to specific concon-tent, such as VODO [6] (movies released under Creative Commons license) and Khan Academy [2] (videos of educational lectures). From a membership perspective, there are public and private Bit-Torrent communities. Public BitBit-Torrent communities usually have no limitations on new user registration and do not enforce sharing. In contrast, many private communities re-quire invitations for user registration, and adopt various forms of sharing-ratio enforce-ment [51, 84, 94, 151] in order to solve the problem of free-riding [64, 87]. In this thesis, we focus mainly on public BitTorrent communities, since most the largest BitTorrent communities nowadays are of this category.

1.1.4 P2P streaming systems

Besides file-sharing, P2P technology has also been adopted in many video-streaming sys-tems [12,35,56,81,85,98,133,152], including both live streaming and Video-on-Demand (VoD). Compared to traditional video streaming websites, the bandwidth requirement of those services is significantly reduced [55]. Today, most popular P2P video stream-ing systems are operated as commercial services in China, such as PPLive [56], PP-Stream [133], and UUSee [85]. PPPP-Stream, one of the most popular P2P VoD systems in China, claims [133] that it has nearly 300 million subscribers and over 25 million daily users. P2P technology is also used in the music service service Spotify. Spotify provides instant access to a vast music catalogue, and it has gained worldwide popularity in the past few years. Spotify uses both back-end servers and a proprietary P2P network to transfer data. Noteworthy, Spotify is currently the only music streaming service that employs P2P technology, which makes it a unique use case for P2P streaming technology.

1.1.5 Other P2P applications

Besides file-sharing and streaming, P2P technology has also been used in many other applications. Skype uses a decentralized supernode [48] network to achieve a scalable service, and it [115] recently reports that it reaches 35 million concurrent online users. Firecoral [129], a web browser extension platform, uses P2P technology to facilitate large-scale web content distribution. FS2You [128], one of the most popular online storage services in China, adopts P2P technology to reduce the bandwidth requirement on its

(15)

servers for serving millions of users. P2P technology is also used in Bitcoin, the popular cyber currency, which is accepted nowadays by organizations such as the Internet Archive and Free Software Foundation.

1.2 Research context: the P2P-Next project

The research presented in this thesis has been carried out in the context of the P2P-Next project. P2P-Next is a European Union-funded research project with 21 partners, includ-ing research institutes, industrial companies, and media providers. The goal of the project is to build the next generation content delivery platform using P2P technology, which will enable users to stream (live or VoD) interactive content on PC or set top box. The P2P platform used in P2P-Next is built upon the Tribler project, which is created by the Parallel and Distributed Systems (PDS) group at Delft University of Technology. The Tri-bler project aims to provide fully-decentralized content discovery and distribution. The core of the project is the Tribler client, which is an enhanced BitTorrent client with many advanced features, such as live streaming, Video-on-Demand, and decentralized content discovery. The Tribler client is released under an open source license, and it is currently available on Windows, Mac OS, and Linux. In 2012, Tribler has been downloaded over one million times, and it is used by thousands of users every week.

In the past few years, much research on Tribler-related topics has been conducted in the PDS group, and the research results have been published on a broad range of topics. Pouwelse et al. [108], Iosup et al. [59], and Wojciechowski et al. [138] have conducted ex-tensive measurements of BitTorrent communities. Iosup et al. [58] also created the Grid Workload Archive, which has inspired our work on the P2P Workload Archive (Chap-ter 3). Meulpolder et al. presented an analysis of bandwidth allocation in BitTorrent swarms [96] and proposed BarterCast [95], a decentralized reputation system for BitTor-rent. Mol et al. presented the Give-to-Get algorithm [98] for P2P Video-on-Demand and studied the effect of firewalls in BitTorrent [97]. Rahman et al. proposed an effort-based incentive mechanism for BitTorrent [111] and presented a design space analysis for mod-eling incentives in distributed systems [112]. D’Acunto et al. presented peer selection strategies [30] and the bandwidth allocation under flashcrowds [32] for BitTorrent-like VoD systems. Jia et al. presented an analysis of the sharing ratio enforcement in Bit-Torrent [63] communities. Delaviz et al. proposed an improvement on the accuracy and coverage of BarterCast [38] and designed a sybil-resistant reputation mechanism [39] for BitTorrent.

(16)

1.3 P2P measurement studies

Due to the complexity and heterogeneity of P2P systems, substantial effort has been spent to measure and understand the usage patterns of those systems. In this section, we in-troduce the most important measurement studies of P2P systems carried out in the last decade, in order to provide an overview of empirical P2P research and the background of this thesis.

1.3.1 Early P2P file-sharing systems

Many empirical studies have investigated the network characteristics of early P2P file-sharing systems such as Gnutella and eDonkey. Ripeanu et al. [113] analyze a dataset of half a million Gnutella nodes. They find that Gnutella is not a pure power-law network, and its virtual topology does not match well with underlying Internet topology. Stutzbach et al. [124] find that the diameter of Gnutella network does not increase as the network size increases because of the high connectivity among ultrapeers, and the two-tier structure of Gnutella can stay balanced even under high churn. Markatos et al. [91] find that the traffic in Gnutella network is bursty at several time scales, and the submitted queries exhibit significant amount of locality. Tutschku et al. [132] find that in eDonkey, there is a significant difference between the download stream and non-download stream, and a single model for P2P flow can lead to significant mischaracterization of the eDonkey traffic.

Besides network properties, many measurement studies of early P2P file-sharing sys-tems focused on peers behavior and the properties of shared content. Saroiu et al. [116] observe a significant amount of heterogeneity in both Gnutella and Napster, and peers in those networks tend to misreport information deliberately. Leibowitz et al. [78] find that the Kazaa traffic is highly concentrated around a small amount of large and popular items, and they further suggest that the small-world characteristics of data exchange graph of Kazaa can be used to build efficient data distribution mechanism. Good et al. [47] find that many Kazaa users share their private and personal files without knowledge, which can be easily exploited by malicious peers. Shin et al. [121] find that more than 10% of Kazaa nodes were infected with more than 40 viruses during the measurement period, and more than 70% infected nodes might be used for relaying spam emails. Similar to the findings in [78], Handurukande et al. [52] observe that a small fraction of peers in eDon-key network share most of the files in terms of both size and number of files, and they also find the popularity of very popular files tends to remain stable over time. Le Fessant et al. [76] find that there is significant locality of interest for audio files and certain degree of clustering among peers exists.

(17)

1.3.2 BitTorrent

Extensive empirical studies of various BitTorrent-related topics have been conducted in the past few years. Here, we introduce those studies from the aspects of download perfor-mance, session characteristics, sharing behavior, community operation, peer connectivity, and traffic locality.

Download performance Performance is arguably the most important topic in BitTor-rent research, hence it has received much attention from the research community. Izal et al. [61] conducted the first measurement study of BitTorrent by analyzing a five-month log of a tracker. They show that BitTorrent is an efficient alternative to traditional server-based content distribution, and their data suggests that BitTorrent can sustain large flashcrowds. However, a measurement of SuprNova community conducted by Pouwelse et al. [108] shows that BitTorrent can not handle flashcrowds well in some swarms, that peers can experience low download speed, and the number of active users in BitTorrent heavily depends on the availability of the global components (trackers) in BitTorrent. The inef-ficiency of BitTorrent in handling flashcrowds was further confirmed by a measurement study of two BitTorrent communities by Guo et al. [50]. They observe that peers in flashcrowds suffer from much lower download rates than peers joining after flashcrowds and peers can also experience fluctuating download performance. They further propose a new system to improve the download performance in BitTorrent, where trackers can fa-cilitate inter-swarm collaboration among peers with exchange-based incentives. Zhang et al. [142] characterized the flashcrowds that occur in two of the largest public BitTorrent trackers, and they find that on average there is a nearly 7-fold decrease in download rate for peers in flashcrowds. These measurements suggest that the churn handling in BitTor-rent is far from optimal and further research is needed. From the protocol perspective, Legout et al. analyze a dataset collected by an instrumented client. They show that the piece picking algorithm used by BitTorrent can achieve nearly-optimal diversity of pieces in BitTorrent swarm and the peer selection algorithm is fair and robust to free riders.

Session characteristics There are many empirical BitTorrent studies reporting statis-tics about BitTorrent user sessions, such as arrival rates, session length, and seeding time. Such knowledge is crucial for understanding the system dynamics in BitTorrent and im-proving BitTorrent algorithms. Pouwelse et al. [108] report that most peers leave the sys-tem soon after finishing download: only 17% and 3.1% peers stay in the swarms for more than one hour and 10 hours after finishing downloads, respectively. Guo et al. [50].find that the peer seeding time in two large public BitTorrent communities can be modeled as exponential distribution. Stutzbach et al. [125] find that the session length of BitTorrent peers can be described by Weibull or Log-normal distribution, and they observed much longer seeding time in swarms sharing legal torrents than swarms sharing copyright con-tent [108]. They also notice that BitTorrent sessions are less predictable than sessions in

(18)

Kad and Gnutella in terms of session length and peer availability. Besides session length and seeding time, Zhang et al. [143] find that the multi-session downloading behavior also varies dramatically among BitTorrent communities: in some BitTorrent communi-ties, most peers can finish download within one session, while in other communicommuni-ties, large fractions of peers spend multiple sessions to download one file.

Sharing behavior BitTorrent is designed based on the assumption that peers con-tribute resources like bandwidth to others. Sharing is critical for content availability and download performance in BitTorrent. Until now, many measurements have been per-formed to understand the sharing behavior of BitTorrent users under different sharing-ratio policies. Piatek et al. [105] study the altruism in BitTorrent by analyzing a large dataset of two public communities without sharing-ratio enforcement. They find that the dominant performance effect in BitTorrent is the altruistic contribution from a mi-nority of high capacity peers, which has little to do with the Tit-for-Tat algorithm. By using BitTyrant, a modified BitTorrent, they show that it is possible for selfish peers to improve download performance while reducing the level of contribution to other peers. Meulpolder et al. [94] observe that sharing-ratio enforcement employed in private BitTor-rent communities can significantly change the behavior of BitTorBitTor-rent users, which in turn lead to higher seeder-leecher ratio, higher download performance, and better peer connec-tivity. Andrade et al. [10] show that the incentive mechanism adopted by BitTorrent can discourage free-riding but fails to lead to cooperations in some BitTorrent communities, where the social and economic characteristics play a more important role in determining the level of cooperation among peers. In addition, it is shown in [10, 94] that the Tit-for-Tat algorithm employed by BitTorrent can become less effective in the existence of large numbers of seeders or sharing ratio-enforcement.

Community operation The BitTorrent ecosystem has demonstrated significant het-erogeneity since the beginning. Although numerous measurements have been performed, it is still not easy to draw the complete picture of the BitTorrent ecosystem. The recent rising of private BitTorrent communities, also known as the Darknet, makes this task even more challenging. Zhang et al. [143] performed a comparative study of the traces hosted in the P2P Trace Archive, and show that BitTorrent communities differ dramatically in file popularity, file size, system dynamics, and peer’s sharing behavior (due to different community policies). D´an et al. [34] analyze a dataset collected from 800 BitTorrent trackers and find that the Zipf’s law can be only used to describe the short-term but not the long-term popularity. A recent large-scale measurement conducted by Wojciechowski et al. [138] covers nearly 800 trackers and over 10 million swarms. They observe that nearly 20% of the trackers measured advertise false information to peers (spam tracker), and they find many swarms that are much bigger than those previously reported [61,108]. With a dataset collected by an instrumented client, Menasche et al. [93] show that in BitTorrent communities, content unavailability is a serious problem, and even a limited

(19)

amount of bundling can significantly reduce the content unavailability. Zhang et al. [151] unravel the characteristics of BitTorrent “darknet” by analyzing a dataset collected from over 800 private BitTorrent communities. They find that the size of each private private community is very small but the aggregated size of the private communities is huge.

Peer connectivity Many studies have been conducted to obtain knowledge of peer connectivity in BitTorrent, since good peer connectivity is crucial for most P2P algo-rithms to function properly. Pouwelse et al. [108] report that around 40% of the IP ad-dresses in a swarm of a popular movie were behind firewalls and hence not reachable from their instrumented client. Later, Mol et al. [97] study the behavior of firewalled BitTorrent peers, and they observe that around 70% peers are behind firewalls at any time in the studied communities. A more recent study by D’Acunto et al. [31] even report that almost 90% peers were behind firewall or various types of NAT. From a community perspective, Meulpolder et al. observe that the type of BitTorrent community can lead to a very different fraction of firewalled peers. In the public community EZTV, there are at most 47% unconnectable, while in the private Polish tracker, only has 20% unconnectable peers on average.

Traffic locality Recently, many ISPs start throttling BitTorrent traffic, because BitTor-rent can generate huge amount of of inter-AS and inter-ISP traffic. Thus, several measure-ment studies have been performed in order to understand the severity of this problem and propose possible solutions. Karagiannis et al. [67] show that BitTorrent is not locality-aware so it can significantly increase the inter-ISP traffic by analyzing the dataset used in [61]. However, they find that BitTorrent peers that are downloading the same file over-lap 30-70% of time, so the inter-ISP traffic can be decreased greatly by introducing minor modifications of BitTorrent protocol. Contrary to the findings in [67], Piatek et al. [106] show that the “win-win” solution for ISP and BitTorrent users – better download perfor-mance for BitTorrent users and less wide area traffic for ISP – is unlikely by analyzing a large-scale dataset. Their analysis indicates that although certain traffic locality exists in BitTorrent, simple heuristics are not sufficient to exploit it and can even hamper network robustness. Hoßfeld et al. [54] study a large dataset of BitTorrent swarms to derive char-acteristics for traffic optimization. They observe that the probability to find peers within an AS in a swarm is very small, which makes it difficult for locality-aware mechanisms to avoid inter-AS traffic.

1.3.3 P2P streaming systems

Nowadays, most of the popular P2P streaming systems are video streaming systems, such PPLive and UUSee. As we mentioned earlier, all those systems are operated in China and have gained tremendous popularity in the past few years. The only P2P music streaming service is Spotify, which is launched in Sweden. Unlike BitTorrent, much fewer

(20)

measure-ments have been conducted for P2P streaming systems. This is because to the best of our knowledge, all the deployed P2P streaming systems are operated as commercial services and they all use proprietary clients and protocols, which makes external measurements difficult. As a result, most of the measurement studies of P2P streaming systems are carried out by the companies operating those services, and the quality of those “official” datasets is also much superior than the datasets collected by external measurements, in terms of both accuracy and comprehensiveness.

Huang et al. [56] present the usage pattern of PPLive, one of the most popular P2P VoD systems, by analyzing a dataset of ten movies collected from the PPLive log server. They show that the inter-arrival times of users watching the same movie follow expo-nential distribution. Interestingly, a large fraction of users spend less than 10 minutes in one movie channel but stay much longer in the system, which makes it possible for those peers to help peers watching movie in other channels. The peer population in each movie channel exhibit a strong daily pattern, peaking during the lunch break and in the late evening.

Yu et al. [141] present an analysis of a dataset collected from a large VoD system (PowerInfo) deployed by China Telecom, which covers more than 200 days and 150,000 users. Similar to PPLive, the peer population in PowerInfo also exhibits a strong daily pattern with peaks during lunch break and in the evening. The users in PowerInfo are also impatient as half of the users in each video channel spend less than 10 minutes, which is defined as the “intro-sampling” behavior. Interestingly, the session length is reversely correlated with the video popularity, which can be exploited to improve video caching. In addition, unlike PPLive, the user arrivals in PowerInfo channels cannot be modeled as Poisson process, because it under-estimates the possibility of small arrivals and over-estimates the probability of large arrivals.

Liu et al. [85] analyze a 200GB dataset collected from the P2P VoD service UUSee to examine the efficacy of the network coding algorithm. They find that when watching normal quality videos in UUSee, the download rate is always higher than the required playback rate, which is not the case for users watching high quality video under high churn. Furthermore, the peer contribution in high quality movie channel is not significant due to the throttling of upload speed by ISPs. They also show that the signaling over-head in UUSee is around 5%, which is much lower than the reported 10% overover-head in PPLive [56].

Hei et al. [53] conduct an external measurement of PPLive using a crawler that is able to capture PPLive’s gossiping protocol. They observe a strong daily pattern of the peer population in PPLive network and a significant flashcrowd during the Chinese spring festival gala show. At the end of their analysis, the authors suggest that more dedicated nodes need be deployed to sustain high quality playback as the upload bandwidth of many peers is not sufficient. In addition, better peering and chunk scheduling strategies

(21)

are necessary to improve the performance of live streaming in UUSee, since in some live channels, the playback position of some peers are minutes behind others.

Silverston et al. [122] performed another external measurement of P2P streaming sys-tems during the 2006 FIFA World cup, covering PPLive, PPStream, SopCast, and TVAnts. They find that those four streaming systems exhibit very different traffic characteristics, and different peering and piece downloading policies are adopted in those systems. De-spite these differences, the peer lifetime in those systems all follow Weibull distribution with different mean values.

Kreitz et al. [73] conducted the first study of Spotify, the popular peer-assisted music streaming service. The authors show that by using P2P technology, a local cache, and prefetching in Spotify, only 8.8% of music streamed comes from Spotify servers and the median playback latency is only 265 ms. In addition, the traffic overhead in Spotify is around 5%, which is significantly lower than that in PPLive [56]. Our recently submit-ted study [150] (Chapter 5) presents a much more detailed and in-depth analysis of user behavior in Spotify.

1.4 P2P measurement approaches

In this section, we introduce the commonly-used techniques for measuring P2P systems, and we discuss their strengths and limitations. Traces collected using these techniques are studied Chapter 2, Chapter 3, and Chapter 4.

1.4.1 Instrumenting clients

One of the most common approach for measuring P2P systems is deploying an instru-mented client [27, 53, 60, 61, 77, 88, 108, 113, 114, 116, 122, 132] in a real P2P network to collect interesting information from other peers. This approach can be used for measuring P2P systems that use open protocols, such as Gnutella and BitTorrent. An instrumented client or software script communicates with other peers in a P2P network by follow-ing completely or partially the P2P protocol. For example, in our studies of BitTorrent communities, Pouwelse et al. [108] and Meulpolder et al. [94] used a customized script that follows BitTorrent protocol to collect information of BitTorrent peers without down-loading the shared files. The collected information includes download progress, session length, seeding time, and connectivity This approach has two limitations: First, the mea-surement capacity is usually limited in terms of number of the queries that can be sent in a short time period or the number of peers that can be connected simultaneously, which can lead to poor peer coverage in the large P2P system with high churn. Secondly, it may be difficult to proactively find or connect to peers behind a firewall or NAT.

(22)

1.4.2 Sniffing traffic

The instrumented client approach is not feasible for measuring systems that use propri-etary protocols like PPLive and Skype. Alternatively, such systems can be measured by sniffing the traffic of its client locally [48, 49, 80, 139]. Using this approach requires cer-tain knowledge of the protocols used by the measured system, which is usually obcer-tained by analyzing the traffic generated by the local client. Similar to the instrumented client approach, this approach has limited measurement power. In addition, it can be difficult to assess the accuracy and biases of the measurement results due to a lack of complete knowledge of the protocol. Other studies [9,17,42,66,89,119] measure the P2P traffic by deploying sniffing clients at the ISPs or in networks of large organizations. This approach can achieve nearly complete peer coverage in the measured network, but it is not easy for most researchers to deploy measurement software at the ISP or organization level.

1.4.3 Scraping websites

The previous approaches can be used for measuring various P2P systems. Now, we will introduce approaches used specifically for measuring BitTorrent. One common approach for measuring BitTorrent is measuring (or scraping) the websites of BitTorrent commu-nities [11, 18, 50, 51, 94, 146, 151], since many such commucommu-nities publish statistics about their service with varying levels of detail. Such information can be easily obtained by scraping the website regularly. However, this approach has several limitations: the data collected with this approach can be limited as it can only scrape information published by the website. Secondly, data collected can be erroneous as many BitTorrent websites can get overloaded occasionally and the web page layout can be changed without informing its users. Thirdly, the IP address and/or the user account can be banned from the measured website due to too frequent sampling.

1.4.4 Scraping trackers

Another widely-used approach for measuring BitTorrent is scraping BitTorrent trackers. As introduced earlier, a tracker is a centralized component in BitTorrent that provides peer discovery service, so it maintains the list of all participating peers in a swarm. During the download, BitTorrent peers report regularly (normally every 30 minutes) the total amount of downloaded bytes to the tracker. However, the scraping API of a tracker only returns the numbers of leechers and seeders in a swarm [33, 61, 142]. This lack of detailed in-formation like peer download rates severely limits the usage of such datasets. However, since one tracker can serve thousands of swarms simultaneously, scraping trackers is ef-ficient for obtaining high-level information about large numbers of swarms. The tracker can also get overloaded like BitTorrent websites, which can consequently result in

(23)

erro-neous reports. This approach has been used in BTWorld [138]—the long-term BitTorrent measurement carried out in our group, which monitors the swarm population of hundreds of trackers.

1.5 Problem statement

In this thesis, we will address the following problems:

What are the sampling biases in BitTorrent measurements? Similarly to early In-ternet measurement efforts [15,44], due to the size of the complete network all BitTorrent measurements have employed data sampling techniques, from periodic measurements to a focus on specific BitTorrent communities. Although measurements play crucial roles in empirical studies and much effort [19, 49, 52, 61, 108, 118] has been put in the last decade into empirical measurements of P2P file-sharing systems including BitTorrent, there cur-rently exists no comprehensive evaluation of the sampling biases, that is, of the ability to objectively represent the characteristics of BitTorrent, introduced by measurements. To use measurements results of BitTorrent in a correct way, it is necessary to understand the strengths and limitations of different measurement techniques, in terms of sampling biases.

How do P2P systems differ from each other and evolve over time? Although many P2P measurements have been carried out in the last decade [48, 52, 70, 74, 135], few measurement datasets are publicly available, and for these few the data are presented in different formats. This situation causes many P2P algorithms and methods to lack a realistic evaluation. More importantly, without public datasets, it is hard to know the dif-ferences among P2P systems or communities. Without this knowledge, it is not possible to evaluate models and algorithms for different scenarios. Therefore, it is crucial for the community to have a public archive that hosts available P2P traces and facilitates their re-use, and a comparative analysis of traces of different P2P systems can provide valuable insights for the research community.

What are the characteristics of flashcrowds in BitTorrent? Flashcrowds [13,14,41, 65] have been observed in traditional web systems and can lead to decreased responsive-ness and increased backlogs. Similarly, flashcrowds have also been observed in BitTor-rent [50, 61, 108, 142]. However, only few algorithms [20] have been proposed to address BitTorrent flashcrowds. In fact, not much is known about their patterns of occurrence and characteristics in real-world deployments. As a result, many basic questions about BitTorrent flashcrowds remain unanswered, such as How often do they occur?, How long do they last?, and Are BitTorrent peers joining flashcrowds worse off than peers joining regular swarms? Because several measurement [61] and analytical [109] studies have shown that BitTorrent can achieve efficient large-scale content distribution, flashcrowds have largely been ignored as a potential problem for BitTorrent users. Therefore, it is

(24)

important to understanding how flashcrowds can affect BitTorrent users, and what are the properties of flashcrowds. Such knowledge can provide valuable insights for developing effective flashcrowd handling mechanisms.

How do user behave in Spotify – the popular peer-assisted music streaming ser-vice? The user behavior in P2P music streaming services can be significantly different from that in P2P video streaming services, because of the fundamental difference between music listening and video watching. A music streaming application can turn a smart phone into an MP3 player with millions of tracks and can be used at almost anytime, any-where, which is not practical for video streaming. Despite the worldwide popularity of Spotify, little is known about the behavior of its users. This knowledge is key for under-standing the usage pattern of P2P music streaming services In addition, the explosively increasing adoption of smart phones and tablets urges the understanding of the usage pat-tern of music streaming services on mobile platforms, which can provide valuable insights for developing mobile P2P streaming applications.

1.6 Contributions and thesis outline

The main contributions of this thesis are:

Sampling bias in BitTorrent measurements (Chapter 2) In this chapter, we present an extensive investigation of the factors that cause inaccuracy in and variability among BitTorrent measurements. To the best of our knowledge, we are the first to propose a method for exposing sample bias in BitTorrent measurements. Our method includes a taxonomy of the sources of sampling bias comprising two axes—data source selection and data volume reduction, including the measurement level, the community type, the measurement type (active and passive), the sampling rate and duration, and the number of measured communities and swarms. Then, we evaluate the effects of the different sources of sampling bias using fifteen real traces taken from nine BitTorrent communities. Our results indicate that current measurement techniques can lead to a significant sampling bias in the measurement results. Based on our findings, we have formulated three recom-mendations to improve future BitTorrent measurements. This chapter is based on [148].

Comparative analysis of P2P traces (Chapter 3) In this chapter, we introduce the Peer-to-Peer Trace Archive that facilitates the exchange of P2P traces and conduct an extensive analysis of traces in the archive. We first analyze comparatively the traces col-lected from various P2P systems including file-sharing, VoIP, and video-streaming, and then we examine the sharing behavior in multiple BitTorrent communities. The analysis results show that many characteristics, such as arrival rates, session length, file size, peer bandwidth, and sharing behavior, differ significantly across P2P systems and communi-ties, and some of them also evolve rapidly over the years. We also find that peer/session identification methods can significantly influence the analysis results. Currently, the P2P

(25)

Trace Archive hosts more than 20 traces. This chapter is based on [143, 146].

Studying flashcrowds in BitTorrent (Chapter 4) In this chapter, we conduct the first comprehensive study of BitTorrent flashcrowds. We propose a model of flashcrowds that characterizes their properties, such as their durations and their magnitudes, and we de-velop a flashcrowd identification algorithm that is able to identify BitTorrent flashcrowds from the evolution of swarm sizes. We study three datasets collected in 2009 and 2010 from OpenBitTorrent and PublicBitTorrent, two of the largest public BitTorrent track-ers nowadays, which provide us with two million swarms in total. Finally, we perform an analysis and statistical modeling of the BitTorrent flashcrowds identified from our datasets, revealing for their properties the best-fitting probability distributions, and the parameters of the best fits. Our findings can be readily used to generate synthetic yet realistic workloads for simulation studies and for real-system tuning. Furthermore, the results can be used to improve the BitTorrent protocol, and to build effective flashcrowd mitigation mechanisms. This chapter is based on [142].

Understanding user behavior in Spotify (Chapter 5) In this chapter, we study the user behavior in Spotify, the peer-assisted music streaming service, using a massive dataset.We focus our study on system dynamics, and individual user behavior. For sys-tem dynamics, we first investigate the session arrival patterns, playback arrival patterns, and the daily variation of session length. For individual users, we separately investigate the multi-device and single-device behavior. Our analysis of multi-device behavior re-veals how Spotify users switch between multiple devices, and we analyze the temporal properties of user sessions. For single-device behavior, we focus on correlations between the length and downtime of successive sessions. To the best of our knowledge, we are the first to study the user behavior in music streaming services using a large real-world dataset. Our main findings include: session arrivals, playback arrivals, and session length all exhibit strong daily patterns in Spotify; Spotify users have strong “inertia” of contin-uing next session on the same device; and users have their favorite times of day to use Spotify service. Our findings are not only key for improving system design of Spotify, but also provide valuable insights to understand user behavior in general music streaming systems. This chapter is based on [150].

Conclusions and future work (Chapter 6) In this chapter, we summarize the most important findings and results in this thesis, and we discuss potential future work.

(26)

(27)

Sampling Bias in BitTorrent

Measurements

As we mentioned in the introduction, BitTorrent is by far the most popular P2P file-sharing system in the past decade, and much effort has been put into measuring BitTor-rent, with the purpose of understanding and improving its use. Similarly to early Internet measurement efforts [15, 44], due to the size of the complete network all BitTorrent mea-surements have employed data sampling techniques, from periodic meamea-surements to the focus on specific BitTorrent communities. Despite this situation, there currently exists no comprehensive evaluation of the sampling biases, that is, of the ability to objectively represent the characteristics of BitTorrent, introduced by BitTorrent measurements. This work presents the first such investigation.

Understanding sampling biases in BitTorrent measurements can benefit BitTorrent re-search in the following ways. First, it can lead to a better understanding of the common-alities and of differences among different parts of the BitTorrent network by explicitly comparing the measurement results. In the Internet community, this “search for invari-ants process” [15] fostered many new research opportunities [44]. From many empirical BitTorrent measurements [11, 59, 61, 108], only some [11, 59] consider aspects of the sampling bias problem. Second, understanding sampling biases leads to a better under-standing of the usage of measurement techniques, which is key to designing and improv-ing BitTorrent measurements. It is symptomatic for the current (lack of) understandimprov-ing of BitTorrent measurement techniques that there is no agreement on the Internet traffic share due to BitTorrent–though caching companies have put forth estimates of over 50% in 2008 [7] and 30% in 2005 [103]. Towards understanding sampling biases in BitTorrent measurements, our main contribution in this chapter is threefold:

1. We propose a method for exposing the sampling biases in BitTorrent measurements that focuses on both the measured BitTorrent components and the volume of the measured data (Section 2.2);

(28)

2. Using fifteen diverse BitTorrent datasets (Section 2.3) we show that the measured BitTorrent components and the volume of the measured data can both significantly bias measurement results (Section 2.4);

3. We formulate recommendations to improve future BitTorrent measurements, and estimate the costs of implementing these recommendations (Section 2.5).

2.1 Background

A P2P system is a system that uses P2P technology to provide a set of services; this group of services forms together an application such as file sharing. We call peers the participants in a P2P system that contribute to or use the system’s resources and services. A peer is completely disconnected until it joins the system, and is active until it leaves the system. A real user may run several peer sessions; the sessions are not overlapped in time. We call a swarm the group of peers, from all the peers in a P2P system, that interact with each other for a specific goal, such as transferring a file. A swarm starts being active when the first peer joins that swarm, and ends its activity when its last peer leaves. The lifetime of a swarm is the period between the start and the end of the swarm. A community is the group of peers who are or can easily become aware of the existence of each other’s swarms.

Our view on P2P systems considers three levels of operation. A P2P system includes at least a peer level, but may also include any of the community and swarm levels. The definitions of community, swarm, and peers presented here are general for P2P systems, though their implementation may differ with the P2P protocol. For example, BitTorrent and eDonkey have different interpretation and thus implementation of the swarm concept. In this chapter we focus on BitTorrent, which includes all the three levels of operation. The files transferred in BitTorrent contain two parts: the raw file data and a metadata (directory and information) part. Peers interested in a file obtain the file’s metadata from a web site (the community level of BitTorrent) and use the peer location services offered by a tracker (the swarm level of BitTorrent) to find other peers interested in sharing the file. The raw file is then exchanged between peers (the peer level of BitTorrent). To facilitate this exchange, the raw data are split in smaller parts, called chunks. Thus, to obtain a complete file a user has to obtain all the chunks by using three application levels. We distinguish between the complete and the transient swarm population: we define the population of a swarm as the set of all peers ever present in the swarm at any time during the measurement, and a snapshot of a swarm as the set of peers present at a specific time during the measurement.

(29)

2.2 Exposing the Sampling Bias

In this section we introduce our method for exposing sampling bias in BitTorrent mea-surements.

Our method focuses on two main questions that define a measurement process: What is the relationship between the data source and bias? and What is the relationship between the data volume and bias? The first question stems from the complexity and scale of BitTorrent. For example, there currently exist tens of popular BitTorrent communities, many operating independently from the others and having specific usage characteristics. The second question expresses the trade-off between accuracy and measurement cost (see also Section 2.5).

2.2.1 Method

We say that a measurement conducted on a BitTorrent component is affected by a sam-pling bias if the sampled characteristics are significantly different from the real charac-teristics of the real BitTorrent component. To analyze the sampling bias we need an understanding of the real characteristics (the ground truth), a conceptual framework for understanding the differences between the sampled and real characteristics, and metrics to quantify these differences.

Ground Truth: The characteristics of BitTorrent are largely unknown: there cur-rently exists no model of a dynamic and heterogeneous BitTorrent swarm, and scientists do not possess even a single complete measurement dataset comprising every message ex-changed between the peers of a BitTorrent community of significant size. Thus, and sim-ilarly with the situation of exposing sampling biases for Internet measurements [44, 75], we need to trace the presence of sampling bias without a ground truth. Instead, we make the observation that if measurements are unbiased, the measured characteristics should remain the same regardless of the sampling. Following this observation, we define for a measurement the complete dataset, which is the practical equivalent to the ground truth, as the dataset collected with maximal measurement capability. For example, if a real mea-surement has sampled data every 5 minutes, it can be used to understand the sampling bias resulting from larger sampling intervals, such as 10 minutes.

Conceptual Framework: We use the term variability when comparing properties, e.g., the average peer download speed or the empirical distribution of file sizes for a com-munity, as measured across different BitTorrent components when using the same mea-surement technique. We also use the term accuracy when examining how data collected with different techniques compares with the complete dataset.

Metrics: We estimate the sampling bias using two metrics:

• The Coverage metric, defined as the percentage of sampled peers or events, from the peers or events comprised in the complete dataset.

(30)

• The Error/deviation of values metric, which mimics traditional statistical ap-proaches for comparing probability distributions of random variables. The Kolmogorov-Smirnov test [82] uses the D characteristic to estimate the maximum distance between the cumulative distribution functions (CDFs) of two random vari-ables. Similarly, we use the D characteristic to compare the measured and the com-plete dataset values. Following traditional work on computer workload modeling (see [43] and the references within), we say that measurements resulting in errors above 10% (D metric above 0.1) have very low accuracy, and that measurements with 5–10% error have low accuracy.

2.2.2 Data Sources

Depending on the selection of the data source, we distinguish three main sources of vari-ability or accuracy:

1. The Measurement level. In Section 2.1 we have defined three levels for a P2P application, community, swarm, and peer. Measuring at any single level may result in measurement inaccuracy. For example, measurements taken at peer level, with an instru-mented peer as the measurement tool, may fail to contact all peers in a swarm, since peers have limited uptime (presence) in the system.

2. TheCommunity type. Many types of communities exist in the BitTorrent world, and this diversity leads to high variability in measurement results. We categorize Bit-Torrent communities based on the type of content they share, either general or specific content. The specific content may be further divided into content sub-types such as video, operating system, etc.; Garbacki et al. [45] have identified around 200 content sub-types for the SuprNova community.

3. The Passive vs. Active Measurements. Following the terminology introduced in our previous work [59], peer-level measurements are active if the instrumented peers acting as measurement probes initiate contact with other BitTorrent peers, and passive if they wait for externally initiated contacts. In contrast to passive measurements, active measurements require that the other peers are accessible, for example, they are not behind a firewall. The 2007 measurement by Xie et al. [139] shows that up to 90% of the peers in a live streaming application are firewalled, and that less than 20% of them by-pass the firewalls.

2.2.3 Data Volume

The data volume is another major discriminant among measurements:

1. Sampling rate and Duration Since peer-to-peer systems have properties that evolve over time, measurements have to observe the same property repeatedly. Then,

(31)

the data volume is the product between the sampling rate and the measurement duration; reducing either leads to lower data volumes, but may also lead to inaccuracy. Rates of a sample every 2.5 [59, 108] to 30 minutes [11], and durations of a few days [59] to a few months [61] have been used in practice.

2. Number of communities and Number of swarms BitTorrent communities may share properties, and within a community the most populated swarms may account for most of the traffic. Thus, including in the measurement fewer communities and swarms may reduce the volume of acquired data without reducing accuracy. Until the recent study of four communities [11], measurements have often focused on one community [59,108], and even on only one swarm [61].

3. Long-term dynamics Many BitTorrent communities have changed significantly over time or even disappeared. Thus, measurements should make efforts to catch long-term system dynamics, including seasonal and multi-year patterns. In practice, the only long-term BitTorrent measurements are the five months study of Izal et al. [61] and our own year-long measurement of SuprNova [108].

2.3 The Collected Traces

To understand sampling bias in BitTorrent measurements, we have acquired 15 long-term traces from 9 BitTorrent communities of hundreds of thousands of peers and responsible for transferring yearly over 13 peta-bytes of data. In this chapter, we investigates nine BitTorrent datasets, as summarized in Table 2.1; for a complete description of the traces see our technical report [147]. Traces studied in this work are available at the Peer-to-Peer Trace Archive1_.

To ensure heterogeneity among the limited number of traces, we have taken into ac-count the following controllable factors when collecting the traces. From the community perspective, the traces focus on communities with either general or very specific spe-cific types of content; for example, the id software community focuses on sharing de-mos of games commercialized by id software. From the community size perspective, the traces include the largest communities in the world at the time of the data collection (T1’04/SuprNova in 2004 and T2’05/PirateBay in 2005) down to small communities, both in terms of number of users and number of shared files. In terms of duration, all the traces but T2’05 are long-term. The T2’05 trace was collected during just a few days in mid-2005, but it still contains the largest amount of peers and events relative to the measurement duration. With respect to the biases introduced by the very long-term com-munity evolution, several of the collected traces include two datasets, one acquired in 2005 and one acquired in 2009.

(32)

ID Trace Description Period Sampling Torrents Sessions Traffic

(Content Type) (minutes) (GB/day)

T1’04 BT-TUD-1, SuprNova Oct 2003 to Dec 2004 60 32,452 n/a n/a (General) 06 Dec 2003 to 17 Jan 2004 2.5 120 28,423,470 n/a T2’05 BT-TUD-2, PirateBay 05-11 May 2005 2.5 2,000 35,881,338 32,000

(General)

T3’05 LegalTorrents.com 22 Mar to 19 Jul 2005 5 41 n/a 698

T3’09 (General) 24 Sep 2009 onwards 5 183 n/a 1,100

T4’05 etree.org 22 Mar to 19 Jul 2005 15 52 165,168 9

T4’09 (Recorded events) 24 Sep 2009 onwards 15 45 169,768 143 T5’05 tlm-project.org 22 Mar to 30 Apr 2005 10 264 149,071 735

T5’09 (Linux) 24 Sep 2009 onwards 10 74 21,529 15

T6’05 transamrit.net 22 Mar to 19 Jul 2005 5 14 130,253 258 T6’09 (Slackware Unix) 24 Sep 2009 onwards 5 60 61,011 840 T7’05 unix-ag.uni-kl.de 22 Mar to 19 Jul 2005 5 11 279,323 493

T7’09 (Knoppix) 24 Sep 2009 onwards 5 12 160,522 348

T8’05 idsoftware.com 22 Mar to 19 Jul 2005 5 13 48,271 19

T8’09 (Game Demos) 24 Sep 2009 onwards 5 37 14,697 12

T9’05 boegenielsen.net 22 Mar to 19 Jul 2005 5 15 36,391 308 (Knoppix)

Table 2.1: Datasets used in this work. Datasets T1’04 and T2’05 have been previously analyzed [59, 108].

2.4 The Results

In this section we investigate the effects of the different ways of data source selection and data volume reduction on the variability and accuracy of BitTorrent measurements.

2.4.1 Data Source Selection

Here we assess the effects of the selection of the data source on the accuracy and variabil-ity of BitTorrent measurements.

Finding: Measurements performed at a single operational level of BitTorrent can lead to very low accuracy.

When measuring swarm dynamics, we observe in T1’04 that in 4 out of 10 swarms, peer level measurements capture many fewer peers than swarm level measurements. For swarm 003 of T1’04, during the flashcrowd (see Figure 2.1), the peer-level coverage drops to about 70% of the swarm-level coverage. Later, due to the overloaded infrastructure, the peer-level coverage is even down to 50% for about half the duration of the flashcrowd. Similar effects can be observed from measurements at both the community and the swarm level in many communities. Taking T7’05 as an example, the community throughput as obtained from community-level measurements is more than 50% higher than the swarm-level numbers, and so the latter are very inaccurate (see Figure 2.2).

(33)

!" !#""" !$""" !%""" !&""" $"'()*

$""% $+'()*$""% "%',-.$""& #"',-.$""& #+',-.$""&

/) )0 1 (-2) $34!5677)0).*) 89-1:*0;<5 _=<-0>!?)@)9 /))0!?)@)9

Figure 2.1: Comparison of swarm dynamics resulting from swarm-level measurement and peer-level measurement. !" !#""" !$""" !%""" !&'""" '#()*+ '"", #"()*+'"", "-(./0'"", &1(./0'"", '&(./0'"", 23 +45 63 *5 7! 89 :; </7= ,,>!?@AA=+=BC= D4EE5B@70!F=G=H IJ/+E!F=G=H

Figure 2.2: Comparison of cumulative community throughput resulting from community-level measurement and swarm-community-level measurement.

Finding: Measuring different BitTorrent communities may lead to very different results.

As discussed in Section 3, the diversity in community types contributes to a high variabil-ity in the measurement results of different communities. For several BitTorrent commu-nities, we show in Figures 2.3, 2.4, and 2.5, the cumulative distribution functions (CDFs) of the file size, the upload speed, and session length, which all differ significantly among

(34)

!" !"#$% !"#% !"#&% !' !" !$"" !("" !)"" !*"" !'""" !'$"" +,--./0!1.20!3456 7$!8"% 79!8"% 7(!8"% 7%!8"% 7*!8"% 7'8"(

Figure 2.3: Comparison of file size distributions in six BitTorrent communities.

!" !"#$% !"#% !"#&% !' !" !'" !$" !(" !)" !%" *+, -./012!3.442!567839 :%!;"< :=!;"< :&!;"< :>!;"<

Figure 2.4: Comparison of upload speed in four BitTorrent communities.

communities. Furthermore, for these communities we do not see a correlation between these characteristics and the focus of the community on general versus specific content. We also observe similar differences in the distributions of the swarm sizes and the down-load speeds in several communities.

(35)

!" !"#$% !"#% !"#&% !' !" !$"" !("" !)"" !*"" !'""" !'$"" !'("" +,-./00123!4/3567!8913:6/; <%!="> <)!="> <&!="> <*!=">

Figure 2.5: Comparison of session length in four BitTorrent communities. Finding: The results of passive and active measurements differ significantly.

This is because the presence of firewalled peers is significant in BitTorrent, and the fire-walled and non-firefire-walled peers have different uptime. For example, less than 60% of the peers are non-firewalled in the T1’04 (SuprNova) trace [108]. An in-depth analysis of the impact of (the fraction of) firewalled peers on upload/download performance in four communities was presented by Mol et al. [97]; their analysis also covers the data of T2’05 (The Pirate Bay), which were collected using both active and passive measurements. It turns out that over 65% of the peers discovered using the active measurements are fire-walled, and that 96% of the swarms have over 50% firewalled peers. The same study found that, because BitTorrent rewards peers with connectivity, non-firewalled peers ex-hibit 80% less uptime than firewalled peers. However, it is not possible to only perform passive measurements, since in this case it is costly to guarantee coverage and a steady sampling rate.

2.4.2 Sampling Rate and Duration

In order to understand the effect of the sampling rate on the measurement accuracy, we take the original datasets obtained in our measurements, with their original sampling rates and durations, as the basis. From these datasets we derive new datasets with other (lower) sampling rates by sampling the original datasets at various intervals. Similarly, we derive datasets with shorter durations by simply taking contiguous pieces of the appropriate lengths of the original datasets. We then compare the properties of the original and newly obtained datasets.

(36)

!" !#" !$" !%" !&" !'"" !" !(" !'"" !'(" !#"" !#(" !)"" *+ ,-+. /0 1+ !2 3 4 567,8!89.-+!#""):'#:#"!'(;"" #:(!<9. (!<9. '(!<9. )"!<9.

(a) One swarm

!" !#" !$" !%" !&" !'"" !" !( !'" !'( !#" !#( )* +, !-,+ .-/0 1! 20 3* 4+ 5* !6 7 8 904:+;<=*>!?+:.;<,5!@,1*43+;!6A,<1!B!#C(!:<,8 04<5<,+;!:*+-A4*:*,1 '(!:<, -D+4:!""E -D+4:!""( -D+4:!""F -D+4:!"'" (b) Multiple swarms

Figure 2.6: Transient peer coverage resulting from measuring at different sampling inter-vals. !" !#" !$" !%" !&" !'"" !" !' !# !( !$ !) !% !* !& !+ !'"!''!'#!'(!'$!')!'% ,-./ 01 23 !4 25 67.8 6! 9: ; <27=.>?@6A!0.=/>?-8!?-3675.>!9B-?3C#D)!=?-B36;

Figure 2.7: Measurement results becomes more unstable with large sampling intervals. (T1.003.)

Finding: When measuring at the peer level, increasing the sampling interval leads to a higher inaccuracy and variance.

Figure 2.6 shows how the average snapshot coverage of several swarms drops when the sampling interval is increased from 2.5 minutes (the original sampling interval) to 30 minutes. Figure 2.7 shows the statistics (min, max, median, and first and third quartile) of the distribution of the snapshot coverage of a single swarm obtained at different multiples of the original sampling interval. Only when the sampling interval does not exceed 7.5 (15) minutes is the median snapshot coverage at least 90% (80%).

Understanding Operation and User Behavior in Peer-to-Peer Systems

Behavior in Peer-to-Peer Systems

Behavior in Peer-to-Peer Systems

Acknowledgments

Contents

Chapter 1

Introduction

1.1 P2P systems

1.1.1 Early P2P file-sharing applications

1.1.2 BitTorrent

1.1.3 BitTorrent communities

1.1.4 P2P streaming systems

1.1.5 Other P2P applications

1.2 Research context: the P2P-Next project

1.3 P2P measurement studies

1.3.1 Early P2P file-sharing systems

1.3.2 BitTorrent

1.3.3 P2P streaming systems

1.4 P2P measurement approaches

1.4.1 Instrumenting clients

1.4.2 Sniffing traffic

1.4.3 Scraping websites

1.4.4 Scraping trackers

1.5 Problem statement

1.6 Contributions and thesis outline

Sampling Bias in BitTorrent

Measurements

2.1 Background

2.2 Exposing the Sampling Bias

2.2.1 Method

2.2.2 Data Sources

2.2.3 Data Volume

2.3 The Collected Traces

2.4 The Results

2.4.1 Data Source Selection

2.4.2 Sampling Rate and Duration