Distributed Hash Table Algorithms - L3S Research Center

Nov 14, 2007 - 2. Distributed Hash Table Algorithms. L3S Research Center. Efficient routing table maintenance. 3485 -. 610. 1622 -. 2010. 611 -. 709. 2011 -.
967KB taille 3 téléchargements 257 vues
L3S Research Center, University of Hannover

Distributed Hash Table Algorithms

Wolf--Tilo Balke and Wolf Siberski Wolf 14.11.2007

*With slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen) and

Peer-to-Peer Systems and Applications, Springer LNCS 3485

1

Review: DHT Basics „Purple Rain“

y Objects need unique key y Key is hashed to integer value

Hash-funktion (e.g. SHA-1)

f Huge key space, e.g. 2128

y Key space partitioned

2313

f Each peer gets its key range

y DHT Goals f Efficient routing to the responsible peer f Efficient routing table maintenance 3485 610

611 709

L3S Research Center

1008 1621

1622 2010

2011 2206

22072905

Distributed Hash Table Algorithms

2906 3484

(3485 610)

2

DHT Design Space 3485 610

611 709

1008 1621

1622 2010

2011 2206

22072905

2906 3484

22072905

2906 3484

y Minimal routing g table f Peer state O(1), Avg. path length O(n) f Brittle network 3485 610

611 709

1008 1621

1622 2010

2011 2206

y Maximal routing table f Peer state O(n), Path length O(1) f Very inefficient routing table maintenance Distributed Hash Table Algorithms

L3S Research Center

3

DHT Routing Tables 3485 610

611 709

1008 1621

1622 2010

2011 2206

22072905

2906 3484

y Usual routing table f Peer state O(log n), Path length O(log n) f Compromise between routing efficiency and maintenance efficiency

L3S Research Center

Distributed Hash Table Algorithms

4

DHT Algorithms y CHORD y Pastry y Symphony y Viceroy y CAN

L3S Research Center

Distributed Hash Table Algorithms

5

Pastry: Basics y 128 bit circular id space y Routing table elements f Leaf set: Key space proximity f Routing table: long distance links f Neighborhood set: network proximity nodeIds

y Basic routing If (target key in key space proximity) Use direct leaf set link else Use link from routing table to resolve next digit of target key

L3S Research Center

Distributed Hash Table Algorithms

6

Pastry: Leaf sets y Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively. f routing ti efficiency/robustness ffi i / b t f fault detection (keep-alive) f application-specific local coordination

L3S Research Center

Distributed Hash Table Algorithms

7

Distributed Hash Table Algorithms

8

Pastry: Routing table

y L nodes in leaf set y log2b N Rows f (actually log2b 2128= 128/b)

y 2b columns y L network neighbors

L3S Research Center

Pastry: Routing

d46a1c

d471f1 d467c4 d462ba

y log2b N steps y O(log N) state

d4213f

Route(d46a1c)

d13da3

65a1fc

L3S Research Center

Distributed Hash Table Algorithms

9

Pastry: Routing procedure If (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (Rld exists) forward to Rld else forward to a known node* that (a) shares at least as long a prefix (b) is numerically closer than this node *from LeafSet, RoutingTable, or NetworkNeigbors

L3S Research Center

Distributed Hash Table Algorithms

10

Pastry: Routing Properties y O(log N) routing table size f 2b * log2b N + 2l

y O(log N) message forwarding steps y Network stability: f guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds

y Number of routing hops: f No failures: < log2b N average, 128/b + 1 max f During failure recovery O(N) worst case, average case much better

Distributed Hash Table Algorithms

L3S Research Center

11

Pastry: Node addition

X=d46a1c

d471f1 Z=d467c4 d462ba d4213f

New node: X=d46a1c

Route(d46a1c)

d13da3

A = 65a1fc

L3S Research Center

Distributed Hash Table Algorithms

12

Routing table maintenance y Leaf set f Copy from neighbor f Extend by sending request to right/left boundary leaf link

y Routing table f Collect routing tables from peers encountered during network entry   Works because peers encountered share same prefix

f Can be incomplete

y Network neighbor set f Probe nodes from collected routing tables f Request neighborhood sets for known nearby nodes L3S Research Center

Distributed Hash Table Algorithms

13

Pastry: Locality properties y Assumption: scalar proximity metric f e.g. ping/RTT delay, # IP hops, geographical distance f a node can probe distance to any other node

y Proximity invariant: f Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.

L3S Research Center

Distributed Hash Table Algorithms

14

Pastry: Geometric Routing in proximity space d467c4

d46a1c

NodeId space

d471f1 d467c4 d462ba

Proximity space

d4213f

Route(d46a1c) d13da3 d4213f 65a1fc

65a1fc

d462ba

d13da3

y Network distance for each routing step is exponentially increasing (entry in row l is chosen from a set of nodes of size N/2bl) y Distance increases monotonically (message takes larger and larger strides) L3S Research Center

Distributed Hash Table Algorithms

15

Pastry: Locality properties y Each routing step is local, but there is no guarantee of globally shortest path y Nevertheless, simulations show: f Expected distance traveled by a message in the proximity space is within a small constant of the minimum

y Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first

L3S Research Center

Distributed Hash Table Algorithms

16

16

Pastry: Node addition details y New node X contacts nearby node A y A routes “join” message to X, which arrives to Z, closest to X y X obtains leaf set from Z, i’th row for routing table from i’th node from A to Z y X informs any nodes that need to be aware of its arrival

L3S Research Center

Distributed Hash Table Algorithms

17

17

Node departure/failure y Leaf set repair (eager – all the time): f Leaf set members exchange keep-alive messages f request set from furthest live node in set

y Routing table repair (lazy – upon failure): f get table from peers in the same row, if not found – from higher rows

y Neighborhood set repair (eager)

L3S Research Center

Distributed Hash Table Algorithms

18

Pastry: Average # of hops 4.5

Average number of h A hops

4 3.5 3 2.5 2 1.5

Pastry Log(N)

1 0.5 0 1000

10000

100000

Number of nodes L=16, 100k random queries L3S Research Center

Distributed Hash Table Algorithms

19

Disttance traveled by Pastry m message

Pastry distance vs IP distance

2500 Mean = 1.59 2000

1500

1000

500

0 0

200

400

600

800

1000

1200

1400

Distance between source and destination

GATech top., .5M hosts, 60K nodes, 20K random messages L3S Research Center

Distributed Hash Table Algorithms

20

Pastry Summary y Usual DHT scalability f Peer state log(N) f Avg. path length log(N)

y Very robust f Different routes possible f Lazy routing table update sufficient

y Network N t k proximity i it aware f No IP network detours

L3S Research Center

Distributed Hash Table Algorithms

21

Symphony y Symphony DHT f Map the nodes and keys to the ring f Link every node with its successor and predecessor f Add k random links with probability proportional to 1/(d·log N), where d is the distance on the ring f Lookup time O(log2 N) f If k = log N lookup time O(log N) f Easy to insert and remove nodes (perform periodical refreshes for the links)

L3S Research Center

Distributed Hash Table Algorithms

22

Symphony in a Nutshell Nodes arranged in a unit circle (perimeter = 1) Arrival --> Node chooses position along circle uniformly at random Each node has 1 short link (next node on circle) g links and k long

Adaptation of Small World Idea: [Kleinberg00] Long links chosen from a probability distribution function: p(x) = 1/(x log n) where n = #nodes. n? Simple greedy routing: “Forward along that link that minimizes th absolute the b l t distance di t tto th the destination.” d ti ti ” Average lookup latency = O((log2 n) / k) hops node

long link

L3S Research Center

short link

Fault Tolerance: No backups for long links! Only short links are fortified for fault tolerance.

Distributed Hash Table Algorithms

23

Network Size Estimation Protocol Problem: What is the current value of n, the total number of nodes? x = Length of arc 1/x = Estimate of n 3 arcs are enough.

L3S Research Center

Distributed Hash Table Algorithms

24

Step 0: Symphony

Probability Distrribution

p(x) = 1 / (x log n)

Symphony:

“Draw from the PDF k times”

0

¼

½

1

Distance to long distance neighbor

Distributed Hash Table Algorithms

L3S Research Center

25

Step 1: Step-Symphony

Probability Distrribution

p(x) = 1 / x log n

Step-Symphony: “Draw from the discretized PDF k times”

0

¼

½

1

Distance to long distance neighbor

L3S Research Center

Distributed Hash Table Algorithms

26

Probability Distrribution

Step 2: Divide PDF into log n Equal Bins

Step-Partitioned-Symphony: “Draw exactly once from each of k bins”

0

¼

½

1

Distance to long distance neighbor

Distributed Hash Table Algorithms

L3S Research Center

27

Probability Distrribution

Step 3: Discrete PDF

Chord: “Draw exactly once from each of log n bins” Each bin is essentially a point.

0

¼

½

1

Distance to long distance neighbor

L3S Research Center

Distributed Hash Table Algorithms

28

Two Optimizations y Bi-directional Routing f Exploit both outgoing and incoming links! f Route to the neighbor that minimizes absolute distance to destination f Reduces avg latency by 25-30%

y 1-Lookahead f List of neighbor’s neighbors f Reduces avg. latency by 40%

y Also applicable to other DHTs

L3S Research Center

Distributed Hash Table Algorithms

29

Symphony: Summary (1) y Distributed Hashing in a Small World y Like Chord: f Overlay structure: ring f Key ID space partitioning

y Unlike Chord: f Routing Table   Two short links for immediate neighbors   k long distance links for jumping   Long distance links are built in a probabilistic way   Peers are selected using a Probability Distribution Function (pdf)   Exploit the characteristics of a small-world network

f Dynamically estimate the current system size

L3S Research Center

Distributed Hash Table Algorithms

30

Symphony: Summary (2) y Each node has k = O(1) long distance links f Lookup:   Expected path length: O((log2N)/k) hops

f Join & leave   Expected: O(log2N) messages

y Comparing with Chord: f Discard the strong requirements on the routing table (finger table) f rely on the small world to reach the destination.

L3S Research Center

Distributed Hash Table Algorithms

31

Viceroy network y Arrange nodes and keys on a ring f As usual

L3S Research Center

Distributed Hash Table Algorithms

32

Viceroy network y Assign to each node a level value f chosen uniformly from the set {1,…,log n} f estimate n by taking the inverse off the th distance di t off the th node d with its successor f easy to update

L3S Research Center

Distributed Hash Table Algorithms

33

Viceroy network y Create a ring of nodes within the same level

L3S Research Center

Distributed Hash Table Algorithms

34

Downward links y For peer with key x at level i f Direct successor peer on level i+1 f Long link to peer x+2i on level i+1

L3S Research Center

Distributed Hash Table Algorithms

35

Upward links y For each peer with key x at level i y Predecessor link on level i-1 y Long link to peer at x-2i on level i-1

L3S Research Center

Distributed Hash Table Algorithms

36

Butterfly links y Each node x at level i has two downward links to level i+1 f a left link to the first node of level i+1 after position x on the ring f a right link to the first node of level i+1 after position x + (½)i

L3S Research Center

Distributed Hash Table Algorithms

37

Viceroy y Emulating the butterfly network 000 001 010 011 100 101 110 111

level 1 level 2 level 3 level 4

y Logarithmic path lengths between any two nodes in the network y Constant degree per node L3S Research Center

Distributed Hash Table Algorithms

38

Viceroy Summary y Scalability: Optimal peer state f Peer state log(1) f Avg. path length log(N)

y Complex algorithm y Network proximity not taken into account

L3S Research Center

Distributed Hash Table Algorithms

39

CAN: Overview y Early and successful algorithm y Simple & elegant f Intuitively to understand and implement f many improvements and optimizations exist f Sylvia Ratnasamy et al. in 2001

y Main responsibilities: f CAN is a distributed system that maps keys onto values f Keys hashed into d dimensional space f Interface: I t f   insert(key, value)   retrieve(key)

L3S Research Center

Distributed Hash Table Algorithms

40

CAN y Virtual d-dimensional Cartesian coordinate system on a d-torus f Example: 2-d [0,1]x[1,0] y Dynamically partitioned among all nodes y Pair (K,V) is stored by mapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P y Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P f If P is not contained in the zone of the requesting node or its neighboring zones, route request to neighbor node in zone nearest P

L3S Research Center

Distributed Hash Table Algorithms

41

CAN State of the system at time t Peer Resource

Zone

x In this 2-dimensional space a key is mapped to a point (x,y)

L3S Research Center

Distributed Hash Table Algorithms

42

CAN: Routing yd-dimensional space with n zones y2 zones are neighbours g if d-1 dimensions overlap

(x,y)

Peer Q(x,y)

Algorithm: Choose the neighbor nearest to the destination

Q(x,y)

key

L3S Research Center

Distributed Hash Table Algorithms

43

CAN: Construction - Basic Idea

L3S Research Center

Distributed Hash Table Algorithms

44

CAN: Construction

Bootstrap node

new node Distributed Hash Table Algorithms

L3S Research Center

45

CAN: Construction

Bootstrap p node

I new node

L3S Research Center

1) Discover some node “I” already in CAN

Distributed Hash Table Algorithms

46

CAN: Construction

(x,y)

I new node

2) Pick random point in space

Distributed Hash Table Algorithms

L3S Research Center

47

CAN: Construction

(x,y)

J I new node

L3S Research Center

3) I routes to (x,y), discovers node J

Distributed Hash Table Algorithms

48

CAN: Construction

J

new

4) split J’s zone in half… new owns one half

L3S Research Center

Distributed Hash Table Algorithms

49

CAN-Improvement: Multiple Realities yBuild several CAN-networks yEach network is called a reality yRouting fJump between realities fChose reality in which distance is shortest

L3S Research Center

Distributed Hash Table Algorithms

50

CAN-Improvement: Multiple Dimensions

L3S Research Center

Distributed Hash Table Algorithms

51

CAN: Multiple Dimensions vs. Multiple Realities

yMore dimensions Æ shorter paths More realities yMore Æ more robustness yTrade-off?

L3S Research Center

Distributed Hash Table Algorithms

52

CAN: Summary y Inferior scalability f Peer state O(d) f Avg. path length O(d N1/d)

y Useful for spatial data!

L3S Research Center

Distributed Hash Table Algorithms

53

Spectrum of DHT Protocols

Protocol Deterministic Topology Partly Randomized Topology Completely Randomized Topology

L3S Research Center

#links

latency

CAN Chord

O(d) O(log N)

O(d N1/d) O(log N)

Viceroy Pastry

O(1) O((2b-1)(log2 N)/b)

O(log N) O((log N) / b)

Symphony

O(k)

O((log2 N)/k)

Distributed Hash Table Algorithms

54

Latency vs State Maintenance

15

Viceroy

10

x

x

x

0

x

x

Symphony

10

CAN

Pastry

x

5

Average Latency

Network size: n=215 nodes

x

x

20

x

Chord

x

x

30

x

x

40

x

X

Pastry 50

60

# TCP Connections

L3S Research Center

Distributed Hash Table Algorithms

55