L3S Research Center, University of Hannover
Distributed Hash Table Algorithms
Wolf--Tilo Balke and Wolf Siberski Wolf 14.11.2007
*With slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen) and
Peer-to-Peer Systems and Applications, Springer LNCS 3485
1
Review: DHT Basics „Purple Rain“
y Objects need unique key y Key is hashed to integer value
Hash-funktion (e.g. SHA-1)
f Huge key space, e.g. 2128
y Key space partitioned
2313
f Each peer gets its key range
y DHT Goals f Efficient routing to the responsible peer f Efficient routing table maintenance 3485 610
611 709
L3S Research Center
1008 1621
1622 2010
2011 2206
22072905
Distributed Hash Table Algorithms
2906 3484
(3485 610)
2
DHT Design Space 3485 610
611 709
1008 1621
1622 2010
2011 2206
22072905
2906 3484
22072905
2906 3484
y Minimal routing g table f Peer state O(1), Avg. path length O(n) f Brittle network 3485 610
611 709
1008 1621
1622 2010
2011 2206
y Maximal routing table f Peer state O(n), Path length O(1) f Very inefficient routing table maintenance Distributed Hash Table Algorithms
L3S Research Center
3
DHT Routing Tables 3485 610
611 709
1008 1621
1622 2010
2011 2206
22072905
2906 3484
y Usual routing table f Peer state O(log n), Path length O(log n) f Compromise between routing efficiency and maintenance efficiency
L3S Research Center
Distributed Hash Table Algorithms
4
DHT Algorithms y CHORD y Pastry y Symphony y Viceroy y CAN
L3S Research Center
Distributed Hash Table Algorithms
5
Pastry: Basics y 128 bit circular id space y Routing table elements f Leaf set: Key space proximity f Routing table: long distance links f Neighborhood set: network proximity nodeIds
y Basic routing If (target key in key space proximity) Use direct leaf set link else Use link from routing table to resolve next digit of target key
L3S Research Center
Distributed Hash Table Algorithms
6
Pastry: Leaf sets y Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively. f routing ti efficiency/robustness ffi i / b t f fault detection (keep-alive) f application-specific local coordination
L3S Research Center
Distributed Hash Table Algorithms
7
Distributed Hash Table Algorithms
8
Pastry: Routing table
y L nodes in leaf set y log2b N Rows f (actually log2b 2128= 128/b)
y 2b columns y L network neighbors
L3S Research Center
Pastry: Routing
d46a1c
d471f1 d467c4 d462ba
y log2b N steps y O(log N) state
d4213f
Route(d46a1c)
d13da3
65a1fc
L3S Research Center
Distributed Hash Table Algorithms
9
Pastry: Routing procedure If (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (Rld exists) forward to Rld else forward to a known node* that (a) shares at least as long a prefix (b) is numerically closer than this node *from LeafSet, RoutingTable, or NetworkNeigbors
L3S Research Center
Distributed Hash Table Algorithms
10
Pastry: Routing Properties y O(log N) routing table size f 2b * log2b N + 2l
y O(log N) message forwarding steps y Network stability: f guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds
y Number of routing hops: f No failures: < log2b N average, 128/b + 1 max f During failure recovery O(N) worst case, average case much better
Distributed Hash Table Algorithms
L3S Research Center
11
Pastry: Node addition
X=d46a1c
d471f1 Z=d467c4 d462ba d4213f
New node: X=d46a1c
Route(d46a1c)
d13da3
A = 65a1fc
L3S Research Center
Distributed Hash Table Algorithms
12
Routing table maintenance y Leaf set f Copy from neighbor f Extend by sending request to right/left boundary leaf link
y Routing table f Collect routing tables from peers encountered during network entry Works because peers encountered share same prefix
f Can be incomplete
y Network neighbor set f Probe nodes from collected routing tables f Request neighborhood sets for known nearby nodes L3S Research Center
Distributed Hash Table Algorithms
13
Pastry: Locality properties y Assumption: scalar proximity metric f e.g. ping/RTT delay, # IP hops, geographical distance f a node can probe distance to any other node
y Proximity invariant: f Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.
L3S Research Center
Distributed Hash Table Algorithms
14
Pastry: Geometric Routing in proximity space d467c4
d46a1c
NodeId space
d471f1 d467c4 d462ba
Proximity space
d4213f
Route(d46a1c) d13da3 d4213f 65a1fc
65a1fc
d462ba
d13da3
y Network distance for each routing step is exponentially increasing (entry in row l is chosen from a set of nodes of size N/2bl) y Distance increases monotonically (message takes larger and larger strides) L3S Research Center
Distributed Hash Table Algorithms
15
Pastry: Locality properties y Each routing step is local, but there is no guarantee of globally shortest path y Nevertheless, simulations show: f Expected distance traveled by a message in the proximity space is within a small constant of the minimum
y Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first
L3S Research Center
Distributed Hash Table Algorithms
16
16
Pastry: Node addition details y New node X contacts nearby node A y A routes “join” message to X, which arrives to Z, closest to X y X obtains leaf set from Z, i’th row for routing table from i’th node from A to Z y X informs any nodes that need to be aware of its arrival
L3S Research Center
Distributed Hash Table Algorithms
17
17
Node departure/failure y Leaf set repair (eager – all the time): f Leaf set members exchange keep-alive messages f request set from furthest live node in set
y Routing table repair (lazy – upon failure): f get table from peers in the same row, if not found – from higher rows
y Neighborhood set repair (eager)
L3S Research Center
Distributed Hash Table Algorithms
18
Pastry: Average # of hops 4.5
Average number of h A hops
4 3.5 3 2.5 2 1.5
Pastry Log(N)
1 0.5 0 1000
10000
100000
Number of nodes L=16, 100k random queries L3S Research Center
Distributed Hash Table Algorithms
19
Disttance traveled by Pastry m message
Pastry distance vs IP distance
2500 Mean = 1.59 2000
1500
1000
500
0 0
200
400
600
800
1000
1200
1400
Distance between source and destination
GATech top., .5M hosts, 60K nodes, 20K random messages L3S Research Center
Distributed Hash Table Algorithms
20
Pastry Summary y Usual DHT scalability f Peer state log(N) f Avg. path length log(N)
y Very robust f Different routes possible f Lazy routing table update sufficient
y Network N t k proximity i it aware f No IP network detours
L3S Research Center
Distributed Hash Table Algorithms
21
Symphony y Symphony DHT f Map the nodes and keys to the ring f Link every node with its successor and predecessor f Add k random links with probability proportional to 1/(d·log N), where d is the distance on the ring f Lookup time O(log2 N) f If k = log N lookup time O(log N) f Easy to insert and remove nodes (perform periodical refreshes for the links)
L3S Research Center
Distributed Hash Table Algorithms
22
Symphony in a Nutshell Nodes arranged in a unit circle (perimeter = 1) Arrival --> Node chooses position along circle uniformly at random Each node has 1 short link (next node on circle) g links and k long
Adaptation of Small World Idea: [Kleinberg00] Long links chosen from a probability distribution function: p(x) = 1/(x log n) where n = #nodes. n? Simple greedy routing: “Forward along that link that minimizes th absolute the b l t distance di t tto th the destination.” d ti ti ” Average lookup latency = O((log2 n) / k) hops node
long link
L3S Research Center
short link
Fault Tolerance: No backups for long links! Only short links are fortified for fault tolerance.
Distributed Hash Table Algorithms
23
Network Size Estimation Protocol Problem: What is the current value of n, the total number of nodes? x = Length of arc 1/x = Estimate of n 3 arcs are enough.
L3S Research Center
Distributed Hash Table Algorithms
24
Step 0: Symphony
Probability Distrribution
p(x) = 1 / (x log n)
Symphony:
“Draw from the PDF k times”
0
¼
½
1
Distance to long distance neighbor
Distributed Hash Table Algorithms
L3S Research Center
25
Step 1: Step-Symphony
Probability Distrribution
p(x) = 1 / x log n
Step-Symphony: “Draw from the discretized PDF k times”
0
¼
½
1
Distance to long distance neighbor
L3S Research Center
Distributed Hash Table Algorithms
26
Probability Distrribution
Step 2: Divide PDF into log n Equal Bins
Step-Partitioned-Symphony: “Draw exactly once from each of k bins”
0
¼
½
1
Distance to long distance neighbor
Distributed Hash Table Algorithms
L3S Research Center
27
Probability Distrribution
Step 3: Discrete PDF
Chord: “Draw exactly once from each of log n bins” Each bin is essentially a point.
0
¼
½
1
Distance to long distance neighbor
L3S Research Center
Distributed Hash Table Algorithms
28
Two Optimizations y Bi-directional Routing f Exploit both outgoing and incoming links! f Route to the neighbor that minimizes absolute distance to destination f Reduces avg latency by 25-30%
y 1-Lookahead f List of neighbor’s neighbors f Reduces avg. latency by 40%
y Also applicable to other DHTs
L3S Research Center
Distributed Hash Table Algorithms
29
Symphony: Summary (1) y Distributed Hashing in a Small World y Like Chord: f Overlay structure: ring f Key ID space partitioning
y Unlike Chord: f Routing Table Two short links for immediate neighbors k long distance links for jumping Long distance links are built in a probabilistic way Peers are selected using a Probability Distribution Function (pdf) Exploit the characteristics of a small-world network
f Dynamically estimate the current system size
L3S Research Center
Distributed Hash Table Algorithms
30
Symphony: Summary (2) y Each node has k = O(1) long distance links f Lookup: Expected path length: O((log2N)/k) hops
f Join & leave Expected: O(log2N) messages
y Comparing with Chord: f Discard the strong requirements on the routing table (finger table) f rely on the small world to reach the destination.
L3S Research Center
Distributed Hash Table Algorithms
31
Viceroy network y Arrange nodes and keys on a ring f As usual
L3S Research Center
Distributed Hash Table Algorithms
32
Viceroy network y Assign to each node a level value f chosen uniformly from the set {1,…,log n} f estimate n by taking the inverse off the th distance di t off the th node d with its successor f easy to update
L3S Research Center
Distributed Hash Table Algorithms
33
Viceroy network y Create a ring of nodes within the same level
L3S Research Center
Distributed Hash Table Algorithms
34
Downward links y For peer with key x at level i f Direct successor peer on level i+1 f Long link to peer x+2i on level i+1
L3S Research Center
Distributed Hash Table Algorithms
35
Upward links y For each peer with key x at level i y Predecessor link on level i-1 y Long link to peer at x-2i on level i-1
L3S Research Center
Distributed Hash Table Algorithms
36
Butterfly links y Each node x at level i has two downward links to level i+1 f a left link to the first node of level i+1 after position x on the ring f a right link to the first node of level i+1 after position x + (½)i
L3S Research Center
Distributed Hash Table Algorithms
37
Viceroy y Emulating the butterfly network 000 001 010 011 100 101 110 111
level 1 level 2 level 3 level 4
y Logarithmic path lengths between any two nodes in the network y Constant degree per node L3S Research Center
Distributed Hash Table Algorithms
38
Viceroy Summary y Scalability: Optimal peer state f Peer state log(1) f Avg. path length log(N)
y Complex algorithm y Network proximity not taken into account
L3S Research Center
Distributed Hash Table Algorithms
39
CAN: Overview y Early and successful algorithm y Simple & elegant f Intuitively to understand and implement f many improvements and optimizations exist f Sylvia Ratnasamy et al. in 2001
y Main responsibilities: f CAN is a distributed system that maps keys onto values f Keys hashed into d dimensional space f Interface: I t f insert(key, value) retrieve(key)
L3S Research Center
Distributed Hash Table Algorithms
40
CAN y Virtual d-dimensional Cartesian coordinate system on a d-torus f Example: 2-d [0,1]x[1,0] y Dynamically partitioned among all nodes y Pair (K,V) is stored by mapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P y Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P f If P is not contained in the zone of the requesting node or its neighboring zones, route request to neighbor node in zone nearest P
L3S Research Center
Distributed Hash Table Algorithms
41
CAN State of the system at time t Peer Resource
Zone
x In this 2-dimensional space a key is mapped to a point (x,y)
L3S Research Center
Distributed Hash Table Algorithms
42
CAN: Routing yd-dimensional space with n zones y2 zones are neighbours g if d-1 dimensions overlap
(x,y)
Peer Q(x,y)
Algorithm: Choose the neighbor nearest to the destination
Q(x,y)
key
L3S Research Center
Distributed Hash Table Algorithms
43
CAN: Construction - Basic Idea
L3S Research Center
Distributed Hash Table Algorithms
44
CAN: Construction
Bootstrap node
new node Distributed Hash Table Algorithms
L3S Research Center
45
CAN: Construction
Bootstrap p node
I new node
L3S Research Center
1) Discover some node “I” already in CAN
Distributed Hash Table Algorithms
46
CAN: Construction
(x,y)
I new node
2) Pick random point in space
Distributed Hash Table Algorithms
L3S Research Center
47
CAN: Construction
(x,y)
J I new node
L3S Research Center
3) I routes to (x,y), discovers node J
Distributed Hash Table Algorithms
48
CAN: Construction
J
new
4) split J’s zone in half… new owns one half
L3S Research Center
Distributed Hash Table Algorithms
49
CAN-Improvement: Multiple Realities yBuild several CAN-networks yEach network is called a reality yRouting fJump between realities fChose reality in which distance is shortest
L3S Research Center
Distributed Hash Table Algorithms
50
CAN-Improvement: Multiple Dimensions
L3S Research Center
Distributed Hash Table Algorithms
51
CAN: Multiple Dimensions vs. Multiple Realities
yMore dimensions Æ shorter paths More realities yMore Æ more robustness yTrade-off?
L3S Research Center
Distributed Hash Table Algorithms
52
CAN: Summary y Inferior scalability f Peer state O(d) f Avg. path length O(d N1/d)
y Useful for spatial data!
L3S Research Center
Distributed Hash Table Algorithms
53
Spectrum of DHT Protocols
Protocol Deterministic Topology Partly Randomized Topology Completely Randomized Topology
L3S Research Center
#links
latency
CAN Chord
O(d) O(log N)
O(d N1/d) O(log N)
Viceroy Pastry
O(1) O((2b-1)(log2 N)/b)
O(log N) O((log N) / b)
Symphony
O(k)
O((log2 N)/k)
Distributed Hash Table Algorithms
54
Latency vs State Maintenance
15
Viceroy
10
x
x
x
0
x
x
Symphony
10
CAN
Pastry
x
5
Average Latency
Network size: n=215 nodes
x
x
20
x
Chord
x
x
30
x
x
40
x
X
Pastry 50
60
# TCP Connections
L3S Research Center
Distributed Hash Table Algorithms
55