G2 specs part1
NOTE: This is an old document and should not be used for new implementations. It is here because it may be useful reference material.
I realise this has been quite a long time in coming, however I'm a strong believer in getting some field experience with new ideas before asserting that they will work (especially in a distributed environment). This has been productive and as a direct result of the experience gained, several significant changes have been evolved into the proposal over the past few weeks. That process of testing is not yet complete, however I'm now confident that the basic ideas are relatively sound.
I'm also very interested in the thoughts of everyone here in the GDF. I tend to have strongly held views of how things should work, but the problem with a lone designer is that you rarely cover all the possible angles. It's only by working with a group of appropriately qualified people that there is any hope of exploring a problem fully. So, I invite everyone to comment, criticise and discuss.
Finally, to reiterate some of the stuff I posted last week: the entire Gnutella2 concept is intended to be an open advancement in the Gnutella world. I recognise that the GDF is not a standards body and our purpose here is not to flesh out a new specification, however I hope that this effort will spark some intelligent discussion of the ideas contained within.
The Gnutella2 concept contains four parts:
- A new protocol
- A new data transport architecture
- New base services, including search
- An implementation standard
While I feel that all four parts are important to move forward with, I suspect that the third part, the new base services and in particular the new search mechanism is likely to be the area of greatest interest for everyone here. Thus I'll focus on the searching aspect of Gnutella2 in this post, and save discussion of the other items for later.
(A final word on the first two points in this post: I do acknowledge that it is entirely possible to implement most if not all new features within the framework of the original Gnutella spec, working around its limitations with things like GGEP. However I would argue that in doing that we are limiting ourselves unnecessarily, not to mention producing less than elegant designs and increasing overall complexity. The original specification united us as developers however if we are unable to move logically beyond it now, I think that reflects an inability on our part to work together.)
In any case, this post is focused on the search architecture of Gnutella2. The search framework is built upon the previous two levels of implementation, however it is possible (and more meaningful) to discuss it at a higher level now. I'll detail the specifics of my implementation later.
- 1 Background
- 2 Real World Considerations
- 3 Gnutella2 Solutions
- 4 Calculations
- 5 Hub Management
- 6 Security
- 7 Conclusion
It's important to take a step back and analyse exactly what a search system should be able to do. In an ideal world of unlimited bandwidth and CPU resources, it should be possible to execute a search and locate every matching object present on the network instantly. In the real world compromises must be made, but the choice of compromise is important.
An Analysis of Gnutella1
In the original (Gnutella1) architecture, the compromise is one of search scope. Performing a search has a more or less fixed cost, occurs over a more or less fixed time interval and covers an area of more or less fixed size (but which is randomly selected). Unfortunately, when mapped from the technical view to the outcome view, it is evident that this compromise is not very desirable: In an ideal scenario, Gnutella1 searches tend to find every matching result within a fixed (but unknown) area of the network. There is little or no opportunity to scale the search to match the rarity of the target object (TTL and number of injection points can be scaled, but have a high cost).
An Optimal Search System
It is my view that an optimal search system, under real world limitations should have the following properties:
- It should be able to guarantee that if a matching object exists somewhere, it can be found
- It should have a means of scaling to match the rarity of the object that is not associated with a high cost to the network
- In a situation of limited resources, the ability to find at least some instances of a rare object should be of higher importance than the ability to find every instance of a common object.
In other words, the principal difference between my "optimal" search objectives and the Gnutella1 model is that I am placing a higher priority on the ability to find an object anywhere, than the ability to find every object in a limited range. My rationale for this is that:
- The ability to locate at least one instance of an object is very important, because without it no further action is possible
- Having located one instance of an object, there are other means of finding additional locations (if desired), such as a download mesh
- In a distributed hash table environment, a general search system only needs to locate a URN matching a search, after which the DHT can efficiently provide additional locations where that object may be found.
- Last but not least, the inability to find an object is demoralising for users
The third point there is quite important – I'm very keen on the use of a large scale distributed hash table within Gnutella in the near future (some work is going on there already). Thus the design of this search system is such that it is not superseded, but rather forms a part of, a DHT-enhanced search solution.
A search model meeting these objectives
It is well understood that in order to design a distributed search system which has at least one scalable variable, there needs to be a means of controlling the progress of the search. (Gnutella1 of course fails on this principle as it involves "unleashing" a search and waiting for the results) Controlled searches thus lend themselves to an iterative, rather than recursive, approach. There are three commonly known approaches:
- The random walker approach, in which the search maintains periodic contact with its originator, allowing the originator to continue or abort progress.
- The random or crawling mesh approach, in which the originator iteratively contacts nodes on the network until all nodes have been contacted or a desired condition has been reached.
- The guided mesh approach, in which the originator iteratively contacts nodes seeking a particular node which is responsible for a particular URN (DHT).
At the most simplistic level, the Gnutella2 network search model takes the second approach: an iterative crawl of all nodes on the network until a desired condition is reached. It also makes use of the random walker model for another, non-search function, however that is outside the scope of this discussion. It is envisioned that the third DHT approach will be incorporated in the future as a more efficient way of locating nodes with an instance of a URN obtained from the current search model.
In this respect, the use of an iterative crawl is similar to the GUESS proposal. The differences between this implementation and GUESS become apparent in the process and optimisations described below.
Real World Considerations
Implementing a node crawl directly is not viable for a public, large scale network for several reasons:
- There are too many nodes to contact
- The cost of contacting a node is too high for the expected number of matching objects on that node (the probability of one match is often a lot lower than 1%)
- Not all nodes can be contacted directly (NATs, firewalls)
- Most nodes do not have the resources to handle and respond to queries from every other node
Increasing Network Density
One of the solutions is of course something we have been using for some time in the Gnutella world – organising nodes into a two level hierarchy, in which "high capacity" nodes shield "low capacity" nodes. High capacity nodes have been called Ultrapeers or "Supernodes" in the past, although I have used the term "Hubs" as it seems to make the purpose much more clear to the general user population.
Hubs solve several issues:
- Hubs can effectively answer queries for their leaves, so:
- Only hubs need to be contacted in a mesh crawl, reducing the number of nodes to contact
- Contacting one hub allows a large number of nodes to be searched, increasing the probability of locating an object and thus reducing the relative cost of contact
- Hubs must be able to be contacted directly, allowing un-contactable nodes to exist and participate as leaves
- Hubs can shield their leaves from the bulk of the bandwidth and CPU requirements
These facts are not new, and many of them apply to traditional Gnutella1 as well. However in the Gnutella2 world, the reduced cost of contact becomes a positive factor where it previously was not. Similarly, the total number of nodes (now hubs) which could be contacted for a complete network-wide search is reduced.
Problems with Gnutella Hubs
It is interesting to note that several other successful distributed networks use a similar model, grouping a large number of leaf nodes under a hub node, although different terminology is often used. The server-based model is obviously the extreme version of this, where all participants in the searchable area are connected to a common server. Moving away from this to a more distributed model, additional servers (-> hubs) are employed, with searches being processed by each.
What these other networks tend to rely upon, however, is having a very large number of leaf nodes per hub. This very high leaf density makes searching very efficient, because the relative cost of contacting each hub and the total number of hubs are both low. In the Gnutella world, our ultrapeers/hubs tend to have far fewer leaves. This presents a problem: there are too many hubs to contact, and the relative cost per hub is too high.
These problems must be solved, or the time necessary to execute a search will be far too high for reasonably rare and very rare objects.
The Gnutella2 model solves the "too many hubs" problem in two ways.
Optimising Hub to Leaf Communication
The first solution tackles the problem at its cause – to reduce the number of hubs; the number of leaves per hub must be increased. The limiting factor on the number of leaves per hub is the capacity of a particular hub to support a large number of leaves. Some aspects of this limitation are fixed, for example the amount of CPU power and bandwidth available to a particular hub. This can only be addressed by choosing better hubs (which is also important)
However, other aspects are highly variable and good targets for optimisation: the cost of each additional leaf can be reduced. Gnutella2 does this by:
- Leaves send their hub an extremely large query hash table which applies to all query traffic. This results in a substantially lower bandwidth requirement for each leaf, at the cost of additional memory on the hub. The latter is reduced by storing the tables in compressed hierarchical form to balance memory and CPU requirements. Patch tables are sent in simple 1-bit packed form, and the necessary levels are generated and compressed by the hub upon receipt of new or patch data.
- All hub to leaf traffic is compressed using the deflate algorithm. This provides a significant reduction in bandwidth requirement for each leaf, once again at the cost of state memory and some CPU on the hub. The CPU requirement is not as significant as might be expected, as the large query hash tables filter the majority of traffic leaving only a vast minority for compression. The combination of these two systems really squeezes the absolute most out of hub to leaf bandwidth. The memory requirement can be halved by not compressing leaf to hub bandwidth, which is almost non-existent in Gnutella2.
- Leaves are given the option of uploading a compressed snapshot of their library to their hub. I have not finished experimenting with this yet, but to spell out the obvious advantages and disadvantages: hub to leaf traffic is eliminated, memory and CPU requirements on the hub increase. By using a similar compressed hierarchical storage and access system I expect to be able to make the impact manageable.
The savings in bandwidth requirement per leaf provided by these optimisations, at minimal extra cost to the hub in terms of CPU and memory requirements, allow a typical hub to serve a much larger number of leaves that it would normally be able to. This helps to increase the average leaves per hub, and with appropriate hub promotion control, reduce the total number of hubs. To map that back to the original assertion, it reduces the relative cost of contacting each hub and therefore also the amount of time taken to perform a search.
Leveraging Inter-Hub Communication
Gnutella2 also makes another key optimisation. Assuming the total number of hubs cannot easily be reduced any further, there are other ways to reduce the relative cost of contacting each hub and reduce the effective number of hubs which actually need to be contacted.
In the Gnutella2 search model, when a particular hub receives a query from a query originator, it not only processes it locally and forwards it to its leaves (where relevant), it also forwards it to its neighbouring hubs. These hubs in turn process the query locally and forward it to their leaves (where relevant), however they do not forward it any further. To make a comparison with the Gnutella1 broadcast model, this is similar to executing a query with TTL=2. In Gnutella2 there are no TTLs involved, rather fixed routing rules are used to ensure that any hub receiving a query will always forward it to its leaves, and may forward it to its neighbouring hubs if the origin was not already a neighbouring hub. The "two level" approach is not arbitrarily chosen, and its significance is detailed later.
This behaviour means that for the cost of sending a query to a single hub, a small "cluster" of connected hubs are actually searched. For a typical hub with 5 neighbours, 6 hubs and their leaves are searched at once. In essence, "hub clustering" offsets Gnutella's lower average leaves per hub. The elimination of broadcast "flood" traffic also means that hubs could actually have a higher number of neighbour hubs, further increasing the cluster size and lowering the cost per contact.
At first examination the cluster effect would appear only to benefit the search originator by reducing the amount of outbound query traffic and the time taken to contact every node (if required). In the unoptimised case, every hub would still need to receive every query, even if it was forwarded by an intermediary hub. The benefits to the search originator are important, however further optimisations allow the cluster effect to benefit hubs as well.
At the first level, hub to hub links are compressed using the deflate algorithm in both directions. This is an important effect, as the nature of UDP query traffic is that it cannot generally be compressed (it's too small and there is no reliable context). Thus receiving a query via UDP is much less efficient in terms of bandwidth than receiving one via TCP. But as we know, UDP has other benefits. A compromise must be struck, and the retransmission of query packets within the hub cluster provides an excellent opportunity. A single inbound UDP (inefficient) query is received by one hub, which is then forwarded to the other 4+ hubs in the cluster via TCP (efficient). One (random) hub takes the extra load; the other (majority) receive the benefit.
At the next level up, query retransmission within the hub cluster is filtered by very large query hash tables. Gnutella2 hubs regularly exchange query hash table data, which is a composite of the hub's own available objects and the hash tables of its leaves. Because retransmission is restricted to a neighbour and its leaves, the use of filtering tables at this level is perfectly acceptable (and accurate). The hash table associated with a particular hub at an instant in time is the same no matter which neighbour views it.
The use of query hash tables in hub to hub links makes it possible to filter out query traffic which would not result in a hit, reducing the hub to hub bandwidth requirements. This not only makes the hub cluster concept more efficient, but also allows hubs to have more leaves. Through clever hash table tricks, the size of the hash tables employed on hub to hub links can be larger than that retained on hub to leaf links (to offset the higher density). Methods of patching hash tables as they change to reduce update bandwidth are well documented also.
The primary focus of these solutions has been to increase the efficiency of the system, in a situation of limited resources. Decreasing the number of hubs to be contacted and decreasing the effective cost of each contact (which includes both reducing the cost of sending the query, and the cost of receiving and processing it).
While the optimisations described previously provide significant performance improvements, they also add complexity. For a small network this complexity is not particularly necessary. However as the number of nodes on the network grows, the benefits of higher leaf density and lower cost of contact become much greater.
For a 500,000 node network with presently loaded ultrapeers averaging about 150 leaves each, there will be approximately 3333 ultrapeers. Every one of these 3333 ultrapeers must be contacted to achieve a global search, incurring the cost of transmission on the search originator and the cost of receiving on each ultrapeer (plus leaf costs depending on how that is set up). None of these queries can be compressed.
If the same network is built with optimised hubs averaging 500 nodes each, there will be approximately 1000 hubs. If each hub is connected to an average of 5 neighbours, less than 170 hubs need to be contacted by the search originator via UDP. The originator can do that faster and more reliably. The other approximately 830 hubs and their leaves are still effectively searched, but incur a reduced (compressed) penalty, if any at all (QHT filtering).
This section deals with some of the more mundane but important aspects of operation.
Crawling the Network
It is not viable for every node to maintain an accurate table of every hub on the network, if the network is of any reasonable size. The storage requirement is not significant, but the overhead in maintaining it is. It makes more sense to progressively discover the extent of the network as it is traversed. This provides a sweep of the current extent over a period of time (i.e. not a snapshot), but its good enough.
To achieve this, each hub which is contacted must provide additional information about where the search originator might look next. In a simple system this is quite straight forward, however the use of the hub clustering technique (while vastly increasing efficiency) complicates matters somewhat.
Upon receiving a valid query from a search originator, a hub immediately returns a query acknowledgement. The acknowledgement is tagged for the specific search ID and includes two lists of information: the hubs which have received the query (including the current hub itself), and a selection of hubs to try.
The list of hubs which received the query will be every neighbouring Gnutella2 hub, including hubs for which the query failed the QHT test. The query originator adds these hubs to the "done list", and avoids sending them direct queries. Note that additional information is also returned, such as the leaf count for each hub, so that the originator can display search statistics.
The list of other hubs to try is more complex. The hubs in the list must be accurate, or time and resources are wasted on a large scale. The only hubs which the hub can be 100% sure of are its neighbours, but they are not eligible because they have already been queried. To solve this problem, every Gnutella2 node knows every neighbour, and every neighbour's neighbour. This applies partially to leaves too, in that leaves must know their neighbours and their neighbours' neighbours; however only hubs are included in the data (leaf addresses are of no use). This information is kept up to date quite stringently.
As a result, a hub sending a query acknowledgement packet can include its neighbours as "searched", and its neighbours' neighbours as "try these". This allows a logical crawling process to occur unimpeded by hub clusters.
General Hub Discovery
Crawling through the network logically has some benefits; however a more random approach can also be beneficial. To aid this, a few random hubs from the active hub cache are also returned in query acknowledgements. Random cached hubs are exchanged between all nodes on the network, much like the "X-Try-Ultrapeers" header, except as a continual process. Finally, hub walker advertisements use a random walker approach to advertise long-running hubs to distant points in the web. All indirect hub references are covered by a last seen time allowing more recently confirmed hubs to be prioritised and limiting the impact of lost hubs. (Connection attempts on former ultrapeer IPs has been a significant problem in Gnutella1).
The term "security" can have many meanings in a peer-to-peer context:
- Preventing denial of service attack upon the network
- Preventing the network from being used to launch an attack on other hosts
- Preventing the originator of a search from being identified
- Preventing the originator of a shared object from being identified
- Preventing false shared objects from being propagated
The first two points relate to securing the network structure against misuse. The next two relate to anonymity, while the last is more difficult to classify. As a developer I am most interested in the first two items, and also to some extent the last item.
Risks Associated With UDP
Any distributed network (and particularly those which involve the trading of host addresses) must take special consideration of possible misuse of its nodes. In the original Gnutella network there is already a significant potential to launch denial of service attacks upon third party hosts by misrepresenting host addresses, which then propagate through the network, resulting in thousands of unsolicited connection attempts. Gnutella2 attempts to address this somewhat by keeping track of last seen times for all hub addresses, however there is a larger security concern: UDP.
At the lowest level, TCP connections are reasonably protected by the need to establish a stream between two hosts. Forging the return address of a connection establishment packet results only in a single (small) response packet being directed at the target host. Once the stream is established, traffic cannot be directed to other hosts.
UDP on the other hand does not benefit from the security of a stream connection. A received datagram may not necessarily have originated from where its header would suggest. In a distributed search scenario this can have a very serious consequence: forging a query packet for a common object and sending it to a hub could result in a large volume of search results being delivered to the forged, third party host address. The more common the query term and the more leaves attached to the hubs (including the effects of hub clustering) and the larger the volume of return traffic. In this situation the network is acting as a traffic amplifier, allowing the attacker to anonymously transmit a small volume, but hit their target with a large and untraceable amount of traffic.
Similarly, attacks can be directed upon the network itself. A malicious host could flood one or more hubs with query traffic using random return addresses to thwart any flood control mechanism, causing an overload. While nothing can be done about unsolicited traffic to a node on the network (which is completely independent from the peer to peer system anyway), the use of UDP can eliminate the important ability to detect and ignore such abuse.
Gnutella2 Solution – Query Keys
A solution is built into the Gnutella2 query model. Before a search originator can transmit a query to a particular hub, it must obtain a "query key" for that hub and include it in the transmission. Query keys are unique to the hub generating them and the return address with which they are associated and generally expire after a relatively long period of time.
Originators request a query key by sending a query key request to a hub, which includes the intended return address for the key. The hub generates a key unique to that return address and dispatches it there. This makes it impossible to get a query key for an IP which you do not control, and prevents keys from being shared between two nodes (key stealing).
When receiving a query from a foreign node, Gntuella2 hubs check the query key against the one they have issued for the query's requested return address and proceed only if there is a match. If and only if the query key matches will the query acknowledgement be sent, or the search processed locally and forwarded to local leaves and neighbouring hubs.
This has several positive effects. If a query does not have a valid key for the receiving hub, it will not be processed and will thus not generate a traffic amplification effect which may be used in a denial of service attack on a third party host. Secondly, the query key mechanism ensures that queries are only processed if they were transmitted from a host which has control over the host in the return address (in the normal case, this is the same host). This means that flood control mechanisms can remain just as effective as in the TCP case. Similarly, host blocking is possible and more viable than in Gnutella1 (where the source of a query could not be verified).
This document has covered the high level search model used in Gnutella2, as currently deployed in the Shareaza 1.7 betas. I've focused purely on search related issues here, as I think that searching is probably the area of most interest to everyone at the moment, and is definitely the area where there is likely to be some interesting discussion.
Gnutella2 consists of a lot more than just a revised search system, which I will cover soon. The actual implementation of the query model allows for a lot more flexibility than we have in the Gnutella1 framework, in terms of the way information is requested and the way it is returned, as well as what kinda of information is dealt with. Also:
- A new generic and extensible packet format, on which all of the
new functionality is built
- The unreliable and semi-reliable UDP transports, including encapsulation, fragmentation, retransmission, etc
- The specific packets involved in the search system
- Other base services
November 17, 2002