Object Search Mechanism

From Gnutella2
Jump to navigation Jump to search

Introduction

The distributed object search mechanism is a very important component of the Gnutella2 architecture. It allows objects distributed throughout the network to be located by a search client in an optimal fashion, requesting and receiving a subset of the total information known about that object.

Model

The Gnutella2 object search mechanism is best described as an "iterative crawl" of the network, with a series of important optimisations derived from the network topology and components:

  • A search client iteratively contacts known hubs with its query
  • Hubs match the query against the cached hash tables of their neighbours
  • Where a table hit occurs, the query is forwarded once only
  • The single injected query thus effectively covers the logical hub cluster, which is the basic searchable unit
  • Nodes which actually receive the filtered query process it and send results to the search client directly

This model has a number of important advantages:

  • Wide network coverage is provided by iterative query injection
  • Effective leaf density is increased by querying hub clusters rather than single hubs or even single nodes
  • Mutual and two-level filtering and static link compression are leveraged
  • The load on each hub cluster is spread equally between its member hubs, providing a larger search base with a lower cost on the aggregation points
  • Hubs need a greater inbound bandwidth capacity than outbound bandwidth capacity, which ideally suits asymmetric Internet connections

Search Process Walkthrough

When a search client wishes to execute a new search, the following events happen:

  • The search client selects an eligible hub from its global hub cache which has a recent timestamp, has not been contacted more recently than it allows, and has not yet been queried in this search.
  • If a query key is not yet available for this hub, a query key request is dispatched.
  • Once a query key is available, the search client sends a keyed query to the hub.
  • Upon receiving the query, the hub checks the query key for validity.
  • The hub then responds with a query acknowledgement packet, containing a list of neighbouring hubs which have now been searched and a list of 2-hop hubs which have not yet been searched.
  • The search client adds the list of searched hubs to its "don't try again" list, and adds the list of "try hubs" to the global hub cache for future selection.
  • Meanwhile, the hub examines the query to make sure that it has not received it before. Duplicate queries are dropped.
  • The hub then matches the query against the query hash table of all connected nodes, and sends it to those nodes which may be able to field a match.
  • While this is occurring, the hub processes the query locally and returns any results it may have.
  • Leaves directly attached to the hub which have a potential match will have received the query, and process it locally. They may elect to return results directly to the search client, or may return their results to their hub for dispatch.
  • Other hubs in the hub cluster which received the query now examine it to ensure they have not processed it before. They do not send an acknowledgement.
  • Assuming it is new, the hubs match the query against the query hash tables of their leaves but not their neighbouring hubs. Potential leaves receive a copy of the query, and the hub processes it locally.
  • Once again, the hub returns its own results and may forward results for its leaves if they do not wish to dispatch them directly.
  • Meanwhile, the search client receives any search results generated by the hub cluster.
  • The search client makes a decision on whether it should continue its search. If so, the process starts again.
  • The search client will not requery any of the hubs in the hub cluster, but it has a list of nearby hubs so that the crawl can continue.

Performance Considerations

Hubs should acknowledge /Q2 messages as quickly as possible with a /QA and not wait 5 or 10 minutes to reply. Why? Because the querying host is probably going to timeout, assuming the remote hub is off or too busy, and will start querying other hosts... which may already have been queried by the hub. This creates a nasty avalanche effect whereby busy hubs get extra traffic to acknowledge, causing them to further delay responses and therefore cause more useless querying.

The more delay there is between receiving of a /Q2 and acknowledging it via a /QA, the less effective the hub clustering topology will be.