Query Hash Tables

From Gnutella2
Jump to: navigation, search

Introduction

Building an efficient network topology is not viable without means to restrict the flow of information appropriately. In the Gnutella2 network architecture, neighbouring nodes exchange compressed hash filter tables describing the content, which can be reached by communicating with them. These tables are updated dynamically as necessary.

The concept of filtering hash tables in file sharing was pioneered by Limegroup for the LimeWire Gnutella application.

Table Properties

Query hash tables or QHTs provide enough information to know with certainty that a particular node (and possibly its descendants) will not be able to provide any matching objects for a given query. Conversely, the QHT may reveal that a node or its descendants may be able to provide matching objects.

This property means that queries can be discarded confidently when a transmission is known to be unnecessary, while not providing the filtering or forwarding node, any actual information about the searchable content. Neighbours know what their neighbours do not have, but cannot say for sure what they do have. QHTs are also very efficient, both in terms of exchange and maintenance and lookup cost.

Table Content

A QHT is a table of 2N bits, where each bit represents a unique word-hash value. For example a table of 20 bits has 220 = 1048576 possible word-hash values. If stored uncompressed this table would be 128 KB in size.

In an empty table, every word-hash value bit will be "1", which represents "empty". To populate the table with searchable content, an application must:

  • Locate every plain-text word and URN which could be searched for and produce a match/hit
  • Hash the word with a simple hash function to produce a word-hash value which is 0 <= value < 2N.
  • Set the appropriate bit in the table to zero, representing "full".

Word Hashing

Words are strings of one or more alphanumeric characters which are not all numeric.

To convert a word into a hash value, the following case-insensitive algorithm is used:


// HashWord( string_ptr, char_count, table_bit_count );
DWORD CQueryHashTable::HashWord(LPCTSTR psz, int nLength, int nBits)
{
  DWORD nNumber = 0;
  int nByte = 0;
  for ( ; nLength > 0 ; nLength--, psz++ )
  {
    int nValue = tolower( *psz ) & 0xFF;
    nValue = nValue << ( nByte * 8 );
    nByte = ( nByte + 1 ) & 3;
    nNumber = nNumber ^ nValue;
  }
  return HashNumber( nNumber, nBits );
}

DWORD CQueryHashTable::HashNumber(DWORD nNumber, int nBits)
{
  WORD64 nProduct = (WORD64)nNumber * (WORD64)0x4F1BBCDC;
  WORD64 nHash = ( nProduct << 32 ) >> ( 32 + ( 32 - nBits ) );
  return (DWORD)nHash;
}

URNs - A Special Case

URNs are treated as a special case: rather than dividing them up into word tokens of Hankook Optimo h727, they are hashed as a complete fixed length string. For example:


urn:sha1:WIXYJFVJMIWNMUWPRPBGUTODIV52RMJA

Bitprint URNs are actually composite values which include both a SHA1 and TigerTree root value. Rather than adding the whole bitprint to the table, each of the constituent URNs are added separately. This allows SHA1-only querying and TigerTree-only querying. A root TigerTree URN takes the form:


urn:tree:tiger/:CN25MLNU3XNN7IHKZMNOA63XG6SKDJ2W7Z3HONA

Other URNs should be expressed in their most natural form before being fed to the word hash function.

Word Prefix Extensions

For words consisting of at least five characters, it is often useful to be able to match substrings within the word. Unfortunately, adding every possible substring of each word would increase the density of the QHT, however, a simple and effective compromise is available:

For words with 5 or more characters:

  • Hash and add the whole word
  • Hash and add the whole word minus the last character
  • Hash and add the whole word minus the last two characters

This allows searching on prefixes of the word, for example "match" will now match "matches".

Table Exchange

Nodes must keep their neighbouring hubs up-to-date with their latest local query hash table at all times. Rather than sending the whole table whenever it changes, nodes may opt to send a "table patch", which includes only the difference between the old and new table.

The /QHT packet is used to supply a query hash table update to a neighbouring hub. Its format is compatible with the Gnutella1 "query routing protocol", except that Gnutella2 requires a 1-bit per value table, while Gnutella1 requires a 4 or 8 bit per value table. Gnutella2 supports patch compression using the deflate algorithm, however, this should not be applied if the TCP link itself is compressed.

Table Access

A table of 2N bits can be stored in an array of bytes 2N/8 long. To resolve a hash-value into a byte and bit number, use the following equations:


int nByte = ( nHashValue >> 3 );
int nBit = ( nHashValue & 7 );

The least significant bit is numbered 0; the most significant bit is numbered 7. To set a bit as empty (setting it to 1):


table_ptr[ nByte ] |= ( 1 << nBit );

To set a bit as full (setting it to zero):


table_ptr[ nByte ] &= ~( 1 << nBit );

The Aggregate or Superset Table

Nodes operating in hub mode must maintain an aggregate or superset query hash table, consisting of their own searchable content combined with the QHTs supplied by all connected leaves. This aggregate table is supplied to neighbouring hubs, allowing them to completely filter traffic for the local hub and its leaves.

This has two important implications:

  • When a change is detected in either the local content or a connected leaf node's QHT, the aggregate table must be rebuilt and patches dispatched to neighbouring hubs. This will happen often, so an appropriate minimum time between updates should be used. One minute is effective.
  • An aggregate table representing 400 leaves will be much denser than a table representing one node. This means that all tables must be large enough that the aggregate table remains productively sparse.

To create an aggregate table, start with an empty table of fixed size containing all 1's. For each contributing table, copy any zero (full) bits to the aggregate table. This is effectively an AND operation. If a source table is smaller than the aggregate table, a single 0 bit in the source will equate to several zero bits in the aggregate. If the source table is larger than the aggregate table, a single zero bit will map to a single bit with some loss of accuracy.

It is of great importance that all QHTs in the system be sufficiently large to allow an aggregate table to remain suitably sparse. Ideally each leaf node should provide a table less than 1% dense.

Query Filtering

Before transmitting a query packet to a connection that has provided a query hash table, match the words and URNs in the query against the QHT.

  • If any of the lookups based on URNs found a hit, send the query packet
  • If at least two thirds of lookups based on words found a hit, send
  • Otherwise, drop the packet

It is important to apply the "two thirds" rule only for words. URNs must provide an automatic match.

Consider all text content in the query, including generic search text and metadata search text if it is present. When dealing with simple query language that involves quoted phrases and exclusions, apply the following rules:

  • Tokenize quoted phrases into words, ignoring the phrase at this level
  • Ignore excluded words - these do not count as table hits or misses
  • Remember to apply exclusion to every word in an excluded phrase

See the section on the simple query language for more information.