Query Hash Tables: Difference between revisions

From Gnutella2
Jump to navigation Jump to search
No edit summary
 
No edit summary
Line 25: Line 25:
== Table Content ==
== Table Content ==


A QHT is a table of 2^N bits, where each bit represents a unique word-hash value.
A QHT is a table of 2<sup>N</sup> bits, where each bit represents a unique word-hash value.
For example a table of 20 bits has 2^20 = 1048576 possible word-hash values. If
For example a table of 20 bits has 2<sup>20</sup> = 1048576 possible word-hash values. If
stored uncompressed this table would be 128 KB in size.
stored uncompressed this table would be 128 KB in size.


Line 33: Line 33:


* Locate every plain-text word and URN which could be searched for and produce a match/hit
* Locate every plain-text word and URN which could be searched for and produce a match/hit
* Hash the word with a simple hash function to produce a word-hash value which is 0 <= value < (2^N).
* Hash the word with a simple hash function to produce a word-hash value which is 0 <= value < 2<sup>N</sup>.
* Set the appropriate bit in the table to zero, representing "full".
* Set the appropriate bit in the table to zero, representing "full".


Line 121: Line 121:
value table. Gnutella2 supports patch compression using the deflate algorithm,
value table. Gnutella2 supports patch compression using the deflate algorithm,
however this should not be applied if the TCP link itself is compressed.
however this should not be applied if the TCP link itself is compressed.
== Table Access ==
A table of 2<sup>N</sup> bits can be stored in an array of bytes 2<sup>N</sup>/8 long. To resolve a
hash-value into a byte and bit number, use the following equations:
<pre>
<nowiki>
int nByte = ( nHashValue >> 3 );
int nBit = ( nHashValue & 7 );
</nowiki>
</pre>
The least significant bit is numbered 0; the most significant bit is numbered 7. To set
a bit as empty (setting it to 1):
<pre>
<nowiki>
table_ptr[ nByte ] |= ( 1 << nBit );
</nowiki>
</pre>
To set a bit as full (setting it to zero):
<pre>
<nowiki>
table_ptr[ nByte ] &= ~( 1 << nBit );
</nowiki>
</pre>

Revision as of 04:55, 20 March 2005

Introduction

Building an efficient network topology is not viable without means to restrict the flow of information appropriately. In the Gnutella2 network architecture, neighbouring nodes exchange compressed hash filter tables describing the content which can be reached by communicating with them. These tables are updated dynamically as necessary.

The concept of filtering hash tables in file sharing was pioneered by Limegroup for the LimeWire? Gnutella1 application.

Table Properties

Query hash tables or QHTs provide enough information to know with certainty that a particular node (and possibly its descendants) will not be able to provide any matching objects for a given query. Conversely the QHT may reveal that a node or its descendants may be able to provide matching objects.

This property means that queries can be discarded confidently when a transmission is known to be unnecessary, while not providing the filtering or forwarding node any actual information about the searchable content. Neighbours know what their neighbours do not have, but cannot say for sure what they do have. QHTs are also very efficient both in terms of exchange and maintenance and lookup cost.

Table Content

A QHT is a table of 2N bits, where each bit represents a unique word-hash value. For example a table of 20 bits has 220 = 1048576 possible word-hash values. If stored uncompressed this table would be 128 KB in size.

In an empty table, every word-hash value bit will be "1", which represents "empty". To populate the table with searchable content, an application must:

  • Locate every plain-text word and URN which could be searched for and produce a match/hit
  • Hash the word with a simple hash function to produce a word-hash value which is 0 <= value < 2N.
  • Set the appropriate bit in the table to zero, representing "full".

Word Hashing

Words are strings of one or more alphanumeric characters which are not all numeric.

To convert a word into a hash value, the following case-insensitive algorithm is used:


// HashWord( string_ptr, char_count, table_bit_count );
DWORD CQueryHashTable::HashWord(LPCTSTR psz, int nLength, int nBits)
{
  DWORD nNumber = 0;
  int nByte = 0;
  for ( ; nLength > 0 ; nLength--, psz++ )
  {
    int nValue = tolower( *psz ) & 0xFF;
    nValue = nValue << ( nByte * 8 );
    nByte = ( nByte + 1 ) & 3;
    nNumber = nNumber ^ nValue;
  }
  return HashNumber( nNumber, nBits );
}

DWORD CQueryHashTable::HashNumber(DWORD nNumber, int nBits)
{
  WORD64 nProduct = (WORD64)nNumber * (WORD64)0x4F1BBCDC;
  WORD64 nHash = ( nProduct << 32 ) >> ( 32 + ( 32 - nBits ) );
  return (DWORD)nHash;
}

URNs - A Special Case

URNs are treated as a special case: rather than dividing them up into word tokens, they are hashed as a complete fixed length string. For example:


urn:sha1:WIXYJFVJMIWNMUWPRPBGUTODIV52RMJA

Bitprint URNs are actually composite values which include both a SHA1 and TigerTree root value. Rather than adding the whole bitprint to the table, each of the constituent URNs are added separately. This allows SHA1-only querying and TigerTree-only querying. A root TigerTree URN takes the form:


urn:tree:tiger/:CN25MLNU3XNN7IHKZMNOA63XG6SKDJ2W7Z3HONA

Other URNs should be expressed in their most natural form before being fed to the word hash function.

Word Prefix Extensions

For words consisting of at least five characters, it is often useful to be able to match substrings within the word. Unfortunately adding every possible substring of each word would increase the density of the QHT, however a simple and effective compromise is available:

For words with 5 or more characters:

  • Hash and add the whole word
  • Hash and add the whole word minus the last character
  • Hash and add the whole word minus the last two characters

This allows searching on prefixes of the word, for example "match" will now match "matches".

Table Exchange

Nodes must keep their neighbouring hubs up to date with their latest local query hash table at all times. Rather than sending the whole table whenever it changes, nodes may opt to send a "table patch", which includes only the difference between the old and new table.

The /QHT packet is used to supply a query hash table update to a neighbouring hub. Its format is compatible with the Gnutella1 "query routing protocol", except that Gnutella2 requires a 1-bit per value table while Gnutella1 requires a 4 or 8 bit per value table. Gnutella2 supports patch compression using the deflate algorithm, however this should not be applied if the TCP link itself is compressed.

Table Access

A table of 2N bits can be stored in an array of bytes 2N/8 long. To resolve a hash-value into a byte and bit number, use the following equations:


int nByte = ( nHashValue >> 3 );
int nBit = ( nHashValue & 7 );

The least significant bit is numbered 0; the most significant bit is numbered 7. To set a bit as empty (setting it to 1):


table_ptr[ nByte ] |= ( 1 << nBit );

To set a bit as full (setting it to zero):


table_ptr[ nByte ] &= ~( 1 << nBit );