Datatypes

From Gnutella2
Revision as of 16:19, 18 January 2014 by Ram (talk | contribs) (→‎Multi-Byte Integers: Added blurb about variable-length encoding)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

<< Packet Structure | Basic Network Maintenance >> | Main Page


Introduction

The format of a packet payload is defined by the packet type and can consist of any binary data; however, there are a number of conventions in place for serializing common datatypes.

Multi-Byte Integers

Multi-byte integers are serialized in the byte-order of the topmost packet. Little endian is the default byte-order; however, big-endian byte order can be selected for those who want it.

Some values can also be serialized with spurious zeroes stripped-off, which is called variable-length encoding. This is suitable for values that are usually small, because it avoids transmitting extra zero bytes over the network.

A variable-length encoding of values less than 256 requires 1 single byte, values up to 65536 will require 2 bytes, and so on and so forth. This is the type of encoding used for serializing the length of each G2 packet, for instance.

Network/Node Addresses

A network or node address consists of a physical address and a port number, and are of variable length, depending on the address family. In IPv4, a network/node address is six bytes long: 4 bytes for an IP address and 2 bytes for a port number as follows:


typedef struct
{
BYTE ip[4];
SHORT port;
} IPV4_ENDPOINT;

Note that this is considered an array of 4 8-bit integers (bytes), followed by a 16-bit integer (short). Byte order does not affect bytes, but it will affect the 16-bit port number.

IPv6 addresses are longer and are not yet defined within the scope of Gnutella2, however, applications should be aware that if the node address is not 6 bytes, it is of a different address family.

GUIDs

Globally unique identifiers (GUIDs) are used to identify nodes on the network. GUIDs are serialized as an array of 16 bytes.

Strings

Strings are encoded with UTF-8 encoding and serialized as a zero-terminated sequence of 8 bit integers.

A zero character (0x00) marks the end of the string, however, if the string data meets the end of the packet (or child packet) payload, the terminator is not required. This means that packets whose payload consists of a string, do not need to include a zero string terminator and their payload length will be the byte length of the encoded string exactly.

UTF-8 encoding is required for all strings present in the packet payload. This means that 7-bit characters may be passed as-is, while extended characters are encoded with multi-byte sequences.

All applications must be able to parse UTF-8 encoded strings, however, it is up to the individual application whether to store the string in Unicode, or convert it to the local code page for processing. In situations where a packet must be processed and forwarded, the original packet must be forwarded rather than a regenerated version. This ensures that both locally unsupported encodings and packet extensions are preserved.

Applications should never send ANSI strings directly if they contain extended characters with the MSB set. These should be encoded with UTF-8. If this is not done, the decoding process may fail and the packet will be discarded or contain bogus information.