UDP Transceiver

From Gnutella2
Revision as of 15:50, 27 March 2005 by Kath (Talk | contribs) (Introduction)

Jump to: navigation, search

Introduction

The User Datagram Protocol (UDP) is invaluable in peer to peer systems because it provides a relatively low-cost (low-overhead) method of sending short, irregular messages to a very large number of peers on demand. Establishing a TCP stream connection to a peer simply to deliver a single packet of information is wasteful in data volume and time for the peers involved and state-aware network devices along the route, for example, network address translation facilities. When dealing with a large number of peers quickly, these costs become unbearable. UDP provides a solution and makes this kind of interaction possible.

However, the delivery of UDP packets is not reliable: packets may be lost en-route for a number of reasons. Often this behaviour is desirable, for example when the destination node's connection is highly congested, UDP packets are likely to be discarded. If the content was not critical, this loss is appropriate as the host's resources are effectively unavailable. In other scenarios involving critical payloads, UDP's lack of reliability is a problem: either the sender needs to make sure the receiver gets the payload, or it needs to know definitively that the receiver was unavailable.

The Gnutella2 network solves this problem by implementing a selectively engaged reliability layer on top of the basic UDP protocol. This reliability layer shares some common functionality with TCP, but importantly does not provide any connection state and thus retains the efficiency originally sought in UDP.

This allows Gnutella2 to select the most optimal communication medium for each and every transmission it needs to perform:

  • If a significant volume of data is to be exchanged, or subsequent data will be exchanged with the same destination, a TCP connection is established
  • If a small volume of important data is to be exchanged in a once-off operation or irregularly, reliable UDP is used
  • If a small volume of unimportant data is to be exchanged in a once-off operation or irregularly, unreliable UDP is used

UDP

Gnutella2 semi-reliable communication is transported using the UDP protocol. The port number for receiving UDP is the same as the port number listening for TCP connections.

Encoding

Implementing an additional reliable protocol within UDP requires a small control header before the payload itself. This header is put to good use:

A small signature identifies the packet as a Gnutella2 semi-reliable UDP datagram. This allows the same port to be used for receiving UDP traffic for other protocols if desired, and offers some simple protection against random, unexpected traffic. A content code identifies the payload as a Gnutella2 packet stream, allowing future protocols to be added within the same reliability layer if desired. Flags allow additional attributes to be specified, such as inline stateless compression of the payload (which is a required feature). The header has a fixed size of 8 bytes, and is represented by the following C structure:


#pragma pack(1)
typedef struct
{
   CHAR szTag3;
   BYTE nFlags;
   WORD nSequence;
   BYTE nPart;
   BYTE nCount;
} GND_HEADER;

The members of the structure are detailed below:

  • szTag - contains a three byte encoding protocol identifier, in this case "GND" for "GNutella Datagram". If this signature is not present the packet should not be decoded as a Gnutella2 reliability layer transmission.
  • nFlags - contains flags which modify the content of the packet. The low-order nibble is reserved for critical flags: if one of these bits is set but the decoding software does not understand the meaning, the packet must be discarded. The high-order nibble is reserved for non-critical flags: when set these bits may be interpreted, but an inability to interpret a bit does not cause the packet to be discarded. Currently defined flags are:
  • 0x01 - Deflate
When the deflate bit is set, the entire payload is compressed with the deflate algorithm. The compression method used is the Deflate Compression Data Format (RFC 1951). On top of this compression a ZLIB ‘wrapper’ is applied (RFC 1950, ZLIB Compressed Data Format.) The ZLIB wrapper ensures packet integrity, among other things. Note that the entire payload must be reassembled in the correct order before it can be deflated if the packet was fragmented. Fragments are not compressed separately!
  • 0x02 - Acknowledge Me
When the acknowledge me bit is set, the sender is expecting an acknowledgement for this packet.
  • nSequence - contains the sequence number of the packet. This sequence number is unique to the sending host only. It is not unique to the pair of the sending host and receiving host as in TCP, as there is no concept of connection state. Sequence numbers on consecutive packets need not be increasing (although that is convenient) – they must only be different. If a packet is fragmented, all of its fragments will have the same sequence number. Byte order is unimportant here.
  • nPart - contains the fragment part number (1 <= nPart <= nCount)
  • nCount - contains the number of fragment parts in this packet. On a transmission, this value will be non-zero (all packets must have at least one fragment). If nCount is zero, this is an acknowledgement (see below).

Fragmentation

Large packets must be fragmented before they can be sent through most network interfaces. Different network media have different MTUs, and it is difficult to predict what the lowest common size will be. Fragmentation and reassembly is performed by the existing Internet protocols, however there are two important reasons why the reliability layer performs its own fragmentation:

  • Sockets implementations specify a maximum datagram size. This is adequate for the vast majority of transmissions, but it is desirable to have the transparent ability to send larger packets without worrying about the host implementation.
  • When the Internet protocols fragment a packet and one or more fragments are lost, it may decide to discard the whole packet in an unreliable datagram protocol. The Gnutella2 reliability layer can compensate by retransmitting the whole packet, which would then be re-fragmented and each fragment resent - however this wastes the fragments that were successfully received before. Managing fragmentation natively allows this optimisation.

Each node determines its own MTU, often based on a best guess combined with information from the host's sockets implementation. Packets exceeding this size are fragmented into multiple datagrams of the appropriate size. Each datagram has the same sequence number and the same fragment count (nCount), but a different fragment number (nPart).

Transmission Process

When a packet is to be transmitted the network layer must:

  • Cache the payload
  • Allocate a new locally and temporally unique sequence number
  • Derive the appropriate number of fragments
  • Queue the fragments for dispatch
  • If the fragments do not need to be acknowledged, the packet can be flushed now

The payload will generally be cached for an appropriate timeout period, or until the data cache becomes full at which time older payloads can be discarded. Fragments are dispatched according to the dispatch algorithm of choice, and the sender listens for acknowledgements. When an acknowledgement is received:

  • Lookup the sent packet by sequence number
  • Mark the nPart fragment as received and cancel any retransmissions of this part
  • If all fragments have been acknowledged, flush this packet from the cache

If a fragment has been transmitted but has not been acknowledged within the timeout, it should be retransmitted. A finite number of retransmissions are allowed before the packet as a whole expires, at which time it is assumed that the packet was not received.

Receive Process

When a new datagram is received, the network layer must:

  • If the acknowledge bit was set, send an acknowledge packet for this sequence number and part number, with nCount set to zero (ack)
  • Lookup any existing packet by the sending IP and sequence number
  • If there is no existing packet, create a new packet entry with the IP, sequence number, fragment count and flags
  • If there was an existing packet, make sure it is not marked as done - if so, abort
  • Add the transmitted fragment to the (new or old) packet entry
  • If the packet now has all fragments, mark it as done and decode it and pass it up to the application layer
  • Leave the packet on the cache even if it was finished, in case any parts are retransmitted
  • Expire old packets from the receive buffer after a timeout or if the buffer is full

Dispatch Algorithm

Fragment datagrams need to be dispatched intelligently to spread the load on network resources and maximise the chance that the receiver will get the message. To do this, the dispatch algorithm should take into account several points:

  • Prioritize acknowledgements.
  • If fragments are waiting to be sent to a number of hosts, do not send to the same host twice in a row. Alternating or looping through the target hosts achieves the same data rate locally, but spreads out the load over downstream links.
  • Do not exceed or approach the capacity of the local network connection. If a host has a 128 kb/s outbound bandwidth, dispatching 32 KB of data in one second will likely cause massive packet loss, leading to a retransmission.
  • After considering the above points, prefer fragments that were queued recently to older packets. A LIFO or stack type approach means that even if a transmitter is becoming backed up, some fragments will get there on time while others will be delayed. A FIFO approach would mean that a backed up host delivers every fragment late.

Parameters

The recommended parameters for the reliability layer are as follows:


MTU = 500 bytes
Transmit Retransmit Interval = 10 seconds
Transmit Packet Timeout / Expire = 26 seconds (allows for two retransmissions before expiry)
Receive Packet Expiry = 30 seconds (allows 10 seconds beyond final retransmission)

Performance Considerations

Relatively low-level network implementations such as this are reasonably complicated, but must operate fast. It is desirable to avoid runtime memory allocations in network code as much as possible, and particularly at this level.

It should be noted that in almost all cases, transmissions to "untested" nodes are single fragment. Replies on the other hand are often larger, and may be deflated in many fragments. This is optimal because attempting to contact a node which may be unavailable involves a retransmission of only a single fragment.

Flow control is an important topic, however it is handled at a higher layer. The UDP reliability layer is only responsible for guaranteeing delivery of selected datagrams.

Only critical transmissions whose reception cannot otherwise be inferred should have the acknowledge request bit set.