UDP Transceiver: Difference between revisions
(Added proposed acknowledgment extension) |
(→Parameters: MTU should really be 476 bytes, not 500) |
||
Line 172: | Line 172: | ||
</nowiki> | </nowiki> | ||
</pre> | </pre> | ||
The UDP/IP header being typically 28 bytes, it is best to limit the payload of the messages to '''476 bytes'''. That way, with the 8-byte header we're topping plus the UDP/IP header, the whole message will not be greater than 512 bytes. The recommended MTU should therefore be 476 rather than the legacy 500. --RAM, 05:19, 23 September 2012 (PDT) | |||
== Performance Considerations == | == Performance Considerations == |
Revision as of 12:19, 23 September 2012
<< TCP Stream Connection and Handshaking | Packet Structure >> | Main Page
Introduction
The User Datagram Protocol (UDP) is invaluable in peer to peer systems because it provides a relatively low-cost (low-overhead) method of sending short, irregular messages to a very large number of peers on demand. Establishing a TCP stream connection to a peer simply to deliver a single packet of information is wasteful in data volume and time for the peers involved and state-aware network devices along the route, for example, network address translation facilities. When dealing with a large number of peers quickly, these costs become unbearable. UDP provides a solution and makes this kind of interaction possible.
However, the delivery of UDP packets is not reliable: packets may be lost en-route for a number of reasons. Often this behaviour is desirable, for example when the destination node's connection is highly congested, UDP packets are likely to be discarded. If the content was not critical, this loss is appropriate as the host's resources are effectively unavailable. In other scenarios involving critical payloads, UDP's lack of reliability is a problem: either the sender needs to make sure the receiver gets the payload, or it needs to know definitively that the receiver was unavailable.
The Gnutella2 network solves this problem by implementing a selectively engaged reliability layer on top of the basic UDP protocol. This reliability layer shares some common functionality with TCP, but importantly does not provide any connection state and thus retains the efficiency originally sought in UDP.
This allows Gnutella2 to select the most optimal communication medium for each and every transmission it needs to perform:
- If a significant volume of data is to be exchanged, or subsequent data will be exchanged with the same destination, a TCP connection is established
- If a small volume of important data is to be exchanged in a once-off operation or irregularly, reliable UDP is used
- If a small volume of unimportant data is to be exchanged in a once-off operation or irregularly, unreliable UDP is used
UDP
Gnutella2 semi-reliable communication is transported using the UDP protocol. The port number for receiving UDP is the same as the port number listening for TCP connections.
Encoding
Implementing an additional reliable protocol within UDP requires a small control header before the payload itself. This header is put to good use:
A small signature identifies the packet as a Gnutella2 semi-reliable UDP datagram. This allows the same port to be used for receiving UDP traffic for other protocols if desired, and offers some simple protection against random, unexpected traffic. A content code identifies the payload as a Gnutella2 packet stream, allowing future protocols to be added within the same reliability layer if desired. Flags allow additional attributes to be specified, such as inline stateless compression of the payload (which is a required feature). The header has a fixed size of 8 bytes, and is represented by the following C structure:
#pragma pack(1) typedef struct { CHAR szTag3; BYTE nFlags; WORD nSequence; BYTE nPart; BYTE nCount; } GND_HEADER;
Byte format:
0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7| Byte +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | szTag | nFlags | 0-3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | nSequence | nPart | nCount | 4-7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The members of the structure are detailed below:
- szTag - contains a three byte encoding protocol identifier, in this case "GND" for "GNutella Datagram". If this signature is not present the packet should not be decoded as a Gnutella2 reliability layer transmission.
- nFlags - contains flags which modify the content of the packet. The low-order nibble is reserved for critical flags: if one of these bits is set but the decoding software does not understand the meaning, the packet must be discarded. The high-order nibble is reserved for non-critical flags: when set these bits may be interpreted, but an inability to interpret a bit does not cause the packet to be discarded. Currently defined flags are:
- 0x01 - Deflate
- When the deflate bit is set, the entire payload is compressed with the deflate algorithm. The compression method used is the Deflate Compression Data Format (RFC 1951). On top of this compression a ZLIB ‘wrapper’ is applied (RFC 1950, ZLIB Compressed Data Format). The ZLIB wrapper ensures packet integrity, among other things. Note that the entire payload must be reassembled in the correct order before it can be deflated if the packet was fragmented. Fragments are not compressed separately!
- 0x02 - Acknowledge Me
- When the acknowledge me bit is set, the sender is expecting an acknowledgement for this packet.
- nSequence - contains the sequence number of the packet. This sequence number is unique to the sending host only. It is not unique to the pair of the sending host and receiving host as in TCP, as there is no concept of connection state. Sequence numbers on consecutive packets need not be increasing (although that is convenient) – they must only be different. If a packet is fragmented, all of its fragments will have the same sequence number. Byte order is unimportant here.
- nPart - contains the fragment part number (1 <= nPart <= nCount)
- nCount - contains the number of fragment parts in this packet. On a transmission, this value will be non-zero (all packets must have at least one fragment). If nCount is zero, this is an acknowledgement (see below).
Fragmentation
Large packets must be fragmented before they can be sent through most network interfaces. Different network media have different MTUs, and it is difficult to predict what the lowest common size will be. Fragmentation and reassembly is performed by the existing Internet protocols, however, there are two important reasons why the reliability layer performs its own fragmentation:
- Sockets implementations specify a maximum datagram size. This is adequate for the vast majority of transmissions, but it is desirable to have the transparent ability to send larger packets without worrying about the host implementation.
- When the Internet protocols fragment, a packet and one or more fragments are lost, it may decide to discard the whole packet in an unreliable datagram protocol. The Gnutella2 reliability layer can compensate by retransmitting the whole packet, which would then be re-fragmented and each fragment resent - however, this wastes the fragments that were successfully received before. Managing fragmentation natively allows this optimisation.
Each node determines its own MTU, often based on a best guess combined with information from the host's sockets implementation. Packets exceeding this size are fragmented into multiple datagrams of the appropriate size. Each datagram has the same sequence number and the same fragment count (nCount), but a different fragment number (nPart).
Transmission Process
When a packet is to be transmitted, the network layer must:
- Cache the payload
- Allocate a new locally and temporally unique sequence number
- Derive the appropriate number of fragments
- Queue the fragments for dispatch
- If the fragments do not need to be acknowledged, the packet can be flushed now
The payload will generally be cached for an appropriate timeout period, or until the data cache becomes full, at which time older payloads can be discarded. Fragments are dispatched according to the dispatch algorithm of choice, and the sender listens for acknowledgements. When an acknowledgement is received:
- Lookup the sent packet by sequence number
- Mark the nPart fragment as received and cancel any retransmissions of this part
- If all fragments have been acknowledged, flush this packet from the cache
If a fragment has been transmitted but has not been acknowledged within the timeout, it should be retransmitted. A finite number of retransmissions are allowed before the packet as a whole expires, at which time it is assumed that the packet was not received.
Receive Process
When a new datagram is received, the network layer must:
- If the acknowledge bit was set, send an acknowledge packet for this sequence number and part number, with nCount set to zero (ack)
- Lookup any existing packet by the sending IP and sequence number
- If there is no existing packet, create a new packet entry with the IP, sequence number, fragment count and flags
- If there was an existing packet, make sure it is not marked as done - if so, abort
- Add the transmitted fragment to the (new or old) packet entry
- If the packet now has all fragments, mark it as done and decode it and pass it up to the application layer
- Leave the packet on the cache even if it was finished, in case any parts are retransmitted
- Expire old packets from the receive buffer after a timeout, or if the buffer is full
Dispatch Algorithm
Fragment datagrams need to be dispatched intelligently, to spread the load on network resources and maximise the chance that the receiver will get the message. To do this, the dispatch algorithm should take into account several points:
- Prioritize acknowledgements.
- If fragments are waiting to be sent to a number of hosts, do not send to the same host twice in a row. Alternating, or looping through the target hosts achieves the same data rate locally, but spreads out the load over downstream links.
- Do not exceed or approach the capacity of the local network connection. If a host has a 128 kb/s outbound bandwidth, dispatching 32 KB of data in one second will likely cause massive packet loss, leading to a retransmission.
- After considering the above points, prefer fragments that were queued recently to older packets. A LIFO or stack type approach means that even if a transmitter is becoming backed up, some fragments will get there on time, while others will be delayed. A FIFO approach would mean that a backed up host delivers every fragment late.
Parameters
The recommended parameters for the reliability layer are as follows:
MTU = 500 bytes Transmit Retransmit Interval = 10 seconds Transmit Packet Timeout / Expire = 26 seconds (allows for two retransmissions before expiry) Receive Packet Expiry = 30 seconds (allows 10 seconds beyond final retransmission)
The UDP/IP header being typically 28 bytes, it is best to limit the payload of the messages to 476 bytes. That way, with the 8-byte header we're topping plus the UDP/IP header, the whole message will not be greater than 512 bytes. The recommended MTU should therefore be 476 rather than the legacy 500. --RAM, 05:19, 23 September 2012 (PDT)
Performance Considerations
Relatively low-level network implementations such as this are reasonably complicated, but must operate fast. It is desirable to avoid runtime memory allocations in network code as much as possible, and particularly at this level.
It should be noted that in almost all cases, transmissions to "untested" nodes are single fragment. Replies on the other hand are often larger, and may be deflated in many fragments. This is optimal, because attempting to contact a node which may be unavailable, involves a retransmission of only a single fragment.
Flow control is an important topic, however, it is handled at a higher layer. The UDP reliability layer is only responsible for guaranteeing delivery of selected datagrams.
Only critical transmissions whose reception cannot otherwise be inferred, should have the acknowledge request bit set.
Extension Proposal
The following is still a proposoal, but since there is no G2 Development Forum where this can be discussed, I thought I would document it here. 05:07, 23 September 2012 (PDT)
This extension should be totally backward compatible with existing implementations provided the above page specifications were strictly adhered to. Here it goes:
Because acknowledgment messages can be lost in the way or arrive out of order, it is best to include as much of the reception state as possible so that the sending party can optimize retransmissions.
In order to do that, the following extensions to the original specifications has been added by gtk-gnutella:
- Cumulative Acknowledgements: when the flag 0x10 is set, it tells the other party that all the fragments up to nPart have been received.
- Extended Acknowledgments: when the flag 0x20 is set, it tells the other party that an acknowledgment payload is present. It immediately follows the header and is architected thusly:
0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7| Byte +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | nReceived | missingBits | 0-3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
nReceived is the total amount of parts successfully received so far.
missingBits is a bitfield that is read as a big-endian number and which contains set bits for the parts still missing. The base part number is normally 0, unless flag 0x10 was set, in which case the nPart value indicates the base. The rule is then that if bit b is set, then the fragment number b + base + 1 is still missing.
In the unlikely case where missingBits is too small to hold all the missing parts, only the ones that can be represented are included, the nReceived field being there to provide additional information.
Proper generation and parsing of the missingBits field is crucial, so to remove any ambiguity, it is best to interpret missingBits as a number. Then bit 0 is the bit corresponding to 2^0, bit n is the bit corresponding to 2^n.
Bit 0 corresponds to fragment #1, unless the 0x10 flag was set. In that case, if for instance nPart is 3, then it means fragments #1, #2 and #3 were already received. The base is therefore 3, and if bit 0 is set in missingBits, it means fragment #4 (0 + 3 + 1) is still missing.
This extended acknowledgment lets the sending party optimize its retransmissions even when some acknowledgments are lost.
Extended Acknowlegments are only useful when the total amount of fragments is 3 or above. Indeed, with only 2 fragments, the Cumulative Acknowledgment lets the receiving party know about the whole reception state.
When the amount of fragments is 3 or more and only a Cumulative Acknowledgment is sent out, it implicitly denies reception of any other fragments. This optimizes bandwidth since the 4 extra bytes sent out will only be required for large messages (more than 2 fragments) in case fragments are received out-of-order.