UDP Transceiver: Difference between revisions
(→Extension Proposal: Added delayed acknowledgments + negotiation sub-sections) |
(→Negotiation: Clarified processing contract for improved acknowledgments) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 220: | Line 220: | ||
'''nReceived''' is the total amount of parts successfully received so far. | '''nReceived''' is the total amount of parts successfully received so far. | ||
'''missingBits''' is a bitfield that is read as a big-endian number and which contains set bits for the parts still missing. The ''base'' part number is normally 0, unless flag '''0x10''' was set, in which case the '''nPart''' value indicates the base. The rule is then that if bit ''b'' is set, then the fragment number ''b'' + ''base'' + 1 is still missing. | '''missingBits''' is a bitfield that is read as a '''big-endian number''' and which contains set bits for the parts still missing. The ''base'' part number is normally 0, unless flag '''0x10''' was set, in which case the '''nPart''' value indicates the base. The rule is then that if bit ''b'' is set, then the fragment number ''b'' + ''base'' + 1 is still missing. | ||
In the unlikely case where missingBits is too small to hold all the missing parts, only the ones that can be represented are included, the '''nReceived''' field being there to provide additional information. | In the unlikely case where missingBits is too small to hold all the missing parts, only the ones that can be represented are included, the '''nReceived''' field being there to provide additional information. | ||
Proper generation and parsing of the missingBits field is crucial, so to remove any ambiguity, it is best to interpret missingBits as a number. Then bit 0 is the bit corresponding to 2^0, bit ''n'' is the bit corresponding to 2^n. | Proper generation and parsing of the '''missingBits''' field is crucial, so to remove any ambiguity, it is best to interpret missingBits as a number. Then bit 0 is the bit corresponding to 2^0, bit ''n'' is the bit corresponding to 2^n. In the above framing pictogram, bit 0 is the 7th bit of the 3rd byte (because the number in '''missingBits''' is encoded in ''big-endian''). | ||
Bit 0 corresponds to fragment #1, unless the '''0x10''' flag was set. In that case, if for instance '''nPart''' is 3, then it means fragments #1, #2 and #3 were already received. The base is therefore 3, and if bit 0 is set in '''missingBits''', it means fragment #4 (0 + 3 + 1) is still missing. | Bit 0 corresponds to fragment #1, unless the '''0x10''' flag was set. In that case, if for instance '''nPart''' is 3, then it means fragments #1, #2 and #3 were already received. The base is therefore 3, and if bit 0 is set in '''missingBits''', it means fragment #4 (0 + 3 + 1) is still missing. | ||
Line 244: | Line 244: | ||
* '''Improved Acknowledgements''': when the flag '''0x10''' is set in emitted fragments, it tells the other party that it can reply with the ''Cumulative Acknowledgments'' or the ''Extended Acknowledgment'' bits in its acknowledgment messages. | * '''Improved Acknowledgements''': when the flag '''0x10''' is set in emitted fragments, it tells the other party that it can reply with the ''Cumulative Acknowledgments'' or the ''Extended Acknowledgment'' bits in its acknowledgment messages. | ||
When the ''Improved Acknowledgments'' flag is '''missing''', the receiver '''must fall back''' to the ''legacy'' acknowledgment protocol. Since '''0x10''' is in the high-nibble part of the flag field, no legacy implementation will choke upon seeing it, and can safely ignore it. This guarantees the sender that the message will be processed correctly by legacy implementations. | When the ''Improved Acknowledgments'' flag is '''missing''', the receiver '''must fall back''' to the ''legacy'' acknowledgment protocol. Since '''0x10''' is in the high-nibble part of the flag field, no legacy implementation will choke upon seeing it, and they can safely ignore it. This guarantees the sender that the message will be processed correctly by legacy implementations. | ||
When the ''Improved Acknowledgment'' bit is seen, the receiver is naturally free to ignore it altogether (as legacy servents will do) or chose to reply only with ''Cumulative Acknowledgements'' and never send ''Extended Acknowledgments'', for instance. | |||
However, when the sender sets the ''Improved Acknowledgment'' bit, it '''must''' be prepared to receive both ''Cumulative Acknowledgments'' '''and''' ''Extended Acknowledgments''. It is safe to ignore the extension part of the latter though, but the former must be fully handled. | |||
=== Delayed Acknowledgments === | === Delayed Acknowledgments === | ||
Line 250: | Line 254: | ||
To maximize the usefulness of ''Cumulative Acknowledgments'', the receiver can delay its acknowledgments for a small period of time (100 ms, say) in the hope that more fragments of the message will arrive meanwhile, provided the message has more than 1 fragment naturally. When the 100 ms delay is expired, it can then acknowledge everything it got so far. Performance will not be impacted by this small delaying, but it can save a few acknowledgments for larger messages. | To maximize the usefulness of ''Cumulative Acknowledgments'', the receiver can delay its acknowledgments for a small period of time (100 ms, say) in the hope that more fragments of the message will arrive meanwhile, provided the message has more than 1 fragment naturally. When the 100 ms delay is expired, it can then acknowledge everything it got so far. Performance will not be impacted by this small delaying, but it can save a few acknowledgments for larger messages. | ||
This means a sender must use a transmission policy whereby it first sends '''all the fragments''' of each message at once | This means a sender must use a transmission policy whereby it first sends '''all the fragments''' of each message at once, then waits for acknowledgments or timeouts before retransmitting the um-acknowledged fragments. In other words, the sender should not send a fragment, wait for its acknowledgement before sending the next one: this would delay the reception of the entire message and it would prevent using the optimizations that this extension proposal wants to promote! | ||
=== Author === | === Author === | ||
This proposal was created by Raphael Manfredi, last modified on 01:03, 26 September 2012 (PDT). | This proposal was created by Raphael Manfredi, last modified on 01:03, 26 September 2012 (PDT). |
Latest revision as of 15:58, 29 September 2012
<< TCP Stream Connection and Handshaking | Packet Structure >> | Main Page
Introduction
The User Datagram Protocol (UDP) is invaluable in peer to peer systems because it provides a relatively low-cost (low-overhead) method of sending short, irregular messages to a very large number of peers on demand. Establishing a TCP stream connection to a peer simply to deliver a single packet of information is wasteful in data volume and time for the peers involved and state-aware network devices along the route, for example, network address translation facilities. When dealing with a large number of peers quickly, these costs become unbearable. UDP provides a solution and makes this kind of interaction possible.
However, the delivery of UDP packets is not reliable: packets may be lost en-route for a number of reasons. Often this behaviour is desirable, for example when the destination node's connection is highly congested, UDP packets are likely to be discarded. If the content was not critical, this loss is appropriate as the host's resources are effectively unavailable. In other scenarios involving critical payloads, UDP's lack of reliability is a problem: either the sender needs to make sure the receiver gets the payload, or it needs to know definitively that the receiver was unavailable.
The Gnutella2 network solves this problem by implementing a selectively engaged reliability layer on top of the basic UDP protocol. This reliability layer shares some common functionality with TCP, but importantly does not provide any connection state and thus retains the efficiency originally sought in UDP.
This allows Gnutella2 to select the most optimal communication medium for each and every transmission it needs to perform:
- If a significant volume of data is to be exchanged, or subsequent data will be exchanged with the same destination, a TCP connection is established
- If a small volume of important data is to be exchanged in a once-off operation or irregularly, reliable UDP is used
- If a small volume of unimportant data is to be exchanged in a once-off operation or irregularly, unreliable UDP is used
UDP
Gnutella2 semi-reliable communication is transported using the UDP protocol. The port number for receiving UDP is the same as the port number listening for TCP connections.
Encoding
Implementing an additional reliable protocol within UDP requires a small control header before the payload itself. This header is put to good use:
A small signature identifies the packet as a Gnutella2 semi-reliable UDP datagram. This allows the same port to be used for receiving UDP traffic for other protocols if desired, and offers some simple protection against random, unexpected traffic. A content code identifies the payload as a Gnutella2 packet stream, allowing future protocols to be added within the same reliability layer if desired. Flags allow additional attributes to be specified, such as inline stateless compression of the payload (which is a required feature). The header has a fixed size of 8 bytes, and is represented by the following C structure:
#pragma pack(1) typedef struct { CHAR szTag[3]; BYTE nFlags; WORD nSequence; BYTE nPart; BYTE nCount; } GND_HEADER;
Byte format:
0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7| Byte +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | szTag | nFlags | 0-3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | nSequence | nPart | nCount | 4-7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The members of the structure are detailed below:
- szTag - contains a three byte encoding protocol identifier, in this case "GND" for "GNutella Datagram". If this signature is not present the packet should not be decoded as a Gnutella2 reliability layer transmission.
- nFlags - contains flags which modify the content of the packet. The low-order nibble is reserved for critical flags: if one of these bits is set but the decoding software does not understand the meaning, the packet must be discarded. The high-order nibble is reserved for non-critical flags: when set these bits may be interpreted, but an inability to interpret a bit does not cause the packet to be discarded. Currently defined flags are:
- 0x01 - Deflate
- When the deflate bit is set, the entire payload is compressed with the deflate algorithm. The compression method used is the Deflate Compression Data Format (RFC 1951). On top of this compression a ZLIB ‘wrapper’ is applied (RFC 1950, ZLIB Compressed Data Format). The ZLIB wrapper ensures packet integrity, among other things. Note that the entire payload must be reassembled in the correct order before it can be deflated if the packet was fragmented. Fragments are not compressed separately!
- 0x02 - Acknowledge Me
- When the acknowledge me bit is set, the sender is expecting an acknowledgement for this packet.
- nSequence - contains the sequence number of the packet. This sequence number is unique to the sending host only. It is not unique to the pair of the sending host and receiving host as in TCP, as there is no concept of connection state. Sequence numbers on consecutive packets need not be increasing (although that is convenient) – they must only be different. If a packet is fragmented, all of its fragments will have the same sequence number. Byte order is unimportant here.
- nPart - contains the fragment part number (1 <= nPart <= nCount)
- nCount - contains the number of fragment parts in this packet. On a transmission, this value will be non-zero (all packets must have at least one fragment). If nCount is zero, this is an acknowledgement (see below).
Fragmentation
Large packets must be fragmented before they can be sent through most network interfaces. Different network media have different MTUs, and it is difficult to predict what the lowest common size will be. Fragmentation and reassembly is performed by the existing Internet protocols, however, there are two important reasons why the reliability layer performs its own fragmentation:
- Sockets implementations specify a maximum datagram size. This is adequate for the vast majority of transmissions, but it is desirable to have the transparent ability to send larger packets without worrying about the host implementation.
- When the Internet protocols fragment, a packet and one or more fragments are lost, it may decide to discard the whole packet in an unreliable datagram protocol. The Gnutella2 reliability layer can compensate by retransmitting the whole packet, which would then be re-fragmented and each fragment resent - however, this wastes the fragments that were successfully received before. Managing fragmentation natively allows this optimisation.
Each node determines its own MTU, often based on a best guess combined with information from the host's sockets implementation. Packets exceeding this size are fragmented into multiple datagrams of the appropriate size. Each datagram has the same sequence number and the same fragment count (nCount), but a different fragment number (nPart).
Transmission Process
When a packet is to be transmitted, the network layer must:
- Cache the payload
- Allocate a new locally and temporally unique sequence number
- Derive the appropriate number of fragments
- Queue the fragments for dispatch
- If the fragments do not need to be acknowledged, the packet can be flushed now
The payload will generally be cached for an appropriate timeout period, or until the data cache becomes full, at which time older payloads can be discarded. Fragments are dispatched according to the dispatch algorithm of choice, and the sender listens for acknowledgements. When an acknowledgement is received:
- Lookup the sent packet by sequence number
- Mark the nPart fragment as received and cancel any retransmissions of this part
- If all fragments have been acknowledged, flush this packet from the cache
If a fragment has been transmitted but has not been acknowledged within the timeout, it should be retransmitted. A finite number of retransmissions are allowed before the packet as a whole expires, at which time it is assumed that the packet was not received.
Receive Process
When a new datagram is received, the network layer must:
- If the acknowledge bit was set, send an acknowledge packet for this sequence number and part number, with nCount set to zero (ack)
- Lookup any existing packet by the sending IP and sequence number
- If there is no existing packet, create a new packet entry with the IP, sequence number, fragment count and flags
- If there was an existing packet, make sure it is not marked as done - if so, abort
- Add the transmitted fragment to the (new or old) packet entry
- If the packet now has all fragments, mark it as done and decode it and pass it up to the application layer
- Leave the packet on the cache even if it was finished, in case any parts are retransmitted
- Expire old packets from the receive buffer after a timeout, or if the buffer is full
Dispatch Algorithm
Fragment datagrams need to be dispatched intelligently, to spread the load on network resources and maximise the chance that the receiver will get the message. To do this, the dispatch algorithm should take into account several points:
- Prioritize acknowledgements.
- If fragments are waiting to be sent to a number of hosts, do not send to the same host twice in a row. Alternating, or looping through the target hosts achieves the same data rate locally, but spreads out the load over downstream links.
- Do not exceed or approach the capacity of the local network connection. If a host has a 128 kb/s outbound bandwidth, dispatching 32 KB of data in one second will likely cause massive packet loss, leading to a retransmission.
- After considering the above points, prefer fragments that were queued recently to older packets. A LIFO or stack type approach means that even if a transmitter is becoming backed up, some fragments will get there on time, while others will be delayed. A FIFO approach would mean that a backed up host delivers every fragment late.
Parameters
The recommended parameters for the reliability layer are as follows:
MTU = 500 bytes Transmit Retransmit Interval = 10 seconds Transmit Packet Timeout / Expire = 26 seconds (allows for two retransmissions before expiry) Receive Packet Expiry = 30 seconds (allows 10 seconds beyond final retransmission)
The UDP/IP header being typically 28 bytes, it is best to limit the payload of the messages to 476 bytes. That way, with the 8-byte header we're topping plus the UDP/IP header, the whole message will not be greater than 512 bytes. The recommended MTU should therefore be 476 rather than the legacy 500. --RAM, 05:19, 23 September 2012 (PDT)
Performance Considerations
Relatively low-level network implementations such as this are reasonably complicated, but must operate fast. It is desirable to avoid runtime memory allocations in network code as much as possible, and particularly at this level.
It should be noted that in almost all cases, transmissions to "untested" nodes are single fragment. Replies on the other hand are often larger, and may be deflated in many fragments. This is optimal, because attempting to contact a node which may be unavailable, involves a retransmission of only a single fragment.
Flow control is an important topic, however, it is handled at a higher layer. The UDP reliability layer is only responsible for guaranteeing delivery of selected datagrams.
Only critical transmissions whose reception cannot otherwise be inferred, should have the acknowledge request bit set.
Extension Proposal
The following is still a proposoal, but since there is no G2 Development Forum where this can be discussed, I thought I would document it here. 05:07, 23 September 2012 (PDT)
This extension should be totally backward compatible with existing implementations provided the above page specifications were strictly adhered to. Here it goes:
Improved Acknowledgments
Because acknowledgment messages can be lost in the way or arrive out of order, it is best to include as much of the reception state as possible so that the sending party can optimize retransmissions.
In order to do that, the following extensions to the original specifications has been added by gtk-gnutella:
- Cumulative Acknowledgements: when the flag 0x10 is set, it tells the other party that all the fragments up to nPart have been received.
- Extended Acknowledgments: when the flag 0x20 is set, it tells the other party that an acknowledgment payload is present. It immediately follows the header and is architected thusly:
0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7| Byte +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | nReceived | missingBits | 0-3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
nReceived is the total amount of parts successfully received so far.
missingBits is a bitfield that is read as a big-endian number and which contains set bits for the parts still missing. The base part number is normally 0, unless flag 0x10 was set, in which case the nPart value indicates the base. The rule is then that if bit b is set, then the fragment number b + base + 1 is still missing.
In the unlikely case where missingBits is too small to hold all the missing parts, only the ones that can be represented are included, the nReceived field being there to provide additional information.
Proper generation and parsing of the missingBits field is crucial, so to remove any ambiguity, it is best to interpret missingBits as a number. Then bit 0 is the bit corresponding to 2^0, bit n is the bit corresponding to 2^n. In the above framing pictogram, bit 0 is the 7th bit of the 3rd byte (because the number in missingBits is encoded in big-endian).
Bit 0 corresponds to fragment #1, unless the 0x10 flag was set. In that case, if for instance nPart is 3, then it means fragments #1, #2 and #3 were already received. The base is therefore 3, and if bit 0 is set in missingBits, it means fragment #4 (0 + 3 + 1) is still missing.
This extended acknowledgment lets the sending party optimize its retransmissions even when some acknowledgments are lost.
Extended Acknowledgments are only useful when the total amount of fragments is 3 or above. Indeed, with only 2 fragments, the Cumulative Acknowledgment lets the receiving party know about the whole reception state.
When the amount of fragments is 3 or more and only a Cumulative Acknowledgment is sent out, it implicitly denies reception of any other fragments. This optimizes bandwidth since the 4 extra bytes sent out will only be required for large messages (more than 2 fragments) in case fragments are received out-of-order.
Negotiation
Before the recipient can use this extension, it must be certain that the sender will understand its semantics. Indeed, imagine the case where a 3-fragment message is sent and completely received, and therefore the recipient only sends a Cumulative Acknowledge for fragment #3. A legacy sender would think fragments #1 and #2 are still missing.
To signify to the recipient that the sender will understand the new acknowledgments, it must set the flag 0x10 in all the fragments it sends. This new flag is therefore defined as:
- Improved Acknowledgements: when the flag 0x10 is set in emitted fragments, it tells the other party that it can reply with the Cumulative Acknowledgments or the Extended Acknowledgment bits in its acknowledgment messages.
When the Improved Acknowledgments flag is missing, the receiver must fall back to the legacy acknowledgment protocol. Since 0x10 is in the high-nibble part of the flag field, no legacy implementation will choke upon seeing it, and they can safely ignore it. This guarantees the sender that the message will be processed correctly by legacy implementations.
When the Improved Acknowledgment bit is seen, the receiver is naturally free to ignore it altogether (as legacy servents will do) or chose to reply only with Cumulative Acknowledgements and never send Extended Acknowledgments, for instance.
However, when the sender sets the Improved Acknowledgment bit, it must be prepared to receive both Cumulative Acknowledgments and Extended Acknowledgments. It is safe to ignore the extension part of the latter though, but the former must be fully handled.
Delayed Acknowledgments
To maximize the usefulness of Cumulative Acknowledgments, the receiver can delay its acknowledgments for a small period of time (100 ms, say) in the hope that more fragments of the message will arrive meanwhile, provided the message has more than 1 fragment naturally. When the 100 ms delay is expired, it can then acknowledge everything it got so far. Performance will not be impacted by this small delaying, but it can save a few acknowledgments for larger messages.
This means a sender must use a transmission policy whereby it first sends all the fragments of each message at once, then waits for acknowledgments or timeouts before retransmitting the um-acknowledged fragments. In other words, the sender should not send a fragment, wait for its acknowledgement before sending the next one: this would delay the reception of the entire message and it would prevent using the optimizations that this extension proposal wants to promote!
Author
This proposal was created by Raphael Manfredi, last modified on 01:03, 26 September 2012 (PDT).