| 
									
										
										
										
											2005-06-23 12:22:36 -07:00
										 |  |  | TCP protocol | 
					
						
							|  |  |  | ============ | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-02-17 22:21:04 -08:00
										 |  |  | Last updated: 9 February 2008 | 
					
						
							| 
									
										
										
										
											2005-06-23 12:22:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | Contents | 
					
						
							|  |  |  | ======== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | - Congestion control | 
					
						
							|  |  |  | - How the new TCP output machine [nyi] works | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Congestion control | 
					
						
							|  |  |  | ================== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The following variables are used in the tcp_sock for congestion control: | 
					
						
							|  |  |  | snd_cwnd		The size of the congestion window | 
					
						
							|  |  |  | snd_ssthresh		Slow start threshold. We are in slow start if | 
					
						
							|  |  |  | 			snd_cwnd is less than this. | 
					
						
							|  |  |  | snd_cwnd_cnt		A counter used to slow down the rate of increase | 
					
						
							|  |  |  | 			once we exceed slow start threshold. | 
					
						
							|  |  |  | snd_cwnd_clamp		This is the maximum size that snd_cwnd can grow to. | 
					
						
							|  |  |  | snd_cwnd_stamp		Timestamp for when congestion window last validated. | 
					
						
							|  |  |  | snd_cwnd_used		Used as a highwater mark for how much of the | 
					
						
							|  |  |  | 			congestion window is in use. It is used to adjust | 
					
						
							|  |  |  | 			snd_cwnd down when the link is limited by the | 
					
						
							|  |  |  | 			application rather than the network. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | As of 2.6.13, Linux supports pluggable congestion control algorithms. | 
					
						
							|  |  |  | A congestion control mechanism can be registered through functions in | 
					
						
							|  |  |  | tcp_cong.c. The functions used by the congestion control mechanism are | 
					
						
							|  |  |  | registered via passing a tcp_congestion_ops struct to | 
					
						
							|  |  |  | tcp_register_congestion_control. As a minimum name, ssthresh, | 
					
						
							| 
									
										
										
										
											2014-02-12 17:35:21 +04:00
										 |  |  | cong_avoid must be valid. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-06-23 12:22:36 -07:00
										 |  |  | Private data for a congestion control mechanism is stored in tp->ca_priv. | 
					
						
							|  |  |  | tcp_ca(tp) returns a pointer to this space.  This is preallocated space - it | 
					
						
							|  |  |  | is important to check the size of your private data will fit this space, or | 
					
						
							|  |  |  | alternatively space could be allocated elsewhere and a pointer to it could | 
					
						
							|  |  |  | be stored here. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | There are three kinds of congestion control algorithms currently: The | 
					
						
							|  |  |  | simplest ones are derived from TCP reno (highspeed, scalable) and just | 
					
						
							|  |  |  | provide an alternative the congestion window calculation. More complex | 
					
						
							|  |  |  | ones like BIC try to look at other events to provide better | 
					
						
							|  |  |  | heuristics.  There are also round trip time based algorithms like | 
					
						
							|  |  |  | Vegas and Westwood+. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Good TCP congestion control is a complex problem because the algorithm | 
					
						
							|  |  |  | needs to maintain fairness and performance. Please review current | 
					
						
							|  |  |  | research and RFC's before developing new modules. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The method that is used to determine which congestion control mechanism is | 
					
						
							|  |  |  | determined by the setting of the sysctl net.ipv4.tcp_congestion_control. | 
					
						
							|  |  |  | The default congestion control will be the last one registered (LIFO); | 
					
						
							| 
									
										
										
										
											2008-02-17 22:21:04 -08:00
										 |  |  | so if you built everything as modules, the default will be reno. If you | 
					
						
							|  |  |  | build with the defaults from Kconfig, then CUBIC will be builtin (not a | 
					
						
							|  |  |  | module) and it will end up the default. | 
					
						
							| 
									
										
										
										
											2005-06-23 12:22:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | If you really want a particular default value then you will need | 
					
						
							|  |  |  | to set it with the sysctl.  If you use a sysctl, the module will be autoloaded | 
					
						
							|  |  |  | if needed and you will get the expected protocol. If you ask for an | 
					
						
							|  |  |  | unknown congestion method, then the sysctl attempt will fail. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | If you remove a tcp congestion control module, then you will get the next | 
					
						
							| 
									
										
										
										
											2006-10-03 22:53:09 +02:00
										 |  |  | available one. Since reno cannot be built as a module, and cannot be | 
					
						
							| 
									
										
										
										
											2005-06-23 12:22:36 -07:00
										 |  |  | deleted, it will always be available. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | How the new TCP output machine [nyi] works. | 
					
						
							|  |  |  | =========================================== | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | Data is kept on a single queue. The skb->users flag tells us if the frame is | 
					
						
							|  |  |  | one that has been queued already. To add a frame we throw it on the end. Ack | 
					
						
							|  |  |  | walks down the list from the start. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | We keep a set of control flags | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	sk->tcp_pend_event | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		TCP_PEND_ACK			Ack needed | 
					
						
							|  |  |  | 		TCP_ACK_NOW			Needed now | 
					
						
							|  |  |  | 		TCP_WINDOW			Window update check | 
					
						
							|  |  |  | 		TCP_WINZERO			Zero probing | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	sk->transmit_queue		The transmission frame begin | 
					
						
							|  |  |  | 	sk->transmit_new		First new frame pointer | 
					
						
							|  |  |  | 	sk->transmit_end		Where to add frames | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	sk->tcp_last_tx_ack		Last ack seen | 
					
						
							|  |  |  | 	sk->tcp_dup_ack			Dup ack count for fast retransmit | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Frames are queued for output by tcp_write. We do our best to send the frames | 
					
						
							|  |  |  | off immediately if possible, but otherwise queue and compute the body | 
					
						
							|  |  |  | checksum in the copy.  | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | When a write is done we try to clear any pending events and piggy back them. | 
					
						
							|  |  |  | If the window is full we queue full sized frames. On the first timeout in | 
					
						
							|  |  |  | zero window we split this. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | On a timer we walk the retransmit list to send any retransmits, update the | 
					
						
							|  |  |  | backoff timers etc. A change of route table stamp causes a change of header | 
					
						
							|  |  |  | and recompute. We add any new tcp level headers and refinish the checksum | 
					
						
							|  |  |  | before sending.  | 
					
						
							|  |  |  | 
 |