ONL NPR Tutorial

The ONL NPR Tutorial

NPR Tutorial >> Writing A Plugin

TOC

New Window?

More Useful Code

Plugin Chaining
Dynamically Forwarding Meta-Packets
Queueing Packets Inside A Plugin
Generating Random Numbers
Shaping Traffic
Handling Control Messages

Introduction

This page describes some useful plugin code fragments. During the discussion, you will be introduced to some useful plugins that do queue management, packet scheduling and traffic shaping.

Plugin Chaining

You may have a need to chain together multiple plugins. For example, you might want to do some packet processing in one plugin and then pass the packet to the delay plugin. Most of the simple plugins (e.g., mycount) send meta-packets to the Queue Manager (QM). We have also seen two alternatives to sending to the QM: dropping a packet (nstats) and sending a meta-packet to the MUX block with a non-zero plugin tag so that it can be sent to the PLC (Parse, Lookup and Classify) block for reclassification (TOStag). Because the nstats plugin drops all meta-packets it receives, it sends each meta-packet to the Freelist Manager which then frees up the space associated with the meta-packet (e.g., DRAM packet buffer, SRAM packet descriptor). In the TOStag plugin, the plugin tag field in meta-packets is set to a value equal to one more than the TOS field in the IP packet header before forwarding on to the MUX block. But another useful destination would be to forward to another plugin microengine.

void handle_pkt()
{
  dl_source_packet(dlFromBlock);
  
  handle_pkt_user( );

  dl_sink_packet(dlNextBlock);
}

As shown above, handle_pkt() in the TOStag plugin gets a packet from the input ring indicated by dlFromBlock, processes the packet with handle_pkt_user(), and then sends the meta-packet to the block indicated by dlNextBlock. Recall that the code at the end of handle_pkt_user() did the processing necessary to send the meta-packet to the MUX block. You can take a similar approach when sending a meta-packet to another plugin.

Forwarding a meta-packet to a plugin microengine involves two simple steps:

Set dlNextBlock to the desired block.
Initialize the fields of the outgoing meta-packet if the next block is the Queue Manager.

There are four details that you must consider when implementing these two steps:

The variable dlNextBlock has been defined in the API to be local to each thread.

This variable is initialized to direct meta-packets to the Queue Manager (QM) input ring in plugin_init() which is called by every thread as part of thread initialization.

The meta-packet format depends on the destination block.

For example, the output port field for a meta-packet destined for the Queue Manager is the upper 3-bits of a 16-bit Queue Manager QID field. But the output port field for a meta-packet destined for another plugin is the second lower order 3-bits of the 16-bit uc_mc_bits field.

The format of a meta-packet from another plugin ME is identical to that of a meta-packet from the PLC block.

This makes forwarding of an incoming meta-packet to another plugin trivial. (We will see this code in the helper_send_from_queue() function later.)

There is a speed-flexibility tradeoff.

If the plugin will always send meta-packets to the same microengine, the implementation is simpler and faster (much like that used in the TOStag plugin). If not, dlNextBlock will need to be determined through the control message interface. In this latter case, all of the thread-specific values of dlNextBlock will need to be updated to a consistent value.

This section describes one family of plugins where dlNextBlock never changes after it is set during initialization in plugin_init_user(). The next section describes the case when dlNextBlock can be changed dynamically through the control message interface.

By convention, plugins with names ending in ++ (e.g., shaper++, delay++, erd++) have been written so that if they are loaded into plugin ME 4, they will forward all meta-packets to the QM; otherwise, they will forward to ME k+1 if they are loaded into ME k. Consider, for example, the following configuration in which PLC sends meta-packets to a traffic shaper plugin. From there, the meta-packets go to a delay plugin, then the Early Random Drop plugin, and then the Queue Manager (if not dropped by erd++).

ME	Plugin	Description	From	To
2	shaper++	Traffic shaper	PLC	ME 3
3	delay++	Delay	ME 2	ME 4
4	erd++	Early Random Drop	ME 3	QM

Three pieces of code support this plugin chaining paradigm:

plugin_init_user() initializes dlNextBlock based on the plugin's ID.
handle_pkt_user() calls helper_set_meta_default() to initialize the fields of the outgoing meta-packet based on the value of dlNextBlock and the incoming meta-packet.
The function helper_set_meta_default() itself.

Note that dlNextBlock is a write-once variable that is initialized by every thread context in plugin_init_user() and never modified thereafter. The value of dlNextBlock is then used by handle_pkt_user() to steer meta-packets to the correct IXP processing block.

    void plugin_init_user() {
1	if(ctx() == 0) {	... initialization for thread 0 ...
2
3	if( pluginId == 0 )		dlNextBlock = PACKET_IN_RING_1;
4	else if( pluginId == 1 )	dlNextBlock = PACKET_IN_RING_2;
5	else if( pluginId == 2 )	dlNextBlock = PACKET_IN_RING_3;
6	else if( pluginId == 3 )	dlNextBlock = PACKET_IN_RING_4;
7	else				dlNextBlock = QM;
    }

The plugin_init_user() code in lines 3-7 for the erd++ plugin (shown above) initializes dlNextblock based on which ME it is running on. This initialization code appears in all chained plugins (i.e., the ones with names ending in ++). Note that lines 3-7 are executed by all thread contexts. The declaration of the variable dlNextBlock is in the global declaration area (outside the scope of any function):

__declspec(gp_reg) int dlNextblock;	// declared in global area

Because dlNextBlock is declared outside the scope of a function and is not declared to be shared, each thread context will have its own copy.

The erd++ plugin probabilistically drops a packet if the the length of its destination queue should be managed and its queue length is larger than a threshold. If it determines that the packet should be dropped (code not shown), it sets the variable droppkt to 1. Otherwise, it sets droppkt to 0.

    void handle_pkt_user( )  {
1	int			droppkt;
2
3	... Set droppkt to 1 if the packet should be dropped ...
4
5	if( droppkt ) {
7	    onl_api_plugin_cntr_inc(pluginId, DROP_COUNT);
8	    ++ndrops;
9	    if ( helper_set_meta_default( DROP ) != 0 ) {
10		helper_set_errno( BAD_NXTBLK );
11	    }
12	} else {
13	    if ( helper_set_meta_default( dlNextBlock ) != 0 ) {
14		helper_set_errno( BAD_NXTBLK );
15	    }
16	}
    }

Lines 5-16 sets the fields in the outgoing meta-packet based on the value of droppkt. If the packet should be dropped, it calls helper_set_meta_default() in line 9 with the argument DROP to initialize the fields so that the meta-packet will be sent to the Freelist Manager. Otherwise, it calls helper_set_meta_default() in line 13 with the argument dlNextBlock to initialize the fields so that the meta-packet will be sent to the next plugin or the Queue Manager in the plugin chain.

Finally, we come to helper_set_meta_default(), the function that actually puts the meta-packet into the input ring of the next packet processing block. Below is its control structure:

    static __forceinline int
    helper_set_meta_default( __declspec(gp_reg) int nextBlk ) {
1	__declspec(gp_reg) int	out_port;
2
3	dlNextBlock = nextBlk;
4
5	if( nextBlk == QM ) {
6		... insert meta-packet into Queue Manager's input ring
7	} else if( nextBlk == DROP ) {
8		... insert meta-packet into Freelist Manager's input ring
9	} else if( nextBlk == MUX ) {
10		... insert meta-packet into MUX's input ring
11	} else if( (nextBlk == PACKET_IN_RING_0)	||
12		   (nextBlk == PACKET_IN_RING_1)	||
13		   (nextBlk == PACKET_IN_RING_2)	||
14		   (nextBlk == PACKET_IN_RING_3)	||
15		   (nextBlk == PACKET_IN_RING_4)		) {
16		... insert meta-packet into plugin ME's input ring
17	} else if( nextBlk == DO_NOTHING ) {	// do nothing
19	} else {				// all other options
20		return -1;	// error
21	}
22	return 0;
    }

The control structure (and processing) is this complicated because the format of the meta-packet depends on the IXP block. For example, the Queue Manager accepts a 12-byte (3-word) meta-packet, but a plugin accepts a 24-byte (6-word) meta-packet. This difference is obvious when we look at the expansions of lines 6 and 16.

Line 6 is the code for sending the meta-packet to the Queue Manager.

	...
5	if( nextBlk == QM ) {
6.1	    __declspec(gp_reg) int	out_port;
6.2	    out_port = (ring_in.uc_mc_bits >> 3) & 0x7;
6.3	    onl_api_update_ring_out_to_qm(
6.4			ring_in.buf_handle_lo24, 
6.5			out_port,
6.6			(((out_port+1) << 13) | ring_in.qid), 
6.7			ring_in.l3_pkt_len);
7	} else if( nextBlk == DROP ) {
	...

The figure above shows that the 3-word outgoing meta-packet to be sent to the QM is constructed from the incoming 6-word meta-packet. Four fields are passed to onl_api_update_ring_out_to_qm():

Buffer handle

The 24-bit buffer handle is copied from the plugin's input ring buffer.

Output port

(Line 6.2) The 3-bit output port number is extracted from bits 3-5 of the 8-bit unicast field in the plugin's input ring buffer.

Raw QID

(Line 6.6) The raw (or internal) QID is a 16-bit quantity consisting of the the 3-bit internal output port number followed by the 13-bit visible (external) QID. For example, QID 64 at port 1 in the RLI is encoded as the 16-bit quantity 0x4040 (hexadecimal) which is 16,448 (decimal) (= 2 * 8192 + 64); i.e., there are 8,192 QIDs numbered from 0 to 8191 decimal, and port number is encoded to be one more than its visible value.

IP packet length

The 16-bit datagram length is extracted from the plugin's input ring buffer.

Line 16 is the code for sending the meta-packet to another plugin. Because the format of the meta-packet to be sent to another plugin is 6-words instead of 3-words, we use a different function for sending to the next plugin in the plugin chain.

	...
11	} else if( (nextBlk == PACKET_IN_RING_0)	||
12		   (nextBlk == PACKET_IN_RING_1)	||
13		   (nextBlk == PACKET_IN_RING_2)	||
14		   (nextBlk == PACKET_IN_RING_3)	||
15		   (nextBlk == PACKET_IN_RING_4)		) {
16.1	onl_api_update_ring_out_to_plugin(
16.2			ring_in.buf_handle_lo24, 
16.3			(ring_in.uc_mc_bits >> 3) & 0x7,
16.5			ring_in.in_port,
16.6			ring_in.plugin_tag,
16.7			ring_in.stats_index, 
16.8			0,
16.9			ring_in.qid, 
16.10			ring_in.nh_eth_daddr_hi32,
16.11			ring_in.nh_eth_daddr_lo16,
16.12			ring_in.eth_type,
16.13			ring_in.uc_mc_bits,
16.14			ring_in.l3_pkt_len);
17	} else if( nextBlk == DO_NOTHING ) {
	...

The function onl_api_update_ring_out_to_plugin() fills in the 12 fields in the 6-word output ring buffer by copying them from the input ring buffer.

The concept of a plugin chain or pipeline is a useful packet processing paradigm. This section has shown that it is fairly easy to provide a standard approach to implementation. Although the end of this section discussed the details behind the implementation, the user interface is straightforward, and the functionality can be included by copying the code fragments cited above.

Dynamically Forwarding Meta-Packets

The previous section described the plugin chain concept. Its implementation used dlNextBlock as a write-once variable. It's not much harder to use dlNextblock in a more dynamic way where its value can change many times. There is one technical difficulty that must be addressed:

Each thread has its own dlNextBlock variable that indicates the next IXP processing block. In some applications, the different instances need not be kept consistent. But in applications that require consistency, the instances of dlNextBlock need to be updated whenever one instance changes.

This consistency can be maintained by storing the value in a shared variable. In the setNxtBlk plugin example below, the official value of the next processing block is stored in the shared variable sharedNextBlock.

The setNxtBlk plugin is a simple example of how to dynamically set the value of dlNextBlock. For example, it is possible to load the setNxtBlk plugin onto microengine 0 and the mycount plugin onto microengine 4 and configure setNxtBlk to send meta-packets to the mycount plugin on ME 4.

By default, the setNxtBlk plugin sends meta-packets to the Queue Manager. The user can change this behavior by sending it the "next=" control message. For example, to send meta-packets to plugin microengine 4, the user would enter the control message "next= PLUGIN4" (note the space character after the = character). The setNxtBlk plugin recognizes the following values for the next block:

Plugin Command	Next Block	Internal Constant
"next= PLUGIN0"	Plugin ME 0	PACKET_IN_RING0
"next= PLUGIN1"	Plugin ME 1	PACKET_IN_RING1
"next= PLUGIN2"	Plugin ME 2	PACKET_IN_RING2
"next= PLUGIN3"	Plugin ME 3	PACKET_IN_RING3
"next= PLUGIN4"	Plugin ME 4	PACKET_IN_RING4
"MUX"	MUX	MUX
"DROP"	Drop Packet	DROP
"QM"	Queue Manager	QM
(otherwise)	Queue Manager	QM

Two functions in the setNxtBlk plugin contain the necessary code to allow dlNextBlock to be changed: handle_pkt_user() and handle_msg(). We discuss only the part of handle_msg() which is unique to this plugin.

    void handle_msg() {
1	__declspec(local_mem) char inmsgstr[28];	// inbound
2	__declspec(local_mem) char outmsgstr[28];	// outbound
3	__declspec(sram) char sram_inmsgstr[28];
4	__declspec(sram) char sram_outmsgstr[28];
5	...
6	char SET_next[8]	= "next=";
7	...
8	char BAD_OP_msg[8]	= "BAD OP";
9	char NEED_ARG_msg[12]= "NEED ARG";
10	...
11	if( ... ) {
12	...
13	} else if( strncmp_sram(sram_inmsgstr, SET_next, 5) == 0 ) {
14    	    __declspec(sram) char *valptr;
15    	    valptr = helper_nxt_token( sram_inmsgstr, 28 ); 
16    	    if( valptr == 0 ) {
17		memcpy_lmem_sram( outmsgstr, NEED_ARG_msg, 12 );
18    	    } else {
19		sharedNextBlock = str2dlNextBlock( valptr );
20		dlNextBlock = sharedNextBlock;
21		memcpy_lmem_sram( outmsgstr, valptr, 21 );
22    	    }
23    	}
    }

The user sets the value of dlNextBlock by entering a control message. For example, to send meta-packets to plugin microengine 4, the user would enter the control message "next= PLUGIN4" (note the space character after the = character). The function str2dlNextBlock() (line 19) translates the external string value ("PLUGIN4" in this example) enterred by the user to the corresponding internal constant (PACKET_IN_RING4 which has value 4) and its return value is used to set the shared variable sharedNextBlock. Note that although Line 20 sets dlNextBlock to the new value sharedNextBlock, handle_msg() is in the control message handling thread which is different than the packet handling thread(s). Thus, it is necessary to save this new value in the shared variable sharedNextBlock so that the value of dlNextBlock in the packet handling thread can be updated to this new value.

Line 15: Gets the external string value of the next block.
Line 17: Returns an error message if there is no next block specified.
Lines 19-21: Translates the external next block string to its internal form, updates the value of dlNextBlock, and then copies the next block string value to the reply message.

1   volatile __declspec(shared gp_reg) int sharedNextBlock;
2   ...
3   void handle_pkt_user() {
4	++npkts;
5	onl_api_plugin_cntr_inc(pluginId, 0);	// Incr global plugin cntr 0
6
7	dlNextBlock = sharedNextBlock;
8	helper_set_meta_default( dlNextBlock );
9	if( dlNextBlock == MUX )	helper_inc_meta_mux_tag( );
10  }

Line 7 is where the packet handling thead(s) update the value of dlNextBlock. Otherwise, handle_pkt_user() is almost identical to the one in the TOStag plugin. It uses the helper_set_meta_default() function (Line 8) to set the outgoing meta-packet fields based on the value of dlNextBlock and meta-packet fields in the input ring buffer. Line 9 increments the plugin tag field if the meta-packet is going to the MUX block, allowing the user to install a filter that matches this plugin tag value.

Queueing Packets Inside A Plugin

We saw earlier that the delay plugin used a queue that was internal to the plugin to store meta-packets until their delay time had expired. The tutorial page Tour_Of_The_Delay_Plugin describes the functions that implement a FIFO queue. This section discusses how other plugins have used that code base to implement their own versions of FIfO queues:

Plugin	Description	Queue Usage	Special Queue Feature(s)
delay++	Delay packets	Packets delayed by fixed amount	Support plugin chaining
shaper++	Shape traffic	Packets delayed to conform to traffic descriptor	Delay varies to meet traffic descriptor
priq	Priority queueing	Hold medium and low-priority packets	Two queues with one common free list

The queueing code used in these plugins were copied from the original delay plugin and customized to varying degrees to fit the special requirements of these plugins. After reviewing the delay plugin, we describe the changes to the queue management code required for each of these plugins. You will see that the delay++ plugin required only a change in the format of the queue item structure, but the priq plugin required rewriting the entire free space management routines.

The delay Plugin

Recall that the delay queue was a list of items where each item consisted of time (the time when the meta-packet should be forwarded to the Queue Manager), four fields from the incoming meta-packet (buf_handle, out_port, qid, l3_pkt_len) and next (the address of the next item on the list):

1	struct delay_item_tag {
2	    union tm_tag	time;		// time to leave
3	    unsigned int	buf_handle;	// meta-packet
4	    unsigned int	out_port;	// .
5	    unsigned int	qid;		// .
6	    unsigned int	l3_pkt_len;	// .
7	    struct delay_item_tag *next;
8	};
9	struct delayq_tag {
10	    unsigned long		ninq;	// # in delay queue
11	    struct delay_item_tag	*hd;	// head ptr
12	    struct delay_item_tag	*tl;	// tail ptr
13	    struct delay_item_tag	*free_hd;	// free list
14	};
15	
16	#define MAX_QUEUE_SZ	35000		// max #items in queue
17	__declspec(shared, sram) struct delayq_tag	delayq;	// queue descriptor

Access to the queue and the free space is provided by the queue descriptor delayq (line 17) which is the structure defined in lines 9-14. Lines 1-8 define the structure of an item on the queue and is 28-bytes (each integer and pointer is 4 bytes and a time is 8 bytes). Since each plugin has access to its own 1 MB of SRAM, we can have over 37,000 items in the queue, Line 16 defines the number of items on the initial free list to be 35,000 items which is well below the 37,000 items provided by 1 MB.

The delay++ Plugin

The delay++ plugin is just the chained plugin version of the delay plugin. Becaue the plugin may need to forward meta-packets to another plugin rather than just the Queue Manager, it must store the incoming meta-packet and not just the fields needed by the Queue Manager.

1	// sizeof(struct item_tag) = 36 ==> 29,127 items in 1 MB
2	struct item_tag {
3	    union tm_tag	tdepart;	// time for pkt to leave
4	    plugin_out_data	metapkt;
5	    struct item_tag	*next;
6	};
7
8	struct queue_tag {
9	    unsigned long	npkts;		// #pkts in queue
10	    unsigned long	nbytes;		// #bytes in queue
11	    unsigned long	maxinq;		// max #pkts in queue
12	    unsigned long	ndrops;		// #overflows from queue
13	    unsigned long	nerrs;		// #errors other than drops
14	    struct item_tag	*hd;		// head ptr
15	    struct item_tag	*tl;		// tail ptr
16	    struct item_tag *free_hd;		// free list
17	};
18
19	#define	MAX_QUEUE_SZ	29000
20	__declspec(shared, sram) struct queue_tag queue;  // queue descriptor

Lines 2-6 define the structure of an item used in the delay++ plugin. Since an entire meta-packet (line 4) is six words (or 24 bytes), an item is now 36 bytes instead of 24 bytes. The effect of this change is a reduction in the maximum number of items to 29,000 shown in line 19. The additions to the queue descriptor in lines 9-13 are extensions to the statistics collected by this version of the queue management routines. Of course, there are additional lines of code that initialize and maintain these variables. There are no other major changes that were made to the queueing code.

The shaper++ Plugin

The shaper++ is a traffic shaper plugin that implements traffic shaping using a token bucket; that is, the output of the traffic shaper with burst size B and rate R has a long-term average rate of R with an initial burst of B bytes after a sufficient idle period. It is similar to the delay++ plugin except the callback() function adds tokens at a rate of R and forwards the first meta-packet in the queue when there are enough tokens in the token bucket. Thus, it delays meta-packets by a variable amount instead of the fixed amount in the delay plugin.

1	// sizeof(struct item_tag) = 32 ==> 32,168 items in 1 MB
2	struct item_tag {
3	    plugin_out_data	metapkt;
4	    unsigned int	iplen;
5	    struct item_tag	*next;
6	};
7	...
8	#define	MAX_QUEUE_SZ	32000

The only change is that an item contains the length of the IP datagram (iplen) instead of the departure time and therefore, an item is four bytes smaller.

The priq Plugin

The priq plugin implements priority queueing that handles three traffic priorities: high, medium and low. To do this, it maintains two internal queues: one for medium priority packets and one for low priority packets. All high-priority packets get sent immediately to queue 64 at the output port. Medium and low priority packets get forwarded in priority order and only if queue 64 is empty. It uses the same queueing structure as the delay plugin except that:

It maintains two queues instead of one; and
It uses only one free list that is used by both queues.

1	// sizeof(struct item_tag) = 20 ==> 52,428 items in 1 MB
2	struct item_tag {
3	    unsigned int	buf_handle;	// meta-packet
4	    unsigned int	out_port;	// .
5	    unsigned int	qid;		// .
6	    unsigned int	l3_pkt_len;	// .
7	    struct item_tag	*next;
8	};
9
10	#define N		2
11	#define	MAX_QUEUE_SZ	40000
12	__declspec(shared sram) struct queue_tag queue[N];	// descriptor
13	__declspec(sram) struct item_tag * __declspec(shared sram) free_hd;
14		// free list pointer to SRAM that resides in sram and is shared

There are now two queue descriptors (line 12) instead of one: one for medium priority packets and one for low priority packets. Furthermore, the free list pointer has been separated out from the queue descriptor (line 13). This required rewriting the queue management routines (not shown) queue_init(), queue_enq(), queue_pop(), queue_alloc() and queue_free().

The declaration of the free list pointer in line 13 looks strange. Here is where the heterogeneous memory hierarchy rears its head! There is a big semantic difference between the declaration in line 13 and these two declarations:

(A)	__declspec(shared sram) struct item_tag * free_hd;
(B)	__declspec(shared sram) struct item_tag * __declspec(sram) free_hd;

The declaration labeled (A) is equivalent to the one labeled (B) since the Intel C compiler assumes that a variable is in SRAM unless told otherwise. The memory type specification (i.e., __declspec()) is a modifier of the type immediately to its right. Thus, the leftmost part of (B) (__declspec(shared sram) struct item_tag *) says that free_hd is a pointer to a struct item_tag which is stored in SRAM and is shared by multiple threads; that is, the item itself is shared (not necessarily the pointer). While the rightmost part of (B) (__declspec(sram)) says that free_hd itself (the pointer) is stored in SRAM. But because there is no shared attribute, there will be one instance of free_hd for every thread context which is NOT what we want.

We want to say that there is only ONE copy of the pointer which is shared among all of the thread contexts. So, the rightmost memory modifier should have been written as __declspec(shared sram) because the pointer itself is shared. Furthermore, since the shared modifier in the leftmost part of line (B) is irrelevant, we can omit it. The result of these changes is line 13.

Below is the handle_pkt_user() which enqueues packets to the proper internal queue. The three flow priorities are GOLD_FLOW (high priority), SILVER_FLOW (medium priority) and BRONZE_FLOW (low priority). The priq plugin forwards all packets to queue RESQ (=64), port 4 in priority order. The callback() thread forwards medium and low priority packets while the handle_pkt() thread immediately forwards high priority packets.

    void handle_pkt_user( )  {
1	... Compute qid and out_port ...
2	... Update counters ...
3	if( qid == GOLD_FLOW ) {		// high priority
4	    helper_set_meta_default( QM );
5	    helper_set_meta_qid( out_port, RESQ );
6	} else {
7	    rawqid = FormRawQid( out_port, RESQ );
8	    if( qid == SILVER_FLOW ) {		// medium priority
9		ninq = queue_enq( &queue[SILVERQ],
10					ring_in.buf_handle_lo24,
11					out_port,
12					rawqid,
13					ring_in.l3_pkt_len );
14	    } else {				// low priority
15		ninq = queue_enq( &queue[BRONZEQ],
16					ring_in.buf_handle_lo24,
17					out_port,
18					rawqid,
19					ring_in.l3_pkt_len );
20	    }
21	    if( ninq == -1) {			// out of free space
22		helper_set_out_to_DROP( );
23	    } else {				// OK
24		helper_set_out_to_DO_NOTHING( );
25	    }
26	}
    }

Lines 4-5: helper_set_meta_default(QM) initializes the output buffer in a default manner using the fields from the input buffer assuming that the meta-packet will go to the Queue Manager. One of the side-effects will be to set dlNextBlock so that meta-packets will go to the Queue Manager. Then, helper_set_meta_qid() changes the qid field to the raw QID determed by the output port number and the external QID constant RESQ (=64). For output port 4 and external QID 64, the raw QID will be 41024 (= 5*8192+64).
Lines 9-13 and 15-19: Enqueues items to the appropriate queue for medium and low priority packets. The only difference between lines 9-13 and 15-19 is the queue.
Lines 21-25: If there are errors, we drop the packet; otherwise, we don't do anything since the callback() thread will forward medium and low priority packets. The two helper functions just set dlNextBlock to the appropriate value. Recall that helper_set_out_to_DROP() just sets dlNextBlock so that the meta-packet goes to the Freelist Manager.

The callback() thread forwards the highest priority packet from the medium and low priority queues when queue 64 at the output port is empty. If queue 64 is not empty it just sleeps for about 10 usec.

    void callback() {
1	__declspec( gp_reg ) onl_api_qparams	qparams;
2
3	onl_api_getQueueParams( 41024, &qparams );	// 5*8192+64
4
5	if( qparams.length == 0 ) {		// empty high-priority queue
6	    if( queue[SILVERQ].npkts > 0 ) {
7		helper_send_from_queue_to_QM( &queue[SILVERQ] );
8	    } else if( queue[BRONZEQ].npkts > 0 ) {
9		helper_send_from_queue_to_QM( &queue[BRONZEQ] );
10	    }
11	}
12	sleep( SLEEP_CYCLES );
    }

Line 3: Reads the queue parameters for QID 64, port 4.
Lines 5-11: Forwards the highest priority packet if queue 64, port 4 is empty.
Line 12: The thread sleeps for about 10 usec.

Generating Random Numbers

The erd++ plugin implements an early random drop algorithm which probabilistically drops packets once a queue gets above a given threshold. The global variables used in the algorithm are shown below:

__declspec(shared gp_reg) unsigned int qlen[N]; // queue length (bytes)
__declspec(shared gp_reg) unsigned int qthresh[N]; // ERD threshold; i.e.,
					// when to start dropping (bytes)
__declspec(shared gp_reg) unsigned int dropmask[N]; // (2^K - 1)

The callback() thread reads the length of one of the target queues (64-67) of a specified port and stores the values in the shared array qlen[N]. It continuously cycles through these four queues so that each queue length has been read once after about 40 usec have elapsed. The indices of the three arrays above correspond to queues 64-67 respectively.

The array qthresh[N] contains the drop thresholds for the four queues and its default value is computed by plugin_init_user(). Currently, the default drop thresholds are set to one-fourth the queue threshold from its RLI setting but can be changed through the control message interface.

The array dropmask[N] is used to select the bits from random integers which are used to decide whether to drop packets from over populated queues. Currently, the default value is 0x3f but can be changed through the control message interface. It is used in the drop() function below which returns 1 if a packet should be dropped and 0 otherwise.

    static __forceinline int
    drop(	__declspec(gp_reg) unsigned int qlen,
		__declspec(gp_reg) unsigned int qthresh,
		__declspec(gp_reg) unsigned int dropmask ) {
1	int			randint;
2	__declspec(gp_reg)	int dropit;
3
4	if( qlen < qthresh )	return 0;
5
6	randint = rand();
7	randint = randint & dropmask;	// rightmost K bits
8
9	if( randint == 0 )	dropit = 1;
10	else			dropit = 0;
11
12	return dropit;
    }

At the heart of the drop() function is the library function rand() in line 6 which returns a pseudo-random, 16-bit unsigned integer between 0 and RAND_MAX = 32767. The current default value of all dropmask[] values is 0x3f which when used as in line 7 with the bitwise AND operator selects the rightmost seven bits from the random integer and stores it in the variable randint. Lines 9 and 10 determine the return value of the drop() function. If the value of randint is 0 it returns 1 (i.e., drop the packet); otherwise it returns 0. The effect in the long run is to drop packets with probability 1/128 whenever the queue is overpopulated.

The approach taken by drop() assumes that the denominator of the drop probability is an integer power of 2. Below is an approach for returning a 1 with probability m/n where m and n are arbitrary integers with m no more than n:

1	randint = rand();
2	randint = randint % n;
3	if( randint < m )	dropit = 1;
4	else			dropit = 0;

Lines 1-2 computes a random integer between 0 and n-1. The probability that the result is between 0 and m-1 inclusive is m/n.

Below is the handle_pkt_user() function that calls the drop() function and then either drops the meta-packet or forwards it to the next block.

    void handle_pkt_user( )  {
1	__declspec(gp_reg) unsigned int qid;
2	int			droppkt;
3
4	qid = ring_in.qid & 0x1fff;			// external qid
5	if( qid == 64 ) {
6		droppkt = drop( qlen[0], qthresh[0], dropmask[0] );
7	} else if( qid == 65 ) {
8		droppkt = drop( qlen[1], qthresh[1], dropmask[0] );
9	} else if( qid == 66 ) {
10		droppkt = drop( qlen[2], qthresh[2], dropmask[0] );
11	} else if( qid == 67 ) {
12		droppkt = drop( qlen[3], qthresh[3], dropmask[0] );
13	} else {
14		droppkt = 0;	// forward pkt untouched
15	}
16
17	if( droppkt ) {
18	    if ( helper_set_meta_default( DROP ) != 0 )		... error ...
19	} else {
20	    if ( helper_set_meta_default( dlNextBlock ) != 0 )	... error ...
21	}
    }

If you look at the sequence of random integers produced by rand(), you might be surprised to see that the sequence is identical repeated each time you reload the plugin. That's because the numbers are produced by an algorithm and therefore the integers are not really random, but pseudo-random; i.e., they do pass some statistical tests that indicate that they appear random. If you want to generate a different sequence, you can initialize the pseudo-random number generator with a seed value using the srand() library function:

	int	seed = 4321;
	...
	srand( seed );

The erd++ plugin allows the user to set the seed through the "seed=" control message. Those interested in its implementation can look at the handle_msg() function in the erd++.c source code.

Shaping Traffic

The shaper++ plugin shapes traffic using a token bucket that maintains a constant output rate of rate_Kbps Kbps when sufficiently backlogged but will allow a burst of atmost bucketsz bytes if it has been idle long enough. It ensures that the following bound is maintained during a period when a queue is backlogged:

Number of bits transmitted in time t <= rate_Kbps*t + 8*bucketsz

The handle_pkt_user() just queues incoming packets, and the callback() thread maintains the bound above. The callback() thread adds tokens to the variable token_cnt at a rate of rate_Kbps Kbps such that:

token_cnt = Max { token_cnt + rate_Kbps*delta(tnow-told), bucketsz }

It forwards the next meta-packet when token_cnt is atleast equal to the IP datagram length and then decrements token_cnt by the datagram length. Initially, token_cnt will be equal to bucketsz. So, any arriving meta-packet will be immediately forwarded since bucketsz is usually chosen to be atleast equal to the maximum-sized packet.

The intellectual center of the shaper++ plugin is the callback() thread. It adds tokens to the token counter, dequeues and forwards packets as long as there are enough tokens, and decreases the token counter according to the packets that it forwards. A single token was chosen to be equal to 0.0001 bits so that rates between 1 Kbps and 1 Gbps could be accurately supported when the sleep interval was 10 usec. This selection means that there are 80,000 tokens per byte.

    __declspec(shared gp_reg) union tm_tag told;// last callback time
    ...
    #define TOKENS_PER_BYTE	80000		// 1 token = 0.0001 bits
    void callback() {
1	__declspec(gp_reg) unsigned int	pktlen_tokens;
2	union tm_tag	tnow;
3	long long	tdiff_nsec;
4	int		rc;
5	    // update token counter
6	tnow.tm2.lo = local_csr_read( local_csr_timestamp_low );
7	tnow.tm2.hi = local_csr_read( local_csr_timestamp_high );
8	tdiff_nsec = diff_nsec( tnow.tm, told.tm );
9	token_cnt = token_cnt + (tdiff_nsec*rate_Kbps)/100;
10	if( token_cnt > TOKENS_PER_BYTE*bucketsz )
11		{ token_cnt = TOKENS_PER_BYTE*bucketsz; }
12	told.tm = tnow.tm;
13	    // forward packets as long as there are enough tokens
14	while( queue.npkts > 0 ) {
15	    pktlen_tokens = TOKENS_PER_BYTE*queue.hd->iplen;
16	    if( token_cnt >= pktlen_tokens ) {	// fwd first pkt
17		rc = helper_send_from_queue( &queue, dlNextBlock );
18		token_cnt -= pktlen_tokens;
19		if( rc != 0 )	... error ...
20	    } else	break;
21	}
22
23	sleep( SLEEP_CYCLES );
    }

Lines 6-11: Update the token counter token_cnt. Lines 6-8 compute tdiff_nsec, the number of nanoseconds between the previous and current clock readings. Lines 9 adds tokens to the token counter based on the token bucket fill rate of rate_Kbps. Line 10 ensures that the token counter doesn't exceed the capacity of the token bucket. Line 11 saves the current time in told for the next update.
Lines 14-23: Dequeues and forwards packets as long as there are enough tokens. Line 18 decreases the token counter by an amount equal to the length of the datagram. When there is not enough tokens, the thread will go to sleep and try again the next time.

Handling Control Messages

Two useful functions for processing control messages are helper_count_words() and helper_tokenize(). As their names imply, they have the following functionality:

helper_count_words(msg): Return the number of words in the string msg.
helper_tokenize(current_word): Return the address of the next word following the one at current_word after terminating the word with the NUL byte. A word is any character sequence terminated by the space character or NUL byte. If current_word is pointing at a space character, the current word is considered to be the empty string. Return 0 if no word is found.

The code fragment below from the handle_msg() function of the shaper++ code demonstrates how these two functions are used in processing a control message such as "params= 1000 3000" which sets the token bucket average rate to 1000 Kbps and the bucket size to 3000 bytes. The basic idea is to use helper_count_words() to find out if there are enough arguments of an operation and to use helper_tokenize() to find the next word and terminate it with a NUL byte so that conversion functions such as helper_atou_sram() can be applied to the word.

    void handle_msg() {
	... other declarations ...
	__declspec(local_mem) char	outmsgstr[28];
	__declspec(sram) char		sram_inmsgstr[28];
	char SET_params[8]  = "params=";
	... other operations ...
1	} else if( strncmp_sram(sram_inmsgstr, SET_params, 7) == 0 ) {
2	    char	*cmnd_word;	// points to input command field
3	    char	*rate_word;	// points to input rate(Kbps) field
4	    char	*bucketsz_word;	// points to input bucketsz(bytes) field
5	    unsigned int nwords;
6
7	    nwords = helper_count_words( sram_inmsgstr );
8	    if( nwords != 3 ) {
9		memcpy_lmem_sram( outmsgstr, NEED_ARG_msg, 12 );
10	    } else {
11		cmnd_word = helper_tokenize( sram_inmsgstr );	// get command
12		rate_word = helper_tokenize( cmnd_word+strlen(cmnd_word)+1 );
13		bucketsz_word = helper_tokenize( rate_word+strlen(rate_word)+1 );
14
15		rate_Kbps = helper_atou_sram( rate_word );
16		bucketsz = helper_atou_sram( bucketsz_word );
17		helper_sram_outmsg_2ul( rate_Kbps, bucketsz, outmsgstr );
18	    }
19	} else ... other operations ...
    }

We describe the above code when the control message is "params= 1000 3000" which attempts to set the long-term average rate to 1,000 Kbps and the bucket size to 3,000 bytes.

Line 1: Looks for the operation name "params=" at the front of the character array sram_inmsgstr[28].
Line 7: Counts the number of space-separated words.
Lines 11-13: Gets the addresses of the three words "params=", "1000" and "3000" from the command string sram_inmsgstr.
Lines 15-16: Converts the string formats of the second and third words into their internal representations which are stored in the variables rate_Kbps and bucketsz.
Line 17: Creates the reply message.

Revised: Fri Apr 3, 2009

NPR Tutorial >> Writing A Plugin	TOC