ONL NPR Tutorial

The ONL NPR Tutorial

NPR Tutorial >> Writing A Plugin

TOC

New Window?

Tour Of The Delay Plugin

Content

Timer Concepts
The Delay Queue
The plugin_init_user() Function
The handle_pkt() And handle_pkt_user() Functions
The callback() Function
Makefile Settings

The delay plugin shows how you can queue a packet for a fixed period of time (its delay) and then forward the packet when its delay has expired. The features found in the plugin that are different from what is found in the mycount plugin are:

A periodic thread (callback()) that determines if a packet should be forwarded
How to get the current time
The use of SRAM for a delay queue
A new paradigm for packet handling that is different from the sequence: get packet, process packet, forward packet.
New settings for loading the plugin due to the use of SRAM
The use of control messages defined in SRAM

These features make the plugin much more complex than the mycount plugin. The first version of the delay plugin was created by starting with the mycount plugin and incremental adding and testing each feature. An understanding of how these features are implemented should help plugin developers write other comparable plugins (e.g., queue management, packet scheduling).

The basic idea behind the delay plugin is that the handle_pkt_user() thread enqueues an arriving packet onto the delay queue, and the callback() thread dequeues any packet in the delay queue whose forwarding time has arrived. In order to support this paradigm, the following major changes were made to the mycount plugin code:

handle_pkt_user() was modified so that it enqueued the meta-packet onto the delay queue rather than calling dl_sink_packet() to forward the meta-packet.
callback() was written to periodically check the delay queue for and to forward expired meta-packets.
New concurrent-access queue management routines were written.
The Makefile was modified to safely allocate compiler usable SRAM regions among a maximum of five plugin instances.

Timer Concepts

IXP timestamps are 64 bits and are read in both handle_pkt_user() to record the arrival time of a packet and callback() to see if it is time to forward the first packet in the delay queue. Reading a timestamp involves reading two 32-bit microengine CSRs (Control Status Registers) that form a 64-bit timestamp that increments once every 16 clock cycles. Since an IXP runs at 1.4 GHz, one tick (16 clock cycles) is 11.42857 nsec. A typical code snippet for atomically reading the time stamp is:

	union tm_tag {
	    long long rm;
	    struct {
	    	unsigned long	hi;
	    	unsigned long	lo;
	    } tm2;
	};
	union tm_tag	y;

	y.tm2.lo = local_csr_read( local_csr_timestamp_low );	// must be first
	y.tm2.hi = local_csr_read( local_csr_timestamp_high );

Note that the order of the two calls is important because the reading of the local_csr_timestamp_low CSR latches the other CSR so that when you finally do read local_csr_timestamp_high, it contains a value that is consistent with the other CSR; i.e., the two statements act as an atomic read of the timestamp.

The Delay Queue

Each item in the delay queue represents one meta-packet. The queue is a standard forward-linked list in which each item contains the time the meta-packet should be forwarded and a copy of the meta-packet fields:

	struct delay_item_tag {
	    union tm_tag time;			// time to leave
	    unsigned int buf_handle;		// meta-packet
	    unsigned int out_port;		// .
	    unsigned int qid;			// .
	    unsigned int l3_pkt_len;		// .
	    struct delay_item_tag *next;
	};

We describe the process involved in developing the the interface functions to give you some insight into how other similar functions should be developed.

Also recall that plugins have been allocated a 5 MB region of SRAM for their own use. We describe the initialization of the delay queue so that you can understand how that SRAM region is used and the changes to the Makefile required to support the loading of all five plugin MEs with the delay plugin.

The delay queue interface functions are:

delayq_init(): Initialize the delay queue structure within the plugin SRAM area.
delayq_enq(): Enqueue an item onto the delay queue.
delayq_pop(): Remove the first item from the delay queue.

Take Note: There is nothing unusual about these functions except that the code recognizes that we must allocate space for the queue items from the predesignated 5 MB plugin SRAM region. But we first developed most of the code on a general-purpose machine and then made the necessary code modifications to accommodate the IXP before compiling and testing the plugin in the IXP environment. This approach makes development quite fast because you can use normal debugging tools in the general-purpose environment and only have to resort to primitive debug messages to debug the IXP-specific parts of the delay queue code. We highly recommend this approach whenever developing complicated code that is not IXP-specific. For example, we recommend this approach when developing a queue management (e.g., RED) or packet scheduling plugin.

1	int
2	delayq_init( __declspec(shared, sram) struct delayq_tag *qptr ) {
3	    int		i;
4	    int		K = MAX_QUEUE_SZ-1;
5	    struct delay_item_tag *item_ptr;
6	
7	    if ( pluginId == 0)		item_ptr = (struct delay_item_tag *) 0xC0100000;
8	    else if ( pluginId == 1)	item_ptr = (struct delay_item_tag *) 0xC0200000;
9	    else if ( pluginId == 2)	item_ptr = (struct delay_item_tag *) 0xC0300000;
10	    else if ( pluginId == 3)	item_ptr = (struct delay_item_tag *) 0xC0400000;
11	    else if ( pluginId == 4)	item_ptr = (struct delay_item_tag *) 0xC0500000;
12	    else	return -1;
13	
14	    qptr->free_hd = item_ptr;	// queue descriptor
15	    qptr->hd = qptr->tl = 0;
16	    qptr->ninq = 0;
17	
18	    (item_ptr+K)->next = 0;
19	    for (i=0; inext = item_ptr+1;
21		++item_ptr;
22	    }
23	
24	    return 0;
25	}

The NPR uses most of SRAM (e.g., buffer descriptors) but allows the plugin user to use 5 MB of it starting at memory location 0xC0100000. The delay plugin assumes that we will divide up that region into five 1 MB regions and that plugin ME k will use the kth region. Lines 7-12 implements this decision by initializing item_ptr to point to the proper 1MB SRAM region. (Note that the values of item_ptr are separated by an amount equal to 0x00100000 or 2^20.) The rest of plugin_init_user() uses item_ptr to initialze the queue descriptor and the freelist.

The rest of the delayq_init() code is obvious. Lines 14-16 intializes the queue descriptor so that the the freelist pointer points to the beginning of the 1 MB SRAM region (line 14); the head and tail pointers are 0 (line 15); and the number in the queue is 0 (line 16). Lines 18-22 creates the freelist by setting the next pointer to point to the next free item structure.

Two other lines are worth discussing: lines 2 and 4. First, the easy one. The name MAX_QUEUE_SZ appears in line 4. This is the maximum number of items (and therefore meta-packets) that can be queued. It is defined to be 35,000. The implication is that for maximum-sized packets (1,500 bytes), the plugin can support a bandwidth-delay product (BDP) of about 280 Mb. A 280 Mb BDP means that you can have atmost a 280 msec delay at 1 Gbps which is not unreasonable.

Second, in line 2 qptr contains the address of the queue descriptor. The queue descriptor contains the freelist pointer free_hd, the head and tail pointers hd and tl, and the population counter ninq. Note that qptr has been declared to be in SRAM and is shared among threads in the ME. The queue descriptor is actually defined and statically allocated near the beginning of the source code file and then later, its address is passed into delayq_init():

	struct delay_item_tag {
	    union tm_tag	time;		// time to leave
	    unsigned int	buf_handle;	// meta-packet
	    unsigned int	out_port;	// .
	    unsigned int	qid;		// .
	    unsigned int	l3_pkt_len;	// .
	    struct delay_item_tag *next;
	};
	struct delayq_tag {
	    unsigned long		ninq;	// # in delay queue
	    struct delay_item_tag	*hd;	// head ptr
	    struct delay_item_tag	*tl;	// tail ptr
	    struct delay_item_tag	*free_hd;	// free list
	};
	
	__declspec(shared, sram) struct delayq_tag	delayq;
	...
	void plugin_init_user()
	{
	    ...
	    if ( delayq_init( &delayq ) != 0 )	errno = BAD_DELAYQ_INIT;
	    ...
	}

The reason that delayq is declared shared is that both the handle_pkt() thread (via handle_pkt_user()) and the callback() thread update the delay queue. Also, because the queue is shared, we must protect the updating of the queue with a lock. This is shown in the abbreviated delayq_pop() code snippet below:

1	#define UNLOCKED 0
2	#define LOCKED   1
3	__declspec(shared gp_reg) unsigned int	delayq_lock;
4	...
5	int
6	delayq_pop( __declspec(shared, sram) struct delayq_tag *qptr ) {
7	    struct delay_item_tag	*item;
8	
9	    while( delayq_lock == LOCKED )	ctx_swap();
10	    delayq_lock = LOCKED;
11
12	    ... Pop front item from queue and return to freelist ...
13	
14	    delayq_lock = UNLOCKED;
15	    return 0;
16	}

Line 9 yields the CPU (context switch to the next thread by calling ctx_swap()) if some other thread has already acquired the lock ddelayq_lock. Otherwise, line 10 acquires the lock. After updating the delay queue, line 14 releases the lock. The code fragment denoted by line 12 is ordinary, straightforward code for removing the item from the queue and returning the space to the freelist.

The fact that the delay queue descriptor delayq is declared to be in SRAM is an issue if we want to have more than one ME run the delay plugin. This issue is addressed in the section Makefile Settings.

The plugin_init_user() Function

The plugin_init_user() function looks identical to the one for the mycount plugin except that it now must intialize the delay queue (line 7) and the delay queue lock (line 6):

1	void plugin_init_user()
2	{
3	    if( ctx() == 0 )
4	    {
5		npkts = 0;		// #pkts seen by plugin
6		delayq_lock = UNLOCKED;
7		if ( delayq_init( &delayq ) != 0 )	errno = BAD_DELAYQ_INIT;
8	    }

Line 7 does indicate that if delayq_init() returns and error, errno is set so that the user can query for the latest error. The point is that we don't do much more with errors because the plugin can't do anything about errors.

The handle_pkt() And handle_pkt_user() Functions

There are two changes that need to be made to the mycount plugin when a meta-packet arrives to the delay plugin:

We need to queue the meta-packet instead of sending it to the Queue Manager. So, now handle_pkt_user() queues the meta-packet along with its departure time, and handle_pkt() calls dl_sink_nopacket() instead of dl_sink_packet().
We need to send the meta-packet to the Queue Manager after it has been properly delayed. So, now the callback() thread periodically checks to see if a departure time has been reached and forwards those qualified meta-packets.

1	void handle_pkt()
2	{
3	    dl_source_packet(dlFromBlock);
4	    default_format_out_data(dlNextBlock);
5	    handle_pkt_user();
6	    dl_sink_nopacket();
7	}

The only change to handle_pkt() is in line 6 where we now call dl_sink_nopacket() which does the same thing as dl_sink_packe() except it doesn't forward any packet. But handle_pkt_user() does the plugin-specific processing.

We do not show the dl_sink_nopacket() code because it is just the dl_sink_packet() code with all of the lines that do meta-packet forwarding have been deleted leaving only the code that passes control to the next thread.

1	void handle_pkt_user()  {
2	    __declspec(gp_reg) buf_handle_t	buf_handle;
3	    __declspec(gp_reg) onl_api_buf_desc	bufDescriptor;
4	    __declspec(gp_reg) unsigned int	out_port;
5	    __declspec(gp_reg) unsigned int	qid;
6	    unsigned int	bufDescPtr;
7	    unsigned long	ninq;
8	    union tm_tag	current;// current time
9	    union tm_tag	depart;	// depart time
10	    struct delay_item_tag *delay_item_ptr;
11	    unsigned int	nticks;	// 16 cycles = 1 tick
12	
13	    ++npkts;		// pkt counter
14	
15	    current.tm2.lo = local_csr_read( local_csr_timestamp_low );
16	    current.tm2.hi = local_csr_read( local_csr_timestamp_high );
17	    nticks = helper_msec2cycles( delay ) >> 4;
18	    depart.tm = current.tm + nticks;
19	
20	    out_port = (ring_in.uc_mc_bits >> 3) & 0x7;
21	    qid = ((out_port+1) << 13) | ring_in.qid,
22	    ninq = delayq_enq( &delayq, depart.tm, ring_in.buf_handle_lo24,
23					out_port, qid, ring_in.l3_pkt_len );
24	    if( ninq == -1 ) {
25	        errno = BAD_ENQ;
26		++ndrops;	// number of drops
27		onl_api_set_out_to_DROP();
28	        dl_sink_packet(dlNextBlock);
29	    } else {
30	        if ( ninq > maxinq )	maxinq = ninq;
31	    }
32	}

Lines 6-11: Note that because none of these declarations have a __declspec(), the compiler assumes that these variables will be in SRAM.
Line 13: This is the standard packet counter update found in mycount.
Lines 15-18: Compute the packet's departure time in clock ticks. Recall that each tick is equal to 16 clock cycles. So, lines 15-16 read the current 64-bit timestamp; line 17 converts the delay from milliseconds to clock ticks; and line 18 computes the departure time which will be used in line 26.
Lines 20-21: Compute the output port and qid fields that will go into the outgoing meta-packet and are put into the delay queue item by the call to delayq_enq() in line 23. These computations are normally done in default_format_out_data().
Lines 22-29: Enqueue the item onto the delay queue. If the call to delayq_enq fails (returns -1), the packet is dropped.
Lines 30: Updates the maximum number of packets in the delay queue.

The callback() Function

The callback() function is responsible for checking for meta-packets that should be forwarded to the Queue Manager because they have met their delay requirement. Because it is a thread, it will be activated every time control is passed back to it from the handle_msg() thread.

1	void callback()
2	{
3	    union tm_tag current;
4	
5	    if ( delayq.ninq > 0 ) {
6		current.tm2.lo = local_csr_read( local_csr_timestamp_low );
7		current.tm2.hi = local_csr_read( local_csr_timestamp_high );
8		if ( current.tm >= delayq.hd->time.tm ) {	// time to leave
9		    dl_sink_delay( );
10		} else {
11		    sleep( SLEEP_CYCLES );
12		}
13	    } else {
14		sleep( SLEEP_CYCLES );
15	    }
16	}

The code is straightforward. If there is atleast one packet in the delay queue, it will forward the first packet in the queue if that packet's departure time has been reached (lines 6-9). In line 9, dl_sink_delay() dequeues the first meta-packet and forwards it to the Queue Manager. It is essentially the same code as dl_sink_packet() except that it gets its meta-packet from the delay queue. If not, it sleeps for 10 microseconds (line 11). SLEEP_CYCLES is a constant equal to 14,000. The argument to sleep() is 14,000 1.4 GHz IXP cycles. If there were no packets in the delay queue, the thread just sleeps for 10 microseconds (14,000 cycles).

Makefile Changes

Earlier we mentioned that the delay plugin uses SRAM memory references that are determined by the microengine C compiler. Furthermore, those SRAM memory references are constrained to a 1,024,000-byte region in SRAM Bank 3. The Makefile we used for compiling/linking the mcount plugin contains the following line where "..." denotes some omitted fields:

	LDFLAGS=-g -p -f ... -sr3 0x00706000:0x000FA000

The "-sr3 0x0070600:0x000FA000" tells the microengine image linker to use the SRAM region in bank (channel) 3 starting at location 0x0070600 and extending for 0x000FA000 bytes (1,024,000 bytes). But if we used this for the delay plugin, all of the declarations with SRAM variables would be assigned from this region. These variabels include the queue descriptor and many of the handle_pkt_user() variables. This would work if we loaded only one delay plugin. But if we loaded two delay plugins or any two plugins that used SRAM references generated by the microengine C compiler, the set of SRAM references from one plugin would overlap those of the other.

The Makefile for the delay plugin partitions the 1,240,000 bytes in SRAM bank 3 into five regions that are each 248,000 bytes and assigns these partitions to the five plugin MEs. It does this by defining five different loader settings, and then uses the setting most appropriate for each of the five MEs:

	LDFLAGS0=-g -p -f ... -sr3 0x00706000:0x00032000
	LDFLAGS1=-g -p -f ... -sr3 0x00738000:0x00032000
	LDFLAGS2=-g -p -f ... -sr3 0x0076A000:0x00032000
	LDFLAGS3=-g -p -f ... -sr3 0x0079C000:0x00032000
	LDFLAGS4=-g -p -f ... -sr3 0x007CE000:0x00032000

Here, LDFLGSi is used when linking for the image for plugin ME i. With these settings, we can load the delay plugin into all five of the plugin MEs, and they won't interfere with each others SRAM region. But you could decide to use a different mapping.

Revised: Fri, Nov 7, 2008

NPR Tutorial >> Writing A Plugin	TOC