NPR Tutorial >> Writing A Plugin | TOC |
The delay plugin shows how you can queue a packet for a fixed period of time (its delay) and then forward the packet when its delay has expired. The features found in the plugin that are different from what is found in the mycount plugin are:
These features make the plugin much more complex than the mycount plugin. The first version of the delay plugin was created by starting with the mycount plugin and incremental adding and testing each feature. An understanding of how these features are implemented should help plugin developers write other comparable plugins (e.g., queue management, packet scheduling).
The basic idea behind the delay plugin is that the handle_pkt_user() thread enqueues an arriving packet onto the delay queue, and the callback() thread dequeues any packet in the delay queue whose forwarding time has arrived. In order to support this paradigm, the following major changes were made to the mycount plugin code:
IXP timestamps are 64 bits and are read in both handle_pkt_user() to record the arrival time of a packet and callback() to see if it is time to forward the first packet in the delay queue. Reading a timestamp involves reading two 32-bit microengine CSRs (Control Status Registers) that form a 64-bit timestamp that increments once every 16 clock cycles. Since an IXP runs at 1.4 GHz, one tick (16 clock cycles) is 11.42857 nsec. A typical code snippet for atomically reading the time stamp is:
union tm_tag { long long rm; struct { unsigned long hi; unsigned long lo; } tm2; }; union tm_tag y; y.tm2.lo = local_csr_read( local_csr_timestamp_low ); // must be first y.tm2.hi = local_csr_read( local_csr_timestamp_high );
Note that the order of the two calls is important because the reading of the local_csr_timestamp_low CSR latches the other CSR so that when you finally do read local_csr_timestamp_high, it contains a value that is consistent with the other CSR; i.e., the two statements act as an atomic read of the timestamp.
Each item in the delay queue represents one meta-packet. The queue is a standard forward-linked list in which each item contains the time the meta-packet should be forwarded and a copy of the meta-packet fields:
struct delay_item_tag { union tm_tag time; // time to leave unsigned int buf_handle; // meta-packet unsigned int out_port; // . unsigned int qid; // . unsigned int l3_pkt_len; // . struct delay_item_tag *next; };
We describe the process involved in developing the the interface functions to give you some insight into how other similar functions should be developed.
Also recall that plugins have been allocated a 5 MB region of SRAM for their own use. We describe the initialization of the delay queue so that you can understand how that SRAM region is used and the changes to the Makefile required to support the loading of all five plugin MEs with the delay plugin.
The delay queue interface functions are:
Take Note: There is nothing unusual about these functions except that the code recognizes that we must allocate space for the queue items from the predesignated 5 MB plugin SRAM region. But we first developed most of the code on a general-purpose machine and then made the necessary code modifications to accommodate the IXP before compiling and testing the plugin in the IXP environment. This approach makes development quite fast because you can use normal debugging tools in the general-purpose environment and only have to resort to primitive debug messages to debug the IXP-specific parts of the delay queue code. We highly recommend this approach whenever developing complicated code that is not IXP-specific. For example, we recommend this approach when developing a queue management (e.g., RED) or packet scheduling plugin.
1 int 2 delayq_init( __declspec(shared, sram) struct delayq_tag *qptr ) { 3 int i; 4 int K = MAX_QUEUE_SZ-1; 5 struct delay_item_tag *item_ptr; 6 7 if ( pluginId == 0) item_ptr = (struct delay_item_tag *) 0xC0100000; 8 else if ( pluginId == 1) item_ptr = (struct delay_item_tag *) 0xC0200000; 9 else if ( pluginId == 2) item_ptr = (struct delay_item_tag *) 0xC0300000; 10 else if ( pluginId == 3) item_ptr = (struct delay_item_tag *) 0xC0400000; 11 else if ( pluginId == 4) item_ptr = (struct delay_item_tag *) 0xC0500000; 12 else return -1; 13 14 qptr->free_hd = item_ptr; // queue descriptor 15 qptr->hd = qptr->tl = 0; 16 qptr->ninq = 0; 17 18 (item_ptr+K)->next = 0; 19 for (i=0; inext = item_ptr+1; 21 ++item_ptr; 22 } 23 24 return 0; 25 }
The NPR uses most of SRAM (e.g., buffer descriptors) but allows the plugin user to use 5 MB of it starting at memory location 0xC0100000. The delay plugin assumes that we will divide up that region into five 1 MB regions and that plugin ME k will use the kth region. Lines 7-12 implements this decision by initializing item_ptr to point to the proper 1MB SRAM region. (Note that the values of item_ptr are separated by an amount equal to 0x00100000 or 2^20.) The rest of plugin_init_user() uses item_ptr to initialze the queue descriptor and the freelist.
The rest of the delayq_init() code is obvious. Lines 14-16 intializes the queue descriptor so that the the freelist pointer points to the beginning of the 1 MB SRAM region (line 14); the head and tail pointers are 0 (line 15); and the number in the queue is 0 (line 16). Lines 18-22 creates the freelist by setting the next pointer to point to the next free item structure.
Two other lines are worth discussing: lines 2 and 4. First, the easy one. The name MAX_QUEUE_SZ appears in line 4. This is the maximum number of items (and therefore meta-packets) that can be queued. It is defined to be 35,000. The implication is that for maximum-sized packets (1,500 bytes), the plugin can support a bandwidth-delay product (BDP) of about 280 Mb. A 280 Mb BDP means that you can have atmost a 280 msec delay at 1 Gbps which is not unreasonable.
Second, in line 2 qptr contains the address of the queue descriptor. The queue descriptor contains the freelist pointer free_hd, the head and tail pointers hd and tl, and the population counter ninq. Note that qptr has been declared to be in SRAM and is shared among threads in the ME. The queue descriptor is actually defined and statically allocated near the beginning of the source code file and then later, its address is passed into delayq_init():
struct delay_item_tag { union tm_tag time; // time to leave unsigned int buf_handle; // meta-packet unsigned int out_port; // . unsigned int qid; // . unsigned int l3_pkt_len; // . struct delay_item_tag *next; }; struct delayq_tag { unsigned long ninq; // # in delay queue struct delay_item_tag *hd; // head ptr struct delay_item_tag *tl; // tail ptr struct delay_item_tag *free_hd; // free list }; __declspec(shared, sram) struct delayq_tag delayq; ... void plugin_init_user() { ... if ( delayq_init( &delayq ) != 0 ) errno = BAD_DELAYQ_INIT; ... }
The reason that delayq is declared shared is that both the handle_pkt() thread (via handle_pkt_user()) and the callback() thread update the delay queue. Also, because the queue is shared, we must protect the updating of the queue with a lock. This is shown in the abbreviated delayq_pop() code snippet below:
1 #define UNLOCKED 0 2 #define LOCKED 1 3 __declspec(shared gp_reg) unsigned int delayq_lock; 4 ... 5 int 6 delayq_pop( __declspec(shared, sram) struct delayq_tag *qptr ) { 7 struct delay_item_tag *item; 8 9 while( delayq_lock == LOCKED ) ctx_swap(); 10 delayq_lock = LOCKED; 11 12 ... Pop front item from queue and return to freelist ... 13 14 delayq_lock = UNLOCKED; 15 return 0; 16 }
Line 9 yields the CPU (context switch to the next thread by calling ctx_swap()) if some other thread has already acquired the lock ddelayq_lock. Otherwise, line 10 acquires the lock. After updating the delay queue, line 14 releases the lock. The code fragment denoted by line 12 is ordinary, straightforward code for removing the item from the queue and returning the space to the freelist.
The fact that the delay queue descriptor delayq is declared to be in SRAM is an issue if we want to have more than one ME run the delay plugin. This issue is addressed in the section Makefile Settings.
The plugin_init_user() function looks identical to the one for the mycount plugin except that it now must intialize the delay queue (line 7) and the delay queue lock (line 6):
1 void plugin_init_user() 2 { 3 if( ctx() == 0 ) 4 { 5 npkts = 0; // #pkts seen by plugin 6 delayq_lock = UNLOCKED; 7 if ( delayq_init( &delayq ) != 0 ) errno = BAD_DELAYQ_INIT; 8 }
Line 7 does indicate that if delayq_init() returns and error, errno is set so that the user can query for the latest error. The point is that we don't do much more with errors because the plugin can't do anything about errors.
There are two changes that need to be made to the mycount plugin when a meta-packet arrives to the delay plugin:
1 void handle_pkt() 2 { 3 dl_source_packet(dlFromBlock); 4 default_format_out_data(dlNextBlock); 5 handle_pkt_user(); 6 dl_sink_nopacket(); 7 }
The only change to handle_pkt() is in line 6 where we now call dl_sink_nopacket() which does the same thing as dl_sink_packe() except it doesn't forward any packet. But handle_pkt_user() does the plugin-specific processing.
We do not show the dl_sink_nopacket() code because it is just the dl_sink_packet() code with all of the lines that do meta-packet forwarding have been deleted leaving only the code that passes control to the next thread.
1 void handle_pkt_user() { 2 __declspec(gp_reg) buf_handle_t buf_handle; 3 __declspec(gp_reg) onl_api_buf_desc bufDescriptor; 4 __declspec(gp_reg) unsigned int out_port; 5 __declspec(gp_reg) unsigned int qid; 6 unsigned int bufDescPtr; 7 unsigned long ninq; 8 union tm_tag current;// current time 9 union tm_tag depart; // depart time 10 struct delay_item_tag *delay_item_ptr; 11 unsigned int nticks; // 16 cycles = 1 tick 12 13 ++npkts; // pkt counter 14 15 current.tm2.lo = local_csr_read( local_csr_timestamp_low ); 16 current.tm2.hi = local_csr_read( local_csr_timestamp_high ); 17 nticks = helper_msec2cycles( delay ) >> 4; 18 depart.tm = current.tm + nticks; 19 20 out_port = (ring_in.uc_mc_bits >> 3) & 0x7; 21 qid = ((out_port+1) << 13) | ring_in.qid, 22 ninq = delayq_enq( &delayq, depart.tm, ring_in.buf_handle_lo24, 23 out_port, qid, ring_in.l3_pkt_len ); 24 if( ninq == -1 ) { 25 errno = BAD_ENQ; 26 ++ndrops; // number of drops 27 onl_api_set_out_to_DROP(); 28 dl_sink_packet(dlNextBlock); 29 } else { 30 if ( ninq > maxinq ) maxinq = ninq; 31 } 32 }
The callback() function is responsible for checking for meta-packets that should be forwarded to the Queue Manager because they have met their delay requirement. Because it is a thread, it will be activated every time control is passed back to it from the handle_msg() thread.
1 void callback() 2 { 3 union tm_tag current; 4 5 if ( delayq.ninq > 0 ) { 6 current.tm2.lo = local_csr_read( local_csr_timestamp_low ); 7 current.tm2.hi = local_csr_read( local_csr_timestamp_high ); 8 if ( current.tm >= delayq.hd->time.tm ) { // time to leave 9 dl_sink_delay( ); 10 } else { 11 sleep( SLEEP_CYCLES ); 12 } 13 } else { 14 sleep( SLEEP_CYCLES ); 15 } 16 }
The code is straightforward. If there is atleast one packet in the delay queue, it will forward the first packet in the queue if that packet's departure time has been reached (lines 6-9). In line 9, dl_sink_delay() dequeues the first meta-packet and forwards it to the Queue Manager. It is essentially the same code as dl_sink_packet() except that it gets its meta-packet from the delay queue. If not, it sleeps for 10 microseconds (line 11). SLEEP_CYCLES is a constant equal to 14,000. The argument to sleep() is 14,000 1.4 GHz IXP cycles. If there were no packets in the delay queue, the thread just sleeps for 10 microseconds (14,000 cycles).
Earlier we mentioned that the delay plugin uses SRAM memory references that are determined by the microengine C compiler. Furthermore, those SRAM memory references are constrained to a 1,024,000-byte region in SRAM Bank 3. The Makefile we used for compiling/linking the mcount plugin contains the following line where "..." denotes some omitted fields:
LDFLAGS=-g -p -f ... -sr3 0x00706000:0x000FA000
The "-sr3 0x0070600:0x000FA000" tells the microengine image linker to use the SRAM region in bank (channel) 3 starting at location 0x0070600 and extending for 0x000FA000 bytes (1,024,000 bytes). But if we used this for the delay plugin, all of the declarations with SRAM variables would be assigned from this region. These variabels include the queue descriptor and many of the handle_pkt_user() variables. This would work if we loaded only one delay plugin. But if we loaded two delay plugins or any two plugins that used SRAM references generated by the microengine C compiler, the set of SRAM references from one plugin would overlap those of the other.
The Makefile for the delay plugin partitions the 1,240,000 bytes in SRAM bank 3 into five regions that are each 248,000 bytes and assigns these partitions to the five plugin MEs. It does this by defining five different loader settings, and then uses the setting most appropriate for each of the five MEs:
LDFLAGS0=-g -p -f ... -sr3 0x00706000:0x00032000 LDFLAGS1=-g -p -f ... -sr3 0x00738000:0x00032000 LDFLAGS2=-g -p -f ... -sr3 0x0076A000:0x00032000 LDFLAGS3=-g -p -f ... -sr3 0x0079C000:0x00032000 LDFLAGS4=-g -p -f ... -sr3 0x007CE000:0x00032000
Here, LDFLGSi is used when linking for the image for plugin ME i. With these settings, we can load the delay plugin into all five of the plugin MEs, and they won't interfere with each others SRAM region. But you could decide to use a different mapping.
Revised: Fri, Nov 7, 2008
NPR Tutorial >> Writing A Plugin | TOC |