ONL NPR Tutorial

The ONL NPR Tutorial

TOC

Introduction

Network Processors and the IXP 2800/2850

To understand the plugin framework provided by ONL, it is helpful to understand a little bit about Network Processors (NPs) and the Intel IXP 2800/2850 in particular. First, NP products have been developed for use in conventional routers as replacements for high-throughput, packet processing ASICs (Application Specific Integrated Circuits). Because NPs are programmable, they enable more rapid development and more rapid correction of design errors. [[ ixp-diagram .png Figure ]]

It is worth repeating what was said earlier in NPR Tutorial => Packet Processing about the IXP. The IXP has some unusual architectural features because it was designed specifically for rapid development of high-performance networking applications. There are 16 multi-threaded microengine (ME) cores that operate in parallel to do most of the packet processing. The MEs are organized so that packets can be streamed through the main processing blocks in a pipelined manner.

Ten of the 16 MEs form the fast path through the NPR, and five of the MEs are set aside for plugins (the remaining ME handles statistics). NPR users can load predefined (standard) plugins or user-developed plugins on to any of these MEs. Typically, the PLC (Parse, Lookup and Classify) block sends a meta-packet to a plugin by inserting the meta-packet into the plugin's input ring buffer, and the plugin inserts the meta-packet into the Queue Manager's input ring buffer when it is done with the packet. But, an advanced plugin developer may elect to use the five plugin input ring buffers in a different manner. The NPR also sets aside 4KB of the scratchpad memory and 5MB of SRAM exclusively for use by plugins.

Furthermore, within each ME, the threading mechanism is used to deal with the memory-processing speed gap that exists in modern processors by overlapping memory access with processing. One of the most striking differences with general-purpose processors is the memory subsystem. There are no memory caches since networking applications exhibit little reference locality. Generally, Dynamic RAM (DRAM) is used for packet buffers. Static RAM (SRAM) is used for packet meta-data, ring buffers for inter-block communication, and large system tables. One of the SRAM channels also supports a Ternary Content-Addressible Memory (TCAM) which is used for IP route lookup and packet classification. The scratchpad memory is used for smaller ring buffers and tables.

Since caches are relatively ineffective for networking workloads, the IXP provides hardware multithreading to cope with the memory latency gap. Each of the MEs has eight separate sets of processor registers (including the Program Counter) which form the MEs hardware thread contexts. An ME can switch from one context to another in two (2) clock cycles, allowing it to stay busy doing useful work, even when several of its hardware threads are suspended waiting for data from non-local memory.

Multi-threading can be used in a variety of ways, but there are some common usage patterns that are well-supported by hardware mechanisms. Perhaps the most commonly used (and simplest) pattern involves a group of threads that operate in round-robin fashion using hardware signals to pass control explicitly from one thread to the next.

In this example, the first thread starts by reading a data item (e.g., a packet pointer) from a shared input queue, then issues a memory read request before passing control to the second thread. The second thread reads the next data item from the shared input queue, issues its own memory read request and passes control to the third thread. By the time the third thread issues its memory read request, the first thread has its data and is ready continue. Notice how this allows the processor to stay busy, in spite of the long memory latencies. Also, note that the round-robin processing ensures that packets are processed in order. This technique works well when the variation in processing times from one packet to the next is bounded (the common case) and is straightforward to implement.

Although implementing a plugin that operates at line speed may require the use of multiple threads, simple, proof-of-concept plugins are often written using just one packet processing thread. Often (as described below), plugins use one of the eight possible threads to handle control messages from the Xscale control processor and in some cases, one other thread to handle periodic operations.

There are two other aspects of the MEs that are important to understand. First, each ME has a small (8 KB), dedicated program store. This limits the number of different functions that can be implemented by a single ME. Also, the MEs have dedicated FIFOs between consecutive pairs of MEs (Next Neighbor Rings) which support pipelined processing.

Microengine C and the Intel IXP

Plugins are written in microengine C which is the standard C-like language provided by Intel for use on the IXP MEs. The most important differences between microengine C and ANSI C are dictated by the IXP architecture:

There is no dynamic memory allocation because there is no operating system running on an ME to manage the memory.
There is no memory cache.
All program variables and tables must be explicitly declared to reside in a particular type of memory (registers, ME local memory, scratchpad, SRAM, DRAM) and there is no caching.
There is no stack and hence no recursion.
Each ME can have up to eight hardware threads that run non-preemptively; i.e., they explicitly pass control to each other.

Although these restrictions may seem onerous, simple plugins are not hard to write, for the most part, because the NPR architecture provides a framework for interacting between MEs and a set of simple functions for processing packets and interacting with other MEs and the Xscale Control Processor.

Programming Framework

To help users who are unfamiliar with this programming environment, we have developed a framework that lowers the entry barrier for writing simple to moderately complex plugins. Note that users are not required to use the framework. Users who are already experts with the IXP can do whatever they wish with the five plugin MEs. The framework consists of a basic plugin structure that handles tasks common to most plugins and a plugin API that provides many functions that are useful for packet processing.

In the basic plugin architecture, there are three different types of tasks which are statically mapped to threads at plugin compile time.

Process a packet
Process a control message
Perform a periodic computation

For example, a plugin ME might have six packet processing threads (threads 0 to 5), one control message processing thread (thread 6) and one periodic computation thread (thread 7). In pseudo-code form, the main routine looks like this:

Initialize plugin;
c := Get current hardware context;
IF( 0 <= c <= 5 )	LOOP {  handle_pkt( ); }	// process pkt
ELSEIF( c == 6 )	LOOP {  handle_msg( ); }	// process msg
ELSE			LOOP {  callback( ); }		// process periodic

It should be understood that this code is run by each of the eight hardware threads on one plugin ME.

After initialization, the threads execute in round-robin fashion where thread k does some processing and then explicitly passes control to thread (k+1 mod 8) when all eight hardware threads are employed. Note that each thread calls the appropriate function based on the value of c, the hardware context. In the case of handle_pkt(), it will block and release control if there is no available meta-packet in its input ring buffer. Similarly, handle_msg() will block and release control if there is no available control message from its Xscale control processor. And the periodic thread (thread 7) executes a computation when it is given control if it is time for a periodic computation and then passes control to thread 0 (the first packet processing thread).

Processing a Packet

In the simplest case, handle_pkt() extracts a meta-packet from its input ring buffer and calls a user-supplied function to do the actual plugin processing. When the function returns, the packet is inserted into the outgoing ring buffer. Typically, the plugin sends the packet directly to the Queue Manager so that it can be sent out to an external link. But the packet can also be sent back to MUX resulting in the packet being matched against routes and filters in the TCAM a second time. This is useful if something in the packet, such as the destination IP address, has changed and the packet needs to be re-routed. Packets can also be redirected to the next plugin ME via a next neighbor ring. In fact, plugins even have the ability to send packets to any other plugin ME by writing directly to the five ring buffers leading from PLC to the plugins.

Processing a Control Message

A user can send a control message to a plugin through the RLI. For example, the delay plugin allows the user to retrieve the number of packets processed by the plugin and to set the delay. If the plugin has been loaded into NPR 2, plugin ME 0, a user can set the delay to 20 msec by sending the message "delay= 20" to NPR 2, plugin ME 0 using the Edit => Send Command to Plugin menu item in the Plugin Table window for NPR 2. In this case, the RLI sends the message to the ONL testbed, and, in particular, to NPR 2's Xscale control processor to be forwarded to the control message ring buffer for plugin ME 0. The control message handling thread in the plugin, will read the message, process it, and return a reply message to the Xscale which will forward the response to the RLI.

A message sent to a plugin is limited to 28 characters. There currently is no standard for the syntax of plugin messages other than that the incoming message is a sequence of white-space separated ASCII characters. Each plugin is left to interpret its own set of messages. For example, in the delay plugin example above, the command "delay= 20" set the delay to 20 msec. The plugin expects the first word to be a keyword ("=counts", "reset", "delay=", etc.) and any another words to be the operands. There is no requirement or convention (yet) that the equal sign ("=") has a special meaning other than being a part of a keyword.

Performing a Periodic Computation

Some plugins may need to do processing that is not dictated purely by packet arrivals. For example, the delay plugin must periodically check to see if any meta-packets in its queue need to be forwarded to the Queue Manager. In such a case, the plugin developer can elect to use thread 7 as a periodic-computation thread which sleeps for a configurable time and then calls another user provided function to do the periodic processing.

Although thread 7 has traditionally been used as the periodic-computation thread, the plugin developer is free to define more than one such thread and assign them to other hardware threads besides thread 7.

Helper Functions

To support plugin developers, we provide a plugin API. The API consists of helper functions for common packet processing steps as well as functions that hide some of the complexity of interacting with packets and packet meta-data. Some examples are reading packet headers from DRAM into local structures, incrementing local counters, reading and writing meta-data, preparing packets for sending to other blocks in the router, and computing checksums. Much of the complexity in these functions deals with reading or writing potentially unaligned memory so that the plugin developer need not worry about such things.

Learning How to Write Your First Plugin

In this section, we explain the basics of writing your first simple plugin. We use the mycount and delay plugins to illustrate these basics. The mycount plugin is a simple plugin and is not particularly useful except to demonstrate the basics of writing a plugin. The plugin demonstrates these concepts:

Six packet processing threads
One control message thread
Debug messages

The plugin counts the number of meta-packets that it receives and uses the one of the 25 public plugin counters ONL_ROUTER_PLUGIN_x_CNTR_y to store the meta-packet count making it easy to chart the number of meta-packets. Here, x ranges over the plugin numbers 0 to 4, and y ranges over the counter numbers 0 to 4.

In comparison to the mycount plugin, the delay plugin is much more complex in that it:

Demonstrates how to support a periodic computation;
Supports a large number of control messages; and
Uses the 5 MB plugin SRAM region for its delay queue.
Uses SRAM storage that is automatically allocated by the microengine C compiler

Furthermore, we show how linker flags can be used to map the automatically allocated SRAM memory to specific regions of SRAM memory that are different for different MEs.

This section begins with a Quick Start page that shows you how to create a copy of your own mycount plugin and to get it to display debug messages. This page presents recipes and gives very little explanation of what is really happening. Although much of the inner workings of a plugin may be a mystery after reading the Quick Start page, you can often write (and test) an extremely simple plugin by modifying the mycount plugin code.

But in order to write any non-trivial plugin, you will need to know a little about the inner workings of a simple plugin like the mycount plugin. So, we give a detailed tour of the mycount plugin and then discuss the following skills and concepts associated with writing a simple plugin:

How to compile a plugin
How to display debug messages
The basic structure of a plugin
How to test a plugin

Once you understand these concepts, you will be able to write a simple plugin and to understand a more complicated plugin like the delay plugin.

After discussing these basics, we discuss the new concepts used by the delay plugin and then discuss what a plugin must do to:

Drop a packet
Inspect a packet header
Inspect a packet body
Modify a packet
Copy a packet
Modify a plugin tag

Revised: Thu, Oct 23, 2008

NPR Tutorial >> Writing A Plugin	TOC