The ONL NPR Tutorial

www >> NPR Tutorial TOC

New Window?

Packet Processing

Introduction

We present a high-level description of the NPR's architecture &mdash just enough to deepen your understanding of the packet forwarding and monitoring features of the RLI and the programmability of the NPRs. As shown in the following table, other pages will cover additional architecture topics and provide greater detail:

Example Topic(s) Primary Audience Link
IXP architecture,
IXP programming concepts
Plugin Developer, NPR Developer The IXP in Brief
Plugin API,
Plugin programming tips
Plugin Developer The Plugin Framework
Data plane,
Control plane
Plugin Developer, NPR Developer Operational Principles of the NPR
Packet classification,
Packet scheduling,
Interblock communication
Advanced Plugin Developer, NPR Developer Implementation Details of the NPR

The IXP 2800

[[ ixp-diagram .png Figure ]]

The IXP 2800 has some unusual architectural features because it was designed specifically for rapid development of high-performance networking applications instead of general-purpose computing. The block diagram (right) shows the major components of the IXP. There are 16 multi-threaded microengine (ME) cores that operate in parallel to do most of the packet processing. Within each ME, the threading mechanism is used to deal with the memory-processing speed gap that exists in modern processors by overlapping memory access with processing.

One of the most striking differences with general-purpose processors is the memory subsystem. There are no memory caches since networking applications exhibit little reference locality. Generally, Dynamic RAM (DRAM) is used for packet buffers. Static RAM (SRAM) is used for packet meta-data, ring buffers for inter-block communication, and large system tables. One of the SRAM channels also supports a Ternary Content-Addressible Memory (TCAM) which is used for IP route lookup and packet classification. The scratchpad memory is used for smaller ring buffers and tables.

The IXP 2800 has the following types of memory (fastest first):

Type Relative Size Where
General-Purpose Registers Small (256) Each ME
Program Store Small (8K words) Each ME
Local Memory Small Each ME
Scratchpad Memory Small One (shared within 1 NPR)
TCAM Small One (shared between 2 NPRs)
SRAM Moderate Four (4) channels (shared)
DRAM Large Three (3) channels (shared)

Data Plane Basics

[[ data-plane.png Figure ]]

Packets are processed in a pipelined fashion. In the diagram above, packets enter from the left. The main processing path through the NPR (the fast path) involves the following sequence of processing blocks:

Block Abbreviation Main Function(s)
Receive Rx Store packet into DRAM buffer
Multiplexor Mux Create packet's meta-data in SRAM and
multiplex packets from other blocks
Parse, Lookup, Copy PLC Inspect packet header and
forward to another block based on result
Queue Manager QM Schedule packet for transmission
Header Format HF Prepare Ethernet header
Transmit Tx Transfer packet to external link and
deallocate packet buffer

The main data flow proceeds in a pipelined fashion starting with the Receive (Rx) block and ending with the Transmit (Tx) block. Packets received from the external links are passed to the Rx block which stores them into DRAM packet buffers. Rx allocates a new buffer for each incoming packet and passes a meta-packet to the next block in the pipeline. A meta-packet is a message that contains a packet pointer and some meta-data (data about the packet).

It is worth repeating that the information passed between blocks consists of packet references and selected pieces of header information, not packet data. This approach avoids the slow operation of potentially large memory-to-memory copying. In most cases, we use the terms meta-packet and packet interchangeably, when we mean meta-packet. But the context should suggest when the term packet refers to the actual packet since the actual packet is almost never copied.

The Multiplexer (Mux) block serves two purposes. First, it creates and stores in SRAM the meta-data (e.g., a packet's length) needed by other blocks. Second, it multiplexes packets coming from blocks other than Rx back into the main pipeline. This includes packets coming from the XScale or plugins. A simple user-configurable priority is used to determine how to process packets from the different sources.

The Parse, Lookup, and Copy (PLC) block is the heart of the router. Packet headers are inspected to form a lookup key which is used to find matching route and filter entries in the TCAM. Based on the result of the TCAM lookup, PLC takes one of five actions:

It is worth noting that three MEs are used to implement PLC processing. All three MEs run the same PLC code block with the only constraint that packets leave the PLC in the same order that they arrived. This approach yields higher performance compared to other alternative (e.g., pipeline the Parse, Lookup, and Copy functions), primarily due to the nature of the operations in PLC. Parse alternates between computation and high-latency DRAM reads of packet headers. Lookup spends most of its time waiting on TCAM responses. And Copy is computation-bound because complex route and filter results must be interpreted. Combining all three functions in each ME provides enough computation for each ME thread to adequately overlap their memory operations with processing.

The Queue Manager (QM) block places incoming packets into one of 8K per-interface queues. A weighted deficit round robin scheduling algorithm (WDRR) selects packets to send to Header Format. In fact, there is one scheduling thread for each external interface. Associated with each queue is a WDRR quantum and a discard threshold, both configurable from the RLI. When the number of bytes in the queue exceeds the discard threshold, newly arriving packets for that queue are dropped.

The Header Format (HF) block prepares the outgoing Ethernet header information for each packet. It ensures that multiple copies of a packet (that have potentially different Ethernet addresses) are handled correctly.

The Transmit (Tx) block transfers each packet from its DRAM buffer to the proper external link, and deallocates the buffer.

There are two additional blocks (not shown) which are used by all the other blocks in the router. The Freelist Manager (FM) block reclaims resources associated with a packet (e.g., the DRAM buffer and the SRAM meta-data) making them available for re-allocation. Whenever a packet is dropped or when Tx has transmitted a packet, the packet reference is sent to the FM.

The Statistics (Stats) block keeps track of the 64K counters that can potentially be updated as packets progress through the router. Other blocks in the system issue counter updates requests to Stats. For example, there are per-port receive and transmit counters which are updated whenever packets are successfully received or transmitted. There are also SRAM counters for each route or filter entry that is updated before and after packets are processed by the QM.

Stats Counters And Indices

The NPR keeps track of a large number of counts. The most basic counters allow you to monitor bandwidth and packet rates coming into and out of ports. The RLI makes these counters visible to the user through RLI menu items. But it is possible for the NPR to have many more counters (up to 256K) that can be updated as packets progress through the router.

There is a 1 MB region of SRAM that holds 4-byte counters for monitoring. These counters fall into two groups:

The register counters are listed in the Summary Information => Counters page. Some documents refer to these counters as global register counters. The 64 register counters keep track of such things as:

The RLI has menu items for some of these counters (e.g., Monitoring => ReadQLength in the port menu) while others must be accessed through the Monitoring => ReadRegisterByte and Monitoring => ReadRegisterPacket menu items in the NPR icon menu. For example, there is no menu item for the number of packets dropped by the Queue Manager but you can find out how many packets have been dropped by a Queue Manager by reading register counter 31.

The remaining counters are stats counters that are accessed through stats indices. A stats index is an integer that provides access to a group of four 4-byte counters:

The pre-queue counters are updated before a packet is processed by the Queue Manager, and the post-queue counters are updated after a packet is processed by the Queue Manager. One stats index is created for each route table entry and each filter.

The Control Plane

[[ control-plane.png Figure ]]

The software running on the (ARM-based) XScale control processor (CP) is used to:

An NPR runs the Linux operating system on the Xscale, and a user-space control daemon on top of Linux. The Xscale daemon starts the router by loading the MEs with the base router code (i.e., everything except plugins) and enables the ME threads to run. Once the data path blocks have been loaded successfully, the daemon is ready to handle several types of user requests:

All messages generated by the XScale are sent back through Mux to PLC, allowing users to add filters to redirect these packets to plugins for special processing.

Programmability

Plugins and filters are powerful mechanisms that allow the user to customize the operations on micro-flows and macro-flows and to customize the processing of individual packets.

Filters

Filters are used to modify the default path that a packet takes through the router. Without filters, each packet would be forwarded through the router's fast path to the output port determined from a search of the route table. That is, the packet's destination IP address is compared to the route table, and the entry with the longest matching address prefix is returned. The result contains the external interface to which the packet should be forwarded. With filters, matching can be made on more fields than just the destination IP address, and the packet disposition includes more than just output port forwarding. Matching is based on the following fields: source and destination IP addresses, transport-layer port number, protocol type, TCP flags, TTL, IP option, and plugin tag. Packet disposition includes output port/queue forwarding, dropping, duplication (auxilliary), multicast, plugin ME, and sampling. In addition, IP address values can include ranges of addresses expressed in the CIDR notation. [[ filter.png Figure ]]

The figure (right) shows an RLI dialog box for adding filters. In this example, the beginning of any HTTP (destination port 80) flow from hosts in the 192.168.0.0/16 subnet going to host 192.168.1.64 will be matched. The TCP flags indicate only TCP SYN packets (and not SYN-ACKs) will match. This filter directs TCP SYN packets to the plugin running on ME 2 (output plugin) and then on to queue 100 (qid) at output port 4 (4).

There are actually two different filter types in the NPR: primary filters and auxilliary filters. A matching primary filter can override a route entry if it has a higher priority than a matching route entry. On the other hand, a matching auxilliary filter does not override either a route entry or filter entry. Instead it is used to effectively duplicate a packet (it actually creates a reference to a packet). All routes are assigned the same priority while each filter has its own priority. When a packet matches multiple primary filters, the highest priority filter is selected unless the route priority is higher. In that case, the matching route is used to determine how the packet should be forwarded. For auxiliary filters, the highest priority auxiliary filter is considered to be a match. It is possible that a single packet can match one primary filter or route entry, and one auxiliary filter resulting effectively in two independent copies of a packet.

Auxilliary filters are normally used for packet monitoring. In addition, auxiliary filters support four sampling rates. A sampling rate R means that approximately R% of matching packets will be selected. By default the available rates are 100%, 50%, 25%, or 12%, but these values are configurable. Each NPR supports 32K primary filters and 16K auxiliary filters.

Plugins

The plugin framework allows users to process packets after the PLC block. Recall these two concepts:

Users are free to load any combination of code blocks on to these five MEs. By default, each plugin ME reads meta-packets from its own input ring and writes to either the Queue Manager's input ring or the Multiplexor's input ring. But the plugin writer is free to modify this behavior. For example, all five plugins could be programmed to act as a pipeline.

Plugins are written in microengine C which is the standard C-like language provided by Intel. The most important differences between microengine C and ANSI C are dictated by the IXP architecture:

Also, the conceptual programming model recognizes that an ME has non-preemptive threads to overlap memory access latency with ME processing.

The NPR provides a programming framework to help users who are unfamiliar with this programming environment and thus, lower the entry barrier for writing simple to moderately complex plugins. Note that users are not required to use the framework. Users who are already experts with the IXP can do whatever they wish with the five plugin MEs. The framework consists of a basic plugin structure that handles tasks common to most plugins and a plugin API that provides many useful packet processing functions. Details about the programming framework are provided later in the pages on plugin programming.


 Revised:  Tue, Aug 12, 2008 

  
  

www >> NPR Tutorial TOC