NPR Tutorial >> Writing A Plugin | TOC |
Even though you may have successfully compiled your plugin, there still may be bugs in the code that will cause the plugin to misbehave. Typical signs that there is something wrong with your plugin include:
We gave an example of testing the mycount plugin in the page NPR Tutorial => Writing A Plugin => Quick Start. But that plugin was very simple and known to have no bugs. You are likely to encounter unexpected bugs when writing your first plugin that significantly departs from any working plugin.
You can not approach the writing, testing and debugging of a new plugin with a cavalier attitude. Such an approach will surely mean little progress even after spending many hours debugging. This situation is due to several factors:
You will not be able to set breakpoints and examine memory. Instead, you will have to insert calls to onl_api_debug_message() whenever you want to display state information.
Bad operations can lead to an unresponsive plugin. You will have to reload the plugin or perhaps restart the entire experiment.
Some of the functions in the ONL API already use part of the 640 words of local memory in each ME. This limitation is somewhat relieved by using SRAM, but the user still needs to avoid memory usage extravagance. Variable declarations that don't include the __declspec() modifier will be allocated from a region of SRAM defined by a linker flag (this limit can be modified to some degree). But the standard Makefile allocates each plugin ME only approximately 200,000 bytes for compiler-generated SRAM variables. But a plugin such as the delay plugin, may need more SRAM (e.g., internal queue). The standard Makefile gives each plugin ME 1 MB of SRAM for its own use, but the user code must manage this space.
There are a number of pitfalls that can trip up a beginner or at the very least be annoying. For example:An assignment statement will cause an implicit context switch if it needs to access SRAM or DRAM. That is, the machine code generated by the statement will include an instruction to do a context switch and will not resume until the thread gets control again. This can lead to a race condition.
These factors lead to the following approach to code development and testing:
This reduces the amount of new code that needs to be written and allows you to begin with a majority of the code in solid, working order.
For Example: When writing the delay plugin, the most complicated related plugin we had was the mycount that counted and forwarded packets without delay. In order to extend the mycount plugin, we had to add these features:
- queue management routines for delayed packets
- additional variables to track queue statistics
- delay packets by a fixed number of milliseconds
Incremental steps will facilitate forward progress. The important point is to make the step small enough so that the plugin behavior can be verified from debug output.
For Example: When developing the delay plugin, the first plugin involved a few small changes to the handle_pkt_user() and handle_msg() routines:This would be a small step if the queue management routines were already debugged. During the testing of the very first plugin version, you are just trying to see if you can get one or a few packets through the plugin and whether key plugin variables have correct values.
- handle_pkt_user() enqueued a new meta-packet and then immediately dequeued and forwarded the meta-packet.
- handle_msg() was extended to recognize operations for returning the value of some queue management variables.
You will be able to use a debugger and a standard operating system on a general-purpose machine. Obviously, you can not test anything having to do with the IXP's memory hierarchy.
For Example: Before doing our first delay plugin test described above, we debugged that part of the queue management routines that were unrelated to the memory hierarchy. The file ~onl/npr/plugins/delay/list.c is a version of the off-line test routine we wrote to test the queue management functions.
The testing sequence described later takes this approach.
The page Basic Debug Messages described some functions for logging debug messages to a file. Control messages handled by handle_msg() can also be used for debugging. An alternative technique that can be used for higher-bandwidth traffic is described in the page Handling Errors. Also, the value of a key variable can be easily charted if the value is stored in one of the five public ME counters.
For Example: The delay plugin responds to the command =counts which returns the values of npkts (number of packets), maxinq (maximum number of queued packets), and ndrops (number of packets dropped). The values of npkts and ndrops can be compared to the packet statistics from ping and UDP iperf. The value of maxinq can be compared to the expected bandwidth-delay product.
A good source for existing code is the code for standard plugins. See the .c and .h files in subdirectories of ~onl/npr/plugins/. Another good source is in the plugin framework source code which is in the directory ~onl/npr/pluginFramework/. Occassionally, users have found some useful code in the subdirectories of ~onl/npr/onl_router/ which contains code used by the plugin framework code.
Our first objective should be to see if low-bandwidth traffic can flow through the plugin and the plugin continues to respond to control messages. Then, we can use incrementally more difficult tests as each step seems to succeed:
The first four tests were demonstrated for the mycount plugin
in the Quick Start page.
The sections below expand on that mycount example.
You can follow these steps using your own plugin.
Test 1 (Load)
Do the following:
You can still send control messages to the plugin even if you have not created a filter to direct packets to the plugin. The mycount plugin should respond to three commands:
Successful return values from each of these commands indicates that the plugin executed its initialization properly, the hardware threads are context switching, and the handle_msg() routine is responding. This would be a good sign that something terrible has not happened yet.
Sometimes users have an explicit "hello" or "version" command where the plugin responds with a version number. This is also a good idea when you have a short debug cycle and you want to make sure that you are executing the correct version of the plugin. Although you should see a consistent view of your files on all machines, sometimes the Network File System (NFS) may be slow in updating the version of the plugin on the NPR control processor.
There are three possible bad outcomes:
This is the easy case since you can insert additional debug code to locate the problem.
This normally doesn't occur unless you use pointers or you make a bad function call. Although the problem(s) could be anywhere, it is likely in the new code that you wrote. You should insert debug code to see what code has executed. Begin by inserting code in the user initialization routine plugin_init_user() that shows that your variables have been initialized properly and that the handle_msg() routine has been called.
Note: It is possible that the problem is in the handle_msg() code itself. A common problem is to try to create a string using a message buffer that is too small. This may happen when several variables have very large values (because they were not initialized) and the concatenated ASCII values exceed the 28-byte message body limit. Hopefully, the message routines would check for this, but this isn't always the case.
This is similar to the previous case but you may be able to locate the problem area more quickly. The good news is that some of the code seems to work (atleast once).
Now, direct packets to the plugin with a filter, and see if you can send a few ping packets through the plugin (e.g., "ping -c 3 n2p3"). Then, follow up with some control messages to see if the plugin is still responsive. If this test is successful, it indicates that the plugin can handle low-intensity traffic and works at some rudimentary level.
There are mainly four bad outcomes:
This is the easy case since the incorrect values may pinpoint the problem code where you can insert additional debug code.
Hopefully, the debug log file will give some indication of where the problem is located. Quite often, a plugin becomes unresponsive to control messages because an assignment has run off the end of an array. Check the debug and control message calls themselves.
The fundamental question is "How far did the first missing packet get?" It is possible that the packet got to the destination, but that the ICMP echo reply packet never made it back to the sender. Here are some things to consider:
- Did the packet get to input port of the NPR with the plugin?
You can chart the packet counter at the input port (Port X => Monitoring => RXPKT) which increments for every packet seen coming from port X by the NPR's RX block.- Did the packet get to the plugin?
You can chart Plugin Counter 0 (Monitoring => Plugin Counter) which increments for every packet seen by handle_pkt_user().- Did the packet get out of the plugin's NPR?
You can chart the packet counter at the output port (Port X => Monitoring => TXPKT) which increments for every packet seen going out of port X by the NPR's TX block.- Did the packet get to the next hop?
The RXPKT and TXPKT counters can be monitored along the entire packet path to see where the lost packet was seen. If the next hop is a host, you can use the netstat -i eth1 before and after sending traffic to see how many packets made it to the destination host.
This is like the last Test 2 case, and you will need to insert debug code to isolate the problem.
Now, send a continuous sequence of ping packets by omitting
the "-c 3" flag.
The plugin should be able to handle this traffic load even if you
outputting debug messages.
Then, repeat the verification steps in the previous test.
Test 5 (UDP/TCP Traffic)
Now, you are ready to send UDP packets but remember that you will lock up the plugin if you send more than a few packets per second and the plugin is sending debug messages. So, you may want to send the UDP packets at a low rate. For example, the iperf command:
iperf -c n2p3 -u -b 12k
Even if your plugin is for TCP packets, it might be worthwhile testing a version of the plugin with UDP traffic even if it means changing the plugin code. The difference between UDP and TCP traffic when using iperf is that the UDP traffic will be near constant rate but the TCP traffic will be bursty. In fact, during the slow start phase, there may be pairs of back-to-back packets spaced by approximately the transmission delay of a single packet. Furthermore, the average input rate may reach twice the bottleneck capacity. This bursty behavior will be more stressful than the one provided by UDP iperf traffic. One way to limit this burstiness is to set the bottleneck capacity to some low rate, but note that the smallest NPR port rate is 0.683 Mbps which translates to about 6 packets per second for maximum-sized packets.
Revised: Fri, Jan 30, 2009
NPR Tutorial >> Writing A Plugin | TOC |