stringSub Plugin

<< Function >>

o Substitutes each occurrence of the character sequence 'hello' in a packet
  with the character sequence 'adieu'.  Note that no substitution will occur
  if 'hello' straddles more than one packet.
o This version of stringSub is both more general and more restrictive than
  the standard plugin version.  The standard plugin version only handles
  UDP packets with a payload that is 32 bytes or smaller but allows the user
  to change the search and replacement character sequences through the control
  interface.  This version handles arbitrary length packets but only searches
  for 'hello' and replaces it with 'adieu'.
o Like the 'vowel' plugin, the code in handle_pkt_user() contains two
  algorithms:
  1) SCAN_DRAM:  scan DRAM byte-by-byte; and
  2) SCAN_LMEM:  scan "32-byte chunks" in local memory that have been read
  from DRAM.
  The SCAN_LMEM algorithm here is much more complicated because 'hello' can
  straddle two chunks requiring the scanning of two chunks before discovering
  that the first chunk needs to be modified and written back to DRAM.  Also,
  the algorithm uses chunks that are aligned on a quadword boundary meaning
  that the first and last chunk are only partially occupied by the packet.
  See "Methods" section for a description of the slow SCAN_DRAM and faster
  SCAN_LMEM methods.  SCAN_DRAM is the default.
o WARNING:  The plugin doesn't use a callback thread.  So, don't include it
  in the Makefile.

<< Methods >>

o SCAN_DRAM

	Let:	q be a ptr to dram
	Then:	Access each byte using *q++
	Details:
		q = start of payload;
		N = payload size - 4;
		for( i=0; i<N; ) {
		    if( is_hello_dram(q) ) {
			set_adieu_dram( q );
			++my_nmods;
			q += 5;
			i += 5;
		    } else {
			q++;
			i++;
		    }
		}
	# NOTE:
	# is_hello_dram(q) is used instead of strncmp_dram(q, "hello")
	# because the user doesn't have any personal DRAM to use.
	# is_hello_dram() uses immediate data for 'h', 'e', 'l', 'l' and 'o'.

o SCAN_LMEM

	Idea:	Form a ring buffer of 2 32-byte chunks in a char array.
		Search for 'hello' and replace with 'adieu' where the search
			can spill over to the next chunk.
	Let:	n be the number of bytes remaining to be processed
		buf[k] be the current local memory buffer character under the
			cursor where k is the current cursor index
		m be the number of leading bytes matching 'hello' 
	Then:
		Read into local memory the first 2 chunks;
		Initialize k and n;
		while( there are chunks to scan ) { 
		    while( current chunk is not done ) {
			m = nmatch_hello_lmem( k, buf );// #of leading bytes
		    					// matching 'hello'
			if( m == 0 ) {		// no match
			    k = (k+1)%64;
			    --n;
			} else if( m == 5 ) {	// complete match
			    set_adieu_lmem( k, buf );	// Do substitution
			    if( k < 59 )	k += 5;
			    else		k = (k+5)%64;
			    n = n - 5;
			} else {		// partial match 1 <= m < 5
			    k = (k+m)%64;
			    n = n-m;
			}
		    }
		    Write chunk to dram if dirty;
		    if( n > 32 )	Read next chunk into ring buffer;
		}
		if( last chunk is dirty)	Write chunk to dram;

	Def:  a chunk is a 32-byte memory block that is aligned on a
		quadword boundary.

			start of payload
			     |
			     v
			 -------------------------------
			|    chunk 0	|    chunk 1	|
			 -------------------------------

	Note 1: The chunk array contains 2 chunks and is treated like a ring
		buffer where buff[0] follows buff[63].  This is needed to
		handle the most difficult case where "hello" straddles 2
		chunks and you need to modify the previous chunk because
		of the "adieu" substitution.

<< Control Messages >>

Type	Command Semantics			Example
Set	alg= X	Set algorithm to X where 0	IN:  "alg= 1"
		means SCAN_DRAM and anything	OUT: "1"
		else means SCAN_LMEM
Get	=vers	Display version number		IN:  "=vers"
						OUT: "1.3"
Get	=counts	Display counts			IN:  "=counts"
		 (npkts, nbytes, nmods)		OUT: "1 118 34"
Get	=alg	Display algorithm		IN:  "=alg"
						OUT: "0"
Get	=errno	Display errno[0-4]		IN:  "=errno"
						OUT: "0 0 0 0 0"
Misc	reset	Reset parameters, nbytes[],	IN:  "reset"
		nvowels[], errno[] counters,	OUT: "OK"
		etc.
Misc	debug	toggle debug_on			IN:  "debug"
						OUT: "0"

<< Testing >>

o Use the 'nc' command to send UDP pkts

	server> nc -w 1 -udl 2000 | tee outfile	# listen on port 2000
	client> ping -c 3 192.168.2.32 2000	# prime arp tables
	client> nc -w 1 -u 192.168.2.32 2000 < file.in
						# send contents of file.in
						#  (assume n2p1 rcvr) using udp
  - If you do not use the "-w 1" argument, enter ctrl-c to terminate nc.
  - An alternative to ping is to run 1 or 2 experiments to get arp tables primed

o I tried 4 test cases

	Input File	Output File	#Subst	Description
	----------------------------------------------------
	file.in		out1		2	1-pkt test
	file2.in	out2		24	2-pkt test
	file3.in	out3		327	3-pkt test
	file4.in	out4		471	3-pkt test (each 1024 bytes)

	Note:  file3.in contains 2132 bytes and 8*41 = 328 instances of "hello"
	but one instance straddles 2 pkts.
	nc sends out 3 pkts with udp lengths of 1024, 1024 and 84

o Both algorithms produced the expected output files and =counts replies
  except that the SCAN_LMEM algorithm doesn't send out the last pkt.

<< Things To Do >>

o Bug in SCAN_LMEM algorithm when input file is file3.in: the SCAN_LMEM
  algorithm doesn't send out the last pkt (84 bytes) even though the
  plugin processes it and counts the correct number of substitutions.
  The output file indicates that the second pkt was written out OK, but
  the third pkt was never received (confirmed by tcpdump).

o Instead of selecting the algorithm through the control interface ('=alg'),
  you can reduce the memory usage and increase running speed by using
  conditional compilation to select the algorithm.

<< Lessons Learned >>

o If you don't plan to use a callback thread, change the Makefile to not define
  CALLBACK_THREAD.  If you do include the callback thread, you must have
  callback() sleep or else handle_msg() won't get many cycles and appear to
  be dead.

o onl_api_ua_read_8W_dram() actually reads quadword (8B) chunks and then
  copies out the bytes you requested to be read.  It calls dram_read() to
  do the reading from DRAM.  So, it calls dram_read() to read either 4 or
  5 quadwords depending on whether the address is aligned or not on a
  quadword boundary (4 quadwords is 32 bytes or 8 words).

