Coding Challenge - The Bit Twiddling Challenge - Results

Publish Date: Sep 12, 2006 | 0 Ratings | 0.00 out of 5 |  PDF

Table of Contents

  1.  Performance
  2. Other Platforms:

 Coding Challenge Winners

  The Philips NatLab TMA team: Joris Robijn (intern), Martin Vernhout, Henk Hessel, Adrie Raaijmakers and Albert Geven. 

1.  Performance

On a P4 2GHz Intel computer the times were:

Philips NatLab TMA team   61
Craig Smith *   61
Konstantin Shifershteyn   62
Konstantin Shifershteyn   64
Edward Fäldt   67
Konstantin Shifershteyn   69
Bart Michiels   71
Martin Gustafsson   71
Philips Research, T&MA team   73
Paul Brown   78

Jaegen Milley   83
Teddy Roerby   87
Martin Gustafsson   91
Edward Fäldt   91
Remco Breen   99
Jaegen Milley   100
Greg Falcon *   101
Christian Altenbach   102
Oleg Chutko   104
Greg Falcon *   104
...
xxx 13803

A * next to the name means the submission is from NI and is there just for the competition.

Optimization is not really the most important aspect of software development, and in fact it is a common justification for code that is difficult to read and maintain. There are times however when it is necessary to meet certain speed requirements, and in the past I have heard many individuals state that because they wanted it to be FAST, they were going to write it in C, assembly, use upper case variable names, or other not-so-scientific methods for getting fast code. Hopefully, optimization challenges,like this one, can help all of us learn some coding techniques that produce fast LabVIEW code but code that is still easy to read and maintain.

The name of the challenge was the Bit Twiddler, and one of the main goals was to see how different people approach bit operations in LabVIEW and learn how to do them efficiently. The contest had very good participation, and the times for submissions ranged from .06 seconds to almost 14 seconds. About one third of the entries took more than a second to finish, and about ten percent finished in under 100ms, with the majority being in the 100ms to 1 sec range.

Let's look at solutions from different time scales and compare them.

 
enlarge

This solution executed in 13745ms. The leftmost loop builds an array of Booleans from the eight bit string characters. The 1D array of Booleans is then reshaped into a 2D of Nx9 Booleans. The rightmost loop takes a row at a time, computes the parity and creates an array of U8s (either the first eight bits or a ?). Then it converts the array to a string. One small thing to notice is that the loop produces an array of int32s, which are then converted into U8s by the coercion dot on the [U8] to string conversion. A better approach is to convert each int32 to a U8 inside the loop andbuild the array of the correct type and size.

Here is a different approach that executes in 1001ms. The left portion takes the U8 byte array and ANDs it with various masks to create eight arrays each containing only one bit of information. They are then interleaved into one larger array, converted into an array of Booleans, and decimated to produce nine Boolean arrays where a character and its parity bit are spread across the nine arrays. These arrays are recombined to form a 2D array of bits for the value and a 1D array for the parity value. The familiar loop on the right then produces the numeric array which is converted into the string. Note that the loop again produces an array of int32s, and instead of an expensive coercion dot to convert an array of int32 to an array of U8, this solution uses an explicit cast. But, even though this time the coercion is explicit, it is identical in execution to the coercion dot, and again, placing the Cast inside the loop should speed this solution up quite a bit.

This solution also makes lots of intermediate data -- 900 U8 bytes turninto 7200 U8 bytes, then into 7200 Booleans, then into a 2D array of 6400and a 1D array of 800.  But, the integer and Boolean nodes are actually quite efficient at operating on entire arrays at once.  Comparing the operation of a built-in node on an array and a Loop containing the node and operating on individual elements, you will find that the node is not only smaller and more elegant on the diagram, but runs faster as well.

enlarge

The solutions that break the one second barrier, including this one that timed in at 311ms, take another approach -- they do all of the work in one loop. The upper left computes how many characters will be output. Inside the loop, the indexing code to figure out which bytes to work on is a bit tricky, but indexes two U8s and builds an I16 from them. The I16 is then bit rotated so that the parity is in the upper and the character byte is in the lower. Then the I16 is split again, the parity checked and the appropriate character output. Notice this solution uses a lookup table, but only to store the value of the parity bit, not the value of the character to output.

enlarge

This solution took 245ms, so it is similar in speed, but notice how the indexing logic is written. Yes, that is the same as *9/8, but done in bit shifting and with no comments. Who said optimized code was hardto read? More interesting though, is the use of the 3D lookup table, 256x256x8 which encodes the shifted combinations of two byte pairs, and attempts to make the most of the precomputed table. Interestingly, this approach is still four times slower than the winners.

By combining lookup table and the two byte at a time approach, these single loop solutions made it into the mid 100ms range, but they are still a ways off from the winners. If two bytes at a time is a good approach, perhaps more is better.

enlarge

And in fact, getting below 100ms the solutions all look to be taking advantage of the fact that nine input bytes result in eight output bytes. Each loop iteration takes the nine bytes apart, combine, shift, and mask them to produce eight nine-bit values, which are then turned into the eight output characters, in this case using the familiar 512 element lookup table with parity already computed.

This approach is typically called "unrolling the loop", and means that the diagram is taking advantage of the fact that the data is clumped in a certain way, or the processor prefers to operate on the data in a certain clump size. Unrolling a loop also means fewer loop iterations, so the loop overhead is reduced compared to the code inside it. In this case, however, the loop overhead is quite small compared to the processing going on and the benefit comes from making better use of the indexed values and simplifying the indexing code. Looking back at the solutions that worked on two bytes at a time, notice that iteration zero indexes elements 0 and 1. Iterationone indexes elements 1 and 2, iteration two bytes 2 and 3, and so on. In fact most of the elements are needed for two different iterations. The unrolled solution shows this in detail as each of the nine bytes are shuffled into the sixteen bytes of the eight I16s. Also notice how the index computation is much simpler without the 9/8 math going on each iteration,and instead one indexer is going up by 8, the other by 9.

The downside of unrolling the loop is that the loop now works well on ninebytes, but what happens when there aren't nine characters on the input. Ifthere are 5 bytes, or 40 bits encoding four nine bit characters, using 36of the bits with four unused, how does this solution work? To handle this, the outside indexing code has become more complex precomputing theoutput array so that computation on bytes 6 through 9 never gets stored inthe result. This also takes advantage of the LabVIEW approach to array indexing. Indexing past the end of an array will not crash your program, and it is defined to return default values, zeroes in this case. The zeroesare shifted, masked, and even sent to the replace. If the initial arrayis sized to hold only four elements, then the replace only works on those, and the zeroes do not become a part of the output. This makes it much simpler to unroll a loop, but it is always necessary to think about and test the input sizes that do not fit the loop's stride. Either preallocate the array and replace to avoid storing bad values, or pad the input array to be a sized as a multiple of the stride and shrink the output array tokeep only the valid data.

This above diagram, submitted by the Philips NatLab TMA team is the P4 winner, taking 61ms and is still reasonably readable. The entry from Craig Smith, who works on the LabVIEW development team and is therefore just a virtual submission, is almost identical except it does not have the single frame sequence. The variation from Konstantin uses bundling and cluster array conversions instead of the index array and replace array element.

Congratulations to everyone in the top ten, they are all very good entries, and to everyone who spent the time to scratch your chin and look for a faster approach to the problem, thank you for making the contest a good one. I hope that everyone learned something from trying your hand at the problem, and I hope that this challenge uncovered some of the techniques that can make a LabVIEW diagram run fast, but still maintain its readability.

 

Back to Top

2. Other Platforms:


Just For Fun
In the contest rules, it was stated that the official results would be for a P4 computer, but just for fun, let's look at how these solutions fare on different types of computers. I ran them on a P3, a dual P4, and a dual PPC. None of the entries really benefitted from the dual processor, but the architecture changes did affect rankings somewhat.  The P3 architecture for some reason favored Konstantin's bundling and unbundling approach, but the top entries still stay pretty close to one another.  The interesting results below come from the PPC.  The overall times are somewhat slower as this is an 866 Mhz PPC, but notice that the top ranked VI is now well out in front of the other entries.

Martin Gustafsson   108
Martin Gustafsson   135
Konstantin Shifershteyn   139
Philips NatLab TMA team   169
Bart Michiels   171
Craig Smith   172
Craig Smith   172
Konstantin Shifershteyn   174
Konstantin Shifershteyn   175
Jaegen Milley   175


enlarge
This code snippet seems to be the important difference.  It merges the bytes together and operates on I32s instead of I8s.  It also uses a large lookup table which is basically two 512 lookup tables side by side to take two nine bit chars and produce two eight bit chars in a single lookup.  The reason for the drastically different speed is that the Intel architectureis more tuned for accessing individual bytes than the PPC chip.

Go here for further discussion.

Back to Top

Bookmark & Share

Ratings

Rate this document

Answered Your Question?
Yes No

Submit