Monday, June 25, 2007

The Flatten Benchmark Results

Four Flatten Scheme Functions Benchmarked

Finally, the results are in, and the graphs are ready to present. After my last article, I received two (2) new excellent functions from two different readers, and they both do very well, beating my own version of flatten when benchmarked on a favorable Scheme implementation and given a favorable range of list complexity. What I show in todays article are the results of four (4) functions when run on four (4) different Scheme implementations, and for a range of list complexities starting at complexities of 1.93 x 10^3 up to 1.93 x 10^8, and on two Scheme implementations up to 3.86 x 10^8.

The complexity of a list was determined by a simple Scheme function, included in the source, that sums the number of pair?'s and atom?'s in a list. This turns out to be a very accurate measure of the input length to a function, and a vast improvement over simply reporting the "length" of a list. Throughout this article, and on the graphs, it will be referred to as the "Complexity" of a list.

The list was constructed by successive cons'ing of what, in the benchmark source, I call a list-segment. Each list segment was composed of three sub-lists. The final list segment looked like this:

 (b (c (d (e (f (g (h (i (j (k (l
    (m n)
 o) p) q) r) s) t) u) v) w) x) y)
 z a b c d e f g h i j k l m
 n o p q r s t u v w x y z a
  (d (i q r) (j s t)) 
  (f (k u v) (l w x)))
  (g (m y z) (n aa bb)) 
  (h (o cc dd) (p ee ff))))

When I refer to lists of a certain length, I'm not referring to the Scheme function by the same name, but rather to the number of list-segments, as above, iteratively cons together. Thus when I say I started with lists of length 10 up to 100 step 10, I mean to say ten (10) list-segments up to one hundred (100) list-segments step ten (10), whose (length test-list) => just happens to be 10 up to 100, but whose complexity is equal to 1930 up to 19,300. Lists of length two (2) million, therefore, had complexity of 3.86 x 10^8.

The graphs show the relationship between the "Adjusted Real Time (ms)" of execution in milliseconds on the y axis vs. the unit-less "List Complexity" on the x axis. The "Adjusted Real Time" is simply the real time of execution for a function, extrapolated to equal the real time of execution for a list of given complexity, if that function had executed one thousand (10^3) times over the same list. This was done to provide continuity over all graphs. The actual test runs were conducted using list lengths, that started at lists of length 10 up to 100 step 10, run for 1000 iterations, and progressed, by orders of magnitude, up to lists of length one million, and for two Scheme implementations, up to lists of length two million. But as the list lengths increased, I necessarily had to reduce the number of iterations, in order for the results to be collected in a reasonable amount of time. You will, therefore, see "Adjusted Real Time" reported, when testing lists with lengths over one million, which were only executed for a single iteration, but have been adjusted by multiplying the actual real time of execution by 10^3. Like wise, if an actual run was made for 100 iterations, then the "Adjusted Real Time" is equal to the actual real time of execution times ten. All runs for lists shorter than length one hundred thousand were run for at least ten iterations. Naturally, the precision of the numbers reported for lists of greater complexity suffers from this method, however, as you will see, this loss of precision was not significant, compared to the effects of the ever dwindling available memory that showed up in all Scheme implementations, more notably in some than others.

You may be wondering why I chose to scale up the different runs according to list length instead of complexity. The answer is that it was a matter of convenience on my part. It was just very simple to increment each run by powers of ten, which is easy to achieve using lengths. I covered every order of magnitude with ten data points for each function, and, in the cases where I was able to execute out to lists of length two million, I gathered ten data points for the span between one and two million, in steps of 100,000 for every function. The complexities of each list were calculated just prior to each data point collected. The benchmark could have been designed in such a way as to pre-calculate all the complexities, and store them in a top-level vector, but I have designed the benchmark to accept an arbitrary list, as well as all the other test parameters, interactively to facilitate early experiments. When I settled on the common testing parameters to use across all Scheme implementations, I simply fed them in via a shell script that ran all the tests for a given implementation as one job.

As introduction, I present the four (4) functions and give specifics on their origin and a brief description of the algorithm that they employ. The two reader contributed functions go by the names "flatten-Eugene" and "flatten-Sacundim", after the names of their respective authors. The other two functions are named for features they employ, so the function "flatten-accum-cons-sans-reverse", uses an accumulator style of function, which employs the Scheme primitive "cons" to build its result list back to front, but doesn't actually reverse its result list. Finally, the function "flatten-set-cdr!" is the function I wrote and presented in my previous article, which employs the Scheme primitive "set-cdr!" to build its list in the order of visitation of the elements of the input list.

It may seem unfair to compare a function, that returns a reversed list as its result, to the other functions, which all build their lists in the order of visitation of the input list. While that is true, "flatten-accum-cons-sans-reverse" was included as a kind of gold standard for the other functions to beat, and as you'll see, Eugene's function did just that in data collected on several test runs.


This function serves as the gold standard. I wrote it myself, but I'm not the original author, as it is too obvious a solution to have not previously been written. Unfortunately, I don't know who the original author is, and perhaps no one does. If you happen to know this bit of Scheme trivia, I will be glad to share it with the rest of the readers and give the original author their due credit. Since it doesn't reverse its result list, it can only be employed in applications in which the order of visitation of its argument need not be preserved. As far as I know, it is the fastest way to non-destructively flatten a list in Scheme.

(define (flatten-accum-cons-sans-reverse x:xs)
  (define (f x:xs result)
      [(null? x:xs) result]
      [(pair? (car x:xs)) 
        (f (cdr x:xs) (f (car x:xs) result))]
      [else (f (cdr x:xs) 
        (cons (car x:xs) result))]))
  (f x:xs '()))


This is the function I posted in my initial article. It is a diversion from accumulator style functions in that instead of passing the result list around as an argument, it uses the lexically bound identifier "last-elt" to keep track of the last pair in the "let bound" result list, thus allowing set-cdr! to mutate "last-elt" to establish the link to each successive cons pair? constructed in the result list. I know this is in violation of "The Eleventh Commandment", as per Friedman and Felleisen in their classic 1996 book, "The Seasoned Schemer", but I will let the results speak for themselves.

(define (flatten-set-cdr! x:xs)
  (let* ([result (cons '() '())]
         [last-elt result])
    (define (f x:xs)
        [(null? x:xs) result]
        [(pair? (car x:xs)) 
          (f (car x:xs)) (f (cdr x:xs))]
          (set-cdr! last-elt 
                    (cons (car x:xs) '()))
          (set! last-elt (cdr last-elt)) 
          (f (cdr x:xs))]))
    (f x:xs)
    (cdr result)))


This is an elegant function which uses two helper functions to avoid the necessity to reverse its list. Eugene Wallingford wrote this function as the answer to an exercise in the 1st ed. of EOPL, which he taught from at The Univ. of Northern Iowa, until the 2nd ed. of "Essentials of Programming Languages" dropped its emphasis on Scheme. The original problem asked for the student to write a function to flatten a proper list of symbol?'s. Thus, in Eugene's function, as he sent it to me, he used the predicate symbol? where each of the other functions use pair?. To put his function on equal ground with all the other heterogeneous list flattening functions, I made the trivial conversion from symbol? to pair?, which simply required the reversal of the order of the two expressions within the "if" expression of "flatten-sym-expr". In results from tests not shown in this article, there was no noticeable degradation in the performance of Eugene's function for having made the change.

(define flatten-with-accumulator
  (lambda (slst symbols-so-far)
    (if (null? slst)
         (car slst)
           (cdr slst) symbols-so-far)))))

(define flatten-sym-expr
  (lambda (sym-expr symbols-so-far)
    (if (pair? sym-expr)
          sym-expr symbols-so-far)
        (cons sym-expr symbols-so-far))))

(define flatten-Eugene
  (lambda (slst)
    (flatten-with-accumulator slst '())))


This function is an ingenious attempt to improve on flatten-set-cdr! by placing all recursive calls in tail position, and employing an additional argument within the loop, that maintains a list of outstanding non-null cdr's of previous pairs, that remain to be processed after the car of a pair? has been processed. It does this at the expense of adding another conditional statement inside the loop, and with an additional cons cell constructed for each pair? seen who's cdr is not null. It was left by a brave reader as a comment, without an email address to reply to the reader directly, so I know very little about the details of its origin, other than what was said in the comment, which you can find at the bottom of my previous article. I'm not even aware of the authors full name. None-the-less, as I promised, I placed the function in with the pack to be benchmarked, and it turned in a very fine performance, especially when compiled.

(define (flatten-Sacundim list-of-lists)
  (let* ([result (cons 'will-be-thrown-out '())]
         [last-pair result])
    (let loop ([remainder list-of-lists]
               [stack '()])
        [(null? remainder)
         (unless (null? stack)
           (loop (car stack) (cdr stack)))]
        [(pair? (car remainder))
         (loop (car remainder) 
               (cons (cdr remainder) stack))]
         (set-cdr! last-pair
                   (cons (car remainder) '()))
         (set! last-pair (cdr last-pair))
         (loop (cdr remainder) stack)]))
    (cdr result)))

The astute reader may be asking why the identifier "last-pair" is colored differently than other identifiers. The answer is that the program I wrote and use to syntax colorize Scheme code posted to this blog, is at once perceptive enough to recognize the identifier for a popular srfi-1 function of the same name, but lacks a sufficiently sophisticated parser to see that the identifier is lexically bound within the function. It's on my very long TODO list before I release "format4blog", which I have previously posted about. If you are in need of this type of service for your own blog, I highly recommend you use Jens Axel Søgarrd's Paste Scheme service.


Here are the graphed results presented grouped by Scheme implementation in alphabetical order. Within a Scheme implementation they are presented by first showing an overview graph of the entire data set for an implementation, followed by graphs of the same data set, truncated at descending list complexity, by steps of powers of ten. After the graphs, I'll provide some analysis, from my perspective, on the results, and finally end with what I feel is the take home message from this effort.

To make full disclosure, let me tell you that all tests were conducted on a Mac Pro Server, OS X v10.4.9, with 4gb's of physical memory installed, and two duo core Xeon Processors, each running at 3gHz. No other applications were in the foreground while these tests were being conducted. There were some background processes that I could not safely squelch during testing, but none were actively allocating memory to any measurable extent, and likewise, all 4 cpu's showed no significant activity prior to the beginning of a test run.







Petite Chez Scheme



I typically pull images off my web site, but this time around I thought I would give Flickr a try. The graphs load quite quickly for me, and my internet connection is a standard residential ADSL line to my home office. Let me know how this works out for you; if I get enough feedback that they are taking too long to load, I'll move them back onto my website and/or reduce their size.

I received excellent support from Felix at Chicken. Once he gave me a few pearls on the proper way to setup the compiler, all the problems I had mentioned in my prior post regarding segment faults disappeared. Those errors, were in fact, due solely to my having turned on the -unsafe optimization prior to having made certain there were no undefined identifiers loose at the top level. Chicken's compiler is quite fast; the benchmark code compiles nearly instantly on my box, even with every optimization I could turn on employed.

One of the first things you might notice about the first graph is that it appears some what jagged. This is the version of the graph, even after I manually removed obvious outliers; i.e, those data points where the execution time actually went down with increased complexity. I did not have to make such modifications to any of the other three implementations, which were, by and large, all monotonically increasing in execution time. My speculation is that Chicken's GC is collecting garbage in larger chunks than the others, because when the execution times suddenly jumped in places, I found that it did not always happen for all functions, and no function was immune from the anomaly. I had the choice of showing all the data, which would have appeared quite erratic, or doing my best at preserving the relative ranking among functions at each point retained, and presenting what I did, which amounts to a manual smoothing filter applied to the actual data. Consequently, there are somewhat fewer actual data points on the chicken graphs.

The other thing to notice about the Chicken graphs, is that they are fairly linear through out the set of graphs. Due to the design choices made by Chickens athors involving memory allocation, I was not able to test beyond lists of length one million. This was confirmed to be legitimate by Felix. Therefore, Chicken was never stressed to the same degree as MzScheme and Petite Chez Scheme. Tests on DrScheme were likewise terminated at lists of one million. One look at MzScheme's entire data set graph will give you a good idea of the other end of the spectrum regarding design choices, when an implementation is stressed with very little available memory left. Eugene's function is quite notable in its performance with the Chicken compiler. It starts off beating the other two functions quite soundly, but then ends up being the slowest when the lists grew to enormous complexity. In fact, I just noticed that I missed another place where Eugene's function beats the gold standard; on the other graphs I tried to place an annotation whenever this occurred. It is also notable how well Sacundim's function does with the very complex lists when compiled with Chicken. In places throughout the range of complexities Sacundim's function is seen to do better than my own function, and although the precision is poor at the end, you can just make out that Sacundim's function beats mine at the final data point. If you take a look at the graphs for all Scheme implementations, you'll see that flatten-Sacundim is more typically running slightly slower than, but nearly parrallel with flatten-set-cdr!

Turning my attention to the DrScheme graphs, you'll notice that here too I terminated the graph at lists of length one-million. I didn't actually see DrScheme crash from heap exhaustion on this set of graphs, but under a previous version of DrScheme, circa v369.100, it did. For safety sake, and due to the increasing run times, I never took it out past one million on the graphs I'm presenting today.

Looking at the overall all graph for DrScheme, you immediately notice its smooth increasing slope. This is most certainly not a result of the linear functions being tested, but rather to DrScheme taking an increasingly longer time to process the memory requirements of these functions. As DrScheme is using MzScheme as its underlying Scheme engine, you only need look at MzSchemes overall graph to see where DrScheme was headed if I had opted to continue. You should also take note that flatten-Sacundim in particular, but also flatten-Eugene, appear to be doing rather poorly in the later part of the overall graph, but this is atypical of both functions performance over the whole set of DrScheme graphs. As you see, I have made a couple of notations where flatten-Eugene actually does better than the gold standard. Another thing to note is that there is a distinct cross over point between flatten-set-cdr! and flatten-Eugene on the overall graph. This is a recurring theme in the graphs for all implementations. Flatten-set-cdr!, for what ever reason, seems to be less effected by the underlying implementations poor response to lists of enormous complexity than the other functions, except in the case of the Chicken compiler where flatten-Sacundim puts in its best performance.

Although, I've mentioned it on several occasions already, as we turn our attention to the overall graph for MzScheme, its quite dramatic. To place the entire data set on the same graph, you have to accommodate for the seemingly exponential rise in run times seen in the last few data points. It is amazing that MzScheme didn't crash by looking at the overall graph. I brought this to the attention of Mathew Flatt, who is the go to guy for PLT software, and he indicated that their was a change made to how they handle very large objects that coincides with the time frame in which I noticed a change in behavior of my benchmark. That is, previously it had failed to make it out to lists of length two million, crashing before all functions had made it. Now it doesn't crash the system after upgrading to v370.3, but we get these extraordinary jumps in run times. Mind you, these are lists of enormous complexity, that in real world applications would rarely, if ever, be constructed. However, it is possible to do without the exponential rise in run times, as you can see by looking at Petites overall graph.

It is somewhat surprising to see the difference in ranking between flatten-Eugene on DrScheme vs. MzScheme. Flatten-set-cdr! is fairly consistently the fastest function, other than the gold standard, on the entire set of graphs for MzScheme, while under DrScheme, it only begins to out perform flatten-Eugene when the lists become very complex.

Finally, we have the Petite Chez Scheme graphs to analyze. Here, even more dramatically than with MzScheme, you see how poorly flatten-Eugene appears to be performing on the initial chart. But a quick glance through the entire set reveals, that flatten-Eugene firmly had the competition beat right up until the final run. Again, for whatever reason, flatten-set-cdr! is less effected by these enormous lists than flatten-Eugene. The other quality of the Petite graphs that you can't escape, is their very smooth linear look, save for the overall graph, which clearly shows the same nonlinear growth rate in execution times as it encounters the largest lists. However, this time, the change happens over a broader spectrum of list lengths, so that even though the overall graph looks like its going north in a hurry, it retains its smoothness. It would be fascinating to have a deeper understanding of what engineering trade-offs each implementation made that produces these wide difference in behavior when stressed with having to allocate memory close to the limit of what is available on a given computer. I suspect if I had the luxury of running this benchmark on a 64bit platform with 32gb of physical memory that things might look quite different indeed.

Take Home Message

When I started this article, I greatly under estimated the time required to publish results that were as creditable as I wanted them to be. It was a real wake up call for me to see the differences in performance of these functions, all of which are based on algorithm's that are O(n) in their complexity, as they were executed under vastly different list complexities, and Scheme implementations. The results say one thing loud and clear. If you are going to select one function over another, and performance matters, then you really need to know the platform on which it will be operating, and the types of input, to which, it will most likely be applied. There are simply too many variables, outside of the programmers design of a function, to take into account for anyone to claim their function is the best implementation of a class of function, under all conditions.

For instance, I rarely use MzScheme in my own work, because of the value added for me by working in the ever helpful and forgiving DrScheme development environment, and I would never be flattening lists of the complexity for which flatten-set-cdr! is particularly well suited. So, for me, it makes more sense to use flatten-Eugene in my library. On the other hand, if you are an Emacs wizard, and you use MzScheme directly from within the editor, then flatten-set-cdr! would likely be your best choice for performance across the board. If you are one of the many fans of Petite Chez Scheme, your going to want to use flatten-Eugene for your everyday list flattening needs, but if you had a particular application in which you anticipated encountering lists of enormous size, then you might consider flatten-set-cdr! for its robust response to the larger lists. Flatten-Sacundim makes a good showing in the compiled environment of Chicken, so on that platform you would have even more choices to pick from. And then you might be a person for whom style is the most important thing for you in Scheme. For those, you cannot escape the elegance of the flatten-Eugene function, that manages to do so well in so many likely scenarios, and has sole claim to the ability to beat the gold standard, a function that can only be used in special cases.

I am also quite aware of the limited scope, within which, I have been able to run this benchmark. There are dozens of other Scheme implementations not represented here, either due to my inexperience with them, my inability to get them to compile on my box, or because it was going to be a far greater task to make a single source file portable enough to run my benchmark on them. The other shortcoming of these results, is that they all came from a single machine using a single operating system. I had hoped to repeat all the runs on my Windows XP box, which also has two (2) 3gHz Xeon processors, and 4gb's of memory, but I simply couldn't afford the time. You could also make a the valid criticism that I stuck with a single form of list-segment throughout the tests I'm reporting in this article. The truth is, I had very much wanted to publish results for varying list structures, but again, I felt obligated to release these results sooner rather than later, because of the kind contribution of my readers. I have made the Scheme benchmark source available to the public at this link on my web site, for those who might want to run the test on their own platform. It is not published on the web site itself, so you'll need to use the link provided here on my blog. If you do download the source, please pay special attention to the notes I have made throughout the code, as this benchmark code is intended only for those who will take on the responsibility for what it might do to their system. In preparing this article I discovered the limits of the various Scheme implementations by encountering several crashes along the way. None of these crashes effected my Mac Pro, or had any lingering effects on the crashed Scheme systems. However, you may not be so lucky.

As always, it was my pleasure to bring this article to you. Please leave a comment as to your take on the article. I'm always open to criticism, and accolades alike.



Sunday, June 03, 2007

A Better Flatten

A Better Flatten Algorithm in Scheme

After a little nudging from my friend Robby Findler to clean up my algorithmic act, I hunkered down with DrScheme to come up with the best list flattening function that I could dream up. For the last few days I've been doing some serious benchmarking between various functions that all do the same job, they decompose a list. It turns out that their are a cluster of four or five functions that all operate in O(n) time, but they differ by small factors in their actual run times. It also turns out that the type of list being flattened can make an otherwise excellent algorithm look very bad. One benchmark in particular proved to spoil a very popular algorithm, that builds the list back to front, and then applies reverse! on the result. I simply handed the algorithm the answer to begin with, and it fell behind other algorithms which had previously never made it into the top five. That is, I presented the function with a completely flat list as the input.

No destructive algorithms were considered, but mutation during construction of the newly produced flattened list was most definitely on the table. The fastest algorithm was a version of a function that normally required reversal to generate the list in the order visited in the input list. There are times when the order would be of no importance to the application, so I left it in the pack as the gold standard to beat.

What I came up with is an algorithm that needs no reversal of its result, and is faster than all but the one algorithm that doesn't reverse its list, and the difference between the two is very small. The function is quite robust in handling differently structured lists; never leaving its top ranking regardless of what list I through at it.

However, I still wasn't satisfied that I knew enough about my function to call the race. So I grabbed a Scheme to C compiler and ran the benchmarks all over again. On small lists, with lengths less than 100,000 (which means there were actually many times that number of elements, since `length' only measures the cdr spine of a list), the results of compiling were about what I had been told. I saw something like an order of magnitude of speed boost, depending on the function. There were some ranking changes within the slower algorithms, and to my surprise, some segment faults on some of these functions. I was hoping that I could get beyond lists of length 2,000,000, which is where, even with 4gb of memory, I hit a wall with DrScheme. However, to my surprise, the compiled code wasn't able to go beyond 1,000,000 before it faulted because of an out of memory allocation error that crashed the entire application. Further, as the length of the lists increased, the gap between the compiled codes performance and that of DrScheme shrank, depending on the algorithm, to somewhere between two to four times faster than the byte code of DrScheme.

It was quite a lesson in just how well DrScheme has been engineered. I saw more program failures over the course of one day than I've seen in my entire experience with DrScheme. I should divulge that this was the first time I had used the compiler, so in the right hands things may have worked out quite nicely.

Update: 2007-06-26 07:37am EDT.

So the record is clear, all the problems I encountered with the Chicken compiler were due to my own lack of experience with this fine piece of software. The segment faults turned out to be uncaught unbound identifiers at top level, which were not reported by the compiler, because I had, from my lack of experience with Chickens compiler, turned on the option -unsafe, and by doing so, it gladly accepted unbound identifiers, because in this mode the burden of correctness lays squarely at the feet of the programmer. And I was foolish enough to run my code for the first time with this option selected. I later contacted Felix at Chicken, who was enormously helpful in getting me squared away on all my concerns, including the ability to use `eval' in a compiled program to allow the interactive specification of the list structure to be tested. It turns out that my last line in the above paragraph was correct. In the right hands, Chicken is a very fast compiler that produces excellent run times. Please see my follow up post for the results of Chickens benchmark on four different functions, where you will see that, unlike most other Scheme implementations, Chicken produced code that remained linear throughout its operating range. I regret having made any comments before talking with the experts at Chicken, who were quite gracious and helpful in getting me on the right track.

If your interested in the function I wrote, visit where I discuss more details. I would be particularly interested in anyone that believes there is a faster algorithm for non-destructively flattening a list. Leave a comment with your algorithm, and I'll post it, even if it turns out that its just a worthy contender.

For those of you who don't want to visit the Cookbook here is the code

(define (flatten x:xs)
  (let* ([result (cons () ())][last-elt result])
    (define (f x:xs)
        [(null? x:xs) result]
        [(pair? (car x:xs)) (f (car x:xs)) (f (cdr x:xs))]
        [else (set-cdr! last-elt (cons (car x:xs) ()))
              (set! last-elt (cdr last-elt))
              (f (cdr x:xs))]))
    (f x:xs)
    (cdr result)))

Note: An astute reader has correctly pointed out that the nil's in this code are not quoted. DrScheme employs an extension to its reader that allows () to be evaluated as '(), saving the programmer from placing all those tic marks in the code, when it is obvious that they should be quoted. As the reader points out, this is not standard R5RS Scheme. And, indeed I had to make a glogal replace on all my nils to compile it. DrScheme employs many reader extensions that are not strictly R5RS Scheme, as do many other implementations. However, for the purest, and to help out when trying to implement portable Scheme code, DrScheme retains a language level which only accepts strict R5RS Scheme. I'm more egalitarian in my approach to implementations of Scheme. So when a better method of writing Scheme is offered, I usually adopt it, especially if it will save me keystrokes, and clean up the appearance of the code at the same time. If you are not using DrScheme, you may likely need to change the code above by adding tic marks in front of each nil. C'est la vie.

Please do leave a comment regarding the function. I'd really enjoy hearing about your own favorite version of this function. I've built up quite a collection on the way to designing this one, and there are some subtle design choices to make. It would be great to see how other people approached the problem.