Sunday, June 03, 2007

A Better Flatten

A Better Flatten Algorithm in Scheme

After a little nudging from my friend Robby Findler to clean up my algorithmic act, I hunkered down with DrScheme to come up with the best list flattening function that I could dream up. For the last few days I've been doing some serious benchmarking between various functions that all do the same job, they decompose a list. It turns out that their are a cluster of four or five functions that all operate in O(n) time, but they differ by small factors in their actual run times. It also turns out that the type of list being flattened can make an otherwise excellent algorithm look very bad. One benchmark in particular proved to spoil a very popular algorithm, that builds the list back to front, and then applies reverse! on the result. I simply handed the algorithm the answer to begin with, and it fell behind other algorithms which had previously never made it into the top five. That is, I presented the function with a completely flat list as the input.

No destructive algorithms were considered, but mutation during construction of the newly produced flattened list was most definitely on the table. The fastest algorithm was a version of a function that normally required reversal to generate the list in the order visited in the input list. There are times when the order would be of no importance to the application, so I left it in the pack as the gold standard to beat.

What I came up with is an algorithm that needs no reversal of its result, and is faster than all but the one algorithm that doesn't reverse its list, and the difference between the two is very small. The function is quite robust in handling differently structured lists; never leaving its top ranking regardless of what list I through at it.

However, I still wasn't satisfied that I knew enough about my function to call the race. So I grabbed a Scheme to C compiler and ran the benchmarks all over again. On small lists, with lengths less than 100,000 (which means there were actually many times that number of elements, since `length' only measures the cdr spine of a list), the results of compiling were about what I had been told. I saw something like an order of magnitude of speed boost, depending on the function. There were some ranking changes within the slower algorithms, and to my surprise, some segment faults on some of these functions. I was hoping that I could get beyond lists of length 2,000,000, which is where, even with 4gb of memory, I hit a wall with DrScheme. However, to my surprise, the compiled code wasn't able to go beyond 1,000,000 before it faulted because of an out of memory allocation error that crashed the entire application. Further, as the length of the lists increased, the gap between the compiled codes performance and that of DrScheme shrank, depending on the algorithm, to somewhere between two to four times faster than the byte code of DrScheme.

It was quite a lesson in just how well DrScheme has been engineered. I saw more program failures over the course of one day than I've seen in my entire experience with DrScheme. I should divulge that this was the first time I had used the compiler, so in the right hands things may have worked out quite nicely.

Update: 2007-06-26 07:37am EDT.

So the record is clear, all the problems I encountered with the Chicken compiler were due to my own lack of experience with this fine piece of software. The segment faults turned out to be uncaught unbound identifiers at top level, which were not reported by the compiler, because I had, from my lack of experience with Chickens compiler, turned on the option -unsafe, and by doing so, it gladly accepted unbound identifiers, because in this mode the burden of correctness lays squarely at the feet of the programmer. And I was foolish enough to run my code for the first time with this option selected. I later contacted Felix at Chicken, who was enormously helpful in getting me squared away on all my concerns, including the ability to use `eval' in a compiled program to allow the interactive specification of the list structure to be tested. It turns out that my last line in the above paragraph was correct. In the right hands, Chicken is a very fast compiler that produces excellent run times. Please see my follow up post for the results of Chickens benchmark on four different functions, where you will see that, unlike most other Scheme implementations, Chicken produced code that remained linear throughout its operating range. I regret having made any comments before talking with the experts at Chicken, who were quite gracious and helpful in getting me on the right track.

If your interested in the function I wrote, visit schemecookbook.org/Cookbook/ListFlatten where I discuss more details. I would be particularly interested in anyone that believes there is a faster algorithm for non-destructively flattening a list. Leave a comment with your algorithm, and I'll post it, even if it turns out that its just a worthy contender.

For those of you who don't want to visit the Cookbook here is the code

(define (flatten x:xs)
  (let* ([result (cons () ())][last-elt result])
    (define (f x:xs)
      (cond
        [(null? x:xs) result]
        [(pair? (car x:xs)) (f (car x:xs)) (f (cdr x:xs))]
        [else (set-cdr! last-elt (cons (car x:xs) ()))
              (set! last-elt (cdr last-elt))
              (f (cdr x:xs))]))
    (f x:xs)
    (cdr result)))

Note: An astute reader has correctly pointed out that the nil's in this code are not quoted. DrScheme employs an extension to its reader that allows () to be evaluated as '(), saving the programmer from placing all those tic marks in the code, when it is obvious that they should be quoted. As the reader points out, this is not standard R5RS Scheme. And, indeed I had to make a glogal replace on all my nils to compile it. DrScheme employs many reader extensions that are not strictly R5RS Scheme, as do many other implementations. However, for the purest, and to help out when trying to implement portable Scheme code, DrScheme retains a language level which only accepts strict R5RS Scheme. I'm more egalitarian in my approach to implementations of Scheme. So when a better method of writing Scheme is offered, I usually adopt it, especially if it will save me keystrokes, and clean up the appearance of the code at the same time. If you are not using DrScheme, you may likely need to change the code above by adding tic marks in front of each nil. C'est la vie.

Please do leave a comment regarding the function. I'd really enjoy hearing about your own favorite version of this function. I've built up quite a collection on the way to designing this one, and there are some subtle design choices to make. It would be great to see how other people approached the problem.

Enjoy!

--kyle

14 comments:

Brad Lucier said...

The empty list is not self-quoting in R5RS; change the three instances of

()

to

'()

How did you generate the test trees for your timing tests?

Kyle Smith said...

That's a reader extension in DrScheme, which saves one from quoting the obvious. I'm afraid I've grown so accustom to it that I never bother quoting the empty list.

The lists were generated using various combinations of three lists. A deeply nested list, called abc, that look like this. '(a (b (c ... x) y) z). A list that had the structure of a binary search tree, which took a few more elements than the abc list to construct, which I did merely by hand. And a flat list '(a b ... y z). A combination of these three lists appended together was what I called a segment. When you see the numbers for the lengths of the lists, they roughly coincide with the number of segments in the test-list. So the lengths are less than the actual number of elements in the test-list.

For the cookbook, which is designed for all levels of Scheme users, I believe I quoted the nils, so as not to confuse the newcomers.

Thanks for your comment. I hope I answered your question adequately.

Enjoy!

--kyle

sacundim said...

Can we actually see the code that generates your test lists? It's probably best if you want people to be sure they can replicate your experiment.

sacundim said...

The only possible improvement I can see: one of your calls to f is not tail recursive. Clearly you can't do this computation in constant time, but by extenalizing the stack, you could in principle save the cost of the function applications.

I tested my version of that in DrScheme 370, however, and there was no measurable improvement at all. It still might help in other systems...

My code (which I don't know how to get properly formatted for this blog comment system):

(define (flatten3 list-of-lists)
(let* ((result (cons 'this-element-will-be-thrown-out '()))
(last-pair result))
(let loop ((remainder list-of-lists)
(stack '()))
(cond ((null? remainder)
(unless (null? stack)
(loop (car stack) (cdr stack))))
((pair? (car remainder))
(loop (car remainder) (cons (cdr remainder) stack)))
(else
(set-cdr! last-pair
(cons (car remainder) '()))
(set! last-pair (cdr last-pair))
(loop (cdr remainder) stack))))
(cdr result)))

Kyle Smith said...

Hi Sacumdim,

Not to worry, I'll have your function nicely indented in my upcoming article. Your observation about other systems is spot on. A very different algorithm submitted earlier shows better performance when benchmarked on a different implementation, while being rather slower when run on MzScheme. I'm glad you submitted when you did, as I was just beginning to write the up coming article to compare the two functions. I will add your function to the group and publish the results, along with the code used to benchmark the functions; I notice that you had requested that in an earlier comment.

Thank you for your submission. Its especially interesting, since it's an attempt at improving my function, which will make the upcoming article nicely balanced.

--kyle

Christophe (vincenz) said...

Hello Kyle,

First of all, thanks for your comment @ my blog. I was considering trying the results you have here in Haskell. Obviously Haskell being strongly typed, I would not be able to use plain lists. But I could easily encode an N-ary tree structure.

That being said, I was hoping you'd post the code used to generate the 'lists'. I expect the behaviour in Haskell to be quite different. For one, tail-calls are not a good idea for data-structures because they go against laziness. Therefore, it will be interesting to see how these algorithms behave in Haskell (as far as is possible, of course, side-effects less common in Haskell).

Best regards,
Christophe

Kyle Smith said...

Hi Christophe,

The benchmark code is located in my latest article along with the graphed results, including two new functions submitted by readers. You can find the results, my own analysis, and conclusions there. One reader tried a fully tail recursive version of my own code, which does well when compiled on the Chicken -> C compiler, and another, submitted by the Chair of CS at Northern Iowa University, does exceptionally well on both DrScheme and Petite Chez Scheme. My own function did well in all the tests run on MzScheme, the underling Scheme implementation of DrScheme, and on all platforms when the lists become enormously large. I measured a metric I've come to call the complexity of a list, which is the sum of all pair?s and atom?s in a list. So, when I say enormously large, I'm talking about > 1.93 x 10^8 thru 3.86 x 10^8, which consume on the order of 1GB of memory. When you take into account that at the end of constructing the newly flattened list, which is a copy, that means conservatively 2GB of memory just for the lists alone.

Take a look at the source code. It would be very cool to see the different functions implemented in Haskell. I'm betting that Haskell would turn out better results, although AFAIK, you wouldn't be able to emulate the functions that use mutation in the construction of the result list, since I believe Haskell is purely functional. Still, the function from Eugene Wallingford, from Iowa, is quite elegant, an doesn't employ any mutation. Neither does the function I use as the "Gold Standard", a function that builds its result list back to front, producing a reversed result list. Eugene's function manages, at several data points, to achieve better run times..

Have a look; if you just ported those two functions to Haskell it would make for a great comparison. Meanwhile, I'll see if I can't port your Stack computer compiler using Scheme macros. That ought to be straight forward, and may actually end up taking less lines of code, given that no type information need be specified.

DrScheme now sports a "Typed Scheme" language level, which I've never used. I figure if I'm going to use strict typing, I'd rather learn how to do it natively in Haskell. Scheme's utility is rooted in its dynamic typing, but it places a lot of responsibility on the programmer to insure type safety Where as, Haskell's strength comes in some part, from its highly developed strict typing, which obviously makes it possible to do optimizations, during compilation, that aren't possible in Scheme.

What text(s) did you find useful in learning Haskell? I picked up "The Haskell School of Expression", Paul Hudak, from Yale, in it's 7th printing, 2007. But Hudak seems to take forever to dish out the language. I think its better suited for someone who has never programmed before. I'd rather get a text that assumes the reader has a certain acumen in programming at the outset.

Thanks for dropping by,

--kyle

Christophe (vincenz) said...

Hello Kyle,

Well, I personally used the YAHT. While it covers some very basic things, and skims over monads a bit too quickly, I found it a good source to learn Haskell, especially coming from O'Caml.

As for the stack-compiler, the notations of the types are purely for documentation purposes, they are not actually required. The compiler will infer those.

I have read about the Typed Scheme effort (on reason I learned about your blog is because I'm subscribed to Planet Scheme as well as Planet Haskell). It seems interesting, and I hope it will lead somewhere besides just academic interest. I think soft-typing is an interesting concept, where you can allow for soft degradation in typing to the runtime system, and choose when you want to have efficiency and when you want to have flexibility. (Though I wonder how type-inference would work with it).

One reason why I'm perhaps slightly wary is for instance that scheme code typically returns #f when no value should be returned. And thus you basically flatten the (Maybe a) type from Haskell. (Aka, if your function actually returns a boolean, how can you determine whether it's no value or a false value. The distinction between : Nothing and Just False).

Anyways, that was somewhat offtopic. I will definitely look at your code, though from a cursory glance, it will require indepth analysis (did I see continuations in there?).

Btw, what I'm wondering, and maybe you know more about this, given your knowledge on continuations. Can the algorithms that use mutation be rewritten to use continuations instead? Is there some sort of duality to these two concepts?


-- Christophe

Christophe (vincenz) said...

Hmm, damn, I had filled in an entire comment, but I feared it got lost :( So here it goes again, probably with some thoughts lost...

Anyways, I will take a look at the code and see if I can port it to Haskell :). For the tutorial, I used YAHT. It also covers the very beginner basics, but then it speeds up. I used it when going from O'Caml to Haskell, and as such it was a great guide. IIRC, his treatment of monads helped me most.

Regarding your comment about the shortness of the implementation of the mini-compiler in Haskell, you should know that the type-annotations are purely optional. It's just a common-practice to add them for top-levels as they act like handy documentation (which are then also verified :). However if you leave them out, the compiler will easily infer them. Only in rare cases when doing advanced type-hacking are hints required for the compiler.

I know about Typed Scheme, in fact the reason that I know about your blog is because I am registered to Planet Scheme (and Planet Haskell). Typed Scheme seems like an interesting concept, and I hope that it will grow into something more than just a toy-example, such that one an use soft-typing: go for static typing where efficiency is required (this might get me a bit flamed, but I'm always under the impression that dynamic typing is going to be inherently just slightly slower), and dynamic typing when flexibility is required. Just not sure how well that will work with type-inference. Also, in general, it will require schemers to change some std idioms. For instance, AFAIK, schemers will return #f when a function returns no value, thereby flattening the value space. Now if you have a function that optionally returns a bool, you're stuck. In Haskell, you'd have three values: Just True, Just False, Nothing. But this is perhaps a bit too much side-track, we'll have to see where Typed Scheme goes.

Btw, regarding the implementations, I have a question that maybe you can answer and that's been mulling in my mind. As stated, some flatten-functions use mutable data. However, can these be implemented using continuations (is there some sort of duality between state and continuations?). If so, I could recode them to use continuations in Haskell.

Cheers,

-- Christophe

Christophe (vincenz) said...

Doh... you have comment moderation. Sorry about the multiplicity of comments.. Feel free to keep whichever version you prefer.

Kyle Smith said...

Hi Christophe,

I routinely publish comments as soon as I see that its not junk, so as I pulled myself away from hacking I noticed your name in my email box, and just published them without concern, by email. It was only afterward that I noticed you had duplicated a comment. No worries.

I actually found YAHT last night, although I believe your source is from the Univ. of Utah, and I found it on wikibooks.org. I spent some time at the Univ. of Utah in the "Medical Imaging Research Lab" (MIRL) as a medical student, and then again as an intern. Great school. We were using ecg synchronized MRI to differentiate myocardial infarct of the endocardium, which is dead tissue, from what is called "dormant" myocardium. The thing about dormant myocardium is it appears to be dead on an echocardiogram, but it is actually salvageable if you can re-perfuse it within a window of time, which is on the order of days, instead of hours. The scientists at MIRL are outstanding; it was one of the highlights of my time spent in medical school.

I actually hacked up a quick virtual stack machine in Scheme, but couldn't stop myself from adding features. Now the "Dual Stack Virtual Machine" (DSVM) is equipped with separate cache and instruction stacks. The cache is an internal stack used to store potentially n arguments before it gets to the operand, since Scheme's (+), (-), (*), & (/) functions have indefinite arity, and it would also facilitate implementing list operands. Then I decided that the DSVM should be running in its own thread, which is what I was just finishing up when I noticed your email. So, I don't think there will be any chance of my Scheme implementation using less lines of code than yours, even with the type hints left in your code.

Actually, it may never see the light of day, since I just discovered Intel's latest version of their C/C++ compiler, which supports OpenMP, and more significantly auto vectorization as well as auto parallelism. That means on my Dual core Dual Xeon Mac Pro, it will, even without OpenMP pragmas in the code, determine if a block of code is safe to either vectorize, using the advanced instruction set of the Xeon Processor, or partition into separate threads on separate cores, or both. I've run some simple benchmarks on serial code, gcc 4.01 v.s. icc 10.0 with the -openmp, -parallel and -xT options enabled in icc, and the results are really impressive.

I've had a rather dormant sourceforge.net project open for several months for the purpose of writing a new Scheme implementation, and I think the idea of writing one that took advantage of hardware that is quite common today, but vastly under-utilized, because of the complexity of maintaining both a multiprocessor version and a single processor version, which keeps software vendors from going down that road, is exciting. There are an extraordinary number of papers that have been published in the last five years about the advantages of running the garbage collector on a separate processor. The whole picture is starting to come into focus with where I want to go with the implementation. It just may come to life after all.

You had a comment about boolean functions. There are a couple of ways to deal with it, but the one I use is to wrap the function in an error handler that returns (void) if it encounters an exception during evaluation. Which gives you a tri-valued boolean function, and the programmer can take what ever measure is appropriate if void is returned.

On the topic of continuations, they do indeed serve as a functional method of maintaining state. And with first class continuations; i.e., call/cc, the expressive power they afford the programmer is vast. However, even with the great advances in the technology of implementing continuations directly on the stack, even in uncooperative environments, such as C/C++, there is a not insignificant performance penalty for their use. In the case of the flatten functions, where performance was my major design goal, I didn't even pursue that avenue.

The more efficient, and Scheme like idiom typically employed is to add additional arguments to a recursive function if state information needs to be held over between calls to the same function. This usually allows you to maintain proper tail recursion on recursive calls as well, which is a big win, since with proper tail recursive functions, you can get by without building a new stack frame for every invocation of the function, which also keeps the stack from growing without bound.

There is also a technique, which is frowned upon by Scheme purists, that involves placing a let bound state identifier behind the initial lambda in a function definition. These let bound identifiers most definitely cause the function to behave in different ways, over time, even when given the same arguments; so they wouldn't fit into a strict functional language paradigm. Still, the technique, if employed with great moderation, can make for a very effective and quick implementations of a toggle function, where you definitely want that behavior.

Having said all that, you should know that first class continuations are capable of implementing nearly any control structure that you could dream up. Because of there expense, computationally, however, I try to use the smallest hammer when ever possible. In the flatten-benchmark.ss code I use first class continuations for portability sake, when a much cheaper, and less dangerous alternative exists called, let/ec. It's like a normal call/cc invocation, but its extent is limited to merely escaping from within its own scope; i.e., you can't stash it away in an object, and expect it to work in some distant function, like you can with call/cc continuations. Not all Scheme implementations have let/ec, but I still wanted the convenience of escaping from a block of code at the first opportunity, so that's why you see call/cc in the code. Their really acting as as structured GOTO statements to exit nested code blocks directly.

I'll give it some thought and see If I can't come up with a flatten function that employs call/cc or maybe a delimited continuation, in its implementation. However, I don't think they would be competitive with the other implementations, from a performance standpoint. I'm usually not that concerned about performance; I kind of got caught up in the whole benchmark thing by accident more than design. I'm glad for the experience, though, because it forced me to look into GC's to a much greater extent than I ever had done before. As Will Clinger told me, most GC's do quite well when not heavily loaded, you are only really interested in the differences in performance when you are running out of memory. And there is quite a body of research being conducted about just that topic. With the introduction of C# using a GC, and Microsoft's experimental f# functional language, I expect you'll see a lot of innovation in this area, because it has the caché of mainstream IT, and not just some Scheme related research, which production programmers tend to shy away from. The hardware folks are also providing facilities to improve performance of GC's, with the introduction of read and write locks at various levels of granularity. Industry sees this now as a much cheaper way to engineer software, when you consider the costs of fixing a single memory allocation oversight on the part of a single programmer.

Well, I'm very much looking forward to my introduction to Haskell. I had been working my way through a text on SML/NJ, but realized that the truly novel ideas were in Haskell. It will be interesting to see if I can resist the ease of hacking up something in Scheme long enough to be properly introduced to Haskell.

Let me know if anything needs a better explanation in the code. I'd be more than happy to help.

Take care,

--kyle

Christophe (vincenz) said...

Hello Kyle,

As promised, I did the tests in Haskell. One reader commented that I should change the optimization flags, so I will be redoing the experiments when I have some time (took several hours of running, and then manually copy-pasting the numbers to average them and get them in to a spreadsheet, I should've outputted to a saner format...)

Flatten Benchmark in Haskell

Cheers,
christophe

Mark said...

Anyone tried doing this without mutations?
I come up with the function below, which should run in O(n) time but I don't have a good guess at the constants.

(define flatten
(lambda (L)
(letrec ((iter
(lambda (L R)
(cond
((null? L) null)
((list? (car L)) (iter (car L) (iter (cdr L) R)))
((null? (cdr L)) (cons (car L) R))
(else (cons (car L) (iter (cdr L) R)))))))
(iter L null))))

Ethan Herdrick said...

This flatten doesn't work if any elements of the tree are dotted pairs. Why?