Skip to main content

Golfing Run Length Encoding in Haskell

Haskell: step by step refactoring to concision

Once I was playing with and working on a Haskell clone of the old Gradius arcade games, Monadius. Most of my changes were not particularly interesting (cleaning up, Cabalizing, fixing warnings, switching all Integer to Int and so on), but in its Demo.hs, I found an interesting solution to a problem, and it seems like a good small example of Haskell abstraction.


One of the problems with language advocacy is that it’s hard to have good examples, since someone who is writing an essay probably will only come up with trivial ones, and someone dealing with meaty problems isn’t thinking about advocacy but how to solve their problem. I hope this example will be a little more substantial than the one-liners examples of closures or partial application.

Level Data In Monadius

Suppose we have levels which are specified by a pair of numbers and then a long list of numbers, often very repetitious. Perhaps a particular level might be represented this way:

level1 :: ((Int, Int), [Int])
level1 = ((2,1),[0,0,0,0,0,0,0,0,0,0,0,0,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,

This is an ugly way to define a level. We could just scrap this representation as Int completely and perhaps define it using a hypothetical enumerated type as

level1 :: ((Geometry,Geometry),[Enemy])
level1 = ((Tall,Narrow), [Flyer,Flyer,Flyer,Flyer,Flyer,Flyer,Shooter,PowerUp,Boss..])

But this representation, while certainly more understandable, is still very repetitious. It will take even more space to write down.

We could auto-derive the Read and Show typeclasses for Geometry and Enemy datatypes, and then we could store level1, level2 etc. as some sort of “levelinformation.dat” file, and read the level information in at runtime.

But that makes installation and running more difficult1 still possible, , and it doesn’t address the core issue: the information is written in a way that is so ungainly that it’s next to impossible to write or even modify!

The Solution

We need some way of expressing this more concisely, of, in short, compressing it. In this vein of thought, our first observation should be that we do not need to resort to fancy compression libraries or anything; there is an obvious way to compress it—the entire thing is a series of repeated numbers.

We should be able to replace the lengthy enumeration of fragments like [0,0,0,0,0,0,0,0,0,0,0,0] with something simpler. For example, the number of repetitions, and what is to be repeated: (12,0).

This run-length encoding is shorter and easier to modify. It is possible that there may be a performance benefit here, as we’ve gotten rid of a large constant that would have to be defined in the program itself and instead replaced it with a shorter function which evaluates to the same thing. (If we’re only on level 1, we don’t need to carry around the expanded version of the other levels, and when we go to level 2, level 1 will be garbage-collected.)

Writing It

So, what is the type of our decompressing function? Well, we need to turn a (Int,Int) into a [Int]; even better, we want to turn a whole list of (Int,Int)s into a single list of Ints. Thus, our end goal is going to be a function of this type:

rleDecode :: [(Int,Int)] -> [Int]

Let us tackle the single tuple example first. The second entry defines what we need, and the first entry defines how many we need, and that’s our type right there:

rleRecursiveDecode :: (Int, a) -> [a]

We could write a primitive recursive function that takes the parameter, decreases the counter by 1, and cons on a copy of the second entry. It could look like this:

rleRecursiveDecode (0,_) = []
rleRecursiveDecode (n,b) = b : (rleRecursiveDecode (n-1,b))

But this really is not the best way to go. It is not necessarily easy to follow, and if there is one thing I have learned about Haskell programming, it is that the most obvious approach (in this case, primitive recursion) may not be the best way to go.

This is code that could have as well been written in Scheme or something. It is complicated because we are trying to ensure that we generate an item for the list only at the exact moment we need to add it into the list; we are programming as if our language is strict, in other words. So, what is the lazy way of doing things? Why do we like laziness to begin with?

One of the key rationale for lazy evaluation is that it promotes separation of concerns—one function doesn’t need to know when or how much has been done by another function. In this case, what concerns ought to be separated? Well, our desired list has 2 properties: its length, and the single repeated item.

Since we can create infinite lists, there is no need for the length to be set at the same time as the repeated-item is being generated. We can create an infinite list containing the item, and we demand as many as we need. Lazy evaluation means our infinite list doesn’t blow up. Simple enough!

The generation is too easy for words:

-- Define an infinite list of our item.
duplicate :: a -> [a]
duplicate a = a : duplicate a

But what’s our demand function? A little thinking and we know that we have a list, a number, and we get back a list. Int -> [a] -> [a] describes a few functions, the second of which turns out to be what we want: take.

rleLazyDecode :: (Int,Int) -> [Int]
rleLazyDecode (n,b) = take n (duplicate b)

Now, duplicate is a simple enough function to define, but another type-search will show that a -> [a] is already defined in the standard libraries—repeat.

So we scrap duplicate:

rleLazyDecode (n,b) = take n (repeat b)

Might as well remove the parentheses:

rleLazyDecode (n,b) = take n $ repeat b

We can do better. What is the type of \n b -> take n $ repeat b? It is Int -> a -> [a], and a Hoogle type-search turns up an exact match: replicate. So we can improve it further:

rleLazyDecode (n,b) = replicate n b

Are we quite done? Well, one could ask: rleLazyDecade is now basically nothing but an alias or new name for replicate, except for how it converts from a tuple (n,b) to normal uncurried arguments n b. Is there any function which will do this conversion for us? Well, we have a function replicate :: a -> b -> c, and data (a,b) passed into said function to produce a c, so we want a function with the type signature (a -> b -> c) -> (a,b) -> c, which turns out to be something called uncurry. Having reasoned our way this far, the implementation is much clearer than the type:

uncurry :: (a -> b -> c) -> (a, b) -> c
uncurry f (a,b) = f a b

That’s what we were just doing manually. So now the next version is clear:

rleLazyDecode = uncurry replicate

A satisfying, short, functional, and lazy one-liner. From here the definition of rleDecode is almost trivial: we extend it to a list of tuples by throwing in a map, and we turn the resulting list of lists into a single list by way of concat:

rleDecode ns = concat $ map rleLazyDecode ns

We can tweak this further, as concat . map is a common enough idiom that there is a shortcut:

rleDecode ns = concatMap rleLazyDecode ns

As before, we could have found concatMap by searching for the type signature of concat . map: (a1 -> [a]) -> [a1] -> [a]. Aw heck—let’s make it points-free like rleLazyDecode:

rleDecode = concatMap rleLazyDecode

And then we substitute in the rleLazyDecode definition:

rleDecode = concatMap (uncurry replicate)

We can actually go even further into the realms of incomprehensibility. It turns out that lists are a monad. This means we can use bind and all the rest of the operations defined by the Monad typeclass to operate on lists and other things. So we can write this bizarre (but short and effective) version of rleDecode:

rleDecode = (uncurry replicate =<<)

And we are done! We can now represent the first level like this:

d1 = ((2,1),d)
  where d = rleDecode [(5, 3), (10, 67), (6, 3), (5, 67), (29, 3), (2, 11),
     (29, 3), (5, 35), (3, 3), (3, 35), (24, 3), (2, 35), (19, 3), (7, 19), (21, 51),
     (63, 35), (15, 43), (5, 11), (8, 43), (8, 35), (3, 3), (9, 35), (20, 3), (52,
     67), (32, 3), (13, 35), (3, 3), (11, 35), (5, 3), (9, 35), (4, 3), (6, 35), (4,
     3), (5, 35), (13, 3), (14, 19), (4, 3), (4, 19), (6, 3), (4, 19), (7, 3), (5,
     19), (8, 3), (8, 19), (5, 83), (3, 67), (2, 3), (12, 19), (5, 83), (17, 19), (9,
     83), (2, 19), (5, 3), (20, 67), (1, 75), (38, 11), (1, 43), (4, 35), (7, 3),
     (57, 67), (2, 3), (5, 19), (17, 3), (6, 19), (8, 3), (6, 67), (7, 83), (59, 3),
     (9, 19), (1, 51), (17, 35), (20, 3), (14, 35), (7, 3), (2, 35), (11, 43), (6,
     11), (7, 3), (26, 67), (7, 3), (13, 35), (5, 43), (7, 11), (2, 3), (5, 67), (1,
     83), (2, 19), (2, 3), (10, 67), (21, 3), (147, 0)]

Much nicer! And since our list-processing code is polymorphic, so even if we replace the arbitrary numbers by constructors, we change nothing.

(Thanks to the denizens of #haskell for refactoring tips, and Don Stewart’s blog entry on run-length encoding/decoding using Arrows.)

  1. Although Cabal makes this easy with its Paths_* mechanism.↩︎