soc reloaded: Outcomes

This blog is kind of a final report for the summer. I had a progress report drafted for about two weeks, so let me paste that here just for the record. To read about the actual results, please skip to the “State of the Code” section below.

The Last Progress Report

This is the second (and last) progress update for this summer of code project. It was written something like a week before the pencil’s down, but I got disheartened and instead of finishing and posting, I went on to code some more. Here it goes…

Since my last report, I have decided to turn somewhat more radical again. The original plan was to stick with the darcs codebase and do most (all) of the work within that, based primarily on writing tests for the testsuite and not exposing anything of the new functionality in a user-visible fashion. I changed my mind about this. The main reason was that the test environment, as it is, makes certain properties hard to express: a typical test-suite works with assertions (HUnit) and invariants (QC). In this environment, expressing ideas like “the displayed patches are aesthetically pleasing” or “the files in the repository have reasonable shape” is impractical at best.

An alternative would have been to make myself a playground using the darcs library to expose the new code. But the fact is, our current codebase is entrenched in all kinds of legacy issues, like handling filenames and duplicated code. It makes the experimenter’s life harder than necessary, and it also involves rebuilding a whole lot of code that I never use, over and over.

All in all, I made a somewhat bold decision to cut everything that lived under Darcs.Patch (plus a few dependencies, as few as possible) into a new library, which I named patchlib, in the best tradition of cmdlib, pathlib and fslib. At that point, I also removed custom file path handling from that portion of code, removed the use of a custom Printer (a pretty-printer implementation) module and a made few other incompatible changes.

Of course, the testing code went along. The net result, at least for me, was an ability to build and test a much smaller piece of self-contained code. It also allowed me to experiment with APIs a bit, where those were used all over darcs, which made it, within the big codebase, impossible to advance without expending disproportionate amount of work on every change. Of course, part of that will be paid back when we decide to port darcs over to use patchlib.

I originally planned this report for the start of this week, but then I got caught in a big refactor of the ApplyMonad/Apply classes (again). The refactor was triggered by the need to pretty-print patches, which is not a completely easy task (made more complicated by the fact that UUIDs are meaningless for the user, so formatting patches without context is essentially useless now). Anyway, I am now much happier with how the ApplyMonad class looks (the ApplyMonadBase thing was genuinely hideous… good riddance). As a net result the ApplyMonad class and, even more importantly, the ApplyMonad transformer (used in applyTo{Tree,State} among other things) is substantially easier to use on the client side (while maybe very slightly harder on the provider side, it is also much clearer, IMHO). Overall win.

As for formatting and summarising patches, I have created a new Display class (I plan to nuke the existing viewing classes more or less, when I manage to make meaningful Display instances for V1 Prim). The API lets you format patches based on their ending or starting context. Since bits of the patches need to be fetched from the hashed store, the display needs to run in a LoadMonad. Since any reasonable patch formatter also needs to pass state around (and since the actual type of state passed around will be different for different Prim implementations, we hide the state in a type family of monad transformers; this also opens the option to use something else than StateT when appropriate).

(This is the end of the “progress” post. The remainder more or less describes the end state of the project.)

State of the Code

You can look at the code in my darcs repositories. Specifically, to play around with a “demo”, you need pathlib, fslib, patchlib, cmdlib and gorsvet. A bit more about the individual libraries:

About patchlib

One of the main problems of the darcs codebase today is insufficient modularity, and patchlib is an experiment in an attempt to address that concern. Efforts to bring about significant improvement of the situation from “inside” (by restructuring existing code) have to date failed. Even though there have been local improvements, the overall problem stubbornly persists.

Hand in hand with modularity problems come issues with unclear and underspecified (both internal end external) APIs. Since the separation between different components of darcs is blurry at best, the pressure to introduce clean, testable interfaces is minimal. The external library, on the other hand, is forced to put up a presentable façade. Luckily, the Patch subhierarchy in darcs is, compared to remainder of darcs, in a fairly good shape in this respect.

Eventually, patchlib should provide leverage to work with darcs(-style) patches, including at least:

With a homogenous set of APIs, mostly mediated by type classes and appropriate instances, to:

About V3 Prims in patchlib

The version 3 primitives in their current incarnation in patchlib have the following traits (based on the list from the proposal):

About gorsvet

Gorsvet is a toy implementation of a repository layer that uses V3 primitive patches. (Un)fortunately, the V3 prims violate fundamental assumptions made by the repository and command layers of darcs, which means that an integration is substantially more expensive than fits a single SoC project. However, as outlined above, it is useful to be able to play around with the V3 prim patches in a realistic environment. Therefore, gorsvet. I made it into a rather thin user shell based on cmdlib and a prototype repository layer. The UI more or less resembles darcs (without interactivity, since that’d be superfluous for a tool of this scope). You are of course welcome to try the experimental tool out: the online help should give you an idea how to use it.

One thing I would like to discuss a bit here is the repository format. Since the patch types are incompatible anyway, we are fully liberated from backward compatibility considerations. The next darcs repository format can be designed from scratch, keeping in mind the shortcomings of the previous two formats. The implementation in gorsvet is a peek at what the result might look like. Anyway, we still need a better “composite” patch layer, which represents conflicts (and sits one level above primitive patches), since the current (version 2) composite patches in darcs are quite unsatisfactory. That also means we have plenty of time to play around with the repository format, which is more or less independent of the composite patch format.

As for the format itself, I went for as simple as possible (but no simpler). So far, I have 2 files and a hashed store: .gorsvet/hashed is a sha256-based, uncompressed “dumping grounds” for stuff of all kinds. The files are (not implemented as of this writing) .gorsvet/index (with the same purpose as _darcs/index — fast, efficient working copy access) and .gorsvet/meta — a small set of root pointers. This has two very beneficial side effects: very atomic oupdates and transactional semantics (free transaction rollback). Compression and garbage collection of the hashed store can then be sorted out separately (and does not affect semnatics of the repository either way). There are currently 4 root pointers in meta: shadow, pristine, inventory and order. The pristine is the same thing as in darcs, shadow is a similar thing, but at any time reflects the current state of the working copy (it is automatically updated from working every time, before anything else happens with the repository). The reason is mainly that the working copy has entirely wrong semantics for diffing V3 primitive patches: most importantly, UUID tracking is implemented through shadow.

The remaining two root pointers represent two new data structures (and inventory is different from what darcs calls an inventory). The order simply lists patchinfo hashes (a handle for each patch that does not change through commutation) in the application order for the repository: in this sense, order replaces hashed_inventory known from darcs. It is pretty compact, but on its own also useless (since it give us no way to get the patches themselves). The inventory then, on the other hand, is an efficient map (currently a sorted pair list, written and read as a Data.Map) from patchinfo hashes to the current patch storage hashes. Moreover, the patchinfo objects are, unlike in darcs, stored in the hashed store as separate entities, and the patchinfo hash can be used to efficiently fetch the patchinfo itself. Therefore, the named patch can be assembled from inventory and order puts the patch into correct context.

Apart from storing and showing things, I have also implemented a “pull” command for gorsvet, but it’s currently fairly unusable, since any conflict automatically means failure (there is no layer to handle conflicts; we could use version 2 darcs patches, but I think it would constitute a dangerous slippery slope: we definitely want a more solid implementation, and also want to avoid a double transition).

Future Work

The obvious future work lies in the conflict handling. There are two main options in this regard: either re-engineer a patch-level, commute-based representation of conflicts (in the spirit of mergers and conflictors), as V3 “composite” patches, or alternatively, use a non-patch based mechanism for tracking conflicts and resolutions. It’s still somewhat early to decide which is a better choice, and they come with different trade-offs. Nevertheless, the decision, and the implementation, constitute a major step towards darcs 3.

The other major piece of work that remains is the repository format: in this area, I have done some research in both the previous and this year’s project, but there are no definitive answers, even less an implementation. I think we now have a number of good ideas on how to approach this. We do need to sort out a few issues though, and the decision on the conflict layer also influences the shape of the repository.

Each of these two open problems is probably about the size of an ambitious SoC project. On top of that, a lot of integration work needs to happen to actually make real use of the advancements. We shall see how much time and resources can be found for advancing this cause, but I am relatively optimistic: the primitive level has turned out fairly well, and to me it seems that shedding the shackles of legacy code sprawl can boost the project as a whole significantly forward.

(PS: While I agree, on the theoretical level, that nuking significant amounts of legacy code carries non-trivial risks, for a small volunteer project like darcs it is imperative to be fun. And a trench war against legacy code is not fun. Writing new things and exploring possibilities, on the other hand, is fun. Which means we need a bit more of the latter and a bit less of the former, even though the project sits in the more conservative camp — afterall, we handle rather precious data… Even when taking natural conservativism into account, a well-motivated, honest subsystem rewrite is better than a half-hearted, someone-has-to-do-it maintenance of a piece of code that everyone hates…)