“Co-Dfns: A Data Parallel Compiler Hosted on the GPU”, 2019-11 (; backlinks):
This work describes a general, scalable method for building data-parallel-by-construction tree-transformations that exhibit simplicity, directness of expression, and high-performance on both CPU and GPU architectures when executed on either interpreted or compiled platforms across a wide range of data sizes. This is exemplified and expounded by the exposition of a complete compiler for a lexically scoped, functionally oriented programming commercial language Dyalog APL.
The entire source code to the Co-dfns compiler [see also BFQN] written in this method requires only 17 lines of simple code compared to roughly 1,000 lines of equivalent code in the domain-specific compiler construction framework, Nanopass, and requires no domain specific techniques, libraries, or infrastructure support. It requires no sophisticated abstraction barriers to retain its concision and simplicity of form [see idiomatic APL].
The execution performance of the compiler scales along multiple dimensions: it consistently outperforms the equivalent traditional compiler by orders of magnitude in memory usage and run time at all data sizes and achieves this performance on both interpreted and compiled platforms across CPU and GPU hardware using a single source code for both architectures and no hardware-specific annotations or code.
It does not use any novel domain-specific inventions of technique or process, nor does it use any sophisticated language or platform support. Indeed, the source does not use branching, conditionals, if statements, pattern matching, ADTs, recursions, explicit looping, or other non-trivial control or dispatch, nor any specialized data models.
[Keywords: compilers, tree transformations, GPU, APL, array programming]
[“In his Co-dfns paper Aaron compares to nanopass implementations of his compiler passes. Running on the CPU and using Chez Scheme (not Racket, which is also presented) for nanopass, he finds Co-dfns is up to 10× faster for large programs. The GPU is of course slower for small programs and faster for larger ones, breaking even above 100,000 AST nodes—quite a large program.”]
See Also:
View PDF (18MB):