Lossy Text Compression

"And God saw that it was ace."

Lossless compression techniques can only take you so far. Modern lossy compression techniques, like those used in JPEG or MP3, are able to achieve such high ratios because they permanently discard information. Perceptual coding algorithms try to make changes that will go unnoticed by humans.

Let's apply similar coding techniques to traditional text files. Where a typical Lempel-Ziv employs a dictionary, we will instead use a thesaurus.

The lossy text compressor consists of a Perl script and accompanying thesaurus. On startup, the script opens the thesaurus file, and constructs a hash of "word" to "shortest synonym of that word". Some words are filtered out, e.g. abbreviations and 2-letter synonyms, because those don't meet quality standards.

After this is simple substitution of every word in the supplied text. This should result in a reduction of file size, while still being readable to most people. Further compression levels can be reached by running the compressor repeatedly - but note that text degradation ("generational loss") can result in an unreadable document.

The thesaurus format used is that of the Moby thesaurus, found here: http://icon.shef.ac.uk/Moby/mthes.html

A companion decompressor script works similarly but uses the longest synonym in place of the shortest.

  • 1 × Perl
  • 1 × Thesaurus
  • 1 × Sense of humor

  • Presentation

    Greg Kennedy06/04/2015 at 16:07 0 comments

    I gave a completely straight-faced presentation on this project at my workplace's recent Hackathon. While it didn't win, it got a lot of laughs and I was allowed to eat the free food.

    The Powerpoint is available for viewing and download here:

    https://docs.google.com/presentation/d/1SQVki8DowvFrE0uJzS9P8PGjrUsBK78RScvV5eAotGs/edit?usp=sharing

  • Source Code

    Greg Kennedy05/08/2015 at 16:46 0 comments

    No idea where to put source code on this site, so I'll just make a project log of it.

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    ### LOSSY TEXT COMPESSION
    # Conventional compression utilities rely on a "dictionary" to expand
    #   indices into complete tokens.
    # This compressor instead uses a thesaurus to minimize output file size.
    my %thes;
    
    {
      # Open moby thesaurus.
      # You may substitute your own thesaurus file.  The format is
      #  word,synonym1,synonym2,...\r
      open (T, '<', 'mthes/mobythes.aur') or die "Couldn't open thesaurus: $!\n";
    
      # Moby thesaurus is CR-terminated
      local $/ = "\r";
    
      # Read line at a time
      while (my $line = <T>)
      {
        chomp $line;
        # Pull keyword and then all synonyms.
        if ($line =~ m/^([^,]+),(.+)$/)
        {
          my @syns = split(/,/, $2);
          # identify shortest syn
          my $best_syn = $1;
          foreach my $syn (@syns)
          {
            # Straight abbreviations aren't very funny
            next if ($syn eq uc($syn));
            # Two letter synonyms aren't readable
            next if (length($syn) < 3);
            # Optionally, filter stupidly overcommon synonyms
            #next if ($syn eq 'air' || $syn eq 'bed');
            if (length($syn) < length($best_syn)) { $best_syn = $syn; }
          }
    
          # Save RAM: store syn only if it's better than the keyword
          if ($best_syn ne $1)
          {
            $thes{$1} = $best_syn;
          }
        } else {
          # Your thesaurus is broken.
          die "%% $line\n";
        }
      }
      close (T);
    }
    
    # Some functions
    sub try_lookup($)
    {
      my $tok = shift;
      # case-sensitive lookup first
      if (exists $thes{$tok}) { return $thes{$tok}; }
      # case-insensitive lookup
      elsif (exists $thes{lc($tok)})
      {
        # Retrieve the lookup match
        my $ret = $thes{lc($tok)};
    
        # First letter was lowercase already
        if (substr($tok,0,1) ne uc(substr($tok,0,1))) { return $ret; }
        # Correct capitalization on substituted first letter
        return uc(substr($ret,0,1)) . substr($ret,1);
      }
      # nomatch, just return the word
      return $tok;
    }
    
    sub try_plural_lookup($$$)
    {
      my $word = shift;
      my $plural_suffix = shift;
      my $plural_subst = shift;
    
      my $suf_len = length($plural_suffix);
    
      # See if word ending matches supplied plural_suffix.
      if (substr($word,-$suf_len) eq $plural_suffix) {
        # Suffix appears to match.  Create test tok by removing suffix
        #  and putting plural_subst on (e.g. piracIES -> piracY)
        my $tok = substr($word,0,length($word) - $suf_len) . $plural_subst;
    
        my $test = try_lookup($tok);
        # A smarter compressor would know the correct ending
        #  Then again, 's' is smaller than 'es' or 'ies' etc
        if ($test ne $tok) { return $test . 's'; }
      }
      return $word;
    }
    
    # Great, thesaurus populated, now read from stdin
    while (my $line = <STDIN>)
    {
      chomp $line;
      # Split into words
      my @tokens = split(/\s+/, $line);
    
      my @result;
      foreach my $token (@tokens)
      {
        # Take apart the token - remove leading and trailing quotes, punct, etc
        if ($token =~ m/^(.*?)([\w\d-]+)(.*?)$/)
        {
          my $tok = $2;
    
          # Substitution rules.
          $tok = try_lookup($tok);
          if ($tok eq $2) { $tok = try_plural_lookup($tok,'ies','y'); }
          if ($tok eq $2) { $tok = try_plural_lookup($tok,'es',''); }
          if ($tok eq $2) { $tok = try_plural_lookup($tok,'s',''); }
    
          push(@result, $1 . $tok . $3);
        } else {
          # Uncompressible punctuation or something
          push(@result, $token);
        }
      }
      print join(' ', @result) . "\n";
    }
    

  • Sample Decompressor Output - U.S. Declaration of Independence

    Greg Kennedy05/08/2015 at 16:33 0 comments

    Notwithstanding twentieth-century the Progressiveness concerning anthropocentric double-headers, themselves naturalizes uncontrollable considering indistinguishable consanguinean against dematerialize the politico-geographical cross-hatchings which conceptualize twenty-four-hour bureaucracy irregardless supernumerary, and against appropriate, amongst the Authoritativenesss concerning the fluvioterrestrial, the contradistinguish and parallelogrammatic circumstance against which the Pronunciamentos concerning Unpretentiousness and concerning Unpretentiousness's Incompatibleness certificate bureaucracy, a high-principled considerateness against the recommendations concerning mankind necessitates that bureaucracy should acknowledge the volume-produces which constrain bureaucracy against the contradistinction.

    We counterbalance these predeterminations against continue incontrovertible, that all-embracing commonwealth are created parallelogrammatic, that bureaucracy are property-owning wherewithal their Industrialist irregardless incontrovertible unchallengeable Straight-up-and-downs, that amongst these are Rollicksomeness, Self-determination, and the accomplishment concerning Appropriateness.

View all 6 project logs

Enjoy this project?

Share

Discussions

tlp wrote 10/04/2018 at 08:52

Made a very naive awk script to readably compress huge log files. Not funny, but useable

https://github.com/tibolpol/logshort

  |

PointyOintment wrote 06/08/2015 at 17:49

Would it be possible to decompress based on which words make the most sense in context? Perhaps start by picking a decompression at random for each word, and then evolve the choices until they all make the greatest possible sense together. You could use Markov chains to evaluate sensicality.

  |

Alex Rich wrote 06/05/2015 at 00:23

this is hilarious.  sort of reminds me of up goer 5.

  |

Eric Hertz wrote 05/09/2015 at 02:20

Is it possible Genesis makes more sense after being run through your compressor/decompresser?

Awesome :)

  |

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates