Lossy Text Compression

The lossy text compressor consists of a Perl script and accompanying thesaurus. On startup, the script opens the thesaurus file, and constructs a hash of "word" to "shortest synonym of that word". Some words are filtered out, e.g. abbreviations and 2-letter synonyms, because those don't meet quality standards.

After this is simple substitution of every word in the supplied text. This should result in a reduction of file size, while still being readable to most people. Further compression levels can be reached by running the compressor repeatedly - but note that text degradation ("generational loss") can result in an unreadable document.

The thesaurus format used is that of the Moby thesaurus, found here: http://icon.shef.ac.uk/Moby/mthes.html

A companion decompressor script works similarly but uses the longest synonym in place of the shortest.

Project Logs

Collapse

Presentation
Greg Kennedy • 06/04/2015 at 16:07 • 0 comments

I gave a completely straight-faced presentation on this project at my workplace's recent Hackathon. While it didn't win, it got a lot of laughs and I was allowed to eat the free food.
The Powerpoint is available for viewing and download here:
https://docs.google.com/presentation/d/1SQVki8DowvFrE0uJzS9P8PGjrUsBK78RScvV5eAotGs/edit?usp=sharing

Source Code

Greg Kennedy • 05/08/2015 at 16:46 • 0 comments

No idea where to put source code on this site, so I'll just make a project log of it.

#!/usr/bin/perl
use strict;
use warnings;

### LOSSY TEXT COMPESSION
# Conventional compression utilities rely on a "dictionary" to expand
#   indices into complete tokens.
# This compressor instead uses a thesaurus to minimize output file size.
my %thes;

{
  # Open moby thesaurus.
  # You may substitute your own thesaurus file.  The format is
  #  word,synonym1,synonym2,...\r
  open (T, '<', 'mthes/mobythes.aur') or die "Couldn't open thesaurus: $!\n";

  # Moby thesaurus is CR-terminated
  local $/ = "\r";

  # Read line at a time
  while (my $line = <T>)
  {
    chomp $line;
    # Pull keyword and then all synonyms.
    if ($line =~ m/^([^,]+),(.+)$/)
    {
      my @syns = split(/,/, $2);
      # identify shortest syn
      my $best_syn = $1;
      foreach my $syn (@syns)
      {
        # Straight abbreviations aren't very funny
        next if ($syn eq uc($syn));
        # Two letter synonyms aren't readable
        next if (length($syn) < 3);
        # Optionally, filter stupidly overcommon synonyms
        #next if ($syn eq 'air' || $syn eq 'bed');
        if (length($syn) < length($best_syn)) { $best_syn = $syn; }
      }

      # Save RAM: store syn only if it's better than the keyword
      if ($best_syn ne $1)
      {
        $thes{$1} = $best_syn;
      }
    } else {
      # Your thesaurus is broken.
      die "%% $line\n";
    }
  }
  close (T);
}

# Some functions
sub try_lookup($)
{
  my $tok = shift;
  # case-sensitive lookup first
  if (exists $thes{$tok}) { return $thes{$tok}; }
  # case-insensitive lookup
  elsif (exists $thes{lc($tok)})
  {
    # Retrieve the lookup match
    my $ret = $thes{lc($tok)};

    # First letter was lowercase already
    if (substr($tok,0,1) ne uc(substr($tok,0,1))) { return $ret; }
    # Correct capitalization on substituted first letter
    return uc(substr($ret,0,1)) . substr($ret,1);
  }
  # nomatch, just return the word
  return $tok;
}

sub try_plural_lookup($$$)
{
  my $word = shift;
  my $plural_suffix = shift;
  my $plural_subst = shift;

  my $suf_len = length($plural_suffix);

  # See if word ending matches supplied plural_suffix.
  if (substr($word,-$suf_len) eq $plural_suffix) {
    # Suffix appears to match.  Create test tok by removing suffix
    #  and putting plural_subst on (e.g. piracIES -> piracY)
    my $tok = substr($word,0,length($word) - $suf_len) . $plural_subst;

    my $test = try_lookup($tok);
    # A smarter compressor would know the correct ending
    #  Then again, 's' is smaller than 'es' or 'ies' etc
    if ($test ne $tok) { return $test . 's'; }
  }
  return $word;
}

# Great, thesaurus populated, now read from stdin
while (my $line = <STDIN>)
{
  chomp $line;
  # Split into words
  my @tokens = split(/\s+/, $line);

  my @result;
  foreach my $token (@tokens)
  {
    # Take apart the token - remove leading and trailing quotes, punct, etc
    if ($token =~ m/^(.*?)([\w\d-]+)(.*?)$/)
    {
      my $tok = $2;

      # Substitution rules.
      $tok = try_lookup($tok);
      if ($tok eq $2) { $tok = try_plural_lookup($tok,'ies','y'); }
      if ($tok eq $2) { $tok = try_plural_lookup($tok,'es',''); }
      if ($tok eq $2) { $tok = try_plural_lookup($tok,'s',''); }

      push(@result, $1 . $tok . $3);
    } else {
      # Uncompressible punctuation or something
      push(@result, $token);
    }
  }
  print join(' ', @result) . "\n";
}

Sample Decompressor Output - U.S. Declaration of Independence
Greg Kennedy • 05/08/2015 at 16:33 • 0 comments

Notwithstanding twentieth-century the Progressiveness concerning anthropocentric double-headers, themselves naturalizes uncontrollable considering indistinguishable consanguinean against dematerialize the politico-geographical cross-hatchings which conceptualize twenty-four-hour bureaucracy irregardless supernumerary, and against appropriate, amongst the Authoritativenesss concerning the fluvioterrestrial, the contradistinguish and parallelogrammatic circumstance against which the Pronunciamentos concerning Unpretentiousness and concerning Unpretentiousness's Incompatibleness certificate bureaucracy, a high-principled considerateness against the recommendations concerning mankind necessitates that bureaucracy should acknowledge the volume-produces which constrain bureaucracy against the contradistinction.
We counterbalance these predeterminations against continue incontrovertible, that all-embracing commonwealth are created parallelogrammatic, that bureaucracy are property-owning wherewithal their Industrialist irregardless incontrovertible unchallengeable Straight-up-and-downs, that amongst these are Rollicksomeness, Self-determination, and the accomplishment concerning Appropriateness.

View all 6 project logs

Lossy Text Compression

Description

Details

Components

Project Logs

Collapse

Presentation

Source Code

Sample Decompressor Output - U.S. Declaration of Independence

Discussions

Similar Projects

MagTag ESP32-S2 clock, weather and more

Hacking a jack in the box

Scalable Silent Air Compressor

Boxes.py

Lossy Text Compression

Description

Details

Components

Project Logs Collapse

Presentation

Source Code

Sample Decompressor Output - U.S. Declaration of Independence

Enjoy this project?

Discussions

Similar Projects

MagTag ESP32-S2 clock, weather and more

Hacking a jack in the box

Scalable Silent Air Compressor

Boxes.py

Does this project spark your interest?

Project Logs

Collapse