Compressing prompt text with lossless compression to save on tokens

NotElonMuzk · 2023-01-09T01:48:44+00:00

Using a Predictive text model to encode and decode is a really bad idea. It’s not meant to compute stuff. Use with caution.

gwern · 2023-01-09T02:08:37+00:00

A fun idea, but it seems like it ought to backfire: a Transformer model is feedforward, so it only has so many layers to 'think'; if it's spending those layers decoding compressed strings (which is amazing if it can), it doesn't have any 'time' to think about whatever abstract form it decodes the inputs to, and it definitely doesn't have time to combine a de facto huge context window to reason over it.

Now, a more interesting idea might be to see if you can train a Transformer on compressed strings to make it native, and maybe tradeoff some width for depth. There's some slight analogies in image generation to training on Fourier components (like JPEG) rather than pixels. But AFAIK no one has experimented with heavy duty compression inputs for regular natural language generation.

gregnr · 2023-01-09T02:13:47+00:00

Clever idea, but I don't think this will create the results you expect. text-davinci-003 is surprisingly good at understanding simple encodings (eg. base64):

Using the following encoded string, summarize and list the key points described in its contents:

SGVsbG8gd29ybGQhIFRvZGF5IGlzIGEgZ3JlYXQgZGF5IGZvciBhIHdhbGsgaW4gdGhlIHBhcmsuIFdoZXJlIGlzIFdhbGRvPyBIZSBtdXN0IGJlIGhlcmUgc29tZXdoZXJlLiBDYW4geW91IGZpbmQgaGltPyBXaGF0IGNvbG9yIHNoaXJ0IGlzIGhlIHdlYXJpbmc/

Key Points:
- Greeting: "Hello world!" 
- Activity: Going for a walk in the park 
- Question: Where is Waldo? 
- Task: Find him 
- Details: What color shirt is he wearing?

And this is likely because the model has been trained with enough base64 data to know common character patterns and how they translate between their encoded/decoded forms. It's not perfect though:

Encode the following text to base64:

Hello world! Today is a great day for a walk in the park. Where is Waldo? He must be here somewhere... Can you find him? What color shirt is he wearing?

SGVsbG8gd29ybGQhIFRvZGF5IGlzIGEgZ3JlYXQgZGF5IGZvciBhIHdhbGsgaW4gdGhlIHBhcmsuIFdoZXJlIGlzIFdhbGRvPyBIZSBtdXN0IGJlIGhlcmUgc29tZXdoZXJlLi4uIENhbiB5b3UgZmluZCBoaW0/IFdoYXQgY29sb3Igc2hpcnQgaXMgaGUgd2FzaW5nPw==

If you decode that, the last word is wasing instead of wearing (still impressive though).

Compression is different. Depending on the algorithm, similar text could compress into significantly different outputs. This is a lot harder to train a language model to understand. Have you confirmed that the LZ4 output you are getting from text-davinci-003 is correct?

For arguments sake, even if we assume that text-davinci-003 can understand compressed data, we likely won't be saving tokens in the end. Tokens are based more on semantic meaning rather than the general "~4 characters per token" rule. If you run the encoded string in your example through OpenAI's tokenizer (https://beta.openai.com/tokenizer), it actually produces 249 tokens, compared to the 34 tokens in your plain text version.

DehiXeM · 2023-01-08T23:37:56+00:00

So far it's a struggle.

Encoding prompt not working.

``` Given a plaintext string, encode the string using the LZ4 and provide the encoded string.

Plaintext: """Wanted to explore this possibility with the community, we could save a lot of tokens if we could pass in a compressed string with all the contextualization to the prompt."""

Encoded String: 0800000088b4c4a4ce4d4e4e4d4b4f4b4d4e4d4d4f4e4d4f4e4c4f4e2c207765636f6d652e202121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212 ```

Decoding prompt not working:
``` Given a string of characters encoded using the LZ4 algorithm, decode the string and provide the plaintext.

Encoded string: """0800000088b4c4a4ce4d4e4e4d4b4f4b4d4e4d4d4f4e4d4f4e4c4f4e2c207765636f6d652e202121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212"""

Dec

12

Dec

Decoded string: Welcome. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ```

WesternLettuce0 · 2023-01-08T22:03:09+00:00

Cool! Following

lgastako · 2023-01-09T20:01:49+00:00

It actually can't do compression or uncompression (or any other encoding/decoding), it can only try to probabilistically predict what words, letters, etc will appear text talking about compressing/decompressing. If you try to compress/decompress phrases that are commonly used in examples like "Hello world" or "The quick brown fox jumped over the lazy dog" then it will have seem enough examples for the predictions to be right a surprising percentage of the time, but if you give it a significant amount of new data it's unlikely to be able to do anything with it except hallucinate.

Edit: My guess would be that the best way to explore the limits of this would be to put it in Linux terminal mode and then just use unix utilities for encoding/decoding. But ultimately I don't think it'll be useful in the way that you'd like.

uniformly · 2023-01-09T21:45:19+00:00

Love the ideas, how about this approach https://rachelelrom.com/2019/04/01/how-to-get-more-content-into-less-space/ Train a cheaper model to do this and have it “compress” by not losing any actual content?

GPT3

MODERATORS