all 12 comments

[–]NotElonMuzk 3 points4 points  (0 children)

Using a Predictive text model to encode and decode is a really bad idea. It’s not meant to compute stuff. Use with caution.

[–]DehiXeM[S] 0 points1 point  (0 children)

Yes that is a fun idea, train it on a custom compression algo to save tokens in the long run!

[–]gregnr 2 points3 points  (1 child)

Clever idea, but I don't think this will create the results you expect. text-davinci-003 is surprisingly good at understanding simple encodings (eg. base64):

Using the following encoded string, summarize and list the key points described in its contents:

SGVsbG8gd29ybGQhIFRvZGF5IGlzIGEgZ3JlYXQgZGF5IGZvciBhIHdhbGsgaW4gdGhlIHBhcmsuIFdoZXJlIGlzIFdhbGRvPyBIZSBtdXN0IGJlIGhlcmUgc29tZXdoZXJlLiBDYW4geW91IGZpbmQgaGltPyBXaGF0IGNvbG9yIHNoaXJ0IGlzIGhlIHdlYXJpbmc/

Key Points:
- Greeting: "Hello world!" 
- Activity: Going for a walk in the park 
- Question: Where is Waldo? 
- Task: Find him 
- Details: What color shirt is he wearing?

And this is likely because the model has been trained with enough base64 data to know common character patterns and how they translate between their encoded/decoded forms. It's not perfect though:

Encode the following text to base64:

Hello world! Today is a great day for a walk in the park. Where is Waldo? He must be here somewhere... Can you find him? What color shirt is he wearing?

SGVsbG8gd29ybGQhIFRvZGF5IGlzIGEgZ3JlYXQgZGF5IGZvciBhIHdhbGsgaW4gdGhlIHBhcmsuIFdoZXJlIGlzIFdhbGRvPyBIZSBtdXN0IGJlIGhlcmUgc29tZXdoZXJlLi4uIENhbiB5b3UgZmluZCBoaW0/IFdoYXQgY29sb3Igc2hpcnQgaXMgaGUgd2FzaW5nPw==

If you decode that, the last word is wasing instead of wearing (still impressive though).

Compression is different. Depending on the algorithm, similar text could compress into significantly different outputs. This is a lot harder to train a language model to understand. Have you confirmed that the LZ4 output you are getting from text-davinci-003 is correct?

For arguments sake, even if we assume that text-davinci-003 can understand compressed data, we likely won't be saving tokens in the end. Tokens are based more on semantic meaning rather than the general "~4 characters per token" rule. If you run the encoded string in your example through OpenAI's tokenizer (https://beta.openai.com/tokenizer), it actually produces 249 tokens, compared to the 34 tokens in your plain text version.

[–]DehiXeM[S] 0 points1 point  (0 children)

Have you confirmed that the LZ4 output you are getting from text-davinci-003 is correct?

Not confirmed, and i would be surprised it's correct.

it actually produces 249 tokens, compared to the 34 tokens in your plain text version.

Ah, well thought. It's true compression in that sense wouldnt save much.

Maybe we can create a compression algorithm based on chatgpt3s token dictionary and train it to understand it. Perhaps in the long run on a trained model it could work in saving some tokens.

[–]DehiXeM[S] 0 points1 point  (2 children)

So far it's a struggle.

Encoding prompt not working.

``` Given a plaintext string, encode the string using the LZ4 and provide the encoded string.

Plaintext: """Wanted to explore this possibility with the community, we could save a lot of tokens if we could pass in a compressed string with all the contextualization to the prompt."""

Encoded String: 0800000088b4c4a4ce4d4e4e4d4b4f4b4d4e4d4d4f4e4d4f4e4c4f4e2c207765636f6d652e202121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212 ```

Decoding prompt not working:
``` Given a string of characters encoded using the LZ4 algorithm, decode the string and provide the plaintext.

Encoded string: """0800000088b4c4a4ce4d4e4e4d4b4f4b4d4e4d4d4f4e4d4f4e4c4f4e2c207765636f6d652e202121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212121212"""

Dec

Dec

Dec

Dec

Dec

Dec

Dec

Dec

Dec

Dec

Dec

12

Dec

Decoded string: Welcome. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ```

[–]Prayers4Wuhan 0 points1 point  (1 child)

i just spent a lot of time trying a python script that works but does not work with gpt3chat. I'm pretty sure it cannot actually execute code and is just guessing.

import zlib

import codecs

import base64

def decode_and_decompress(compressed_data: bytes) -> str:

newbytes = base64.b64decode(compressed_data)

newbytes_decoded = codecs.decode(newbytes, "zlib")

stringdata = newbytes_decoded.decode('utf-8')

return stringdata

c_bytes=b'eJxLTsxTqMwvVUhJTc7PLShKLS5WKMnILFZILygx1lPwVEjJzysBieRlg5UlJ+YBAM91EfQ='

print(decode_and_decompress(c_bytes))

[–]Prayers4Wuhan 0 points1 point  (0 children)

import zlib

import codecs

import base64

def compress_and_encode(data: str) -> bytes:

b = data.encode('utf-8')

b2 = codecs.encode(b, "zlib")

base64_bytes = base64.b64encode(b2)

return base64_bytes

def decode_and_decompress(compressed_data: bytes) -> str:

newbytes = base64.b64decode(compressed_data)

newbytes_decoded = codecs.decode(newbytes, "zlib")

stringdata = newbytes_decoded.decode('utf-8')

return stringdata

data = 'can you decompress this gpt3'

compressed_data = compress_and_encode(data)

print('Original data: ' + data)

print(compressed_data)

decompressed_data = decode_and_decompress(compressed_data)

print(decompressed_data)

print('just the bytes example to see if gpt3 chat can decompress this')

c_bytes=b'eJxLTsxTqMwvVUhJTc7PLShKLS5WKMnILFZILygx1lPwVEjJzysBieRlg5UlJ+YBAM91EfQ='

print(decode_and_decompress(c_bytes))

[–]WesternLettuce0 0 points1 point  (1 child)

Cool! Following

[–]lgastako 0 points1 point  (0 children)

It actually can't do compression or uncompression (or any other encoding/decoding), it can only try to probabilistically predict what words, letters, etc will appear text talking about compressing/decompressing. If you try to compress/decompress phrases that are commonly used in examples like "Hello world" or "The quick brown fox jumped over the lazy dog" then it will have seem enough examples for the predictions to be right a surprising percentage of the time, but if you give it a significant amount of new data it's unlikely to be able to do anything with it except hallucinate.

Edit: My guess would be that the best way to explore the limits of this would be to put it in Linux terminal mode and then just use unix utilities for encoding/decoding. But ultimately I don't think it'll be useful in the way that you'd like.

[–]uniformly 0 points1 point  (1 child)

Love the ideas, how about this approach https://rachelelrom.com/2019/04/01/how-to-get-more-content-into-less-space/ Train a cheaper model to do this and have it “compress” by not losing any actual content?

[–]DehiXeM[S] 1 point2 points  (0 children)

Combining this approach, and converting into some (transparent enough) abbreviations could have a significant impact.