programming

258 readers

3 users here now

Post about programming, interesting repos, learning to program, etc. Let's try to keep free software posts in the c/libre comm unless the post is about the programming/is to the repo.
Do not doxx yourself by posting a repo that is yours and in any way leads to your personally identifying information. Use reports if necessary to alert mods to a potential doxxing.
Be kind, keep struggle sessions focused on the topic of programming.

founded 2 years ago

MODERATORS

Llituro@hexbear.net

git@hexbear.net

Saoirse@hexbear.net

Typopy (github.com)

submitted 1 day ago* (last edited 18 hours ago) by invalidusernamelol@hexbear.net to c/programming@hexbear.net

2 comments fedilink hide all child comments

I got bored today and made a little python script that takes text and spits out a version of it with typos that maintains readability.

The algorithm is really simple (shuffle all runs of ASCII letters and maintain the first and last letter). Added some options to preserve double letters and prevent the shuffling from moving letters to the other side of the word.

I don't think this has any real world applications beyond maybe messing with text on your site when you detect a bot. ChatGPT can pretty easily decode the typos from my initial testing, but I'm not sure if it would do as well if it's training data was polluted with this type of text obfuscation.

top 2 comments

sorted by: hot top controversial new old

[–] tricerotops@hexbear.net 3 points 1 day ago (1 children)

was gonna ask how llms deal with it but i bet these sorts of errors are common enough that it ends up being understood as the same word

[–] invalidusernamelol@hexbear.net 2 points 1 day ago* (last edited 1 day ago)

I think because the lexical distance between a "readable" typo and the word is so close, they can end up with similar tokens. The gzip of all outputs are virtually identical which is a good tell.

So yeah for small instances it's useless, but I still think it's enough that you could use it to gum up scrapers while still having the response text be either easily decoded and human readable.

Also since it looks like real text and will have a lot of the same tokens, but is also super incorrect, so if they don't filter it or parse it out before using it, you'll get weird behavior on whatever it's trying to steal.

It almost certainly works way better text input that has a lot of longer words, since anything under 4 letters is unchanged. Longer words to get harder for the reader to parse too though, so still kinda meh.

I could try and add some complexity by avoiding swapping letters that are close on a qwerty layout?

Edit: So I did some testing using tiktoken, the tokenizer package that OpenAI uses, and the token count on the OG Bee Movie script is ~26k while the typoized version is ~36k. It does seem to notice when you give it a typoized input, but since the tokens are expanded, it seems to be doing a lot more work on their frontend to "decode" your input.

I'm definitely just running anything I send you some chatbot through this just so I can balloon their token burn rate lol