was gonna ask how llms deal with it but i bet these sorts of errors are common enough that it ends up being understood as the same word
programming
-
Post about programming, interesting repos, learning to program, etc. Let's try to keep free software posts in the c/libre comm unless the post is about the programming/is to the repo.
-
Do not doxx yourself by posting a repo that is yours and in any way leads to your personally identifying information. Use reports if necessary to alert mods to a potential doxxing.
-
Be kind, keep struggle sessions focused on the topic of programming.
I think because the lexical distance between a "readable" typo and the word is so close, they can end up with similar tokens. The gzip of all outputs are virtually identical which is a good tell.
So yeah for small instances it's useless, but I still think it's enough that you could use it to gum up scrapers while still having the response text be either easily decoded and human readable.
Also since it looks like real text and will have a lot of the same tokens, but is also super incorrect, so if they don't filter it or parse it out before using it, you'll get weird behavior on whatever it's trying to steal.
It almost certainly works way better text input that has a lot of longer words, since anything under 4 letters is unchanged. Longer words to get harder for the reader to parse too though, so still kinda meh.
I could try and add some complexity by avoiding swapping letters that are close on a qwerty layout?
Edit: So I did some testing using tiktoken, the tokenizer package that OpenAI uses, and the token count on the OG Bee Movie script is ~26k while the typoized version is ~36k. It does seem to notice when you give it a typoized input, but since the tokens are expanded, it seems to be doing a lot more work on their frontend to "decode" your input.
I'm definitely just running anything I send you some chatbot through this just so I can balloon their token burn rate lol