Character Prefix Conditioning

(cursor.com)

29 points | by mntruell 438 days ago

7 comments

kcarnold 435 days ago
This was the subject of https://arxiv.org/abs/2412.03719. (I suspect you can do simpler than the paper's solution if you're only interested in the top-k.)
A related topic is "token healing", although some implementations (unfortunately including the one in HuggingFace Transformers) make some big assumptions that aren't always true (like treating spaces as special).
viraptor 435 days ago
> Can you construct an efficient algorithm for sampling from q(tk∣t1,…,tk−1), that minimizes calls to the original language model?
I feel like I'm missing some issue here... Can't you query stopping at the last full token boundary, then reject any results which don't match the character prefix and continue from there with the completion? Kind of like when you mask the invalid actions when reinforcement training on games? Or is that losing too much info?
[-]
- sshh12 435 days ago
  I asked o1 to figure this out and this is essentially what it came up with as well.
  https://chat.sshh.io/share/HIzUotMYVxFhRde94ZYJJ
teaearlgraycold 435 days ago
Not sure if this is free labor or a means to source candidates.
do_not_redeem 435 days ago
So here is ChatGPT's token list: https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0...
Is there some reason it isn't alphabetical? (More specifically, lexically sorted by codepoint) If you had a model with sorted tokens, you'd be able to solve this by constraining output to tokens with the desired prefix, probably with some mechanism similar to how this works: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
[-]
- pizza 435 days ago
  They are sorted, but in a way such that, roughly speaking, rank is proportional to inverse frequency (like Zipf’s law, but also permitting merging of subwords to be ranked). This is actually extremely important because it makes the otherwise very-high-cardinality categorical feature of target predicted argmax vocab dictionary key index slightly smoother and slightly more predictable for the model
- versteegen 435 days ago
  516 instances of "\r\n"!
amanrs 435 days ago
This is harder than it looks.
First "token-healing" doesn't work. Consider the case "app" where the most likely options are "ap|praisal" or "apple|sauce". You can't just sample all tokens that start with app, or you'd miss appraisal.
Second, it's easy to come up with a naive algorithm that samples from the true distribution. It's very difficult to make this algorithm efficient.
yorwba 435 days ago
Ideally you'd have a language model that can predict a good continuation after any byte. If an existing model can't do that because it's too reliant on a specific tokenization, you might nonetheless be able to fine-tune it until it can gracefully handle the unexpected tokenizations that result from splitting at a random byte.
[-]
- kevmo314 435 days ago
  Such a model will always be less performant than one on tokens, as you're effectively switching to one byte per token. Solving this problem in code is much cheaper.
  [-]
  - yorwba 435 days ago
    I don't mean switching to one byte per token, but switching to training on the token distribution that results from cutting off the input at arbitrary bytes. The bytes per token should be basically unchanged, as only the end gets a bit shorter.
    [-]
    - kevmo314 435 days ago
      Yeah I've tried that approach. The model ends up needing to learn every combination of tokens. For example, the word "apple" now has six bytes positions it can be split on and the model suddenly needs to learn that all six will yield the same output attention state.
      It ends up being O(max token length) more complex and so you end up needing a proportionally larger model to accommodate it.
      [-]
      - pizza 435 days ago
        Seems like we should just use gradual annealing of tokens to more fine grained single character tokens over the course of training then
        [-]
        kevmo314 435 days ago
        I believe that's similar to the idea behind https://github.com/facebookresearch/blt
_andrei_ 434 days ago
Do they have the solution (I'd assume Supermaven which they recently acquired does) and are they just giving out an interesting challenge to readers? Or... what is this?