If you assume that A * 10 isn't going to overflow, so that ASL A moves 0 into the carry flag (so no need for CLC), then instead of using the undocumented RRA opcode, you can just do:
sta $00
asl a
asl a
adc $00
asl a
This is also 7 bytes, but is faster since adc $00 is 3 cycles, vs rra $00 being 5 cycles.
The A = max(A, X) example is certainly interesting, but not very useful since it loops through the code twice (very slow) and assumes that $8a is available. The much faster obvious version only adds one byte:
Sure. Note that I picked those examples to demonstrate the two fairly quirky classes of things the optimizer tends to find. If the programmer has different requirements they can specify that, and it'll spit out the examples you gave (or something equivalent).
I like the idea of exhaustive search, which the simplicity of the 6502 seems ideally suited for, but the search speed seems a bit limiting. I wonder if there's not potential for more generation restriction (e.g. code can only use a specific N bytes of zero page) and heavy search pruning to speed it up? If it could generate optimal 20-30 op sequences in semi-reasonable time that'd make it very useful.
Interesting and fun read - we are well into the terrain of what was completely impossible to do back then. Now I can't wait to see a faster AppleSoft ROM ;-)
Demo coding is indeed the primary usecase for this, and the reason for why I started tinkering on it in the first place. That, and people who make homebrewed NES / C64 video games should find it fairly useful for optimizing tight loops and such.
If you assume that A * 10 isn't going to overflow, so that ASL A moves 0 into the carry flag (so no need for CLC), then instead of using the undocumented RRA opcode, you can just do:
sta $00
asl a
asl a
adc $00
asl a
This is also 7 bytes, but is faster since adc $00 is 3 cycles, vs rra $00 being 5 cycles.
The A = max(A, X) example is certainly interesting, but not very useful since it loops through the code twice (very slow) and assumes that $8a is available. The much faster obvious version only adds one byte:
stx $00
cmp $00
bcs done
txa
done:
Sure. Note that I picked those examples to demonstrate the two fairly quirky classes of things the optimizer tends to find. If the programmer has different requirements they can specify that, and it'll spit out the examples you gave (or something equivalent).
I like the idea of exhaustive search, which the simplicity of the 6502 seems ideally suited for, but the search speed seems a bit limiting. I wonder if there's not potential for more generation restriction (e.g. code can only use a specific N bytes of zero page) and heavy search pruning to speed it up? If it could generate optimal 20-30 op sequences in semi-reasonable time that'd make it very useful.
Having it only use operations that use a specific set of zero-page addresses is already supported, yep!
20-30 ops is probably impossible, unfortunately. The combinatorial explosion is just too enormous.
e-graphs have been repeatedly reinvented across decades for many purposes. One of them is superoptimization.
https://en.wikipedia.org/wiki/E-graph
https://www.cs.cornell.edu/courses/cs6120/2025sp/blog/supero...
https://github.com/philzook58/awesome-egraphs
reminds me a bit of https://pubby.games/codegen.html just that its approach seems way more refined and useful.
Interesting and fun read - we are well into the terrain of what was completely impossible to do back then. Now I can't wait to see a faster AppleSoft ROM ;-)
That’s incredibly clever and a fun read. Well done!
I imagine lots of demo coders glancing back and forth between that writeup and their own carefully hand-tuned assembly.
Thank you!
Demo coding is indeed the primary usecase for this, and the reason for why I started tinkering on it in the first place. That, and people who make homebrewed NES / C64 video games should find it fairly useful for optimizing tight loops and such.
Great