Theseus, a Static Windows Emulator

(neugierig.org)

92 points | by zdw 2 days ago

6 comments

  • da-x 38 minutes ago
    The issue with static recompilation are edge cases such as self modifying code, or code copying a variations of itself to be executed on the stack, jump tables of structure that cannot be inferred from analysis without solving for the halting problem and all sorts of other tricks that necessitate the use of a JIT eventually. I wrote a static recompiler for Win32 in 2012 for fun, and I've seen old 1990's programs do these things.
  • jcranmer 7 hours ago
    So I've definitely played around with a lot of these ideas, because the thing that I'm trying to emulate/port/reverse engineer is a 16-bit Windows application, which coincidentally means that a lot of the existing infrastructure just keels over and dies in supporting it.

    I originally started the emulation route, but gave that up when I realized just how annoying it was to emulate the libc startup routine that invoked WinMain, and from poking around at the disassembly, I really didn't need to actually support anything before WinMain. And then poking further, I realized that if I patched the calls to libc, I only had to emulate a call to malloc at the, well, malloc level, as opposed to implementing the DOS interrupt that's equivalent to sbrk. (And a side advantage to working with Win16 code is that, since every inter-segment call requires a relocation, you can very easily find all the calls to all libc methods, even if the disassembler is still struggling with all of the jump tables.)

    After realizing that the syscalls to modify the LDT still exist and work on Linux (they don't on Windows), I switched from trying to write my own 386 emulator to actually just natively executing all the code in 16-bit mode in 64-bit mode and relying on the existing relocation methods to insert the thunking between 16-bit and 64-bit code [1]. The side advantage was that this also made it very easy to just call individual methods given their address rather than emulating from the beginning of main, which also lets me inspect what those methods are doing a lot better.

    Decompilation I've tried to throw in the mix a couple of times, but... 16-bit code really just screws with everything. Modern compilers don't have the ontology to really support it. Doing my own decompilation to LLVM IR runs into the issue that LLVM isn't good at optimizing all of the implemented-via-bitslicing logic, and I still have to write all of the patterns manually just to simplify the code for "x[i]" where x is a huge pointer. And because it's hard to reinsert recompiled decompiled code into the binary (as tools don't speak Win16 file formats anymore), it's difficult to verify that any decompilation is correct.

    [1] The thunking works by running 16-bit code in one thread and 64-bit code in another thread and using hand-rolled spinlocks to wait for responses to calls. It's a lot easier than trying to make sure that all the registers and the stack are in the appropriate state.

    • evmar 6 hours ago
      [post author] I went down some similar paths in retrowin32, though 32-bit x86 is likely easier.

      I was also surprised by how much goop there is between startup and main. In retrowin32 I just implemented it all, though I wonder how much I could get away with not running it in the Theseus replace-some-parts model.

      I mostly relied on my own x86 emulator, but I also implemented the thunking between 64-bit and 32-bit mode just to see how it was. It definitely was some asm but once I wrapped my head around it it wasn't so bad, check out the 'trans64' and 'trans32' snippets in https://github.com/evmar/retrowin32/blob/ffd8665795ae6c6bdd7... for I believe all of it. One reframing that helped me (after a few false starts) was to put as much code as possible in my high-level language and just use asm to bridge to it.

      • jcranmer 1 hour ago
        Yeah, 32-bit x86 is somewhat easier because everything's in the same flat address space, and you at least have a system-wide code32 gdt entry that means you can ignore futzing around with the ldt. 16-bit means you get to deal with segmented memory, and the cherry on top is that gdb just stops being useful since it doesn't know anything about segmented memory (I don't think Linux even makes it possible to query the LDT of another process, even with ptrace, to be fair).

        As for trying to ignore before main... well, the main benefit for me was being able to avoid emulating DOS interrupts entirely, between skipping the calls to set up various global variables, stubbing out some of the libc implementations, and manually marking in the emulator that code page X was 32-bit (something else that sends tools in a tizzy, a function switching from 16-bit to 32-bit mid-assembly code).

        16-bit is weird and kinda fun to work with at times... but there's also a reason that progress on this is incredibly slow for me.

    • trollbridge 2 hours ago
      Interesting work. I’m also doing the same thing with a Win16 Visual Basic app (with some third party libraries).

      Eventually I just decided to run the app inside an accurate emulator with a full Windows 3.1 environment since nothing else was stable.

    • charcircuit 7 hours ago
      Have you tried just asking an LLM to reverse it to C and then work from there?
      • trollbridge 2 hours ago
        I tried that, and ran into roadblocks since the app under test is old Visual Basic (which is half compiled and half interpreted) and then it used a third party library which has quite sophisticated anti-decompilation features.
  • mmastrac 9 hours ago
    I think "recompiler" is the accepted term in the art, is it not?
    • evmar 6 hours ago
      Looking through Wikipedia at least, it's not exactly clear to me. They have separate pages for 'binary recompiler' and 'binary translation' that link to each other, and the latter is more about going between architectures (which is the main objective here).
  • nxobject 7 hours ago
    A follow-up question I'd like to ask to the author: if I understand correctly, one of the ideas is that you can write "high-level", almost platform-independent translations like:

        regs.eax = 3;
        regs.eax = add(regs.eax, 4);
        windows_api();  // some native implementation of the API that was called
    
    and rely on optimizers to do as much "lowering" or specialization to a platform as possible. I'd like to ask: how reliable have optimizers been in gettting those results (e.g. how much fiddling with various -Oxxx flags)?
    • evmar 6 hours ago
      It depends on what results you’re expecting! Relative to an interpreter, even the simplest unoptimized translation of that code is already significantly more efficient code at runtime.

      In the post there is a godbolt link showing a compiler inlining a simple add, but a real implementation of x86 add would be much more complex.

      I have read other projects where the authors put some effort into getting exactly the machine code they wanted out. For example, maybe you want the virtual regs.eax to actually exist in a machine register, and one way you might be able to convince a compiler to do that is by passing it around as a function parameter everywhere instead as a struct element. I have not investigated this myself.

  • mysterydip 9 hours ago
    That’s a really clever idea/solution I hadn’t thought of before, and yet it makes “of course, duh” sense as well.
  • pema99 8 hours ago
    See also the authors previous project retrowin32 https://github.com/evmar/retrowin32