Large-scale PowerPC recompiler rework

Exzap commented

2023-01-30 02:16:29 -03:00

(Migrated from github.com)

Disclaimer: This is work-in-progress. I'm opening this draft PR for visibility, so others can track progress and know not to alter recompiler code. Work started on this in November and the ETA for completion is somewhere in the span of the next few months, depending on my motivation.

Goals

I originally started work on the recompiler in 2014 and since then I have learned a lot more about state-of-the-art compiler and IR design. While I'm generally happy with the quality of our code translation, some of the design choices I made along the way make it hard to introduce further optimizations or fixes. A lot of the complexity is at the burden of the x86-64 backend, which means that all of that would have to be reimplemented when targeting another architecture.

Overall, the idea is to make both the front-end (PPC to IR) and the back-end (IR to x86-64) as "dumb" as possible so that all the complex logic can be shifted to operate on platform-independent IR, lowering the burden on platform-specific code.

State

Please do not report bugs yet. In fact I don't recommend trying this out, it's an active construction site.

Reorganized file and folder structure to be more modular
Modernize C-style code to use C++ features where it makes sense
Fundamentally rework PPC basic block handler to be more flexible. Support non-continous functions and potentially allow for complex inlining
Support for bool-based jumps and bool registers instead of having PowerPC CR logic embedded into the IR
Allow PowerPC CR bits, SPRs and XER carry bit to reside in registers and participate in register allocation
Avoid complex instructions in the IR when they could be implemented using basic operations only
- LSWI / STWSI
- SRAW / SRAWI
- BDNZ
- LWARX / STWCX
- ADDC and other arithmetic instructions with carry
- DCBZ
- MFCR / MTCRF
- RLWIMI
- SLW / SRW
Support typed registers. For now everything is either a 32bit integer or a 2x64bit paired single register
Switch floating-point logic over to the newer register allocator that is currently only used for integer registers
Support for calls to native code in arbitrary locations of the IR program. Currently calling external code is done hackily via macro instructions which need per-backend implementation
Rework floating-point register handling. This is a big chapter on it's own and I'll expand on this once I get to it
Optimize! This includes bringing back optimizations lost with the restructuring as well as adding some new ones
- Added a new dead code elimination pass
- x86 specific: Conditional jumps will use eflags instead of emulating PPC CRx bits where possible
- Fix loop detection and move register loads/stores out of loops where possible

I know a lot of these are pretty abstract, so in the future I might add a few before-vs-after code examples to this text.

Q&A

Will this PR add ARM support?

No. But it will make adding a new target architecture a lot easier and if I am motivated enough I'll look into adding an aarch64 backend after this is done.

Will this make Cemu faster?

Maybe? After everything is done the recompiler should output faster code, but CPU execution speed generally isn't a bottleneck in Cemu so it's hard to predict whether there will be an actual difference.

What about the proposed plan to use LLVM?

I did quite a bit of research on that. The biggest downside is that LLVM is still quite JIT-unfriendly and comes with significant bloat. Not saying that it wouldn't work, but the cons outweigh the pros in my opinion. Plus we already got a pretty sophisticated recompiler and it would be a waste to throw it away.
On a personal note, I enjoy working on custom solutions more than plugging in libraries so it's easier for me to stay motivated and make progress. In regards to total effort both solutions are about the same.

**Disclaimer:** This is work-in-progress. I'm opening this draft PR for visibility, so others can track progress and know not to alter recompiler code. Work started on this in November and the ETA for completion is somewhere in the span of the next few months, depending on my motivation. ### Goals I originally started work on the recompiler in 2014 and since then I have learned a lot more about state-of-the-art compiler and IR design. While I'm generally happy with the quality of our code translation, some of the design choices I made along the way make it hard to introduce further optimizations or fixes. A lot of the complexity is at the burden of the x86-64 backend, which means that all of that would have to be reimplemented when targeting another architecture. Overall, the idea is to make both the front-end (PPC to IR) and the back-end (IR to x86-64) as "dumb" as possible so that all the complex logic can be shifted to operate on platform-independent IR, lowering the burden on platform-specific code. ### State Please do not report bugs yet. In fact I don't recommend trying this out, it's an active construction site. - [x] Reorganized file and folder structure to be more modular - [x] Modernize C-style code to use C++ features where it makes sense - [x] Fundamentally rework PPC basic block handler to be more flexible. Support non-continous functions and potentially allow for complex inlining - [x] Support for bool-based jumps and bool registers instead of having PowerPC CR logic embedded into the IR - [x] Allow PowerPC CR bits, SPRs and XER carry bit to reside in registers and participate in register allocation - [ ] Avoid complex instructions in the IR when they could be implemented using basic operations only - [x] LSWI / STWSI - [x] SRAW / SRAWI - [x] BDNZ - [x] LWARX / STWCX - [x] ADDC and other arithmetic instructions with carry - [ ] DCBZ - [x] MFCR / MTCRF - [ ] RLWIMI - [ ] SLW / SRW - [x] Support typed registers. For now everything is either a 32bit integer or a 2x64bit paired single register - [x] Switch floating-point logic over to the newer register allocator that is currently only used for integer registers - [ ] Support for calls to native code in arbitrary locations of the IR program. Currently calling external code is done hackily via macro instructions which need per-backend implementation - [ ] Rework floating-point register handling. This is a big chapter on it's own and I'll expand on this once I get to it - [x] Optimize! This includes bringing back optimizations lost with the restructuring as well as adding some new ones - [x] Added a new dead code elimination pass - [x] x86 specific: Conditional jumps will use eflags instead of emulating PPC CRx bits where possible - [x] Fix loop detection and move register loads/stores out of loops where possible I know a lot of these are pretty abstract, so in the future I might add a few before-vs-after code examples to this text. ## Q&A ### Will this PR add ARM support? No. But it will make adding a new target architecture a lot easier and if I am motivated enough I'll look into adding an aarch64 backend after this is done. ### Will this make Cemu faster? Maybe? After everything is done the recompiler should output faster code, but CPU execution speed generally isn't a bottleneck in Cemu so it's hard to predict whether there will be an actual difference. ### What about the proposed plan to use LLVM? I did quite a bit of research on that. The biggest downside is that LLVM is still quite JIT-unfriendly and comes with significant bloat. Not saying that it wouldn't work, but the cons outweigh the pros in my opinion. Plus we already got a pretty sophisticated recompiler and it would be a waste to throw it away. On a personal note, I enjoy working on custom solutions more than plugging in libraries so it's easier for me to stay motivated and make progress. In regards to total effort both solutions are about the same.

👍 28 ❤️ 21 🎉 3 🚀 10

Wunkolo commented

2023-01-30 02:43:54 -03:00

(Migrated from github.com)

What would be the scope of changing the x64 emitter over to something like xbyak?

With the current x64 emitter, adding a new instruction or class of instructions would involve implementing the encoding for those instructions (REX, VEX, EVEX, ModR/M, SIB, etc) from scratch and then implementing the new instruction in particular AND detecting it the particular CPUID flags when this redundant work can probably just be pushed onto a proven library.

What would be the scope of changing the x64 emitter over to something like [xbyak](https://github.com/herumi/xbyak/)? [With the current x64 emitter](https://github.com/cemu-project/Cemu/blob/main/src/Cafe/HW/Espresso/Recompiler/PPCRecompilerX64.h), adding a new instruction or class of instructions would involve implementing the encoding for those instructions (REX, VEX, EVEX, ModR/M, SIB, etc) from scratch and then implementing the new instruction in particular AND detecting it the particular CPUID flags when this redundant work can probably just be pushed onto a proven library.

👍 2

Exzap commented

2023-01-30 03:28:26 -03:00

(Migrated from github.com)

Thanks for pointing out Xbyak, I wasn't aware of it. The assemblers I looked at were always a bit overkill for our purposes, usually focusing on human-friendly API and less towards a simple interface for machine generated code. We only need a very thin emitter, but Xbyak seems to be exactly that.

As part of this rework I also started a new "cleaner" x86-64 high-performance emitter which I auto-generate from encoding tables. The effort for this is relatively minimal, but using a premade emitter would certainly cut down the effort even further. I'll think about it.

Thanks for pointing out Xbyak, I wasn't aware of it. The assemblers I looked at were always a bit overkill for our purposes, usually focusing on human-friendly API and less towards a simple interface for machine generated code. We only need a very thin emitter, but Xbyak seems to be exactly that. As part of this rework I also started a new "cleaner" x86-64 high-performance emitter which I auto-generate from encoding tables. The effort for this is relatively minimal, but using a premade emitter would certainly cut down the effort even further. I'll think about it.

👍 10 🚀 3

amayra commented

2023-05-16 18:51:13 -04:00

(Migrated from github.com)

did you drop this project ?

Exzap commented

2023-05-17 06:33:47 -04:00

(Migrated from github.com)

Nah just busy with other stuff. I'll get back to this eventually

iMonZ commented

2023-09-26 16:48:47 -03:00

(Migrated from github.com)

Nah just busy with other stuff. I'll get back to this eventually

Thanks! ARM64 Support would make the CEMU emulator finally done and future proof!

> Nah just busy with other stuff. I'll get back to this eventually Thanks! ARM64 Support would make the CEMU emulator finally done and future proof!

Wunkolo commented

2023-09-26 17:40:03 -03:00

(Migrated from github.com)

On ARM64: I've been using oaknut on other projects. It is structured very similarly to xbyak.

On ARM64: I've been using [oaknut](https://github.com/merryhime/oaknut) on other projects. It is structured very similarly to xbyak.

Gabezin64 commented

2023-10-12 21:20:31 -03:00

(Migrated from github.com)

This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD?

Exzap commented

2023-10-13 10:34:18 -03:00

(Migrated from github.com)

This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD?

That's a graphical issue. It's unaffected by this CPU rework.

> This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD? That's a graphical issue. It's unaffected by this CPU rework.

This pull request can be merged automatically.

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin Exzap/jit-work:Exzap/jit-work

git checkout Exzap/jit-work

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git checkout main

git merge --no-ff Exzap/jit-work

git checkout Exzap/jit-work

git rebase main

git checkout main

git merge --ff-only Exzap/jit-work

git checkout Exzap/jit-work

git rebase main

git checkout main

git merge --no-ff Exzap/jit-work

git checkout main

git merge --squash Exzap/jit-work

git checkout main

git merge --ff-only Exzap/jit-work

git checkout main

git merge Exzap/jit-work

git push origin main

Large-scale PowerPC recompiler rework #641

Goals

State

Q&A

Will this PR add ARM support?

Will this make Cemu faster?

What about the proposed plan to use LLVM?

Checkout

Merge