Matt Ownby's Cool Projects: The next generation of emulation design?

So as I sat here at 1 AM feeding my newborn infant (that makes 4 kids!), my mind drifted to Star Rider emulation once again and the inherent inaccuracies with the way that Daphne (and also MAME last time I checked) handle emulating the 6809E CPU, the PIA6821, and the Williams Special Chips.

Daphne emulates a multi-CPU game by having each CPU take turns and execute a small chunk of cycles before switching to next CPU and doing the same. In the case of Star Rider, where each CPU runs at the same speed (1 MHz), Daphne actually executes a single instruction before switching to the next CPU which gives a pretty dang accurate experience. But even this is not really good enough to achieve "perfect emulation" because Star Rider (and other Williams games) utilize a design which requires the CPU to be halted on a regular basis or to not run on a steady clock. Daphne assumes that the clock will always be steady and that the CPU will never be halted, so this presents a bit of a problem. To get perfect emulation, one needs to look to FPGA solutions and forget about desktop PC solutions. Or does one?

I realized a few things tonight. If I am going to take Daphne to the "next level" and emulate a game like Star Rider "perfectly" so that real hardware and Daphne and run side by side and retain identical state over a period of days (with allowance for clock frequency drift between the two), I am going to have to ditch the traditional model of "execute N cycles as fast as possible in isolation".

Instead, a more ideal way to emulate Star Rider and other Williams games is to emulate the actual clock that all of the CPUs use and sync each CPU to this clock. Star Rider has a single 24 MHz clock on the VGG board which is split up (via PROM) into a 12 MHz, 6 MHz, 4 MHz, 2 MHz, and 1 MHz clock. But all of these clocks are in sync because they all derive from the parent 24 MHz clock. The PIF board has its own 4 MHz clock (which creates a 1 MHz clock for its CPU), and I haven't looked at the sound board that much to know where its clock is coming from, but it probably is also running at 1 MHz. It is quite convenient that the three 6809E CPUs for this game are all running at the same clock speed because that means they can all be clocked as if from a single source. Not all games will have this benefit, so this approach can't be necessarily applied to every scenario out there. But I think for Star Rider (and probably other Williams games) it would work to emulate the game on a clock level!

One reason that I didn't even consider doing this in the past is because of performance. Emulating the clock of a CPU is a lot slower than just running a bunch of instructions and keeping track of how many cycles it is supposed to take. But new modern CPUs keep adding greater capacity to execute multiple tasks in parallel (by adding CPU cores) and it has been a problem for me to figure out how to utilize this parallelism. The cool thing about emulating a CPU on a clock level is that multiple emulated CPUs can somewhat execute independently of each other which is perfect for a multi-threaded architecture!

It basically would work something like this:

When the clock is about to do a Low->High transition, the state of all inputs would be sent to every thread (emulated CPU) and then each thread would process this clock transition in parallel and ignore propagation delay (which I think is okay to do since this is a digital clocked system). Each thread would report any change to the CPU's output (for example, the R/W pin changing or data being put on the bus, etc) to the master thread. Then all of the output changes would be consumed by the master thread, triggering any emulation-related callbacks (such as updating video output), and the master thread would figure out the new input state. This new input state would be sent to every thread (emulated CPU) and they'd all perform the High->Low transition in parallel and this process would repeat infinitely.

The idea is that as long as a thread has an input state, it can execute its own clock transition and figure out its own output state without having to wait for another thread to accomplish the same task.

This assumes that the hardware design is such that propagation delay can be completely ignored due to the clock speed(s) of the system being chosen conservatively enough.

How would this type of system perform? I frankly have no idea! I don't even know if it would be worth using multiple threads to do this BUT my experience over the years tells me that there is a good chance that performance would be good enough that an emulated system like this would run at full speed on a modern PC.

Heck, at this level, even the video drawing hardware could be emulated! The possibilities! THE POSSIBILITIES!!! :)

As I alluded to in a previous blog post, I am working with Sean Riddle on an FPGA replacement design for the Williams Special Chip at the moment and once we have this thing perfected (hehe), someone could make an emulated version of it (FPGA isn't the same kind of emulation in my opinion) using the technique that I described above.

Matt Ownby's Cool Projects

Friday, August 8, 2014

The next generation of emulation design?

1 comment: