HexTrain on MTXPlus+

HEXTRAIN is a very demanding game (from a timing point of view). It runs on MEMU, REMEMOrizer, REMEMOTECH, and CFX, but at this time, not on MTXPlus+ :-

Background

All of the systems that work either have a 9929 VDP, or an equivelent that requires no delays between accesses. The MTXPlus+ has a 9958, which is touted as "compatible" with 9938 and thus 9918/9929.

The best knowledge that we have about the 9918/9929 VDP is that successive data input/outputs must be 2us apart if video output is disabled (blanked), in the vertical blank, or in text mode. In the active part of the display, in other modes, this must be 8us. This corresponds to 8T or 32T apart for a 4MHz Z80. In practice the fastest a Z80 can output is once every 11T, so its only outputs during the active part of the display that need delays. No restrictions are known for accesses to the VDP control port.

The 9958 can be configured (set bit 1 of register R#9, the *NT bit) to work in 50Hz rather than 60Hz mode. Doing so creates a larger vertical blank time, which HexTrain relys on.

The 9958 can also be configured (set bit 2 of register R#25, the WTE bit) to keep the Z80 waiting until its internal 9958-to-VRAM transfer is complete. This is presumably to allow faster CPUs.

Finally, MTXPlus+ has a state-machine in its GALs that cause it to add additional wait-states when certain ports are accessed. Tony knows when/why this is necessary, although if the 9958 was a true drop-in-replacement for 9918/9929, it wouldn't be. This state-machine can be disabled through a jumper.

How HexTrain works

HexTrain outputs static text, such as the dashboard above the active video area, by blanking the VDP output, updating the VDP at full speed, and later re-enabling VDP output.

HexTrain includes machine generated Z80 code, to repaint the HexTrain video area. This is called the instant the vertical blank starts, and is specificaly designed to output data as close as possible to the known 9918/9929 timings (as stated above). To this end, it starts by outputting data (to control and data ports) at full speed, and does this until it knows it is within 8 scan-lines of the end of the vertical blank. At this point it switches to a mode in which data outputs are 32T apart. This mode lasts 8+64 scan lines, ie: until the electron gun reaches the area of the screen with the video in it.

The machine generated Z80 code includes OUT's to send data and also to setup VDP addresses. It doesn't add delays between both bytes of an address (assuming it needs to write both), but if we are in the active area, it does ensure the following data output is 32T later.

Any subsequent updates to the screen (eg: to update the score) are done with 32T between data outputs.

HexTrain's model of instruction timings has been validated by instrumenting MEMU and comparing how long the HexTrain compiler thinks a video update takes to run against how long the Z80 emulation in MEMU thinks it takes to run - the two match exactly.

Strategy

One approach is to run at 4MHz and take at face value the statement that a 9958 is compatible with the 9918/9929. I might expect that if I operate a 9958 only in an 9918 mode, then the delay requirements of the 9958 would the the same or better than the 9918/9929. Therefore we might expect the 9958 would never ask the Z80 to wait (assuming we enabled that feature) and in-principle the MTXPlus+ GALs need never do so either - however Tony believes the GALs need to add waits, and the Observations below appear to support this.

Another approach is to say that "hey, the 9958 has a mechanism to cause the Z80 to wait if you try to output too fast, and more than that, MTXPlus+ has an additional level of such protection". This would suggest that HexTrain code could be slowed down if the 9958 timing requirements are worse than with 9918/9929. However, we can perhaps compensate for this by running MTXPlus+ at a faster clock speed. But the net effect is that we should never get screen corruption, and yet we do.

Observations

Tests with MTXPlus+ running at 4MHz.

From the code supplied by Martin, it appears the 9958 feature where it can tell the Z80 to wait is enabled at startup.

At Memofest 2016 I hadn't realised that we were running in 60Hz mode. I thought that had been switched in the MTXPlus+ BIOS.

This code has been verified to successfully select 50Hz mode :-

LD A,2
OUT (2),A
LD A,089H
OUT (2),A

If Dave sets the jumper to remove Tonys extra wait-states, the dashboard in the top 1/3 of the screen is corrupted.

With or without 50MHz mode the HexTrain video area is corrupted.

With or without Tonys extra wait-states the HexTrain video area is corrupted.

Martin code to initialise the 9958 doesn't appear to set any mode/flags that would slow down the timings.

Theory

Tony asserts that the 9958 is not timing compatible with the 9918/9929.

This is certainly consistent with the observation that the top 1/3 can't be quickly repainted in the vertical blank. It is also consistent with his reading of the rather detailed timing diagrams linked from the bottom of this page, and also the traces from Daves logic analyser.

Martin points out the less than precise wording in the 9958 spec around when the WTE bit, applies. It says :-

However, WAIT function is not provided for incomplete access to the 
register and the color palette or for the data ready status of commands.

Tony also thinks it can only add waits to data writes until the prior 9958-to-VRAM transfer completes. ie: it doesn't add waits to subsequent address and register writes.

This wouldn't matter if the the 9958-to-VRAM transfer had its own latched copy of the VRAM address as well as the VRAM data byte to write there. However the thinking is that it doesn't, and so it is possible that outputting data to the 9958 followed very quickly by a new address can cause the data to go to the new (rather than the intended) address.

HexTrain's compiler is designed to be as quick as possible, as no such timing constraint exists on the 9918/9929. As a pathalogical example, if HexTrain wants to output data byte 040H, followed by an output address setup of 00040H, it will generate :-

LD A,040H
OUT (1),A      ; 11T data
OUT (2),A      ; 11T lo address
OUT (2),A      ; 11T hi address

The problem is that that the "lo address" is written before the 9958-to-VRAM data transfer can occur, even at 4MHz.

Often the generated code won't be as bad, it might be :-

LD A,040H
OUT (1),A      ; 11T data
LD A,register  ;  4T
OUT (2),A      ; 11T lo address
...

or :-

LD A,040H
OUT (1),A      ; 11T data
LD A,literal   ;  7T
OUT (2),A      ; 11T lo address
...

but its still too fast.

Solution?

If MTXPlus+ intends to be 100% hardware compatible to an MTX, then Tony's GALs would need to add additional wait states to cover cases like the one HexTrain is exposing. However, this in itself wouldn't actually acheive 100% compatibility, as this doesn't address the fundemental problem that the 9958 is slower than 9918/9929 in some cases. This means the MTXPlus+ would be running slower than the MTX in such cases. So perhaps 100% hardware compatibility isn't a realistic goal. At least not with a 9958 in the design.

Another approach, specifically for HexTrain is for the HexTrain compiler to generate its Z80 code differently for MTXPlus+. It could add delays between data writes and address writes always (not just in the active area, as it does today).

This would bloat the size of the generated Z80 code fragments, and some of them are already very close indeed to the maximum code size limit (12KB).

On the flip side, we know 9958 WTE holds back data writes, so I can stop the compiler adding adding delays in code that runs during the active time, and the resulting code size reduction may well offset the additional delays described above.

But there is a residual problem: the HexTrain compiler keeps a track of how many cycles have elapsed as the code it is generating is executed. Ordinarily this is used to chose between fast and slow updates, and we can argue this isn't needed any more, just do fast updates always and let WTE pace us. However, its also used in the game loop, to pace the overall speed of the game.

The WTE (and Tony's extra-wait GAL) will be adding wait cycles and its actually quite hard to work out in advance how many of these will be added and when. That requires a very detailed understanding of the internals of the 9958 and Tonys GAL.

Perhaps the way forward is to make an estimate that however many cycles the Z80 would ordinarily need to update the HexTrain video, there will be 10% (or some other number) extra wait cycles added. This would require some experimentation to tune it, and to see how workable it is.

Links


This page maintained by Andy Key
andy.z.key@googlemail.com