The Amiga original versionAtari ST demosThe Atari ST portQuality ClaimsReverse Engineering - SummarizedReverse Engineering - The Long Version
Now and then somebody speaks of the "game" Shadow of the Beast and uses it as an example of something the Amiga range of computers could do but the Atari ST could never dream to achieve.This is totally true: Even with all the time and talent of the universe, there is no way a standard Atari 520 ST could compete, in what is after all mostly a technical demo designed to exploit every single feature the Amiga had.
The actual point of contention is regarding if the Atari ST port of Shadow of the Beast was as good as it could possibly have been considering the usual constraints of time, budget, etc...
The Amiga original version
The game was originally developped on the Amiga by Martin Edmonson and Paul Howarth of Reflections with a great music by David Whittaker, and was published by Psygnosis in 1989.From a technical point of view, it's brilliant, there is a ton of colors, multiple layers of parallax scrollings, large sprites, good moody music, etc...
As a game, it's abysmally bad: It's at the same level of complexity as the average 8bit platformer with monsters and dangers just beeing triggers to launch things with some very predictable trajectories.
Still, it's an interesting game from a technical point of view, and that's probably why there's been so many Atari ST demo groups making their own demo versions showing the keypoints of the Amiga game.
And just to illustrate a bit, for the ones who don't know the game, here a few more screenshots1 before I speak of the ST version.
Just go on Moby Games if you want to see more screenshots, you can also see a number of longplays on youtube.
Atari ST demos
Here are a few of the "Shadow of the Beast" Atari demo I could remember, if you can think of other, just tell me and I will add them :)What these demos all have in common is that they run at a fixed 50hz, most of them are using some form of overscan2, they have colorful graphics, but they also did some technical choices about what to display or not.
A good example is that for exemple both Funvision and Pendragons decided to have multicolor mountains, while both Equinox and NeXT used just a black shape for the mountains.
The reasons are simple: The Amiga was able to display more bitplans independently than the ST, so to achieve something similar on the ST would require to do a lot of real time masking and blitting instead of simple changing some memory screen addresses... which is extremely costly.
But we all know that demos can do things that games can't, because demos don't need to be (very) interactive, there are no highscores to maintain, no monsters to handle, etc...
Time to look at the Atari ST version of the game.
The Atari ST port
The Atari ST version was made by Mark McCubbin of Eldritch The Cat, and was also released by Psygnosis, in 1990.Since Eldritch the Cat had previously released a very decent action game - Projectyle - featuring some fast and nice multi-directional scrolling, everybody was expecting the Atari ST version of Shadow of the Beast to be in very good hands...
When the game finally arrived in the hands of reviewers and players, the hype finally dropped.
It was not a catastrophe, the graphics were definitely recognizable, the gameplay and layout was mostly the same, most of the elements were there, which is more than what we could say of the ports on 8bit computers.
Still, it was a bit like being told to close your eyes to receive a great surprise, and when you open them all you see is something disapointing:
Let's try to see for a list of the differences:
- The game does not run at 50fps
- There is a big BEAST logo on top of the screen over a black background
- The color changes in the background are wobbling around
- The fast scrolling wall/barrier at the bottom is not there
- The intermediate plan with the small trees is gone
- The center plan elements all share a very bland palette
- The air balloon and the moon in the background are missing
- There is no in-game music
- There is no highcolor intro picture3
- There is no intro with paralax and big logo
That's quite a lot of differences, and it's very common to hear that shortcuts were made in the development of games or ports in general because of budget issues, time pressure, etc...
Quality Claims
Being a professional game developer myself, I'm totally aware of this problem, and I know many games on many machines have been impacted by that, but what really niggled my spider senses was some references to an interview of the author that kept popping all over the place (on Facebook, Atari Mania, etc...) leading to his post from dlfrsilver in Atari Forum:shadow of the beast ST (Mark mc cubbin here)
Postby dlfrsilver » Mon Feb 02, 2009 8:22 am
I have mailed Mark Mc Cubbin about shadow of the beast ST and a game called tentacle. He has given some bits of what happened with beast ST version:
"Good questions, Shadow of the Beast did very well on the ST (it was no.1 two years in a row in the sales charts). As far as the details, happy to share them.
I know some ST demo scene guys thought it could have been done better and in reality there are areas where it could have been better, however, game development is about compromise and this was the case here. The first version of the game I did used all the same art as the Amiga running at 60/50fps, however, it required an ungodly amount of memory since the most obvious tricks were pre-computing the parallax scrolling ( for the underground section, which was only 2 layers ), then, simply movem.l the tiles to the screen as needed. The reality is thought that this wasn't reasonable ( it would have been a 1meg only game versus 512k ).
After several iterations/tickes, it was decided that this version would be launched either as 1 meg only or on the STE (and of course then we could use the HW scrolling too). In the end the version that was used for the underground sections used 1 bit plane for the background layer of parallax and 2 bit planes for the from layer, this allowed each layer to be independently drawn to the screen as fast as possible without having to have a huge pre-compute buffer. Although I still precomputed the shifted blocks, by carefully arranging the palette so that the odd and even colors were the same for the first 8 colors in the palette it meant that I could draw anything into the first bit-plane and it wouldn't affect the front layer. For sprites, due to memory constraints, again I couldn't pre-shift those for speed ( as always the fastest way draw was movem.l ), so I used another trick which was movep.l, which allowed you to write the graphics data on odd boundries across the ST interleaved screen. I had custom sprite routines for odd and even boundries for speed (versions that used movep and versions that were just straight movem).
There were similar compromises for the tree-sections ( the 11 layers of parallax sections ), I wrote a copper emulator on Timer B that heavily modified the palette to get all the colors needed, the large trees were all sprites pre-computed. Although of course, doing this reduced the overall CPU time available but was necessary to get the additional colors to get it looking half-way decent. The reality is, if I didn't actually have to have a game in there yeah, I could have had the nicely pre-computed scrolling doing it's thing, looking like the Amiga version ( I already had this for first prototype ). Psygnosis decided not to do a 1 meg only version or even the STE version, unfortunately. Around the same time frame we ended up doing the ST version of Flimbo's Quest, this was also a bit of a show-case on the Amiga. This time we could actually get two full color layers running because we could pre-compute the parallax.... Still, overall, I was fairly happy with it given the constraints, many of which you don't have for the demo-scene :)"
Based on this email exchange, the conclusions had been drawn: The Atari ST version had been done by an experienced developper who did all he could have possibly done, and the author of the post concluded:
"I guess now why the ST version needs to be powered at 20mhz instead of 8mhz to be smooth :D"
And the same message kept getting repeated again and again4:
Since I did not have enough actual elements to have a good oppinion on the topic, I decided to look inside the actual code.
Reverse Engineering - Summarized
What follows was originally posted on Facebook and Atari Forum, while I was digging in the code, I had written a summary on the original Atari Forum thread but this forum does not show any pictures if people are not logged in, so I decided to write this blog post.I started the disassembly process from the original version of the game in STX (Pasti) format to make sure that the code I would look at had not been tempered/modified during the cracking process, but I also got the DBUG "hdd/falcon compatible" version, as well as two "filed" versions (one from Automation and one from Medway Boys) because games in file format are often easier to work with.
I used a mix of Easy Rider and Steem Boiler 3.9.4 for debugging/tracing, and then applied a bunch of patches to my version to make sure the code did do what I think it was supposed to do.
The bottom line is: NO, Shadow of the Beast is not as good as it could have been, not by a giant margin: The code is as best "good enough" for a prototype, but most of the patterns I've seen in the code suggest a "beginner++" level at best.
Among the issues:
- The coder NEVER use absolute short addressing mode, so all the color changes take way longer than necessary
- All the color changes are done with move.w, he never move.l two consecutive colors, he never (an)+ or movem.l them either
- He uses JSR in many places where he could have used BSR
- He loves to JSR / RTS (instead of JMP)
- The famed "Copper Emulator" is a mess, a lot of time is spent jumping to empty routines
- The code is full of mulu #400 and divu #25
- He often uses clr.l dn instead of moveq #0
Now, from what I've learned:
- The Eldritch the Cat logo is just an IFF/ILBM animation in ANIM format, so whoever feels like doing a SOTB++ can easily change that.
- The code is a mess, but a proper disassembly should probably not be too difficult to understand
- Adding STe palettes is trivial
- There's enough free cpu time in the small things I signaled to probably either add a music (hint: You have three chiptunes from FFT in the Phaleon you can use for that) or put back the missing scrolling barrier at the bottom
You can play with the disassembled "loader source code": Just assemble it with vasm, then copy it over the floppy with the original files (or overwrite the original beast.prg).
vasm.exe -m68000 -Ftos -noesc -no-opt -o %SOTB%beast2.prg %SOTB%beast.s
You can also use genst/devpac if you want :)
Reverse Engineering - The Long Version
For this long version, I'm just going to copy-paste the messages I wrote on Facebook when I was digging in the code.Ok, so thanks to GGN I got a filed version, and using SteamSSE "Debug Mode" I'm able to look at the code, I only looked at the Timer B (because the colors tend to flicker) and I was wondering about what the code was doing:
For what I see, the code starts by forcing the low resolution mode (move.b #$0,$ffff8260) not even bothering to use .w addressing mode to make it faster.
Then it saves d0 and a0 in two locations (53dd8 and 53ddc) and they get restored from there at the end.
Then it loads into d0 a 16 bit value from 53a64, multiply it by four and then uses it to access a routine to witch it jumps.
When it's back, it reloads the value, increments it, check if it reached 50 ($32) and if yes forces it back to zero.
And finally it resets the IERA and IMRA registers, style ignoring .w addressing mode.
I've not looked an anything else in the code, but as far as I'm concerned it's about at the level of the first "learn making rasters using Timer B" from the first issues of ST Magazine :p
So, prove me wrong, but this can be improved by:
- Using short addressing everywhere
- Not changing the resolution back every hbl
- Add #4 each time instead of add #1 and multiply by four
- Possibly reverse the table and decrement instead of comparing to #50
Does that make sense?
For what I see, the code starts by forcing the low resolution mode (move.b #$0,$ffff8260) not even bothering to use .w addressing mode to make it faster.
Then it saves d0 and a0 in two locations (53dd8 and 53ddc) and they get restored from there at the end.
Then it loads into d0 a 16 bit value from 53a64, multiply it by four and then uses it to access a routine to witch it jumps.
When it's back, it reloads the value, increments it, check if it reached 50 ($32) and if yes forces it back to zero.
And finally it resets the IERA and IMRA registers, style ignoring .w addressing mode.
I've not looked an anything else in the code, but as far as I'm concerned it's about at the level of the first "learn making rasters using Timer B" from the first issues of ST Magazine :p
So, prove me wrong, but this can be improved by:
- Using short addressing everywhere
- Not changing the resolution back every hbl
- Add #4 each time instead of add #1 and multiply by four
- Possibly reverse the table and decrement instead of comparing to #50
Does that make sense?
So, I've been digging a little bit more, and I've to say that the things don't look to good for the supposedly quality of the code.
Basically using Steem debug, I've traced the HBL routine, and in the red frame you can see the way the color palettes are changed, using 16 move.w abs.l,abs.l to change each of the 16 colors.
If my cycle table is correct, each of these instruction takes 28 clock cycles, so 16*28=448 clock cycle to change the full palette, plus the rts and the rest of the hbl code I posted earlier in the thread.
For reference, on the Atari ST, in each frame we have 160256 clock cycles (at 50hz refresh rate) spread over 313 scanlines, which gives us 512 clock cycle per scanline.
Seen like that you may realize that 448 is kind of big.
And just for the lol, remember that Spectrum 512 (a drawing program) is changing the entire color palette three times per frame, so it kind of look like our small code sample out there could be optimized.
So first, one could notice that both on the left side and right side the addresses are just incremented by two on each instruction (ff8240, 42, 44, 46, ...) so why not instead of copying the colors one by one, doing it two by two, using .L instead of .W ?
The result is that instead of 16 move.w, we use 8 move.l, which if we stick to the absolute long addressing mode results in 8 times 36 clock cycles instead of 16 times 28 cycles, which gives us 288 clock cycles instead of 448. Not sure what you think, but to me that sounds like a "small" improvement (and just to be clear, this takes LESS code than the original code, so whatever the original program thought of Demosceners tricks that use room does not apply there).
Can we do better?
Well, instead of using $00ff8240,l he could just have used $ffff8240.w (short addressing mode) which is available for the entire set of hardware registers on the ST, so instead of move.l abs.l,abs. (36 cycles) we can use move.l abs.l, abs.w (32 cycles) which reduces the code to an even shorter 32*8 = 256 cycles (which also happen to remove 2 bytes for each instruction!).
Assuming you can afford to save/restore two registers, you can also do that:
lea $6eb5c,a0 ; 12
lea $ffff8240.w,a1 ; 8
move.l (a0)+,(a1)+ ; 20
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
that's 12+8+20*8=180 cycles, so we are already almost 2.5 times faster than the original code, and it takes less room.
So far, all the code I've seen is representative of this way of coding, but I need to dig more to find more.
If somebody can find me a good disassembler, compatible with vasm/devpack syntax, that can disassemble some binary block which was assembled at some arbitrary absolute address, that would help :)
Thanks!
Basically using Steem debug, I've traced the HBL routine, and in the red frame you can see the way the color palettes are changed, using 16 move.w abs.l,abs.l to change each of the 16 colors.
If my cycle table is correct, each of these instruction takes 28 clock cycles, so 16*28=448 clock cycle to change the full palette, plus the rts and the rest of the hbl code I posted earlier in the thread.
For reference, on the Atari ST, in each frame we have 160256 clock cycles (at 50hz refresh rate) spread over 313 scanlines, which gives us 512 clock cycle per scanline.
Seen like that you may realize that 448 is kind of big.
And just for the lol, remember that Spectrum 512 (a drawing program) is changing the entire color palette three times per frame, so it kind of look like our small code sample out there could be optimized.
So first, one could notice that both on the left side and right side the addresses are just incremented by two on each instruction (ff8240, 42, 44, 46, ...) so why not instead of copying the colors one by one, doing it two by two, using .L instead of .W ?
The result is that instead of 16 move.w, we use 8 move.l, which if we stick to the absolute long addressing mode results in 8 times 36 clock cycles instead of 16 times 28 cycles, which gives us 288 clock cycles instead of 448. Not sure what you think, but to me that sounds like a "small" improvement (and just to be clear, this takes LESS code than the original code, so whatever the original program thought of Demosceners tricks that use room does not apply there).
Can we do better?
Well, instead of using $00ff8240,l he could just have used $ffff8240.w (short addressing mode) which is available for the entire set of hardware registers on the ST, so instead of move.l abs.l,abs. (36 cycles) we can use move.l abs.l, abs.w (32 cycles) which reduces the code to an even shorter 32*8 = 256 cycles (which also happen to remove 2 bytes for each instruction!).
Assuming you can afford to save/restore two registers, you can also do that:
lea $6eb5c,a0 ; 12
lea $ffff8240.w,a1 ; 8
move.l (a0)+,(a1)+ ; 20
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
that's 12+8+20*8=180 cycles, so we are already almost 2.5 times faster than the original code, and it takes less room.
So far, all the code I've seen is representative of this way of coding, but I need to dig more to find more.
If somebody can find me a good disassembler, compatible with vasm/devpack syntax, that can disassemble some binary block which was assembled at some arbitrary absolute address, that would help :)
Thanks!
I've been poking a lot in the color changes, because that was the easiest thing to investigate (just break with the debugger, and look at the content of $70 that's the address of the VBL routine) or use the Steam "Break on Timer B", that works as well.
This time I just looked at what was lying around in the main code, just stopped randomly at some point and started tracing.
What you see on the screenshot is a bit of a routine which is called multiple time, and because it's well known that the 68000 has a multiplication instruction it decided to use it twice, because a mulu #400 after all is only taking about 48 cycles.
And because the 68000 is super fast, it makes sense to use "jsr (a4) / rts" when you could just have skipped the return part and do "jmp (a4)".
And no, that's not "super elite demoscene coding practices that don't work in the real world of game developers", it's just what people do because it's faster and takes less room.
So much fun.
Next time somebody tells you that the ST version of Shadow of the Beast is "as good as it could ever has been", just answer that they don't know what they are talking about and to please come back later when they have actual arguments to offer.
This time I just looked at what was lying around in the main code, just stopped randomly at some point and started tracing.
What you see on the screenshot is a bit of a routine which is called multiple time, and because it's well known that the 68000 has a multiplication instruction it decided to use it twice, because a mulu #400 after all is only taking about 48 cycles.
And because the 68000 is super fast, it makes sense to use "jsr (a4) / rts" when you could just have skipped the return part and do "jmp (a4)".
And no, that's not "super elite demoscene coding practices that don't work in the real world of game developers", it's just what people do because it's faster and takes less room.
So much fun.
Next time somebody tells you that the ST version of Shadow of the Beast is "as good as it could ever has been", just answer that they don't know what they are talking about and to please come back later when they have actual arguments to offer.
One of the things I always wondered why how Shadow of the Beast on the ST could manage to have unstable rasters, despite having a narrower than normal window size (it's 288 wide, not 320) and not using the border color for the rasters.
The explanation is kind of sad really.
Earlier I explained the issue with the Timer B doing very complicated things very inefficiently, so I decided to bite the bullet and try to fix it, after all being all "your code suck" is one thing, fixing it is another :p
So, armed with the Medway Boys version of the game I added some extremely ugly "patching" code that just replace the existing code with some small code snippets to see if that has a perceptible effect.
The top screenshot shows the position of the color gradient change on the screen (set to yellow so it's obvious).
As you can see the color change happens very late in the scanline, and since it's not stable it's very perceptible.
The reason why that one is particularly visible is that the code is WEIRD: It looks like the guy manually patched the code after it was already compiled, I can't find any other explanation for this code snippet:
$5406A:
MOVE.W $6EB3A,$FF825A.L
MOVE.W $6EB3C,$FF825C.L
MOVE.W $6EB3E,$FF825E.L
RTS
$5408A
JSR $5406A
MOVE.W $6EB40,$FF8258.L
RTS
The JSR (A0) actually jumps in $5408A which immediately jumps back to $5406A to change three colors then returns and finally changes $ff8258 which is the color used for the rasters.
Just applying the fix I suggested earlier (removing the resolution change, storing d0/a0 more efficiently and replacing the .L by .W on the register addresses has a quite massive impact, just that itself allows the color change to almost be on the left side of the screen...
And my final change was to just change the background color first, and there, problem solved, as you can see on the last picture.
The explanation is kind of sad really.
Earlier I explained the issue with the Timer B doing very complicated things very inefficiently, so I decided to bite the bullet and try to fix it, after all being all "your code suck" is one thing, fixing it is another :p
So, armed with the Medway Boys version of the game I added some extremely ugly "patching" code that just replace the existing code with some small code snippets to see if that has a perceptible effect.
The top screenshot shows the position of the color gradient change on the screen (set to yellow so it's obvious).
As you can see the color change happens very late in the scanline, and since it's not stable it's very perceptible.
The reason why that one is particularly visible is that the code is WEIRD: It looks like the guy manually patched the code after it was already compiled, I can't find any other explanation for this code snippet:
$5406A:
MOVE.W $6EB3A,$FF825A.L
MOVE.W $6EB3C,$FF825C.L
MOVE.W $6EB3E,$FF825E.L
RTS
$5408A
JSR $5406A
MOVE.W $6EB40,$FF8258.L
RTS
The JSR (A0) actually jumps in $5408A which immediately jumps back to $5406A to change three colors then returns and finally changes $ff8258 which is the color used for the rasters.
Just applying the fix I suggested earlier (removing the resolution change, storing d0/a0 more efficiently and replacing the .L by .W on the register addresses has a quite massive impact, just that itself allows the color change to almost be on the left side of the screen...
And my final change was to just change the background color first, and there, problem solved, as you can see on the last picture.
After that I published a summary article on the Atari Forum, and a user named SwapD0 applied some disassembly magic to generate an ever better disassembly, which I looked at and did a final set of checks:
On the Atari Forum, SwapD0 used his disassembler to generate this semi readable source code:
http://defence-force.org/.../atari/ShadowOfTheBeast/beast.s
I will let as an exercise for the readers to search for the following labels in the code, and see if you can come up with more efficient ways to do the same thing:
- l001973 and l001998 (did we really need this 4(a1) ?)
- l002301 (too complicated to use movem.l ?)
- l002302 (lots of crl.l)
- l002363 (interesting combined use of pre decrement and add)
(followed by me wondering about some code I did not understand but happen to the actual FDC/DMA loading code - thanks Orion Replicants for the explanation)
http://defence-force.org/.../atari/ShadowOfTheBeast/beast.s
I will let as an exercise for the readers to search for the following labels in the code, and see if you can come up with more efficient ways to do the same thing:
- l001973 and l001998 (did we really need this 4(a1) ?)
- l002301 (too complicated to use movem.l ?)
- l002302 (lots of crl.l)
- l002363 (interesting combined use of pre decrement and add)
(followed by me wondering about some code I did not understand but happen to the actual FDC/DMA loading code - thanks Orion Replicants for the explanation)
There you go, it's about as complete as it will ever be, I guess having it all on the blog will help future software archeologist more than if it's spread over multiple places.
1. Courtesy of Moby Games↩
2. Some software trickery that allows the programmer to extend the display area by disabling some or all of the borders around the screen, thus allowing up to 416x276 pixels to be displayed instead of the usual 320x200↩
3. Doable on the ST with Spectrum 512↩
4. Denis - 21/06/2015: "As the coder explained it, this game was VERY hard to convert on ST, because every trick and hardware effects needed to be converted as software routines, as well as preshift the graphics, this leading to insane memory requirements (2mb!) when the amiga release needed only 512kb to run. The ST to get in software the same result as the Amiga version would need a 68000 running at 20 Mhz...... The STE hardware scroll brings more problems than it solves, because there are no sprites hardware assistance on it. What's the point of having a scroll running at 50 fps when the computer has so many sprites to put on screen at high speed ? the STE (never mind the ST) was not made for that."↩