Dbug's blog - Shifter Distortions

2020

Sun 18th October 2020

Horizontal Distortion ST vs STe How it's done on the Amiga How it's done on the STe Timings Register trickery Inner loop Alternative methods

A few days ago, I saw on Twitter an interesting link to an article from Mark Wrobel explaining how the classic horizontal distortion effect was done on the Amiga computer using the copper, complete with some source code and pictures to explain it all.

Since I had done that on the Atari STe back in the days, I answered that maybe I should explain how to do it on the Atari, so there we are!

Horizontal Distortion

Before going into details, I suggest you take a look at the video of what the effect looks like:

I hope you have a better idea of what the topic is, but in case of, what you just saw is the good old golden mask of the Egyptian King Tutankhamun distorted per scanline in left and right panoramic overscan (408x200 resolution) at 50 frames per second on a Atari 520 STe.

ST vs STe

I must start by a disclaimer: The method I'm presenting here will only work on a STe computer (Mega STe included), it will not work on the ST/STF machines because they lack the ability to change the screen address after the screen has started being refreshed, and it will not work on either TT or Falcon because on these machines sync code is pretty much impossible to do.

Atari 520 STe (From the Save The Earth demo)

That being said, we can start!

There are quite many differences in architecture between the Atari and Amiga computers:

The Amiga has many specialized co-processors (to draw objects, sprites, change registers, play sound)
The memory access time allocated to the 68000 is fixed on the Atari ST but changes on the Amiga depending of what else is running at the same time
The Amiga bitplans have independent pointers that can be moved separately while on the Atari you always have one single pointer to a fixed interleaved bitplans structure

Many of the things the Amiga can do in hardware by just changing a couple parameters or registers can only be done on the Atari ST by brute-forcing copies of data using the CPU¹

The Atari 520 STe internals

While the STe brought some significant improvements (replay of sampled sounds, blitter, more colors, more flexibility in changing the screen addresses), it is still quite lacking, but on the specific topic of "wobbling the entire screen", I believe we are not quite far from what the Amiga can do!

How it's done on the Amiga

I will let you read the article (linked above) for the details, but basically the whole idea is to use the Copper co-processor to change the position of the screen address on each new scanlines.

The Copper is a very specialized device which is able to read a list of instructions which are basically "wait for this" and "write that there", which in term of "distorting the screen" is basically a list of 200 "wait until we start next line" and "this is the values to write in the screen registers to point on this particular part of the picture".

Then each frame the program "patches" the Copper list to recompute the screen addresses values based on whatever parameters it had (in this case, a basic sine curve).

What is interesting here, is that this system allows the programmer to not have to deal with timings, this is all offloaded to a cycle accurate specialized device (that we do not have on the Atari ST), on the other hand, this also means that some additional bus usage is generated since the data is written by the cpu, read by the copper, which itself change the registers².

How it's done on the STe

On the Atari STe, we do not have a copper, but we can synchronize the 68000 with the video display by reading the Shifter's internal registers.

The whole idea is to do exactly the same thing as on the Amiga, except instead of writing a Copper list, we use the 68000 as if it was a mix between the Copper and the 68000, to both compute the values and directly write them in the Shifter registers.

Regarding the Shifter registers, we need to use the following:

$ffff8205 : Video Address Counter (High Byte)
$ffff8207 : Video Address Counter (Mid Byte)
$ffff8209 : Video Address Counter (Low Byte)
$ffff8265 : Video Pixel Offset

On the Atari ST, the Video Address Counter could only be read, but Atari made possible on the STe to change the values during the display of the picture, between the scanlines (so basically during the invisible horizontal move to the next line).

The Video Pixel Offset is also a new register which allow us to adjust the position between 0 and 15 pixels, so by using a combination of the four registers we can effectively force the Shifter to display any random place we select in memory - as long as we do it when she Shifter accepts it -.

Timings

As I mentioned earlier, the Atari ST cycle usage is highly predictable, and fortunately for us, the situation did not change on the STe:

You can open borders (which increase the displayable area), play sampled sounds, use the new registers, etc... that has absolutely no impact on the amount of clock cycles allocated to the CPU.

The method we are using will be very similar to the one I presented in the previous article about the Mind Bender illusion effect, knowing that each scanline matches 512 clock cycles on the 68000 side, it's just a matter of doing that:

wait for scanline 0
change video address registers
wait for scanline 1
change video address registers
wait for scanline 2
change video address registers
wait for scanline 3
(...)

Before doing that, we need to make sure we are properly synchronized with the display, because if we start in the middle of a scanline the Shifter will not accept the values, so we use the time tested "invert wait delay":


 move.b #0,$ffff8209.w 
 moveq #16,d2
.wait_sync:
 move.b $ffff8209.w,d0
 beq.s .wait_sync
 sub.b d0,d2
 lsl.b d2,d0

The idea is simple:

We force the video counter lower byte to zero, and we wait for it to become non-zero - the indication that the video display actually started.

At this point, the value we just read from $ffff8209 maybe 2, 4, 6, ... so we need to make sure that we compensate for this delay in order to reach a perfect sync point, by simply using the lsl instruction.

The 68000 processor did not have an optimized rotate implementation, so the larger the amount you want to shift, the longer it takes, and as it happens, the number of cycles taken by each shifted bit matches the speed at which the shifter loads data from memory, so all we had to do is to subtract from 16 in order to get an inverted value:

The later we get out of the loop, the less we shift, and when the lsls is done, we are at a fixed point in time.

From there, it's just a matter of staying in sync, which means counting all the instructions!

Oh, last thing: Don't forget to disable the automatic optimizations from your assembler!

Register trickery

As usual in demo code, the idea is to optimize the base effect so you can maximize the free CPU time you can use for other interesting things.

Since we are going to change the screen address every single line, it's important this is done efficiently with as few modified registers as possible.

There are multiple ways of doing it, but I do like movep.l³, so what I tend to do is the following:


 lea $ffff8260.w,a0     ; 8/2 resolution
 lea $ffff820a.w,a1     ; 8/2 frequence
 (...)
 movep.l d4,-5(a1)      ; 24/6 - SCREEN ADDRESS ($ffff8205/07/09/0B)
 move.b d4,91(a1)       ; 12/3 - PIXEL SHIFT ($ffff8265)

Obviously you are not obliged to use d4 and a1, you are free to use any register you want, in my case it is a side effect of having a0 and a1 already set-up for the left-right overscan.

What this code does is to write the three top bytes of d4 in the three Video Address Counter registers⁴ and then the lowest byte is written to the Pixel Shift register.

This way I can keep a complete scanline information in one single 32 bit value, by storing the address shifted by 8 plus the fine pixel shift.

Inner loop

When you put it all together, that gets us this inner loop that does the following:

Set the screen address
Open the left border
Compute the address of the next scanline
Open the right border
loop

Visually it looks like that:

Panoramic Overscan

The top area is where we wait for the screen to start, the first line synchronization happens somewhere at the start of the top red line, then we have 199 lines with the left and right bordered openned, and then back to black at the bottom of the screen.

Here is what the code looks like, and no, I did not pre-compute anything, so the address calculation is kind of ugly :)


 move.w #199-1,d7       ; 8/2
king_tut_picture_loop: 
 movep.l d4,-5(a1)      ; 24/6 - SCREEN ADDRESS ($ffff8205/07/09/0B)
 move.b d4,91(a1)       ; 12/3 - PIXEL SHIFT ($ffff8265)

 move.w a0,(a0)         ; 8/2 - LEFT BORDER OPENING
 move.b d0,(a0)         ; 8/2 - LEFT BORDER OPENING

 pause 62

 moveq.l #0,d1          ; 1
 moveq.l #0,d2          ; 1
 and.w #510,d3          ; 2 - Stay in the sine table
 move.w (a2,d3),d1      ; 4 - Fetch offset from the sine table
 addq.w #2,d3           ; 1 

 move.l d5,d4           ; 1
  
 move.w d1,d2           ; 1
 and.w #15,d1           ; 2 
 move.b d1,d4           ; 1 Pixel shift

 lsr #4,d2              ; 4
 lsl #3,d2              ; 3
 lsl #8,d2              ; 6
 add.l d2,d4            ; 2 Final offset

 move.b d0,(a1)         ; 8/2 - RIGHT BORDER OPENING
 move.w a0,(a1)         ; 8/2 - RIGHT BORDER OPENING

 add.l d6,d5            ; 2

 pause 15

 dbra d7,king_tut_picture_loop     ; 12/3 taken, 16/4 not-taken

On a side note, we have about 308 free clock cycles (the 62 and 15 pause blocks) in each scanline, that's about 60%, enough to for example perform a complete color palette change on each scanline

You can find the complete source code, including the RMAC assembler and PictConv picture converter executables in the Defence Force SVN repository in the Sine Wave Tutorial.

Alternative methods

Obviously, if you are allergic to synchronized code, you can use the Timer B to perform the same thing, but using IRQ is not without drawbacks: You can forget about overscan, and the overhead of calling and returning from an IRQ is not particularly light.

Personally I hate the MFP, the less I have to touch it, the better, and wasting CPU on context saving on a 8mhz machine just makes my skin crawl :D

But that's your choice!

I hope you have a better understanding on how this effect worked.

1. Probably a reason why Atari ST demo coders became kind of hardcore optimizers.↩
2. This is not particularly problematic if you only change a couple values per scanline, but heavy copper usage can impact the cpu significantly↩
3. when you need to use movep tou're already f*d↩
4. The lowest byte is written to $ffff820b which does not actually exist↩