The random ramblings of a French programmer living in Norway...
2020
← Dungeon Master introKeyboard handling (1) →
  Mind Bender
Sat 15th August 2020   
Size coding is an interesting challenge: After spending a day at work using a computer that counts memory in gigabytes and storage in terabytes, it's refreshing to go back to basics.

Back in the days, people HAD to write compact and efficient code, simply because the hardware would not allow for anything else.

These days it's more like an intellectual puzzle game, which I find deeply satisfying.

Mind Bender, my entry to Sommarhack 2020's 256 bytes intro competition is not my first attempt at size coding.

Oric minigames

If you've followed my productions over time you probably already know three of my Oric minigames:
4K Kong
4K Kong
  • 4K Kong
  • Cyclotron
  • 3K FreeCell
The three games all have in common the fact that they use quite a lot more memory than their claimed size:

They are all stored on disk as compressed executables, and decompressed in memory before they run.

Additionally, they are all written for the 6502 CPU, which may not be very advanced in term of supported operations, but has the benefit of using 8 bit opcodes, resulting in quite compact code compared to the 68000 cpu which supports a rich instruction set and quite advanced addressing modes, but requires 16 bit opcodes for each instruction.

Cyclotron
Cyclotron
The main issue in using a compressor, is that the decompressor itself takes room, and unfortunately on most platform this is not supported by the operating system, so the decompressor has to be part of your executable - uncompressed -.

If your target size is quite large (all things considered), then you can definitely pay the cost of having the depacker, but when targeting very small sizes, the trade off become very hard to justify.

If you have to drop the idea of using compressed executables, then the set of tricks you can use changes drastically!

3K FreeCell
3K FreeCell
A common source of size executable is loop unrolling: To offset the cpu cost of testing "are we there yet", it is common to do more than one operation on each loop, so for example instead of testing the end of the buffer to recopy every byte, you can do it every two bytes, 4 bytes, 8 bytes, or even more if you really want to maximize the performance.

Obviously this "copy pasted"1 code takes room, but if your executable is compressed, things like sequences of "move.l (a0)+,(a1)+" will be encoded in just a few bytes.

When using compressed executables, size coding become a game of "making patterns that can be repeated efficiently".

Procedural image

The "Mind Bender" idea started when Wietze contacted me on IRC to ask me if I would participated with a remote entry for Sommarhack 2020.

The main topic was "medium resolution", which was kind of inspiring to me since I've done trickery in both Low and High resolution, but not quite anything in Med rez, so I said yes.

Then some time later there started to be people playing with bootsectors, like in the good old days, and somebody suggested that if there was enough entries then they could add a 256 bytes intro category to Sommarhack... and since I was starting to get a bit short on time, I decided to try that first, and if I still had time, then eventually the Med Rez challenge.

Then at about the same time, Patapom shared this picture on Facebook, with the following comment

"TIEEEEEEENS ! CERVEAU DE MERDE !!!"
(Take that, shitty brain!!!)

Patapom inspiration
Patapom inspiration

It's one of these typical pictures that seems to move despite being totally static, and I wondered if it was possible to redo in 256 bytes, because after all it's just a chessboard with an alternate set of colors and some small flowery patterns at the intersection between tiles.

How hard can that be?

Changelog

Looking at the change log, it actually took about three days.

I basically followed by usual process, which starts by trying to get something running at all, even if it's ugly, slow, too big, etc... It's pointless to spend time optimizing at the start if you don't even have a rough idea of what a basic implementation would look like!

Then, when you have something running, look at it in detail (in this case, by tracing the code line by line in the debugger) to get a better idea of what is actually taking room or could be differently.

So, let's see how the process day by day

Day One (2020-05-28)

1207 bytes - first working version of the effect with the visible distortion
1051 bytes - removed setup and debug infos
930 bytes - can't quit anymore, no restore, no keypress
482 bytes - not sure what I changed...
478 bytes - replaced supexec by super
463 bytes - not clearing IMRA and IMRB anymore
331 bytes - Replaced the nops by a dbra
325 bytes - Removed one useless color change
296 bytes - Replaced the color mapping table by a rotated bitfield

So, the first implementation took 1207 bytes, which is quite far from the 256 bytes target!

Here is what it looks like:

First prototype
First prototype

It's not particularly awesome, the black bars are horrible, the dots are truncated and not centered, but the visual effect kind of work, so there's still some hope :D

Looking at the code history, the drop from 1207 bytes to 1051 bytes was caused by removing some of the uncessary system setup and restore code which I reused from a previous intro, as well as disabling debug information (symbols) from the final executable file format.

To drop from 1051 to 482 I introduced conditional assembly to switch between "development" and "release" modes: In the final version it was not necessary to be able to properly handle the keyboard and cleanly return to the operating system, so in "final release" mode I simply skipped all this part!

Simple gain really :)

The drop to 478 bytes was calling an alternative function of the operating system to switch to Supervisor mode2which did the same thing in less bytes.


ifne development_mode
; Development mode: We call the main routine in supervisor mode
; and when we come back, we return to the caller
move.l #super_main,-(sp)
move.w #$26,-(sp) ; XBIOS: SUPEXEC
trap #14
addq.w #6,sp

clr.w -(sp) ; GEMDOS: PTERM(0)
trap #1
else
; Final Release: We switch to supervisor, but never come back
; so no need to correct the stack pointer
clr.l -(sp)
move.w #$20,-(sp) ; GEMDOS: SUPER
trap #1
endc


To achieve 463 bytes I stopped setting the values of the MFP IMRA and IMRB registers (since I was not relying on any timer).

To go down to 331 bytes, I replaced a sequences of nops by a timed loop of equivalent duration:


rept 67
nop
endr


vs


moveq #22-1,d0
delay
dbra d0,delay


And finally to go down to 296 bytes, I replaced a small table of bytes used to define the alternate color patterns of dots from


SECTION DATA

colors
dc.b 0,1,1,0,1,0,0,1,0,1,1,0,1,0,0,1
dc.b 0,1,1,0,1,0,0,1,0,1,1,0,1,0,0,1
(...)
lea colors,a3
add d1,a3
(...)
move.b (a3)+,d2


to


move.l #%0110100101101001,d3

move.l d3,d4
lsr.l #1,d3
(...)
lsr.l #1,d4
addx d2,d2


This basically achieve the same result of having a list of values that change3.

Day Two (2020-05-29)

284 bytes - Replaced the blocks 2 and 5 nops by single instructions taking the same time
282 bytes - Replaced a move.b #0 by a sf
265 bytes - Changed the generation of the bitmap pattern
260 bytes - Replaced a few move.w #0, by clr.w
258 bytes - Accessed $ff8240 through an address register
256 bytes - Played with the initial temporisation code

To go down from 296 to 284 bytes, I basically replaced the remaining blocks of nops by neutral instructions that took the same amount of time to execute but using less room in memory, such as


lsl.l #6,d3


which takes 20 clock cycles so is the equivalent of


nop
nop
nop
nop
nop


where each of the 5 nop take 4 cycle to execute.

Obviously you better have to be sure that d3 can be modified at this place in the code!

Regarding the move.b #0 and move.w #0, each of these takes 4 bytes: 2 bytes for the move instruction opcode, and 2 bytes for the "0" immediate payload, which neither SF (Set False) or CLR require - easy gains.

At this point, we are down to 256 bytes (from the original 1207) and it still look exactly the same.

256 bytes version
256 bytes version

So the big question is: Now that we have something that technically fits the constraints and could be sent to the Sommarhack people, is it possible to reduce even more the size so we can make it prettier?

Let's try!

Day Three (2020-05-30)

244 bytes - Used the default screen instead of setting up mine
240 bytes - Changed the pattern generation to use bitplans offsets
238 bytes - Optimize the color setting for the bitmaps
254 bytes - Added a text and some fancy colors on the top
252 bytes - Reduced a constant value from .l to .w
250 bytes - Removed a sf $ffff8209.w which I did not need anymore
240 bytes - Reorganized the post video sync delay loop

The big win was achieve on day 3, when I decided to use the screen information setup in the system instead of defining my own screen.

A typical screen setup would be something either involving a call to a XBIOS function, or manual setup of screen registers:


; Set the screen on the buffer
move.l #screen,d0
sf.b d0
move.l d0,a0
lsr.l #8,d0
move.b d0,$ffff8203.w
lsr.l #8,d0
move.b d0,$ffff8201.w

(...)

SECTION BSS

save_screen_addr_1 ds.b 1
save_screen_addr_2 ds.b 1
save_vbaselo ds.b 1

even

screen ds.l 32000/4


while in practice we can just read the current screen address in page 4 system's variables:


move.l $44e.w,a0 ; screenpt


The main issue, is that you inherit of the screen as it was, which means in our case, a big ugly green image with a title bar with the name of our executable!

So the question become: How can I efficiently erase the screen?
Obviously we can just use brute force and some code to loop around to erase the 32000 bytes we found by reading $44e, but I decided to go for something a little bit fancy, and instead use the print function of the Gemdos to display a text!


;
; Print the Dbug title
;
pea Message(pc)
move.w #9,-(sp)
trap #1

(...)

Message
dc.b 27,"b",3 ; Set 1 as Ink color
dc.b 27,"E" ; Erase Screen
dc.b "Dbug"
dc.b 0


In case you did not know, the Atari ST system print function supports VT52 control codes, which means a single print out can do things like changing the color of the text, of the background, enable inverted video, move the cursor, etc... so I used that to select the color of my "Dbug" text while also deleting the entire image, just in a few additional bytes.

The rest of the optimisation was just some small things, like replacing the setting of two colors from two instructions to one, from


move.w #$707,$ffff8242.w ; The Purple crosses
move.w #$777,$ffff8246.w ; The White crosses


to


move.l #$07770707,$ffff8242.w ; The White and Purple crosses


And the rest is some small tweaks here and there, like playing with the color registers during the display to generate some gradient colors which are more satisfying to the eye.

Anyway, here is the final result:

The image generated by my 256 bytes program
The image generated by my 256 bytes program

Implementation notes

If you don't know the Atari ST, it may not be obvious what I did.

The image generated by my program appears to occupy the entire screen, including the borders, which normally cannot contain any graphics and only show the content of the first entry of the color palette (register 0 at address $ffff8240).

The normal display area is 320x200, which is much smaller than the actual viewable area - in this case 384x270 but it's more or less variable depending of the machines and screens -.

You can visualize the two areas on this annotated screenshot:

Image layout in detail
Image layout in detail

If you look at the inner part of the image, you will notice that the "Dbug @ Sommarhack 2020" text, as well as all the small white and purple patterns are inside the 320x200 rectangle: The reason is that I'm not doing any of the usual "fullscreen/overscan" trickery normally used to coerce and Atari ST in displaying actual bitmap in the borders.

The reason why you can see the green chessboard tiles over all the screen, is that it's not made of bitmap, it's only done by changing the background color register during the entire display:

change color 0 to dark green
wait
wait
wait
change color 0 to light green
wait
wait
wait
rince and repeat

if you do that often enough, and using the right timings, you can easily do vertical bands of alternative colors, and if every "n" lines, you add a few more additional delays, you can shift your entire pattern of column by one, resulting in the chessboard pattern you are seeing.

The nice thing about the Atari ST, is that the timings of the screen and instructions is 100% reliable and predictable, which makes such a code relatively easy to do, so here are some numbers for you:
  • The CPU runs at 8mhz (8 millions cycles per second)
  • At 50hz (screen refresh) we have 160256 cycles per frame (roughly 8mz/50hz)
  • These 160256 cycles are spread of 313 scanlines, with 512 cycles per scanline
When you know these numbers, and when you know how long each instruction take, it's quite easy to synchronize the color changes with the display and give the impression that things happen in a certain way :)

Anyway, here is the complete source code of what is actually assembled in the final release4.

I hope that was interesting for you, if you have questions don't hesitate to ask :)



opt o+,w+

SECTION TEXT

; ------------------
; Program start
; ------------------
ProgStart
clr.l -(sp)
move.w #$20,-(sp) ; GEMDOS: SUPER
trap #1

super_main
lea $ffff8240.w,a6
move.l #$07770727,$ffff8242.w ; The White and Purple crosses

; Print the Dbug title
pea Message(pc)
move.w #9,-(sp)
trap #1

; Draw the cross patterns
move.l $44e.w,a0 ; screenpt
lea 160*5(a0),a0

move.w #%0110100101101001,d3

moveq #7-1,d1
loop_y
move.l a0,a1

move.l d3,d4
lsr.l #1,d3

moveq #10-1,d0
loop_x
move.w d4,d2
lsr.l #1,d4
and #2,d2

lea 0(a1,d2.w),a2

lea Patterns(pc),a3
moveq #10-1,d5
loop_pattern
move.w (a3)+,(a2)
lea 160(a2),a2
dbra d5,loop_pattern

lea 16(a1),a1

dbra d0,loop_x
lea 160*30(a0),a0
dbra d1,loop_y

; VSync and disable interrupts
move.w #37,-(sp) ; VSYNC
trap #14

; Main loop
move.w #$2700,sr
loop
move.w #$345,(a6) ; Light blue-gray background

move.w #$370,d6 ; Color of first tile
move.w #$263,d7 ; Color of second tile
move.w #$fff,8(a6)


; Line zero synchronisation
moveq #14,d2
.wait_sync
move.b $ffff8209.w,d0
beq.s .wait_sync
sub.b d0,d2
lsl.b d2,d0

; To leave room for the "Dbug" line
; 64 nops*8 =
move.w #157-1,d0
delay
subq.w #1,8(a6)
dbra d0,delay

; The alternated color grid
moveq #10-1,d1
loop_lines

move.w #242-1,d0
loop_squares
move.w d6,(a6) ; 8/2 background color change

lsl.l #8,d3 ; 24/6 (DELAY)

move.w d7,(a6) ; 8/2 background color change

move.w (a6),(a6) ; (DELAY)

dbra d0,loop_squares ; 12/3 if branches / 16/4 if not taken

addq #1,d6
nop

dbra d1,loop_lines
bra.s loop

Patterns
dc.w %0000000001100000
dc.w %0000000001100000
dc.w %0000000011110000
dc.w %0000001111111100
dc.w %0000111111111111
dc.w %0000111111111111
dc.w %0000001111111100
dc.w %0000000011110000
dc.w %0000000001100000
dc.w %0000000001100000

Message
dc.b 27,"b",4 ; Set 4 as Ink color
dc.b 27,"E" ; Erase Screen
dc.b "Dbug @ Sommarhack 2020"
dc.b 0

end



1. In practice we tend to use macro functionalities in our assemblers, like REPT-ENDR in Devpac
2. On the Atari ST, you are not allowed to access hardware registers and system variables as a simple user, you need to escalate to supervisor mode
3. I originally choose to have a table because I was not certain I only have two values there
4. I removed the system saving/restoring code but you can see it on the SVN repository
comments powered by Disqus