$S@g·, g@S$$$ $$$$$$+-ù ú. `@g, .ú- -+g@ ½üýø^_ +-Ä- - $$$$$$-- -Ä-$b,`$-- --- y '-- Ä- -$$ g@S$$$-Ä- ---+ «¬¬«¬ $$$$$$ ¬¬«« $$l $ ¬¬«¬« $ «¬«¬¬«¬ $$ $$$$$$ «¬«¬¬« ¬¬«¬« $$$$$$ ¬«¬¬ $$$ l «¬«¬« $ ¬«¬¬«¬« $$ $(dt)$ ¬«¬¬«¬ +-- Ä -l$$ü$$-Ä-- -$$$ $Ä- --- $`+,._ _,$l $$$$$$-- Ä- -+ $$$y$$ Sü',$ ü $$,øý½SQ$ ½üýø^~ ,g@$$ ,g@S@, $$$l `ýÓ*S,_ _`*S$$b._ ,SSü' This article comes from Hugi issue 15 ú Released in May 1999 ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ OPTIMIZING "PUTBITS" PART II TAD ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ OPTIMIZING "PUTBITS" PART II ÄÄÄ( TAD )ÄÄÄ Introduction After reading Hugi #14, and especially Dario Phong's "Optimizing Putbits", I thought there was quite a bit more which needed to be said on this subject. And so here is my contribution, which I hope will inspire others to take this subject still further. As Dario Phong correctly said, the "putbits" function is very heavily used when writing a compressor (or "packer" as many call it), so it seems sensible to try and optimize it to reduce the amount of time needed to compress a file. But, this should be done after first optimizing the rest of the compressor, because let's face it, searching for the longest string match in a large block, or the most optimum packing sequence can take a 100 or a 1000 times more time than the "Putbits" routine which outputs the finished compression token. If anything the "Getbits" routine (used in the depacker/decompressor) would be a far better place to spend your time optimizing, after, of course, the actual compression algorithm itself. In this article I will only focus on optimizing the Putbits" routine, and I leave the rest of your super-compression engine for you to optimize. There is no real brain-blasting methods or fantastic new algorithms here, most of the techniques have been know about for years. The marker bit for example was used in the ZX Spectrum tape load/save routines and probably before years before 1982. Terminology "Bitstream": A continous block of bytes treated as one long series of bits. "Token": This represents a command or a number of data bytes. The token can range from a single bit upto 16 or more in length. "Putbits": This takes a token and writes all of its bits into the Bitstream. It has to deal with crossing byte boundaries and writing bit(s) to the correct bit positions in each Bitstream byte. "Sequence": I use the term sequence and string to mean the same thing. A continious block of bytes of a certain length. A string is more usually used when dealing with ASCII characters, so sequence seems a better choice. "Bitrack": A temporary byte (or word) into which a number of bits are shifted into, until an entire byte (or word) has been filled. After which it is written to disk/memory and the Bitrack is re-initialised, so the process can start again. Bit-by-Bit I will first decribe a 1 bit "Putbits" routine in which the "newbit" is only 1 in length and has the value 0 or 1. Initialise: BitRack = 0 BitPosition = 0 Putbit: BitRack = BitRack OR (newbit << BitPosition) BitPosition + 1 If BitPosition = 8 then OutputByte(BitRack) BitRack = 0 BitPosition = 0 This would output bits into the bitstream in this order: --------------------> bitstream hgfedcba ponmlkji xwvutsrq .... etc ... byte 0 byte 1 byte 2 where the very first bit is written is to bit #0 of the first byte. Countdown A much easier way to understand (and code) is to start at bit 7 and countdown until we go past bit 0 and then output the byte before repeating the process again from bit 7. Initialise: BitPostion = 7 BitRack = 0 PutBit: BitRack = BitRack OR (newbit << BitPosition) BitPosition - 1 IF BitPosition < 0 then OutputByte(BitRack) BitRack = 0 BitPosition = 7 Now bits are written in this order: --------------------> bitstream abcdefgh ijklmnop qrstuvwx .... etc ... byte 0 byte 1 byte 2 The Marker Bit We can actually get rid of the BitPosition variable and combine it with the Bitrack. I will use some 80x86 for this. Rather than taking the input bit, shifting it to the correct bit-location and combining it with our bitstream byte, we can simplify this by using a rotating byte (word, or dword etc.) instead. Initialise: mov [BitRack], 01h ; Bit 0 = marker bit PutBit: mov al, ; AL = new bit shr al, 1 ; shift the bit into CF rcl [BitRack], 1 ; rotate CF into Bitrack jnc short notfull mov al, [BitRack] call OutputByte ; OutputByte(BitRack) mov [BitRack], 01h ; re-init the BitRack marker notfull: This time the BitRack itself acts as the bit counter. It is initialised with 01h (binary 00000001). As each new bit is rotated into the BitRack the marker bit will be also be rotated until finally the marker bit falls into the CF. At this point an entire 8 bits have been stored in BitRack in the order abcdefgh, we just have to output the BitRack and re-initialise it with the marker value. As you can see it is far more efficent in terms of speed and register usage than using the "BitPosition" method. This time bits are written into byte starting from bit 7: --------------------> bitstream abcdefgh ijklmnop qrstuvwx .... etc ... byte 0 byte 1 byte 2 PLEASE NOTE: In order to output the final BitRack, you must do this: FlushMarker: mov [BitRack], 01h ; Is the BitRack empty ? jz short rackempty flushloop:: shl [BitRack], 1 ; shift in 0's until jnc short flushloop ; the marker-bit falls out mov al, [BitRack] ; then write the final byte call OutputByte ; OutputByte(BitRack) rackempty: Without the above above code the final byte of the bitstream would be missing. The SHL instruction is VERY important, it aligns the remaining bits in the BitRack to begin at bit position 7, without it our marker-bit would still be in the BitRack somewhere, with the last 0 to 6 bits after it. (E.g. 00001xxx where 'xxx' are the last 3 bits written into the BitRack with our markerbit before it). Multiple-Bits Using the "Putbit" routine is fine for single bit tokens, but for multiple bits a loop would be needed, and so it would be much, much slower. A far better method is to write multiple bits into the bitrack at once as this avoids having to loop for each and every bit. token - This is the value we wish to write into the bitstream. tokenbits - The number of bits to write. (max. 16 bits) Initialise: BitPos = 0 BitRack = 0 PutBits: BitRack = BitRack OR ( token << BitPos ) BitPos + tokenbits While BitPos >= 8 OutputByte(BitRack) Bitrack = token >> (BitPos AND 7) BitPos - 8 Wend Now bits are written into byte starting from bit 0: --------------------> bitstream hgfedcba ponmlkji xwvutsrq .... etc ... byte 0 byte 1 byte 2 Now in 80x86: dx = token ch = tokenbits PutBits: mov cl, [BitPos] mov al, dl shl al, cl or [BitRack], al ; Bitrack OR (token << BitPos) add cl,ch ; BitPos + tokenbits whilelp: cmp cl, 8 ; is BitRack full ? jb short wendlp ; no... mov al,[BitRack] call OutputByte ; OutputByte(BitRack) push cx mov ax, dx and cl, 7 shr ax, cl mov [BitRack], al ; Bitrack = token >> (BitPos AND 7) pop cx sub cl, 8 ; BitPos - 8 jmp whilelp wendlp: mov [BitPos], cl Could do Better Now the above looks like a nice, neat loop, but in fact it's a very general solution. The loop itself is only entered when the BitRack has been filled with 8 bits. In which case the BitRack is output and the remaining bits in the token are placed in the BitRack and the loop repeats, until the entire token has been shifted into the BitRack. But when the BitRack is full (and the while loop is entered) the remaining bits from the token are written at bit position 0 in the Bitrack. Here is a much better solution: dx = token ch = tokenbits (0 to 16) PutBits: mov cl, [BitPos] mov ax, dx shl ax, cl or al, [BitRack] ; AX = BitRack OR (token << BitPos) add cl, ch ; BitPos + tokenbits cmp cl, 8 jb short notfull ; BitRack is NOT full? call OutputByte ; OutputByte(AL) mov al, ah ; AL = BitRack (bits 15..8) sub cl, 8 cmp cl, 8 jb short notfull ; BitRack is NOT full? call OutputByte ; OutputByte(AL) sub cl, 8 mov ah, dh mov al, 0 rol ax, cl ; AL = overflow byte (bits 23..16) notfull: mov [BitRack], al mov [BitPos], cl To understand this, look at this terrible ASCII. Here is the worst possible case, BitPos = 7 tokenbits = 16 bit 15 0 aaaaaaaa bbbbbbbb <-- the token 7 0 _rrrrrrr <-- the BitRack After shifting the BitRack looks like this: 23 16 8 0 _aaaaaaa abbbbbbb b_______ Combined with the BitRack: 23 16 8 0 _aaaaaaa abbbbbbb brrrrrrr Improvements Well, I have only used 8086 instructions and 16-bit registers to make this as compatible as possible. You will no doubt be using 32-bit registers and addressing-modes which will certainly help speed things up a great deal. The SHLD and SHRD instructions might be useful. The BitRack is probably best extended to 32-bits to reduce memory accesses. Using Protected-Mode will again help to speed things and of course it allows huge buffers to be easily maintained without having to juggle segment registers. "RAW" bytes I almost forgot this tip. Roughly about 50% (hopefully less) of the time you will be forced to output "RAW" 8 bit bytes because you are unable to compress data. For example, the first time you encounter a byte which hasn't been seen before you need some way to encode it. This is normally done with a single-bit prefix '0' or '1' so when decompression takes place we can decode "RAW" bytes quickly without too much overhead. Now, rather than encode every 8-bit byte using our "putbits" routine which has to perform shifting, byte boundary checks and so on... We can output the byte directly into the bitstream. This means "RAW" bytes are encoded and decoded without the need to shift or deal with 8 bits across byte boundaries in the bitstream. To use this method you must use some kind of output buffer which allows random access because you will need to go back and insert/overwrite a previously written byte later on. You must also reserve space in the bitstream for each BitRack image BEFORE you output any "RAW" bytes. +------>--------->------+ +>+ | | | | xxxxxxxx xxxxxxxx xxxxxxxx bitrack bitrack bitrack image 1 image 2 image 3 In the above diagram we have 3 RAW bytes and 24 BitRack bits (denoted by the 'x' characters). Because we are writing BitRack images and RAW bytes at different speeds we must be VERY careful how we output bytes. We must reserve space for the BitRack image byte before writing any RAW bytes. byte OutputBuffer[1000] ; an example output buffer OutAddr = 0 Initialise: RackAddr = OutAddr ; reserve space for the OutAddr + 1 ; 1st BitRack image BitRack = 0 BitPos = 0 We have 2 output routines now, Put1Bit and PutRAW: Put1Bit: BitRack = BitRack OR (newbit << BitPos) BitPos + 1 if BitPos = 8 then OutputBuffer[RackAddr] = BitRack RackAddr = OutAddr OutAddr + 1 BitRack = 0 BitPos = 0 And the PutRAW routine: PutRAW: OutputBuffer[OutAddr] = rawbyte OutAddr + 1 The reason why we reserve a byte for the BitRack BEFORE it is full, is because when we come to depack the bitstream we need to read the BitRack image first, to know what to do with the following bitstream bytes. Closing Words Well, that's all for putbits. Don't worry if you can't understand it all in one go, just start off with simple code and work you way up slowly, making sure you understand things fully before moving onto something more difficult. Enjoy. Regards TAD #:o)