$S@g·,                               g@S$$$
                 $$$$$$+-ù ú. `@g,          .ú-  -+g@ ½üýø^_
          +-Ä- - $$$$$$-- -Ä-$b,`$-- --- y '-- Ä- -$$ g@S$$$-Ä- ---+
           «¬¬«¬ $$$$$$ ¬¬«« $$l $ ¬¬«¬« $ «¬«¬¬«¬ $$ $$$$$$ «¬«¬¬«
           ¬¬«¬« $$$$$$ ¬«¬¬ $$$ l «¬«¬« $ ¬«¬¬«¬« $$ $(dt)$ ¬«¬¬«¬
          +-- Ä -l$$ü$$-Ä-- -$$$ $Ä- --- $`+,._  _,$l $$$$$$-- Ä- -+
                 $$$y$$      Sü',$       ü         $$,øý½SQ$
                 ½üýø^~      ,g@$$     ,g@S@,      $$$l
                             `ýÓ*S,_  _`*S$$b._   ,SSü'

         This article comes from Hugi issue 15 ú Released in May 1999
         ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ


ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
OPTIMIZING "PUTBITS" PART II                                               TAD
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ



                          OPTIMIZING "PUTBITS" PART II

                                 ÄÄÄ( TAD )ÄÄÄ



 Introduction

 After  reading Hugi #14, and especially  Dario Phong's "Optimizing Putbits", I
 thought  there  was quite a bit more which  needed to be said on this subject.
 And  so here is my contribution, which I hope will inspire others to take this
 subject still further.

 As  Dario Phong correctly said, the  "putbits"  function is very heavily  used
 when  writing a compressor (or "packer" as many call it), so it seems sensible
 to try and optimize it to reduce the amount of time needed to compress a file.

 But,  this  should be done after first  optimizing the rest of the compressor,
 because  let's  face  it, searching for the  longest  string  match in a large
 block,  or  the most optimum packing sequence can  take  a 100 or a 1000 times
 more  time  than the "Putbits" routine  which outputs the finished compression
 token.

 If anything the "Getbits" routine (used in the depacker/decompressor) would be
 a far better place to spend your time optimizing, after, of course, the actual
 compression algorithm itself.

 In  this  article I will only focus  on optimizing the Putbits" routine, and I
 leave the rest of your super-compression engine for you to optimize.

 There is no real brain-blasting methods or fantastic new algorithms here, most
 of  the techniques have been know about  for years. The marker bit for example
 was  used in the ZX Spectrum tape load/save routines and probably before years
 before 1982.


 Terminology

 "Bitstream": A continous block of bytes treated as one long series of bits.

 "Token": This  represents a command or a  number of data bytes.  The token can
 range from a single bit upto 16 or more in length.

 "Putbits": This takes a token and writes  all of its bits into  the Bitstream.
 It has to deal with crossing byte boundaries and writing bit(s) to the correct
 bit positions in each Bitstream byte.

 "Sequence": I  use  the term sequence and  string  to  mean the same thing.  A
 continious  block of bytes of a certain  length. A string is more usually used
 when dealing with ASCII characters, so sequence seems a better choice.

 "Bitrack": A temporary byte (or word) into which a number of bits  are shifted
 into,  until  an  entire  byte (or word)  has  been  filled. After which it is
 written  to disk/memory and the Bitrack  is re-initialised, so the process can
 start again.


 Bit-by-Bit

 I will first decribe a 1 bit "Putbits" routine in which the "newbit" is only 1
 in length and has the value 0 or 1.

        Initialise:

                BitRack = 0
                BitPosition = 0

        Putbit:

                BitRack = BitRack OR (newbit << BitPosition)
                BitPosition + 1

                If BitPosition = 8 then OutputByte(BitRack)
                                        BitRack = 0
                                        BitPosition = 0

 This would output bits into the bitstream in this order:

                --------------------> bitstream

                hgfedcba ponmlkji  xwvutsrq .... etc ...

                 byte 0    byte 1    byte 2

 where the very first bit is written is to bit #0 of the first byte.


 Countdown

 A  much easier way to understand (and code) is to start at bit 7 and countdown
 until  we go past bit 0 and then  output the byte before repeating the process
 again from bit 7.

        Initialise:

                BitPostion = 7
                BitRack = 0

        PutBit:
                BitRack = BitRack OR (newbit << BitPosition)
                BitPosition - 1

                IF BitPosition < 0 then OutputByte(BitRack)
                                        BitRack = 0
                                        BitPosition = 7

 Now bits are written in this order:

                --------------------> bitstream

                abcdefgh ijklmnop  qrstuvwx .... etc ...

                 byte 0    byte 1    byte 2


 The Marker Bit

 We  can  actually get rid of the  BitPosition variable and combine it with the
 Bitrack. I will use some 80x86 for this.

 Rather  than taking the input bit, shifting it to the correct bit-location and
 combining it with our bitstream byte, we can simplify this by using a rotating
 byte (word, or dword etc.) instead.

        Initialise:

                mov     [BitRack], 01h  ; Bit 0 = marker bit

        PutBit:
                mov     al, <newbit>    ; AL = new bit
                shr     al, 1           ; shift the bit into CF

                rcl     [BitRack], 1    ; rotate CF into Bitrack
                jnc     short notfull
                mov     al, [BitRack]
                call    OutputByte      ; OutputByte(BitRack)
                mov     [BitRack], 01h  ; re-init the BitRack marker
        notfull:

 This  time the BitRack itself acts as  the bit counter. It is initialised with
 01h  (binary 00000001). As each new bit is rotated into the BitRack the marker
 bit will be also be rotated until finally the marker bit falls into the CF. At
 this point an entire 8 bits have been stored in BitRack in the order abcdefgh,
 we just have to output the BitRack and re-initialise it with the marker value.

 As  you  can see it is far more  efficent in terms of speed and register usage
 than using the "BitPosition" method.

 This time bits are written into byte starting from bit 7:

                --------------------> bitstream

                abcdefgh ijklmnop  qrstuvwx .... etc ...

                 byte 0    byte 1    byte 2

 PLEASE NOTE: In order to output the final BitRack, you must do this:

        FlushMarker:
                mov     [BitRack], 01h  ; Is the BitRack empty ?
                jz      short rackempty
        flushloop::
                shl     [BitRack], 1    ; shift in 0's until
                jnc     short flushloop ; the marker-bit falls out
                mov     al, [BitRack]   ; then write the final byte
                call    OutputByte      ; OutputByte(BitRack)
        rackempty:

 Without the above above code the final byte of the bitstream would be missing.
 The  SHL  instruction is VERY important, it  aligns  the remaining bits in the
 BitRack  to begin at bit position 7,  without it our marker-bit would still be
 in  the BitRack somewhere, with the last 0  to 6 bits after it. (E.g. 00001xxx
 where  'xxx'  are the last 3 bits  written into the BitRack with our markerbit
 before it).


 Multiple-Bits

 Using  the  "Putbit" routine is fine for  single  bit tokens, but for multiple
 bits a loop would be needed, and so it would be much, much slower.

 A far better method is to write multiple bits into the bitrack at once as this
 avoids having to loop for each and every bit.

                token           - This is the value we wish to
                                  write into the bitstream.

                tokenbits       - The number of bits to write.
                                  (max. 16 bits)

        Initialise:

                BitPos = 0
                BitRack = 0

        PutBits:
                BitRack = BitRack OR ( token << BitPos )
                BitPos + tokenbits

                While BitPos >= 8
                                OutputByte(BitRack)
                                Bitrack = token >> (BitPos AND 7)
                                BitPos - 8
                Wend

 Now bits are written into byte starting from bit 0:

                --------------------> bitstream

                hgfedcba ponmlkji  xwvutsrq .... etc ...

                 byte 0    byte 1    byte 2

 Now in 80x86:

                dx = token
                ch = tokenbits

        PutBits:
                mov     cl, [BitPos]
                mov     al, dl
                shl     al, cl
                or      [BitRack], al   ; Bitrack OR (token << BitPos)
                add     cl,ch           ; BitPos + tokenbits

        whilelp:
                cmp     cl, 8           ; is BitRack full ?
                jb      short wendlp    ; no...

                mov     al,[BitRack]
                call    OutputByte      ; OutputByte(BitRack)

                push    cx
                mov     ax, dx
                and     cl, 7
                shr     ax, cl
                mov     [BitRack], al   ; Bitrack = token >> (BitPos AND 7)
                pop     cx
                sub     cl, 8           ; BitPos - 8
                jmp     whilelp
        wendlp:

                mov     [BitPos], cl


 Could do Better

 Now  the  above looks like a nice, neat  loop, but in fact it's a very general
 solution.  The  loop itself is only entered  when  the BitRack has been filled
 with 8 bits. In which case the BitRack is output and the remaining bits in the
 token  are placed in the BitRack and  the loop repeats, until the entire token
 has been shifted into the BitRack.

 But  when  the BitRack is full (and  the  while loop is entered) the remaining
 bits from the token are written at bit position 0 in the Bitrack.

 Here is a much better solution:

                dx = token
                ch = tokenbits          (0 to 16)

        PutBits:
                mov     cl, [BitPos]
                mov     ax, dx
                shl     ax, cl
                or      al, [BitRack]   ; AX = BitRack OR (token << BitPos)
                add     cl, ch          ; BitPos + tokenbits

                cmp     cl, 8
                jb      short notfull   ; BitRack is NOT full?
                call    OutputByte      ; OutputByte(AL)

                mov     al, ah          ; AL = BitRack (bits 15..8)
                sub     cl, 8
                cmp     cl, 8
                jb      short notfull   ; BitRack is NOT full?
                call    OutputByte      ; OutputByte(AL)

                sub     cl, 8
                mov     ah, dh
                mov     al, 0
                rol     ax, cl          ; AL = overflow byte (bits 23..16)
        notfull:
                mov     [BitRack], al
                mov     [BitPos], cl

 To understand this,  look at this terrible ASCII.  Here is the  worst possible
 case,

                BitPos = 7
                tokenbits = 16


           bit 15               0
                aaaaaaaa bbbbbbbb       <-- the token

                7      0
                _rrrrrrr                <-- the BitRack

 After shifting the BitRack looks like this:

               23      16       8        0
                _aaaaaaa abbbbbbb b_______

 Combined with the BitRack:

               23      16       8        0
                _aaaaaaa abbbbbbb brrrrrrr


 Improvements

 Well,  I have only used 8086 instructions and 16-bit registers to make this as
 compatible  as  possible.  You  will no  doubt  be  using 32-bit registers and
 addressing-modes which will certainly help speed things up a great deal.

 The SHLD and SHRD instructions might be useful.

 The  BitRack  is probably best extended  to 32-bits to reduce memory accesses.
 Using  Protected-Mode will again help to speed  things and of course it allows
 huge  buffers  to  be  easily  maintained  without  having  to  juggle segment
 registers.


 "RAW" bytes

 I almost forgot this tip.

 Roughly  about  50% (hopefully less) of the  time you will be forced to output
 "RAW"  8 bit bytes because you are  unable to compress data.  For example, the
 first  time  you encounter a byte which  hasn't been seen before you need some
 way to encode it. This is normally done with a single-bit prefix '0' or '1' so
 when  decompression takes place we can  decode "RAW" bytes quickly without too
 much overhead.

 Now, rather than encode every 8-bit byte using our "putbits" routine which has
 to  perform shifting, byte boundary checks and so on... We can output the byte
 directly into the bitstream.

 This  means  "RAW" bytes are encoded and  decoded without the need to shift or
 deal with 8 bits across byte boundaries in the bitstream.

 To use this method you must use some kind of output buffer which allows random
 access  because  you  will need to  go  back and insert/overwrite a previously
 written byte later on.

 You must also reserve space in the bitstream for each BitRack image BEFORE you
 output any "RAW" bytes.

               +------>--------->------+      +>+
               |                       |      | |
        xxxxxxxx <RAW byte> <RAW byte> xxxxxxxx xxxxxxxx <RAW byte>

         bitrack                       bitrack  bitrack
         image 1                       image 2  image 3

 In  the above diagram we have 3 RAW  bytes and 24 BitRack bits (denoted by the
 'x' characters).

 Because  we  are writing BitRack images and  RAW  bytes at different speeds we
 must  be  VERY  careful  how we output  bytes.  We  must reserve space for the
 BitRack image byte before writing any RAW bytes.

        byte OutputBuffer[1000]         ; an example output buffer

        OutAddr = 0


        Initialise:
                RackAddr = OutAddr      ; reserve space for the
                OutAddr + 1             ; 1st BitRack image

                BitRack = 0
                BitPos = 0

 We have 2 output routines now, Put1Bit and PutRAW:

        Put1Bit:
                BitRack = BitRack OR (newbit << BitPos)
                BitPos + 1
                if BitPos = 8 then OutputBuffer[RackAddr] = BitRack

                                   RackAddr = OutAddr
                                   OutAddr + 1

                                   BitRack = 0
                                   BitPos = 0

 And the PutRAW routine:

        PutRAW:
                OutputBuffer[OutAddr] = rawbyte
                OutAddr + 1

 The reason why we reserve a byte for the BitRack BEFORE it is full, is because
 when  we come to depack the bitstream we need to read the BitRack image first,
 to know what to do with the following bitstream bytes.


 Closing Words

 Well,  that's  all for putbits. Don't worry  if you can't understand it all in
 one  go,  just start off with simple code  and  work you way up slowly, making
 sure you understand things fully before moving onto something more difficult.

 Enjoy.

                                Regards
                                        TAD #:o)