$S@g·,                               g@S$$$
                 $$$$$$+-ù ú. `@g,          .ú-  -+g@ ½üýø^_
          +-Ä- - $$$$$$-- -Ä-$b,`$-- --- y '-- Ä- -$$ g@S$$$-Ä- ---+
           «¬¬«¬ $$$$$$ ¬¬«« $$l $ ¬¬«¬« $ «¬«¬¬«¬ $$ $$$$$$ «¬«¬¬«
           ¬¬«¬« $$$$$$ ¬«¬¬ $$$ l «¬«¬« $ ¬«¬¬«¬« $$ $(dt)$ ¬«¬¬«¬
          +-- Ä -l$$ü$$-Ä-- -$$$ $Ä- --- $`+,._  _,$l $$$$$$-- Ä- -+
                 $$$y$$      Sü',$       ü         $$,øý½SQ$
                 ½üýø^~      ,g@$$     ,g@S@,      $$$l
                             `ýÓ*S,_  _`*S$$b._   ,SSü'

         This article comes from Hugi issue 15 ú Released in May 1999
         ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ


ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
A HUFFMAN, BUT NO TREE?                                                    TAD
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ



                            A HUFFMAN, BUT NO TREE?

                                 ÄÄÄ( TAD )ÄÄÄ



 Introduction

 You  have probably heard this name a thousand times, and may have even coded a
 few  programs which use it. The algorithm  for creating a Huffman tree is well
 known, and it's so elegant that it makes the rest of your code look messy.

 But  (don't  you just hate that word?),  but, after creating the tree we still
 need  to gather the unique prefix codes to each leaf on the tree. This usually
 means a recursive routine which can eat up a lot of stack space.

 So  in  this article I will try to  explain the Huffman algorithm and the part
 which  is  all  too often absent from  other  documents - gathering the prefix
 codes from the tree.

 To make this article a little more interesting I will not use a tree!!


 Terminology

 "Bitstream": A continuous block of bytes treated as one long series of bits.

 "Token": This  represents a command or a  number of data bytes.  The token can
 range from a single bit upto 16 or more in length.

 "Putbits": This takes a token and writes  all of its bits into  the Bitstream.
 It has to deal with crossing byte boundaries and writing bit(s) to the correct
 bit positions in each Bitstream byte.


 What's a Huffman tree?

 Well,  Huffman  isn't  really a tree, it's  more  of  a coding technique which
 encodes  "symbols" (bytes, or characters for example)  using a variable number
 of bits to form an unique prefix code. The number of bits which are used for a
 prefix  is based upon the frequency that the symbol occurs in the data stream.
 If  a  symbol  occurs very often, then it  is  given  the least number of bits
 possible,  if a symbol occurs very rarely then it is given a greater number of
 bits.

 First  of all we need to know the "frequency" (how often each symbol occurs in
 the data stream), this is an easy task.

 Before  we  can count the frequency of each  symbol we first need to reset all
 the  counters  to  zero,  I  will use the  same  loop  to  zero out some other
 variables which will be explained and used later on.

        .DATA
 freqs        dw    256 dup (?)
 prefixes     dw    256 dup (?)
 prefbits     dw    256 dup (?)
 links        dw    256 dup (?)

        .CODE

              sub   ax, ax
              sub   bx, bx            ; n = 0 to 255
 clrcounts:
              mov   freqs[bx], ax     ; freqs[n] = 0
              mov   prefixes[bx], ax  ; prefixes[n] = 0
              mov   prefbitx[bx], ax  ; prefbits[n] = 0
              mov   links[bx], ax     ; links[n] = 0
              add   bx, 2
              cmp   bx, (2*256)
              jnz   short clrcounts


 Now let's scan the entire data stream and count each byte's frequency.

              [DS:SI] --> data stream
              CX = Number of Bytes in data stream

              mov   ah, 0
 freqloop:
              lodsb                   ; n = InputByte()
              mov   bx, ax
              add   bx, bx
              inc   freqs[bx]         ; freqs[n] + 1
              loop  freqloop

 So now the word array 'freqs' has the frequency of each byte.


 Building The Tree

 And now the Huffman algorithm.

 a) Collect the frequencies of each symbol in our data stream (done).
 b) Create nodes for each symbol (done).


 1) Find the two nodes with the two lowest frequencies.
    I'll call them "alpha" and "beta".

 2) For every child node of "alpha" add a '0' bit to the prefix.
 3) For every child node of "beta" add a '1' bit to the prefix.

 4) Join  alpha and beta into one parent  node, and give it the total frequency
    of alpha + beta.

 5) Remove the alpha and beta nodes, then add the parent node.

 6) Repeat this process from 1) until only one node is left.


 And now the code.

 The  following  code  has been taken from  some  old  source code and slightly
 modified to make it more readible, it should work, but don't sue me. X8-]

 There  is  only one loop. It is quite big  so I have broken it down into small
 stages.

 Stage 1) Find the node with the lowest freq.

 alpha        equ <bx>
 beta         equ <di>

 buildtree:
              sub si, si             ; for i = 0 to 255
              mov alpha, si
              mov dx, 0FFFFh         ; lowest freq = MAX number
 find1:
              mov ax, freqs[si]
              test ax, ax
              jz short skip1         ; ignore if freq[i] = 0
              cmp ax, dx
              jae short skip1        ; greater than lowest so far?
              mov dx, ax             ; keep lowest freq
              mov alpha, si          ; alpha = i
 skip1:
              add si, 2
              cmp si, (2*256)
              jnz short find1        ; check all nodes

 Stage 2) Find the node with the 2nd lowest freq.

              sub si, si             ; for i = 0 to 255
              mov cx, -1             ; lowest count = -1
 find2:
              mov ax, freqs[si]
              test ax, ax
              jz short skip2         ; ignore if freq[i] = 0
              cmp ax, cx
              jae short skip2        ; greater than lowest so far?
              cmp si, alpha
              jz short skip2         ; ignore if alpha = i
              mov cx, ax             ; keep lowest freq
              mov beta, si           ; beta = i
 skip2:
              add si, 2
              cmp si, (2*256)
              jnz short find2        ; check all nodes

 Stage 3) Check if we only have 1 lowest node.

              inc cx                 ; lowest beta freq = -1?
              jz short done


 Stage 4) Add a '0' to all of alpha's nodes.

              mov ax, alpha
 add_0bit:
              mov si, ax
              shr prefix[si], 1      ; shift in a 0 bit
              inc prefbits[si]
              mov ax, links[si]
              test ax, ax
              jnz short add_0bit     ; repeat for all of alpha's nodes

 Stage 5) Combine alpha + beta nodes (join beta to end of alpha link-list).

              mov links[si], beta

 Stage 6) Add a '1' to all of beta's nodes.

              mov ax, beta
 add_1bit:
              mov si, ax
              stc
              rcr prefix[si], 1       ; shift in a 1
              mov ax, links[si]
              test ax, ax
              jnz short add_1bit      ; repeat for all of beta's nodes

 Stage 7) Combine alpha and beta into one node (alpha).

              mov ax, freqs[beta]
              add freqs[alpha], ax    ; freq[alpha] + freq[beta]

 Stage 8) Remove beta from our node list (just zero its freq).

              mov freq[beta], 0

 And repeat the loop...

              jmp buildtree

 done:

 And  that,  ladies and gentlemen, is the  Huffman  tree built!! As you can see
 there is NO recursion (in fact, NO stack space was used).


 Using our Huffman 'Tree'

 Okay, now after building it, we had better use it.

              [DS:SI] --> data stream
              DX = Number of Bytes in data stream (NOTE: DX )

 freqloop:
              lodsb                   ; n = InputByte()
              mov ah, 0
              add ax, ax
              mov bx, ax
              mov ax, prefix[bx]      ; ax = prefix[i]
              mov cx, prefbits[bx]    ; cx = prefbits[i]
              call PutBits            ; PutBits(AX, CX)
              dec dx
              jnz short freqloop      ; repeat for all bytes...

 Of course you need to write your own "PutBits" routine (hint: look in a recent
 issue of Hugi).


 Improvements

 Okay,  the weakest part is the searching for the two nodes with the two lowest
 frequencies.  I  actually  wrote  the above  code  to  test out this tree-less
 Huffman idea, and as far as I can tell... it works.

 There  is  also some wasted bytes in the  'prefbits' array. It is defined as a
 WORD  array, but in fact a BYTE array  would work and save 256 bytes. But then
 again you would need to convert the current word index into a byte index so it
 probably  isn't  worth  it, unless you are  using  p-mode and the groovy S-I-B
 addressing modes.


 Closing Words

 If  the  code  doesn't  work, then... sorry, but  I  want  this  to be used as
 inspiration for YOU to write your own code, instead of a cut-n-paste.

 As  you  will no doubt want to use this  in p-mode then a total re-write is in
 order anyway, you can then speed up the lame search loops.

 If  anyone  has any suggestions to speed up  this  method then why not write a
 short article for Hugi? I am sure this would make other coders happy too.

 I  haven't  seen that many Huffman algorithms,  well actually about two. And I
 have  not seen one which uses this nice linked-list approach to build the tree
 and gather the prefix codes.

 Enjoy.

                                Regards
                                        TAD #:o)