$S@g·, g@S$$$ $$$$$$+-ł ś. `@g, .ś- -+g@ ½üżų^_ +-Ä- - $$$$$$-- -Ä-$b,`$-- --- y '-- Ä- -$$ g@S$$$-Ä- ---+ «¬¬«¬ $$$$$$ ¬¬«« $$l $ ¬¬«¬« $ «¬«¬¬«¬ $$ $$$$$$ «¬«¬¬« ¬¬«¬« $$$$$$ ¬«¬¬ $$$ l «¬«¬« $ ¬«¬¬«¬« $$ $(dt)$ ¬«¬¬«¬ +-- Ä -l$$ü$$-Ä-- -$$$ $Ä- --- $`+,._ _,$l $$$$$$-- Ä- -+ $$$y$$ Sü',$ ü $$,ųż½SQ$ ½üżų^~ ,g@$$ ,g@S@, $$$l `żÓ*S,_ _`*S$$b._ ,SSü' This article comes from Hugi issue 14 ś Released in February 1999 ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ OPTIMIZING THE PUTBITS DARIO PHONG ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ OPTIMIZING THE PUTBITS ÄÄÄ( Dario Phong / PhyMosys )ÄÄÄ Table of contents ž Introduction ž The 'proc' ž First optimization ž Second optimization ž The same in pmode ž Get bits ž Limiting the buffer ž Improving this doc ž Closing words ž Contacting the author Introduction This article is aimed to a coder who has already implemented a compressor, knows how ASM works and how to optimize for Pentium. If you have never done any of those things, then better do them, and when it's done, read this. Everybody who has even learned ASSEMBLER has also 'studied' the putpixel. A very optimized routine. Why? That's for three reasons: first is the way for understanding how the VRAM works, second it's used always in video routines, and third, everybody showed his putpixel. Now think about the putbits. It's almost the most used routine from compressors. A lot of clocks cycles are spent in your putbits. Do you really think you can optimize you compressor/parser/... enough? I hope so, but a fast putbits is always needed, so here we talk about it. I know the best for optimizing is pmode, but I still don't work on it (soon I'll do so ;-), so we will use real mode, and we will optimize for a plain Pentium (586), so specific optimizations for other procesors will not be put here, unless someone really feels that's necessary. And I don't think so. E-) This doc isn't intended to be a plain source-code, so you better understand how everything works. The 'proc' The routine is simple: Every time it's called it needs a byte of source, the bits to put into the output buffer, and the number of bits to write. Then the routine puts the bits, and if the byte is full (= we wrote 8 bits) we update the output buffer. For putting the bit in the right position we shift it and then 'or' the output byte. Here it is the putbits that I use, at least till now: ;-) ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł ; Putbits ; Input: cx-> number of bits to write bl->bits to be written ; Output: nothing ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł put_bits: test cx,cx ;I used that. Put it ONLY if you know jz short @@ptb_fin ;that maybe you will try to write 0 bits ;but better avoid writing 0 bits ;-D push ax es di ;Save only the registers that you know ;are currently used. mov es,buf_com ;ds is faster than any other segment ;register. But it's needed for some data @@ptb_more_bits_: ;in my 'main' loop OE-) push cx ;We need cx for the shl/cl xor ax,ax ;clear it mov bh,ptb_byte ;get the byte to finally write mov al,bl ;get the input byte (where the bits are) shr al,1 ;put the first bit in the carry flag adc ah,0 ;Then to ah. I know there's salc ;but I think it's an undocumented inst. ;from Intel's Pentium, so... mov bl,al ;save the shifted byte mov cl,ptb_totalbits ;how many bits we wrote shl ah,cl ;Put the bit to its position or bh,ah ;and then update the output byte mov ptb_byte,bh ;save the output byte ;now let's see if we must update the output buffer inc ptb_totalbits ;we wrote one bit cmp ptb_totalbits,8 ;did we wrote the full byte? jb short @@ptb_no_byte ;90h mov ptb_totalbits,0 ;start again to write the bits mov di,ptr_com ;the pointer to the input buffer mov es:[di],bh ;the output byte inc di ;next position mov ptr_com,di ;save it for next time inc bytes_com ;the number of written bits mov ptb_byte,0 ;erase the output byte @@ptb_no_byte: pop cx ;the number of bits to write loop short @@ptb_more_bits_ ;are we done? pop di es ax ;restore the used registers @@ptb_fin: ret Problems? Here? A lot. There's no pairing, never. The shl/cl takes 4 clocks, push&pop isn't very funny, of course everything can be worse, the data may be misaligned, or in the same dword-boundary... Let's count the clocks: 6+(16*number of bits)+6(if we update out. buffer). This is approximately the clocks for that piece of code, of course assuming that all the data is in the level 1 cache (that's like assuming Bill Gates is documenting every secret of Windows ;-). To write one bit and update the buffer, this will take 28 clocks. I tried it with a program that reads the counters, and it took 65 clocks! Writing 8 bits, so writing the byte to the output buffer too: 252 clocks. First optimization We will try to do more pairing, and loading, and saving the data only when it starts and when ends: ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł ; Putbits ; Input: dx-> number of bits to write al->bits to be written ; Output: nothing ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł put_bits: mov bh,ptb_byte mov cl,ptb_totalbits mov es,buf_com ;yes, still using es }E-) mov di,ptr_com push cx di ;this pairs @@ptb_more_bits_: ;the 'main' loop xor ah,ah ;1 u shr al,1 ;2 u adc ah,0 ;3 u shl ah,cl ;4-7 u - the bits written or bh,ah ;8 u inc cl ;1 v cmp cl,8 ;9 u jb short @@ptb_no_byte ;9 v xor cx,cx ;1 u - the counter of written bits mov es:[di],bh ;2 u inc di ;2 v - next byte mov ptr_com,di ;3 u - save only if incremented inc bytes_com ;3 v xor bh,bh ;4 u - output byte @@ptb_no_byte: dec dx ;10 u - the counter jnz short @@ptb_more_bits_ ;10 v mov ptb_byte,bh mov ptb_totalbits,cl pop di cx ;this pairs too ret This version has been debugged at least twice or more. It works. And it's better than the first. 7+(10*number of bits)+4(if we update out. buffer) It's ok, don't you think? We improved the pairing, and sure that writing for example 5 bits is far better than in the first attempt. The real timing for this is 51 clocks for one bit. For 8 bits 145! We have reduced the number of clock cycles from 252 to 145. Quite good. E-) The bit organization is the following: 0000 0000<- first bit ||--- second bit |---- third bit ... Second optimization One of the fucking things in this loop is the 4-clock shl/cl. How can we avoid it? Easy, every time we put a new bit we shift the byte to the left. This implies two things: we only spend 1 clock with the shift, BUT the order of the bits in the byte is inverted, so: first bit->0000 0000 second bit--|| |- last bit third bit----| If you choose this putbits, then you MUST change your getbits, but not a lot. Your first getbits shifted to the left. This one will shift to the right. ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł ; Putbits ; Input: cl-> number of bits to write al->bits to be written ; Output: nothing ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł put_bits: push di mov es,buf_com ;some initalizations mov di,ptr_com mov ah,ptb_byte mov ch,ptb_totalbits @@ptb_more_bits_: ;the 'main' loop shr al,1 ;u 1 - put bit in the carry flag adc ah,0 ;u 2 - put it in output byte inc ch ;v 2 - the count of written bits cmp ch,8 ;3 u jb short @@ptb_no_byte ;3 v xor ch,ch ;1 u - the counter of written bits mov es:[di],ah ;2 u inc di ;2 v mov ptr_com,di ;3 u inc bytes_com ;3 v xor ah,ah ;4 u - clear output byte @@ptb_no_byte: shl ah,1 ;u 4 - put the bit in its position dec cl ;5 u - the counter jnz short @@ptb_more_bits_ ;5 v mov ptb_byte,ah mov ptb_totalbits,ch pop di ret There are some things which have to be mentioned: Why is the 'shl ah,1' after the comparison and not just after the 'inc ch'? Easy, when you have written 8 bits, you shift again to the left, then bit #7 will go out. In that way they are only shifted if the byte isn't full. Once the updating of the buffer 'shl ah,1' works, but don't worry, the content of ah is 0, so there's no problem with it. As you see, the body of the loop is smaller and all the instructions are simple and only one clock. I will not care more for 'theory' timing, so here it's the real timing: One bit: 41 - Eight bits: 110 It is better than the previous version, and it's more than 50% faster than the first. Hey, I'm proud of it. OE-) I must say that the first was very poorly coded, so it wasn't a very difficult job. E-) The same in pmode Well, since I'm now learning pmode I decided to do the same version but in pmode. This is not really very different. Anyway here it goes: ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł ; Putbits. Pmode. ; Input: cl-> number of bits to write al->bits to be written ; Output: nothing ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com ;ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł÷“÷ł put_bits: mov edi,ptr_com ;I like pmode E-) mov ah,ptb_byte mov ch,ptb_totalbits push edi @@ptb_more_bits_: ;the 'main' loop shr al,1 ;u 1 - put bit in the carry flag adc ah,0 ;u 2 - put it in output byte inc ch ;v 2 - the counter of written bits cmp ch,8 ;3 u jb short @@ptb_no_byte ;3 v xor ch,ch ;1 u - the counter of written bits mov [edi],ah ;2 v inc edi ;3 u inc bytes_com ;3 v - I reduced 2 clocks putting mov ptr_com,edi ;3 u - <- this there xor ah,ah ;4 u - clear output byte @@ptb_no_byte: shl ah,1 ;u 4 - put the bit in his position dec cl ;5 u - the counter jnz short @@ptb_more_bits_ ;5 v mov ptb_byte,ah mov ptb_totalbits,ch pop edi ret This is, at least till someone publishes a better one, the best put_bits out there. Putting one bit: 39 clocks. Putting 8: 105 clocks. Get bits This document isn't a class of "How to cut'n'paste", this is a "Learn and implement" class, or at least I wanted it to be in that way, but optimization requires examples, so the source code is here. But there's still the get bits. Once you've learned all this stuff, write your own. If you get good results, show us your implementation. Did you really think you would get no homework? X-D Remember that if you use the last version for getbits you must use shr, and if you use one of the others, shl. Or something like that. OE-) Limiting the buffer Today I was planning the memory needed for my new compressor (really a lot E-). And as memory is limited, you have to choose well how you will distribute it. Let's think. A lot of memory for the input file. More memory for the data structures needed by the compressor... So we now have almost no free mem, and we still need memory for the output buffer. And I came to this solution: The output buffer will be 5k long or something like that, the put_bits will have a counter with all the bytes written in this buffer (this is not bytes_com), and when it's 5k (so the buffer is full) we write it to the output file, and start again writing in the 5k buffer. This has not been tested yet, so maybe 5k is not enough and spends a lot of time writing. Improving this doc Of course this doc isn't finished, it still needs YOUR help. If you improve the put_bits, or do a good get_bits, send them to me, and a new version of that doc will be released. You will be properly credited. Let's do a good piece of optimization. Of course other ideas around that will be accepted, making the put_bits get the source bits from a word, etc. If you read something that doesn't make sense or something, let me know. If you have any idea, email me: dario_phong@mixmail.com Closing words My words have arrived at the end, at least till someone else wants to put his work in favour of everybody. Nothing else to say. Well, something else: If I don't get any feedback I will not update this doc. E-( Today I've written one article and finished it, and I still haven't coded my new compressor in pmode. }E-) Contacting the author Contacting me is as simple as sending an email. E-) dario_phong@mixmail.com I check it almost every week, so don't worry if my email is a little bit late. And check my page, and join my email list (I'll send you an email, when I've updated my page with new things to download). WWW: http://www.geocities.com/SiliconValley/Byte/6789/ - DAriO PhONG^PhyMosys, Barcelona 20-29/1/1999