$S@g·, g@S$$$ $$$$$$+-ù ú. `@g, .ú- -+g@ ½üýø^_ +-Ä- - $$$$$$-- -Ä-$b,`$-- --- y '-- Ä- -$$ g@S$$$-Ä- ---+ «¬¬«¬ $$$$$$ ¬¬«« $$l $ ¬¬«¬« $ «¬«¬¬«¬ $$ $$$$$$ «¬«¬¬« ¬¬«¬« $$$$$$ ¬«¬¬ $$$ l «¬«¬« $ ¬«¬¬«¬« $$ $(dt)$ ¬«¬¬«¬ +-- Ä -l$$ü$$-Ä-- -$$$ $Ä- --- $`+,._ _,$l $$$$$$-- Ä- -+ $$$y$$ Sü',$ ü $$,øý½SQ$ ½üýø^~ ,g@$$ ,g@S@, $$$l `ýÓ*S,_ _`*S$$b._ ,SSü' This article comes from Hugi issue 15 ú Released in May 1999 ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ RLE: A BASIC DATA COMPRESSION SCHEME DARIO PHONG ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ RLE: A BASIC DATA COMPRESSION SCHEME ÄÄÄ( DARiO PhONG / PhyMosys )ÄÄÄ Introduction_____ Hi! Welcome to this little article about RLE. This is a compression scheme used in PCX, but we'll not learn the RLE scheme of PCX, 'coz it's byte-based output. I'll present you with an RLE encoder with bit-based output. If you know how it works, then you are probably saying: "RLE is old and almost no one uses it, why the hell should I learn it?" Well, I can't answer you! Unless you use it for (simple) image compression or for the output of BWT, a new compression algorithm that makes a transformation in the text. If you want to learn it, read the related articles in Hugi #13 and #14 or check out some good references available on the net. I recommend you [2], 'coz [1] is a little difficult. How it works_____ RLE scheme is based on the assumption that a byte will appear again, just after it has occurred. In 'aaaab', we have four consecutive 'a's. How RLE compresses is easy: 1- Write an 'a' and say that there are 3 more 'a's. 2- Write a 'b'. So for letting the decoder know when there's a literal byte with a repetition we'll use a flag, a prefix, only one bit. If it's '0' then the next is a literal byte, if it was '1' then we have a byte and the lenght, of how many more bytes with the same value follow. Of course you now need a routine that lets you write bits, but I will not discuss that 'coz I have written an article about this topic. Just read Hugi #14 or go to my homepage and download it. Encoder implementation_____ First read the whole input file to a buffer. Then the loop starts. Read the current byte. Then compare it with the following, till you find one that is different, and keep track of the bytes read. In Asm: mov edi,offset buffer ;where the file is @@rle_inner: xor ebx,ebx ;it keeps track of the repeated bytes mov al,[edi] ;the current byte, note that I read @@rle_inner_comp: ;only ONCE inc ebx ;next byte (at the start, the next) cmp al,[edi+bx] ;compare it with the following bytes jz short @@rle_inner_comp ;while is equal jump In C: next=0; byte=*(buffer); while(byte==*(buffer+next)) ++next; If there was no byte repeated at all then (in the Asm version) we'll set EBX to '1' just decrement it, and we'll know that there was no repetition. dec ebx jz short @@rle_no_rep I'll not support C anymore, 'coz it's so easy that you can easily implement it yourself. If there was no byte at all, then write an uncompressed byte. @@rle_no_rep: push ax ;in AL we have the current byte xor eax,eax ;flag of uncompressed byte mov ecx,1 ;write one bit call put_bits ;write the flag pop ax ;restore it mov ecx,8 ;write the whole byte call put_bits inc edi ;next byte cmp edi,buffer+imput_lenght ;precalculate this after the loop jne short @@rle_inner ;while there's a byte to scan If we have some byte repeated, then write the byte and the length: push ax ;save the current byte mov eax,1 ;flag of repetition mov ecx,1 ;write one bit call put_bits ;write the flag pop ax ;restore it mov ecx,8 ;write the whole byte call put_bits ;so the decoder knows what to repeat mov eax,ebx ;in EBX we have the bytes repeated mov ecx,5 ;length from 0-31 call put_bits ;write the length inc edi ;the uncompressed byte add edi,ebx ;the repeated bytes cmp edi,buffer+imput_lenght;precalculate this after the loop jne short @@rle_inner Now you have a simple RLE encoder. Note that in order to avoid bugs, you MUST change the next byte to the last. Imagine you have 03h as the last byte, then the next byte (it will not be compressed) MUST be different from 03h, for example 04h. Why? This is 'coz our parser only stops when are no more repetitions. In case the last byte of the file and the next in memory are equal, our parser will continue parsing beyond the END of the file, and then the program will hang. The length_____ Of course you must care that the length isn't greater than the maximum, with something like that: cmp ebx,31 ;31 is our maximum jbe short @@rle__ mov ebx31 @@rle__: But if we want to make this better, we can only improve the way the length is saved, so let's have a look at it. We use 5 bits, so the range is from 0-31, but we never use the 0, 'coz we never have a repetition of 0, so let's do the range from 1-32, much better. The encoder only has to decrement EBX, and the decoder increments it. Note that if the minimum was 2 the encoder would add 2, and the decoder subtract 2. Another way of saving the length could be making it variable, depending of a flag, so: 1 if 0: 3 if 1: 6 Keep the first trick that we used in your mind: For the second length our range will be 1-64 but in fact we never use the 1-8 'coz it's used in the first length (3 bits), so its range can be: (1-64)+8 = 9-72. The enconder has to test whether the length is lower than 9. If it is, it has to emit a 1 and the length; if it isn't, then a 0 and the length. Decoder_____ The decoder is even easier than the encoder, as always. We just have to get a bit and check it: mov edi,offset buffer ;the output buffer @@rled_inner: mov edx,1 ;how many bits we want call get_bits ;bits in AL dec al ;check the bit-flag (prefix) jz short @@rled_decode ;if 1 a match (1-1=0) then zf on Now in case the conditional jump has not been taken, we just read a byte and put it in the buffer: mov edx,8 ;get a whole byte call get_bits mov [edi],al ;put the byte inc edi ;next byte cmp edi,buffer+output_lenght;do I have to explain this again? jne short @@rled_inner ;next one jmp short @@rled_end ;end, don't execute any more code Now in case it was a 'match': @@rled_decode: mov edx,8 call get_bits ;get the byte to copy mov [edi],al ;put it in the buffer inc edi ;next byte push ax ;save it mov edx,1 ;check the flag for the length call get_bits dec al ;see what of the lengths is jz short @@rled_big ;we have the 'big' length mov edx,3 ;the bits needed for the length call get_bits ;get the 'little one' mov ecx,eax ;save the length and ecx,0111b ;erase the other bits inc ecx ;the range 1-8 pop ax ;restore the byte to write jmp short @@rled_finally_decode ;let's decode it @@rled_big: mov edx,6 ;the length call get_bits ;get the 'big one' mov ecx,eax ;save the length and ecx,0111111b ;mask it, so we have no problems add ecx,9 ;the range 9-72 pop ax ;restore the byte to write @@rled_finally_decode: mov [edi],al ;put byte in the buffer inc edi ;next byte dec ecx ;more bytes to copy? jnz short @@rled_finally_decode ;keep copying cmp edi,buffer+output_lenght jne short @@rled_inner @@rled_end: ret ;end of the proc Here is almost all the code you need for a fast RLE encoder and decoder. I said almost all, 'coz you still have to encode the length, well and get the params, get the mem... About getting the params, there's a little article at my h-page. References_____ [1] http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/ src-rr-124.html The original BWT paper. [2] http://www.dogma.net/markn Look for the article section. If you want more compression related links, visit my homepage at: http://www.geocities.com/SiliconValley/Byte/6789/ Closing words_____ Now you have already implemented it, but you see that it doesn't have good ratios. What else can you do? You can improve it with Huffman, or Shannon-Fano, or even arithmetic coding... But this scheme (RLE) is only good for images, with loooong repetitions, and for the output of a BWT-block. If you want another scheme, you can try lz77. From my homepage you can download an article about it. If you still have any doubt, or just want to talk about compression, send me an email: dario_phong@mixmail.com Now, just good luck with your coding! - DARiO PhONG, Barcelona 14/02/1999