Lemmingologist	05 Aug 2005 14:48:46 Explanation of Lemmings file compression The levels, graphics, and sound files for PC Lemmings are all compressed using the .DAT compression format. A .DAT file contains one or more data items (for example, levels), and each item in the file is stored as a "block." The compressed data are read in bitwise, starting at the end of a block and working backwards. A few pieces of information, however, are stored in a header at the beginning of each block, and must be read in before jumping to the end. Header Byte 0: instructions for handling the tail end of the block. There are eight different possibilities, represented by the values 00 through 07. The instructions depend on the structure of the end of the file as well, which can be determined by looking at the last bit. If the last bit in the file is 0, then the instructions are determined as follows: 00 - skip last 10 bits, read next 3 bits 01 - skip last 9 bits, read next 3 bits 02 - skip last 8 bits, read next 3 bits 03 - skip last 7 bits, read next 3 bits 04 - skip last 2 bits, read next 2 bits, skip next 4 bits, read next 1 bit 05 - skip last 2 bits, read next 3 bits, skip next 3 bits 06 - skip last 2 bits, read next 3 bits, skip next 2 bits 07 - skip last 2 bits, read next 3 bits, skip next 1 bit If the last bit in the file is 1, the instructions are slightly different: 00 - skip last 11 bits, read next 8 bits 01 - skip last 10 bits, read next 8 bits 02 - skip last 9 bits, read next 8 bits 03 - skip last 8 bits, read next 8 bits 04 - skip last 3 bits, read next 4 bits, skip next 4 bits, read next 4 bits 05 - skip last 3 bits, read next 4 bits, skip next 3 bits, read next 4 bits 06 - skip last 3 bits, read next 4 bits, skip next 2 bits, read next 4 bits 07 - skip last 3 bits, read next 4 bits, skip next 1 bit, read next 4 bits Byte 1: checksum. Don't know how this is calculated, but it doesn't appear to make any difference. Bytes 2-5: four-byte big-endian integer indicating the size, in bytes, of the uncompressed block. Bytes 6-9: four-byte big-endian integer indicating the size, in bytes, of the compressed block. To be continued in reply, as there appears to be a limit on message length.

Lemmingologist	05 Aug 2005 14:50:04 Lemmings file compression continued Data Once the header has been read, skip to the end of the block and follow the instructions indicated by byte 0 of the header. Remember, the data are being read backwards, which means that "next" really means the bits that come before in the compressed file. Also, integers which are not in the header should be read in reverse bit order (for example, 110 = 3, and 01000000 = 2). The bits that were read, rather than skipped, form an integer specifying the first "length." If 8 bits were read instead of 3 (that is, the last bit of the file is 1), add 8 to this length. Now, do the following until all of the data have been read (reading backwards in the file from the current position, one bit at a time, and making sure not to treat the header as data): (1) Read in a number of bytes equal to length + 1, and write them directly to the decompressed file, reversing the order of the bits in each byte. Remember that these bytes may start in the middle of a file byte - do not skip any bits. Proceed to (2). (2) The next few bits indicate the nature of the bits that come after them. (Since bits are being read backwards one at a time, the actual order of these bits in the file will be reversed). 00 - read next 3 bits as an integer. This number is the next length. Repeat (1). 01 - there will be 2 repeated bytes. Read next 8 bits as an offset. Proceed to (3). 100 - there will be 3 repeated bytes. Read next 9 bits as an offset. Proceed to (3). 101 - there will be 4 repeated bytes. Read next 10 bits as an offset. Proceed to (3). 110 - read next 8 bits as an integer, "n". There will be n + 1 repeated bytes. Read the next 12 bits as an offset. Proceed to (3). 111 - read next 8 bits as an integer, "n". The next length is equal to n + 8. Repeat (1). (3) Copy the indicated number of repeated bytes from the decompressed file, starting from the position equal to the number of bytes already decompressed, minus the offset, minus 1. Paste these bytes at the end of the decompressed file. For example, if 50 bytes have been decompressed so far, and the offset is 20, copy bytes starting from position 50 - 20 - 1 = 29. If there are, for example, 5 repeated bytes, then copy 5 bytes starting from byte number 29, and paste them at the end of the file. Repeat (2). After all of the data in the block have been read and the decompressed file has been written, the order of the bytes in the decompressed file will then need to be reversed, as the original file was read and compressed in reverse order. This must be done bytewise - read the last byte of the decompressed file, paste it in as the first byte of the new file; read the second to last byte and paste it in as the new second byte, and so on. And that is all there is to the Lemmings compression format! I have included program code for compression and decompression as replies to this post. These programs should be able to be translated into any programming language which can read and write single bytes (built-in methods for reading and writing single bits are not necessary). Be aware, though, that the algorithms I am using are slow when handling large files, and could use some optimization. These programs are intended as a starting point for anyone who would like to write a Lemmings graphics editor. Which, I imagine, will inevitably happen sometime (see my other post - Instructions for editing graphics files).

Lemmingologist	05 Aug 2005 14:53:30 Decompression code dim byte, bytePos as integer dim in, out as binaryStream function ReadBits(count as integer) as integer dim i, sum as integer for i=count-1 downto 0 sum=sum+pow(2,i)(byte mod 2) byte=byte\2 if bytePos < 7 then bytePos=bytePos+1 else bytePos=0 byte=in.ReadByte in.position=in.position-2 end if next return sum end function dim i, bits, start, blockSize, compressedBlockSize, fileSize, itemCount, repeat, offset, length as integer dim stop, endOfFile as boolean dim file as folderItem file=GetOpenFolderItem("type") //displays "Open File" dialog box fileSize=0 itemCount=0 if file <> nil then while not endOfFile in=file.OpenAsBinaryFile(false) out=GetFolderItem("temp").CreateBinaryFile("type") //creates a binary file named temp in.position=fileSize start=in.ReadByte bits=in.ReadByte blockSize=in.ReadLong compressedBlockSize=in.ReadLong in.position=fileSize+compressedBlockSize-1 byte=in.ReadByte bytePos=0 in.position=in.position-2 if start > 0 then if ReadBits(1)=0 then if start < 4 then bits=ReadBits(9-start) length=ReadBits(3) elseif start=4 then length=ReadBits(3)2 bits=ReadBits(4) length=length+ReadBits(1) else length=ReadBits(4) bits=ReadBits(8-start) end if else if start < 4 then bits=ReadBits(10-start) length=ReadBits(8)+8 else bits=ReadBits(2) length=ReadBits(4)*16 bits=ReadBits(8-start) length=length+ReadBits(4)+8 end if end if else bits=ReadBits(8) if ReadBits(1)=0 then length=ReadBits(4) else bits=ReadBits(2) length=ReadBits(8)+8 end if end if while in.position >= fileSize+5 for i=0 to length bits=ReadBits(8) out.WriteByte bits next stop=(in.position < fileSize+5) while not stop if ReadBits(1) = 0 then //0 if ReadBits(1) = 0 then //00 length=ReadBits(3) stop=true else //01 repeat=1 end if else //1 if ReadBits(1) = 0 then //10 if ReadBits(1) = 0 then //100 repeat=2 else //101 repeat=3 end if else //11 if ReadBits(1) = 0 then //110 repeat=ReadBits(8) else //111 length=ReadBits(8)+8 stop=true end if end if end if if not stop then if repeat <= 3 then offset=ReadBits(7+repeat) else offset=ReadBits(12) end if for i=0 to repeat out.position=out.position-offset-1 bits=out.ReadByte out.position=out.position+offset out.WriteByte bits next end if wend wend in.close out.close in=GetFolderItem("temp").OpenAsBinaryFile(false) out=GetFolderItem(file.name+" Item "+str(itemCount)).CreateBinaryFile("type") in.position=in.length-1 for i=1 to in.length-blockSize bits=in.ReadByte in.position=in.position-2 next for i=1 to blockSize out.WriteByte in.ReadByte in.position=in.position-2 next in.close out.close GetFolderItem("temp").delete fileSize=fileSize+compressedBlockSize if fileSize >= file.length then endOfFile=true end if itemCount=itemCount+1 wend end if

Lemmingologist	05 Aug 2005 14:54:53 Re: Explanation of Lemmings file compression function Binary(byte as integer) as string dim i, b as integer dim bin as string b=byte for i=7 downto 0 bin=bin+str(b\pow(2,i)) b=b mod pow(2,i) next return bin end function function FindRepeatedBytes(str1 as string, str2 as string, maxDistance as integer) as integer dim pos, curInStr, skip, length as integer dim source, find as string source=right(str1, maxDistance8) find=str2 do //inStr is a search function which returns 0 if the string to be found does not occur, //and n+1 if n characters precede its first occurrence curInStr=inStr(source, find) if curInStr > 0 then if curInStr mod 8 = 1 then pos=skip+curInStr\8+1 end if skip=skip+ceil(curInStr/8) source=right(source, len(source)-ceil(curInStr/8)8) end if loop until curInStr=0 return pos end function dim i, j, byte, position, count, offset, length, max as integer dim newBytes, compressed, raw as string dim in, out as binaryStream dim file as folderItem dim stop as boolean file=GetOpenFolderItem("type") //displays "Open File" dialog box if file <> nil then in=file.OpenAsBinaryFile(false) in.position=in.length-1 while not stop count=0 do byte=in.ReadByte if in.position=1 then stop=true else in.position=in.position-2 end if newBytes=newBytes+Binary(byte) if len(raw) < max8 then offset=len(raw)\8-position else offset=max-position end if if len(newBytes) <= 32 then max=pow(2, 6+len(newBytes)\8) else max=4096 end if position=FindRepeatedBytes(raw, newBytes, max) count=count+1 loop until position=0 or count > 256 or stop if len(newBytes) > 16 then if not stop then in.position=in.position+1 count=count-1 newBytes=left(newBytes, len(newBytes)-8) end if if length > 0 then if length <= 8 then compressed=left(compressed, len(compressed)-length8)+right(Binary(length-1), 5)+right(compressed, length8) else compressed=left(compressed, len(compressed)-length8)+"111"+Binary(length-9)+right(compressed, length8) end if length=0 end if select case count case 2 compressed=compressed+"01"+Binary(offset) case 3 compressed=compressed+"100"+right(Binary(offset\256), 1)+Binary(offset mod 256) case 4 compressed=compressed+"101"+right(Binary(offset\256), 2)+Binary(offset mod 256) else compressed=compressed+"110"+Binary(count-1)+right(Binary(offset\256), 4)+Binary(offset mod 256) end select else if length+count <= 264 then compressed=compressed+newBytes length=length+count else compressed=left(compressed, len(compressed)-length8)+"111"+Binary(length-9)+right(compressed, length8) compressed=compressed+newBytes length=count end if end if raw=raw+newBytes if len(raw) > 32768 then raw=right(raw, 32768) end if newBytes="" wend if length > 0 then if length <= 8 then compressed=left(compressed, len(compressed)-length8)+right(Binary(length-1), 5)+right(compressed, length8) else compressed=left(compressed, len(compressed)-length8)+"111"+Binary(length-9)+right(compressed, length8) end if length=0 end if out=GetFolderItem(file.name+" Compressed").CreateBinaryFile("type") //creates a binary file with the original file name plus "Compressed" j=len(compressed) mod 8 out.WriteByte j if left(compressed, 1) = "0" then if j > 4 then for i=7 downto j newBytes=newBytes+"0" next compressed=left(compressed, 5)+newBytes+right(compressed, len(compressed)-5) elseif j=4 then compressed=left(compressed, 4)+"0000"+right(compressed, len(compressed)-4) elseif j=0 then compressed=compressed+"00000000" end if else for i=7 downto j newBytes=newBytes+"0" next if j=0 then compressed=newBytes+compressed elseif j < 3 then compressed=left(compressed, j)+newBytes+right(compressed, len(compressed)-j) elseif j < 5 then compressed=left(compressed, 3)+newBytes+right(compressed, len(compressed)-3) else compressed=left(compressed, 7)+newBytes+right(compressed, len(compressed)-7) end if end if max=ceil(len(compressed)/8) out.WriteByte 0 out.WriteLong in.length out.WriteLong max+10 for i=1 to max byte=0 for j=7 downto 0 byte=byte+val(right(compressed,1))pow(2,j) compressed=left(compressed, len(compressed)-1) next out.WriteByte byte next in.close out.close end if

the guest	05 Aug 2005 18:40:46 Re: Explanation of Lemmings file compression Can you give a semi-English explanation of what you do for compression?

Mindless	06 Aug 2005 00:54:35 Re: Explanation of Lemmings file compression What language is this code in? It's not any syntax I recognize... although it looks alot like BASIC. Edit: After a quick search, it appears to be REALbasic... :-/

the guest	06 Aug 2005 01:11:44 Re: Explanation of Lemmings file compression Just curious: "Lemmingologist", you don't happen to be the author of LemEdit, are you? I'm guessing this for no reason other than the fact that your knowledge appears to somewhat match the features and limitations of LemEdit, including the significant fact that you don't have info on the VGASPECx.DAT files, just like the last version of LemEdit we get also don't have support for them yet. The reason we (well, at least I) want to know is because, well, LemEdit could use some fixes to work better on modern Windows machines......

Mindless	06 Aug 2005 04:38:39 Re: Explanation of Lemmings file compression Just curious: "Lemmingologist", you don't happen to be the author of LemEdit, are you? I'm guessing this for no reason other than the fact that your knowledge appears to somewhat match the features and limitations of LemEdit, including the significant fact that you don't have info on the VGASPECx.DAT files, just like the last version of LemEdit we get also don't have support for them yet. The reason we (well, at least I) want to know is because, well, LemEdit could use some fixes to work better on modern Windows machines...... Sorry to break your theory, but (s)he can't be, since in order to compress .DATs (LemEdit does ;P) you MUST know how to create the correct checksum.

the guest	06 Aug 2005 15:03:52 Re: Explanation of Lemmings file compression Good point. I honestly haven't been really trying to thoroughly read whatever (s)he posted. X_X

Lemmingologist	13 Aug 2005 11:18:59 Compression instructions Start at the end of the uncompressed file, and repeat the following until reaching the beginning: Read in bytes, one at a time, backwards toward the beginning, until a string (two bytes or longer) is found which has already been read. At the end of the compressed file, first indicate the number of non-repeating bytes that were read in before the repeated string, minus one (call this number n). This is done in the following manner: If n is less than 8, write two 0-bits, then write n as a three-bit integer. If n is greater than or equal to 8, write three 1-bits, then write n - 8 as an eight-bit integer. For example, if 23 non-repeating bytes were read before the repeated string, then n = 23 - 1 = 22. 22 - 8 = 14, so the output to the compressed file would then be 111 followed by 14 in binary, which is 00001110. Now append the non-repeating bytes to the compressed file. Then, add two or three bits indicating the number of repeating bytes, using the codes below: � � �01 - 2 repeated bytes � � �100 - 3 repeated bytes � � �101 - 4 repeated bytes � � �110 - anywhere from 2 to 256 repeated bytes. Follow this with the number of repeated bytes minus one, written as an eight-bit integer. Next, write the offset of the repeated string, which is equal to the total number of bytes read so far from the uncompressed file, minus the number of bytes read before the previous occurrence of the repeated string. The number of bits used to write the offset depends on the repeat indicator code used above: � � 01 - offset is an eight-bit integer � � �100 - offset is a nine-bit integer � � �101 - offset is a ten-bit integer � � �110 - offset is a twelve-bit integer This means that the maximum offset is 255, 511, 1023, and 4095 for each of the above codes respectively. If the string does not occur within 4096 bytes of the current position in the file, it must not be counted as a repeated string. After all of the data have been read, reverse the order of the bits (not bytes) in the compressed file. Then, at the beginning of the new compressed file, add the ten-byte header: Byte 0: the remainder when the number of bits in the compressed file is divided by 8. Byte 1: checksum. This is ignored by Lemmings, so a zero can be used here. Bytes 2-5: four-byte big-endian integer indicating the size, in bytes, of the uncompressed file. Bytes 6-9: four-byte big-endian integer indicating the size, in bytes, of the compressed file, including the ten-byte header. Finally, some 0-bits need to be added near the end of the new compressed file. The placement of these bits depends on the value of byte 0 of the header: 00 - add eight 0-bits at the end of the file. 01 - if the last bit of the file is 1, add seven 0-bits before the last bit of the file. Otherwise, do nothing. 02 - if the last bit of the file is 1, add six 0-bits before the last two bits of the file. Otherwise, do nothing. 03 - if the last bit of the file is 1, add five 0-bits before the last three bits of the file. Otherwise, do nothing. 04 - add four 0-bits before the last four bits of the file. 05 - if the last bit of the file is 0, add three 0-bits before the last five bits of the file. If the last bit is 1, add three 0-bits before the last seven bits of the file. 06 - if the last bit of the file is 0, add two 0-bits before the last five bits of the file. If the last bit is 1, add two 0-bits before the last seven bits of the file. 07 - if the last bit of the file is 0, add one 0-bit before the last five bits of the file. If the last bit is 1, add one 0-bit before the last seven bits of the file.

ccexplore	13 Aug 2005 14:00:36 Re: Compression instructions Read in bytes, one at a time, backwards toward the beginning, until a string (two bytes or longer) is found which has already been read. Ok, that's really the only part I cared about, since it was the part that affects tradeoffs between performance and effectiveness. It sounds like your code searches for all possible substrings, correct? So that in "abcdef", it will search "ab", "abc", "abcd" and "abcde"? I'm curious what sorts of compression ratio you've gotten, and also how long it takes to compress data from the vgagr# files.

Mindless	13 Aug 2005 16:27:00 Re: Compression instructions Byte 1: checksum. This is ignored by Lemmings, so a zero can be used here. Hmmm... Crap... (s)he's right... of course, my Lemmings game is cracked, so they may have the checksum checking code removed...

Mindless	13 Aug 2005 16:49:45 Re: Explanation of Lemmings file compression If the last bit in the file is 0, then the instructions are determined as follows: ... If the last bit in the file is 1, the instructions are slightly different: ... By this do you mean the last bit in the .DAT file or the last bit in the compressed data section?

ccexplore	13 Aug 2005 16:52:05 Re: Explanation of Lemmings file compression Yeah, interestingly it works apparently. Then again, I don't think I ever said that it wouldn't work either. �In the code it definitely passes back the result of the checksum comparison, but whether the result was used or not I can't tell since I'm not going to go through all the places where the function is called and see whether the caller bothers to check the return value. Anyhow, it's just a dirt-simple checksum. �Probably the simplest aspect of the whole thing. Now of course, whether it works for LemEdit or not is another question which I haven' tested (not that it matters for me).

Mindless	13 Aug 2005 17:03:09 Re: Explanation of Lemmings file compression Now of course, whether it works for LemEdit or not is another question which I haven' tested (not that it matters for me). No checksum mismatch warnings from LemEdit... Edit: Oooo... wow, I crashed LemEdit.

covox	14 Aug 2005 00:28:39 Re: Explanation of Lemmings file compression Great stuff. Unsure quite what language the code is (at first glance it looks distinctly VBish), however thankfully it's quite pseudocodey. I can't thank you enough, this is indeed a Christmas miracle. Currently I'm writing a C port of your proceedure (tried starting from scratch using your instructions, my method invoked far too many dynamic memory hissy fits :/). Hopefully I'm aiming toward some sort of multitool to export graphics, sound and level metadata (XML, almost done) to more modern file formats/organized hierachy. If anyone's in the mood for more source code I'm happy to post what I have. Well, maybe once I've cleaned it up a bit ;)

Mindless	14 Aug 2005 02:02:04 Re: Explanation of Lemmings file compression Hopefully I'm aiming toward some sort of multitool to export graphics, sound and level metadata (XML, almost done) to more modern file formats/organized hierachy. Could you elaborate on this a bit? I'm not sure what it is you're trying to do. XML?

ccexplore	14 Aug 2005 02:31:55 Re: Explanation of Lemmings file compression Currently I'm writing a C port of your proceedure (tried starting from scratch using your instructions, my method invoked far too many dynamic memory hissy fits :/). You can skip this step if you want, I already have working C++ code for compression and decompression. (But I hope you will credit me for it if you do decide to use it.) E-mail me for more info.

covox	14 Aug 2005 04:17:48 Re: Explanation of Lemmings file compression Could you elaborate on this a bit? I'm not sure what it is you're trying to do. XML? Ooh-er, could've explained myself better there ;P At some point, I'm going to have a shot at making an open-source extendable Lemmings clone. It'll be done in C and SDL, which means (hopefully) that porting and drop-in improvements should be trivial. I'm developing this on a gcc/linux base, which should compile under Windows with mingw32. No idea if VS6/VS.NET/Microsoft Magic Mystery Compiler will touch it however ;/ The Pingus guys have done a great job, however I'm aiming toward something more lightweight and ultra-portable. Oh, and ClanLib is waaay too large to stick onto a PDA. Heaven help you if you change versions :/ To start with, I'm just trying to make the game data manageable. So all the terrain/animations will be converted to indexed PNG, and the level information converted to XML documents. XML is nice because both people and computers can read it :) Each level will be stored in an XML file, and there should be some mechanism which refreshes the index (containing all the levels/different games) each time you run the program. For example, if you were to develop a level called "Taste the golden spray", and you want it positioned as Level 12 of Fun on a levelset called "My Generic Custom Game", you could add this XML to the appropriate file: <?xml version="1.0" encoding="UTF-8"?> <!-- Games available are Classic, ONML, Your_game_name_here --> <LemLevel game="My Generic Custom Game" difficulty="1" level="12"> <title>Taste the golden spray!</title> <settings levelset="2" release="80" save="80" speed="20" time="180"/> <skills climber="0" floater="0" bomber="0" blocker="0" builder="0" basher="0" miner="0" digger="0"/> <!-- BEGIN level metadata --> <Interactives> <interactive piecenum="4" xoffset="0" yoffset="0" flag_ground="0" flag_nooverdraw="0" flag_upsidedown="1" /> ... </Interactives> <Terrains> <terrain piecenum="12" xoffset="0" yoffset="0" flag_black="1" flag_nooverdraw="0" flag_upsidedown="0" /> ... </Terrains> <Steels> <steel xoffset="0" yoffset="0" xwidth="16" ywidth="16" /> ... </Steels> </LemLevel> It's very preliminary at the moment, any suggestions are more than welcome. There's a good chance that the schema will be extended at some point to include a broader definition of interactive objects (think the player-controlled cannons in Lemmings 2). You can skip this step if you want, I already have working C++ code for compression and decompression. (But I hope you will credit me for it if you do decide to use it.) E-mail me for more info. Thanks greatly for the offer, however I'd feel more confident if all the code is pure C. But then again, I haven't had time to compile test my attempt yet, so I may come crying back :)

ccexplore (not logged in)	14 Aug 2005 04:55:45 Re: Explanation of Lemmings file compression Thanks greatly for the offer, however I'd feel more confident if all the code is pure C. But then again, I haven't had time to compile test my attempt yet, so I may come crying back :) For the decompression I actually have a C version. I don't have and can't give you a C++ version for the compression though, because it uses <list>, and C, of course, lacks any sort of standardized library for collections. Then again, since you mentioned "open source", it may be best anyway if you write the code yourself based on Lemmingologist's posts (mine might be a bit too close to what's in the original game itself to be safely open-sourced).

Mindless	14 Aug 2005 05:44:04 Re: Explanation of Lemmings file compression I don't have and can't give you a C++ version for the compression though, because it uses <list>, and C, of course, lacks any sort of standardized library for collections. That'd be why I haven't ported the compression code to FreeBasic (I gave up on QBX because of memory issues). I have no idea what list is... FreeBasic is awesome: the power of C, yet the syntax of QB (and they're not finished yet)

Lemmingologist	20 Aug 2005 13:36:29 Compression code function Binary(byte as integer) as string dim i, b, power as integer dim bin as string b=byte for i=7 downto 0 power=pow(2,i) bin=bin+str(b\power) b=b mod power next return bin end function function FindRepeatedBytes(str1 as string, str2 as string, maxDistance as integer) as integer dim pos, curInStr, skip as integer dim source, find as string source=str1 find=str2 do //inStr is a search function which returns 0 if the string to be found does not occur, //and n+1 if n characters precede its first occurrence curInStr=inStr(source, find) if curInStr > 0 then if curInStr mod 8 = 1 then pos=skip+curInStr\8+1 end if skip=skip+ceil(curInStr/8) source=right(source, len(source)-ceil(curInStr/8)8) end if loop until pos > 0 or curInStr=0 return pos end function function Reverse(bytes as string) as string dim i as integer dim b, n as string b=bytes for i=1 to len(bytes)\8 n=n+right(b, 8) b=left(b, len(b)-8) next return n end function function XOR(byte1 as integer, byte2 as string) as integer dim i, b1, power, result as integer dim b2 as string b1=byte1 b2=byte2 for i=7 downto 0 power=pow(2,i) if str(b1\power) <> left(b2, 1) then result=result+power end if b1=b1 mod power b2=right(b2, i) next return result end function dim i, j, byte, position, readPosition, count, offset, length as integer dim newBytes, compressed, raw as string dim in, out as binaryStream dim file as folderItem dim stop as boolean file=GetOpenFolderItem("type") if file <> nil then in=file.OpenAsBinaryFile(false) readPosition=in.length-1 while not stop ProgressBar1.value=in.position/in.length1000\1 count=0 do stop=readPosition < 0 if not stop then in.position=readPosition byte=in.ReadByte readPosition=readPosition-1 newBytes=Binary(byte)+newBytes end if offset=position+count-2 if not stop then position=FindRepeatedBytes(raw, newBytes) count=count+1 end if loop until position=0 or count > 256 or stop stop=readPosition < 0 if len(newBytes) > 16 then if stop then if position=0 then count=count-1 newBytes=left(newBytes, 8) end if else readPosition=readPosition+1 count=count-1 newBytes=right(newBytes, len(newBytes)-8) end if if length > 0 then if length <= 8 then compressed=left(compressed, len(compressed)-length8)+right(Binary(length-1), 5)+right(compressed, length8) else compressed=left(compressed, len(compressed)-length8)+"111"+Binary(length-9)+right(compressed, length8) end if length=0 end if if count=2 and offset < 256 then compressed=compressed+"01"+Binary(offset) elseif count=3 and offset < 512 then compressed=compressed+"100"+right(Binary(offset\256), 1)+Binary(offset mod 256) elseif count=4 and offset < 1024 then compressed=compressed+"101"+right(Binary(offset\256), 2)+Binary(offset mod 256) else compressed=compressed+"110"+Binary(count-1)+right(Binary(offset\256), 4)+Binary(offset mod 256) end if if stop and position=0 then if length < 264 then compressed=compressed+Reverse(newBytes) length=length+1 else compressed=left(compressed, len(compressed)-length8)+"111"+Binary(length-9)+right(compressed, length8) compressed=compressed+Reverse(newBytes) length=1 end if end if else if length+count <= 264 then compressed=compressed+Reverse(newBytes) length=length+count else compressed=left(compressed, len(compressed)-2648)+"11111111111"+right(compressed, 2648) compressed=compressed+Reverse(newBytes) length=count end if end if raw=left(newBytes+raw, 32768) newBytes="" wend if length > 0 then if length <= 8 then compressed=left(compressed, len(compressed)-length8)+right(Binary(length-1), 5)+right(compressed, length8) else compressed=left(compressed, len(compressed)-length8)+"111"+Binary(length-9)+right(compressed, length8) end if end if out=GetFolderItem(file.name+" Compressed").CreateBinaryFile("type") j=len(compressed) mod 8 out.WriteByte j for i=7 downto j newBytes=newBytes+"0" next if left(compressed, 1) = "0" then if j < 4 then compressed=newBytes+compressed elseif j=4 then compressed=left(compressed, 4)+newBytes+right(compressed, len(compressed)-4) else compressed=left(compressed, 5)+newBytes+right(compressed, len(compressed)-5) end if else if j=0 then compressed=newBytes+compressed elseif j <= 4 then compressed=left(compressed, j)+newBytes+right(compressed, len(compressed)-j) else compressed=left(compressed, 7)+newBytes+right(compressed, len(compressed)-7) end if end if length=len(compressed)\8 for i=1 to length byte=XOR(byte, left(right(compressed, i8), 8)) next byte=byte\128+(byte\64 mod 2)2+(byte\32 mod 2)4+(byte\16 mod 2)8+(byte\8 mod 2)16+(byte\4 mod 2)32+(byte\2 mod 2)64+(byte mod 2)128 out.WriteByte byte out.WriteLong in.length out.WriteLong length+10 for i=1 to length byte=0 for j=7 downto 0 byte=byte+val(right(compressed,1))*pow(2,j) compressed=left(compressed, len(compressed)-1) next out.WriteByte byte next in.close out.close end if

Lemmingologist	22 Aug 2005 10:21:44 Re: Compression instructions Ok, that's really the only part I cared about, since it was the part that affects tradeoffs between performance and effectiveness. It sounds like your code searches for all possible substrings, correct? �So that in "abcdef", it will search "ab", "abc", "abcd" and "abcde"? I'm curious what sorts of compression ratio you've gotten, and also how long it takes to compress data from the vgagr# files. Yes, if "abcdef" is a repeated string, it will search for all of those substrings until it reaches "abcdefg", which has not occurred before in the file. It will then back up one space in the uncompressed file, remove the "g", and add code for the repeating "abcdef". I think my FindRepeatingBytes function is highly inefficient, though, as it searches through the file from beginning to end, when it is really looking for the last occurrence of the substring. This is probably why it takes a fairly long time (several minutes) to compress the VGAGR files. The compressed files are much smaller than the ones generated by LemEdit, as LemEdit only looks for repeating strings which are 16 bytes or less, and my compressor takes advantage of every bit in the "repeat number byte" as well as every bit allocated to the offset. They are still longer than the original compressed files, however. The Lemmings compressor must have been more clever about where to begin and end repeated substrings, unlike my "greedy" algorithm, which simply looks for the longest possible string starting from wherever it happens to be in the file.

Lemmingologist	22 Aug 2005 10:32:40 Re: Explanation of Lemmings file compression By this do you mean the last bit in the .DAT file or the last bit in the compressed data section? The compressed data section. I suppose I should have said "block", rather than "file".

ccexplore (not logged in)	22 Aug 2005 11:05:14 Re: Compression instructions the substring. This is probably why it takes a fairly long time (several minutes) to compress the VGAGR files. Several minutes?!? X_X I guess that won't do for end-user purposes. ;) (Yes, I understand it's never intended as such.) I still like to know how yours compared with DMA's (ie. the original DAT files from the game). Are yours only larger by a few bytes, or a lot more? As a reference, the algorithm I have is generally a few bytes larger for level DAT files, about 20% larger for VGASPEC files, and unfortunately, nearly twice as large for VGAGR files. On the other hand, it operates with no perceptible delay (ie. under a second or less). (Though as a practical matter, you can probably get away with not doing any real compression at all, so I guess this is more an academic exercise than anything. :-/)

Lemmingologist	22 Aug 2005 12:35:48 Re: Compression instructions Several minutes?!? �X_X �I guess that won't do for end-user purposes. �;) (Yes, I understand it's never intended as such.) I still like to know how yours compared with DMA's (ie. the original DAT files from the game). �Are yours only larger by a few bytes, or a lot more? As a reference, the algorithm I have is generally a few bytes larger for level DAT files, about 20% larger for VGASPEC files, and unfortunately, nearly twice as large for VGAGR files. �On the other hand, it operates with no perceptible delay (ie. under a second or less). (Though as a practical matter, you can probably get away with not doing any real compression at all, so I guess this is more an academic exercise than anything. �:-/) My files are (roughly) 1% larger for VGAGR, 4% larger for levels, and 7% larger for VGASPEC. Obviously, this is not worth all the extra time my program takes. The slow speed is probably due to the fact that I'm using a high-level framework and a general-purpose search function, though. You could probably afford to increase your level of compression without sacrificing too much speed.

Lemmingologist	23 Aug 2005 12:19:28 Decompression code (bug fix) dim byte, bytePos as integer dim in, out as binaryStream function ReadBits(count as integer) as integer dim i, sum as integer for i=count-1 downto 0 sum=sum+pow(2,i)(byte mod 2) byte=byte\2 if bytePos < 7 then bytePos=bytePos+1 else bytePos=0 byte=in.ReadByte in.position=in.position-2 end if next return sum end function dim i, bits, start, blockSize, compressedBlockSize, fileSize, itemCount, repeat, offset, length as integer dim stop, endOfFile as boolean dim file as folderItem file=GetOpenFolderItem("type") //displays "Open File" dialog box fileSize=0 itemCount=0 if file <> nil then while not endOfFile in=file.OpenAsBinaryFile(false) out=GetFolderItem("temp").CreateBinaryFile("type") //creates a binary file named temp in.position=fileSize start=in.ReadByte bits=in.ReadByte blockSize=in.ReadLong compressedBlockSize=in.ReadLong in.position=fileSize+compressedBlockSize-1 byte=in.ReadByte bytePos=0 in.position=in.position-2 if start > 0 then if ReadBits(1)=0 then if start < 4 then bits=ReadBits(9-start) length=ReadBits(3) elseif start=4 then length=ReadBits(3)2 bits=ReadBits(4) length=length+ReadBits(1) else length=ReadBits(4) bits=ReadBits(8-start) end if else if start < 4 then bits=ReadBits(10-start) length=ReadBits(8)+8 elseif start=4 then bits=ReadBits(2) length=ReadBits(1)128 bits=ReadBits(4) length=length+ReadBits(7)+8 else bits=ReadBits(2) length=ReadBits(4)16 bits=ReadBits(8-start) length=length+ReadBits(4)+8 end if end if else bits=ReadBits(8) if ReadBits(1)=0 then length=ReadBits(4) else bits=ReadBits(2) length=ReadBits(8)+8 end if end if while in.position >= fileSize+5 for i=0 to length bits=ReadBits(8) out.WriteByte bits next stop=(in.position < fileSize+5) while not stop if ReadBits(1) = 0 then //0 if ReadBits(1) = 0 then //00 length=ReadBits(3) stop=true else //01 repeat=1 offset=ReadBits(8) end if else //1 if ReadBits(1) = 0 then //10 if ReadBits(1) = 0 then //100 repeat=2 offset=ReadBits(9) else //101 repeat=3 offset=ReadBits(10) end if else //11 if ReadBits(1) = 0 then //110 repeat=ReadBits(8) offset=ReadBits(12) else //111 length=ReadBits(8)+8 stop=true end if end if end if if not stop then for i=0 to repeat out.position=out.position-offset-1 bits=out.ReadByte out.position=out.position+offset out.WriteByte bits next end if wend wend in.close out.close in=GetFolderItem("temp").OpenAsBinaryFile(false) out=GetFolderItem(file.name+" Item "+str(itemCount)).CreateBinaryFile("type") in.position=in.length-1 for i=1 to in.length-blockSize bits=in.ReadByte in.position=in.position-2 next for i=1 to blockSize out.WriteByte in.ReadByte in.position=in.position-2 next in.close out.close GetFolderItem("temp").delete fileSize=fileSize+compressedBlockSize if fileSize >= file.length then endOfFile=true end if itemCount=itemCount+1 wend end if

Lemmingologist	23 Aug 2005 13:18:51 Re: Explanation of Lemmings file compression I have updated my compression and decompression code (above on this page). The changes include: � Checksum calculation (thanks to ccexplore) � Support for 12-bit offsets with less than 5 repeating bytes � Fixed a bug in compressor which sometimes caused the first byte of the file to be encoded incorrectly. � Increased compression speed by optimizing searches for repeated strings

Lemmingologist

05 Aug 2005 14:48:46

Explanation of Lemmings file compression

The levels, graphics, and sound files for PC Lemmings are all compressed using the .DAT compression format. A .DAT file contains one or more data items (for example, levels), and each item in the file is stored as a "block." The compressed data are read in bitwise, starting at the end of a block and working backwards. A few pieces of information, however, are stored in a header at the beginning of each block, and must be read in before jumping to the end.

Header

Byte 0: instructions for handling the tail end of the block. There are eight different possibilities, represented by the values 00 through 07. The instructions depend on the structure of the end of the file as well, which can be determined by looking at the last bit.
If the last bit in the file is 0, then the instructions are determined as follows:
00 - skip last 10 bits, read next 3 bits
01 - skip last 9 bits, read next 3 bits
02 - skip last 8 bits, read next 3 bits
03 - skip last 7 bits, read next 3 bits
04 - skip last 2 bits, read next 2 bits, skip next 4 bits, read next 1 bit
05 - skip last 2 bits, read next 3 bits, skip next 3 bits
06 - skip last 2 bits, read next 3 bits, skip next 2 bits
07 - skip last 2 bits, read next 3 bits, skip next 1 bit
If the last bit in the file is 1, the instructions are slightly different:
00 - skip last 11 bits, read next 8 bits
01 - skip last 10 bits, read next 8 bits
02 - skip last 9 bits, read next 8 bits
03 - skip last 8 bits, read next 8 bits
04 - skip last 3 bits, read next 4 bits, skip next 4 bits, read next 4 bits
05 - skip last 3 bits, read next 4 bits, skip next 3 bits, read next 4 bits
06 - skip last 3 bits, read next 4 bits, skip next 2 bits, read next 4 bits
07 - skip last 3 bits, read next 4 bits, skip next 1 bit, read next 4 bits

Byte 1: checksum. Don't know how this is calculated, but it doesn't appear to make any difference.
Bytes 2-5: four-byte big-endian integer indicating the size, in bytes, of the uncompressed block.
Bytes 6-9: four-byte big-endian integer indicating the size, in bytes, of the compressed block.

To be continued in reply, as there appears to be a limit on message length.

Lemmingologist

05 Aug 2005 14:50:04

Lemmings file compression continued

Data

Once the header has been read, skip to the end of the block and follow the instructions indicated by byte 0 of the header. Remember, the data are being read backwards, which means that "next" really means the bits that come before in the compressed file. Also, integers which are not in the header should be read in reverse bit order (for example, 110 = 3, and 01000000 = 2). The bits that were read, rather than skipped, form an integer specifying the first "length." If 8 bits were read instead of 3 (that is, the last bit of the file is 1), add 8 to this length. Now, do the following until all of the data have been read (reading backwards in the file from the current position, one bit at a time, and making sure not to treat the header as data):

(1) Read in a number of bytes equal to length + 1, and write them directly to the decompressed file, reversing the order of the bits in each byte. Remember that these bytes may start in the middle of a file byte - do not skip any bits. Proceed to (2).

(2) The next few bits indicate the nature of the bits that come after them. (Since bits are being read backwards one at a time, the actual order of these bits in the file will be reversed).
00 - read next 3 bits as an integer. This number is the next length. Repeat (1).
01 - there will be 2 repeated bytes. Read next 8 bits as an offset. Proceed to (3).
100 - there will be 3 repeated bytes. Read next 9 bits as an offset. Proceed to (3).
101 - there will be 4 repeated bytes. Read next 10 bits as an offset. Proceed to (3).
110 - read next 8 bits as an integer, "n". There will be n + 1 repeated bytes. Read the next 12 bits as an offset. Proceed to (3).
111 - read next 8 bits as an integer, "n". The next length is equal to n + 8. Repeat (1).

(3) Copy the indicated number of repeated bytes from the decompressed file, starting from the position equal to the number of bytes already decompressed, minus the offset, minus 1. Paste these bytes at the end of the decompressed file.
For example, if 50 bytes have been decompressed so far, and the offset is 20, copy bytes starting from position 50 - 20 - 1 = 29. If there are, for example, 5 repeated bytes, then copy 5 bytes starting from byte number 29, and paste them at the end of the file. Repeat (2).

After all of the data in the block have been read and the decompressed file has been written, the order of the bytes in the decompressed file will then need to be reversed, as the original file was read and compressed in reverse order. This must be done bytewise - read the last byte of the decompressed file, paste it in as the first byte of the new file; read the second to last byte and paste it in as the new second byte, and so on.

And that is all there is to the Lemmings compression format! I have included program code for compression and decompression as replies to this post. These programs should be able to be translated into any programming language which can read and write single bytes (built-in methods for reading and writing single bits are not necessary). Be aware, though, that the algorithms I am using are slow when handling large files, and could use some optimization. These programs are intended as a starting point for anyone who would like to write a Lemmings graphics editor. Which, I imagine, will inevitably happen sometime (see my other post - Instructions for editing graphics files).

Lemmingologist

05 Aug 2005 14:53:30

Decompression code

dim byte, bytePos as integer
dim in, out as binaryStream

function ReadBits(count as integer) as integer
dim i, sum as integer

for i=count-1 downto 0
sum=sum+pow(2,i)*(byte mod 2)
byte=byte\2
if bytePos < 7 then
bytePos=bytePos+1
else
bytePos=0
byte=in.ReadByte
in.position=in.position-2
end if
next

return sum
end function

dim i, bits, start, blockSize, compressedBlockSize, fileSize, itemCount, repeat, offset, length as integer
dim stop, endOfFile as boolean
dim file as folderItem

file=GetOpenFolderItem("type") //displays "Open File" dialog box
fileSize=0
itemCount=0
if file <> nil then
while not endOfFile
in=file.OpenAsBinaryFile(false)
out=GetFolderItem("temp").CreateBinaryFile("type") //creates a binary file named temp
in.position=fileSize
start=in.ReadByte
bits=in.ReadByte
blockSize=in.ReadLong
compressedBlockSize=in.ReadLong
in.position=fileSize+compressedBlockSize-1
byte=in.ReadByte
bytePos=0
in.position=in.position-2

if start > 0 then
if ReadBits(1)=0 then
if start < 4 then
bits=ReadBits(9-start)
length=ReadBits(3)
elseif start=4 then
length=ReadBits(3)*2
bits=ReadBits(4)
length=length+ReadBits(1)
else
length=ReadBits(4)
bits=ReadBits(8-start)
end if
else
if start < 4 then
bits=ReadBits(10-start)
length=ReadBits(8)+8
else
bits=ReadBits(2)
length=ReadBits(4)*16
bits=ReadBits(8-start)
length=length+ReadBits(4)+8
end if
end if
else
bits=ReadBits(8)
if ReadBits(1)=0 then
length=ReadBits(4)
else
bits=ReadBits(2)
length=ReadBits(8)+8
end if
end if

while in.position >= fileSize+5
for i=0 to length
bits=ReadBits(8)
out.WriteByte bits
next
stop=(in.position < fileSize+5)

while not stop
if ReadBits(1) = 0 then //0
if ReadBits(1) = 0 then //00
length=ReadBits(3)
stop=true
else //01
repeat=1
end if
else //1
if ReadBits(1) = 0 then //10
if ReadBits(1) = 0 then //100
repeat=2
else //101
repeat=3
end if
else //11
if ReadBits(1) = 0 then //110
repeat=ReadBits(8)
else //111
length=ReadBits(8)+8
stop=true
end if
end if
end if

if not stop then
if repeat <= 3 then
offset=ReadBits(7+repeat)
else
offset=ReadBits(12)
end if
for i=0 to repeat
out.position=out.position-offset-1
bits=out.ReadByte
out.position=out.position+offset
out.WriteByte bits
next
end if
wend
wend

in.close
out.close
in=GetFolderItem("temp").OpenAsBinaryFile(false)
out=GetFolderItem(file.name+" Item "+str(itemCount)).CreateBinaryFile("type")
in.position=in.length-1
for i=1 to in.length-blockSize
bits=in.ReadByte
in.position=in.position-2
next
for i=1 to blockSize
out.WriteByte in.ReadByte
in.position=in.position-2
next
in.close
out.close
GetFolderItem("temp").delete

fileSize=fileSize+compressedBlockSize
if fileSize >= file.length then
endOfFile=true
end if
itemCount=itemCount+1
wend
end if

Lemmingologist

05 Aug 2005 14:54:53

Re: Explanation of Lemmings file compression

function Binary(byte as integer) as string
dim i, b as integer
dim bin as string

b=byte
for i=7 downto 0
bin=bin+str(b\pow(2,i))
b=b mod pow(2,i)
next

return bin
end function

function FindRepeatedBytes(str1 as string, str2 as string, maxDistance as integer) as integer
dim pos, curInStr, skip, length as integer
dim source, find as string

source=right(str1, maxDistance*8)
find=str2

do
//inStr is a search function which returns 0 if the string to be found does not occur,
//and n+1 if n characters precede its first occurrence
curInStr=inStr(source, find)
if curInStr > 0 then
if curInStr mod 8 = 1 then
pos=skip+curInStr\8+1
end if
skip=skip+ceil(curInStr/8)
source=right(source, len(source)-ceil(curInStr/8)*8)
end if
loop until curInStr=0

return pos
end function

dim i, j, byte, position, count, offset, length, max as integer
dim newBytes, compressed, raw as string
dim in, out as binaryStream
dim file as folderItem
dim stop as boolean

file=GetOpenFolderItem("type") //displays "Open File" dialog box
if file <> nil then
in=file.OpenAsBinaryFile(false)
in.position=in.length-1

while not stop
count=0
do
byte=in.ReadByte
if in.position=1 then
stop=true
else
in.position=in.position-2
end if
newBytes=newBytes+Binary(byte)
if len(raw) < max*8 then
offset=len(raw)\8-position
else
offset=max-position
end if
if len(newBytes) <= 32 then
max=pow(2, 6+len(newBytes)\8)
else
max=4096
end if
position=FindRepeatedBytes(raw, newBytes, max)
count=count+1
loop until position=0 or count > 256 or stop

if len(newBytes) > 16 then
if not stop then
in.position=in.position+1
count=count-1
newBytes=left(newBytes, len(newBytes)-8)
end if

if length > 0 then
if length <= 8 then
compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
else
compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
end if
length=0
end if

select case count
case 2
compressed=compressed+"01"+Binary(offset)
case 3
compressed=compressed+"100"+right(Binary(offset\256), 1)+Binary(offset mod 256)
case 4
compressed=compressed+"101"+right(Binary(offset\256), 2)+Binary(offset mod 256)
else
compressed=compressed+"110"+Binary(count-1)+right(Binary(offset\256), 4)+Binary(offset mod 256)
end select
else
if length+count <= 264 then
compressed=compressed+newBytes
length=length+count
else
compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
compressed=compressed+newBytes
length=count
end if
end if

raw=raw+newBytes
if len(raw) > 32768 then
raw=right(raw, 32768)
end if
newBytes=""
wend

if length > 0 then
if length <= 8 then
compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
else
compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
end if
length=0
end if

out=GetFolderItem(file.name+" Compressed").CreateBinaryFile("type") //creates a binary file with the original file name plus "Compressed"
j=len(compressed) mod 8
out.WriteByte j

if left(compressed, 1) = "0" then
if j > 4 then
for i=7 downto j
newBytes=newBytes+"0"
next
compressed=left(compressed, 5)+newBytes+right(compressed, len(compressed)-5)
elseif j=4 then
compressed=left(compressed, 4)+"0000"+right(compressed, len(compressed)-4)
elseif j=0 then
compressed=compressed+"00000000"
end if
else
for i=7 downto j
newBytes=newBytes+"0"
next
if j=0 then
compressed=newBytes+compressed
elseif j < 3 then
compressed=left(compressed, j)+newBytes+right(compressed, len(compressed)-j)
elseif j < 5 then
compressed=left(compressed, 3)+newBytes+right(compressed, len(compressed)-3)
else
compressed=left(compressed, 7)+newBytes+right(compressed, len(compressed)-7)
end if
end if

max=ceil(len(compressed)/8)
out.WriteByte 0
out.WriteLong in.length
out.WriteLong max+10

for i=1 to max
byte=0
for j=7 downto 0
byte=byte+val(right(compressed,1))*pow(2,j)
compressed=left(compressed, len(compressed)-1)
next
out.WriteByte byte
next

in.close
out.close
end if

the guest

05 Aug 2005 18:40:46

Re: Explanation of Lemmings file compression

Can you give a semi-English explanation of what you do for compression?

Mindless

06 Aug 2005 00:54:35

Re: Explanation of Lemmings file compression

What language is this code in? It's not any syntax I recognize... although it looks alot like BASIC.

Edit: After a quick search, it appears to be REALbasic... :-/

the guest

06 Aug 2005 01:11:44

Re: Explanation of Lemmings file compression

Just curious: "Lemmingologist", you don't happen to be the author of LemEdit, are you?

I'm guessing this for no reason other than the fact that your knowledge appears to somewhat match the features and limitations of LemEdit, including the significant fact that you don't have info on the VGASPECx.DAT files, just like the last version of LemEdit we get also don't have support for them yet.

The reason we (well, at least I) want to know is because, well, LemEdit could use some fixes to work better on modern Windows machines......

Mindless

06 Aug 2005 04:38:39

Re: Explanation of Lemmings file compression

Sorry to break your theory, but (s)he can't be, since in order to compress .DATs (LemEdit does ;P) you MUST know how to create the correct checksum.

the guest

06 Aug 2005 15:03:52

Re: Explanation of Lemmings file compression

Good point. I honestly haven't been really trying to thoroughly read whatever (s)he posted. X_X

Lemmingologist

13 Aug 2005 11:18:59

Compression instructions

Start at the end of the uncompressed file, and repeat the following until reaching the beginning:

Read in bytes, one at a time, backwards toward the beginning, until a string (two bytes or longer) is found which has already been read. At the end of the compressed file, first indicate the number of non-repeating bytes that were read in before the repeated string, minus one (call this number n). This is done in the following manner:
If n is less than 8, write two 0-bits, then write n as a three-bit integer.
If n is greater than or equal to 8, write three 1-bits, then write n - 8 as an eight-bit integer.
For example, if 23 non-repeating bytes were read before the repeated string, then n = 23 - 1 = 22. 22 - 8 = 14, so the output to the compressed file would then be 111 followed by 14 in binary, which is 00001110.
Now append the non-repeating bytes to the compressed file.

Then, add two or three bits indicating the number of
repeating bytes, using the codes below:
� � �01 - 2 repeated bytes
� � �100 - 3 repeated bytes
� � �101 - 4 repeated bytes
� � �110 - anywhere from 2 to 256 repeated bytes. Follow this with the number of repeated bytes minus one, written as an eight-bit integer.

Next, write the offset of the repeated string, which is equal to the total number of bytes read so far from the uncompressed file, minus the number of bytes read before the previous occurrence of the repeated string.
The number of bits used to write the offset depends on the repeat indicator code used above:
� � 01 - offset is an eight-bit integer
� � �100 - offset is a nine-bit integer
� � �101 - offset is a ten-bit integer
� � �110 - offset is a twelve-bit integer
This means that the maximum offset is 255, 511, 1023, and 4095 for each of the above codes respectively. If the string does not occur within 4096 bytes of the current position in the file, it must not be counted as a repeated string.

After all of the data have been read, reverse the order of the bits (not bytes) in the compressed file. Then, at the beginning of the new compressed file, add the ten-byte header:
Byte 0: the remainder when the number of bits in the compressed file is divided by 8.
Byte 1: checksum. This is ignored by Lemmings, so a zero can be used here.
Bytes 2-5: four-byte big-endian integer indicating the size, in bytes, of the uncompressed file.
Bytes 6-9: four-byte big-endian integer indicating the size, in bytes, of the compressed file, including the ten-byte header.

Finally, some 0-bits need to be added near the end of the new compressed file. The placement of these bits depends on the value of byte 0 of the header:
00 - add eight 0-bits at the end of the file.
01 - if the last bit of the file is 1, add seven 0-bits before the last bit of the file. Otherwise, do nothing.
02 - if the last bit of the file is 1, add six 0-bits before the last two bits of the file. Otherwise, do nothing.
03 - if the last bit of the file is 1, add five 0-bits before the last three bits of the file. Otherwise, do nothing.
04 - add four 0-bits before the last four bits of the file.
05 - if the last bit of the file is 0, add three 0-bits before the last five bits of the file. If the last bit is 1, add three 0-bits before the last seven bits of the file.
06 - if the last bit of the file is 0, add two 0-bits before the last five bits of the file. If the last bit is 1, add two 0-bits before the last seven bits of the file.
07 - if the last bit of the file is 0, add one 0-bit before the last five bits of the file. If the last bit is 1, add one 0-bit before the last seven bits of the file.

ccexplore

13 Aug 2005 14:00:36

Re: Compression instructions

Read in bytes, one at a time, backwards toward the beginning, until a string (two bytes or longer) is found which has already been read.

Ok, that's really the only part I cared about, since it was the part that affects tradeoffs between performance and effectiveness.

It sounds like your code searches for all possible substrings, correct? So that in "abcdef", it will search "ab", "abc", "abcd" and "abcde"?

I'm curious what sorts of compression ratio you've gotten, and also how long it takes to compress data from the vgagr# files.

Mindless

13 Aug 2005 16:27:00

Re: Compression instructions

Byte 1: checksum. This is ignored by Lemmings, so a zero can be used here.

Hmmm...
Crap... (s)he's right... of course, my Lemmings game is cracked, so they may have the checksum checking code removed...

Mindless

13 Aug 2005 16:49:45

Re: Explanation of Lemmings file compression

If the last bit in the file is 0, then the instructions are determined as follows:
...

If the last bit in the file is 1, the instructions are slightly different:

...

By this do you mean the last bit in the .DAT file or the last bit in the compressed data section?

ccexplore

13 Aug 2005 16:52:05

Re: Explanation of Lemmings file compression

Yeah, interestingly it works apparently.

Then again, I don't think I ever said that it wouldn't work either. �In the code it definitely passes back the result of the checksum comparison, but whether the result was used or not I can't tell since I'm not going to go through all the places where the function is called and see whether the caller bothers to check the return value.

Anyhow, it's just a dirt-simple checksum. �Probably the simplest aspect of the whole thing.

Now of course, whether it works for LemEdit or not is another question which I haven' tested (not that it matters for me).

Mindless

13 Aug 2005 17:03:09

Re: Explanation of Lemmings file compression

Now of course, whether it works for LemEdit or not is another question which I haven' tested (not that it matters for me).

No checksum mismatch warnings from LemEdit...

Edit: Oooo... wow, I crashed LemEdit.

covox

14 Aug 2005 00:28:39

Re: Explanation of Lemmings file compression

Great stuff. Unsure quite what language the code is (at first glance it looks distinctly VBish), however thankfully it's quite pseudocodey. I can't thank you enough, this is indeed a Christmas miracle.

Currently I'm writing a C port of your proceedure (tried starting from scratch using your instructions, my method invoked far too many dynamic memory hissy fits :/). Hopefully I'm aiming toward some sort of multitool to export graphics, sound and level metadata (XML, almost done) to more modern file formats/organized hierachy. If anyone's in the mood for more source code I'm happy to post what I have. Well, maybe once I've cleaned it up a bit ;)

Mindless

14 Aug 2005 02:02:04

Re: Explanation of Lemmings file compression

Hopefully I'm aiming toward some sort of multitool to export graphics, sound and level metadata (XML, almost done) to more modern file formats/organized hierachy.

Could you elaborate on this a bit? I'm not sure what it is you're trying to do. XML?

ccexplore

14 Aug 2005 02:31:55

Re: Explanation of Lemmings file compression

Currently I'm writing a C port of your proceedure (tried starting from scratch using your instructions, my method invoked far too many dynamic memory hissy fits :/).

You can skip this step if you want, I already have working C++ code for compression and decompression. (But I hope you will credit me for it if you do decide to use it.) E-mail me for more info.

covox

14 Aug 2005 04:17:48

Re: Explanation of Lemmings file compression

Could you elaborate on this a bit? I'm not sure what it is you're trying to do. XML?

Ooh-er, could've explained myself better there ;P At some point, I'm going to have a shot at making an open-source extendable Lemmings clone. It'll be done in C and SDL, which means (hopefully) that porting and drop-in improvements should be trivial. I'm developing this on a gcc/linux base, which should compile under Windows with mingw32. No idea if VS6/VS.NET/Microsoft Magic Mystery Compiler will touch it however ;/

The Pingus guys have done a great job, however I'm aiming toward something more lightweight and ultra-portable. Oh, and ClanLib is waaay too large to stick onto a PDA. Heaven help you if you change versions :/

To start with, I'm just trying to make the game data manageable. So all the terrain/animations will be converted to indexed PNG, and the level information converted to XML documents. XML is nice because both people and computers can read it :)

Each level will be stored in an XML file, and there should be some mechanism which refreshes the index (containing all the levels/different games) each time you run the program. For example, if you were to develop a level called "Taste the golden spray", and you want it positioned as Level 12 of Fun on a levelset called "My Generic Custom Game", you could add this XML to the appropriate file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Games available are Classic, ONML, Your_game_name_here -->
<LemLevel game="My Generic Custom Game" difficulty="1" level="12">
  <title>Taste the golden spray!</title>
  <settings levelset="2" release="80" save="80" speed="20" time="180"/>
  <skills climber="0" floater="0" bomber="0" blocker="0" builder="0" basher="0" miner="0" digger="0"/>

  <!-- BEGIN level metadata -->
  <Interactives>
    <interactive piecenum="4" xoffset="0" yoffset="0" flag_ground="0" flag_nooverdraw="0" flag_upsidedown="1" />
    ...
  </Interactives>
  <Terrains>
    <terrain piecenum="12" xoffset="0" yoffset="0" flag_black="1" flag_nooverdraw="0" flag_upsidedown="0" />
    ...
  </Terrains>
  <Steels>
    <steel xoffset="0" yoffset="0" xwidth="16" ywidth="16" />
    ...
  </Steels>
</LemLevel>

It's very preliminary at the moment, any suggestions are more than welcome. There's a good chance that the schema will be extended at some point to include a broader definition of interactive objects (think the player-controlled cannons in Lemmings 2).

You can skip this step if you want, I already have working C++ code for compression and decompression. (But I hope you will credit me for it if you do decide to use it.) E-mail me for more info.

Thanks greatly for the offer, however I'd feel more confident if all the code is pure C. But then again, I haven't had time to compile test my attempt yet, so I may come crying back :)

ccexplore (not logged in)

14 Aug 2005 04:55:45

Re: Explanation of Lemmings file compression

Thanks greatly for the offer, however I'd feel more confident if all the code is pure C. But then again, I haven't had time to compile test my attempt yet, so I may come crying back :)

For the decompression I actually have a C version. I don't have and can't give you a C++ version for the compression though, because it uses <list>, and C, of course, lacks any sort of standardized library for collections.

Then again, since you mentioned "open source", it may be best anyway if you write the code yourself based on Lemmingologist's posts (mine might be a bit too close to what's in the original game itself to be safely open-sourced).

Mindless

14 Aug 2005 05:44:04

Re: Explanation of Lemmings file compression

I don't have and can't give you a C++ version for the compression though, because it uses <list>, and C, of course, lacks any sort of standardized library for collections.

That'd be why I haven't ported the compression code to FreeBasic (I gave up on QBX because of memory issues). I have no idea what list is...

FreeBasic is awesome: the power of C, yet the syntax of QB (and they're not finished yet)

Lemmingologist

20 Aug 2005 13:36:29

Compression code

function Binary(byte as integer) as string
dim i, b, power as integer
dim bin as string

b=byte
for i=7 downto 0
power=pow(2,i)
bin=bin+str(b\power)
b=b mod power
next

return bin
end function

function FindRepeatedBytes(str1 as string, str2 as string, maxDistance as integer) as integer
dim pos, curInStr, skip as integer
dim source, find as string

source=str1
find=str2
do
//inStr is a search function which returns 0 if the string to be found does not occur,
//and n+1 if n characters precede its first occurrence
curInStr=inStr(source, find)
if curInStr > 0 then
if curInStr mod 8 = 1 then
pos=skip+curInStr\8+1
end if
skip=skip+ceil(curInStr/8)
source=right(source, len(source)-ceil(curInStr/8)*8)
end if
loop until pos > 0 or curInStr=0

return pos
end function

function Reverse(bytes as string) as string
dim i as integer
dim b, n as string

b=bytes
for i=1 to len(bytes)\8
n=n+right(b, 8)
b=left(b, len(b)-8)
next

return n
end function

function XOR(byte1 as integer, byte2 as string) as integer
dim i, b1, power, result as integer
dim b2 as string

b1=byte1
b2=byte2
for i=7 downto 0
power=pow(2,i)
if str(b1\power) <> left(b2, 1) then
result=result+power
end if
b1=b1 mod power
b2=right(b2, i)
next

return result
end function

dim i, j, byte, position, readPosition, count, offset, length as integer
dim newBytes, compressed, raw as string
dim in, out as binaryStream
dim file as folderItem
dim stop as boolean

file=GetOpenFolderItem("type")
if file <> nil then
in=file.OpenAsBinaryFile(false)
readPosition=in.length-1

while not stop
ProgressBar1.value=in.position/in.length*1000\1
count=0
do
stop=readPosition < 0
if not stop then
in.position=readPosition
byte=in.ReadByte
readPosition=readPosition-1
newBytes=Binary(byte)+newBytes
end if
offset=position+count-2
if not stop then
position=FindRepeatedBytes(raw, newBytes)
count=count+1
end if
loop until position=0 or count > 256 or stop
stop=readPosition < 0

if len(newBytes) > 16 then
if stop then
if position=0 then
count=count-1
newBytes=left(newBytes, 8)
end if
else
readPosition=readPosition+1
count=count-1
newBytes=right(newBytes, len(newBytes)-8)
end if

if length > 0 then
if length <= 8 then
compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
else
compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
end if
length=0
end if

if count=2 and offset < 256 then
compressed=compressed+"01"+Binary(offset)
elseif count=3 and offset < 512 then
compressed=compressed+"100"+right(Binary(offset\256), 1)+Binary(offset mod 256)
elseif count=4 and offset < 1024 then
compressed=compressed+"101"+right(Binary(offset\256), 2)+Binary(offset mod 256)
else
compressed=compressed+"110"+Binary(count-1)+right(Binary(offset\256), 4)+Binary(offset mod 256)
end if

if stop and position=0 then
if length < 264 then
compressed=compressed+Reverse(newBytes)
length=length+1
else
compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
compressed=compressed+Reverse(newBytes)
length=1
end if
end if
else
if length+count <= 264 then
compressed=compressed+Reverse(newBytes)
length=length+count
else
compressed=left(compressed, len(compressed)-264*8)+"11111111111"+right(compressed, 264*8)
compressed=compressed+Reverse(newBytes)
length=count
end if
end if

raw=left(newBytes+raw, 32768)
newBytes=""
wend

if length > 0 then
if length <= 8 then
compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
else
compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
end if
end if

out=GetFolderItem(file.name+" Compressed").CreateBinaryFile("type")
j=len(compressed) mod 8
out.WriteByte j

for i=7 downto j
newBytes=newBytes+"0"
next
if left(compressed, 1) = "0" then
if j < 4 then
compressed=newBytes+compressed
elseif j=4 then
compressed=left(compressed, 4)+newBytes+right(compressed, len(compressed)-4)
else
compressed=left(compressed, 5)+newBytes+right(compressed, len(compressed)-5)
end if
else
if j=0 then
compressed=newBytes+compressed
elseif j <= 4 then
compressed=left(compressed, j)+newBytes+right(compressed, len(compressed)-j)
else
compressed=left(compressed, 7)+newBytes+right(compressed, len(compressed)-7)
end if
end if

length=len(compressed)\8
for i=1 to length
byte=XOR(byte, left(right(compressed, i*8), 8))
next
byte=byte\128+(byte\64 mod 2)*2+(byte\32 mod 2)*4+(byte\16 mod 2)*8+(byte\8 mod 2)*16+(byte\4 mod 2)*32+(byte\2 mod 2)*64+(byte mod 2)*128
out.WriteByte byte
out.WriteLong in.length
out.WriteLong length+10

for i=1 to length
byte=0
for j=7 downto 0
byte=byte+val(right(compressed,1))*pow(2,j)
compressed=left(compressed, len(compressed)-1)
next
out.WriteByte byte
next

in.close
out.close
end if

Lemmingologist

22 Aug 2005 10:21:44

Re: Compression instructions

Ok, that's really the only part I cared about, since it was the part that affects tradeoffs between performance and effectiveness.

It sounds like your code searches for all possible substrings, correct? �So that in "abcdef", it will search "ab", "abc", "abcd" and "abcde"?

I'm curious what sorts of compression ratio you've gotten, and also how long it takes to compress data from the vgagr# files.

Yes, if "abcdef" is a repeated string, it will search for all of those substrings until it reaches "abcdefg", which has not occurred before in the file. It will then back up one space in the uncompressed file, remove the "g", and add code for the repeating "abcdef". I think my FindRepeatingBytes function is highly inefficient, though, as it searches through the file from beginning to end, when it is really looking for the last occurrence of the substring. This is probably why it takes a fairly long time (several minutes) to compress the VGAGR files. The compressed files are much smaller than the ones generated by LemEdit, as LemEdit only looks for repeating strings which are 16 bytes or less, and my compressor takes advantage of every bit in the "repeat number byte" as well as every bit allocated to the offset. They are still longer than the original compressed files, however. The Lemmings compressor must have been more clever about where to begin and end repeated substrings, unlike my "greedy" algorithm, which simply looks for the longest possible string starting from wherever it happens to be in the file.

Lemmingologist

22 Aug 2005 10:32:40

Re: Explanation of Lemmings file compression

By this do you mean the last bit in the .DAT file or the last bit in the compressed data section?

The compressed data section. I suppose I should have said "block", rather than "file".

ccexplore (not logged in)

22 Aug 2005 11:05:14

Re: Compression instructions

the substring. This is probably why it takes a fairly long time (several minutes) to compress the VGAGR files.

Several minutes?!? X_X I guess that won't do for end-user purposes. ;) (Yes, I understand it's never intended as such.)

I still like to know how yours compared with DMA's (ie. the original DAT files from the game). Are yours only larger by a few bytes, or a lot more?

As a reference, the algorithm I have is generally a few bytes larger for level DAT files, about 20% larger for VGASPEC files, and unfortunately, nearly twice as large for VGAGR files. On the other hand, it operates with no perceptible delay (ie. under a second or less).

(Though as a practical matter, you can probably get away with not doing any real compression at all, so I guess this is more an academic exercise than anything. :-/)

Lemmingologist

22 Aug 2005 12:35:48

Re: Compression instructions

Several minutes?!? �X_X �I guess that won't do for end-user purposes. �;) (Yes, I understand it's never intended as such.)

I still like to know how yours compared with DMA's (ie. the original DAT files from the game). �Are yours only larger by a few bytes, or a lot more?

As a reference, the algorithm I have is generally a few bytes larger for level DAT files, about 20% larger for VGASPEC files, and unfortunately, nearly twice as large for VGAGR files. �On the other hand, it operates with no perceptible delay (ie. under a second or less).

(Though as a practical matter, you can probably get away with not doing any real compression at all, so I guess this is more an academic exercise than anything. �:-/)

My files are (roughly) 1% larger for VGAGR, 4% larger for levels, and 7% larger for VGASPEC. Obviously, this is not worth all the extra time my program takes. The slow speed is probably due to the fact that I'm using a high-level framework and a general-purpose search function, though. You could probably afford to increase your level of compression without sacrificing too much speed.

Lemmingologist

23 Aug 2005 12:19:28

Decompression code (bug fix)

dim byte, bytePos as integer
dim in, out as binaryStream

function ReadBits(count as integer) as integer
dim i, sum as integer

for i=count-1 downto 0
sum=sum+pow(2,i)*(byte mod 2)
byte=byte\2
if bytePos < 7 then
bytePos=bytePos+1
else
bytePos=0
byte=in.ReadByte
in.position=in.position-2
end if
next

return sum
end function

dim i, bits, start, blockSize, compressedBlockSize, fileSize, itemCount, repeat, offset, length as integer
dim stop, endOfFile as boolean
dim file as folderItem

file=GetOpenFolderItem("type") //displays "Open File" dialog box
fileSize=0
itemCount=0
if file <> nil then
while not endOfFile
in=file.OpenAsBinaryFile(false)
out=GetFolderItem("temp").CreateBinaryFile("type") //creates a binary file named temp
in.position=fileSize
start=in.ReadByte
bits=in.ReadByte
blockSize=in.ReadLong
compressedBlockSize=in.ReadLong
in.position=fileSize+compressedBlockSize-1
byte=in.ReadByte
bytePos=0
in.position=in.position-2

if start > 0 then
if ReadBits(1)=0 then
if start < 4 then
bits=ReadBits(9-start)
length=ReadBits(3)
elseif start=4 then
length=ReadBits(3)*2
bits=ReadBits(4)
length=length+ReadBits(1)
else
length=ReadBits(4)
bits=ReadBits(8-start)
end if
else
if start < 4 then
bits=ReadBits(10-start)
length=ReadBits(8)+8
elseif start=4 then
bits=ReadBits(2)
length=ReadBits(1)*128
bits=ReadBits(4)
length=length+ReadBits(7)+8
else
bits=ReadBits(2)
length=ReadBits(4)*16
bits=ReadBits(8-start)
length=length+ReadBits(4)+8
end if
end if
else
bits=ReadBits(8)
if ReadBits(1)=0 then
length=ReadBits(4)
else
bits=ReadBits(2)
length=ReadBits(8)+8
end if
end if

while in.position >= fileSize+5
for i=0 to length
bits=ReadBits(8)
out.WriteByte bits
next
stop=(in.position < fileSize+5)

while not stop
if ReadBits(1) = 0 then //0
if ReadBits(1) = 0 then //00
length=ReadBits(3)
stop=true
else //01
repeat=1
offset=ReadBits(8)
end if
else //1
if ReadBits(1) = 0 then //10
if ReadBits(1) = 0 then //100
repeat=2
offset=ReadBits(9)
else //101
repeat=3
offset=ReadBits(10)
end if
else //11
if ReadBits(1) = 0 then //110
repeat=ReadBits(8)
offset=ReadBits(12)
else //111
length=ReadBits(8)+8
stop=true
end if
end if
end if

if not stop then
for i=0 to repeat
out.position=out.position-offset-1
bits=out.ReadByte
out.position=out.position+offset
out.WriteByte bits
next
end if
wend
wend

in.close
out.close
in=GetFolderItem("temp").OpenAsBinaryFile(false)
out=GetFolderItem(file.name+" Item "+str(itemCount)).CreateBinaryFile("type")
in.position=in.length-1
for i=1 to in.length-blockSize
bits=in.ReadByte
in.position=in.position-2
next
for i=1 to blockSize
out.WriteByte in.ReadByte
in.position=in.position-2
next
in.close
out.close
GetFolderItem("temp").delete

fileSize=fileSize+compressedBlockSize
if fileSize >= file.length then
endOfFile=true
end if
itemCount=itemCount+1
wend
end if

Lemmingologist

23 Aug 2005 13:18:51

Re: Explanation of Lemmings file compression

I have updated my compression and decompression code (above on this page). The changes include:
� Checksum calculation (thanks to ccexplore)
� Support for 12-bit offsets with less than 5 repeating bytes
� Fixed a bug in compressor which sometimes caused the first byte of the file to be encoded incorrectly.
� Increased compression speed by optimizing searches for repeated strings