Fast and Easy Data Compressor

Data compression is easy if you know some theory behind, like Markov models and such. However, although the theory is kind of complex, compressing data is not so difficult if you take the direct results from the theory. For example, compressing means “predicting” the next character after a set of characters given as input, based on the current history of the document. If you predict the character correctly, you don’t have to output anything, as it follows the “pattern” or “rules” of the text. Suppose we have an input document composed of n characters, c0 to cn−1. Suppose also that we are processing the document at character i, that is, ci. We maintain a 2D-matrix, P with the predictions for a given pair of characters. Then, the compression is simple: if P(ci−2,ci−1) = ci then, don’t output anything (the actual character has been predicted); if not, output the character ci literally, and update P so that the next time P (ci−2, ci−1) = ci. Surprisingly, this simple mechanism gives fair compression ratios. However, you have to be careful with the codification of the compressed text, because you have to be able to tell whether a character has been predicted or not. For that purpose, the output is divided in groups, G0 to Gm, with m = ⌈n⌉. Each group Gk contains six elements (bytes) and a descriptor byte: 6 Gk =⟨bk,g0k,g1k,g2k,g3k,g4k,g5k⟩ Thus, for the first group, g0 refers to c0 , g10 to c1 and so on. For the Gk group, g0k refers to c6∗k, g1k refers to c6∗k+1, and so on. The value of each gjk is as follows: { gjk = λ if P (c6∗k+j−2, c6∗k+j−1) = c6∗k+j c6∗k+j if P (c6∗k+j−2, c6∗k+j−1) ̸= c6∗k+j Note that λ means that no character is produced, so that the actual size of the group can be less than 7 bytes (producing compression). Notealsothatgjk areonlyvalidforthosekandjsothat6∗k+j<n.Thismeansthelastgroup can be incomplete to suit real file length. bk is defined as follows: bk = 64 + Suppose also that c−1 = 0 and c−2 = 0. ∑ i|gik =λ The purpose of this problem is to build a compressor following this specification. Input The input is formed by a series of n bytes (characters) finalized by an EOF. That is, you have to take all the input as the text to be compressed (including spaces and carriage return). 2i

2/2 Output The output is the set of m = ⌈n⌉ groups (G0 ...Gm−1) in form of ASCII characters, as described previously. 6 Sample explanations: In spite of the description above, consider the Sample Input below as two different files (one per non-empty line) just to illustrate the way the compressor works. Then the results at the Sample Output depends only of the corresponding line. Case 1: In the second group (the one starting with “q”) the first “o” is predicted because in the first “lo”, space after “l” was followed by an “o”. Also, the 5th and 6th characters of this group are predicted. Therefore, the descriptor byte is: “q” = ASCII(113) = 1110001b. The output is 23 bytes long, and the input 24 bytes long, so in this short text, the algorithm compressed 1 bytes out of 24 (4.16% compression). Case 2: In this case, 10 bytes are compressed (20.4% compression). Sample Input a lo loco lo coloco lola football is football and basketball is basketball Sample Output @a lo lqco @ colocI lla @footba@ll is |foGand@ baskettblVibs_lA