Humanzip copyright (C) 2007 Matthew Strait

*** Purpose and Description ***

humanzip is a compression program that operates on text files.  Unlike 
most compression algorithms, its output is human readable.  Indeed, it 
is explictly meant to be read by humans and might even be easier to read 
than the original.

humanzip compresses files by looking for common strings of words and 
replacing them with single symbols. The idea is to reduce the screen and 
print size of documents.  Humanzip does not explictly try to reduce the 
size of the file as measured in bytes, although this usually happens 
incidentally.  For instance, lines in a file that looks like this:

	This is a test, please panic.
	This is a test, hide under your desk.
	This is a test, close the curtains.

might be converted to:

	Å, please panic.
	Å, hide under your desk.
	Å, close the curtains.

A key is included at the top of the compressed file.  It is always given 
with lines like "å/Å - this is a test", where å represents "this is a 
test" and Å represents "This is a test".

Don't expect dramatic compression here.  Most files will be reduced 5-15%.

humanunzip will (in theory) restore files exactly to their original 
state. I've included in this package a very short shell script called 
"dotest.sh" to test this.

If you don't like the exact way that humanzip has chosen abbreviations, 
you can change it around by hand.  As long as you follow the format, 
humanunzip will still work.

Interestingly, sometimes you'll end up with a slightly smaller file if 
you humanzip it and then [gb]zip it than if you just [gb]zip it.  But 
sometimes it goes the other way, and in any case it's almost never going 
to be worth the trouble.

*** Original motivation ***

I compile and print a listing of all Magic: The Gathering cards several 
times a year so that I can carry it around with me when I play.  There 
are upwards of 8500 different cards, each with an average of ~150 
characters of text.  Even if I use a very small font and take care to 
avoid wasting space on the paper, the result is well over 100 pages.

However, this text is very repetitive.  The word "creature" appears over 
10,000 times, and "target" over 3600 times, for instance.  There are 
roughly 1300 repetitions of "until end of turn" and 900 of "comes into 
play" and so forth.  By replacing these words and phrases with single 
characters, the print size can be reduced by ~13%.  (I also use some 
other tricks which are specific to Magic to reduce it somewhat further. 
For instance, each time the name of a card appears in that card's entry 
after the first time, I replace it with "«".)

Once you've gotten used to it, "T: « deals 1 Đ to Ŧ Ĉ or ƣ" is no harder 
to read than "T: Prodigal Sorcerer deals 1 damage to target creature or 
player", and maybe easier because your eyes doen't have to move as far.

I imagine that there are other people out there with lengthy, repetitive 
text which might benefit from some human readable compression.  While of 
course text with any topic can be operated on by humanzip, I envision it 
being used mainly on technical documents.  (Not only does it not work as 
well on narritive, since it is less repetitive, but I'm guessing most 
people would just be annoyed.)

*** Limitations ***

The programs you use for manipulating text must support UTF-8.  If 
between these quotation marks: "Å", you do not see capital A with a 
small circle above it, you're in trouble.  Fortunately, modern GNU/Linux 
distributions have full support for UTF-8, so you're probably fine.

Because humanzip uses UTF-8 characters to replace the strings it finds, 
files that already use UTF-8 cannot be compressed.  In fact, humanzip 
will refuse to compress any file that has any non-ASCII characters (that 
is, bytes whose most significant bit is 1).  In the future, humanzip 
might be able to work around this by just not using any UTF-8 characters 
that are already there, but right now it just plays it safe.

humanzip is optimized to work on English text.  This means two things: 
(1) it has a black list of English 'grammar' words that would be 
annoying to replace and (2) it looks for English-style plurals so that 
it can treat the sigular and plural as the same word.  Neither of these 
will prevent it from working on non-English text, but the result will 
not be as pleasing in some cases.
