1. Computing

What is Data Compression?


Word in Binary

Data Compression in Technology:

Though you may not realise it, data compression affects most aspects of computing today. In fact many websites use compression to reduce the amount of physical traffic they send and save time.

As a developer you're probably familiar with archive utilities that compress files into archives with one of these extensions.

  • Ace
  • Rar
  • Zip
  • BZ2
There are links to some libraries with source code.

Compressing Data, Music, Speech and Video:

If you visit a website, the picture files that your browser downloads (gifs, jpgs and pngs) are all compressed as is video on DVD or online.

Windows also allows you to compress drives or folders if they are formatted with NTFS. A music file on a CD may be 20 Mb in size. If converted to MP3, it would shrink down to perhaps 3 Mb thanks to compression. Other areas of compression includes speech which is a particularly important technology for cell phones as it allows more traffic, and video compression, which is a key part of cable TV, DVDs and portable game consoles.

Lossless or Lossy Compression?:

There are two types of compression.

  • Lossless
  • Lossy
Lossless compression guarantees that what is compressed can be recovered without any data loss. As a developer, you wouldn't be too impressed if your backup archive corrupted your files.

Lossy compression though affects photographs and video where some loss of image quality is unnoticeable and so acceptable. An uncompressed jpg can easily be compressed to half its size or smaller without it being noticed. Too much compression though and it starts pixellating with small square blobs appearing.

How does Compression Work?:

In one word: redundancy. Extra information that is unnecessary is removed. Take a look at this sentence. "Yu cn undrstnd ths, evn wth mny ltrs msng". Hopefully you could make out that it said "You can understand this, even with many letters missing".

Those missing letters are redundant in English, as you can still understand the sentence without them. In 8 bit text which has 2^8 = 256 different values, less than 80 are needed to write English.

Compressing English Text:

Using a-z, A-Z, 0-9 space, comma, semi-colon, dash, colon, etc plus a few other characters, 80 characters out of 256 are needed. Less than a third of the possible values.

If we restrict ourself to 7 bits instead of 8 then we can compress losslessly by 12.5%. The example shows the word Word with the binary ASCII representation shown below. This is 32 bits long (4 x 8). By dropping the first bit of each character, we end up with 28 bits. To be useable we need longer text than 8 bytes (which compresses to 7 characters) but most files are usually longer than 8 bytes!

Better Compression with LZW:

This isn't a very good method as compression goes. Other lossless compression methods such as LZW can shrink text down to 20% or less of its size. Dropping one bit only achieved 87.5% compression!

LZW (short for Liv, Zempel and Welch- names of the inventors) is a dictionary approach to compression which builds a dictionary of phrases. This was patented for 17 years and the patents only ran out a couple of years ago. They were the subject of much controversy as the GIF format (introduced by Compuserve in the late 80s) used LZW compression.

Patent Problems:

Unisys who had the LZW patent one day announced that anyone developing software to read or write GIF files must pay licence fees. It's because of this that the PNG format was developed. GIF though was an excellent format for outputting charts or text as it used lossless compression.

A few years ago I wrote a dll that provided graphical output for a website. It could output either GIFs or JPGs. The same chart as a GIF file it was 9KB in size, but as JPG it was 45KB. The GIF looked better as well! LZW is still used in Adobe Acrobat files. The patent has now expired.

Other Compression Techniques:

There are a variety of compression techniques around and several open source libraries both free and commercial are available to implement compression.

How does Lossy Compression Work?:

It varies- MP3 tends to cut down on silences, while JPEGs performs a mathematical transformation on the entire image in small sections, then filter out those values beyond a cutoff quality value. With lower quality, artifacting starts occurring. This is a bit like pixelation in video images.

There are two MP3 libraries in the C Code Library.

Compression Today:

To keep devices small but still as powerful, compression techniques must be used. Here are some examples:

  • Birthday cards that sing "Happy Birthday" when you open them.
  • Cartridge games for console games machines.
  • Films on DVD-ROM.
  • Digital Still ad Video Cameras to increase storage.
Compression is used increasingly everywhere in electronics.

But it's not over yet. The present day algorithms for both lossy and lossless are good but there's probably better ones still waiting to be discovered. Invent a better compression algorithm and you could become rich and famous.

Related Video
Using Meta Data for Search
Export Data From an Excel Sheet to a Word Document
  1. About.com
  2. Computing
  3. C / C++ / C#
  4. Getting Started
  5. What is Data Compression?

©2014 About.com. All rights reserved.