Image Compression with Multipixels

Published 27 Feb 2016
</img>
Figure 0. Original image, and different levels of compression with multipixels.

Digital images, depending on their quality, can take huge amounts of storage space and the number of imaging devices available just keeps increasing. Also, due to the limited transfer rates and their cost, reducing the number of bits to represent an image is an important issue. There are several compression algorithms that take advantage of statistical redundancy or the limited capabilities of the human eye, to encode the information using fewer bits. One such algorithm consists in creating pixels of different sizes for areas that have roughly the same color, thus reducing the number of pixels. In this post, I shall present an implementation of that algorithm, along with de- and en-coding algorithms for a proposed set of rules for storing such images as to reduce their storage footprint. All the code for encoding and decoding, as well as a nice visualization script are available on the Github repository for the project.

Introduction

Compression of data is the use mathematical tools that are capable of describing a given set of data by a set of mathematical rules that is smaller than the set of data. There are two kinds of compression: lossless and lossy. As their names indicate, the former refers to a method with which no data is lost in the process of compressing, while the latter refers to a process that approximates the original data well enough to say they are equivalent, although strictly speaking, information is lost during the process and cannot be recovered. Lossy compression is widely used for sound and image compression because a compressed and an uncompressed signal sensed by a human are virtually equal. Lossless compression plays a more important role when every bit of information has to be analysed or processed. Compression algorithms have been used for a wide variety of purposes, but mainly to reduce the storage space of files, or to boost the data transfer rates. Because most of the time these two factors, storage space and total data transfer can be translated into costs, there is a direct and measurable benefit provided by these algorithms.

There are several ways of compressing a set of data. One way is by approximating a function with a series of simpler terms such as trigonometric Fourier series, as is done with the JPEG algorithm. The image is treated as a function in 2 dimensions and then decomposed into a series of periodic functions with different frequencies and amplitudes. This can be done because every periodic function can be represented as the infinite sum of cosines and sines. Multilayer neural networks can also be used for such a purpose, for they provide a means of imitating an output given a specific input, and there is a number of learning algorithms such as back propagation which reduce the error between the desired and the actual output.

What these methods do is eliminate redundancy in the data set. One technique of doing so, without using approximation functions is called pattern substitution and consists of identifying redundant patterns in the data and substitute them by a reduced form or code.

Method

For 2 dimensional discrete signals like images, an example of a pattern substitution method is grouping together a set of adjacent pixels of roughly the same color, creating a multipixel that spans several single pixels in both directions, horizontal and vertical. To identify every multipixel in the image, they are tagged with 7 properties instead of the 3 components in colorspace a single pixel usually has. Those 7 properties are:

  • Horizontal coordinate of the top-left corner
  • Vertical coordinate of the top-left corner
  • Width of the multipixel
  • Height of the multipixel
  • Red component of the multipixel’s color
  • Green component of the multipixel’s color
  • Blue component of the multipixel’s color

The proposed method creates candidate multipixels and calculates what the average color for that multipixel is. Then its variance is compared to a known threshold and if the variance is higher than the threshold, the multipixel is discarded and split into two new candidate multipixels that are analysed recursively. The process begins with the entire image being the first candidate multipixel, and runs until the multipixels are too small to be divided any further. This smallest size can be chosen as to reduce the total number of pixels. The variance of a multipixel is calculated as follows:


Equation 1.

Where μ is the mean RGB value in that multipixel.

The process of splitting the image alternates between horizontal and vertical divisions. If the image is wider than it is tall, the first division is vertical and the next division is horizontal. The alternation is inverted if the image is taller than it is wide. The sections need not be an even number of pixels long.

If the variance of the multipixel falls within the given threshold, the candidate multipixel is no longer split and can be stored.

Encoding and decoding

Before trying to encode a compressed image, it is necessary to establish a set of rules. The file has a header that is 16 bytes long. After the header comes the list of pixels which are a fixed number of bits long and after all the pixels a string of two bytes indicates the end of the file. The file then will be $NPix·BPP+18$ bytes long. The values $NPix$ and $BPP$, as well as the other 5 fields of information a multipixel needs are described as follows:

Number of pixels (NPix), for which we reserve 4 bytes, or a maximum pixel count of 4295 megapixels.

The number of bits per pixel (BPP), which is a variable value because depending on the size of the image we can have small images addressable with 8 bits, or really big ones that need, say, 12 bits. This value is calculated with the minimum number of bits that are needed to describe a pixel for a given image. The maximum size of the pixel will be as big as the image itself, so the minimum number of bits to describe a pixel’s size is $ceil(log_2(W+1))+ceil(log_2(H+1))$, where $W$ and $H$ are the image’s width and height, respectively. Then we need to express the pixel’s coordinates, which describe the position of the upper-left corner of the multipixel. Since the multipixel can be as small as one single pixel, the coordinates will have the same range as the pixel’s size, which means it will need the same number of bits, thus the total number of bits per pixel will be:

</img>
Equation 2.

Then we have the size of the image in width and height, and for each of which 16 bits are assigned, making for a maximum image size of 65536 x 65536 pixels.

To add a little more information about what the file is, the first 2 bytes of the header are set to an arbitrary value that identifies the format, and they can be checked by the decoder when first opening the file and avoid reading through the entire file at once. Additionally an ending string can be set for the header, indicating the start of the image.

Then we can start with the pixels’ information. Because the pixels cannot be sorted according to their position as normal pixels are, they are stored as they are generated during the compression process, so both its horizontal (PosX) and vertical (PosY) coordinates have to be specified. The next two values have the same size because a single pixel can be as big as the image, so the width (WPix) and height (HPix) of each pixel are specified too. The last 24 bits from equation 2 are for the 3-component vector in RGB colorspace with one byte for each of its elements as is usually done on other formats.

After all pixels have been written, it is convenient to have an ending string in case the data has to be restored from an unknown filesystem.

Thus, the proposed storing rules would look like this:

\0x20\0xB2IMG[Width][Height][BPP][NPix]\0xBEn0x70[PosX][PosY][WPix][HPix][R][G][B]......[WPix][R][G][B]\0xBE\0x71

Results

After running the algorithm against a set of 5 different images, which can be seen in the PDF version of the paper at the and of the post, the efficiency of the algorithm was measured. The source images were already compressed with the JPEG algorithm, so the size of an uncompressed image was estimated using the size of the image, supposing that every pixel occupies 3 bytes; so the file size would be 3·W·H bytes, where W is the width of the image and H is its height. As one would expect, the multipixel algorithm performs well when compressing images with very flat colors and not too many details. Both algorithms performed worst for the same image, and the multipixel algorithm performed the best for an image with very flat colors. The second worst image for the multipixel algorithm shows a very grainy region, which requires many different small pixels. The file sizes of the output images are shown in Fig. 1. Both algorithms show a very high compression ratio for the tested images, as can be seen in Fig. 2, averaging a 12.31 fold reduction in size for JPEG, and 10.23 for the multipixel algorithm, with a maximum of 25.6 fold and 19.1 and a minimum of 1.3 and 7.1 fold for the multipixel algorithm and JPEG, respectively.

</img>
Figure 1.

It is also important to measure the quality of the compressed images, and not only the compression ratios, because an image can be compressed with an extremely high ratio but a terrible quality so the two images have no resemblance anymore. The way of comparing the quality, is to measure the differences between the two images and adding them all together along the three channels and then normalizing it by dividing by the total number of pixels times three. In order to have a sense of scale, the compression quality between two completely different images was measured and the result was an average of 80 per pixel per channel, while a perfect image will have a difference of 0 per pixel per channel. In Fig. 3 the quality of every compressed image is shown, and the result is consistent with the other two plots, where the image with the least quality is also the image with the least compression ratio.

</img>
Figure 2.

Conclusion

The presented algorithm shows it can compress images at a high ratio if given the correct circumstances, i.e. relatively big areas with flat colors. Nevertheless, the resulting images have a rough texture due to the nature of the algorithm. A way to correct that, would be to blur the areas where the biggest pixels are, as shown in Fig. 4 where the image was blurred selectively, making it have a high resemblance with the original. Nonetheless, the quality of both images was compared with the method described in the previous section, and the quality of the blurred image is slightly lower compared to the multipixel image, although does not seem to be statistically significant.

</img>
Figure 3.

The format used to store the compressed images reserves more bits than are commonly used for the size and coordinates of the pixels because very few times there are multipixels as big as the image. This results in 4 fields for every pixel that, for small pixels, have several unused bits, that is to say they are set to zero. This allows for further compression using statistical methods.

</img>
Figure 4. Selective blurring.

Alberto Barquin is a master’s student of Space Science. He’s from Mexico City, and lives in Europe. He’s a self-taught programmer, and computer-vision enthusiast. To know more about him, check out his website. You can also find this article in PDF form.

Also, check out the repository of the project on Github. The code is easy to follow, and there’s a great visualization tool to see how the compression happens live.

Population and GDP across Mexico

Published 22 Dec 2015

I found a TopoJSON map of Mexico - made from Mike Bostock’s very easy-to-follow example, and a dataset of population and GDP per municipality; so I set out to make a small visualization with it, and the result is the following map.

Note: The bottom bars represent population and GDP per municipality. On these bars, the coloring of each municipality is by quintiles over GDP. Notice how 20% of all GDP is produced in only 2.36% of municipalities, and so on. The visualization can be manipulated by clicking on those bars.

I liked the visualization because we can see a lot of interesting patterns. First of all we can see the fact that Mexico has become a diverse and well distributed economy. A few municipalities with very high GDP are the area nearby the Cantarell Complex, the largest oilfield in the country; but the top 10 most productive municipalities don’t really concentrate on a single area (Monterrey, Guadalajara, Mexico City, Toluca and the Cantarell region). As we go further and further, we can see regions all over the country lighting up: Some large cities, some touristic destinations, industrial centers, and border cities.

Unfortunately, the map allows us to visualize two other facts that are not so uplifting. On the one hand, we can observe economic inequality: A total of 140 municipalities produce 90% of the GDP in the country, but house only 50% of the population. This means that 50% of the population lives very close to the main areas of production in the country, while the other 50% lives in areas that are generally less prosperous.

The other saddening fact evidenced by the map is poverty in the south of the country. Particularly in the state of Oaxaca. Although it is also fair to say that municipalities in Oaxaca are smaller, and the economic output of the state may have been broken down into smaller parts. (Visualization on GDP per state pending!).

Anyway, hope you liked the map! All comments are welcome! : )

Also, for those curious on details on the data, you can find the information on the API that I used, as well as the indicators on the National Institute of Geography and Statistics’ website.

The Power Law distribution - and how I couldn't find one

Published 27 Oct 2015

Today I’m going to write about power law distributions. Power laws are very common in statistics, because they model many common phenomena very closely. The following image shows the typical shape of a power-law distribution. A few elements (in green) contain a very large proportion of the area of the curve, and the rest (in yellow) is a very long tail of ‘average-valued’ elements.

![Typical power law](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Long_tail.svg/300px-Long_tail.svg.png) `Typical form of a power law`

As I mentioned, this kind of distribution is observed in many phenomena. Actually, power laws have been observed in physics for a while now, but with the new ubiquity of data on human activities, there is a sort of hype around them. They are popping up everywhere to explain human-related phenomena. For instance the frequencies of words in human languages show a power-law distribution: Aproximately 80% of what we say can be said with only 20% of the words of the English language (e.g. 7% of the words in a book are just “the”). This is true for many human languages!

Another phenomenon that shows power-law behavior is the [amount of friends on Facebook] (http://www.facebook.com/notes/facebook-data-team/anatomy-of-facebook/10150388519243859). Most people have only a few friends, but there is also a significant amount of people in the ‘long tail’. This is, a significant amount of people that have a large amount of friends (more than 500).

The power law typically appears to describe phenomena that are about popularity. Facebook friends, twitter followers, sales of books (a few books account for most of the sales), total grossing of movies, size of cities, number of speakers of languages, distribution of household incomes, level of collaboration on online communities, and countless other phenomena show power law behavior or something very close to it!

With all this excitement about power laws, naturally many researchers have jumped on the power-law train, and made unsustainable claims about their data, for instance Clay Shirky’s Blog In-degree distribution claims. It wasn’t long until someone set out to kill a few birds in a single shot: [Clauset et al looked at 24 distributions] (http://arxiv.org/pdf/0706.1062v2.pdf) that had been claimed to be power laws by other researchers (sucha as word frequency, terrorist attack fatalities, religious sect size, surname frequency, in-degrees per URL, etc.), and found that only one of the 24 was a clear power law. Pretty disappointing huh? Clauset et al were nice enough to provide code for others to test their data for power-law-ness, and the code was later translated to Python too.

Well, I ran into my own dataset that I suspected of being a power law. I found some [data of registered businesses in Mexico] (http://www3.inegi.org.mx/sistemas/descarga/), published by the National Institute of Geography and Statistics (INEGI). Particularly, data of “manufacturing businesses” - which includes bakeries and tortilla makers. I grouped the data by business type, and found that most of the businesses are of very few types (20% of registered manufacturing businesses are tortilla makers!). Here is my data on a log-log scale:

You can see it’s definitely not a straight line, so there is no way it could fit to a power law, right? Well, as it turns out, when researchers fit their data to power laws, they typically do it for a fraction of their data (the tail, or the beginning of it); so upon first imporession it was not entirely certain whether we had a power law in our hands or not - if only for the beginning of the distribution. To test that I used the code from Clauset et al’s paper, which includes a Maximum Likelihood Estimator of the exponent, as well as a Kolmogorov-Smirnov goodnes-of-fit test. Spoler alert: The fit was really bad, and the P value was 1 or something embarrassing.

And that’s the story of how I failed to fit my first power law. The code, which is nothing more than loading the data and applying plfit, is available on Github for the curious. I’ll be back soon with more on this data, which I find pretty interesting. I will probably try to fit it to another distribution (Log-Normal?). Until then.

Playing with csv data

Published 26 Oct 2015

Recently I’ve been working on a paper to classify relationships between pairs of nodes in a graph. To do the classification, I designed a few features -about 20- for each pair of nodes. I wrote a script that calculates the features for each pair of nodes, and outputs it into a csv file. If you are curious about the project, you may check it out on Github.

My graph has about 6,000 nodes, so there’s approximately 18,000,000 pairs of nodes for which I generate features, so the dataset is pretty large (3.8 G for the file that contains the main features).

Originally, I was doing the main analysis in Python and scikit-learn, but after some issues with the classification I had to dive into the features, and before I got any code down, I wanted to do some exploration. I started with the basic console utilities in Linux: grep, awk, sed, etc.; but those only took me so far. I needed something that had CSV built into its philosophy. That’s when I found csvkit, a set of Python utilities to work with csv files in the console. It provides basic functionality like csvcut to slice, mince and mix the csv files as you wish; or csvgrep to run good old string matching on particular columns of the file. It also includes more advanced functionality, like csvstat, which will run some basic statistics functions on your csv file or stream.

Because they are coded in Python, they have the disadvantage that they might be slower than the good old Unix alternatives. As an example, I found that normal grep was faster than csvgrep, with the disadvantage that you don’t have an easy way of finely matching columns. The pack also has a bunch of other scripts that I haven’t tried, but that might work for your application. You can take a look at the documentation, where they go over the different kinds of use cases, and utilities provided in csvkit.

It’s all a matter of finding the right tool for what you need. I am very pleased with csvkit, and I recommend you try it too. It might even save you the need to write any code for your exploratory data analysis!

The Chess Battlefield

Published 25 Oct 2015

I recently analyzed the millionbase. A large dataset of aproximately 2.2 million chess games. There have been several examples of analyses of the millionbase, such as the Chess Piece Journeys, and Survival of Pieces in Chess that was first posted in Quora.

Inspired by these, I decided to analyze how pieces are captured on the Chess Board. The result is the following interactive visualization:

The results are interesting - although they might be expected for anyone who’s played chess for long enough. The center of the board is where most piece captures occur. A lot of it is from captures of pawns, which are very heavily captured in rows 4 and 5. Observe that the distribution becomes much less concentrated for pieces that are not pawns. Queens, bishops, and knights are often captured in the middle of the board, but their distributions have a much higher variance. Rooks on the other hand are most often captured on the edges of the board.

It is interesting to observe the patterns that form for black and white pieces, and for different pieces. Go ahead and spend some time looking at it. :)

The data, as well as the code for the analysis and visualization are available on my Git repository. Feel free to ask any questions, or use the code as you see fit. Please give attribution when you use the visualizations.

Hello, World!

Published 24 Oct 2015

This is my first post on this blog. From now on, I’ll try to post regularly on things like data analysis, Mexico, Korea, public policy, and software - and generally things that keep me interested. I’ll try to keep it interesting too.

Welcome to my blog. : )