Keep Building

Novelty is overrated

Nov 21, 2014

Binarization Part 2

If we confine ourselves to binarizing documents, we want to find out every pixel that is ink and mark those as black. All the other pixels are the page background and can be marked as white.

Even this formulation papers over ambiguity. At the edge of every letter, there will be pixels, each of which cover an area which is partially covered in ink. Those pixels will be an intermediate value in the greyscale image between the ink and the page. This is one way in which we are losing information when we go to a binary image.

Before going into Howe's algorithm, I want to outline the two most common kinds of binarization algorithms. There are many variants on each of these methods.

Global Thresholds

One class of methods involves coming up with a single global threshold for the whole image. Any pixels lighter than this threshold become white and any pixel darker than the threshold become black. There are various ways to decide the threshold. You can analyze the distribution of pixel brightness and try to find a point between the cluster of page pixels and the cluster of ink pixels.

The big disadvantage is that it only works well when the page is extremely evenly lit. If one part of the page is darker than another, then a single threshold will cause similar pixels to be classified differently. Letters on one part of the page can look bolder than letters on another part. In the worst case, entire letters or features can disappear because the threshold that might work one place won't work everywhere.

Adaptive Thresholds

While I am pretty good at evenly lighting my books when scanning, I still see this kind of binarization artifact when using global thresholds. So I looked into alternatives and most of these alternatives are adaptive thresholding techniques. Adaptive thresholding means that you calculate a potentially different threshold for each pixel based on the neighborhood it is in.

This is a great advantage when dealing with uneven light. If the top part of the page is a bit brighter, then it's threshold will be a bit higher. These algorithms tend to be slower and they also tend to introduce more black speckles into the result. But some of the best algorithms currently available use an adaptive threshold.

Other Methods

Howe's algorithm uses neither kind of technique. It works by mapping the binarization problem to the maxflow/mincut problem. I will talk about how this works next time.