May 10, 2015
It has been a while since I have posted to the blog. If you follow along with this guide, you will see why.
Step 1: Select a Scanner Design
Step 2: Build All The Things
Step 3: Ship it
Hundreds of pounds of plywood, two car loads of bubble wrap, dozens of boxes...
Now I need a nap.
Nov 22, 2014
The Maximum Flow problem is well-studied and there are very efficient algorithms to solve it. It turns out to be equivalent to finding the Minimum Cut of a network as well. If we can map the binarization problem to either of these problems, the result should be fast and accurate. Let's look at what the problems are and then see how to map it.
Suppose you have a network of pipes in a water system. The pipes may be different sizes (water capacities) and connect to each other at various junctions. Suppose you have one water supplier, called the source. And somewhere else you have one customer who wants as much water as possible, called the sink. The question is, given a particular configuration of pipes and junctions, what is the most water you can supply to the customer over the network?
In this example, the maximum flow possible through the network is 7 gallons per second.
Let's say you are a secret agent, trying to prevent the villain from escaping to safety. The villain will try to take a train from a starting city to their safehouse. Since the villain is wily, you know that if there is any path to safety, they will find it. But you only have so many resources and each train line you shut down has a cost. Shutting down all the train lines is far too expensive, so what is the cheapest set of train lines to shut down to guarantee success?
These two problems happen to be completely equivalent. The water source is the villain's starting city. The water sink is the villain's destination. The size of a pipe is equivalent to the cost of shutting down a train line. Once you map the one problem to the other, the total flow of water you can get through the water system is equal to the cost of trapping the villain.
In order to keep the villain away from safety at minimal cost, you have to cut off three routes whose cost adds up to $7 million. This is not a coincidence. When the flow is maximized, the water goes through pipes which collectively form a bottleneck. And those pipes, when removed, partition the graph at minimum cost.
We can turn the binarization problem into an instance of the mincut problem by constructing a graph from the pixels of a picture. Then when we run the mincut algorithm, some of the pixels will end up on the side of the source/villain. The rest will be on the side of the customer/escape. We set one group to be white and the other to be black.
Our main graph will form a lattice where every pixel has an edge to the four adjacent pixels. If the two pixels are in different groups, then the edge will have to be cut and this imposes a penalty. The bigger the weight, the more penalty is assessed when the two are in different groups and so the more likely they are to be in the same group.
Then we connect every pixel to both the source and sink, but vary the weights. Here is a simple way to do it. The brighter the pixel is, the more weight we assign to its link to the source and the less weight we assign its link to the sink. So bright pixels are biased toward one side and dark pixels towards the other.
The combination of these two kinds of weights helps ensure that contiguous regions of darker pixels are all tagged the same and helps reduce the chance of random speckles. If you want to see this method in action, take a look here.
Howe's binarization algorithm builds on this basic system. It builds a graph like I have described here and runs mincut on it. The tricky bit is the methods it uses for determining the weights between adjacent pixels and between the pixels and the source/sink.
Nov 21, 2014
If we confine ourselves to binarizing documents, we want to find out every pixel that is ink and mark those as black. All the other pixels are the page background and can be marked as white.
Even this formulation papers over ambiguity. At the edge of every letter, there will be pixels, each of which cover an area which is partially covered in ink. Those pixels will be an intermediate value in the greyscale image between the ink and the page. This is one way in which we are losing information when we go to a binary image.
Before going into Howe's algorithm, I want to outline the two most common kinds of binarization algorithms. There are many variants on each of these methods.
One class of methods involves coming up with a single global threshold for the whole image. Any pixels lighter than this threshold become white and any pixel darker than the threshold become black. There are various ways to decide the threshold. You can analyze the distribution of pixel brightness and try to find a point between the cluster of page pixels and the cluster of ink pixels.
The big disadvantage is that it only works well when the page is extremely evenly lit. If one part of the page is darker than another, then a single threshold will cause similar pixels to be classified differently. Letters on one part of the page can look bolder than letters on another part. In the worst case, entire letters or features can disappear because the threshold that might work one place won't work everywhere.
While I am pretty good at evenly lighting my books when scanning, I still see this kind of binarization artifact when using global thresholds. So I looked into alternatives and most of these alternatives are adaptive thresholding techniques. Adaptive thresholding means that you calculate a potentially different threshold for each pixel based on the neighborhood it is in.
This is a great advantage when dealing with uneven light. If the top part of the page is a bit brighter, then it's threshold will be a bit higher. These algorithms tend to be slower and they also tend to introduce more black speckles into the result. But some of the best algorithms currently available use an adaptive threshold.
Howe's algorithm uses neither kind of technique. It works by mapping the binarization problem to the maxflow/mincut problem. I will talk about how this works next time.
Nov 20, 2014
In this multi-part series, I am going to talk about my recent discoveries in the realm of document binarization. I will start with the basics and background and then build up an explanation of Nicholas Howe's remarkable algorithm for doing it piece by piece.
I have been obsessing about this algorithm over the past week, picking it apart and trying to improve upon it. So far my efforts to improve it have been fruitless. But these efforts have given me a newfound respect for just how hard the binarization problem is.
For those who want to skip ahead and read about the algorithm from primary sources, you can find details here:
- The first paper describes the base algorithm.
- A second paper shows a method for automatically tuning parameters for the base algorithm. This idea is what really piqued my initial interest because all of these algorithms have parameters that drastically impact their effectiveness and finding the right parameters is time consuming work.
- Nicholas Howe has also released his Matlab code and it is by converting and tweaking the code that I really came to understand how it all works.
Binarization is the process of turning a grayscale image into a black and white image. Every pixel is converted from a gray pixel into either a completely white or a completely black pixel.
One reason that people are interested in binarization is that it is an essential step for OCR algorithms that convert a document into plain text and other image processing algorithms. Another reason is that a binarized document can be compressed down a lot more, an order of magnitude at least.
The latter is what I am most interested in. With binarization and the best compression techniques, you a scanned 400 page book can be compressed down to about 10 MB. That is small enough to carry thousands of books on any modern tablet. Once I scan library, I want to be able to carry the whole kit and kaboodle with me all the time.
One issue with binarization is that there is no single right way to binarize every image. It inevitably means a loss of information. There are dozens of methods and each one will work well for different purposes or with different source material.
I want to binarize documents rather than photographs or illustrations. My materials will be relatively evenly lit and printed with ink. I want to maintain readability and preserve small scale features like serifs and thin strokes in a letter. And I want an automated process. If I have to spend a lot of manual labor on each page, then books with hundreds of pages are not feasible.