Nov 22, 2014
The Maximum Flow problem is well-studied and there are very efficient algorithms to solve it. It turns out to be equivalent to finding the Minimum Cut of a network as well. If we can map the binarization problem to either of these problems, the result should be fast and accurate. Let's look at what the problems are and then see how to map it.
Suppose you have a network of pipes in a water system. The pipes may be different sizes (water capacities) and connect to each other at various junctions. Suppose you have one water supplier, called the source. And somewhere else you have one customer who wants as much water as possible, called the sink. The question is, given a particular configuration of pipes and junctions, what is the most water you can supply to the customer over the network?
In this example, the maximum flow possible through the network is 7 gallons per second.
Let's say you are a secret agent, trying to prevent the villain from escaping to safety. The villain will try to take a train from a starting city to their safehouse. Since the villain is wily, you know that if there is any path to safety, they will find it. But you only have so many resources and each train line you shut down has a cost. Shutting down all the train lines is far too expensive, so what is the cheapest set of train lines to shut down to guarantee success?
These two problems happen to be completely equivalent. The water source is the villain's starting city. The water sink is the villain's destination. The size of a pipe is equivalent to the cost of shutting down a train line. Once you map the one problem to the other, the total flow of water you can get through the water system is equal to the cost of trapping the villain.
In order to keep the villain away from safety at minimal cost, you have to cut off three routes whose cost adds up to $7 million. This is not a coincidence. When the flow is maximized, the water goes through pipes which collectively form a bottleneck. And those pipes, when removed, partition the graph at minimum cost.
We can turn the binarization problem into an instance of the mincut problem by constructing a graph from the pixels of a picture. Then when we run the mincut algorithm, some of the pixels will end up on the side of the source/villain. The rest will be on the side of the customer/escape. We set one group to be white and the other to be black.
Our main graph will form a lattice where every pixel has an edge to the four adjacent pixels. If the two pixels are in different groups, then the edge will have to be cut and this imposes a penalty. The bigger the weight, the more penalty is assessed when the two are in different groups and so the more likely they are to be in the same group.
Then we connect every pixel to both the source and sink, but vary the weights. Here is a simple way to do it. The brighter the pixel is, the more weight we assign to its link to the source and the less weight we assign its link to the sink. So bright pixels are biased toward one side and dark pixels towards the other.
The combination of these two kinds of weights helps ensure that contiguous regions of darker pixels are all tagged the same and helps reduce the chance of random speckles. If you want to see this method in action, take a look here.
Howe's binarization algorithm builds on this basic system. It builds a graph like I have described here and runs mincut on it. The tricky bit is the methods it uses for determining the weights between adjacent pixels and between the pixels and the source/sink.
Nov 21, 2014
If we confine ourselves to binarizing documents, we want to find out every pixel that is ink and mark those as black. All the other pixels are the page background and can be marked as white.
Even this formulation papers over ambiguity. At the edge of every letter, there will be pixels, each of which cover an area which is partially covered in ink. Those pixels will be an intermediate value in the greyscale image between the ink and the page. This is one way in which we are losing information when we go to a binary image.
Before going into Howe's algorithm, I want to outline the two most common kinds of binarization algorithms. There are many variants on each of these methods.
One class of methods involves coming up with a single global threshold for the whole image. Any pixels lighter than this threshold become white and any pixel darker than the threshold become black. There are various ways to decide the threshold. You can analyze the distribution of pixel brightness and try to find a point between the cluster of page pixels and the cluster of ink pixels.
The big disadvantage is that it only works well when the page is extremely evenly lit. If one part of the page is darker than another, then a single threshold will cause similar pixels to be classified differently. Letters on one part of the page can look bolder than letters on another part. In the worst case, entire letters or features can disappear because the threshold that might work one place won't work everywhere.
While I am pretty good at evenly lighting my books when scanning, I still see this kind of binarization artifact when using global thresholds. So I looked into alternatives and most of these alternatives are adaptive thresholding techniques. Adaptive thresholding means that you calculate a potentially different threshold for each pixel based on the neighborhood it is in.
This is a great advantage when dealing with uneven light. If the top part of the page is a bit brighter, then it's threshold will be a bit higher. These algorithms tend to be slower and they also tend to introduce more black speckles into the result. But some of the best algorithms currently available use an adaptive threshold.
Howe's algorithm uses neither kind of technique. It works by mapping the binarization problem to the maxflow/mincut problem. I will talk about how this works next time.
Nov 20, 2014
In this multi-part series, I am going to talk about my recent discoveries in the realm of document binarization. I will start with the basics and background and then build up an explanation of Nicholas Howe's remarkable algorithm for doing it piece by piece.
I have been obsessing about this algorithm over the past week, picking it apart and trying to improve upon it. So far my efforts to improve it have been fruitless. But these efforts have given me a newfound respect for just how hard the binarization problem is.
For those who want to skip ahead and read about the algorithm from primary sources, you can find details here:
- The first paper describes the base algorithm.
- A second paper shows a method for automatically tuning parameters for the base algorithm. This idea is what really piqued my initial interest because all of these algorithms have parameters that drastically impact their effectiveness and finding the right parameters is time consuming work.
- Nicholas Howe has also released his Matlab code and it is by converting and tweaking the code that I really came to understand how it all works.
Binarization is the process of turning a grayscale image into a black and white image. Every pixel is converted from a gray pixel into either a completely white or a completely black pixel.
One reason that people are interested in binarization is that it is an essential step for OCR algorithms that convert a document into plain text and other image processing algorithms. Another reason is that a binarized document can be compressed down a lot more, an order of magnitude at least.
The latter is what I am most interested in. With binarization and the best compression techniques, you a scanned 400 page book can be compressed down to about 10 MB. That is small enough to carry thousands of books on any modern tablet. Once I scan library, I want to be able to carry the whole kit and kaboodle with me all the time.
One issue with binarization is that there is no single right way to binarize every image. It inevitably means a loss of information. There are dozens of methods and each one will work well for different purposes or with different source material.
I want to binarize documents rather than photographs or illustrations. My materials will be relatively evenly lit and printed with ink. I want to maintain readability and preserve small scale features like serifs and thin strokes in a letter. And I want an automated process. If I have to spend a lot of manual labor on each page, then books with hundreds of pages are not feasible.
Nov 18, 2014
The tag line of this blog is "novelty is overrated" because we are far too concerned with being original. It is very easy to have a great idea and then become discouraged when you see that others have had similar ideas. But this is an impulse we must strive against. Your great idea is still great even if others have done something similar. There are still many good reasons to build it.
Build it Anyway
The most important reason to build your idea even if it is unoriginal is that there is so much you will learn by building it. Thinking up ideas doesn't teach you anything. It is only by implementing them that you can gain true mastery. And the new things you learn will spur you on to bigger things. The next idea you have will incorporate the new things you have learned and will be even better.
It is also the case that your idea might be superficially similar to something that already exists but the details will differ. The only way to find that out is to flesh out the details by building it. The truth is that every idea is well anchored in the intellectual context around it. Improvements and novel ideas are usually incremental at the time and it is only in retrospect that we look back and see it as a great leap.
In the age of search engines, it is trivial to find out who else has an idea. And it is almost always the case that you will discover that your idea is not unique. One way to avoid discouragement is to bask in blissful ignorance. Avoid the temptation to find out how unoriginal your idea really is.
When I came up with the title of this blog, I deliberately avoided searching to see if there were any other blogs with the same name. I don't have the discouragement of knowing about the half dozen other blogs with the same name. I haven't wasted days fruitlessly trying to come up with a new name that nobody has ever used but still familiar enough to be attractive.
In general, there are many reasons to read about similar ideas when working on a project. You might look for inspiration to see what can be done. Or you might be doing research to improve your techniques. But if you ever find yourself starting to compare your idea to what is out there or wonder if it is original, stop. Go work on your idea for a while and regain your enthusiasm.
The concern with originality is a stultifying influence on creativity. But the notion of completely novel ideas that spring fully formed from the mind of a unique genius is wrong as well as bad.
Every idea derives from what came before. And the great ideas occur in a social context where people collaborate or compete and use the best parts of multiple projects to make the result even better. Einstein and Newton worked and talked with other very smart people many of whom had groundbreaking ideas of their own.
If you write your idea down in just a couple of sentences, there is not enough detail to see what is really new. Any true innovation is in the thousands of details that come from developing your idea. The difference between the useless theory of atoms developed in Ancient Greek philosophy and modern atomic theory is in the details that are used to form the basis for chemistry and physics.
So this blog will not be a chronicle of new inventions. Everything I build is rooted in the work of those who came before. I wasn't the first person to use Pelican to host a static blog and Pelican was not the first static blog generator. But my journey is still my own and is still interesting. And Pelican is still a great tool.
The tag line is there to remind me that it is ok to walk a path others have gone down and that there are still interesting things to be found even if I am not blazing a new trail for the first time.
Nov 17, 2014
The overall setup of my blogging engine is now complete. I had a few requirements which are fairly unusual:
- Host it myself entirely in Amazon S3
- Edit and upload each post entirely using my tablet (an iPad)
- Upload images and automatically paste links while editing a post
- Automatic backup of the entire blog to Dropbox each time I post
- Ensure that I can figure out which backup blog post points to which backup image
The only reasonable way to host a blog on S3 is to use a static site generator. I chose Pelican, though Jekyll and Octopress are both good alternatives. These are all command line tools that turn a folder of CommonMark documents into a static set of html and CSS files that can then be uploaded for hosting anywhere.
Publishing on iOS
None of them have anything remotely like an iOS client. The closest thing to a console on iOS is Editorial, a text editor that is programmable in Python. And I found out how to run an SFTP client in Editorial and that is how I can get from a piece of text to a post.
After completing my post in Editorial, I just invoke the upload workflow to push the post to the input folder on my VPS. The VPS is configured to run Pelican and the upload is immediately detected by my daemon which uses inotifywait to detect when files are changed. The daemon then uploads the files generated by Pelican to S3 for hosting.
At first, I was planning on using the Transmit iOS app to upload images directly to an S3 bucket. But there is no url scheme or x-callback-url support in Transmit which would make the whole process cumbersome.
So instead, I decided to reuse the SFTP code in my post upload workflow in Editorial. My picture workflow lets me choose one or more pictures from the camera roll and then inserts a CommonMark image tag for each url. When I upload the blog post, then the images themselves will be copied from my VPS to S3 at the same time by my daemon.
Since I am authoring and publishing directly from my tablet, I need to make sure that I am have everything backed up in multiple places. I will retain a copy on my tablet, and there will be an extra copy on my VPS, but it still feels a bit fragile. So every post and image is also uploaded to Dropbox. This is easy to do in Editorial which can be linked to a Dropbox account. The real trick will be handling sync if and when I decide to add a way to upload blog posts from my laptop.
Nov 15, 2014
This is a mobile test post. I have written it in Editorial on iOS. If everything works, then I will be able to upload it to the origin and it will automatically post.
The workflow I am using is the SFTP workflow created by Gabe from mac drifter.
Now things are mostly working. With the current setup, I can automatically prepend the date to the file and backup to Dropbox. The next step will be to figure out how I can host images and upload them from my tablet.
Here is a test of the picture hosting. You can see a piece of my workflow in this image.
Nov 13, 2014
This is my first blog post for the Pelican blog system. If everything goes well, then this will cause my watch script to automatically build the templated website for the first time.
Well, here goes. Time to save.
Here we go again. The first time this whole site got built but it skipped this entry because it didn't have a date.
Second time is a charm.