A very interesting problem, indeed. I presume, not a complex one either. Here is what the problem statement looks like. “Given a set of N points in a plane, find the closest pair points”. A question which is always asked in interviews in various forms, here we will not just look at the solution but will also understand the intuition behind the solution.
In the adjoining image, we have five points (A, B, C, D, E) in the plane. As a result of the solution we need to return one pair which has the smallest distance between them. Of course there can be multiple such pairs, we are interested in the first one.
Approaching “Closest Pair Points” problem
Well the image above clearly draws a picture that we can evaluate all possible pairs (one pair is represented by one line in the graph). This possibly leads to (N/2) * ( N-1 ) pairs and hence a minimum order O(N2) comparisons. This seems to be a quadratic order solution with almost constant space to store the current pair with minimum distance.
Can we do better?
That is the question we need to ask ourselves and the answer is Yes, we can do better and here is how. While learning Algorithms and Data structures in previous posts we understood that strategies like divide and conquer offer a faster running time if we can break down the problem into smaller consumable pieces. Here is a refresher for Divide and Conquer.
How do we know that we have to use Divide and Conquer?
Well, we need to understand the nature of the problem. Here are the hints:
- The problem contains many points and the distance between a pair is independent on the other points in the neighborhood.
- That proves that we can divide the problem into smaller pieces without affecting the outcome.
- Also, on subsequent divisions we will reach a point where each of the sub problems will be in its minimum state of two points.
- If we can find the minimum distance for each sub problems, then the minimum of all the minimums will give us the required answers.
This gives us sufficient hint that we need to divide and conquer. 🙂
With this basic idea, let us try and draft a solution. Couple of questions here:
- How do we divide, on what basis?
- How do we calculate the distance? Euclidean or Geometric Mean Distance?
To answer the first question, we would choose to divide it in a way such that both divided sections must get almost equal number of points (this is required for having a fair branch on both sides and having equal recursion traces for both branches of division. We can randomly choose to divide based on the x coordinates or the y coordinates. Here we use the x coordinate of each point.
To answer the second question, let us measure the Euclidean distance, however it wont affect the solution no matter which distance we measure.
It is pretty clear that we need to divide at the median of all the x coordinates. The point which equally divides the points based on their x co-ordinates.
Divide based on x coordinate
We can sort the array of points (against their x coordinates) using merge sort or any other efficient sorting algorithm and then choose the middle element to define the partition.
Once this point is available, we have exactly half points to the left and half to the right. Here I present an image with the coordinate axis in place.
P and Q are the two sections divided by the blue line and contain almost equal number of points. Now if we can find the closest pair in P and closest pair in Q, we might get a possible solution. Let us say that the closest pair in P is separated by distance d1 and closest pair in Q is separated by d2 then the closest pair distance among all the points could be d = minimum(d1, d2).
Hold on! did we look at the boundary (blue line), is there a possibility of d3 being the shortest distance and the pair connected by d3 being the closest pair?
Of course there is a chance. Let us evaluate that as well. Our solution never took into account the cross over cases. There can be two points, one in P and the other in Q extremely close to the border and the distance between them can be smaller than d.
Does that mean, we need to evaluate each point in P against each point in Q to figure this out? If yes then what would be the possible running time?
If we do so, we will end up pairing N/2 points against another N/2 points and hence the total number of pairs would be N/2 * N/2 which is N2/4. Not very good compared to our quadratic solution.
So, how do we achieve a smaller running time. Till now we have completed sorting of the inputs which took around O(NlogN) time. We would like to do another logN or constant time job to get to an efficient solution. If it exceeds beyond that, then we do not achieve what we claimed.
Let us try to restrict the points in P and Q for checking the cross over. Here are few hints:
- We do not need to test for any point with x coordinate less than x-d in P and any coordinate greater than x+d in Q.
- Even if we eliminate the points mentioned in 1. We still can have many points lying in the area x-d and x+d.
Why is point 1 true?
Yes it is true because we have already established that the closest pair is at a distance d and we are testing for the crossover condition. So any point with x coordinate less than x-d in P cannot possibly pair up with any coordinate in Q to make the closest pair with distance less than d (it will take at least d distance to reach to the boundary line, crossing would take more).
The same argument goes for x+d in Q.
Now as per point 2, if there are K points with x coordinates between x-d and x and K is proportional to N then again it is a quadratic solution. Hence, we need to limit the comparisons which has to be made within this range as well.
Let us zoom into the image and see how can we restrict the number of comparisons. In the below image I have zoomed around the border line and shown the grids such that each edge of a cell in the grid is of distance d/2.
Now with the help of geometry, we can prove that not more than one point can lie in any of the cells. Remember we already found that the minimum distance between two points in P or Q is not less than d. In a square cell of edge d/2, the maximum distance is d/√2 . which is smaller than d, hence no two points can lie in any square cell.
Imagine, that a point A (the red point) to be tested is lying on any of the grid lines in P, then the cells in Q which actually needs a testing against A would be a matrix of size 4*2 because these cells can possibly contain a point which may result in a smaller pair distance than d. This proves that at max for each point in P we need to pair it up with 8 points in Q. You can probably get to a lower number with some more intelligence.
The point I want to bring in here is that its a constant number. Say that you want to check for more cells, it would still be a constant, 16 at max but not more than that. Which means for each point in with x coordinate greater than x-d/2 we can actually test 8 points in Q which lie within certain rectangular bounds.
This will add up a linear term to the running time of the algorithm which is cn, we established that c can be a bigger number but still linear. Hence the total running time would be NlogN + cN, which makes it O(NlogN).
You can download the source code for this post from our github repository.
The source code in the above github link can be further optimized to bring down the constant term. Please let me know by comments if it needs further improvements. You can also run it without the graphics class, I have tested it for 10000 random test cases each with 30 points and it seems to work fine. The code contains both an exhaustive check and a efficient skipping as discussed in this post. Both the outputs match.
Hope this helps. Thanks and happy learning. If you are interested in reading the related Wiki Closest Pair Points on Wikipedia