Median of a Stream of Integers

Why does this work?

It works because the LOW and HIGH lists divide the stream in two equal parts. Sometimes there is one extra number in either the LOW list or the HIGH list.

Explanation Condition 1

Let us evaluate first condition, say that both the lists have equal number of elements at one given point.

  • The median at that point is the average of two numbers (smallest of the HIGH list and biggest of the LOW list)
  • If a new number is to be added, the count of numbers will become odd and hence the median must come from one of the lists.
  • If this incoming number is smaller than the median, it is going to reduce the median value, hence it definitely has to be a part of the LOW list for the above point to be true.
  • Hence, the highest number of the LIST is the median.

Explanation Condition 2

Let us evaluate second condition, say that HIGH list has more numbers of elements at one given point.

  • This means the current median came from the HIGH list. It is the smallest number of the HIGH list.
  • If a new number has to be added, it must be such that after addition both LOW and HIGH lists have same size.
  • If this new number is smaller than median, it can easily be added to the LOW list and the sizes can be made equal.
  • If this new number is bigger than the median, it should ideally go to the HIGH list, but this list already has one extra number.
  • Hence, we need to remove a number from the HIGH list and add it to the LOW list to make some space in the HIGH list
    • Which number to remove?
    • The LOW list contains numbers which are smaller than the median, and we can say the smallest number in the HIGH list is the only one which can be safely moved to the other list and the LOW list constraint can still be satisfied.
    • Basically we are compensating the removal of this number by adding a bigger number to the HIGH list, hence the median will always be bigger the current median and hence shifting the smallest number from HIGH to LOW list is acceptable.
  • Hence, the average of the two numbers (biggest from the LOW list and smallest from the HIGH list) is the new median. Because these two are the numbers in the middle of the stream.

The third condition, where HIGH list has less numbers of elements compare to the LOW list at one given point is exactly reverse of the second condition.

Interesting fact about Medians

If the median of a list of numbers is M then

  • Adding a number N into the stream, such that N > M  will increase the value of the new median.
  • Adding a number N into the stream, such that N < M will decrease the value of the new median.
  • Adding a number N into the stream, such that N = M will not change the median.

Choosing the Data Structure for the above algorithm

A close observation shows that we are mostly interested the lowest number higher than the median and the highest number lower than the median. This means that we need some data structure which can provide easy access to these extremes.

Another important requirement here is to remove the highest or the lowest and insert it in another list. The insertion and removals need to be faster here.

As we are reading about Heaps, we know that this data structure provides log(N) running time for all these operations. The trick here is to use two Heaps, one for the LOW and one for the HIGH.

  • The LOW list needs to be represented by a Max Heap because we always need the biggest number from that list.
  • The HIGH list needs to be represented by a Min Heap because we always need the smallest number from that list

Note: Of course this problem can be soled by many other ways, which we will discuss later, but most of the places you will find it under the Heaps category.

Source Code

We will take he help of code generated in the last post Building Heaps. Here is the remaining code, which reads from the console and prints the median every time.

You can download the code from github

Analysis

For each of the element in the stream, there are very specific operations:

  • comparing the heap sizes which is O(1)
  • comparing the input element to one of the max or min heap root, which is O(1)
  • Insertion into one of the heaps, also includes increase or decrease key, which is O(logN)
  • Extracting the min or max from the heap, which also involves heapify , hence its cost is O(logN)
  • Calculating Average of two numbers, which is O(1)

Asymptotically speaking, the running time of the algorithm is O(logN) for each element. If there are N elements the time will be O(NlogN) in the worst case.

Side Note: The asymptotic running time is similar to sorting and someone can argue, that maintain a sorted list and inserting the incoming number at teh right position will do the same trick .

Which is true if you use sorts like heap sort which will always promise O(NlogN) running time, even in worst case.

In case you are facing any difficulty understanding this, let me know in comment and I will be happy to address your concerns.