Reading integer streams and return N th order statistics

Problem

Imagine you are reading in a stream of integers. Periodically, you wish to be able to look up the rank of a number x (the number of values less than or equal to  x).
Implement the data structures and algorithms to support these operations. 
The following methods need to be implemented.

// method to take new number x
void track(int x); 

// method to return the number of values that are less than or equal to x (not including x itself)
int get_rank(int x); 

Example:

Stream (in order of appearance): 5, 1, 4, 4,  5,  9, 7, 13, 3

get_rank(1) = 0
get_rank(3) = 1
get_rank(4) = 3


There are three possible ways, where first two uses the array and the last uses the binary tree.

Each approach has the advantages and disadvantages. Which solution is better would depend on which operation is more important or frequent than the other.

For this question, I will assume, get_rank() is called more frequently than track() itself.

Binary search approach using Array, O(N) for track(), O(logN) for get_rank()


This approach scarifies the efficiency of track() for the sake of faster, get_rank() operation.

When the frequency of track(int x) method is less frequent than get_rank(), we can use array or std::vector to store the integers and sort the integers each time a new value is given.

Sorting the values each time takes O(N) since it is basically inserting the new number in the middle of array, which causes the values after the new number shifted to the right.

For get_rank(int x) operation, we can use binary search to find the position of the largest values that is equal or less than x, which results in O(logN) running time, where N is the number of values stored.

Here is the complete code in C++.

#include<vector>
#include<iostream>
using namespace std;
class NumberStream {
public:
// O(N)
void track(int n)
{
int i = 0;
for (i = 0; i < list.size(); i++)
{
if (list[i] > n)
break;
}
list.insert(list.begin()+i, n);
for( i = 0; i< list.size(); i++)
cout<<list[i]<<",";
cout<<endl;
}
// O( log N)
int get_rank(int x)
{
int pos = INT_MIN;
search(0, list.size()-1, x ,pos);
return pos;
}
private:
// this does not return the number correctly if non existing number is given.
void search(int s, int e, int target ,int& max)
{
if (s > e)
return;
int h = s+ (e-s)/2;
if (list[h] <= target)
{
max = (max < h)? h: max;
s = h + 1;
} else {
e = h -1;
}
search(s, e, target, max);
}
vector<int> list;
};
int main()
{
NumberStream s;
s.track(5);
s.track(1);
s.track(4);
s.track(4);
s.track(5);
s.track(9);
s.track(7);
s.track(9);
s.track(13);
s.track(3);
cout<<"rank of 1 : "<<s.get_rank(1)<<endl;
cout<<"rank of 3 : "<<s.get_rank(3)<<endl;
cout<<"rank of 4 : "<<s.get_rank(4)<<endl;
return 1;
}

On demand sorting using Linked list, O(1) for track(), O(NlogN) for get_rank()


If we favor the efficiency of track(int x) operation, we can use linked list to store the values and defer the sorting until get_rank(int x) is called.

In this case, track(int x) operation is simply adding a new value to the end of the linked list, which runs in O(1).

When track(int x) is called, we sort the values in O(NlogN) time and linearly search x while counting the preceding values, which runs in O(N).

Overall running time for track(int x) is O(NlogN).
You can avoid sorting the numbers each time by remembering whether the numbers are sorted.
Say we keep the flag, bSorted. We reset the flag to false when track(int x) is called and set it to true after sorting the values.  If the flag is true when get_rank(int x) we can skip the sorting.

This approach is quite straightforward. Therefore, I would not show any coding example.


Binary search tree approach, O(logN) for track(), O(logN) for get_rank()


Last way is to use binary tree to store the values in descending order, assuming we are using balanced binary search tree.

In this case, track(int x) operation is to insert a new value to the binary search tree, which runs in O(\log N).

get_rank(int x) is to find the N th order statistics. It can be achieved through in-order search. However, it would take O(N) time.

Considering that we are only interested in the values smaller than a given value, x, we can keep track of the number of nodes on the left side of the tree when adding the new element. (see Figure 1, for example tree).

Figure 1. Example binary tree with the left-side children counts



With this information present, we can now use the binary search for get_rank(int x) operation, which results in O(logN).

The last one is the most efficient but it would not be practical to use in the interview unless you are not required to write the code. Instead, you need to talk through the solution or write the pseudo code.

Balancing the binary tree will require the adjustment of the count of left side nodes in each node.
Unless you are confident that you can write both balancing routine less than 30 mins, I personally recommend you to go with the first solution and explain this solution as possible improvement.

The following code excludes the balancing the tree.

#include<iostream>
using namespace std;
struct node {
int v;
int lc;
node * l;
node * r;
node(int value): v(value), lc(0), l(0), r(0){}
};
class NumberStream {
public:
NumberStream():root(0){}
void track(int x)
{
node *prev = 0, *cur = root;
if (!root)
{
root = new node(x);
return;
}
while(cur)
{
prev = cur;
if (cur->v >= x)
{
cur->lc++;
cur = cur->l;
} else {
cur = cur->r;
}
}
if (prev->v >= x)
{
prev->l = new node(x);
} else {
prev->r = new node(x);
}
}
int get_rank(int x)
{
int count = 0;
node * cur = root;
while (cur)
{
if (cur->v == x)
{
count = cur->lc + count;
break;
} else if ( cur->v < x) {
count = count + cur->lc + 1;
cur= cur->r;
} else {
cur = cur->l;
}
}
return count;
}
private:
node *root;
};
int main()
{
NumberStream s;
s.track(5);
s.track(1);
s.track(4);
s.track(4);
s.track(5);
s.track(9);
s.track(7);
s.track(9);
s.track(13);
s.track(3);
cout<<"rank of 1 : "<<s.get_rank(1)<<endl;
cout<<"rank of 3 : "<<s.get_rank(3)<<endl;
cout<<"rank of 4 : "<<s.get_rank(4)<<endl;
return 1;
}



Here is python solution, this does not handle duplicate key correctly since it stop at the first occurrence of the matching key. We need better algorithm to handle duplicate key.

#this solution does not handle the duplicate key since it stops at the first appearance of the matching key.
class Node:
right = None
left = None
left_child_count = 0
value = None
def __init__(self, value, count = 0):
self.value = value
self.left_child_count = count
class BinarySearchTree:
root = None
def add(self, num):
if (self.root is None):
self.root = Node(num)
return
cur = self.root
is_right = False
while (cur is not None):
prev = cur
if (num >= cur.value):
cur = cur.right
is_right = True
else:
cur.left_child_count +=1
cur = cur.left
is_right = False
if (num >= prev.value):
prev.right = Node(num)
else:
prev.left = Node(num)
def search_rank(self, num):
cur = self.root
prev = None
rank = 0
while (cur is not None):
prev = cur
if num == cur.value:
rank += cur.left_child_count
if num > cur.value:
rank += cur.left_child_count+1
cur = cur.right
else:
cur = cur.left
return rank
class NumberRanker:
tree = BinarySearchTree()
def track(self, num):
self.tree.add(num)
def get_rank(self, num):
return self.tree.search_rank(num)
ranker = NumberRanker()
input = [5, 1, 4, 4, 5, 9, 7, 13, 3]
# 1,3,4,4,5,7,9,13
for i in input:
ranker.track(i)
print( "get_rank(1): ", ranker.get_rank(1))
print( "get_rank(3): ", ranker.get_rank(3))
print( "get_rank(4): ", ranker.get_rank(4))
print( "get_rank(10): ", ranker.get_rank(10))

Comments

Popular posts from this blog

Planting flowers with no adjacent flower plots

Find the shorted path from the vertex 0 for given list of vertices.