Wednesday, December 28, 2011

Largest ,Second Largest and kth largest & k largest elements in an array

Smallest and 2nd Smallest:
http://www.geeksforgeeks.org/to-find-smallest-and-second-smallest-element-in-an-array/#comment-2541
http://www.geeksforgeeks.org/maximum-and-minimum-in-an-array/

http://www.programcreek.com/2014/02/find-min-max-in-an-array-using-minimum-comparisons/

kth largest && k largest:

k largest/smallest:   
http://www.geeksforgeeks.org/k-largestor-smallest-elements-in-an-array/
kth largest/smallest:http://www.programcreek.com/2014/05/leetcode-kth-largest-element-in-an-array-java/ http://www.geeksforgeeks.org/kth-smallestlargest-element-unsorted-array/
All Methods described above are split based on approach below.
Heap Approach : Method 4 and 6 of 1st Link : More explanation below. http://www.geeksforgeeks.org/tournament-tree-and-binary-heap/

Randomised Partition for QuickSort:
Balanced Partition /  Order Statistics / Selection Algo for Quick Sort: Method 5 of 1st Link http://www.geeksforgeeks.org/kth-smallestlargest-element-unsorted-array-set-3-worst-case-linear-time/
http://en.wikipedia.org/wiki/Selection_algorithm 
http://net.pku.edu.cn/~course/cs101/resource/Intro2Algorithm/book6/chap10.htm 
http://c3p0demo.googlecode.com/svn/trunk/scalaDemo/script/Order_statistics.ppt 

Randomised/Balanced Partition Explained:
1) Use order statistic algorithm to find the kth largest element. Please see the topic selection in worst-case linear time O(n)
2) Use QuickSort Partition algorithm to partition around the kth largest number O(n).
3) Sort the k-1 elements (elements greater than the kth largest element) O(kLogk). This step is needed only if sorted output is required.
Time complexity: O(n) if we don’t need the sorted output, otherwise O(n+kLogk)
Details:
Use order statistic algorithm to find the kth largest element.
Selection Algorithm will give O(n) soln for this problem
Selection Algorithm gives median of an array in O(n). Use this median, which will
 be n/2th largest element in the array(obviously). Reduce the search 
space by half after getting this median. either k < n or k > n 
else if k ==n then we're done
If k < n then process the elements less than the median and recurse else otherwise.
Complexity : n + n/2 + n/4 ...
=> n (1 + 1/2 + 1/4 ....inf)
= n* ( 1/(1-1/2))
= O(n)
Time complexity: O(n)
Detailed Explanation:
We can use the Selection Algorithm as used in quicksort. It works as follows, select a pivot and partition the array to left and right subarrays such that, the elements that are smaller than the pivot value end up in the left group, and the ones that are and larger than or equal to the pivot are in the right group. Now, only the pivot is in its sorted position. The remaining elements are not sorted but their relative position to the pivot, whether they are on the left or right, is as in sorted order. Let’s say after partitioning the array the position of the pivot in the array is m. If m is equal to k, then the pivot is exactly the kth element that we’re looking for, so we return the pivot value. If m is less than k, then the kth element is in the right subarray. Else if m is greater than k, then the kth element is in the left subarray. So we can recursively call the same procedure and find the kth element. The code will make everything clear:
def partition1(arr, left, right, pivotIndex):
    arr[right], arr[pivotIndex]=arr[pivotIndex], arr[right]
    pivot=arr[right]
    swapIndex=left
    for i in range(left, right):
        if arr[i]<pivot:
            arr[i], arr[swapIndex] = arr[swapIndex], arr[i]
            swapIndex+=1
    arr[right], arr[swapIndex]=arr[swapIndex], arr[right]
    return swapIndex
 
def kthLargest1(arr, left, right, k):
    if not 1<=k<=len(arr):
        return
    if left==right:
        return arr[left]
    while True:
        pivotIndex=random.randint(left, right)
        pivotIndex=partition1(arr, left, right, pivotIndex)
        rank=pivotIndex-left+1
        if rank==k:
            return arr[pivotIndex]
        elif k<rank:
            return kthLargest1(arr, left, pivotIndex-1, k)
        else:
            return kthLargest1(arr, pivotIndex+1, right, k-rank)

The partition function divides the array to two subarrays as described above. The pivot used in partition is selected uniformly at random to potentially avoid worst case performance. We search for the kth element within the indexes [left, right]. Left is initially 0 and right is length of array – 1. If the rank of the pivot is equal to k after partitioning, then the pivot itself is the kth element so we return. Otherwise, we recurse by adjusting the bounds. If the rank of the pivot is greater than k, then we should continue our search in the left subarray, so the new array bounds become [left, pivotIndex-1]. Else if the rank of the pivot is less than k, then we continue to search in the right subarray, so the bounds become [pivotIndex+1, right]. The value of k also gets adjusted if we’re searching the right subarray since we change the left index, so the new value of k becomes k-rank, meaning we count out rank number of elements that we’re eliminating at the left portion of the array
The average time complexity of this approach is O(N). But worst case complexity is unfortunately O(N^2), which occurs if we make poor pivot selections that doesn’t partition the array well, and leaves most of the elements at one side and very few at the other. In the extreme case the partition results in 0 elements at one side and all others at the other side, when smallest or largest element is chosen as pivot. As a result we can only eliminate 1 element at each step, leading to quadratic time complexity. Conversely the best case performance occurs when the pivot divides the array into to equal sized partitions, which results in a linear complexity. Here are the recurrence relations for best and worst case, best case is linear worst case is quadratic. Average case also turns out to be linear (proof omitted).
Best case:
T(N) = T(\dfrac{N}{2}) + O(N) \rightarrow T(N) = \Omega(N)
Worst case:
T(N) = T(N-1) + O(N) \rightarrow T(N) = O(N^2)
There’s a very elegant algorithm that has worst case linear time performance, which is called Median of Medians Algorithm. It’s discovered by 5 great computer scientists, Manuel Blum (Blum speedup theorem), Robert Floyd (Floyd-Warshall shortest path algorithm), Vaughan Pratt (Pratt primality certificate), Ron Rivest (RSA cryptography algorithm), and Robert Tarjan (graph algorithms and data structures). Median of medians is a modified version of selection algorithm where we improve pivot selection to guarantee reasonable good worst case split. The algorithm divides the array to groups of size 5 (the last group can be of any size < 5). Then calculates the median of each group by sorting and selecting the middle element (sorting complexity of 5 elements is negligible). Finds the median of these medians by recursively calling itself, and selects the median of medians as the pivot for partition. Then it continues similar to the previous selection algorithm by recursively calling the left or right subarray depending on the rank of the pivot after partitioning. The partition function is slightly different though, partition1 function above takes the index of the pivot as input, partition2 here takes the value of the pivot as input, which is only a slight modification. Here is the code
def partition2(arr, left, right, pivot)
    swapIndex=left
    for i in range(left, right+1):
        if arr[i]<pivot:
            arr[i], arr[swapIndex] = arr[swapIndex], arr[i]
            swapIndex+=1
    return swapIndex-1
 
def kthLargest2(arr, left, right, k):
    length=right-left+1
    if not 1<=k<=length:
        return
    if length<=5:
        return sorted(arr[left:right+1])[k-1]
 
    numMedians=length/5
    medians=[kthLargest2(arr, left+5*i, left+5*(i+1)-1, 3) for i in range(numMedians)]
    pivot=kthLargest2(medians, 0, len(medians)-1, len(medians)/2+1)
    pivotIndex=partition2(arr, left, right, pivot)
    rank=pivotIndex-left+1
    if k<=rank:
        return kthLargest2(arr, left, pivotIndex, k)
    else:
        return kthLargest2(arr, pivotIndex+1, right, k-rank)
The worst case complexity of this approach is O(N) because the median of medians chosen as pivot is either greater than or less than at least 30% of the elements. So even in the worst case we can eliminate constant proportion of the elements at each iteration, which is what we wanted but couldn’t achieve with the previous approach. 
Tournament Tree (Winner Tree) and Binary Heap

Given a team of N players. How many minimum games are required to find second best player?
We can use adversary arguments based on tournament tree (Binary Heap).
Any player defeated by the best player can be the second-best, not always the one who lasts till the end. So, after getting the best player, we must carry out another tournament between the players defeated directly by the best to find the second-best player. 
7 is erroneously chosen as second best According to above figure we are required to carry out the extra tournament between the direct losers. From the above tree, the players who have been defeated by (9) are (8), (3) and (7). So we carry out another tournament:
Finding kth largest:
From the above picture we can see that we have found the max in the array. Now we know 9 is MAX but we can't say 7 is SECOND MAX since that would be wrong. Number of Comparison's to find MAX = (n/2) + logn(n) = (n-1) in the above case so the above algo is also optimal for finding MAX ; since conventional algo also takes the same. To find SECOND MAX take the players that were directly defeated by MAX. i.e. 9 DIRECTLY DEFEATED BY MAX = 8, 3, 7 If we play the above tournament among 8, 3, 7 we get 8 as second MAX we for this tournament we needed just 2 comparison's Total comparisons to find MAX and second MAX using tournament principle = (n/2) + log(n) + log(n) = (n-1) + log(n) Now to find kth MAX we can play (k/2) such tournaments since in one tournament we get MAX and second MAX..then exclude them and play the tournament again(k/2) times So total comparisons to find kth MAX using above method = (k/2) ((n-1) + log(n)) However in the conventional algo it would be (k/2 passes of array)(2n comparisons to find MAX and second MAX) = k * n So tournament algo is more optimal eg:: if k = 4 and n = 8 Comparison's by conventional algo = 4 * 8 = 32 Comparison's by Tournament algo = (4/2) ((7) + 3) = 2 * 10 = 20
Note:
While the naive algorithm does 2n-3 comparisons for a list with n numbers, the second one does only n+lg(n)-2. This algorithm can also be extended to find the k-th largest. To find the k-th largest, simply do k tournaments. The algorithm to find the k-th largest based on tournaments is also optimal.
One implementation problem is that in the first tournament, we must keep track of the players whom the eventual winner is defeating. However, since we don't know who the eventual winner will be, we must keep track of the players getting defeated by all the winners (who have not yet lost).
Alternative Questions:

kth largest element in a stream

You are reading a stream of integers and you cannot look back to the stream once it is passed. At any instant you have to find the kth largest element(if the stream is less than k you need to report smallest element) from the stream read so far. What is data structure you will use and approach to find the same. Strategy: Keep a max heap of size K. 
  • The max element will always be on the top.
  • Compare each incoming element with it.
  • In case, the new number is lesser than max, extract max from heap and insert the new number.
  • At any instance, you are keeping K minimum element.
Total complexity – n log k (where n is the number of elements parsed) http://www.geeksforgeeks.org/kth-largest-element-in-a-stream/ Alternative: You have a infinite stream of repeated number find top K frequent number from this stream and discussion on this question like which data structure will you use and time complexity etc.

2 comments: