Finding the Kth Largest Number in a Large Amount of Unordered Data

Find the Kth largest value in an unordered long array of millions. The requirement is, of course, that the faster you find it, the better.

Option 1: Sorting method

Sorting is the easiest idea to think of, with a complexity of O(nlogn). Various sorting algorithms can be thought of, such as fast sort, subsumption sort, insertion sort, monkey sort…etc.

However, in the field of engineering, the choice of a solution often cannot be evaluated using only the complexity of the algorithm.

The amount of data exchanged per sorting scheme
Amount of extra space requested
Average complexity
Worst complexity
Performance with different data volumes

So this time someone has to ask, how do I choose the right solution? Hey, then I have to mention that phrase again, benchmark everything! Although you surely know that I didn’t choose to use sorting to solve the Kth largest problem in the end, I still want to share with you some of my test conclusions.

In an unordered long array of 100w~1000w data size, the JDK’s own Array.sort() is faster than any of the sorting schemes.

The internal implementation of Array.sort is timsort, which is an optimized merge sort.

A little thought shows that sorting is not the optimal solution, because in my problem, I only need to find the Kth largest number, but the sorting solution goes to the trouble of sorting the entire array, which is not necessary

Option 2: Heap

Option 2: Heap For general top K problems, K is generally very small by default, so for general top K problems, you can choose to use heap to solve.

Heap has an important property: the value of each node is not greater than the value of its left and right child nodes, then the top element of the heap is the minimum value of the whole heap.

The JDK’s PriorityQueue implements the heap as a data structure, and specifies the comparator field to represent either the small top heap or the large top heap, with the default being natural ordering.

The idea of Top K problem is solved by the small top heap.

The small top heap maintains the maximum number of K elements currently scanned.
Each subsequent scan is added to the heap if it is larger than the top of the heap.
then delete the top of the heap.
And so on, until all elements are scanned.

Java implements the Kth largest integer code as follows.

public int findKthLargest(int[] nums, int k) {
  PriorityQueue<Integer> minQueue = new PriorityQueue<>(k);
  for (int num : nums) {
    if (minQueue.size() < k || num > minQueue.peek())
      minQueue.offer(num);
    if (minQueue.size() > k)
      minQueue.poll();
  }
  return minQueue.peek();
}

Returning to my problem of finding the Kth largest number, the range of K is not specified here, so in the worst case, K == N/2, either maintaining a small top heap with top K or a large top heap with top(N - K) would require O(N/2) of memory, which is shown to be a very large overhead for large amounts of data. So for my scenario (a competition), the heap solution can be rejected outright.

The heap solution is suitable for scenarios where K is small and it is very convenient to maintain the first K numbers.

Option 3: Quick Select

You may not have heard of Quick Select, but you must have heard of Quick Sort. In fact, the author of both algorithms is Hoare, and their ideas are very similar: pick a base element pivot, partition the array into two subarrays, throw the left subarray if it is larger than pivot, throw the right subarray if it is smaller than pivot, and then recursively cut the numerator array. Quick Select differs from Quick Sort in that it does not slice each subarray, but the target subarray. Second, Quick Select, like Quick Sort, is an unstable algorithm; pivot selection directly affects the algorithm’s performance, with a worst-case time complexity of O(n2).

I first encountered this algorithm when I was in college at ACM, and I remember that the amount of data was such that Quick Sort could not pass and Quick Select could.

The Java implementation of Quick Select is as follows.

public static long quickSelect(long[] nums, int start, int end, int k) {
        if (start == end) {
            return nums[start];
        }
        int left = start;
        int right = end;
        long pivot = nums[(start + end) / 2];
        while (left <= right) {
            while (left <= right && nums[left] > pivot) {
                left++;
            }
            while (left <= right && nums[right] < pivot) {
                right--;
            }
            if (left <= right) {
                long temp = nums[left];
                nums[left] = nums[right];
                nums[right] = temp;
                left++;
                right--;
            }
        }
        if (start + k - 1 <= right) {
            return quickSelect(nums, start, right, k);
        }
        if (start + k - 1 >= left) {
            return quickSelect(nums, left, end, k - (left - start));
        }
        return nums[right + 1];
    }

In the end, I chose to use Option 3: Quick Select as my solution to solve the Kth largest number, which is also the fastest solution under benchmark. In 10 queries, the sorting solution takes 6s, while the Quick Select solution takes only 300ms, which is a very big optimization.

Summary

This article briefly introduces the Top K problem for unordered arrays and the Kth largest number for unordered arrays, two very similar problems, and provides three common solutions. Of course, there are many variants of the problem, such as solving TopK on multi-core machines, multiple hosts, and even introducing the idea of exclusion and MapReduce, which are already considering other levels of optimization, so I won’t elaborate too much here.

Reference https://www.cnkirito.moe/topk/

Finding the Kth Largest Number in a Large Amount of Unordered Data

Table of Contents

top K questions

Option 1: Sorting method

Option 2: Heap

Option 3: Quick Select

Summary