Find the Kth largest value in an unordered long array of millions. The requirement is, of course, that the faster you find it, the better.

top K questions

As soon as the problem is described, many people will associate it with the top K problem, which is discussed extensively in both the algorithm and engineering fields, and it is easy to encounter similar problems in actual projects, so I also took the opportunity to summarize it into an article.

Common top K problems, and their variants.

  1. There are 10,000,000 records, and these query strings have a relatively high repetition rate of no more than 3,000,000 if the duplicates are removed. The higher the repetition of a query string, the more users query it, that is, the more popular it is. Please count the top 10 most popular query strings, and require no more than 1GB of memory.
  2. There are 10 files, each file is 1GB, and each line of each file holds the user’s query, and the query of each file may be duplicated. Please sort the queries by their frequency.
  3. There is a file of 1GB size, each line is a word, the size of the word does not exceed 16 bytes, the memory limit size is 1MB, please query the highest frequency of 100 words.
  4. Extracts the IP of the most visited website on a certain day.
  5. Find the 100 integers with the highest number of repetitions out of 1 billion integers.
  6. The input information for the search is a string, counting the top 10 most popular items among the 3 million input information, and each input is a string of no more than 255B, with a maximum memory usage of 1GB.
  7. There are 10 million ID numbers and their corresponding data, the ID numbers may be duplicated, find the ID number that appears most often.

These questions.

Description of the traditional top K problem.

The problem of finding the maximum top K numbers in a large amount of data is often called the top K problem. For example, in a search engine, counting the top 10 most popular search terms; in a song library, counting the top 10 most downloaded songs, etc.

Note that there is a slight difference between my problem and the traditional top K. The traditional top K problem requires the “first K numbers”, while what I have encountered is the “Kth number”. The difference is not that big, but it can make a big difference in the final solution we choose.

I will present some traditional top K problem solutions below, and I will intersperse each solution with a comparison of the applicable differences between the top K numbers and the Kth largest number. And, in my usual style, I will definitely release the code, so if you are looking for an answer to the problem of “finding the Kth largest number for a large amount of unordered data”, I believe you can directly copy my code.

Option 1: Sorting method

Sorting is the easiest idea to think of, with a complexity of O(nlogn). Various sorting algorithms can be thought of, such as fast sort, subsumption sort, insertion sort, monkey sort…etc.

However, in the field of engineering, the choice of a solution often cannot be evaluated using only the complexity of the algorithm.

  • The amount of data exchanged per sorting scheme
  • Amount of extra space requested
  • Average complexity
  • Worst complexity
  • Performance with different data volumes

So this time someone has to ask, how do I choose the right solution? Hey, then I have to mention that phrase again, benchmark everything! Although you surely know that I didn’t choose to use sorting to solve the Kth largest problem in the end, I still want to share with you some of my test conclusions.

In an unordered long array of 100w~1000w data size, the JDK’s own Array.sort() is faster than any of the sorting schemes.

The internal implementation of Array.sort is timsort, which is an optimized merge sort.

A little thought shows that sorting is not the optimal solution, because in my problem, I only need to find the Kth largest number, but the sorting solution goes to the trouble of sorting the entire array, which is not necessary

Option 2: Heap

Option 2: Heap For general top K problems, K is generally very small by default, so for general top K problems, you can choose to use heap to solve.

Heap has an important property: the value of each node is not greater than the value of its left and right child nodes, then the top element of the heap is the minimum value of the whole heap.

The JDK’s PriorityQueue implements the heap as a data structure, and specifies the comparator field to represent either the small top heap or the large top heap, with the default being natural ordering.

The idea of Top K problem is solved by the small top heap.

  • The small top heap maintains the maximum number of K elements currently scanned.
  • Each subsequent scan is added to the heap if it is larger than the top of the heap.
  • then delete the top of the heap.
  • And so on, until all elements are scanned.

Java implements the Kth largest integer code as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public int findKthLargest(int[] nums, int k) {
  PriorityQueue<Integer> minQueue = new PriorityQueue<>(k);
  for (int num : nums) {
    if (minQueue.size() < k || num > minQueue.peek())
      minQueue.offer(num);
    if (minQueue.size() > k)
      minQueue.poll();
  }
  return minQueue.peek();
}

Returning to my problem of finding the Kth largest number, the range of K is not specified here, so in the worst case, K == N/2, either maintaining a small top heap with top K or a large top heap with top(N - K) would require O(N/2) of memory, which is shown to be a very large overhead for large amounts of data. So for my scenario (a competition), the heap solution can be rejected outright.

The heap solution is suitable for scenarios where K is small and it is very convenient to maintain the first K numbers.

Option 3: Quick Select

You may not have heard of Quick Select, but you must have heard of Quick Sort. In fact, the author of both algorithms is Hoare, and their ideas are very similar: pick a base element pivot, partition the array into two subarrays, throw the left subarray if it is larger than pivot, throw the right subarray if it is smaller than pivot, and then recursively cut the numerator array. Quick Select differs from Quick Sort in that it does not slice each subarray, but the target subarray. Second, Quick Select, like Quick Sort, is an unstable algorithm; pivot selection directly affects the algorithm’s performance, with a worst-case time complexity of O(n2).

I first encountered this algorithm when I was in college at ACM, and I remember that the amount of data was such that Quick Sort could not pass and Quick Select could.

The Java implementation of Quick Select is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
public static long quickSelect(long[] nums, int start, int end, int k) {
        if (start == end) {
            return nums[start];
        }
        int left = start;
        int right = end;
        long pivot = nums[(start + end) / 2];
        while (left <= right) {
            while (left <= right && nums[left] > pivot) {
                left++;
            }
            while (left <= right && nums[right] < pivot) {
                right--;
            }
            if (left <= right) {
                long temp = nums[left];
                nums[left] = nums[right];
                nums[right] = temp;
                left++;
                right--;
            }
        }
        if (start + k - 1 <= right) {
            return quickSelect(nums, start, right, k);
        }
        if (start + k - 1 >= left) {
            return quickSelect(nums, left, end, k - (left - start));
        }
        return nums[right + 1];
    }

In the end, I chose to use Option 3: Quick Select as my solution to solve the Kth largest number, which is also the fastest solution under benchmark. In 10 queries, the sorting solution takes 6s, while the Quick Select solution takes only 300ms, which is a very big optimization.

Summary

This article briefly introduces the Top K problem for unordered arrays and the Kth largest number for unordered arrays, two very similar problems, and provides three common solutions. Of course, there are many variants of the problem, such as solving TopK on multi-core machines, multiple hosts, and even introducing the idea of exclusion and MapReduce, which are already considering other levels of optimization, so I won’t elaborate too much here.


Reference https://www.cnkirito.moe/topk/