The problem encountered in this article is to de-duplicate the data from Hive SQL SELECT with certain columns as key. The following is a step-by-step discussion.

DISTINCT

When it comes to de-duplication, DISTINCT naturally comes to mind. But in Hive SQL, it has two problems.

  • DISTINCT will use all the columns from SELECT as keys for de-duplication. That is, as long as one column has different data, DISTINCT will consider it different and keep it.
  • DISTINCT will put all the data on one reducer, which will cause serious data skew and take a lot of time.

ROW_NUMBER() OVER

Two problems with DISTINCT can be solved with ROW_NUMBER() OVER. For example, if we want to de-duplicate the key by key1 and key2 columns, we would write code like this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
WITH temp_table AS (
  SELECT
    key1,
    key2,
    [columns]...,
    ROW_NUMBER() OVER (
      PARTITION BY key1, key2
      ORDER BY column ASC
    ) AS rn
  FROM
    table
)

SELECT
  key1,
  key2,
  [columns]...
FROM
  temp_table
WHERE
  rn = 1;

In this way, Hive will put the data on different mappers with key1 and key2 as keys, and then arrange the data in ascending order by column for a group with the same key1 and key2, and finally keep the first data in each group after the arrangement. This completes the task of de-duplication by key1 and key2.

Note the role of PARTITION BY here: first, to break up the data by key1 and key2 to solve the above problem (2); second, to combine with ORDER BY and rn = 1 to group and de-weight the data by key1 and key2 to solve the above problem (1).

But obviously, this is not-elegant, and not surprisingly inefficient.

GROUP BY and COLLECT_SET / COLLECT_LIST

A core part of the ROW_NUMBER() OVER solution is to use PARTITION BY to group data by key, and the same can be done with GROUP BY. However, GROUP BY needs to be used in conjunction with an aggregation function. The combination of ORDER BY and the rn = 1 condition enables the keep first function. We need to consider what kind of aggregation function can achieve or indirectly achieve such a function? It is easy to think of COLLECT_SET and COLLECT_LIST.

So there is this code.

1
2
3
4
5
6
7
8
SELECT
  key1,
  key2,
  [COLLECT_LIST(column)[1] AS column]...
FROM
  temp_table
GROUP BY
  key1, key2

For columns other than key1 and key2, we use COLLECT_LIST to collect them and output the first collected result. The reason for using COLLECT_LIST instead of COLLECT_SET is that SET is unordered, so you can’t guarantee that the output columns are all from the same piece of data. If this is not required or restricted, then you can use COLLECT_SET, which is more resource efficient.

Compared to the previous approach, it is much more efficient because the sorting and (possibly) storing to disk actions are omitted. However, since it is (possibly) not stored to disk, the data in COLLECT_LIST is cached in memory. This approach may trigger an OOM if the number of duplicates is particularly large, so consider breaking the data up further and then merging it, or simply switch to the previous approach.