The problem encountered in this article is to de-duplicate the data from Hive SQL
SELECT with certain columns as
key. The following is a step-by-step discussion.
When it comes to de-duplication,
DISTINCT naturally comes to mind. But in Hive SQL, it has two problems.
DISTINCTwill use all the columns from
SELECTas keys for de-duplication. That is, as long as one column has different data,
DISTINCTwill consider it different and keep it.
DISTINCTwill put all the data on one reducer, which will cause serious data skew and take a lot of time.
Two problems with
DISTINCT can be solved with
ROW_NUMBER() OVER. For example, if we want to de-duplicate the key by
key2 columns, we would write code like this.
In this way, Hive will put the data on different mappers with
key2 as keys, and then arrange the data in ascending order by
column for a group with the same
key2, and finally keep the first data in each group after the arrangement. This completes the task of de-duplication by
Note the role of
PARTITION BY here: first, to break up the data by
key2 to solve the above problem (2); second, to combine with
ORDER BY and
rn = 1 to group and de-weight the data by
key2 to solve the above problem (1).
But obviously, this is not-elegant, and not surprisingly inefficient.
GROUP BY and COLLECT_SET / COLLECT_LIST
A core part of the
ROW_NUMBER() OVER solution is to use
PARTITION BY to group data by key, and the same can be done with
GROUP BY. However,
GROUP BY needs to be used in conjunction with an aggregation function. The combination of
ORDER BY and the
rn = 1 condition enables the
keep first function. We need to consider what kind of aggregation function can achieve or indirectly achieve such a function? It is easy to think of
So there is this code.
For columns other than
key2, we use
COLLECT_LIST to collect them and output the first collected result. The reason for using
COLLECT_LIST instead of
COLLECT_SET is that SET is unordered, so you can’t guarantee that the output columns are all from the same piece of data. If this is not required or restricted, then you can use
COLLECT_SET, which is more resource efficient.
Compared to the previous approach, it is much more efficient because the sorting and (possibly) storing to disk actions are omitted. However, since it is (possibly) not stored to disk, the data in
COLLECT_LIST is cached in memory. This approach may trigger an OOM if the number of duplicates is particularly large, so consider breaking the data up further and then merging it, or simply switch to the previous approach.