Let’s start with the general conclusion (full conclusion at the end of the article)
- With the same semantics and indexes:
group byand distinct can both use indexes with the same efficiency.
- With the same semantics and no indexes: distinct is more efficient than
group by. The reason is that both distinct and
group byperform grouping operations, but
group bymay sort and trigger filesort, resulting in inefficient sql execution.
Based on this conclusion, you may ask.
- Why are
group byand distinct equally efficient when the semantics are the same and indexes are available?
- Under what circumstances does
group byperform a sorting operation?
With these two questions in mind, let’s find the answers. Next, let’s look at the basic use of distinct and
Usage of distinct
DISTINCT keyword is used to return a uniquely different value. It is used before the first field in a query statement and works on all columns in the main clause.
If a column has a NULL value and the
DISTINCT clause is used for that column, MySQL will keep one NULL value and delete the others because the
DISTINCT clause treats all NULL values as the same.
distinct multi-column de-duplication
The distinct multi-column de-duplication is done based on the specified de-duplicated column information, i.e. only if all the specified column information is the same, it will be considered as duplicate information.
Use of group by
For basic de-duplication, the use of
group by is similar to that of distinct.
Single column de-duplication
The difference between the two syntaxes is that
group by can perform single column de-duplication, while
group by works by sorting the results in groups and then returning the first data in each group. And the de-duplication is done according to the fields that follow
Principles of distinct and group by
In most examples,
DISTINCT can be regarded as special
GROUP BY, their implementation is based on grouping operation, and both can be implemented by loose index scan, compact index scan (about index scan will be described in detail in other articles, so I won’t introduce it in detail here).
GROUP BY can be scanned and searched using indexes. For example, in the following two sql (just look at the last extra of the table alone), we analyze these two sql and we can see that in the extra, both sql use a compact index scan
Using index for group-by.
So, in general, for the same semantic
GROUP BY statements, we can optimize them using the same indexing optimizations.
GROUP BY, before MYSQL 8.0,
GROUP Y by default does an implicit sort based on fields.
As you can see, the following sql statement uses a temporary table along with a filesort.
For implicit sorting, we can refer to the official explanation of Mysql.
GROUP BY implicitly sorts by default (that is, in the absence of ASC or DESC designators for GROUP BY columns). However, relying on implicit GROUP BY sorting (that is, sorting in the absence of ASC or DESC designators) or explicit sorting for GROUP BY (that is, by using explicit ASC or DESC designators for GROUP BY columns) is deprecated. To produce a given sort order, provide an ORDER BY clause.
So, prior to Mysql 8.0,
Group by would sort the results by default based on the role field (the field that follows
Group by). When indexes are available,
Group by does not require additional sorting; however, when indexes are not available, the Mysql optimizer has to choose to implement
GROUP BY by using a temporary table and then sorting.
And when the size of the result set exceeds the size of the temporary table set by the system, Mysql will copy the temporary table data to the disk and then perform the operation, and the execution efficiency of the statement will become extremely low. This is the reason why Mysql chooses to deprecate this operation (implicit sorting).
For these reasons, Mysql has been optimized and updated for this in 8.0.
Previously (MySQL 5.7 and lower), GROUP BY sorted implicitly under certain conditions. In MySQL 8.0, that no longer occurs, so specifying ORDER BY NULL at the end to suppress implicit sorting (as was done previously) is no longer necessary. However, query results may differ from previous MySQL versions. To produce a given sort order, provide an ORDER BY clause.
Thus, our conclusion also comes out:
In case of the same semantics and indexing:
group byand distinct can use indexes with the same efficiency. Because
group byand distinct are nearly equivalent, distinct can be seen as a special
In the case of identical semantics and no indexes:
distinct is more efficient than
group by. The reason is that both distinct and
group byperform grouping operation, but
group byperforms implicit sorting before Mysql 8.0, which leads to triggering filesort and inefficient sql execution.
But since Mysql 8.0, Mysql has removed implicit sorting, so at this point, with the same semantics and no indexes, the execution efficiency of
group byand distinct is also nearly equivalent.
Reasons why group by is recommended
- the semantics of
group byare clearer
group bycan do some more complex processing of the data
The semantics of
group by is clearer than that of distinct. And since the distinguish keyword will take effect for all fields,
group by is more flexible when it comes to compound business processing.
group by can perform more complex processing on data according to grouping, such as filtering data by having, or performing operations on data by aggregation functions.