I am reviewing and learning SQL, there is something that I notice that seems curious to me.
Suppose I have a table called productos
and one of its fields is categoria
, when doing the following queries I see that the result is the same:
SELECT DISTINCT categoria FROM productos;
Y
SELECT categoria FROM productos GROUP BY categoria;
The difference that I notice is that with DISTINCT
me it filters the duplicates and respects the order in which they appear, while with the sentence that it uses it GROUP BY
organizes them in alphabetical order. Based on that, it can be said that the first statement executes faster. If so, when handling large volumes of data, would the difference in performance be considerable?
Although it is clear that both techniques obtain the same final result, not all of them would be seen as valid for the result you want to achieve.
Taking into account the proposal that you make, the correct thing to do would be to use the
DISTINCT
, since it applies to the row, instead theGROUP BY
was created to work with aggregations such as theSUM()
,MAX()
,AVG()
, etc.The issue of order would not be a problem because one
ORDER BY
would resolve the difference.In these links, although they are in English, the same issue was raised:
Is there any difference between GROUP BY and DISTINCT
What is the difference between GROUP BY and DISTINCT?
GROUP BY
It is used more for operations of the type:count
,sum
, etc.Depending on the number of records in the table (talking about millions of records), the
select
(whether withdistinct
or withgroup by
) will take more or less the same timeIf the case is that the table has millions of records (100, 200, 500), sometimes it is best to extract the data that you want to group in a temporary table (
select ... insert
) and on the temporary table execute thedistinct
or thegroup by
. The query time is considerably much faster.In addition to what Leandro comments and as a faithful translation of one of the answers in the link that he himself attaches, the answer varies between engines but you can have a scope of these two database engines:
RPTA:
There is no difference (in SQL Server, at least) Both queries use the same execution plan.
http://sqlmag.com/database-performance-tuning/distinct-vs-group
Perhaps there is a difference, if there are subqueries involved:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
No difference (Oracle-style):
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212
The function
DISTINCT
removes duplicate records, the functionGROUP BY
is implemented to group records.The function
DISTINCT
is executed as follows:business_key
values to a temporary tableThe function
GROUP BY
is executed as:business_key
in ahashtable
hashtable
The first optimizes memory, while the second optimizes speed but requires a large amount of memory depending on the number of keys.
Greetings.
The first option just filters the rows as it finds them but has to go through all of them to get the result. When you use
group by
the primary returned result it is reprocessed to sort it according to the grouping value, in your case, by "category". Without using indexes, the first option is faster. However if you put an index on the "category" field then the query withgroup by
is almost as fast. Keep in mind that each alternative is used according to the result you need.