Quantcast
Channel: Teradata Forums - All forums
Viewing all articles
Browse latest Browse all 27759

difference between distinct and group by - response (3) by Santosh Kumar sarangi

$
0
0

Since DISTINCT redistributes the rows immediately, more data may move between the AMPs, where as  GROUP BY that only sends unique values between the AMPs.
So, we can say that  GROUP BY sounds more efficient. 
    But when you assume that data is nearly unique in a table, GROUP BY will  spend more time attempting to eliminate duplicates that do not exist at all.Therefore, it is wasting its  time to check for duplicates the first time. Then, it must redistribute the same amount of data .

Let us see how these steps are used in each case for elimination of Duplicates
(can be found out using explain plan)

DISTINCT
1. It reads each row on AMP
2. Hashes the column value identified in the distinct clause of select statement.
3. Then redistributes the rows according to row value into appropriate AMP
4. Once  redistribution is completed , it
    a. Sorts data to group duplicates on each AMP
    b. Will remove all the duplicates on each amp and sends the original/unique value

P.s: There are cases when "Error : 2646 No more Spool Space " . In such cases try using GROUP BY.

GROUP BY
1. It reads all the rows part of GROUP BY
2. It will remove all duplicates in each AMP for given set of values using "BUCKETS" concept
3. Hashes the unique values on each AMP
4. Then it will re-distribute them to particular /appropriate AMP's
5. Once  redistribution is completed , it
    a. Sorts data to group duplicates on each AMP
    b. Will remove all the duplicates on each amp and sends the original/unique value

Hence it is better to  go for
 

  • GROUP BY  -  when Many duplicates
  • DISTINCT        -  when few or no duplicates
  • GROUP BY -  SPOOL space is exceeded

Viewing all articles
Browse latest Browse all 27759

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>