The thing about the Primary Index is that ISN'T all about data distribution. If you select a PI that perfectly distributes the data across the AMPS, but is never used as a Join column or in a query predicate, you're doing yourself and your database a dis-service.
A key priority of the PI is that it should define the most common access path, some data skew is perfectly acceptable. There are probably as many definitions for accetable skew as there are people who can spell "skew"...so don't look for one "perfect" answer, personally I try to keep data skew below 20-25.
When you define the same PI on two commonly joined tables, you enable your TD system to create AMP local joins, with no redistribution. You can see this in the explain plan.
Remember, data has to be on the same AMP to be joined. If you have two tables with different PI's, the optimizer will pick the smaller (if it can) of the two tables, assuming you have current statistics collected, do a FTS on the smaller table, bring it into spool, re-hash the rows on that table in spool and push it back down to the AMPs according to the re-hashed column ( the PI of the larger table) and THEN join your rows.
If you can define the same PI on two freuqently joined tables and give up some skew, you can eliminate the FTS, re-hash and redistribution.
Consider the Sales Order Header and Sales Order Line Item tables, a fairly common occurence. The natural identifier for the Header Table and by default, initial PI selection, is some thing SalesOrderHeaderID. The Line Item table also has a natural identifier, but in this case it happens to be a composite PI, composed of SalesOrderHeaderID ( to relate each line back to its parent header row) and LineItemID (to differentiate each line item within the order).
If you select SalesOrderHeaderId, LineItemID as a composite PI on the Line Item Table, you've now hashed ALL of your Item rows to different amps than your header rows, perfectly distributed, but NOT co-located. The query that joins headers and line items now has to grab the line item table complete a full table scan, rehash it to the salesorderheaderid and then join it. All you've achieved is data distribution at the more complex processing.
If you accept some skew and define the PI of the LineItem table as SalesOrderHeaderId alone, the child rows will hash to the same amps as their parent rows, run the same query and you've now got amp local operations...
↧