Problem Description
A typical lookup kind of JOIN between FACT table with hundreds of millions of web clicks, and DIM table with 55M IP ranges in numeric style (e.g. downloaded from MaxMind).
The explain plan choose to replicate DIM to all AMP, which is reasonable; it scan and filter the FACT table in parallel; but then it will use "product join" to find the matching GEO_IP_LOCATION_ID. Quickly, sun query runs out of spool or gets killed by DBA.
Not sure if value-ordered unique secondary index on (start_ip_num, end_ip_num) will help, because it can only be created in TD 14.xx due to the BIGINT data type. But this is such a typical JOIN, there should be a more efficient way to do it, right?
-- DIM_GEO_IP_RANGE create set table DIM_GEO_IP_RANGE ( START_IP_NUM BIGINT NOT NULL, END_IP_NUM BIGINT NOT NULL, GEO_IP_LOCATION_ID INT ) UNIQUE PRIMARY INDEX(START_IP_NUM) ; -- FACT_WEB_CLICK create multiset table FACT_WEB_CLICK ( CLICK_TIME_UTC TIMESTAMP(3) NOT NULL, LOCAL_TIME_ZONE_OFFSET DECIMAL(4,2), WEBSITE_ID SMALLINT NOT NULL, SESSION_ID BIGINT NOT NULL, CLICK_ID INT NOT NULL, CLICK_TYPE SMALLINT, IP_NUM BIGINT NOT NULL, PAGE_ID INT NOT NULL, USER_ID INT, REF_STRING VARCHAR(256) CHAR SET LATIN, ) PRIMARY INDEX (SESSION_ID, CLICK_ID) PARTITION BY ( RANGE_N( CAST(CLICK_TIME_UTC as DATE) BETWEEN DATE '2013-01-01' AND DATE '2013-12-13' EACH INTERVAL '1' DAY) ); -- LOOKUP JOIN (explain plan has product join) explain select c.*, g.geo_ip_location_id from fact_web_click c join dim_geo_ip_range g on c.ip_num between g.start_ip_num and g.end_ip_num where c.date_sk = current_date - 1 and user_id is not null and website_id = 1234 ;
Forums: