Posted: Sun Mar 14, 2004 2:09 pm
Ray said it better than I have been able to say. If your requirement is to put 30 million rows into a hash file for repetitive lookup, you will probably exceed the read caching capabilities. My technique suggests partitioning the data so that you have a subset in N hash files, where the size of the hash file allows for caching. You also want as many transformation jobs as you have cpus, so that each cpu is fully used. If you have 24 cpus, then I suggest that you have 24 hash lookups of your large dimension, and 24 transformation jobs of the fact data. Each job handles a specific customer key set. Rather than use the "ends with 0..9" which limits you to only 10 partitions, MOD allows you to have as many partitions as you want. MOD returns the integer remainder of a division operation, so MOD(somenumber, 24) always returns 0..23. So, all customer keys when passed thru MOD(customer_key,24) will return 0..23. So, you have 24 hash files, each one contains its respective set of customer_keys. Your fact transformation jobs limit themselves to only processing those customers that pertain to it, again those ending in 0..23.
This partitioning of reference data, instantiating the transformation, and distributing processing across multiple cpus is EXACTLY what PX does for you automatically. Remember, divide the source data and conquer with multiple identical processes. You take your customer dimension table, and push it into a PX reference dataset. If you have 24 cpus, and set up a node pool of 24 nodes (1 node ~ 1 cpu if you configure it that way), it will disperse the appropriate subset to each node. Then, when processing the fact rows and you merge the fact data with the customer dimension data set, it will again send only the rows that pertain to those customers to that node. The partitioning keeps the right data together.
I'm trying to get someone to look at the manuals for 7.0 and confirm the limitations with the caching. I have never seen this limit, as long as the DS user has no limits on memory usage. I had one client with a 30 million hash lookup requirement, partitioned into 16 lookups. Each job achieved full caching, as watching glance or top/prstat showed that each process was using a lot of memory. I'll try to dig out some screen snapshots of glance to prove it.
This partitioning of reference data, instantiating the transformation, and distributing processing across multiple cpus is EXACTLY what PX does for you automatically. Remember, divide the source data and conquer with multiple identical processes. You take your customer dimension table, and push it into a PX reference dataset. If you have 24 cpus, and set up a node pool of 24 nodes (1 node ~ 1 cpu if you configure it that way), it will disperse the appropriate subset to each node. Then, when processing the fact rows and you merge the fact data with the customer dimension data set, it will again send only the rows that pertain to those customers to that node. The partitioning keeps the right data together.
I'm trying to get someone to look at the manuals for 7.0 and confirm the limitations with the caching. I have never seen this limit, as long as the DS user has no limits on memory usage. I had one client with a 30 million hash lookup requirement, partitioned into 16 lookups. Each job achieved full caching, as watching glance or top/prstat showed that each process was using a lot of memory. I'll try to dig out some screen snapshots of glance to prove it.