Info Analyzer performance/sizining
Posted: Mon Jan 12, 2009 11:53 am
We have some large files (~ few GB) that we need to get from outside vendors and one major problem is quality of the data files. We also have a fixed time window in which the profiling must complete and report back issues to the vendors. We are likely to do column profiling as well as primary key inference against these files. The column type is of varchar 256. What are some guidance on HW and storage? I am looking for approximate number of CPUs/cores, memory size and disk size to complete the job in approximately in 2 hours.
Secondly, it's likely that file sizes will grow down the line (but the prolfing functionality will remain same). Is there any way I can maintain the same 2 hour window for column profiling and key inference by doing some data partitioning and parallel processing? If so how would that be done?
Lastly, how do I perform "join" analysis between two files to determine the "join" key between two files?
Thanks in advance.
Secondly, it's likely that file sizes will grow down the line (but the prolfing functionality will remain same). Is there any way I can maintain the same 2 hour window for column profiling and key inference by doing some data partitioning and parallel processing? If so how would that be done?
Lastly, how do I perform "join" analysis between two files to determine the "join" key between two files?
Thanks in advance.