Page 1 of 1

profiling time and space ?

Posted: Tue Oct 14, 2008 4:09 am
by vairus
Hi,

Am new to information analyzer and i had lot of question.

Am doing column analysis on flat file size of 2.3GB. It had 6.6 million rows.

What is the amount of space needed for this table in DB2?

am using 2 proc of 3.6GHZ ,160 GB harddisk and 3GB Ram..how long it will take to finish column analysis for above table?

while doing analysis it creating dataset in ibm/informationserver/server/dataset folder and clear all the files after the process. Is the dataset folder is tempspace for IA?

How to monitor the status of job while its running?

Thanks in advance guys

Vairamuthu

Posted: Tue Oct 14, 2008 2:25 pm
by Aruna Gutti
I can give an answer for one of your questions.

You can monitor the Column Analysis jobs in Director Client in ANALYZERPROJECT.

Posted: Tue Oct 14, 2008 3:11 pm
by ray.wurlod
Use DB2 Control Center to see which tablespaces are used in IADB and ANALYZERPROJECT and the amount of space allocated to and used by each. Table space management is usually set to "automatic".

Posted: Wed Oct 15, 2008 7:30 am
by vairus
Thanks for your reply Aruna and ray .

Posted: Tue Jan 06, 2009 2:28 pm
by mee
Can someone shed any light as to what IA is doing under the cover? And why does it need the IADB and ANALYZERPROJECT tables?

Thanks in advance.

Posted: Tue Jan 06, 2009 2:30 pm
by mee
Can someone shed any light as to what IA is doing under the cover? And why does it need the IADB and ANALYZERPROJECT tables?

Thanks in advance.

Posted: Tue Jan 06, 2009 3:11 pm
by ray.wurlod
Information Analyzer analysis tasks are run as DataStage jobs in the ANALYZERPROJECT DataStage project.

The Information Analyzer database (IADB) is used to store the results of analysis - for example the results of column analysis are used in performing table analysis and cross-table analysis.

Posted: Tue Jan 06, 2009 4:39 pm
by mee
Ray, Thanks for the response.

I do have some follow up questions. Will appreciate getting further clarity on these.

We have some large files (~ few GB) that we need to get from outside vendors and one major problem is quality of the data files. We also have a fixed time window in which the profiling must complete and report back issues to the vendors. We are likely to do column profiling as well as primary key inference against these files. The column type is of varchar 256. What are some guidance on HW and storage? I am looking for approximate number of CPUs/cores, memory size and disk size to complete the job in approximately in 2 hours.

Secondly, it's likely that file sizes will grow down the line (but the prolfing functionality will remain same). Is there any way I can maintain the same 2 hour window for column profiling and key inference by doing some data partitioning and parallel processing? If so how would that be done?

Lastly, how do I perform "join" analysis between two files to determine the "join" key between two files?

Thanks in advance.