Performance tuning

vij · Post by **vij** » Tue Jan 16, 2007 8:07 am

I have a job which helps me acheiving the following functionality:

source:
col1 col2 col3
10 aaaa 123
10 aaaa 345
10 wqert 126
10 aaaa 789

output:

col1 col2 col3
10 aaaa 123,345,789
10 wqert 126

I have used the sorter to sort the records based on col1 and col2 and in the next stage, transformer, i compare the previous records value and current records value and tag the col3 value for a duplicate record (a record is duplicate based on the values of col1 and col2 ) and remove the duplicate and pass the last record for a group.

now, the problem is - this job gets million of records from input and when sorting/removing duplicates, the job gets failed due to no enough space in server. so i wanted to know is there any way i can acheive this functionality , through a readymade stage in DataStage? or any alternative solution is welcome!

pls let me know this.

Thanks in advance!

DSguru2B · Post by **DSguru2B** » Tue Jan 16, 2007 8:25 am

Is your source a flat file or a database table? If the later then do an order by in the source and get sorted data. If former then you can use the os level sort by providing the -T option with it to specify the temp directory which will be used to hold groups during sorting. If over all you have low disk space, then buy more disks.

ray.wurlod · Post by **ray.wurlod** » Tue Jan 16, 2007 8:17 pm

How about using a Remove Duplicates stage? Or even a unique Sort?

vij · Post by **vij** » Tue Jan 16, 2007 10:10 pm

yes, i am using a remove duplicate stage to remove the duplicates from the sorted and tagged data and passing the last record to the target.

I have the file as the source and target, no where i use a database table here.

My doubt is - Is there any way i can avoid the load to each stage and so that i will not get the "no space" issue on the server.

As an alternative, i heard that theres something called "vector stage" which can tag a column value with other value. but i wanted the same functionality based on a condition, (prev record value = current record value).

pls advice me accordingly.

ray.wurlod · Post by **ray.wurlod** » Tue Jan 16, 2007 10:15 pm

Vectors won't help - it's still the same volume of data.

Get more space. You need it.

vij · Post by **vij** » Tue Jan 16, 2007 10:38 pm

ok thanks for the info Roy.Leaving the space issue aside, atleast from the time consuming perspective, can you advice me an alternative logic or stages to acheive the same functionality?

Jai_sahaj · Post by **Jai_sahaj** » Wed Jan 17, 2007 6:24 am

vij wrote:ok thanks for the info Roy.Leaving the space issue aside, atleast from the time consuming perspective, can you advice me an alternative logic or stages to acheive the same functionality?

I would add another sort stage which generates a clusterkey and derive column3 based on the value of clusterkey. Hence avoiding any string comparisons in transformer.