Performance tuning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Performance tuning

Post by vij »

I have a job which helps me acheiving the following functionality:
source:
col1 col2 col3
10 aaaa 123
10 aaaa 345
10 wqert 126
10 aaaa 789

output:

col1 col2 col3
10 aaaa 123,345,789
10 wqert 126
I have used the sorter to sort the records based on col1 and col2 and in the next stage, transformer, i compare the previous records value and current records value and tag the col3 value for a duplicate record (a record is duplicate based on the values of col1 and col2 ) and remove the duplicate and pass the last record for a group.

now, the problem is - this job gets million of records from input and when sorting/removing duplicates, the job gets failed due to no enough space in server. so i wanted to know is there any way i can acheive this functionality , through a readymade stage in DataStage? or any alternative solution is welcome!

pls let me know this.

Thanks in advance!
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Is your source a flat file or a database table? If the later then do an order by in the source and get sorted data. If former then you can use the os level sort by providing the -T option with it to specify the temp directory which will be used to hold groups during sorting. If over all you have low disk space, then buy more disks.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How about using a Remove Duplicates stage? Or even a unique Sort?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Post by vij »

yes, i am using a remove duplicate stage to remove the duplicates from the sorted and tagged data and passing the last record to the target.

I have the file as the source and target, no where i use a database table here.

My doubt is - Is there any way i can avoid the load to each stage and so that i will not get the "no space" issue on the server.

As an alternative, i heard that theres something called "vector stage" which can tag a column value with other value. but i wanted the same functionality based on a condition, (prev record value = current record value).

pls advice me accordingly.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Vectors won't help - it's still the same volume of data.

Get more space. You need it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Post by vij »

ok thanks for the info Roy.Leaving the space issue aside, atleast from the time consuming perspective, can you advice me an alternative logic or stages to acheive the same functionality?
Jai_sahaj
Participant
Posts: 7
Joined: Mon Nov 10, 2003 1:11 pm

Post by Jai_sahaj »

vij wrote:ok thanks for the info Roy.Leaving the space issue aside, atleast from the time consuming perspective, can you advice me an alternative logic or stages to acheive the same functionality?
I would add another sort stage which generates a clusterkey and derive column3 based on the value of clusterkey. Hence avoiding any string comparisons in transformer.
Post Reply