Require solution on a requirement

creatingfusion · Post by **creatingfusion** » Tue Mar 22, 2011 1:04 am

The input data looks somethin like -
Col_A|Col_B|Col_C|Col_Date1|Col_Date2
A|B|C|2001/01/01|2002/01/03
A|B|C|2001/01/02|2002/01/04
A|B|C|2001/01/03|2002/01/05

The output should be:
Col_A|Col_B|Col_C|Col_Date1|Col_Date2
A|B|C|2001/01/01|2002/01/05

Requirement: if Col_A, Col_B and Col_C are same amongst the input columns then the output should be only one row with Col_A, Col_B and Col_C and pick up the earliesrt date of Col_Date1 and recent date of Col_Date2

Please provide yours solutions how this can ba implemented using datastage.

Note: I did post the topic earlier but the answers were use of agregator with min max dates. But here we need to handle 200 millions of records in input file so please suggest some optimal method regarding this. Probably stage variable current previous or something which does not dumps data for processing.

Regards
Abhijit

iHijazi · Post by **iHijazi** » Tue Mar 22, 2011 2:44 am

Hmm, let's draw the logic of this first:

Sort data on Col_Date1 Ascending to get the earliest date.
Sort data on Col_Date2 Descending to get the recent date.

Now we got both and stored them "somewhere".

We'll do the comparison now. If Col_A = Col_B = Col_C then render an output with their values (which are identical), along with stored values we made earlier from the sorting.

Is the business logic correct here?

Ok, now let us put some rules:
1. I see that you don't need to partition your data. Hence: Use Entire on the first stage, and proceed with Same among the rest.
2. Let us divide the data into 3 outputs at the beginning: First one with Col_A,Col_B,Col_C (no sorting, or configuration required for this set), Second one with Col_Date1 and that is with stages to sort and get the first value, and the last one (Third one) with Col_Date2 to do sorting as well get the first value.
3. Now let's put those columns back together, something logically like join with Outter All join. (Not talking about the stage, the logic)
4. Now we got all the rows, with the
Col_A|Col_B|Col_C|Earliest_Date|Recent_Date
while Earliest_Date is the same among all records, same with Recent_Date.
5. Data here looks pretty and ready. Let's get into comparison of Col_A,Col_B,Col_C. I smell the sound of Transformer here. Use it, and pass the output along.

That's pretty much it. Hope that was simple enough for you to understand.

creatingfusion · Post by **creatingfusion** » Tue Mar 22, 2011 8:15 am

Thanks a lot for the elaborated reply....

But as the number of records being too large can the three sort stage be reduced to one and sorting done on keys and the stage variable taking care of the earliest and latest dates.....

Else was looking for some more optimization looking at the large number of records....

Please suggest.

Regards
Abhijit

iHijazi · Post by **iHijazi** » Tue Mar 22, 2011 8:31 am

Unfortunately this is not possible for one stage to do sorting for this specific case.

Imagine this:
You need to sort ascending at one point, then sort descending the other. Doing both at the same time will get you incorrect results. Logically!
Do a quick test on SQL for a certain table to understand the logic behind what I mean.

However, you still can use aggregate stage to get the Max and Min values (considering that the Data Type is date/timestamp). But what happens in back end is pretty much sorting, so you won't notice a whole lot of noticeable change in the performance)

Optimization Hint: For faster sorting / aggregation, you may use Range Partition. Needs a little bit of work though, but definitely efficient!

Let me know how it goes.

jwiles · Post by **jwiles** » Tue Mar 22, 2011 9:21 am

So, what is the problem with using the aggregator stage as was originally suggested? It will perform exactly what you desire regarding the Min/Max and as you are only performing Min/Max it should not add much overhead. Why are you worried about the quantity?

Do you need to keep a row with "Col_A|Col_B|Col_C|Col_Date1|Col_Date2", or is this just for illustrative purposes within your question?

Regards,

creatingfusion · Post by **creatingfusion** » Tue Mar 22, 2011 9:29 am

"Col_A|Col_B|Col_C|Col_Date1|Col_Date2", these being the columns in the output but aggregation on 2 million rows would be slow.

In stage veriable would it be possible if i sort by the keys Col_A|Col_B|Col_C ... only and then compare the Col_Date1 values and store in another stage variable if that is the latest date and same with Col_Date2 using another stage variable till we get the earliest. I tried this and checked we are able to get the result in the stage var but the issue is the condition to get the value in output .... when i do current <> previous it takes the first record of the next set of the group of keys and not the last one...

Please suggest.
Thanks
Abhijit

creatingfusion · Post by **creatingfusion** » Tue Mar 22, 2011 9:51 am

Input -
Col_A|Col_B|Col_C|Col_Date1|Col_Date2

Its 5 different columns delimited with pipe for illustration....

What I did is -
for Col_Date1....
current = Col_A:Col_B:Col_C
currentDt1=Col_Date1
chkDt1=if current=previous and Col_Date1>prev_Col_Date1
then
prev_Col_Date1
else Col_Date1
flag = if current = previous then 1 else 0
previous = current
prev_Col_Date1 = chkDt1

constraint: flag = 0

Issue-
when the records are processed the last record in a group with same keys have the earliest date in the last record of the group but the flag gets 0 on the first record of each group....

Please suggest.

Thanks
Abhijit

chulett · Post by **chulett** » Tue Mar 22, 2011 10:02 am

creatingfusion wrote:Note: I did post the topic earlier but the answers were use of agregator with min max dates. But here we need to handle 200 millions of records in input file so please suggest some optimal method regarding this.

I'm curious... did you even try using an aggregator to do this simple min/max grouping? Or did you simply assume it would somehow be less than 'optimal'?

Also key to this - is your input data sorted on those first three columns?

creatingfusion · Post by **creatingfusion** » Tue Mar 22, 2011 10:11 am

Agregator was tested but it fills up the scratchdisk and the records are sorted on the key fields.

Thanks
Abhijit

jwiles · Post by **jwiles** » Tue Mar 22, 2011 10:18 am

aggregation on 2 million rows would be slow

What is your definition of "slow"? I just ran a similar job (Proof of Concept) with 200 million rows in 5 minutes in a single partition. Is that too slow for you? All it required was converting the dates back and forth to an integer for the aggregator. I did that with modify stages.

For the transformer-based version: a single sort, transformer and a remove duplicates stage is sufficient...no need for three sorts, that's just wasting resources unecessarily. The transformer can keep the current min and max for a key value in stage variables, then copy the current min/max to every output record. The remove duplicates stage can the keep the last record for a key value.

Regards,

jwiles · Post by **jwiles** » Tue Mar 22, 2011 10:59 am

When you ran your test with aggregator, what was the value of the "Method" option on the Aggregator Stage Properties tab? "Hash" or "Sort"?

(mine was Sort, BTW)

jwiles · Post by **jwiles** » Tue Mar 22, 2011 2:18 pm

To expand on the transformer solution: For IS 8.5, you could use the new LastRowInGroup() function to eliminate the need for the remove duplicates stage. The example I gave in the earlier post was for pre IS 8.5 versions.

Regards,

DSXchange

Require solution on a requirement

Require solution on a requirement

Re: Require solution on a requirement

Re: Require solution on a requirement