Creation buckets On the basis of Value

anupam · Post by **anupam** » Sat Feb 21, 2004 1:43 am

Hi,

I have to create a Job in which Buckets are created on the basis of the value of input column.

for eg,i have one Column Say Duration, the values are
Duration
2
7
10
18
33
49
99
125
185
308
700

Now In the output i want
Duration_0_5
Duration_6_15
Duration_16_30
Duration_31_45
Duration_46_60
Duration_61_90
Duration_91_120
Duration121_180
Duration_180_300
Duration_301_600
Duration_601_Above

For each output column i have to write if condition for comparing the value of Duration. Say for Duration_6_15 I am writing
"If Duration >=6 and Duration <= 15 then 1 else 0"

Similarily for each column i am writing If conditions resulting in slowing down the process.

Please suggest how to increase the speed,
Is there any way in which these calculations has to be done only once.

chulett · Post by **chulett** » Sat Feb 21, 2004 9:18 am

It's a bit early in the morning, but...

Since you've got a logical progression to your 'bucket' ranges, perhaps you could do something with a MODULUS - in essence hashing it into the proper output bucket? Might take a little magical formula, but I think it would work out. Problem is I think you'll still need to do some sort of range checking...

A couple of other thoughts until the Universe Boys come in with something better. Perhaps a routine with a CASE statement might evaluate things 'faster'? Check from the top down and just do a 'greater than' to cut down the number of evaluations it needs to do. 'Course then you'd have added the overhead of the routine call to the picture, which may erase any performance gain.

Also seems like a hash file or a dynamic array could be leveraged for this.

Be curious what the 'best practice' turns out to be here!

ray.wurlod · Post by **ray.wurlod** » Sat Feb 21, 2004 5:24 pm

For large volumes of data, a hashed file containing 601 rows and two columns, would perform best.

Assuming your hashed file is to be called RANGES, create it as a Type 2 with modulus 20 and separation 4. (Just trust me on this.)

Code: Select all

CREATE.FILE RANGES 2 20 4

Populate the file with the following code (or something similar, perhaps in a before-stage subroutine).

Code: Select all

 OPEN "RANGES" TO RANGES.FVAR
 THEN
    FILELOCK RANGES.FVAR
    FOR I = 0 TO 5
       RANGE = "Duration_0_5"
       WRITE RANGE TO RANGES.FVAR,I
    NEXT I
    FOR I = 6 TO 15
       RANGE = "Duration_6_15"
       WRITE RANGE TO RANGES.FVAR,I
    NEXT I
    FOR I = 16 TO 90
       RANGE = (INT(I/15)+1-(MOD(I,15)=0))*15
       RANGE = "Duration_" : RANGE - 14 : "_" : RANGE
       WRITE RANGE TO RANGES.FVAR,I
    NEXT I
    FOR I = 91 TO 120
       RANGE = "Duration_91_120"
       WRITE RANGE TO RANGES.FVAR,I
    NEXT I
    FOR I = 121 TO 180
       RANGE = "Duration_121_180"
       WRITE RANGE TO RANGES.FVAR,I
    NEXT I
    FOR I = 181 TO 300
       RANGE = "Duration_181_300"
       WRITE RANGE TO RANGES.FVAR,I
    NEXT I
    FOR I = 301 TO 600
       RANGE = "Duration_301_600"
       WRITE RANGE TO RANGES.FVAR,I
    NEXT I
    FILEUNLOCK RANGES.FVAR
    CLOSE RANGES.FVAR  ; * no longer needed
 END

Perform a lookup based on the number, if it is not found and the number is over 600, substitute the top range. For example.

Code: Select all

If inlink.number > 600 Then "Duration_601_Above" Else lookup.range

kcbland · Post by **kcbland** » Sat Feb 21, 2004 6:11 pm

I think what the poster is asking to do is a pivot. In other words, take the source row and depending on the value place it into the appropriate column. Rather than put an If Then Else condition on every column derivation, he wants to know if there's a faster way if not easier way.

One method is to use a function to do the pivot with a case statement. The trick is going to be embedding delimiters. If your input file is two columns (one key and one attribute, and your output is 13 columns (one key and 12 attributes), then a simple function to handle the pivot is all that's needed. Try this:

Write a job like: seq --> xfm --> seq. Simply map the two source columns to the output sequential file, use a pipe delimiter on the output file. On the column derivation for the one attribute, use the following function:

Code: Select all

YourValue = Arg1

BEGIN CASE
    Case YourValue >= 601
        Position = 11
    Case YourValue >= 301
        Position = 10
    Case YourValue >= 180
        Position = 9
    Case YourValue >= 121
        Position = 8
    Case YourValue >= 91
        Position = 7
    Case YourValue >= 61
        Position = 6
    Case YourValue >= 46
        Position = 5
    Case YourValue >= 31
        Position = 4
    Case YourValue >= 16
        Position = 3
    Case YourValue >= 6
        Position = 2
    Case YourValue >= 0
        Position = 1
    Case @TRUE
        Position = 12
End Case

DynamicArray<Position> = YourValue
CONVERT @AM TO "|" IN DynamicArray
Ans = DynamicArray

What this function will do is determine the correct bucket the value goes into. Because it embeds the delimiter of the file in the string value returned, it correctly allocates the value into the appropriate column. You will NOT be able to VIEW DATA on the output stage, because the actual data will end up with more columns in it than defined. This is correct. Simply browse the file using pg or more and you will see that the data is in the correct column.

chulett · Post by **chulett** » Sat Feb 21, 2004 8:50 pm

Cool.

So I was on the right track.

anupam · Post by **anupam** » Sun Feb 22, 2004 11:52 pm

Hi all,

Ray, i tried the solution suggested by u but that is not what i want. I want specifically the value of duration in the proper bucket if the condition is satisfied.

Kenneth Bland, thanx a lot for the solution but it is taking more time then the earlier design.

My only problem is the performance or speed. Kindly suggest some other way by which the speed can be increased as we are supposed to process abt 300 Million records.

kcbland · Post by **kcbland** » Sun Feb 22, 2004 11:58 pm

Well, how many cpus do you have? Are you using job instantiation to divide-and-conquer process your source file? If you have 24 cpus, you could use 24 job instances to process 1/24th of the source data and thus achieve 24X the throughput. What's sooo slow about that? Now, if you say you have 2 cpus, then I understand.

If you have a simple job design as I showed, seq --> xfm --> seq, then you simply put a constraint in to "pick" 1 out of N rows from the source file and pivot it. N is how many job instances you are using. You'll use a job parameter in the output sequential file name, and pass in to each job instance its number and the N value (I like parameters called PartitionNumber and PartitionCount). Use a sequencer job and drop N (PartitionCount) job stages onto the canvas. For each job, set PartitionNumber to 1 to N. You'll run N number of jobs, and produce N number of output files. Concatentate the N number of files together, and you're done. Viola! You're N times faster.

Besides, don't you like the simplicity of the design I gave you?

anupam · Post by **anupam** » Mon Feb 23, 2004 12:34 am

Hello Kenneth,

I am not questining ur design but splitting of the files i have already done. Besides i have used 11 sortcl scripts for aggregation the Records in the next job. The speed which i am getting for buckting is aby 3500 rows/sec as i have many others buckets also in this application.

If i will tell u my full design then it is not at all simple, besides i don't want simple solutions also.

After analysing i feel that 3500 rows/sec is not desired speed to process 300 Million records. The application is supposed to run daily.

I really appreciate the design which u have suggested but the speed is my main concern. I tried to implement various methods to increase the speed and to an extent i am sucessful. I was getting 500 rows/sec intially and now i am getting 3500 wors/sec. I want atleast 10000 rows/sec.

Please suggest and don't be so personel.

I am really sorry if i am being hard.

kcbland · Post by **kcbland** » Mon Feb 23, 2004 12:41 am

Okay, well then here's a last parting attempt to help before going to sleep (it's 1:30am in the morning here).

If you're processing 300 million rows a day, I would hope that you have opportunities for incremental update rather than full re-aggregation.

Otherwise, you must keep in mind that your unit of measurement is not rows/second, but rather bytes/second. As a wider row (more columns) will add more data overhead, as well as transformation overhead, you cannot rely on rows/second to measure this throughput. You still have not addressed my point about job instantiation. I'm going to highlight this so you don't miss it:

Instantiation is the best method for achieving multiples in processing improvement in Server jobs.

If your job is running at 3500/rows per second, then 3 instances of the job sharing 1/3 of the source data will use 3 cpus as well as give you 10500 rows/second net throughput.