Creation buckets On the basis of Value
Moderators: chulett, rschirm, roy
Creation buckets On the basis of Value
Hi,
I have to create a Job in which Buckets are created on the basis of the value of input column.
for eg,i have one Column Say Duration, the values are
Duration
2
7
10
18
33
49
99
125
185
308
700
Now In the output i want
Duration_0_5
Duration_6_15
Duration_16_30
Duration_31_45
Duration_46_60
Duration_61_90
Duration_91_120
Duration121_180
Duration_180_300
Duration_301_600
Duration_601_Above
For each output column i have to write if condition for comparing the value of Duration. Say for Duration_6_15 I am writing
"If Duration >=6 and Duration <= 15 then 1 else 0"
Similarily for each column i am writing If conditions resulting in slowing down the process.
Please suggest how to increase the speed,
Is there any way in which these calculations has to be done only once.
I have to create a Job in which Buckets are created on the basis of the value of input column.
for eg,i have one Column Say Duration, the values are
Duration
2
7
10
18
33
49
99
125
185
308
700
Now In the output i want
Duration_0_5
Duration_6_15
Duration_16_30
Duration_31_45
Duration_46_60
Duration_61_90
Duration_91_120
Duration121_180
Duration_180_300
Duration_301_600
Duration_601_Above
For each output column i have to write if condition for comparing the value of Duration. Say for Duration_6_15 I am writing
"If Duration >=6 and Duration <= 15 then 1 else 0"
Similarily for each column i am writing If conditions resulting in slowing down the process.
Please suggest how to increase the speed,
Is there any way in which these calculations has to be done only once.
----------------
Rgds,
Anupam
----------------
The future is not something we enter. The future is something we create.
Rgds,
Anupam
----------------
The future is not something we enter. The future is something we create.
It's a bit early in the morning, but...
Since you've got a logical progression to your 'bucket' ranges, perhaps you could do something with a MODULUS - in essence hashing it into the proper output bucket? Might take a little magical formula, but I think it would work out. Problem is I think you'll still need to do some sort of range checking...
A couple of other thoughts until the Universe Boys come in with something better. Perhaps a routine with a CASE statement might evaluate things 'faster'? Check from the top down and just do a 'greater than' to cut down the number of evaluations it needs to do. 'Course then you'd have added the overhead of the routine call to the picture, which may erase any performance gain.
Also seems like a hash file or a dynamic array could be leveraged for this. Be curious what the 'best practice' turns out to be here!
Since you've got a logical progression to your 'bucket' ranges, perhaps you could do something with a MODULUS - in essence hashing it into the proper output bucket? Might take a little magical formula, but I think it would work out. Problem is I think you'll still need to do some sort of range checking...
A couple of other thoughts until the Universe Boys come in with something better. Perhaps a routine with a CASE statement might evaluate things 'faster'? Check from the top down and just do a 'greater than' to cut down the number of evaluations it needs to do. 'Course then you'd have added the overhead of the routine call to the picture, which may erase any performance gain.
Also seems like a hash file or a dynamic array could be leveraged for this. Be curious what the 'best practice' turns out to be here!
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
For large volumes of data, a hashed file containing 601 rows and two columns, would perform best.
Assuming your hashed file is to be called RANGES, create it as a Type 2 with modulus 20 and separation 4. (Just trust me on this.)
Populate the file with the following code (or something similar, perhaps in a before-stage subroutine).
Perform a lookup based on the number, if it is not found and the number is over 600, substitute the top range. For example.
Assuming your hashed file is to be called RANGES, create it as a Type 2 with modulus 20 and separation 4. (Just trust me on this.)
Code: Select all
CREATE.FILE RANGES 2 20 4
Code: Select all
OPEN "RANGES" TO RANGES.FVAR
THEN
FILELOCK RANGES.FVAR
FOR I = 0 TO 5
RANGE = "Duration_0_5"
WRITE RANGE TO RANGES.FVAR,I
NEXT I
FOR I = 6 TO 15
RANGE = "Duration_6_15"
WRITE RANGE TO RANGES.FVAR,I
NEXT I
FOR I = 16 TO 90
RANGE = (INT(I/15)+1-(MOD(I,15)=0))*15
RANGE = "Duration_" : RANGE - 14 : "_" : RANGE
WRITE RANGE TO RANGES.FVAR,I
NEXT I
FOR I = 91 TO 120
RANGE = "Duration_91_120"
WRITE RANGE TO RANGES.FVAR,I
NEXT I
FOR I = 121 TO 180
RANGE = "Duration_121_180"
WRITE RANGE TO RANGES.FVAR,I
NEXT I
FOR I = 181 TO 300
RANGE = "Duration_181_300"
WRITE RANGE TO RANGES.FVAR,I
NEXT I
FOR I = 301 TO 600
RANGE = "Duration_301_600"
WRITE RANGE TO RANGES.FVAR,I
NEXT I
FILEUNLOCK RANGES.FVAR
CLOSE RANGES.FVAR ; * no longer needed
END
Code: Select all
If inlink.number > 600 Then "Duration_601_Above" Else lookup.range
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I think what the poster is asking to do is a pivot. In other words, take the source row and depending on the value place it into the appropriate column. Rather than put an If Then Else condition on every column derivation, he wants to know if there's a faster way if not easier way.
One method is to use a function to do the pivot with a case statement. The trick is going to be embedding delimiters. If your input file is two columns (one key and one attribute, and your output is 13 columns (one key and 12 attributes), then a simple function to handle the pivot is all that's needed. Try this:
Write a job like: seq --> xfm --> seq. Simply map the two source columns to the output sequential file, use a pipe delimiter on the output file. On the column derivation for the one attribute, use the following function:
What this function will do is determine the correct bucket the value goes into. Because it embeds the delimiter of the file in the string value returned, it correctly allocates the value into the appropriate column. You will NOT be able to VIEW DATA on the output stage, because the actual data will end up with more columns in it than defined. This is correct. Simply browse the file using pg or more and you will see that the data is in the correct column.
One method is to use a function to do the pivot with a case statement. The trick is going to be embedding delimiters. If your input file is two columns (one key and one attribute, and your output is 13 columns (one key and 12 attributes), then a simple function to handle the pivot is all that's needed. Try this:
Write a job like: seq --> xfm --> seq. Simply map the two source columns to the output sequential file, use a pipe delimiter on the output file. On the column derivation for the one attribute, use the following function:
Code: Select all
YourValue = Arg1
BEGIN CASE
Case YourValue >= 601
Position = 11
Case YourValue >= 301
Position = 10
Case YourValue >= 180
Position = 9
Case YourValue >= 121
Position = 8
Case YourValue >= 91
Position = 7
Case YourValue >= 61
Position = 6
Case YourValue >= 46
Position = 5
Case YourValue >= 31
Position = 4
Case YourValue >= 16
Position = 3
Case YourValue >= 6
Position = 2
Case YourValue >= 0
Position = 1
Case @TRUE
Position = 12
End Case
DynamicArray<Position> = YourValue
CONVERT @AM TO "|" IN DynamicArray
Ans = DynamicArray
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Hi all,
Ray, i tried the solution suggested by u but that is not what i want. I want specifically the value of duration in the proper bucket if the condition is satisfied.
Kenneth Bland, thanx a lot for the solution but it is taking more time then the earlier design.
My only problem is the performance or speed. Kindly suggest some other way by which the speed can be increased as we are supposed to process abt 300 Million records.
Ray, i tried the solution suggested by u but that is not what i want. I want specifically the value of duration in the proper bucket if the condition is satisfied.
Kenneth Bland, thanx a lot for the solution but it is taking more time then the earlier design.
My only problem is the performance or speed. Kindly suggest some other way by which the speed can be increased as we are supposed to process abt 300 Million records.
----------------
Rgds,
Anupam
----------------
The future is not something we enter. The future is something we create.
Rgds,
Anupam
----------------
The future is not something we enter. The future is something we create.
Well, how many cpus do you have? Are you using job instantiation to divide-and-conquer process your source file? If you have 24 cpus, you could use 24 job instances to process 1/24th of the source data and thus achieve 24X the throughput. What's sooo slow about that? Now, if you say you have 2 cpus, then I understand.
If you have a simple job design as I showed, seq --> xfm --> seq, then you simply put a constraint in to "pick" 1 out of N rows from the source file and pivot it. N is how many job instances you are using. You'll use a job parameter in the output sequential file name, and pass in to each job instance its number and the N value (I like parameters called PartitionNumber and PartitionCount). Use a sequencer job and drop N (PartitionCount) job stages onto the canvas. For each job, set PartitionNumber to 1 to N. You'll run N number of jobs, and produce N number of output files. Concatentate the N number of files together, and you're done. Viola! You're N times faster.
Besides, don't you like the simplicity of the design I gave you?
If you have a simple job design as I showed, seq --> xfm --> seq, then you simply put a constraint in to "pick" 1 out of N rows from the source file and pivot it. N is how many job instances you are using. You'll use a job parameter in the output sequential file name, and pass in to each job instance its number and the N value (I like parameters called PartitionNumber and PartitionCount). Use a sequencer job and drop N (PartitionCount) job stages onto the canvas. For each job, set PartitionNumber to 1 to N. You'll run N number of jobs, and produce N number of output files. Concatentate the N number of files together, and you're done. Viola! You're N times faster.
Besides, don't you like the simplicity of the design I gave you?
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Hello Kenneth,
I am not questining ur design but splitting of the files i have already done. Besides i have used 11 sortcl scripts for aggregation the Records in the next job. The speed which i am getting for buckting is aby 3500 rows/sec as i have many others buckets also in this application.
If i will tell u my full design then it is not at all simple, besides i don't want simple solutions also.
After analysing i feel that 3500 rows/sec is not desired speed to process 300 Million records. The application is supposed to run daily.
I really appreciate the design which u have suggested but the speed is my main concern. I tried to implement various methods to increase the speed and to an extent i am sucessful. I was getting 500 rows/sec intially and now i am getting 3500 wors/sec. I want atleast 10000 rows/sec.
Please suggest and don't be so personel.
I am really sorry if i am being hard.
I am not questining ur design but splitting of the files i have already done. Besides i have used 11 sortcl scripts for aggregation the Records in the next job. The speed which i am getting for buckting is aby 3500 rows/sec as i have many others buckets also in this application.
If i will tell u my full design then it is not at all simple, besides i don't want simple solutions also.
After analysing i feel that 3500 rows/sec is not desired speed to process 300 Million records. The application is supposed to run daily.
I really appreciate the design which u have suggested but the speed is my main concern. I tried to implement various methods to increase the speed and to an extent i am sucessful. I was getting 500 rows/sec intially and now i am getting 3500 wors/sec. I want atleast 10000 rows/sec.
Please suggest and don't be so personel.
I am really sorry if i am being hard.
----------------
Rgds,
Anupam
----------------
The future is not something we enter. The future is something we create.
Rgds,
Anupam
----------------
The future is not something we enter. The future is something we create.
Okay, well then here's a last parting attempt to help before going to sleep (it's 1:30am in the morning here).
If you're processing 300 million rows a day, I would hope that you have opportunities for incremental update rather than full re-aggregation.
Otherwise, you must keep in mind that your unit of measurement is not rows/second, but rather bytes/second. As a wider row (more columns) will add more data overhead, as well as transformation overhead, you cannot rely on rows/second to measure this throughput. You still have not addressed my point about job instantiation. I'm going to highlight this so you don't miss it:
If you're processing 300 million rows a day, I would hope that you have opportunities for incremental update rather than full re-aggregation.
Otherwise, you must keep in mind that your unit of measurement is not rows/second, but rather bytes/second. As a wider row (more columns) will add more data overhead, as well as transformation overhead, you cannot rely on rows/second to measure this throughput. You still have not addressed my point about job instantiation. I'm going to highlight this so you don't miss it:
If your job is running at 3500/rows per second, then 3 instances of the job sharing 1/3 of the source data will use 3 cpus as well as give you 10500 rows/second net throughput.Instantiation is the best method for achieving multiples in processing improvement in Server jobs.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle