Question: 1)Overflow 2)"Running Job ..." in 4 days
Question: 1)Overflow 2)"Running Job ..." in 4 days
Hi Experts:
Please help on these questions.
My file A has 36Million+ rows, and file B has 20Million+ rows. UNIX QS 7.0. Default output.
1) Overflow:
In log/jobName.MTC.StepFREQLDA.log, there is a message:
"937 number of overflow areas allocated", in jobName.MTC.StepFREQLDB.log: "695 number of overflow areas allocated". This job has 7 passes. No overflow message in other files.
If it is a overflow problem, how to solve it?
2) Match job does not stop, but it sounds that results are all given.
My match job is started in last Friday evening, and it is still running, the File Mode execution window show the message "Running job ...", But since Yesterday, January 31, 2010, 1:46:46 PM, there is nothing updated.
In the Data folder, the 4 files: jobname.RPT is modified at 1/31/2010 1:46:39 PM, and the OUT, RAS, RBS are modified at 1/31/2010 1:46:44 PM. The last log file are the jobName.Step001.log and jobname.MTC.StepEXTRACTC.log modified at 1/31/2010 1:46:46 PM. The other log files for step1-step7 are modified before that time. I wonder if the process is hang by a file too large. Does the job finished?
3) Could you also point a good example for extracting output? I have primary key in File A, File B does not have the related values. I want the output have the primary key. I also want the match pass number in the output file.
Thanks in advance
Please help on these questions.
My file A has 36Million+ rows, and file B has 20Million+ rows. UNIX QS 7.0. Default output.
1) Overflow:
In log/jobName.MTC.StepFREQLDA.log, there is a message:
"937 number of overflow areas allocated", in jobName.MTC.StepFREQLDB.log: "695 number of overflow areas allocated". This job has 7 passes. No overflow message in other files.
If it is a overflow problem, how to solve it?
2) Match job does not stop, but it sounds that results are all given.
My match job is started in last Friday evening, and it is still running, the File Mode execution window show the message "Running job ...", But since Yesterday, January 31, 2010, 1:46:46 PM, there is nothing updated.
In the Data folder, the 4 files: jobname.RPT is modified at 1/31/2010 1:46:39 PM, and the OUT, RAS, RBS are modified at 1/31/2010 1:46:44 PM. The last log file are the jobName.Step001.log and jobname.MTC.StepEXTRACTC.log modified at 1/31/2010 1:46:46 PM. The other log files for step1-step7 are modified before that time. I wonder if the process is hang by a file too large. Does the job finished?
3) Could you also point a good example for extracting output? I have primary key in File A, File B does not have the related values. I want the output have the primary key. I also want the match pass number in the output file.
Thanks in advance
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
This advice from kejuste can and should be ignored. We're talking blocks in QualityStage Match here - nothing whatsoever to do with any database!
That you're getting block overflow suggests that you're not discriminating particularly well with your choice of blocking fields. Try adding one or two more blocking fields in the pass or passes where this alert is being generated. In keeping with your original objectives, of course.
That you're getting block overflow suggests that you're not discriminating particularly well with your choice of blocking fields. Try adding one or two more blocking fields in the pass or passes where this alert is being generated. In keeping with your original objectives, of course.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
Daxinz,
Could you post you:
- blocking strategy for your 7 passes
- Match commands
having that info we could make better suggestions ...
Anyway, matching a 36M file against a 20M file using just regular QS jobs will take a considerable amount of time even with the best of the blocking/matching strategy
Regards
Could you post you:
- blocking strategy for your 7 passes
- Match commands
having that info we could make better suggestions ...
Anyway, matching a 36M file against a 20M file using just regular QS jobs will take a considerable amount of time even with the best of the blocking/matching strategy
Regards
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
Thank so much for your help. the following is the match/block:
pass 1 Block: SSN.
Match: Last name(UNCERT), First name (UNCERT), Initial, Birth YY, Birth MM
pass 2 Block: Last name (char), First name (char), Birth YY,
Match: Birth MM, Initial, SSN, Birth DD, ZIP
pass 3 Block: Birth YYMM, Last Name, Match: First Name, Initial, Birth DD, ZIP, SSN
pass 4 Block: Birth YYMM, First Name.
Match: Last Name, Initial , Birth DD, ZIP, SSN
pass 5 Block: First name Nysiis, Last Name (4 chars only).
Match: Birth YY, Birth MM, Birth DD, Initial, ZIP
pass 6 Block: Last name Nysiis, First Name (4 chars only),
Match: Birth MM, Birth YY, Birth DD, Initial, ZIP, SSN
pass 7 Block: First Name Nysiis, Birth YY
Match Birth MM, Birth DD , Last Name Nysiis, Initial, ZIP
Cut-off is 0 for testing.
I am thinking change the block in pass 7 to block on Soundex of lastname and soundex of first name, plus birth YY.
In addition, could you also check if the overflow message tells me the error? if it does, how to handle it.
After this match, I am going to have larger file A and file B to match. It makes me worry so much.
Thank you again.
pass 1 Block: SSN.
Match: Last name(UNCERT), First name (UNCERT), Initial, Birth YY, Birth MM
pass 2 Block: Last name (char), First name (char), Birth YY,
Match: Birth MM, Initial, SSN, Birth DD, ZIP
pass 3 Block: Birth YYMM, Last Name, Match: First Name, Initial, Birth DD, ZIP, SSN
pass 4 Block: Birth YYMM, First Name.
Match: Last Name, Initial , Birth DD, ZIP, SSN
pass 5 Block: First name Nysiis, Last Name (4 chars only).
Match: Birth YY, Birth MM, Birth DD, Initial, ZIP
pass 6 Block: Last name Nysiis, First Name (4 chars only),
Match: Birth MM, Birth YY, Birth DD, Initial, ZIP, SSN
pass 7 Block: First Name Nysiis, Birth YY
Match Birth MM, Birth DD , Last Name Nysiis, Initial, ZIP
Cut-off is 0 for testing.
I am thinking change the block in pass 7 to block on Soundex of lastname and soundex of first name, plus birth YY.
In addition, could you also check if the overflow message tells me the error? if it does, how to handle it.
After this match, I am going to have larger file A and file B to match. It makes me worry so much.
Thank you again.
Last edited by daxinz on Tue Feb 02, 2010 12:40 pm, edited 3 times in total.
Last night, I ran the match with the first 2 passes only. It started at 10:30 last night, and ended at 09:07 AM this morning. At advanced running options, I increased all the buffers to 5 times larger.
It is successful. No error mesage. But there are still overflow related messages in two log files:
jobName.MTC.StepFREQLDA.log(403): freqld(7192): 3326 number of overflow areas allocated
jobName.MTC.StepFREQLDB.log(65): freqld(21667): 3930 number of overflow areas allocated
Please help diagnose it with the information in my previous reply.
Thanks
It is successful. No error mesage. But there are still overflow related messages in two log files:
jobName.MTC.StepFREQLDA.log(403): freqld(7192): 3326 number of overflow areas allocated
jobName.MTC.StepFREQLDB.log(65): freqld(21667): 3930 number of overflow areas allocated
Please help diagnose it with the information in my previous reply.
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It may not be possible. It depends on your data. For example if there are many records with same birth month and first name, then you will tend to get overflow used in pass 4. There's a similar argument for pass 7. Match will still work when there is overflow - it just won't be as efficient. If the alerts are a concern to you and your matching model won't allow of tighter blocking, demote the alerts using a message handler.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
Daxinz,
Basically the job is choking ... and is causing overflow. With a good blocking strategy you should be fine, even the job will take time to exhaust that amount of records. Is a matter of divide and conquer strategy
The size of the blocks are the most important performance factor in match performance. The size can be reduce using more or different blocking variables. You should be able to add input columns to your input file, while creating the input file, like "First three char from FirstName" just for blocking purpose.
Before we do more trial and error, it will be very helpful if you could state your match requirements, in Layman's terms "When two records are considered a match"
Regards
Basically the job is choking ... and is causing overflow. With a good blocking strategy you should be fine, even the job will take time to exhaust that amount of records. Is a matter of divide and conquer strategy
The size of the blocks are the most important performance factor in match performance. The size can be reduce using more or different blocking variables. You should be able to add input columns to your input file, while creating the input file, like "First three char from FirstName" just for blocking purpose.
Before we do more trial and error, it will be very helpful if you could state your match requirements, in Layman's terms "When two records are considered a match"
Regards
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
Please keep helping.
If I only run the pass 1 only, block on SSN. There are sill overflow message in jobName.MTC.StepFREQLDA.log, and only in this file.
50% of the records in my file A has no SSN, and a lot of them are illegal. SSN is duplicated in some records. In file B SSN are unique, and all passed my validation.
Could you please check the message from StepFREQLDA.log bellow, and help finding some hint?
freqld(14636):Thu 04 Feb 2010 07:13:52 AM PST LOG: File [A] -- Working on record: 100000
...... ......
freqld(14636):Thu 04 Feb 2010 09:44:41 AM PST LOG: File [A] -- Working on record: 36100000
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: *** Indexing System I/O Statistics***
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 12000 number of control areas
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 10000 number of Buffers allocated
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 3327 number of overflow areas allocated
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 5843257 number of Physical writes
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 299163070 number of Logical writes
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 6799323 number of Physical reads
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 1288686480 number of Logical reads
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 510668 number of records in file
What is the meaning in each of the line? For example, "5843257 number of Physical writes" , 5843257 rows? Which rows? where to write, overflow location also? "510668 number of records in file" what records it is mentioning? etc.
In the jobName.MTC.StepFREQLDB.log, overflow message is:
freqld(9252):Thu 04 Feb 2010 10:29:39 AM PST LOG: 3929 number of overflow areas allocated
File B has much less records, and the SSN is unique. Why the overflow area allocated is more?
In addition, If I understand the meaning of other log files, it may help. Is there a document that explain the meaning of the files and the contents? If not, could anybody explain, especially the contents related to overflow?
In addition, can I catch the records overflowed for further match?
Thanks
If I only run the pass 1 only, block on SSN. There are sill overflow message in jobName.MTC.StepFREQLDA.log, and only in this file.
50% of the records in my file A has no SSN, and a lot of them are illegal. SSN is duplicated in some records. In file B SSN are unique, and all passed my validation.
Could you please check the message from StepFREQLDA.log bellow, and help finding some hint?
freqld(14636):Thu 04 Feb 2010 07:13:52 AM PST LOG: File [A] -- Working on record: 100000
...... ......
freqld(14636):Thu 04 Feb 2010 09:44:41 AM PST LOG: File [A] -- Working on record: 36100000
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: *** Indexing System I/O Statistics***
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 12000 number of control areas
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 10000 number of Buffers allocated
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 3327 number of overflow areas allocated
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 5843257 number of Physical writes
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 299163070 number of Logical writes
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 6799323 number of Physical reads
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 1288686480 number of Logical reads
freqld(14636):Thu 04 Feb 2010 10:13:22 AM PST LOG: 510668 number of records in file
What is the meaning in each of the line? For example, "5843257 number of Physical writes" , 5843257 rows? Which rows? where to write, overflow location also? "510668 number of records in file" what records it is mentioning? etc.
In the jobName.MTC.StepFREQLDB.log, overflow message is:
freqld(9252):Thu 04 Feb 2010 10:29:39 AM PST LOG: 3929 number of overflow areas allocated
File B has much less records, and the SSN is unique. Why the overflow area allocated is more?
In addition, If I understand the meaning of other log files, it may help. Is there a document that explain the meaning of the files and the contents? If not, could anybody explain, especially the contents related to overflow?
In addition, can I catch the records overflowed for further match?
Thanks
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
Daxinz,
The best way to help you is knowing what are your matching goals or requirements. From there we should suggest a good blocking/matching strategy
Meanwhile here some answer based on your post:
The best way to help you is knowing what are your matching goals or requirements. From there we should suggest a good blocking/matching strategy
Meanwhile here some answer based on your post:
daxinz wrote:Please keep helping.
If I only run the pass 1 only, block on SSN. There are sill overflow message in jobName.MTC.StepFREQLDA.log, and only in this file.
50% of the records in my file A has no SSN, and a lot of them are illegal. SSN is duplicated in some records. In file B SSN are unique, and all passed my validation.
A --> All records containing empty SNN will be in one BIG block causing overflow. One way to avoid this is converting all empty SSN to null
In addition, can I catch the records overflowed for further match?
All records in a block with overflow will be skipped and available in next pass
Thanks
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
Daxinz,
I notice that you are on version 7.x .... is more difficult with that version
The idea to convert empty SSN to nulls values is to force QS to make those records residual instead of creating a big block. In a blocking stategy all records that don't fall in a block become residuals
So while creating the fix file with SAS, just replace the empty SSN values with a sequential number. Even better concat the initial letter from the first & last name to the number. Making sure that those number are unique and easy to identify for purging purpose. Because they will be unique they won't be in any block so they will become automatically residuals
I notice that you are on version 7.x .... is more difficult with that version
The idea to convert empty SSN to nulls values is to force QS to make those records residual instead of creating a big block. In a blocking stategy all records that don't fall in a block become residuals
So while creating the fix file with SAS, just replace the empty SSN values with a sequential number. Even better concat the initial letter from the first & last name to the number. Making sure that those number are unique and easy to identify for purging purpose. Because they will be unique they won't be in any block so they will become automatically residuals
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
Since you said in previous reply, the overflowed records goes into next passes either, I would like to allow this overflow. SSN is a match field in later passes as integer, I use cnt_diff, para1=1. The records in file B all have good SSN.
My last question for this topic: is there any possible problem to allow this overflow? Does it hide other overflow or other errors? If it is fine, I am going to close this topic.
Thank you so much.
My last question for this topic: is there any possible problem to allow this overflow? Does it hide other overflow or other errors? If it is fine, I am going to close this topic.
Thank you so much.
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
Daxinz,
No to that process, but lack of resorces can prevent the server to stop spawning processes and other jobs will fail
Using a generate SSN to those records with empty SSN will avoid overflow, and won't damage your next pass matching strategy.
No to that process, but lack of resorces can prevent the server to stop spawning processes and other jobs will fail
Using a generate SSN to those records with empty SSN will avoid overflow, and won't damage your next pass matching strategy.
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses