Partital Data Load

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Partital Data Load

Post by karthi_gana »

All,

I am facing this question in almost all the interview.

question is :

seqfile --> transformer --> odbc

seqfile = 10000 records

The job failed after loading 5000 records. how to load only the remaining records?

what is the proper way to handle this? we are using 8.1 and 8.5.
Karthik
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

How have you been answering this question in all those interviews?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The proper way to handle it is to design a restart mechanism into your job design. Most of these will require you - within the job design itself or within a second job used for the "recovery phase" - to ascertain how many rows were successfully committed. This usually requires some way uniquely to identify the run that committed them.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

chulett wrote:How have you been answering this question in all those interviews?
nope :)
Last edited by karthi_gana on Wed Aug 08, 2012 12:09 am, edited 1 time in total.
Karthik
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

Could you explain it in detail? it would be helpful , if you explain it with simple job design for both the method.
Karthik
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

LoadId={aborted run}


I am not getting this.

Do you mean LoadId <> 1 or is null?

can you give some more options to do the same?
Karthik
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

The "LoadId" is a unique audit number to identify records from a particular run, in this case, the run that aborted. Only issue here of course is if it fails again so you would really need to keep track of LoadIds (and use an "in" clause) or re-use the same LoadId.

Alternatives would be to delete the previous set of records based on the LoadId, to be worthwhile, the loadid would need to be indexed, or prior to each run, check how many records loaded (x) based on loadid (or any other trick), have a transform in the middle that filters out the first (x) rows, you would need to load sequentially and probably work on sorted data to have any chance of working correctly.

Best choice really depends on how frequently you expect it to fail, how many records you are trying to load and how big the table is. At 10,000 records I can't see any dilemma since it's pretty small by current norms
m_mani87
Participant
Posts: 45
Joined: Thu May 24, 2012 11:13 pm
Location: Coimbatore

Post by m_mani87 »

This would work but a bit complex

1)Take the source again and lookup with your target ODBC
2)Prior to that you should add a column dummy=1 to the ODBC and then looup
3)Pass the dummy also to target
4)The rows after lookup with dummy=1 are already loaded redcords
5) Filter the records with condition dummy<>1 and load the remaining into the ODBC


Kindly anyone suggest if u know an easier solution.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Slight correction on the previous posts,

These all assume you have the ability to construct some sort of unique key from each record so you can match it to the ones already written out. If you can't, then you need to add a counter to the record which can be done using the Row_Number_Column option on the sequential file stage.

In m_mani87's post the ones where Dummy isn't found would be null, since it didn't find a match. That would allow you to add a constraint that only allowed records through for insertion where Dummy was null.

If you use either mode correctly, it could allow the job to gradually work its way through the file, even if something caused several aborts.

Andy
Last edited by asorrell on Fri Aug 10, 2012 8:16 am, edited 1 time in total.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Slight correction on the previous post by m_mani87,

1) It assumes you have the ability to construct some sort of unique key from each record so you can match it to the ones already written out. If you can't, then an alternative is to use the "Row Number Column" option on the sequential file stage to have it generate a unique key for each row as it is read.

2) The ones where Dummy isn't found would be null, since it didn't find a match. That would allow you to add a constraint that only allowed records through for insertion where Dummy was null.

If you use either mode correctly, it could allow the job to gradually work its way through the file, even if something caused several aborts.

Andy
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

what about the resources and time consumed in doing the lookup operation itself? Im assuming the source data is itself a load ready file or dataset or a 'load ready' work table .
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

An indexed work table with a commit_grp_id and row_Seq_no may work? Have one time job to do a parametrized select .

For partitioned loads limit the number of partitions to 4 or 8 , use a modulus partitioning on the row_seq_no field ( if possible) , and select based on the minimum common record loaded . Witha work table even a merge is possible in oracle - but that would be bypassing ETL and require an update statement.


Still not sure how to achieve this using a dataset though , (assuming partitioned load) and avoiding a lookup operation on the target ( Im thinking the overhead/time consumed of a lookup operation or CDC may itself be significant enough - given this requirement primarily caters to a 'very large data volume' scenario) .




Another option with a dataset would be to create multiple datasets for the load and loop them through , but that involves overhad of making calls to the job successsively and keeping track of the number and iterations - so probably create as many datasets as the partitions exist and load them one after another). if a job fails after loading 5 of 8 datasets , manually run the parametrized load for datasets 5 thro 8 ( assuming dataset name has a partition number suffix) .
There is no avoiding loading dataset #5 though ( in its entirety)
Post Reply