Partital Data Load

karthi_gana · Post by **karthi_gana** » Tue Aug 07, 2012 8:25 pm

All,

I am facing this question in almost all the interview.

question is :

seqfile --> transformer --> odbc

seqfile = 10000 records

The job failed after loading 5000 records. how to load only the remaining records?

what is the proper way to handle this? we are using 8.1 and 8.5.

chulett · Post by **chulett** » Tue Aug 07, 2012 9:30 pm

How have you been answering this question in all those interviews?

ray.wurlod · Post by **ray.wurlod** » Tue Aug 07, 2012 11:44 pm

The proper way to handle it is to design a restart mechanism into your job design. Most of these will require you - within the job design itself or within a second job used for the "recovery phase" - to ascertain how many rows were successfully committed. This usually requires some way uniquely to identify the run that committed them.

karthi_gana · Post by **karthi_gana** » Wed Aug 08, 2012 12:06 am

chulett wrote:How have you been answering this question in all those interviews?

nope

karthi_gana · Post by **karthi_gana** » Wed Aug 08, 2012 12:08 am

Could you explain it in detail? it would be helpful , if you explain it with simple job design for both the method.

karthi_gana · Post by **karthi_gana** » Thu Aug 09, 2012 2:45 am

LoadId={aborted run}

I am not getting this.

Do you mean LoadId <> 1 or is null?

can you give some more options to do the same?

Kryt0n · Post by **Kryt0n** » Thu Aug 09, 2012 4:33 pm

The "LoadId" is a unique audit number to identify records from a particular run, in this case, the run that aborted. Only issue here of course is if it fails again so you would really need to keep track of LoadIds (and use an "in" clause) or re-use the same LoadId.

Alternatives would be to delete the previous set of records based on the LoadId, to be worthwhile, the loadid would need to be indexed, or prior to each run, check how many records loaded (x) based on loadid (or any other trick), have a transform in the middle that filters out the first (x) rows, you would need to load sequentially and probably work on sorted data to have any chance of working correctly.

Best choice really depends on how frequently you expect it to fail, how many records you are trying to load and how big the table is. At 10,000 records I can't see any dilemma since it's pretty small by current norms

m_mani87 · Post by **m_mani87** » Fri Aug 10, 2012 12:01 am

This would work but a bit complex

1)Take the source again and lookup with your target ODBC
2)Prior to that you should add a column dummy=1 to the ODBC and then looup
3)Pass the dummy also to target
4)The rows after lookup with dummy=1 are already loaded redcords
5) Filter the records with condition dummy<>1 and load the remaining into the ODBC

Kindly anyone suggest if u know an easier solution.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Fri Aug 10, 2012 8:06 am

Slight correction on the previous posts,

These all assume you have the ability to construct some sort of unique key from each record so you can match it to the ones already written out. If you can't, then you need to add a counter to the record which can be done using the Row_Number_Column option on the sequential file stage.

In m_mani87's post the ones where Dummy isn't found would be null, since it didn't find a match. That would allow you to add a constraint that only allowed records through for insertion where Dummy was null.

If you use either mode correctly, it could allow the job to gradually work its way through the file, even if something caused several aborts.

Andy

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Fri Aug 10, 2012 8:08 am

Slight correction on the previous post by m_mani87,

1) It assumes you have the ability to construct some sort of unique key from each record so you can match it to the ones already written out. If you can't, then an alternative is to use the "Row Number Column" option on the sequential file stage to have it generate a unique key for each row as it is read.

2) The ones where Dummy isn't found would be null, since it didn't find a match. That would allow you to add a constraint that only allowed records through for insertion where Dummy was null.

If you use either mode correctly, it could allow the job to gradually work its way through the file, even if something caused several aborts.

Andy

rameshrr3 · Post by **rameshrr3** » Mon Aug 13, 2012 2:20 pm

what about the resources and time consumed in doing the lookup operation itself? Im assuming the source data is itself a load ready file or dataset or a 'load ready' work table .

rameshrr3 · Post by **rameshrr3** » Fri Aug 17, 2012 8:55 am

An indexed work table with a commit_grp_id and row_Seq_no may work? Have one time job to do a parametrized select .

For partitioned loads limit the number of partitions to 4 or 8 , use a modulus partitioning on the row_seq_no field ( if possible) , and select based on the minimum common record loaded . Witha work table even a merge is possible in oracle - but that would be bypassing ETL and require an update statement.

Still not sure how to achieve this using a dataset though , (assuming partitioned load) and avoiding a lookup operation on the target ( Im thinking the overhead/time consumed of a lookup operation or CDC may itself be significant enough - given this requirement primarily caters to a 'very large data volume' scenario) .

Another option with a dataset would be to create multiple datasets for the load and loop them through , but that involves overhad of making calls to the job successsively and keeping track of the number and iterations - so probably create as many datasets as the partitions exist and load them one after another). if a job fails after loading 5 of 8 datasets , manually run the parametrized load for datasets 5 thro 8 ( assuming dataset name has a partition number suffix) .
There is no avoiding loading dataset #5 though ( in its entirety)