Large file to split

vsi · Post by **vsi** » Fri Dec 15, 2006 9:45 am

Hi folks,

I have a sourcefile size 120gig, i need to split that file into several files and i need to work .
The scenario is this file is loading into multiple tables
By multiple Datastage jobs,

Now in order to split this file into separate files i don't know the metadata.

Please help me to solve this issue.

Thanks in advance.

DSguru2B · Post by **DSguru2B** » Fri Dec 15, 2006 9:48 am

vsi wrote: Now in order to split this file into separate files i don't know the metadata.

Please help me to solve this issue.

Remember. Datastage is metadata driven. Even file split logic will be metadata driven. If you have no idea about the meta data, how are you going to split the file

?

vsi · Post by **vsi** » Fri Dec 15, 2006 10:02 am

Thanks for u r response, Dsguru

I am sorry, i loaded the metadata

and the job design is as follow

seq.file -----> Transformer ------> multiple sequential files.

in the transformer i used the constratin

@INROWNUM<10000 ----- FIRSTFILE
@INROWNUM>10001AND@INROWNUM

like this i am using a condition for different sequential file,

but it is not working

is there any other method to split into multiple files.

Thanks in advance.

DSguru2B · Post by **DSguru2B** » Fri Dec 15, 2006 10:12 am

If your on server engine then use a link partitioner to achieve this.
If your on PX then your condition should work.
for the first link have @INROWNUM < 10000
for second have @INROWNUM >= 10000 and @INROWNUM <20000
and so on....
What error are you getting ?
You can also split the file at the os level by using the split command.

narasimha · Post by **narasimha** » Fri Dec 15, 2006 10:39 am

vsi,

What is the environment you are working in.
In your post you have the Job Type as Server, but you are posting in the Parallel forum.
If both are consistent then you get a more specific answer.

thebird · Post by **thebird** » Fri Dec 15, 2006 11:14 am

Remember that if it is a Parallel job, then @OUTROWNUM and @INROWNUM would get executed on each of the nodes for the job run. Meaning that - if you give a constraint @INROWNUM<=1000 and the job is run on a 4 node config, then the transformer would send out 1000 rows from each of the nodes - giving you 4000 rows in your file.

DSguru2B · Post by **DSguru2B** » Fri Dec 15, 2006 11:47 am

True. Use this amazing post by vmcburney to handle partition numbers and number of partitions. Constraint it accordingly.

vsi · Post by **vsi** » Fri Dec 15, 2006 12:46 pm

Thanks for u r response folks,

environment:- parallel,
version :- 7.5.2
operating system:- Linux.

inorder to use Split command the file is having Header and Detail records.
like
Ex:- customer id, customer group
with reference to this information insurance, tax, adress, zipcode ----------------------------.

and the file is a fixed width column.

Even it is a parallel job, it was not configured fully i mean APT CONFIGURATION.
Please give u r valuble ideas to resolve the issue.

Thanks in advance.

jgreve · Post by **jgreve** » Fri Dec 15, 2006 2:34 pm

Can you give me more information:

Why does splitting this file make your life better? What is the problem if you just leave it alone? (I am sincere about asking this - the reason you want to split the file influences the way you need to split it.)

Run these commands on your file and paste the output here:

Code: Select all

ls -l file
wc file

What is the record format?
or... what are the record formats?
These detail records you mentioned:
is it like this kind of pattern?

Code: Select all

HDR:vsi
DTL:tax#1
DTL:tax#2
DTL:tax#3
HDR:jgreve
DTL:tax#1
DTL:tax#2
HDR:wurlod
DTL:tax#1
DTL:tax#2
DTL:tax#3
DTL:tax#4

or like this:

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2
:vsi:tax#1 :tax#2 :tax#3:::::
:jgreve :tax#1 :tax#2::::::
:wurlod :tax#1 :tax#2 :tax#3 :tax#4::::

I can't tell you without knowing what your data looks like.
Ok... actually I could tell you to do something, but it is likely to cause harm instead of helping you.

Post some example records, if you can - change everybody's name
and tax id if you have to.

John G.

vsi wrote:Thanks for u r response folks,

environment:- parallel,
version :- 7.5.2
operating system:- Linux.

inorder to use Split command the file is having Header and Detail records.
like
Ex:- customer id, customer group
with reference to this information insurance, tax, adress, zipcode ----------------------------.

and the file is a fixed width column.

Even it is a parallel job, it was not configured fully i mean APT CONFIGURATION.
Please give u r valuble ideas to resolve the issue.

Thanks in advance.

vsi · Post by **vsi** » Fri Dec 15, 2006 3:15 pm

Thanks for u r response.

The reason for splitting the file is

1. the file is too large 120gig

2.when i run the parallel jobs with large volumes of files like this i am getting HEAP ALLOCATION ERRORS.

3.For our parallel environment the Configuration of nodes is not done. (still they are doing).

4.The same file is source for 11 ETL JOBS

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2
:vsi:tax#1 :tax#2 :tax#3:::::
:jgreve :tax#1 :tax#2::::::
:wurlod :tax#1 :tax#2 :tax#3 :tax#4::::

i will provide if u need any further details.

Thanks in advance.

thebird · Post by **thebird** » Fri Dec 15, 2006 3:22 pm

What is the problem? Can you be specific?

Were you not able to split the large file using a counter (as was mentioned in Vincent's post linked above by DSGuru) in the Transformer???

narasimha · Post by **narasimha** » Fri Dec 15, 2006 3:38 pm

Another clarification needed

Does only you first line of of your big file have the column names like this

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:vsi:tax#1 :tax#2 :tax#3::::: 
:jgreve :tax#1 :tax#2:::::: 
:wurlod :tax#1 :tax#2 :tax#3 :tax#4::::

If this is the case as mentioned by DSDuru2B you can use split command

or

Does the column name repeat at regular/irregular intervals like below

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:vsi:tax#1 :tax#2 :tax#3::::: 
:jgreve :tax#1 :tax#2:::::: 
:wurlod :tax#1 :tax#2 :tax#3 :tax#4:::: 
:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:abc:tax#1 :tax#2 :tax#3::::: 
:def :tax#1 :tax#2:::::: 
:ghi :tax#1 :tax#2 :tax#3 :tax#4::::

Then we need a different approach

ray.wurlod · Post by **ray.wurlod** » Fri Dec 15, 2006 4:03 pm

Why am I paying more taxes than the others?

narasimha · Post by **narasimha** » Fri Dec 15, 2006 4:04 pm

You get more, you pay more

jgreve · Post by **jgreve** » Mon Dec 18, 2006 5:21 pm

ray.wurlod wrote:Why am I paying more taxes than the others?

Well, this is just a toy example.
I didn't want to put all the
tax-fields in there, or we'd be looking
at something like this to support
your client list, yes?

Code: Select all

TAX_1:TAX_2: ... :TAX_999:TAX_1000

I'm just ungruntled 'cuz I'm untangling my taxes for the year; god I hate paperwork.
John G.

DSXchange

Large file to split

Large file to split

Re: Large file to split

Re: Large file to split

Can you give me more information:

Re: Can you give me more information: