Page 1 of 2

Large file to split

Posted: Fri Dec 15, 2006 9:45 am
by vsi
Hi folks,

I have a sourcefile size 120gig, i need to split that file into several files and i need to work .
The scenario is this file is loading into multiple tables
By multiple Datastage jobs,

Now in order to split this file into separate files i don't know the metadata.

Please help me to solve this issue.

Thanks in advance.

Re: Large file to split

Posted: Fri Dec 15, 2006 9:48 am
by DSguru2B
vsi wrote: Now in order to split this file into separate files i don't know the metadata.

Please help me to solve this issue.
Remember. Datastage is metadata driven. Even file split logic will be metadata driven. If you have no idea about the meta data, how are you going to split the file :roll: ?

Re: Large file to split

Posted: Fri Dec 15, 2006 10:02 am
by vsi
Thanks for u r response, Dsguru


I am sorry, i loaded the metadata

and the job design is as follow

seq.file -----> Transformer ------> multiple sequential files.

in the transformer i used the constratin

@INROWNUM<10000 ----- FIRSTFILE
@INROWNUM>10001AND@INROWNUM

like this i am using a condition for different sequential file,

but it is not working

is there any other method to split into multiple files.

Thanks in advance.

Posted: Fri Dec 15, 2006 10:12 am
by DSguru2B
If your on server engine then use a link partitioner to achieve this.
If your on PX then your condition should work.
for the first link have @INROWNUM < 10000
for second have @INROWNUM >= 10000 and @INROWNUM <20000
and so on....
What error are you getting ?
You can also split the file at the os level by using the split command.

Posted: Fri Dec 15, 2006 10:39 am
by narasimha
vsi,

What is the environment you are working in.
In your post you have the Job Type as Server, but you are posting in the Parallel forum.
If both are consistent then you get a more specific answer.

Posted: Fri Dec 15, 2006 11:14 am
by thebird
Remember that if it is a Parallel job, then @OUTROWNUM and @INROWNUM would get executed on each of the nodes for the job run. Meaning that - if you give a constraint @INROWNUM<=1000 and the job is run on a 4 node config, then the transformer would send out 1000 rows from each of the nodes - giving you 4000 rows in your file.

Posted: Fri Dec 15, 2006 11:47 am
by DSguru2B
True. Use this amazing post by vmcburney to handle partition numbers and number of partitions. Constraint it accordingly.

Posted: Fri Dec 15, 2006 12:46 pm
by vsi
Thanks for u r response folks,

environment:- parallel,
version :- 7.5.2
operating system:- Linux.

inorder to use Split command the file is having Header and Detail records.
like
Ex:- customer id, customer group
with reference to this information insurance, tax, adress, zipcode ----------------------------.

and the file is a fixed width column.

Even it is a parallel job, it was not configured fully i mean APT CONFIGURATION.
Please give u r valuble ideas to resolve the issue.

Thanks in advance.

Can you give me more information:

Posted: Fri Dec 15, 2006 2:34 pm
by jgreve
Can you give me more information:

Why does splitting this file make your life better? What is the problem if you just leave it alone? (I am sincere about asking this - the reason you want to split the file influences the way you need to split it.)

Run these commands on your file and paste the output here:

Code: Select all

ls -l file
wc file
What is the record format?
or... what are the record formats?
These detail records you mentioned:
is it like this kind of pattern?

Code: Select all

HDR:vsi
DTL:tax#1
DTL:tax#2
DTL:tax#3
HDR:jgreve
DTL:tax#1
DTL:tax#2
HDR:wurlod
DTL:tax#1
DTL:tax#2
DTL:tax#3
DTL:tax#4
or like this:

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2
:vsi:tax#1 :tax#2 :tax#3:::::
:jgreve :tax#1 :tax#2::::::
:wurlod :tax#1 :tax#2 :tax#3 :tax#4::::
I can't tell you without knowing what your data looks like.
Ok... actually I could tell you to do something, but it is likely to cause harm instead of helping you.

Post some example records, if you can - change everybody's name
and tax id if you have to.

John G.
vsi wrote:Thanks for u r response folks,

environment:- parallel,
version :- 7.5.2
operating system:- Linux.

inorder to use Split command the file is having Header and Detail records.
like
Ex:- customer id, customer group
with reference to this information insurance, tax, adress, zipcode ----------------------------.

and the file is a fixed width column.

Even it is a parallel job, it was not configured fully i mean APT CONFIGURATION.
Please give u r valuble ideas to resolve the issue.

Thanks in advance.

Re: Can you give me more information:

Posted: Fri Dec 15, 2006 3:15 pm
by vsi
Thanks for u r response.

The reason for splitting the file is

1. the file is too large 120gig

2.when i run the parallel jobs with large volumes of files like this i am getting HEAP ALLOCATION ERRORS.

3.For our parallel environment the Configuration of nodes is not done. (still they are doing).

4.The same file is source for 11 ETL JOBS

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2
:vsi:tax#1 :tax#2 :tax#3:::::
:jgreve :tax#1 :tax#2::::::
:wurlod :tax#1 :tax#2 :tax#3 :tax#4::::

i will provide if u need any further details.

Thanks in advance.

Posted: Fri Dec 15, 2006 3:22 pm
by thebird
What is the problem? Can you be specific?

Were you not able to split the large file using a counter (as was mentioned in Vincent's post linked above by DSGuru) in the Transformer???

Posted: Fri Dec 15, 2006 3:38 pm
by narasimha
Another clarification needed

Does only you first line of of your big file have the column names like this

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:vsi:tax#1 :tax#2 :tax#3::::: 
:jgreve :tax#1 :tax#2:::::: 
:wurlod :tax#1 :tax#2 :tax#3 :tax#4:::: 
If this is the case as mentioned by DSDuru2B you can use split command


or

Does the column name repeat at regular/irregular intervals like below

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:vsi:tax#1 :tax#2 :tax#3::::: 
:jgreve :tax#1 :tax#2:::::: 
:wurlod :tax#1 :tax#2 :tax#3 :tax#4:::: 
:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:abc:tax#1 :tax#2 :tax#3::::: 
:def :tax#1 :tax#2:::::: 
:ghi :tax#1 :tax#2 :tax#3 :tax#4:::: 
Then we need a different approach

Posted: Fri Dec 15, 2006 4:03 pm
by ray.wurlod
Why am I paying more taxes than the others? :( :cry: :x

Posted: Fri Dec 15, 2006 4:04 pm
by narasimha
You get more, you pay more :wink:

Posted: Mon Dec 18, 2006 5:21 pm
by jgreve
ray.wurlod wrote:Why am I paying more taxes than the others? :( :cry: :x
Well, this is just a toy example.
I didn't want to put all the
tax-fields in there, or we'd be looking
at something like this to support
your client list, yes? :wink:

Code: Select all

TAX_1:TAX_2: ... :TAX_999:TAX_1000
I'm just ungruntled 'cuz I'm untangling my taxes for the year; god I hate paperwork.
John G.