Large file to split

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Large file to split

Post by vsi »

Hi folks,

I have a sourcefile size 120gig, i need to split that file into several files and i need to work .
The scenario is this file is loading into multiple tables
By multiple Datastage jobs,

Now in order to split this file into separate files i don't know the metadata.

Please help me to solve this issue.

Thanks in advance.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Re: Large file to split

Post by DSguru2B »

vsi wrote: Now in order to split this file into separate files i don't know the metadata.

Please help me to solve this issue.
Remember. Datastage is metadata driven. Even file split logic will be metadata driven. If you have no idea about the meta data, how are you going to split the file :roll: ?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Re: Large file to split

Post by vsi »

Thanks for u r response, Dsguru


I am sorry, i loaded the metadata

and the job design is as follow

seq.file -----> Transformer ------> multiple sequential files.

in the transformer i used the constratin

@INROWNUM<10000 ----- FIRSTFILE
@INROWNUM>10001AND@INROWNUM

like this i am using a condition for different sequential file,

but it is not working

is there any other method to split into multiple files.

Thanks in advance.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

If your on server engine then use a link partitioner to achieve this.
If your on PX then your condition should work.
for the first link have @INROWNUM < 10000
for second have @INROWNUM >= 10000 and @INROWNUM <20000
and so on....
What error are you getting ?
You can also split the file at the os level by using the split command.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

vsi,

What is the environment you are working in.
In your post you have the Job Type as Server, but you are posting in the Parallel forum.
If both are consistent then you get a more specific answer.
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

Remember that if it is a Parallel job, then @OUTROWNUM and @INROWNUM would get executed on each of the nodes for the job run. Meaning that - if you give a constraint @INROWNUM<=1000 and the job is run on a 4 node config, then the transformer would send out 1000 rows from each of the nodes - giving you 4000 rows in your file.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

True. Use this amazing post by vmcburney to handle partition numbers and number of partitions. Constraint it accordingly.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

Thanks for u r response folks,

environment:- parallel,
version :- 7.5.2
operating system:- Linux.

inorder to use Split command the file is having Header and Detail records.
like
Ex:- customer id, customer group
with reference to this information insurance, tax, adress, zipcode ----------------------------.

and the file is a fixed width column.

Even it is a parallel job, it was not configured fully i mean APT CONFIGURATION.
Please give u r valuble ideas to resolve the issue.

Thanks in advance.
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

Can you give me more information:

Post by jgreve »

Can you give me more information:

Why does splitting this file make your life better? What is the problem if you just leave it alone? (I am sincere about asking this - the reason you want to split the file influences the way you need to split it.)

Run these commands on your file and paste the output here:

Code: Select all

ls -l file
wc file
What is the record format?
or... what are the record formats?
These detail records you mentioned:
is it like this kind of pattern?

Code: Select all

HDR:vsi
DTL:tax#1
DTL:tax#2
DTL:tax#3
HDR:jgreve
DTL:tax#1
DTL:tax#2
HDR:wurlod
DTL:tax#1
DTL:tax#2
DTL:tax#3
DTL:tax#4
or like this:

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2
:vsi:tax#1 :tax#2 :tax#3:::::
:jgreve :tax#1 :tax#2::::::
:wurlod :tax#1 :tax#2 :tax#3 :tax#4::::
I can't tell you without knowing what your data looks like.
Ok... actually I could tell you to do something, but it is likely to cause harm instead of helping you.

Post some example records, if you can - change everybody's name
and tax id if you have to.

John G.
vsi wrote:Thanks for u r response folks,

environment:- parallel,
version :- 7.5.2
operating system:- Linux.

inorder to use Split command the file is having Header and Detail records.
like
Ex:- customer id, customer group
with reference to this information insurance, tax, adress, zipcode ----------------------------.

and the file is a fixed width column.

Even it is a parallel job, it was not configured fully i mean APT CONFIGURATION.
Please give u r valuble ideas to resolve the issue.

Thanks in advance.
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Re: Can you give me more information:

Post by vsi »

Thanks for u r response.

The reason for splitting the file is

1. the file is too large 120gig

2.when i run the parallel jobs with large volumes of files like this i am getting HEAP ALLOCATION ERRORS.

3.For our parallel environment the Configuration of nodes is not done. (still they are doing).

4.The same file is source for 11 ETL JOBS

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2
:vsi:tax#1 :tax#2 :tax#3:::::
:jgreve :tax#1 :tax#2::::::
:wurlod :tax#1 :tax#2 :tax#3 :tax#4::::

i will provide if u need any further details.

Thanks in advance.
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

What is the problem? Can you be specific?

Were you not able to split the large file using a counter (as was mentioned in Vincent's post linked above by DSGuru) in the Transformer???
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

Another clarification needed

Does only you first line of of your big file have the column names like this

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:vsi:tax#1 :tax#2 :tax#3::::: 
:jgreve :tax#1 :tax#2:::::: 
:wurlod :tax#1 :tax#2 :tax#3 :tax#4:::: 
If this is the case as mentioned by DSDuru2B you can use split command


or

Does the column name repeat at regular/irregular intervals like below

Code: Select all

:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:vsi:tax#1 :tax#2 :tax#3::::: 
:jgreve :tax#1 :tax#2:::::: 
:wurlod :tax#1 :tax#2 :tax#3 :tax#4:::: 
:NAME:TAX_1:TAX_2:TAX_3:TAX_4:TAX_5:TAX_6:TAX_7:TAX_8:ADDR_1:ADD_R2 
:abc:tax#1 :tax#2 :tax#3::::: 
:def :tax#1 :tax#2:::::: 
:ghi :tax#1 :tax#2 :tax#3 :tax#4:::: 
Then we need a different approach
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Why am I paying more taxes than the others? :( :cry: :x
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

You get more, you pay more :wink:
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

Post by jgreve »

ray.wurlod wrote:Why am I paying more taxes than the others? :( :cry: :x
Well, this is just a toy example.
I didn't want to put all the
tax-fields in there, or we'd be looking
at something like this to support
your client list, yes? :wink:

Code: Select all

TAX_1:TAX_2: ... :TAX_999:TAX_1000
I'm just ungruntled 'cuz I'm untangling my taxes for the year; god I hate paperwork.
John G.
Post Reply