How to split Array in datastage?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pkll
Participant
Posts: 73
Joined: Thu Oct 25, 2012 9:45 pm

How to split Array in datastage?

Post by pkll »

Hi,
I Have Source file Like,

I/P
-----
custname,custno
KIRAN,[323~96.949900~7.210000][384~120.000000~0.000000][443~0.000000~0.000000][463~0.000000~0.000000][763~0.000000~0.000000][783~0.000000~0.000000][704~0.000000~0.000000]
i have morethan more than 100 arrays.
I want output like o/p
----
custname,custno,amount,balance
KIRAN,323,96.949900,7.210000
KIRAN,384,120.000000,0.000000
---
---
---
like this

Could you plese help me how to split this array?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

First let me get you into your own post and in the proper forum. No idea why you decided to jump on the end of this post but I've split yours out so you have control over the fate of the topic. Done!

Let me know if you're not on an 8.x release on a UNIX server.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

Column 2 in your input looks like a variable length ""Vector of Subrecords."
You can use restructure stages . Let me know what you data source is.


I have done similar work using "Promote Subrecord " stage
Convert a variable length vector of subrecords to subrecords - in fact the structure had level 02 and level 03 , all were promoted to top level using 2 promote subrecord stages in cascade ( with copy stage for debug in between) .

The main stumbling block for you may be making datastage ***recognize*** your input column as a vector of subrecords.

--Vesion 8.5 and above only :
Another option is transformer stage Looping with version 8.5 and above . You would need to know number of subrecords for each input record and use this in the next transformer to use loop variables ( put a copy stage between 2 transformers to do debugging ) .
Last edited by rameshrr3 on Fri Mar 08, 2013 11:39 am, edited 2 times in total.
sudha03_vpr
Participant
Posts: 34
Joined: Tue Feb 26, 2013 9:36 am
Location: Chicago

Post by sudha03_vpr »

Alternatively, you can use Complex Flat file that supports multiple records. Which system does generate this file ?
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

To read it as a vector of subrecords from a file, You MUST do away with the enclosing brackets '[' or ']' , and add another column to your schema and define it as a link field for variable length vector - which will store the count of beginning brackets or closing brackets - this will effectively be the number of elements in the vector, and tag this link field as such.
If source cannot change format , use column import to read the entire second column as a long text and then do the rest in a transformer.
For the vector column , define the link field with what ever name you gave . In the next transformer replace closing brackets ']' with a delimiter ( '~') so that all fields of your subrecord have a trailing delimiter. Define your subrecord columns ( subcol1,subcol2,subcol3) as level 02. and the Vector column must be declared as UNKNOWN datatype with Variable checkbox set.
Use this as input to the Promote subrecord stage and list the vector column as the subrecord data column. Output link shoudl include Name, Subcol1,Subcol2 and Subcol3 , with no level set. It should work fine.
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

Hi,

If you are in version 8.5 and above,it can be done easily.

Your job flow will be like this:-

Seqfile----> TSFM1----TSFM2----->Output

Logic in TSFM1:-

StageVar=Dcount(DSLink2.custno,"][")
Loop Condition:-@ITERATION<=StageVar

Output Derivation :-

Code: Select all

Custname=Inputlink.Custname
Custno=If(@ITERATION=1) Then Right(Field(DSLink2.custno,"][", @ITERATION),Index(Field(DSLink2.custno,"][", @ITERATION),"[",1)+1) Else If (@ITERATION=7) Then Left(Field(DSLink2.custno,"][", @ITERATION),Index(Field(DSLink2.custno,"][", @ITERATION),"]",1)-1) Else Field(DSLink2.custno,"][", @ITERATION)
Logic In TSFM2:-

Svar=Dcount(Dcount(DSLink27.CUSTNO,"~")


Loop Condition:-

@ITERATION<=Svar


Derivation:-

CUSTNAME=Inputlink.Custiname
Custno=StringToDecimal(Field(DSLink27.CUSTNO,"~",@ITERATION))
pkll
Participant
Posts: 73
Joined: Thu Oct 25, 2012 9:45 pm

Post by pkll »

Hi Prasson,

Thanks for ur reply.It is working fine.
But,From Transformer to Transformer Stage i am getting 31000 records per second.
From Sequential File to Transformer I am getting 1500 records per second only .Atleast i need to reach 10000 records per second.

I have used 4 Node configuration , i used partition techniques,i used $APT_MONITOR_TIME(value=5),$APT_MONITOR_SIZE(value=100000) environment variables,in target Oracle connector stage i increse the array size,In source sequential stage i have used number of records per node=4.

Could you please help me how to get more performance(10000 records per second) from source sequential file to Transformer stage?

Regards,
pkll..
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Start by ceasing to believe that rows/second is in any way meaningful. It includes all wait time (while the stages are starting up), it is rounded to the nearest sampling interval, and unless row sizes are identical you're not comparing apples with apples anyway.

What's so magic about the 10000 figure?

Use DataStage Director's Monitor view to determine how equitably your records are split over nodes. What processing takes place within the Transformer stages that needs two of them? Can you combine the logic into a single Transformer stage?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

The TFM 1 in the logic you use needs read ahead buffering. the Output link is much faster as your are doing row splits and incresing number of rows due to logic . If performance is an issue , you can still go with promote subrecord stage - trust me it works. Use a tr -d command to remove your operning and closing brackets '[]' before your read the file and define the subrecord delimiter as '~' . Try running your job with APT_DISABLE_COMBINATION = TRUE and see if your rows/sec change.
Post Reply