inputs required on the design...

Pavan_Yelugula · Post by **Pavan_Yelugula** » Wed Feb 09, 2005 11:08 pm

Hi All

we are working on a application which is taking 7 hours to complete. The requirement is to bring it down in 5 hours

The solution to be proposed should be a DataStage architecture

As such the design which is in place is a very good one. we need to use every parallel property of DataStage to beat it

i need some advice on you about the following design
i identified different parallel properties in Datastage

Started thinking on my own

what do you think of a design a job like this---->

Job canvass
------------------------------------------------
ODBCTable--------->seq file

ODBCTable---------->seqfile
-----------------------------------------------

The job as such contains 2 main streams. i know i am confusing Datastage by having multiple streams. but this kind of job is working fine.
Do you think i can go ahead with this or is this going to snap as i go on.

it would be really helpful if you can give me some inputs on this

if you think of any parallelism solutions out of the box please share it with me
Thanks and Regards
Pavan

andru · Post by **andru** » Wed Feb 09, 2005 11:32 pm

The design will work fine w.r.t performance provided the tabels referenced in both the streams are different. You can try using a IPC stage between the ODBC and seq. stage. I have heard using IPC explicitly is mroe powerful than implicitly defining it in the project properties. Also if the o/p seq.file is going to be just used as an i/p to the next job, you can try using hash files instead of seq.files.

Pavan_Yelugula · Post by **Pavan_Yelugula** » Wed Feb 09, 2005 11:42 pm

hi andru
Thanx for the reply

how is it going to affect if i get across the same table in both the streams.

Will there be some kind of locking on the tables????
As such i am just using the same table at both the places just for reading data from it.
i was under the impression that as i am just reading and i am not writing to the table.the table i am using at both places will not be locked. Is my understanding wrong?

ok if it is same table then

job1
-----
ODBCTable-------->Seq1

job2
-----
ODBCTable--------->Seq2

seq
----
job1 job2 (side by side)

Will this get me some parallelism which i am looking for

Thanks and Regards
Pavan

ray.wurlod · Post by **ray.wurlod** » Thu Feb 10, 2005 12:49 am

The original design is perfectly valid and will give the same parallelism as the newer design.

In the original design, because there are no interdependencies, the separate Transformer stages - and therefore data streams - will be handled by separate processes. On a multiple CPU machine, it's highly likely that these will run on separate CPUs, but that's a decision made by the operating system, not by DataStage.

Pavan_Yelugula · Post by **Pavan_Yelugula** » Thu Feb 10, 2005 1:30 am

ray
can you also clear me on the assumptions i had on tables getting locked when writing but not when we are reading from them.

my 2nd post in the thread i explained my concern

Thanks and Regards
Pavan

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Feb 10, 2005 3:03 am

You can use the native language connection (such as OCI for Oracle) to improve performance. Performing parallel run will work with a single tick in the properties (for multi-instance run) and provide mechanism to keep the flow unique and not overwrite each other.

Ray may be able to advise better if you can provide information on what RDBMS you use and how you plan to break them into multiple run. i.e. do you intend to have a break in date-of-birth field so everything before 2000 goes in one stream and all others in another.

Also how are you going to use the output file? Can you handle 2 separate files or do you need one combined and ordered output file?

ray.wurlod · Post by **ray.wurlod** » Thu Feb 10, 2005 3:06 pm

Most databases use an NR1W locking scheme (N readers or 1 writer). Therefore, unless another process is updating the table, you ought to have no problems whatsoever.

Some databases, such as Red Brick Warehouse, implement "query priority concurrency", so that an update can be suspended while a snapshot is taken for the queries to use.

In any case, your two streams selecting from the same table will not cause you any locking problems.

Other processes operating on the database may, but this is not something that is under DataStage's control, other than by setting a higher transaction isolation level. You can read about these in ODBC manuals, which can be had from Microsoft Press (since ODBC is actually a Microsoft standard). Most good technical bookstores should have it.

Pavan_Yelugula · Post by **Pavan_Yelugula** » Thu Feb 10, 2005 10:50 pm

hi ray
That was real good thanx for the clarification

i also have a few small queries in my design

1) i have huge tables(7 million records) in my database and also small tables( a few thousand records).
i need to adhere to a set of business constraints before i take the data to process further. the constraints are the same for all the tables like country_code checking..etc

my query is which is the best place to apply this business constraints

1) i can apply the constraints in my odbc stage writing a sql query which will get the data with required constraints
2) i can get in all the data from the table and put a transformer and apply the constarints before i work any further

what should i follow out the two approches.is there a trade off i can take depending on the table size.

also please consider the fast my database is in a different server and my Datastage in a different server. i feel it would be better if i get in all the data and do the transformation on my datastage because the database server is severly overloaded.

Any inputs will be really helpful

Thanks and Regards
Pavan

ray.wurlod · Post by **ray.wurlod** » Fri Feb 11, 2005 5:41 pm

Whenever possible apply constraints as early as possible; in this case when selecting from source. It will help if the constrained columns are indexed. By eliminating rows as early as possible you are minimizing the total volume of data to be processed and therefore minimizing the total time for completion.

Always index foreign keys.

The only time this approach might not be possible is where any join columns are in disparate databases.