Page 1 of 1

How to find out the bytes moved by a datastage job

Posted: Wed Mar 25, 2015 8:58 am
by AshishDevassy
Hey
I need to see how many bytes are moved by the data stage jobs.

tried a few Environment variables
like cc_msg_level
and some APT variables , no luck.
The job logs just show the record counts.

Is there any way ?

Posted: Wed Mar 25, 2015 10:57 am
by qt_ky
Bytes moved by a job will vary wildly with each different job design and could be meaningless (i.e. Job 1 moved 1GB from stage A to stage B and 900 MB from stage B to stage C and 900 from C to D, etc.).

Bytes moved across each link may be a useful measurement, but I do not know of an environment variable to set to learn this. I would curious to learn if there is one.

Posted: Wed Mar 25, 2015 3:36 pm
by ray.wurlod
I'm not aware of any either, other than manually calculating row sizes based on information in Chapter 2 of Parallel Job Developer's Guide about storage sizes of data types. And anything you put in to calculate the same would degrade the job's performance.

But, of course, any data moved around in parallel jobs uses buffers of various kinds, and of other kinds when re-partitioning. So you would need to round up to the nearest whole buffer size. For example writing even one row into a Data Set will move at least 32KB.

Posted: Wed Mar 25, 2015 5:04 pm
by eostic
It would be very interesting to know "why"?

What is the requirement for having such detail at the "byte" level? ...and then further, as Ray notes, is the question about "real" bytes, accurately measured uniquely for every variable record? ...internal memory representation? ...with buffers and whatever else internal needs to be applied? At the link level? Aggregated across all links and Stages?

Lots of possibilities.

Ernie

Posted: Thu Mar 26, 2015 7:04 am
by chulett
Perhaps because Informatica does? :wink:

I've never found it all that useful, honestly, but it is shown in their monitor. Would also be curious as to the why of this.

Posted: Thu Mar 26, 2015 7:48 am
by AshishDevassy
Yea What Ray said was more or less a last resort..
The reason behind this exercise is to see the amount of data that we are moving using ETL. Management request :)
And it had me curious too

I generated a performance Report for a job and That had a chart which had a bytes used by the job (doesn't look like the same thing)

I was thinking of going about it using the database.
i.e. identify the records loaded last night (multiply the no of records with the bytes that each record takes.)
long and winding

Posted: Thu Mar 26, 2015 8:52 am
by chulett
AshishDevassy wrote:I was thinking of going about it using the database. i.e. identify the records loaded last night (multiply the no of records with the bytes that each record takes.)
What number of bytes - the full size according to the metadata? The Average Record Length derived from the data itself? Or by summing the true size of each record based on the data in each row? Then there's the potential difference of the data size in flight versus in storage.

Sounds like fun regardless...

Posted: Thu Mar 26, 2015 9:24 am
by AshishDevassy
for example
Table A - Bytes per record 100
No of records loaded last night 500
bytes loaded - 50000

That's the best that I can think of ...

Posted: Thu Mar 26, 2015 9:29 am
by chulett
Sure... still just wondering where that 'bytes per record' value came from.

Posted: Thu Mar 26, 2015 10:37 am
by qt_ky
An alternative outside of DataStage would be to request a network person to monitor inbound and outbound network traffic with your DataStage server over an average 24 hour period and send you a report with the number of bytes for each. There are probably operating system commands that would let you do the same thing yourself too.

Posted: Sat Mar 28, 2015 8:52 am
by kduke
Network traffic is not all data. It includes security and SQL statements passed back and forth. Bytes written is the amount of work being done. So it is better predictor of performance than rows / second.

In EtlStats is a routine which tries to estimate the average row length. Now most database s will calculate this when you run stats. They store it in the information schema. Once you get this length you still need to tie to the link in the job which write to this table. Getting table name, job name and link name in a table then average row length in a table then joining all that with row counts will give you the number. Not a simple task. How much benefit will you get from knowing these numbers? Will it pay for the effort to tie row length to job link name?

The EtlStats routine uses an estimate on varchar fields and other fields where length is either a max length or a packed field like numbers. For instance DECIMAL(18,2) does not take up 18 bytes. It is packed based on how the database works. VARCHAR(40) is based on average length. So in the routine you could hard code a percentage so all varchars would average 50% or 70% of max length. So at best it was an estimate. Average itself is an estimate of how long a row is. It is probably a better estimate than anything we could guess. So getting these out of information schema is probably best.

Oracle and DB2 I know have average row length. How it gets it you need to look up. So I think it is possible. Let us know if you do it and how you did it. Would be nice to see if it was worth it or you learned anything. Bytes per second should be consistent across all similar tables in a given database. Faster server should improve these numbers. So load times should be easier to predict in theory.

Good luck.

Posted: Wed Apr 01, 2015 7:40 am
by AshishDevassy
So I talked to the DBAs.
And they said will be able to give me the table names and the bytes per record (using something called DCLGENs)
multiplying that by the count of records loaded by ETL would give a rough estimate on the data bytes moved by ETL nightly.

Posted: Wed Apr 01, 2015 9:53 am
by Cranie
This is very possible.

We have a process that captures the DS logs, design and run information. This runs every few hours. The job is DS basic and captures 1500 job instance runs in about 10minutes (so it's pretty quick and not resource hungry)

We can then extrapolate details such as how much data (bytes, rows) have gone down each link and even to the level of how much has gone down each partition. This is very useful in identifying skewed PX jobs, jobs with high CPU and low row and gives you a first glance at what could be a focus point.

Most of these features are detailed in the BASIC documentation. I can see many benefits to having this.

Some of it needs to be guess work - i.e. data is stored and compressed in some instances, varchar fields may not always be 255!!! (we take an average). So while not an exact process it is a very good indicator and gives you a sense of what the environment is doing. This is great for capacity planning too.

Posted: Wed Apr 01, 2015 1:38 pm
by AshishDevassy
@Cranie - Can you give some more details of the DS job that you used to capture the said details.