Equivalent of first and last functions in sql in datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Equivalent of first and last functions in sql in datastage

Post by varshanswamy »

Hi,
something equivalent of the below SQL

select
first(agent),
last(nav_path),
sum(page_cnt)
sum(tot_file_size_num),
web_site_id,
event_session_id
from
sessn_event
group by
web_site_id,
event_session_id

is there any stage using which I can convert this sql into into a datastage PX job
Because I have the data input in the form of a sequential File.

regards,
varsha
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Grouping and summing are the preserve of the Aggregator stage (though it can be done more clumsily in a Transformer stage). Many other aggregate functions are available but First and Last are not among them.
You therefore have two options; to use a Transformer stage and perform your change detection (of rows sorted by grouping columns), summing and detection of First and Last using stage variables, or using a server Shared Container in which there is a server Aggregator stage, which can do Sum, First and Last as aggregate functions. For best performance always supply input sorted by grouping columns; in the server Aggregator stage specify in the Input properties that the incoming data are thus sorted.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

ray.wurlod wrote:Grouping and summing are the preserve of the Aggregator stage (though it can be done more clumsily in a Transformer stage). Many other aggregate functions are available but First and Last are not among them.
You therefore have two options; to use a Transformer stage and perform your change detection (of rows sorted by grouping columns), summing and detection of First and Last using stage variables, or using a server Shared Container in which there is a server Aggregator stage, which can do Sum, First and Last as aggregate functions. For best performance always supply input sorted by grouping columns; in the server Aggregator stage specify in the Input properties that the incoming data are thus sorted.

Hi,

The job is presently a server job, and is present as part of the normal job flow.
I have presently created a PX, job with transformer and remove duplicates stage to obtain the first and last values.
I would like to know with respect to the performance point of view, which would be faster the server job called as a shared container in a PX job, or a pure PX job using transformer stage.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Try both and let us know.

There are too many variables to make a blanket statement that "method X will be faster". When you publish your results, therefore, please also include the hardware configuration (including processor speed), job design information, and number and size of rows processed. Ideally you should also include any strategies followed to eliminate cache effects, such as rebooting between tests, or running tests more than once each and in different order.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

The job I have developed using transformer and remove duplicates stage has a strange behavior, the aggregated values are obtained if I take fewer records, but when I run the job for bulk of records, the result obtained is not accurate. I have not used any partitioning or parallelism in the jobs. And here we only have a single node, so there is no question of any parelllism anywhere, I am unable to decipher, why i am obtaining ambigious reuslts. The stage I have used are the remove duplicates stage(first and last function) and trnasformer stage, and dataset. Could anyone help me out as to why I am getting such strange results
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not unless you tell us, completely and precisely, what your job is doing in these stages.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

Hi,

What I am trying to do is the equivalent of

select
event_session_id,
first(ref_qry_strng_text),
first(agent),
first(page_id),
first(start_dt_tm),
last(start_dt_tm),
first(timediff),
web_site_id,
first(clnt_ipa_num),
sum(page_cnt),
first(link_key),
sum(tot_file_size_num),
last(nav_path)
from
sessn_event
group by
web_site_id,
event_session_id

using a set of transformers and remove duplicate stages.

My input is made up of the following fields, in a dataset.

jsessionid
ref_qry_stng_text
agent
start_dt_tm
end_dt_tm
timediff
clnt_ipa_num
link_key
PAGE_ID
WEB_SITE_ID
page_cnt
SIZE_FILE_NUM
NAV_PATH

Giving the example of what I do in the job.

Input

-----------------------------------------------------------------------------------------------------------
"0005WAXIVD2YTQFIOERSFE4AVCZB4IV0",
"https://www1.net.hsbc.com/code/pro/en_U ... stid=17494",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)",
"2005-09-22 15:26:37",
"0",
"-4",
"217.45.222.121",
"2",
"5236",
"38",
"0",
" 00000000000000000000.",
"5236"
--------------------------------------------------------------------------------------------------
"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0",
"-",
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",
"2005-09-22 13:53:51",
"2005-09-22 13:54:08",
"-4",
"172.201.177.223",
"1",
"5234",
"37",
"1",
" 00000000000000000000.",
"5234"
---------------------------------------------------------------------------------------------------
"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0",
"https://www1.net.hsbc.com/code/public/e ... stid=16250",
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",
"2005-09-22 13:54:08",
"2005-09-22 13:54:09",
"-4",
"172.201.177.223",
"1",
"5234",
"37",
"0",
" 00000000000000000097.",
"5234-5234"
------------------------------------------------------------------------------------------------------
"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0",
"https://www1.net.hsbc.com/code/public/e ... stid=16250",
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",
"2005-09-22 13:54:09",
"2005-09-22 13:54:14",
"-4",
"172.201.177.223",
"1740363",
"5266",
"30",
"1",
" 00000000000000000000.",
"5234-5234-5266"
----------------------------------------------------------------------------------------------------------
"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0",
"-",
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",
"2005-09-22 13:54:14",
"2005-09-22 13:54:14",
"-4",
"172.201.177.223",
"1740363",
"5267",
"30",
"0",
" 00000000000000000000.",
"5234-5234-5266-5267"
---------------------------------------------------------------------------------------------------------
"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0",
"-",
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",
"2005-09-22 13:54:14",
"2005-09-22 13:54:19",
"-4",
"172.201.177.223",
"1740363",
"5268",
"30",
"0",
" 00000000000000000000.",
"5234-5234-5266-5267-5268"
-----------------------------------------------------------------------------------------------------------
"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0",
"-",
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",
"2005-09-22 13:54:19",
"2005-09-22 14:12:52",
"-4",
"172.201.177.223",
"1740363",
"5268",
"30",
"0",
" 00000000000000000000.",
"5234-5234-5266-5267-5268-5268"
--------------------------------------------------------------------------------------------------------------
"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0",
"-",
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",
"2005-09-22 14:12:52",
"0",
"-4",
"172.201.177.223",
"1740363",
"5277",
"30",
"0",
" 00000000000000000000.",
"5234-5234-5266-5267-5268-5268-5277"
------------------------------------------------------------------------------------------------------------------
"004CRI4FMNNH1QFIWESCFF4AVBZB4IV0",
"https://www2.net.hsbc.com/code/pro/en_U ... ozilla/4.0 (compatible; MSIE 6.0; Windows 98)",
"2005-09-22 20:38:08",
"0",
"-4",
"154.20.4.123",
"2","5236",
"38",
"1",
" 00000000000000000000.",
"5236"
---------------------------------------------------------------------------------------------------------------------

So the output for the jsessionid "001SW1IDO3JTJQFIOERCFFWAVCZB4IV0" would be

"001SW1IDO3JTJQFIOERCFFWAVCZB4IV0", - group by jsessionid
"https://www1.net.hsbc.com/code/public/e ... stid=16250", -first(ref_qry_strng_text)
"Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; .NET CLR 1.1.4322)",- first(agent)
"5234",-first(page_id)
"2005-09-22 13:53:51", - first(start_dt_tm)
"2005-09-22 14:12:52", - last(start_dt_tm)
"-4",-first(timediff)
"172.201.177.223",-first(clnt_ipa_num)
"2",-sum(page_cnt)
"30",-group by
"1",-first(link_key)
" 00000000000000000097.",- sum(tot_file_size_num)
"5234-5234-5266-5267-5268-5268-5277" - last(nav_path)


1. I sort the input data based on the web_site_id, jsessionid, then the start_dt_tm.
2. Later I split the records based the requirment of first and last, those which are first I pass it through a seperate link, those which are last I pass it
through a seperate link, and for those values which needs to be summed for a particular web_site_id, session_id, start_dt_tm are summed in the
transformer by comparing previous value and next value. For those values which are to be considered only as first the other values are set to zero(numeric),
and ''(varchar) values.
3. I send the values which are first through the remove duplicates and take up the first value.
4. I send the values which are last thorugh the remove duplicates and take up the last value.
5. I combine both the first and last values using a funnel.
6. Later since I want the grouped values based on web_site_id, jsessionid and start_dt_tm, For those values which are char, I concatenate first value with last value, and if numeric, I add the first value with next value.
Post Reply