Reading/Parsing a weblog file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Reading/Parsing a weblog file

Post by I_Server_Whale »

Hi All,

Need some help in reading/parsing a weblog file. The weblog file is a flatfile.

Example Records:

Code: Select all

^69.167.3.6^[09/Sep/2006:00:00:00 -0700]^GET /includes/css.jhtml?SectionID=null&id=null HTTP/1.1^200^21^0^80^http://www.xyz.com/help/stores/index.jhtml^Mozilla/4.0^profile=8155368725; flashCookie=6; SephSessID=69.167.3.6.29756115778517715; JSESSIONID=VGS0YTLX0FOEFLAUCLABXCQ; disable=disable; cmRS=t3=1157785249641&pi=Home^SephSessID=69.167.3.6.29756115778517715^

^129.123.104.5^[09/Sep/2006:00:00:00-0700]^GET/assets/dyn/category/C10610/C10610_hero.jpgHTTP/1.0^200^3572^0^80^http://www.xyz.com/browse/brand_hierarchy.jhtml?brandId=3866^Mozilla/4.0^language=EnglishUS; profile=8355672542; flashCookie=6; SephSessID=129.123.104.5.284841157783677969; JSESSIONID=BEE1UW1N1QOBNLAUCJBRXCQ; disable=disable; ForeseeLoyalty_MID_kINYBhFpco=65^-^

^129.123.104.5^[09/Sep/2006:00:00:00 -0700]^GET /assets/dyn/category/C10610/C10610_hdr.jpg HTTP/1.0^200^1812^0^80^http://www.xyz.com/browse/brand_hierarchy.jhtml?brandId=3866^Mozilla/4.0^language=EnglishUS; profile=8355672542; flashCookie=6; SephSessID=129.123.104.5.284841157783677969; JSESSIONID=BEE1UW1N1QOBNLAUCJBRXCQ; disable=disable; ForeseeLoyalty_MID_kINYBhFpco=65^-^

^65.101.153.101^[09/Sep/2006:00:00:00 -0700]^GET /browse/brand_hierarchy.jhtml?brandId=5843 HTTP/1.1^200^71387^0^80^http://www.xyz.com/;jsessionid=OD0DKHD4REI3HLAUCJABXCQ^Mozilla/5.0^language=EnglishUS; flashCookie=6;  profile=8369491739; SephSessID=65.102.160.146.102161157782878917; JSESSIONID=OD0DKHD4REI3HLAUCJABXCQ; disable=disable; ForeseeLoyalty_MID_kINYBhFpco=33^SephSessID=65.102.160.146.102161157782878917^
For clarity, I have separated each incoming record with a new line. The delimiter in the above example file is '^'. And the corresponding metadata is as below:

IP_ADDRESS
HIT_DATE_TIME
URL_REQUEST
BYTES_SENT
TIME_TAKEN
COUNT
PORT_NO
REFER_URL
BROWSER
COOKIE_IN
COOKIE_OUT


Focussing on the field 'COOKIE_IN', the COOKIE_IN values for the above four records are:

Code: Select all

profile=8155368725; flashCookie=6; SephSessID=69.167.3.6.29756115778517715; JSESSIONID=VGS0YTLX0FOEFLAUCLABXCQ; disable=disable; cmRS=t3=1157785249641&pi=Home


language=EnglishUS; profile=8355672542; flashCookie=6; SephSessID=129.123.104.5.284841157783677969; JSESSIONID=BEE1UW1N1QOBNLAUCJBRXCQ; disable=disable; ForeseeLoyalty_MID_kINYBhFpco=65


language=EnglishUS; profile=8355672542; flashCookie=6; SephSessID=129.123.104.5.284841157783677969; JSESSIONID=BEE1UW1N1QOBNLAUCJBRXCQ; disable=disable; ForeseeLoyalty_MID_kINYBhFpco=65


language=EnglishUS; flashCookie=6;  profile=8369491739; SephSessID=65.102.160.146.102161157782878917; JSESSIONID=OD0DKHD4REI3HLAUCJABXCQ; disable=disable; ForeseeLoyalty_MID_kINYBhFpco=33;
For each incoming record, I need to extract the 'profile' number in the above COOKIE_IN field which is of length 10.

Result:

Code: Select all

8155368725
8355672542
8355672542
8369491739
The problem here is that 'profile' number sits/positions itself at different places in the column 'COOKIE_IN'. For example, in the first record, profile is at the beginning. But for the second and third, it is at a different place.

If the profile is stationary at one place I can use the 'Field()' function to retrieve the required. But how do I do it if the position is varying. Can I uses MATCHES?

Any idea/solution/suggestion would be greatly appreciated.

Many Thanks,
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Have you installed the Click Pack for DataStage? From what I recall, it is free and installs several stages specifically geared to handling things like web logs. And it brings Perl integration into DataStage. Something to check out.
-craig

"You can never have too many knives" -- Logan Nine Fingers
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

Hi Gents!

Doing this in Perl would not be too tough. :? Notice that I say that without actually having more than what I see here. It would probably take any hour or two to come up with something that could parse that formatted file into the structure you want. Would you want to drop it into a sequential file format to process through DataStage?

Pretty simple actually. Feel free to message me if you would like to take a look at this.

Bestest,
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If the string is "language=EnglishUS; profile=8355672542; flashCookie=6; SephSessID=129.123.104.5.284841157783677969; JSESSIONID=BEE1UW1N1QOBNLAUCJBRXCQ; disable=disable; ForeseeLoyalty_MID_kINYBhFpco=65 ", then couldn't you just use

Code: Select all

In.String[INDEX(In.String,"profile=",1)+8,10]
Post Reply