Page 1 of 1

Need suggestion to exctract data from HTML string

Posted: Mon Apr 24, 2017 4:04 pm
by anajitKS
I have a requirement to extract data from HTML string. Is there a good/easy way to achieve it using DataStage?

Any suggestion is appreciated.

Posted: Mon Apr 24, 2017 6:57 pm
by chulett
Seems to me the first answer is "depends". Can you post an example of the HTML and what data you are trying to extract from it, please?

Posted: Tue Apr 25, 2017 8:12 am
by anajitKS
Here is one example

<div class='container-fluid custPDPBucketContainer'><div class='row'><div class='col-md-12'><div class='row'><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>ITEM NUMBER</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-md-11'><span itemprop="sku">1445804</span></div></div><div class='row custPDPBucketHeader'><div class='col-md-12'>STONE DETAILS</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Minimum Carat Total Weight:</div><div class='col-xs-4 col-md-4'>1 1/8 ctw (1.11 - 1.19)</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Type:</div><div class='col-xs-4 col-md-4'>Diamond</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Shape:</div><div class='col-xs-4 col-md-4'>Round</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Color:</div><div class='col-xs-4 col-md-4'>IJ</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Clarity:</div><div class='col-xs-4 col-md-4'>I3</div></div></div><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>METAL</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Type:</div><div class='col-xs-4 col-md-4'>Gold</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Color:</div><div class='col-xs-4 col-md-4'>Yellow</div></div></div></div></div></div></div>

From this HTML
we have to be able to extract 'ITEM NUMBER' 1445804 'STONE DETAILS'
'Minimum Carat Total Weight:' '1 1/8 ctw (1.11 - 1.19)' 'Stone Type:' and so on.

Posted: Tue Apr 25, 2017 9:56 am
by chulett
The first technical term that comes to mind is... yuck. :?

I don't see a good way to "manually" do this but perhaps others may have some suggestions. I would imagine you may need to leverage one of the many "HTML Parsers" out there or perhaps write something in C++ or Java. [shrug]

Posted: Tue Apr 25, 2017 10:43 am
by UCDI
if the xml format is reliably identical for each record, it can be done with simple substring logic.

For example, if you could seek
<span itemprop="sku">
to find the item number, and you can do it for all records, that would be simple.

If you can't, you have to parse the whole mess. Datastage has XML tools which can pull it apart into columns, if you want to try to set that up (hierarchical stage and xml stages) if you have access to those. If not, java, VB stages or C routine all are options.

I always attack XML with string processing first. If I can do what I need to do with dumb string matching, that is great. If not, I have to apply another method, and that varies depending on how annoying the xml format is. It does not have to have a totally fixed format to use string processing attacks. It just needs to have the tags that you want in a format that you can find "<tag>data", even if other tags are skipped or inserted, that is ok. The trouble is if you have <tag><optional stuff or very deep nested junk> data format AND the optional stuff is too complicated to reliably locate the data after it.

10 min of analysis on the xml schema and example files should let you know if string searching is even remotely possible or not. If not, its a chore.

Posted: Tue Apr 25, 2017 10:45 am
by anajitKS
chulett wrote:The first technical term that comes to mind is... yuck. :?
I had the same reaction when it came up as a requirement. I just wanted to find out if anyone has any suggestions.

Posted: Tue Apr 25, 2017 11:06 am
by chulett
Of course, and you have a couple now.

How much does it matter that it isn't really XML but rather HTML? I was wondering if you could parse it as XML but you would need to make it "well formed" before hand I would think.