CFF Stage - Importing REDEFINES within the same record

munch9 · Post by **munch9** » Thu Jan 14, 2010 10:13 am

vivekgadwal,

I do not have a V8 installation to test this on but I suspect this is the same as on V7.5.3.

The problem you have is due to the different implementations of zoned decimal in ASCII and EBCDIC.

The value you are seeing in your file are
{ -> I for positive 0-9 and } -> R for negative 0-9.

This is because the source has overpunched the first nibble of the last byte with the hex value 'C' for positive and 'D' for negative. I assume this has come from a mainframe system (or at least something using EBCDIC)

To read these using a CFF you need to get the source file in a binary form rather than plain text.

As soon as you specify 'Character-set: ASCII & Data format: Text ' as mention above, Datastage then assumes that your file is using the ASCII zoned decimal implementation.

In this case the first nibble of the last byte is overpunched with '3' for positive values and '7' for negative values.

The numbers that you say were read correctly

There is another field EARNED_PREMIUM (which is part of the Occurs) whose definition is PIC S9(6)V9(2) and it has data like "0000407P".

When I do a view data on this, the tool is unable to decode the WRITTEN_PREMIUM field data, but it is giving me signed data for EARNED_PREMIUM field (like "-000040.70").

Are in fact NOT being translated correctly.

'P' in EBCDIC = x'D7' so in zoned decimal is -7
'p' in ACSII = x'70' so in zoned decimal is -0

so in fact the value 0000407P should become -000040.77

(For some reason datastage seems to ingore the case of the overpunched character when decoding ASCII zoned decimal as upper case 'P' actually = x'50' but still displays as -0)

The best option, if possible, would be to get this source file supplied in binary rather than translated to ASCII (hopefully just a change of FTP mode), otherwise the lookup suggested may be the easiest way.

Hope that helps.

vivekgadwal · Post by **vivekgadwal** » Thu Jan 14, 2010 10:37 am

munch9 wrote:vivekgadwal,

I do not have a V8 installation to test this on but I suspect this is the same as on V7.5.3.

The problem you have is due to the different implementations of zoned decimal in ASCII and EBCDIC.

The value you are seeing in your file are
{ -> I for positive 0-9 and } -> R for negative 0-9.

This is because the source has overpunched the first nibble of the last byte with the hex value 'C' for positive and 'D' for negative. I assume this has come from a mainframe system (or at least something using EBCDIC)

To read these using a CFF you need to get the source file in a binary form rather than plain text.

Thank you very much munch9!

I am unaware of this separate/different implementations in ASCII and EBCDIC. This file is indeed FTP-ed from Mainframe. May be Craig was pointing out to this fact, but I did not catch it - my apologies for that.

munch9 wrote:The numbers that you say were read correctly
There is another field EARNED_PREMIUM (which is part of the Occurs) whose definition is PIC S9(6)V9(2) and it has data like "0000407P".

When I do a view data on this, the tool is unable to decode the WRITTEN_PREMIUM field data, but it is giving me signed data for EARNED_PREMIUM field (like "-000040.70").
Are in fact NOT being translated correctly.

'P' in EBCDIC = x'D7' so in zoned decimal is -7
'p' in ACSII = x'70' so in zoned decimal is -0

so in fact the value 0000407P should become -000040.77

(For some reason datastage seems to ingore the case of the overpunched character when decoding ASCII zoned decimal as upper case 'P' actually = x'50' but still displays as -0)

I will request a binary file and see if this fixes it. So, if this is FTP-ed to me in binary format, then the encoding will be EBCDIC, right? Or ASCII binary?

Do you know why Sequential file stage is at least showing me some data, erroneous at those zoned decimals of course, whereas CFF doesn't even grant me that pleasure of watching the right data aligned to the right field? Do I have to set-up something else inside the CFF which I did not do earlier?

Again, thank you very much for your suggestions. I will test this out and will post the results here.

munch9 · Post by **munch9** » Thu Jan 14, 2010 10:46 am

Hi,

If the file is FTP'd in binary mode then the bytes will not be converted so should be read in as EBCDIC/binary in the CFF stage.

As the file was previously in ASCII format then the sequential stage would have been able to read much of the data and convert, albeit wrongly, a handful of the zoned values.
Chances are the CFF stage was not working due to some issue with the column layout/record length etc. It's easy to get incorrect lengths because of allowing for the sign bytes etc or possible that the conversion to ascii may have resulted in some non printable/control characters.

Once you have the new file give it a try and see how it goes.

vivekgadwal · Post by **vivekgadwal** » Thu Jan 14, 2010 10:53 am

munch9 wrote:Hi,

If the file is FTP'd in binary mode then the bytes will not be converted so should be read in as EBCDIC/binary in the CFF stage.

As the file was previously in ASCII format then the sequential stage would have been able to read much of the data and convert, albeit wrongly, a handful of the zoned values.
Chances are the CFF stage was not working due to some issue with the column layout/record length etc. It's easy to get incorrect lengths because of allowing for the sign bytes etc or possible that the conversion to ascii may have resulted in some non printable/control characters.

Once you have the new file give it a try and see how it goes.

Thanks for the quick response. I just made the request for a binary file, but they said that the full file cannot be generate because it is a monthly data file. However, they should provide me with whatever data they have so far.

I will test it out and post it here!

vivekgadwal · Post by **vivekgadwal** » Fri Jan 15, 2010 7:12 pm

munch9 wrote:Hi,

If the file is FTP'd in binary mode then the bytes will not be converted so should be read in as EBCDIC/binary in the CFF stage.

As the file was previously in ASCII format then the sequential stage would have been able to read much of the data and convert, albeit wrongly, a handful of the zoned values.
Chances are the CFF stage was not working due to some issue with the column layout/record length etc. It's easy to get incorrect lengths because of allowing for the sign bytes etc or possible that the conversion to ascii may have resulted in some non printable/control characters.

Once you have the new file give it a try and see how it goes.

munch9,

I got the file FTP-ed as Binary. However, there is something that I did not post in my previous posts that is a major hindrance to this. The file that gets FTP-ed will get read by a Unix script and splits it into different child files and the source to this job is one of those output files

.

Do you/anyone think that it is because of this processing by a script that is changing the implementation from EBCDIC zoned to ASCII zoned?

Looking forward to any suggestions...

munch9 · Post by **munch9** » Thu Jan 21, 2010 9:58 am

Your script is not changing the implementation.

The problem is that the file is created using EBCDIC and hence the { } characters. If it is in ascii by the time datastage comes to read it either by FTP converting it or some other means, the CFF stage assumes that the zoned part is using the ascii values.

There may be a way to get datastage to interpret it correctly as it is but I personaly don't know how.

Looks like you are back to the original suggestion of a lookup, or possibly replacing the unix script with a Datastage job to read and split the original file?

munch9 · Post by **munch9** » Thu Jan 21, 2010 11:09 am

vivekgadwal ,

I just read back through and noticed that this was working ok in a server job. I've tried it out here and it seems fine too.

Would it be possible to add an extra step using a server job to decode these zoned values and feed this o/p into your parallel job?

vivekgadwal · Post by **vivekgadwal** » Thu Jan 21, 2010 11:20 am

munch9 wrote:vivekgadwal ,

I just read back through and noticed that this was working ok in a server job. I've tried it out here and it seems fine too.

Would it be possible to add an extra step using a server job to decode these zoned values and feed this o/p into your parallel job?

munch9,

This is exactly my work-around right now. However, the file is huge (more than a million rows), so, understandably, the performance of Server job is quite slow.

The reason I left this topic open is in hope to find a solution to this. An IBM PMR is also opened, but you might already know how slow things work on that end. Also, I am pretty sure that many people are using Zoned decimal files all the time and things are working for them.

My feeling is that, I am not setting some property right inside the CFF stage, even though I have toggled almost all the available options in that stage.

I greatly appreciate your help on this.

chulett · Post by **chulett** » Thu Jan 21, 2010 11:43 am

Just out of curiosity, there's nothing inherent in a Server job that makes it "slow" just because it is a Server job. For example, I've used one to parse 10 million Apache log records in less than four minutes. Is it specifically the CFF stage that is the bottleneck? How long did it actually take to run your million records through one?

vivekgadwal · Post by **vivekgadwal** » Thu Jan 21, 2010 11:49 am

chulett wrote:Just out of curiosity, there's nothing inherent in a Server job that makes it "slow" just because it is a Server job. For example, I've used one to parse 10 million Apache log records in less than four minutes. Is it specifically the CFF stage that is the bottleneck? How long did it actually take to run your million records through one?

I did not mean in a way that it is causing a major bottleneck. It is a good solution. It is taking me around 5 minutes (give or take a few seconds) to unload into a flat file. But, if you can have that step within the PX job, would you rather go for that solution or stay with the Server job and adding another job layer? I am thinking that if I could get this figured out, then I can use just one job (Unload and process within the same PX job) instead of having 2 jobs (one for unload - server job & one for processing - PX job).

My apologies for not making this point clear in my earlier post.

chulett · Post by **chulett** » Thu Jan 21, 2010 12:07 pm

Sure, I understand. Just wanted the point clarified as I have zero experience with the CFF stage so am following along so I can learn things about it... vicariously.

vivekgadwal · Post by **vivekgadwal** » Thu Jan 21, 2010 12:22 pm

chulett wrote:Sure, I understand. Just wanted the point clarified as I have zero experience with the CFF stage so am following along so I can learn things about it... vicariously.

Learning about this particular issue has been as much fun as much as it has been a pain in the neck. However, I am very glad Server has this functionality and ability to read Zoned Decimals without much fuss in setting up. People over at my work place are a little averse to using Server jobs and I use this example to tout about the prominence of Server tool.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 21, 2010 3:59 pm

You can always incorporate the server job functionality into a server shared container and use that in the parallel job.

Review Chapter 2 of Parallel Job Developer's Guide for restrictions - I don't believe you'll hit any of them in this case.

vivekgadwal · Post by **vivekgadwal** » Thu Jan 21, 2010 4:45 pm

ray.wurlod wrote:You can always incorporate the server job functionality into a server shared container and use that in the parallel job.

Review Chapter 2 of Parallel Job Developer's Guide for restrictions - I don't believe you'll hit any of them in this case.

Thanks Ray.

That is a very interesting suggestion. I will read that and try and incorporate within this job.

parag.s.27 · Post by **parag.s.27** » Thu Jan 21, 2010 6:06 pm

vivekgadwal wrote:
chulett wrote:Worst case you could build a lookup with the 20 values to substitute for the last character and do all that while it's still a string. The link I posted showed you that a 'P' would be a -7 as the last digit. You'll probably want to handle the signage separately and multiply the result by -1 for the negative range.

Still pretty sure the stage should be able to handle them automagically, however.
Thanks Craig. The way you proposed can be done. But, as you noted, it should be done only in a worst case scenario. Right now, to beat the clock, I am reading the file in a Server job using its CFF Stage which is reading the "Zoned" values along with the signs (I see '+' and '-' signs before the numbers). The job design is as follows:
Code: Select all
CFF ---> XFM ---> Seq File (comma delimited)
I am reading this Seq File (comma delimited) in my PX job (importing the definition, which was a pain in the neck as I cannot copy and paste the stage ).

However, one issue is arising now...after reading all the rows from the file (there are about 819,000 rows, the CFF Stage is reading one more row (which doesn't exist in the file). That extra row is not having any data...only commas.
Code: Select all
,,,,,,....,+,+,+,,,...,+,...
When I try to view this in Vi editor, it shows some weird characters in some places of the row (which are defined as Character fields...so I believe the tool is probably inserting some @FM or @VM in there).

Questions:
1) Why is it reading an addition row? I have seen this file in Hex editor and all the rows have '0D' and '0A' as delimiters. However, even if I strip off these '0D's and run it, it is still reading this extra line.
2) If I check the option "Omit last new line" in the Server Seq File stage, the PX job is throwing an error saying that it is expecting a new line (as EOR) and it doesn't have one.

Please suggest...

Also, if anyone else has worked on PX CFF stage to get the Zoned decimal values, please help me in configuring the stage.

***UPDATE***
In PX Sequential File Stage, if I change the datatype for the fields with Zoned decimal to "DISPLAY_NUMERIC" (previously it was DECIMAL...which is showing as COMP-3) and read the Fixed-width file directly (not as mentioned above), I can see some rows being read and some are still being displayed as 000000.

Upon further re-search into the data, the data with values like "0000274R" is being correctly read as "-000027.42" and data with values like "0000580}" is still "00000000"!! Is this because of the "{ or }" characters in the data? If so, shouldn't the stage read all the "Zoned Decimal" characters which includes "{" and "}"??

Try one option, it solved one of the similar problem in CFF which I've already posted in another post.

In the record option tab, select the encoding type as BIG-ENDIAN instead of Native Endian. And it should work. Also you can remove the group column while making the final selection of columns.