Non-printable Control Characters

logic · Post by **logic** » Fri May 11, 2007 6:53 pm

My Requirement is
I have to extract 100 million records from a teradata table and write it in to a flat file.
This data should be written in to multiple files based on a date range and generate a control report for each file which will give the Record Count and
give the sum of all the records in certain columns.

The Control Report Should like this

Control Report
File: fileX
Date Range Between 2000-01-02 and 2001-06-30
------------------------------------------------------------------
Record Count : yyyy
Column1 : $XXXXXX

Coulum2 : $xxxxx
Column10 : $xxxxx
.
.
.
.
.
.
.
.
.
.Column 55 : $xxx

I designed a job using teradata Ent stage and Sequential file stage.

This job will be executed multiple times using a wrapper.
Each time the job is executed bases on date rane it writes the output file and Control file.
I have written a shell script which be executed after the job. The shell script will generate the Control report with Record count and
sum of the columns.I am calling an awk script in the shell.This is the awk script. File_sum_ctl.awk

BEGIN {FS = "\t"}
{
s1 += $4;
s2 += $9;
s3 += $15;
s4 += $21;
s5 += $46;
s6 += $47;

}
END {
printf "RECORD COUNT ",NR;
printf ;
printf "Col4 $%5.2f\n",s1;
printf "Col9 $%5.2f\n",s2;
printf "Col15 $%5.2f\n",s3;
printf "Col21 $%5.2f\n",s4;
printf "Col46 $%5.2f\n",s5;
printf "Col47 $%5.2f\n",s6;
;
}

Here is the Shell script
ere is the script.this executes fine.can u plz check if it is parametrised.
#! /usr/bin/ksh
#Purpose : Create a ctl file having count date
#Version : 1.0
#History/date :
#Instructions :Use sh to execute the script
# : chmod this script file appropiately to avoid permission issues.
# : There is no validation for number of inputs.

SCRIPTS=$1
SRC_SYS_ID=$2
OUTBOUND_DIR=$3
Target_File_name=$4
Script_FileName=$5

if [[ ${4} = File.txt.gz ]]
then
echo ' Control Report'
echo ' '$4''
echo -------------------------------------------
gunzip -c $3/$2_$4|awk -f $1/File_sum_ctl.awk

else
echo ' Control Report'
echo ' '$4''
echo -------------------------------------------
echo Record Count :' '`gunzip -c $4|wc -l`
fi

I can use Aggregator to calculate the sum but I am told to use the script.

The issue know is
I data extracts are fine and the Control reports are generated correctly.Expect for 1 extract based on a daterange.
This is a multi instance job.
In one particular instance.The data is getting truncated and amount calculations are different from the amount calculation in teradata table.
for instance
Select sum(dt_range) from table is $5000
Control Report shows $4500

In the extract which has the issue for one particular record it has all 200 columns but target has only 25 columns. That means data is getting truncated or dropped.
I dont understand why the data is missing.when the job is simple transfer from Teradata ent stage to seq file stage.The job has only 2 stages.
I guess the source data is bad.It has Non-printable Control Characters so write a logic which will convert this data to '' (Empty).I wrote a logic
to convert the character ASCII values of the non-printable characters
http://www.robelle.com/library/smugbook/ascii.html
from this link I took the char values of the non-printable Control characters Convert(NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSIDLEDC1DC2DC3DC4,'')
But this logic didnt take care of the issue.
Can anyone please help me

ray.wurlod · Post by **ray.wurlod** » Sat May 12, 2007 2:01 am

How did you represent the "ASCII values of control characters" in your Convert function?
Why do you believe control characters to be the root cause of the problem you are experiencing?

logic · Post by **logic** » Sat May 12, 2007 6:10 am

I did use this function for converting the ASCII characters values to ''(Empty cotes)

Convert(NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSIDLEDC1DC2DC3DC4,'')

I used this function because my senior told me this is the root cause and use the convert function.
I am not sure what the cause is. Do you know any other reason why the columns are getting truncated

Thanks

chulett · Post by **chulett** » Sat May 12, 2007 7:23 am

I'm not sure 'control characters' really are your problem but this Convert definitely is not the way to go. I can see what you are attempting to do, but this isn't the way to do it. Since you've entered them all as just characters, each one individually will be converted to an empty string - first all N's then all U's, then L's - etc.

To pursue this, you'll need to use the CHAR function with their decimal values instead. For example:

Code: Select all

SOH  CHAR(1)
HT   CHAR(9)
DC1  CHAR(17)

Etc. Which means your Convert function would need to look more like:

Code: Select all

Convert(CHAR(1):CHAR(9):CHAR(17),"",YourField)

Fleshed out for all your codes, of course. If this doesn't "fix" your problem, you'll need to take another whack at explaining your problem. Concentrate on explaining just the problem area, we've got all the other stuff now. Examples are always good, too.

logic · Post by **logic** » Sat May 12, 2007 7:46 am

The source data has Decimals,Dates,Varchar and Char.I am using a Format function in the teradata Ent stage to transform all these decimals and dates to varchar. So my question is do I need to use this Convert function in the transformer for all the columns?. When I look at the data in the teradata SQL Assistant I find some special characters. If I look at the data in Unix ,vi editor I find special characters too .but the data is so huge ,100 million rows or something. How can I trace these characters in vi,editor?

chulett · Post by **chulett** » Sat May 12, 2007 8:04 am

Um... examples? 'Some special characters' tells us nothing.

Do you have access to a hex editor? That would help. Or an old school od or 'octal dump' in UNIX. The -A switch will let you get hex output if you're more familiar with that - check your man page.

logic · Post by **logic** » Sat May 12, 2007 9:18 am

Let me make sure iam heading in right way. Since existence of control characters in source has been told by somebody else so i want to make sure what could be other reasons for data to be truncated only in few rows. Since iam doing a direct dump from a Teradata tables to flat files using Teradata enterprise and Sequential Stage.

FYI :Both Source and Target has equal number of rows with few rows in Target truncated.

chulett · Post by **chulett** » Sat May 12, 2007 10:12 am

Again - without specific examples I'm not sure how anyone can know whether you're on the right track or heading down the rabbit hole.

You haven't shown us:

* What these 'truncated' values look like
* What these 'untruncated' values look like
* What 'special characters' you think you are seeing

Not sure about anyone else, but to me, without information along those lines we'd just be guessing.

logic · Post by **logic** » Sat May 12, 2007 1:25 pm

Each record has 240 columns and this is how source data table looks like

Source Data

36807951212 5 AZ 0.00 00003454 00 0 2007-01-01 -189.01 2 1 2 1 1 1
36874721212 6 NY 0.00 00003233 00 0 2006-01-01 121.01 2 1 2 1 2 1

When iam writing down to a target file the data for some records gets truncated and target file looks like this

Target Data

36807951212 5 AZ 0.00 00003454 00 0 2007-01-01 -189.01 2 1 2 1 1 1
36874721212 6 NY 0.00 00003233 00 0 2006-01-01

Since i've used only Teradata Enterprise Stage and Sequencital stage in the job . Iam wondering wat could be the issue.

ray.wurlod · Post by **ray.wurlod** » Sat May 12, 2007 3:09 pm

Also define "truncated". DataStage's data browser can be fooled into believing rows are truncated because VarChar columns have been padded with "\0" characters (which is the default setting of APT_STRING_PADCHAR environment variable). Viewing the output file with od -xc or a similar command (or vi) is a preferable mechanism for determining the existence of non-printable characters.

chulett · Post by **chulett** » Sat May 12, 2007 9:53 pm

When you say 'the target file looks like this' - what are you using to do the looking? 'View Data' in DataStage doesn't count, you need to do this from your operating system. The octal dump we've both now mentioned is one way, a hex editor or even vi can help.

logic · Post by **logic** » Mon May 14, 2007 9:50 pm

I did view the data from the Operating system using the command "file name |od -acxo".I could view the control characters here.I used the convert function and got rid of this.
Thanks for all your assistance

chulett · Post by **chulett** » Mon May 14, 2007 11:06 pm

So, would you mind sharing with us what control characters you found and your final Convert statement? That and mark the topic as 'Resolved'. Both would help future searches with a similar problem.

Thanks.

logic · Post by **logic** » Thu Oct 23, 2008 12:50 pm

I used the following logic with the below two stage variables
svControlChars=CHAR(0):CHAR(1):CHAR(2):CHAR(3):CHAR(4):CHAR(5):CHAR(6):CHAR(7):CHAR(8):CHAR(9):CHAR(10):CHAR(11):CHAR(12):CHAR(13):CHAR(14):CHAR(15):CHAR(16):CHAR(17):CHAR(18):CHAR(19):CHAR(20):CHAR(21):CHAR(22):CHAR(23):CHAR(24):CHAR(25):CHAR(26):CHAR(27):CHAR(28):CHAR(29):CHAR(30):CHAR(31):CHAR(127)
svReplaceChars=''
and in
In the derivation cell
Convert(svControlChars,svReplaceChars,column_name)

I was able to strip off the the control characters.

Thanks you all for your responses