Non-printable Control Characters
Moderators: chulett, rschirm, roy
Non-printable Control Characters
My Requirement is
I have to extract 100 million records from a teradata table and write it in to a flat file.
This data should be written in to multiple files based on a date range and generate a control report for each file which will give the Record Count and
give the sum of all the records in certain columns.
The Control Report Should like this
Control Report
File: fileX
Date Range Between 2000-01-02 and 2001-06-30
------------------------------------------------------------------
Record Count : yyyy
Column1 : $XXXXXX
Coulum2 : $xxxxx
Column10 : $xxxxx
.
.
.
.
.
.
.
.
.
.Column 55 : $xxx
I designed a job using teradata Ent stage and Sequential file stage.
This job will be executed multiple times using a wrapper.
Each time the job is executed bases on date rane it writes the output file and Control file.
I have written a shell script which be executed after the job. The shell script will generate the Control report with Record count and
sum of the columns.I am calling an awk script in the shell.This is the awk script. File_sum_ctl.awk
BEGIN {FS = "\t"}
{
s1 += $4;
s2 += $9;
s3 += $15;
s4 += $21;
s5 += $46;
s6 += $47;
}
END {
printf "RECORD COUNT ",NR;
printf ;
printf "Col4 $%5.2f\n",s1;
printf "Col9 $%5.2f\n",s2;
printf "Col15 $%5.2f\n",s3;
printf "Col21 $%5.2f\n",s4;
printf "Col46 $%5.2f\n",s5;
printf "Col47 $%5.2f\n",s6;
;
}
Here is the Shell script
ere is the script.this executes fine.can u plz check if it is parametrised.
#! /usr/bin/ksh
#Purpose : Create a ctl file having count date
#Version : 1.0
#History/date :
#Instructions :Use sh to execute the script
# : chmod this script file appropiately to avoid permission issues.
# : There is no validation for number of inputs.
SCRIPTS=$1
SRC_SYS_ID=$2
OUTBOUND_DIR=$3
Target_File_name=$4
Script_FileName=$5
if [[ ${4} = File.txt.gz ]]
then
echo ' Control Report'
echo ' '$4''
echo -------------------------------------------
gunzip -c $3/$2_$4|awk -f $1/File_sum_ctl.awk
else
echo ' Control Report'
echo ' '$4''
echo -------------------------------------------
echo Record Count :' '`gunzip -c $4|wc -l`
fi
I can use Aggregator to calculate the sum but I am told to use the script.
The issue know is
I data extracts are fine and the Control reports are generated correctly.Expect for 1 extract based on a daterange.
This is a multi instance job.
In one particular instance.The data is getting truncated and amount calculations are different from the amount calculation in teradata table.
for instance
Select sum(dt_range) from table is $5000
Control Report shows $4500
In the extract which has the issue for one particular record it has all 200 columns but target has only 25 columns. That means data is getting truncated or dropped.
I dont understand why the data is missing.when the job is simple transfer from Teradata ent stage to seq file stage.The job has only 2 stages.
I guess the source data is bad.It has Non-printable Control Characters so write a logic which will convert this data to '' (Empty).I wrote a logic
to convert the character ASCII values of the non-printable characters
http://www.robelle.com/library/smugbook/ascii.html
from this link I took the char values of the non-printable Control characters Convert(NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSIDLEDC1DC2DC3DC4,'')
But this logic didnt take care of the issue.
Can anyone please help me
I have to extract 100 million records from a teradata table and write it in to a flat file.
This data should be written in to multiple files based on a date range and generate a control report for each file which will give the Record Count and
give the sum of all the records in certain columns.
The Control Report Should like this
Control Report
File: fileX
Date Range Between 2000-01-02 and 2001-06-30
------------------------------------------------------------------
Record Count : yyyy
Column1 : $XXXXXX
Coulum2 : $xxxxx
Column10 : $xxxxx
.
.
.
.
.
.
.
.
.
.Column 55 : $xxx
I designed a job using teradata Ent stage and Sequential file stage.
This job will be executed multiple times using a wrapper.
Each time the job is executed bases on date rane it writes the output file and Control file.
I have written a shell script which be executed after the job. The shell script will generate the Control report with Record count and
sum of the columns.I am calling an awk script in the shell.This is the awk script. File_sum_ctl.awk
BEGIN {FS = "\t"}
{
s1 += $4;
s2 += $9;
s3 += $15;
s4 += $21;
s5 += $46;
s6 += $47;
}
END {
printf "RECORD COUNT ",NR;
printf ;
printf "Col4 $%5.2f\n",s1;
printf "Col9 $%5.2f\n",s2;
printf "Col15 $%5.2f\n",s3;
printf "Col21 $%5.2f\n",s4;
printf "Col46 $%5.2f\n",s5;
printf "Col47 $%5.2f\n",s6;
;
}
Here is the Shell script
ere is the script.this executes fine.can u plz check if it is parametrised.
#! /usr/bin/ksh
#Purpose : Create a ctl file having count date
#Version : 1.0
#History/date :
#Instructions :Use sh to execute the script
# : chmod this script file appropiately to avoid permission issues.
# : There is no validation for number of inputs.
SCRIPTS=$1
SRC_SYS_ID=$2
OUTBOUND_DIR=$3
Target_File_name=$4
Script_FileName=$5
if [[ ${4} = File.txt.gz ]]
then
echo ' Control Report'
echo ' '$4''
echo -------------------------------------------
gunzip -c $3/$2_$4|awk -f $1/File_sum_ctl.awk
else
echo ' Control Report'
echo ' '$4''
echo -------------------------------------------
echo Record Count :' '`gunzip -c $4|wc -l`
fi
I can use Aggregator to calculate the sum but I am told to use the script.
The issue know is
I data extracts are fine and the Control reports are generated correctly.Expect for 1 extract based on a daterange.
This is a multi instance job.
In one particular instance.The data is getting truncated and amount calculations are different from the amount calculation in teradata table.
for instance
Select sum(dt_range) from table is $5000
Control Report shows $4500
In the extract which has the issue for one particular record it has all 200 columns but target has only 25 columns. That means data is getting truncated or dropped.
I dont understand why the data is missing.when the job is simple transfer from Teradata ent stage to seq file stage.The job has only 2 stages.
I guess the source data is bad.It has Non-printable Control Characters so write a logic which will convert this data to '' (Empty).I wrote a logic
to convert the character ASCII values of the non-printable characters
http://www.robelle.com/library/smugbook/ascii.html
from this link I took the char values of the non-printable Control characters Convert(NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSIDLEDC1DC2DC3DC4,'')
But this logic didnt take care of the issue.
Can anyone please help me
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
How did you represent the "ASCII values of control characters" in your Convert function?
Why do you believe control characters to be the root cause of the problem you are experiencing?
Why do you believe control characters to be the root cause of the problem you are experiencing?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I did use this function for converting the ASCII characters values to ''(Empty cotes)
Convert(NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSIDLEDC1DC2DC3DC4,'')
I used this function because my senior told me this is the root cause and use the convert function.
I am not sure what the cause is. Do you know any other reason why the columns are getting truncated
Thanks
Convert(NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSIDLEDC1DC2DC3DC4,'')
I used this function because my senior told me this is the root cause and use the convert function.
I am not sure what the cause is. Do you know any other reason why the columns are getting truncated
Thanks
I'm not sure 'control characters' really are your problem but this Convert definitely is not the way to go. I can see what you are attempting to do, but this isn't the way to do it. Since you've entered them all as just characters, each one individually will be converted to an empty string - first all N's then all U's, then L's - etc.
To pursue this, you'll need to use the CHAR function with their decimal values instead. For example:
Etc. Which means your Convert function would need to look more like:
Fleshed out for all your codes, of course. If this doesn't "fix" your problem, you'll need to take another whack at explaining your problem. Concentrate on explaining just the problem area, we've got all the other stuff now. Examples are always good, too.
To pursue this, you'll need to use the CHAR function with their decimal values instead. For example:
Code: Select all
SOH CHAR(1)
HT CHAR(9)
DC1 CHAR(17)
Code: Select all
Convert(CHAR(1):CHAR(9):CHAR(17),"",YourField)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
The source data has Decimals,Dates,Varchar and Char.I am using a Format function in the teradata Ent stage to transform all these decimals and dates to varchar. So my question is do I need to use this Convert function in the transformer for all the columns?. When I look at the data in the teradata SQL Assistant I find some special characters. If I look at the data in Unix ,vi editor I find special characters too .but the data is so huge ,100 million rows or something. How can I trace these characters in vi,editor?
Um... examples? 'Some special characters' tells us nothing.
Do you have access to a hex editor? That would help. Or an old school od or 'octal dump' in UNIX. The -A switch will let you get hex output if you're more familiar with that - check your man page.
Do you have access to a hex editor? That would help. Or an old school od or 'octal dump' in UNIX. The -A switch will let you get hex output if you're more familiar with that - check your man page.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Let me make sure iam heading in right way. Since existence of control characters in source has been told by somebody else so i want to make sure what could be other reasons for data to be truncated only in few rows. Since iam doing a direct dump from a Teradata tables to flat files using Teradata enterprise and Sequential Stage.
FYI :Both Source and Target has equal number of rows with few rows in Target truncated.
FYI :Both Source and Target has equal number of rows with few rows in Target truncated.
Again - without specific examples I'm not sure how anyone can know whether you're on the right track or heading down the rabbit hole.
You haven't shown us:
* What these 'truncated' values look like
* What these 'untruncated' values look like
* What 'special characters' you think you are seeing
Not sure about anyone else, but to me, without information along those lines we'd just be guessing.
You haven't shown us:
* What these 'truncated' values look like
* What these 'untruncated' values look like
* What 'special characters' you think you are seeing
Not sure about anyone else, but to me, without information along those lines we'd just be guessing.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Each record has 240 columns and this is how source data table looks like
Source Data
36807951212 5 AZ 0.00 00003454 00 0 2007-01-01 -189.01 2 1 2 1 1 1
36874721212 6 NY 0.00 00003233 00 0 2006-01-01 121.01 2 1 2 1 2 1
When iam writing down to a target file the data for some records gets truncated and target file looks like this
Target Data
36807951212 5 AZ 0.00 00003454 00 0 2007-01-01 -189.01 2 1 2 1 1 1
36874721212 6 NY 0.00 00003233 00 0 2006-01-01
Since i've used only Teradata Enterprise Stage and Sequencital stage in the job . Iam wondering wat could be the issue.
Source Data
36807951212 5 AZ 0.00 00003454 00 0 2007-01-01 -189.01 2 1 2 1 1 1
36874721212 6 NY 0.00 00003233 00 0 2006-01-01 121.01 2 1 2 1 2 1
When iam writing down to a target file the data for some records gets truncated and target file looks like this
Target Data
36807951212 5 AZ 0.00 00003454 00 0 2007-01-01 -189.01 2 1 2 1 1 1
36874721212 6 NY 0.00 00003233 00 0 2006-01-01
Since i've used only Teradata Enterprise Stage and Sequencital stage in the job . Iam wondering wat could be the issue.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Also define "truncated". DataStage's data browser can be fooled into believing rows are truncated because VarChar columns have been padded with "\0" characters (which is the default setting of APT_STRING_PADCHAR environment variable). Viewing the output file with od -xc or a similar command (or vi) is a preferable mechanism for determining the existence of non-printable characters.
Last edited by ray.wurlod on Sun May 13, 2007 4:11 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
When you say 'the target file looks like this' - what are you using to do the looking? 'View Data' in DataStage doesn't count, you need to do this from your operating system. The octal dump we've both now mentioned is one way, a hex editor or even vi can help.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
I used the following logic with the below two stage variables
svControlChars=CHAR(0):CHAR(1):CHAR(2):CHAR(3):CHAR(4):CHAR(5):CHAR(6):CHAR(7):CHAR(8):CHAR(9):CHAR(10):CHAR(11):CHAR(12):CHAR(13):CHAR(14):CHAR(15):CHAR(16):CHAR(17):CHAR(18):CHAR(19):CHAR(20):CHAR(21):CHAR(22):CHAR(23):CHAR(24):CHAR(25):CHAR(26):CHAR(27):CHAR(28):CHAR(29):CHAR(30):CHAR(31):CHAR(127)
svReplaceChars=''
and in
In the derivation cell
Convert(svControlChars,svReplaceChars,column_name)
I was able to strip off the the control characters.
Thanks you all for your responses
svControlChars=CHAR(0):CHAR(1):CHAR(2):CHAR(3):CHAR(4):CHAR(5):CHAR(6):CHAR(7):CHAR(8):CHAR(9):CHAR(10):CHAR(11):CHAR(12):CHAR(13):CHAR(14):CHAR(15):CHAR(16):CHAR(17):CHAR(18):CHAR(19):CHAR(20):CHAR(21):CHAR(22):CHAR(23):CHAR(24):CHAR(25):CHAR(26):CHAR(27):CHAR(28):CHAR(29):CHAR(30):CHAR(31):CHAR(127)
svReplaceChars=''
and in
In the derivation cell
Convert(svControlChars,svReplaceChars,column_name)
I was able to strip off the the control characters.
Thanks you all for your responses
"nulla tenaci, invia est via"