Page 1 of 1

Removal of Repeated Characters

Posted: Wed Oct 19, 2016 3:48 pm
by oacvb
We have a requirement to remove repetitive characters like AAA or BBB etc., from Name that occurs more than thrice. I tried using Convert function in transformer stage and gave string as AAA but it removed the character that appeared even once. Please let me know how this can be implemented in transformer stage, I tried in server routine but it can't be called from Parallel job.

Posted: Thu Oct 20, 2016 9:18 am
by qt_ky
You'll most likely have to write your own BASIC routine. Use a BASIC Transformer stage in a parallel job to call it. It's not shown in the Palette. Instead, go to Repository, Stage Types, Parallel, Processing, BASIC Transformer.

Posted: Thu Oct 20, 2016 12:54 pm
by ray.wurlod
You could write the routine in C++ and (having compiled and linked it and created a reference to it in DataStage) call it from a parallel Transformer stage.

Or you could use a BASIC Transformer stage in a parallel job.

Posted: Fri Oct 21, 2016 10:27 am
by UCDI
There is a recent thread on hand-rolling an e-replace function in C that is probably a good starting point if you are not strong at C. It would be similar to that, just a bit of adjusted logic.

You can probably also handle this with pattern action... I personally wouldn't, but you could.

Posted: Mon Nov 14, 2016 11:33 am
by abc123
Datastage 9.1 has eReplace in the parallel transformer stage.

Ray/qt_ky, if the OP was to replace strings such as AAAA with A using eReplace, how would he do it?

I would think that the OP would have to call eReplace 26 times in a nested manner. Agree?

Posted: Mon Nov 14, 2016 1:43 pm
by UCDI
26 for caps, 26 again for lower case, 10 more for numbers .... symbols.. and that assumes you can find a way to do it. If you had unicode it would be intractable, and it is horrible for only simple ascii.

the C or basic way would be to copy the original string into a new string, one byte at a time, dropping duplicates as you go (if current != previous, copy, else skip). This is a O(N) operation, which is pretty much as good as it gets here (you could also divide it across many threads if the strings were gigantic, but that is usually not necessary). Its a very simple and short chunk of code, highly recommend doing it this way...

Posted: Mon Nov 14, 2016 3:09 pm
by qt_ky
No need to hard-code a scenario for every possible character...

Loop through the string from first to last position. Initialize a counter variable. Initialize a previous character variable.

If current character = previous character, increment a counter, else reset the counter.

If counter > 3 (or whatever your rule may be), then do something.

Posted: Mon Nov 14, 2016 3:40 pm
by abc123
qt_ky, I am assuming that you are talking about a parallel transformer loop, right?

Posted: Tue Nov 15, 2016 8:32 am
by qt_ky
The generic pseudo code I outlined would be valid for any programming language, BASIC routine, Parallel routine, etc.

In theory, it could even be done using looping within a Parallel Transformer stage, although it would be a bit cumbersome.

Posted: Tue Nov 15, 2016 9:05 am
by UCDI
For clarity, that is the same algorithm I said except I suggested copying into a temp for simplicity & speed in the low level languages. Details aside, I think this is the best algorithm for general strings (it can be improved for specific strings, of course).

Posted: Tue Nov 15, 2016 9:56 am
by qt_ky
Note there was one detail in the original post to remove repetitive characters that occur more than thrice... :wink:

Posted: Tue Nov 15, 2016 2:46 pm
by UCDI
That is true! The core algorithm is unchanged, but you do need to handle this detail. That looks to be extra convoluted in datastage transformer logic.

Posted: Thu Nov 17, 2016 9:38 am
by chucksmith
Back to the original question, please note that the convert() function deals with single byte comparison/conversion. The change() function deals with substrings.

Posted: Thu Nov 17, 2016 9:46 am
by asorrell
On a related note, because this is a per-character comparison of a string, its going to really slow down your job a lot, regardless of methodology. Be aware of the impact if this is a time critical job that processes a lot of records.

Posted: Thu Nov 17, 2016 11:03 am
by UCDI
It shouldn't. I recently standardized a text file in a similar way (removal of extra characters). The file was about 60MB of text. Execution time was less than 3 seconds and that includes reading the file and writing the fixed file output on top of the processing time. That was a one-shot hack code, so I did not even multi-thread it. If I had multi-threaded it, it would have been ~ 1 second on a typical 4 cpu machine.

Ive done a couple of string standardization routines for datastage and they are invariable the fastest stages in the job.

Here is a quick, rough cut at it, since the topic has stayed alive for so long.
-----------

char outbuff[10000]; //yes, its an evil global variable.

char *strmax3(char * buff)
{
static char which = 0; //for parallel execution: a micro "memory manager"
//tweak for your system, this is fine for 4-8 cpu/threads.
//its faster than allocating new memory for each input.
which = (which+1)%10;
char * out = &(outbuff[which*100]); //100 is max length of input string (including 0 end), tweak if needed.

//null string or string too short to check do nothing, return the input and exit.
if(!buff) return buff;
int len = strlen(buff);
if(len < 4) return buff;
unsigned int dx, lc;
dx = lc = 0;

//seeds the algorithm to simplify code.
out[dx++] = buff[0];
out[dx++] = buff[1];
out[dx++] = buff[2];
for(lc = 3; lc < len; lc++)
{
if(buff[lc] == buff[lc-1] && buff[lc] == buff[lc-2] && buff[lc] == buff[lc-3]); // ;here = do nothing, so if true do nothing.
else //we do not have 4 identical chars in a row, so we copy into the target.
out[dx++] = buff[lc];
}
out[dx] = 0; //standard c end of string MUST be added to hand-cooked strings.
return out;
}