duplicate rows

qwert · Post by **qwert** » Tue Aug 24, 2004 9:55 am

Hi,

Is there a DS stage that can help you to remove duplicate rows from a sequential file ?

Thanks

chulett · Post by **chulett** » Tue Aug 24, 2004 10:04 am

See if this post helps.

chulett · Post by **chulett** » Tue Aug 24, 2004 4:31 pm

Hmmm, someone seems to have "unreplied" to this.

Be happy to help with Stage Variables and detecting changes in repeating groups if there is a need...

w951bvc · Post by **w951bvc** » Tue Aug 24, 2004 7:45 pm

chulett wrote:Hmmm, someone seems to have "unreplied" to this. :? Be happy to help with Stage Variables and detecting changes in repeating groups if there is a need...

Umm, that was me. I had a follow-up question, but then realized my question bordered dangerously on being type where the person asking obviously didn't take much time to investigate on their own. Don't want to be that type of fellow! So I deleted the post, did some digging, and now am back.

So here goes...

Where I work, we've never used stage variables. All of our aggregations, checking for dupes, etc, have always been through hash files. Though it seems that instead of using hash files for duplicate identification, stage variables might be a much better alternative (depending on the cost of sorting the input file, of course).

I've searched through the various threads on this board, and read through the EVER-fabulous DS documentation (page 9-20 thru 9-23 in the Server Job Developer's Guide). But I'm missing a key point here somewhere, that I just can't seem to puzzle out or get to work.

Say I set up a stage variable, called SV. Now, I assign (by dragging) an input column called Name. In the constraint on my output link, to drop dupes, I would want something like "SV <> InputLink.Name".

But of course by this time in the transform, SV has already been assigned the current value of Name. What I really want is to compare the PREVIOUS value of SV, to the CURRENT value of Name. And that's where I'm dropping the ball. What exactly is the syntax that I'm looking for?

Since the Stage Variables pops up in the right of the transformer much like a link, I thought I could perhaps juggle the processing order of the links. But the stage variables aren't really "links", so that approach didn't work.

Embarrassingly ignorant,

->Richard Carpenter

chulett · Post by **chulett** » Tue Aug 24, 2004 8:28 pm

Ahhh, Richard... let's tawk.

First, just about stage variables in general. They've been quite the boon since they were added to DataStage as they've got a ton of uses other than in the context you are concerned with right now. A couple of examples off the top of my head:

*For repeated derivations. Why repeat the same derivation logic (typically to check the success of failure of something) when you can do it once and leverage it multiple times?

*For boolean values and to make complex derivations more self-documenting. A simple example that encapsulates these two points. A lookup needs to be checked and a value compared with the input data to determine subsequent values. Instead of repeating:

Code: Select all

If Not(Lookup_link.NOTFOUND) And Lookup_link.Old_Value = Input_link.New_Value Then X Else Y

perhaps several times - consider this. Create a stage variable called GoodRecord. In the derivation of the stage variable, put just the expression:

Code: Select all

Not(Lookup_link.NOTFOUND) And Lookup_link.Old_Value = Input_link.New_Value

This evaluates to @TRUE or @FALSE and sets the value of the stage variable accordingly. Now, in the derivations of the fields that need to make decisions based on this, you can simply say:

Code: Select all

If GoodRecord Then X Else Y

For what you need to do, some points to keep in mind. Stage variables are evaluated before the main derivations in the Transformer and they happen in the order listed. So, to check for changes in repeating groups you use two stage variables - one to hold the "old" value and one to hold the result of the comparison of old versus new. It could be as simple as a boolean variable called IsChanged and one called OldValue. Declare them in that order and then reference the OldValue in the check for IsChanged:

Code: Select all

IsChanged: Input.NewValue <> OldValue and @INROWNUM > 1
OldValue:  Input.NewValue

You may or may not need the special check for the first record depending on exactly what you are checking and how you set the Initial Value of the stage variables. Does that make sense?

I make use of the "boolean" stage variables quite a bit, especially for complex Constraints. Another interesting use is for a conditional counter, something that only gets incremented under certain conditions and not necessarily for every row. Accomplish this by setting the stage variable to itself when you don't want it to change:

Code: Select all

GoodCount:  If GoodRecord Then GoodCount + 1 Else GoodCount

Hope this helps peak your interest and you start leveraging Stage Variables in your jobs... you will find them very handy, IMHO.

qwert · Post by **qwert** » Wed Aug 25, 2004 8:00 am

Thanks for your answer . But I thought about a solution involving Aggregator stage. But once angain, thanks, this seems to be a good alternative.

chulett · Post by **chulett** » Wed Aug 25, 2004 8:57 am

Yup, the Aggregator is one answer when your data isn't coming in sorted.

w951bvc · Post by **w951bvc** » Wed Aug 25, 2004 11:25 am

Craig;

Thanks for the *extremely* clear and concise answer. This morning I was able to get everything finally working as it should, and am able to start realizing just some of the ways these variables things can be used.

A quick follow up. In your last example, you show how such variables can be used to count records that meet a certain condition. What method is there for "throwing" this variable someplace that can be used or reported on elsewhere?

I imagine I could simply stick it on the end of every row, and then just look at the last row. But if I'm writing to a something besides a sequential file, I lose my ability (I'm faily sure) to grab a definite "last row".

->Richard

ray.wurlod · Post by **ray.wurlod** » Wed Aug 25, 2004 4:44 pm

You can always run another output from the Transformer stage containing just the value in question, run that through an Aggregator stage with its aggregate rule set to "Last", and from there to wherever.

chulett · Post by **chulett** » Wed Aug 25, 2004 8:52 pm

Glad you found it helpful, Richard.

There's always more than one way to skin the DataStage cat, so in addition to what Ray suggested, something else to check out - the UserStatus area. It is an area each job has where a single piece of data can be written and retrieved at any time after the job completes. There are a couple of functions associated with it that you'd need to look up in the help. One is DSSetUserStatus and the other is an option of DSGetJobInfo.

I write a small custom routine to call DSSetUserStatus and call it (typicallly) "CallDSSetUSerStatus". All it does is take whatever is passed into it as Arg1 and call the function using it. I then pass back out Arg1 as Ans. This allows me to use this in a derivation inside a job, sometimes at the field level or perhaps in the derivation (or even Initial Value) of a Stage Variable. Because it is hash-based, it follows the "destructive overwrite" rules of last in, only thing out. Write to it as many times as you like and only the last thing written will be there when you go to retrieve it.

In order to retrieve it, check out the DSGetJobInfo function. I don't recall the exact syntax off the top of my head, but one of the Info Types will allow you to retrieve the contents of the User Status area at any time. All you need to do is attach to the job first. Don't forget to detach when done, it's only polite.

It can be used in Sequencer jobs to set Parameter Values for jobs downstream of activities that set it. See the Stage.$UserStatus entry in the External Parameter Helper thingy when picking parameter values.

A caveat: Recompiling or resetting a job will empty the User Status contents. Also note that, even though you can only store a single data item there, it can be a delimited string or a dynamic array which can then be parsed or decomposed after retrieval.

Enjoy!