Hash File, Create File Options max number for min modulus

ray.wurlod · Post by **ray.wurlod** » Mon Jan 05, 2004 6:16 pm

wdudek wrote:If we were to use this methodology, wouldn't we still need to pull all of the unidata data from our database and then use a reference to find the changed/new records?

No, if the logic for determining the changed records can be determined within UniData, you can simply add a selection criterion to your UniData stage (on the Selection tab on its Output link). For example, if your I-descriptor returned "changed" or "", then you would SELECT filename WITH

Code: Select all

Idescriptor_name = "changed"

This would actually reduce the number of rows that had to be selected from UniData, and probably therefore speed up the ETL process.

wdudek · Post by **wdudek** » Thu Jan 08, 2004 1:15 pm

ok, I'm taking in alot here. As for Kenneth's post, I looked up the original he mentioned. In it he recommends setting up an account specific to datastage. I believe that we do have this as our setup on the unix server looks like this /prod/db /prod/db/db_sql and datastage is looking at /prod/db/db_sql. When I went into this account and looked at one of the files I saw the following: I apologize about the formatting I played with it but couldn't get it to look too pretty.

LIST DICT LS_MASTER BY TYP BY @ID TYP LOC CONV NAME FORMAT SM ASSOC 13:46:31 Jan
08 2004 1
@ID............ TYP LOC.......... CONV NAME........... FORMAT SM ASSOC.....

@ID D 0 15L S
AMORT_RESIDUAL D 103 MD2, 14R S
ANNUAL_INT_RATE D 153 MD3 15R MV LS_MASTER_
ANNUAL_INT
_RATE
APPLICATION_NO D 114 MD0 14R S
APPROV_AUTH D 35 11R MV LS_MASTER_
APPROV_AUT
H_PH

(only a small portion)

From looking at this I am under the impression that we are using D (direct) Descriptors. Where can I find out about the F pointers or would these D's be F' if we were using them?

As for Ray's post, would them tmp file that the records whose id's have been changed be a subfile of the one whose records have been changed? I'm still reading the docs I downloaded the other day.

kcbland · Post by **kcbland** » Thu Jan 08, 2004 2:11 pm

A file has two elements: the data, and the dictionary. In the VOC is a File Pointer. I don't have access to Unidata, but you can see where a file is defined by looking at the definition.

In the VOC a file entry can be either a synonym (Q-Pointer) or file (F-pointer). Try doing CT VOC filename and looking at the results. If it looks like:

Code: Select all

<1> Q
<2> Account
<3> Filename

then it means that your file in this account is pointing to another file in another account.

If it looks like:

Code: Select all

<1> F
<2> /var/opt/wherever/xyz
<3> /var/opt/wherever/D_xyz

then it's pointing directly to the file. What you do is setup a new account and create F-pointers that point to the physical file but a different dictionary. You do this by creating the dictionary file in your datastage account. Then, noodle the pointer to look at the data. You really should have a unidata guy do this for you properly.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 08, 2004 4:38 pm

All the data in a single record are stored in a delimited string. What you're seeing in the "D" type entries is that, for example, the AMORT_RESIDUAL field is the 103rd delimited substring, and that there is one value of AMORT_RESIDUAL per record (the "S" in the "SM" place), and this is stored with two implied decimal places (that is, 123456 stored means 1,234.56 when displayed for humans).
ANNUAL_INT_RATE is the 153rd delimited substring in the record, and there is potentially more than one ANNUAL_INT_RATE per record (the "MV" means "multi-valued" in the "SM" place). This is stored with three implied decimal places (that is, 1234567 stored means 1,234.567 when displayed for humans). There are other multi-valued fields that share a one-to-one relationship with ANNUAL_INT_RATE; these all belong to the "association" (= nested table) called LS_MASTER_ANNUAL_INT_RATE.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 08, 2004 4:41 pm

wdudek wrote:As for Ray's post, would them tmp file that the records whose id's have been changed be a subfile of the one whose records have been changed? I'm still reading the docs I downloaded the other day.

My idea was that there would be something in the records themselves that would timestamp their most recent changes, for example a field (column) called something like LAST_UPDATE_DATE. This could be used to select only the records that have been changed since a particular date. It would be necessary that the application maintains this field.

I did not envisage a temp file of any kind. While this could be done, it would be an entirely separate file, not within the structure of the UniData file (table) containing the original data.

wdudek · Post by **wdudek** » Thu Jan 08, 2004 4:59 pm

Kenneth, thanks, I checked and yes they are f type pointers.

Ray, your idea is great and we use that methedology often in our oracle environment where we control everything, unfortunately we cannot add such a column, as the Unidata database is from third party vendor who supplies us the application which rides on it and generates the data that we our extracting from it. Unfortunately for us, this rigidity seems to be our biggest problem.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 08, 2004 5:55 pm

Could still be worth checking the file dictionary to see whether such a column already exists! (Or ask the third-party vendor.) If it's there, your problems are non-existent.

kcbland · Post by **kcbland** » Thu Jan 08, 2004 8:31 pm

ray.wurlod wrote:Nice thought, but you can't create F pointers from DataStage to UniData.

No, not from DataStage, from an account setup for DataStage to use. Since dictionaries for files contain A, S, F, T, and I correlatives/descriptors, you will not only see physical data but virtual as well. When DataStage imports metadata, it imports the dictionary. You often have a lot of junk columns, the worst being T correlatives since they are actually accessing other hash files.

It is paramount that you only pull back the physical data in a hash file, no virtual columns. In addition, dictionaries are often malformed, incomplete, or outright wrong. It behooves you to have clean dictionaries that accurately depict the data within. For that, you can setup an account on the Universe/Unidata side for DataStage to use. There, setup F-pointers to point to the data files, but use locally created dictionaries. You will access that account rather than the normal account. This, IMHO, is the best way to make sure you get data, as well as the most performant extracts.