data originating from spreadsheets

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

data originating from spreadsheets

Post by datastage »

I despise data coming from a source of a spreadsheet or comma separated file created from a speadsheet, as they are more prone to have data integrity errors than a database table. As a best practice I always load this source data into a real database before doing anything else with that source, so I at least know that starting point is consisent with my other source starting points.

What I've yet to develop or decide upon is what is a good method to auto-validate sources in Excel or similar. I'm looking for a good way to create some reusable code, whether it be a Datastage Basic routine, vbscript, whatever, to finding blank rows, rows with fewer or more columns than expected, etc..

I guess my question is what comes to mind on how to build a reusable code set for analyzing and potentially fixing data quality and integrity problems from spreadsheet sourced data.
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

Hi Byron,

Spreadsheets are simple and flexible.

This is good for newbies and freaks.

If you want to use them for Database purpose you need 100 % of discipline. They kind of discipline a Bookkeeper has in his blood.

If you have at least one person in your team with an other kind of blood you can get a lot of trouble, which is uncalculable.

So better use any DataBase for you Quality-Checks. There the discipline is given by itsself.

In MS-Access, which is a DataBase, for exemple you can use vbscript.

Wolfgang
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Isn't that what MetaRecon and Quality Manager were about (and INTEGRITY to a lesser extent)? These all have new names since being acquired and integrated by Ascential: ProfileStage, AuditStage and QualityStage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

Post by datastage »

Wolfgang/Ray...thanks for the replies. I do like ProfileStage and AuditStage, and AuditStage may be a good 'quick & easy' fit for the process on analyzing spreadsheets. Hopefully if Access is being used then fewer problems exist than in Excel.

The problem as a consultant I face is that client often have been using Excel to manage some smaller data sets for years, and I can't just walk in and get to to change to Access, MySql, or Sql Server just so that I have a cleaner source. If they have one or more of ASCL data quality/integrity products great, but often they don't.

This may be too much of a pipe dream, but I still wondering if there is a good method, probably with a script and maybe best as a UniVerse/DataStage Basic script to make it portable, to resuse some cleanup code that would only need minor modifications each time you need to analyze spreadsheet source data. It just seems there is too much flaky stuff that can go on, for instance you have a field of 2 digit numeric codes and then you find '!0' in the data. Excel allows the !, when any database wouldn't, so there are just more junk than normal in these.

Another thing that annoys me is when this type of thing gets cleaned up and can appear again in next weeks spreadsheet. I can always analyze data in transforms, or find load errors to a database, etc..., but I'm still eager to find a way to catch it first thing and even before I begin to code an ETL job or run a job.
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Byron,

Simple enough to do with VB or C or C++. I had written a simple VB app which ran as a before job ExecDOS. This preprocess accepted paramters, started Excel (if needed), opened the spreadsheet, did whatever and then saved it as a CSV. Works well and is a fairly simple solution to processing Excel tables. The CSV then becomes your input to the DS Job.

A prereq for this is either to have Excel installed on the server or include the Excel runtime in your exe. (OLE Automation and Excel 10.0 Object Library)

If you would like the code, please let me know

Regards,
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Michael

Post the code. Sounds useful. Brent had a similar thing in Java. I will send if he will send it and I can post it.

Thanks Kim.
Mamu Kim
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Here it is - just paste into a module, change the startup object to "Sub Main" - no need for forms. The app accepts a parameter, which would be a fully qualified path to an existing spreadsheet.

Code: Select all

Option Explicit
Declare Function FindWindow Lib "user32" Alias _
"FindWindowA" (ByVal lpClassName As String, _
                ByVal lpWindowName As Long) As Long

Declare Function SendMessage Lib "user32" Alias _
"SendMessageA" (ByVal hWnd As Long, ByVal wMsg As Long, _
               ByVal wParam As Long, _
               ByVal lParam As Long) As Long

Public Sub Main()

    Dim xlpath As Variant
    Dim XLAPP As Excel.Application
    Dim fpath As Variant
    Dim strAbsPath As String            ' Result of parsing
    Dim intI As Integer                 ' Character position counter
    Dim ExcelWasNotRunning As Boolean   ' Flag for final release.
    Dim xlfile As Variant
    Dim myxl As Object

    xlpath = Trim$(Command())
    
    If xlpath = "" Then Exit Sub
    
' Find the position of the last separator character.
    
    For intI = Len(xlpath) To 1 Step -1
        If Mid$(xlpath, intI, 1) = "/" Or _
        Mid$(xlpath, intI, 1) = "\" Then Exit For
    Next intI

    strAbsPath = Left$(xlpath, intI)
    xlfile = Right$(xlpath, Len(xlpath) - intI)
    xlfile = Left$(xlfile, (Len(xlfile) - InStr(1, xlfile, ".") + 1))
    
    
    On Error Resume Next   ' Defer error trapping.
    Set XLAPP = GetObject(, "Excel.Application")
    If Err.Number <> 0 Then ExcelWasNotRunning = True
    Err.Clear   ' Clear Err object in case error occurred.

    DetectExcel

    Set myxl = GetObject(xlpath)
   
    myxl.Application.DisplayAlerts = False
    
    myxl.SaveAs strAbsPath & xlfile, xlCSV
   
   
    If ExcelWasNotRunning = True Then
        myxl.Application.Quit
    End If

    Set myxl = Nothing   ' Release reference to the
                        ' application and spreadsheet.
End Sub

Sub DetectExcel()
' Procedure dectects a running Excel and registers it.
   Const WM_USER = 1024
   Dim hWnd As Long
' If Excel is running this API call returns its handle.
   hWnd = FindWindow("XLMAIN", 0)
   If hWnd = 0 Then   ' 0 means Excel not running.
      Exit Sub
   Else
   ' Excel is running so use the SendMessage API
   ' function to enter it in the Running Object Table.
      SendMessage hWnd, WM_USER + 18, 0, 0
   End If
End Sub
The above only saved an existing .xls as a CSV, but could easily be converted to do some checking or whatever.

Regards,
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Very cool. I am sure I will use this.

Thanks.
Mamu Kim
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Kim,

You can probably do all kinds of things like make this a plugin, a DLL or whatever. This was a quick a dirty utility that I needed and took very little time to create.
changming
Participant
Posts: 68
Joined: Wed Oct 13, 2004 3:35 am

can you help me

Post by changming »

mhester wrote:Byron,

Simple enough to do with VB or C or C++. I had written a simple VB app which ran as a before job ExecDOS. This preprocess accepted paramters, started Excel (if needed), opened the spreadsheet, did whatever and then saved it as a CSV. Works well and is a fairly simple solution to processing Excel tables. The CSV then becomes your input to the DS Job.

A prereq for this is either to have Excel installed on the server or include the Excel runtime in your exe. (OLE Automation and Excel 10.0 Object Library)

If you would like the code, please let me know



Regards,
currently, i met a problem in reading Excel file. question one is how to read Excel file which is in a unix work station? Do i have to manually save the speadsheet on the sever? question two, if the spreadsheet have more than 2 sheets, how to read them seperately and load into deferent table?
hope you can give me a quick feedback. thanks
Post Reply