Vertica connection performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

Vertica connection performance

Post by buzzylee »

Hi experts,

I've been recently playing a lot with Vertica 6.0 connectivity under DataStage. Setting up the connection has its "gotchas" and isn't described well but what bothers me more is the ODBC Connector performance when it's used in "write" mode.

It basically takes at minimum 10-15 seconds to initialize every session, moreover - the more sessions you open (degree of parallelism) the slower it is. In absurd situations we are waiting hours (sic!) to get the job to the point of actual data load. Sounds insane, doesn't it? :)

We investigated the main root cause of that - looks like DataStage is indirectly responsible for it.

Let me explain - as you probably know DataStage implicitly issues catalog queries whenever Connector type of stages are used. This is to validate the target tables structure, generate DML command, etc. The problem is this queries kill Vertica. I know - most of you will say: "report it to Vertica support". And we did it - but the fact is, it's not guaranteed there will be simple patch for it.

What I was interested to know is - does any of you ever faced this issue and if so - what was the remediation (besides HP support call)?

And secondly - is there any way to turn off ODBC Connector's catalog metadata calls? For the record - turning off schema reconciliation doesn't help at all - db catalog queries are still being issued.

And last but not least - using ODBC Enterprise stage isn't an option. Even though it doesn't use catalog queries it doesn't satisfy other project requirements (no support of data rejection)

Regards
Buzz
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

Post by buzzylee »

Ok, we have a workaround for the issue. Very "brute force" workaround but it works and satisfies us :lol:

It turned out the main evil comes from the Vertica in-built SQLStatistics ODBC method implementation. The main purpose of it is to return some object statistics and it's implemented as a part of ODBC driver. Vertica simply uses following query underneath:

Code: Select all

SELECT
    *
FROM
    (
        SELECT
            NULL AS catalog_name ,
            vs_projections.schemaname AS schema_name ,
            vs_projections.anchortablename AS table_name ,
            NULL AS non_unique ,
            NULL AS index_qualifier ,
            NULL AS index_name ,
            0 AS type ,
            NULL AS ordinal_position ,
            NULL AS column_name ,
            NULL AS asc_or_desc ,
            ROUND( AVG( total_row_count - deleted_row_count ) ) AS cardinality ,
            NULL AS pages ,
            NULL AS filter_condition
        FROM
            vs_projections JOIN storage_containers
                ON vs_projections.name = storage_containers.projection_name
        GROUP BY
            vs_projections.schemaname ,
            vs_projections.anchortablename
        ORDER BY
            non_unique ,
            type ,
            index_qualifier ,
            index_name ,
            ordinal_position
    ) AS vmd
WHERE
    schema_name ilike E 'public' ESCAPE E '\\'
    AND table_name ilike E 'test01' ESCAPE E '\\';
The only theoretical value it brings on the table is column "cardinality". I said "theoretical" because I can't find reason why DataStage should know about target table cardinality when writing the data.

Above query is very slow once you install data model objects - mainly because slow response on "storage_containers" system view which additionally seems to be serialized (the more sessions we open the slower overall response is). It looks like some bad system dictionary implementation on Vertica's side, we have a ticket open for it.

How to "fix" it? It turned out that all we have to do is to... change this query within ODBC library file (yes - binary hex editor to be used). What we did is simple elimination of "storage_containers" reference and stubbing "cardinality" column with some static value. The other option is to replace "storage_containers" with some user table that is even empty - it still works.

I know it sounds as a dodgy workaround but it works :D

Cheers
Buzz
nikhil_bhasin
Participant
Posts: 50
Joined: Tue Jan 19, 2010 4:14 am

Re: Vertica connection performance

Post by nikhil_bhasin »

I was also using Vertica with Datastage 8.7 version, the way how we tackled the loads were to create a load ready sequential file in pipe delimited and then using Vertica's COPY utility command to load the data. Believe me ODBC can never give you that performance. You can try using this way and compare the times
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

Post by buzzylee »

Hi Nikhil,

That's interesting, I will give it a try tomorrow.

I knew about that technique however I discarded it due to the fact that my ODBC Connectors's INSERT statement is actually executed by Vertica as COPY command (it uses pipe instead of file). This is Vertica ODBC driver feature to upgrade bulk INSERT operations into COPY statement, from this perspective I thought proposed approach should be slower as there is an extra I/O step to happen (seq file write).

The other reason why we preferred to stick to ODBC connector is metadata management - impact analysis / data lineage features are there by default once you use ODBC Connector, with an extra sequential file we would require some manual / semi-automatic steps to stitch seq file to shared table metadata. It's doable but not standard approach.

Anyway, I will let you know once I run the benchmark :)

Cheers
Buzz
nikhil_bhasin
Participant
Posts: 50
Joined: Tue Jan 19, 2010 4:14 am

Post by nikhil_bhasin »

Sure buzzy... this connector that you are using, is that provided by Vertica itself?..
Yeah, i also thought that extra I/O would cause the problem, but we did faced some issues while using ODBC connector to Vertica specially if you need to truncate the table before loading.
Do let me know your results
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

Post by buzzylee »

Hi again,

Answering your question - yes, we use Vertica ODBC driver provided by HP / Vertica.

I did run mentioned test (Connector vs Seq File + COPY) and indeed second approach seems to be slightly faster:

ODBC Connector / 1 node
Elapsed time: 1:02
Command used by Vertica:

Code: Select all

COPY public.test01 ( col1, col2, col3 ) FROM LOCAL STDIN NATIVE VARCHAR ENFORCELENGTH RETURNREJECTED AUTO NO COMMIT
ODBC Enterprise / 1 node
Elapsed time: 5:11
Command used by Vertica:

Code: Select all

COPY public.test01 ( col1, col2, col3 ) FROM LOCAL STDIN NATIVE VARCHAR ENFORCELENGTH RETURNREJECTED AUTO NO COMMIT
Sequential File + COPY / 1 node
Elapsed time: 0:56
Command used by Vertica:

Code: Select all

COPY public.test01 ( col1, col2, col3 ) FROM LOCAL '/var/IBM/copy1.txt' DELIMITER ',' NULL '' ENCLOSED BY '"' ENFORCELENGTH RETURNREJECTED AUTO NO COMMIT;
It looks like even though there is an extra step in last approach (seq file materialization) it's still around 10% faster than stdin COPY executed via ODBC. At the same time ODBC Enterprise demonstrates really poor performance.

Note that I executed all scenarios in single node environment. Interestingly when switching into 2-node mode ODBC approach was indeed almost 2 times faster (0:36 for ODBC Connector and 3:20 for ODBC Enterprise) but sequential file approach (upgraded to "file set") only improved by 15-20% to 0:45. It looks like COPY LOCAL command doesn't load data in parallel even though number of input files is bigger than 1. It's probably due to the fact that files are locally based (DS Server, not Vertica host).

Regards
Buzz
nikhil_bhasin
Participant
Posts: 50
Joined: Tue Jan 19, 2010 4:14 am

Post by nikhil_bhasin »

Hi Buzzy,
How much volume of data did you used while performing these tests? We were having huge volume of data 15-16 million to start with. It took around 12 mins for datsstage job to write into seq file and then another 1-2 mins for COPY command to load into the tables, using 4 virtual node config file
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

Post by buzzylee »

Hi Nikhil,

It was around 10 million records of 80 bytes row size. However it is worth to add that our SAN disk system is pretty much powerfull and well sized. That's why timings you see are achievable on single node config.

The other thing is that using more than 1 node with seq file approach won't benefit much as it uses single write process all the time. File Set stage addresses I/O parallelism well.

Cheers
Buzz
Post Reply