Lookup Job Stuck for several hours before completion

gagan8877 · Post by **gagan8877** » Sat Feb 16, 2013 10:08 pm

Config File has 4 nodes.

-----------------------------

##I IIS-DSEE-TOIX-00059 21:46:28(000) <APT_RealFileExportOperator in APT_FileExportOperator,0> Export complete; 6 records exported successfully, 0 rejected.
##I IIS-DSEE-TFSC-00010 21:46:28(000) <main_program> Step execution finished with status = OK.
##I IIS-DSEE-TCOA-00071 21:46:28(001) <main_program> Total virtual memory on node node0 is 6144 MB.
.
##I IIS-DSEE-TCOA-00071 21:46:28(002) <main_program> Config file checker on node node0 reports 0 error(s), 0 warning(s).
.
##I IIS-DSEE-TCOA-00071 21:46:28(003) <main_program> Config file checker on node node6 reports 0 error(s), 0 warning(s).
.
##I IIS-DSEE-TCOA-00071 21:46:28(004) <main_program> Config file checker on node node5 reports 0 error(s), 0 warning(s).
.
##I IIS-DSEE-TCOA-00071 21:46:28(005) <main_program> Total virtual memory on node node4 is 6144 MB.
.
##I IIS-DSEE-TCOA-00071 21:46:28(006) <main_program> Config file checker on node node4 reports 0 error(s), 0 warning(s).
.
##I IIS-DSEE-TCOA-00072 21:46:28(007) <main_program> 0 error(s) detected.
##I IIS-DSEE-TCOA-00073 21:46:28(008) <main_program> 0 warning(s) detected.

---------------------------

Job is simple:

1. Reads from a dataset (791.5 MB total, on 4 nodes)

Totals:
records : 3057400
blocks : 6332
bytes : 828555400
filesize: 829947904
min part: 207486976
max part: 207486976

2. Lookup from LookupFileSet (5414 rows each (partitioning - entire) on 2 nodes)

Totals:
filesize: 369280 (360 KB)
min part: 184640
max part: 184640

3. The Lookup outputs to a dataset - 592558548 rows, approx. 21 GB. During write operation, the output dataset is repartitioned for the join in the down stream job.

If I look at job monitoring window in Director:

- Job started @ 7:09 PM
-Current time 10:30 PM
-Showing Elapsed time: 00:43:49
-No. of rows on the output link: 591200424
-%CP @ 24

It seems the job gets stuck after 43 min for a the next 12 hours and then finishes after 13 hours.

--------------------------------------------------------------------
--------------------------------------------------------------------

OSH script
#################################################################
#### STAGE: ds_AsrPreMaster
## Operator
copy
## General options
[ident('ds_AsrPreMaster')]
## Inputs
0< [ds] '[&"Directories.DATASETS"]jp_jdae3_TR_Awr_AsrPre_Master.ds_AsrPreMaster.ds'
## Outputs
0> [modify (
distribution_method:not_nullable string[1]=distribution_method;
vendor_id:not_nullable string[8]=vendor_id;
vendor_num:not_nullable string[9]=vendor_num;
vendor_name:not_nullable string[34]=vendor_name;
division:not_nullable string[6]=division;
item:not_nullable string[7]=item;
sku_num:not_nullable string[3]=sku_num;
alpha_code:not_nullable string[3]=alpha_code;
facility_num:not_nullable string[6]=facility_num;
e3_whse_id:not_nullable string[3]=e3_whse_id;
e3_whse_name:not_nullable string[25]=e3_whse_name;
line_num:not_nullable string[2]=line_num;
format_num:not_nullable string[2]=format_num;
retail_rebuyer_num:not_nullable string[3]=retail_rebuyer_num;
e3_fmt_2f:not_nullable string[2]=e3_fmt_2f;
e3_rebuyer_2b:not_nullable string[2]=e3_rebuyer_2b;
billing_cost:not_nullable string[15]=billing_cost;
rim_rec_sell:not_nullable string[10]=rim_rec_sell;
item_sku_desc:not_nullable string[35]=item_sku_desc;
sku_id:not_nullable string[10]=sku_id;
sku_flag:not_nullable string[1]=sku_flag;
division_L3:not_nullable string[3]=division_L3;
division_L2:not_nullable string[2]=division_L2;
item_L5:not_nullable string[5]=item_L5;
sku_num_L4:not_nullable string[4]=sku_num_L4;
keep
distribution_method,vendor_id,vendor_num,vendor_name,
division,item,sku_num,alpha_code,
facility_num,e3_whse_id,e3_whse_name,line_num,
format_num,retail_rebuyer_num,e3_fmt_2f,e3_rebuyer_2b,
billing_cost,rim_rec_sell,item_sku_desc,sku_id,
sku_flag,division_L3,division_L2,item_L5,
sku_num_L4;
)] 'ds_AsrPreMaster:in_AsrPreMaster.v'
;
#### STAGE: ds_AsrPreMstr.lnk_lkp1m_ScoreAlphaStoreGrp_Part
## Operator
hash -key sku_id -cs -key store_L4 -cs
## General options
[ident('ds_AsrPreMstr.lnk_lkp1m_ScoreAlphaStoreGrp_Part')]
## Inputs
0< [] 'lkp1m_ScoreAlphaStoreGrp:lnk_lkp1m_ScoreAlphaStoreGrp.v'
## Outputs
0> [] 'lkp1m_ScoreAlphaStoreGrp:lnk_lkp1m_ScoreAlphaStoreGrp_Part.v'
;
#### STAGE: ds_AsrPreMstr.lnk_lkp1m_ScoreAlphaStoreGrp_Sort
## Operator
tsort -key sku_id -cs -asc -key store_L4 -cs -asc -nonStable
## General options
[ident('ds_AsrPreMstr.lnk_lkp1m_ScoreAlphaStoreGrp_Sort')]
## Inputs
0< [] 'lkp1m_ScoreAlphaStoreGrp:lnk_lkp1m_ScoreAlphaStoreGrp_Part.v'
## Outputs
0> [] 'lkp1m_ScoreAlphaStoreGrp:lnk_lkp1m_ScoreAlphaStoreGrp_Part_Sort.v'
;
#################################################################
#### STAGE: ds_AsrPreMstr
## Operator
copy
## General options
[ident('ds_AsrPreMstr')]
## Inputs
0< [] 'lkp1m_ScoreAlphaStoreGrp:lnk_lkp1m_ScoreAlphaStoreGrp_Part_Sort.v'
## Outputs
0>| [ds] '[&"Directories.DATASETS"]jp_jdae3_TR_AsrPreMaster.ds_AsrPreMstr.ds'
;
#################################################################
#### STAGE: lkp1m_ScoreAlphaStoreGrp
## Operator
lookup
## Operator options
-fileset '[&"Directories.DATASETS"]jp_jdae3_Standardization1.lfs_ScoreAlphaStoreGrp.lfs'
-transfer_adapter ' transfer* excludes alpha_code'
-ifNotFound reject
## General options
[ident('lkp1m_ScoreAlphaStoreGrp'); jobmon_ident('lkp1m_ScoreAlphaStoreGrp'); nocombine]
## Inputs
0< [] 'ds_AsrPreMaster:in_AsrPreMaster.v'
## Outputs
0> [modify (
keep
distribution_method,vendor_id,vendor_num,vendor_name,
division,item,sku_num,alpha_code,
facility_num,e3_whse_id,e3_whse_name,line_num,
format_num,retail_rebuyer_num,e3_fmt_2f,e3_rebuyer_2b,
billing_cost,rim_rec_sell,item_sku_desc,sku_id,
sku_flag,division_L3,division_L2,item_L5,
sku_num_L4,store_num,store_L4;)] 'lkp1m_ScoreAlphaStoreGrp:lnk_lkp1m_ScoreAlphaStoreGrp.v'
1> [] 'lkp1m_ScoreAlphaStoreGrp:rej_lkp1m_ScoreAlphaStoreGrp.v'
;
#################################################################
#### STAGE: seq_ScoreAlphaGrpRejects
## Operator
export
## Operator options
-schema record
{final_delim=end, delim=',', quote=double}
(
distribution_method:string[1];
vendor_id:string[8];
vendor_num:string[9];
vendor_name:string[34];
division:string[6];
item:string[7];
sku_num:string[3];
alpha_code:string[3];
facility_num:string[6];
e3_whse_id:string[3];
e3_whse_name:string[25];
line_num:string[2];
format_num:string[2];
retail_rebuyer_num:string[3];
e3_fmt_2f:string[2];
e3_rebuyer_2b:string[2];
billing_cost:string[15];
rim_rec_sell:string[10];
item_sku_desc:string[35];
sku_id:string[10];
sku_flag:string[1];
division_L3:string[3];
division_L2:string[2];
item_L5:string[5];
sku_num_L4:string[4];
)
-file '[&"Directories.REJECTS"]jp_jdae3_TR_AwrMaster.seq_ScoreAlphaGrpRejects.rej'
-overwrite
-rejects continue
## General options
[ident('seq_ScoreAlphaGrpRejects'); jobmon_ident('seq_ScoreAlphaGrpRejects')]
## Inputs
0< [] 'lkp1m_ScoreAlphaStoreGrp:rej_lkp1m_ScoreAlphaStoreGrp.v'
;
# End of OSH code

--------------------------------------------------------------------
--------------------------------------------------------------------

Time 10:53 PM, job still running.....

lsps -s
Total Paging Space Percent Used
6144MB 1%

df -g

Filesystem GB blocks Free %Used Iused %Iused Mounted on

/stag/lvds1 115.00 96.39 17% 636 1% /dataset1
/stag/lvds2 115.00 96.40 17% 636 1% /dataset2
/stag/lvds3 115.00 96.41 17% 638 1% /dataset3
/stag/lvisdk0 225.00 224.96 1% 25 1% /disk0
/stag/lvisdk1 225.00 224.96 1% 25 1% /disk1
/stag/lvisdk2 225.00 224.96 1% 25 1% /disk2
/stag/lvisdk3 225.00 224.96 1% 25 1% /disk3
/stag/lvds0 115.00 95.22 18% 642 1% /dataset0

dataset-x- is where the datasets are.
disk-x- are the scratch disks

vmstat -v shows:

3145728 memory pages
3039440 lruable pages
2632 free pages
1 memory pools
706983 pinned pages
80.0 maxpin percentage
3.0 minperm percentage
90.0 maxperm percentage
29.1 numperm percentage
885361 file pages
0.0 compressed percentage
0 compressed pages
29.1 numclient percentage
90.0 maxclient percentage
885354 client pages
0 remote pageouts scheduled
2758 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
448515 external pager filesystem I/Os blocked with no fsbuf
71.8 percentage of memory used for computational pages

lparstat
System configuration: type=Shared mode=Uncapped smt=4 lcpu=4 mem=12288MB psize=2 ent=1.00

%user %sys %wait %idle physc %entc lbusy vcsw phint
----- ----- ------ ------ ----- ----- ------ ----- -----
0.2 0.1 0.4 99.4 0.00 0.4 5.8 1067695908 19557411

lparstat -i
Node Name : rcm05
Partition Name : rcm07
Partition Number : 13
Type : Shared-SMT-4
Mode : Uncapped
Entitled Capacity : 1.00
Partition Group-ID : 32781
Shared Pool ID : 1
Online Virtual CPUs : 1
Maximum Virtual CPUs : 1
Minimum Virtual CPUs : 1
Online Memory : 12288 MB
Maximum Memory : 16384 MB
Minimum Memory : 10240 MB
Variable Capacity Weight : 100
Minimum Capacity : 0.80
Maximum Capacity : 1.00
Capacity Increment : 0.01
Maximum Physical CPUs in system : 64
Active Physical CPUs in system : 16
Active CPUs in Pool : 2
Shared Physical CPUs in system : 16
Maximum Capacity of Pool : 200
Entitled Capacity of Pool : 200
Unallocated Capacity : 0.00
Physical CPU Percentage : 100.00%
Unallocated Weight : 0
Memory Mode : Dedicated
Total I/O Memory Entitlement : -
Variable Memory Capacity Weight : -
Memory Pool ID : -
Physical Memory in the Pool : -
Hypervisor Page Size : -
Unallocated Variable Memory Capacity Weight: -
Unallocated I/O Memory entitlement : -
Memory Group ID of LPAR : -
Desired Virtual CPUs : 1
Desired Memory : 12288 MB
Desired Variable Capacity Weight : 100
Desired Capacity : 1.00
Target Memory Expansion Factor : -
Target Memory Expansion Size : -
Power Saving Mode : Disabled

mpstat -s

System configuration: lcpu=4 ent=1.0 mode=Uncapped

Proc0
0.40%
cpu0 cpu1 cpu2 cpu3
0.18% 0.10% 0.06% 0.06%

Can anyone throw some light on this? Why is it stuck? What should be next step? What should I monitor & how? Is there something else that I can try?

Thanks

jwiles · Post by **jwiles** » Sat Feb 16, 2013 11:12 pm

Two things come to my mind:

1) Your lookup outputs nearly 200 times the number of source records and more than 200 times the data volume. Is this expected and is it correct?
2) The job is sorting the data prior to writing the output dataset (shown in your OSH)

Most likely, the job is "stuck" (not really) sorting the output data. You can't see this in action in the performance statistics because you don't have an explicit sort stage in the job design, rather you have selected the sort option on the input link to the output dataset stage. This also doesn't give you certain options for tuning the memory/scratch usage of the sort.

What can you monitor:
Disk activity...I bet during most of the twelve hours it's primarily on the scratch disks.
Disk performance...how fast/slow is data transfer back and forth to the scratch and dataset disks? You can do this from the command line using the dd command, or in a simple DS job using Row Generator and a sequential file stage.
How to monitor? iostat or better yet, use nmon and nmon_analyzer (Google for it). Gather the statistics for the entire job run...a 1 to 5 minute sample period should be enough.
Work with your system admins to determine if the I/O channels are saturated during the job.

Possible solutions (based up):

1) Verify the lookup's output. It may be correct, but outputting 200x the input volume is suspicous to me.
2) Add a sort stage in front of the output dataset, performing your partition and sort there. Set the Sort's Restrict Memory Usage option to 200(MB)--the default is 20. This can have a profound effect with sorting Multi-gigbyte files.
3) Increasing the parallelism of the job may help if the I/O system throughput isn't maxed out yet.

Regards,

gagan8877 · Post by **gagan8877** » Sun Feb 17, 2013 9:39 am

Hi James

Thanks for the valuable info.

"1) Your lookup outputs nearly 200 times the number of source records and more than 200 times the data volume. Is this expected and is it correct?"

Yes, this is correct.

"2) The job is sorting the data prior to writing the output dataset (shown in your OSH)"

I removed the sort/repartition and yeah after that the job is showing correct stats in the job monitor. Still running since 11 hours.

nmon is very useful and better than the vmstats, etc. for sure - thanks.

"Disk activity...I bet during most of the twelve hours it's primarily on the scratch disks."

Why does df -g show only 1% used on scratch disks (since yesterday)?

Filesystem GB blocks Free %Used Iused %Iused Mounted on
/stag/lvisdk0 225.00 224.96 1% 25 1% /disk0
/stag/lvisdk1 225.00 224.96 1% 25 1% /disk1
/stag/lvisdk2 225.00 224.96 1% 25 1% /disk2
/stag/lvisdk3 225.00 224.96 1% 25 1% /disk3

Is there a possibility that scratch space usage is limited (set by admin) to the process running the job? Is there a command to find out?

Even the CPU usage never shows more than 35%. Usually is between 15 & 20.

topas_nmon──S=WLMsubclasses────Host=r5982pcm05─────Refresh=2 secs───10:05.38─────────────────────────────────────────────────────────────────────────┐
│ CPU-Utilisation-Small-View ───────────EntitledCPU= 1.00 UsedCPU= 0.292─────────────────────────────────────────────────────────────────────────────│
│Logical CPUs 0----------25-----------50----------75----------100 │
│CPU User% Sys% Wait% Idle%| | | | | │
│ 0 0.0 21.5 0.0 78.5|ssssssssss > | │
│ 1 0.0 4.8 0.0 95.2|ss > | │
│ 2 0.0 0.0 0.0 100.0| > | │
│ 3 0.0 0.0 0.0 100.0| > | │
│EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+ │
│ EC 0.6 16.4 0.0 12.1|ssssssssiiiiii-----------------------------------| │
│ VP 0.6 16.4 0.0 12.1|ssssssssiiiiii-----------------------------------| │
│EC= 29.2% VP= 29.2% +--No Cap---|------------|-SMT=4-----100% VP=1 CPU+

User% is usually <5%, sys% is close to 20%

Is there a possibility that cpu usage is limited (set by admin) to the process running the job? Is there a command to find out?

"Increasing the parallelism of the job may help if the I/O system throughput isn't maxed out yet."

The %busy counter in nmon shows less than 4% activity on 4 of the disks. Others are at 0.

Disk Busy Read Write Transfers Size Peak% Peak KB/s qDepth │
│ Name KB/s KB/s /sec KB Read+Write or N/A │
│hdisk1 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk2 0% 0.0 0.0 0.0 0.0 67% 11528.2 -- │
│hdisk4 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk6 2% 0.0 126.0 31.5 4.0 5% 423.7 -- │
│hdisk11 2% 0.0 832.0 13.0 64.0 3% 2941.9 -- │
│hdisk8 0% 0.0 0.0 0.0 0.0 0% 4.0 -- │
│hdisk7 0% 0.0 0.0 0.0 0.0 8% 8.0 -- │
│hdisk3 0% 0.0 0.0 0.0 0.0 8% 0.0 -- │
│hdisk5 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk12 0% 0.0 0.0 0.0 0.0 1% 28.0 -- │
│hdisk10 0% 0.0 0.0 0.0 0.0 1% 8.0 -- │
│hdisk9 0% 0.0 0.0 0.0 0.0 0% 8.0 -- │
│hdisk14 0% 0.0 0.0 0.0 0.0 8% 0.0 -- │
│hdisk15 0% 0.0 0.0 0.0 0.0 1% 0.0 -- │
│hdisk13 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk18 0% 0.0 0.0 0.0 0.0 1% 22.0 -- │
│hdisk17 0% 0.0 0.0 0.0 0.0 8% 252.0 -- │
│hdisk16 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk0 0% 0.0 0.0 0.0 0.0 2% 116.0 -- │
│hdisk19 0% 0.0 0.0 0.0 0.0 1% 0.0 -- │
│hdisk22 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk21 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk20 0% 0.0 0.0 0.0 0.0 1% 0.0 -- │
│hdisk23 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk26 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk28 2% 0.0 832.0 13.0 64.0 3% 3070.1 -- │
│hdisk24 0% 0.0 0.0 0.0 0.0 0% 0.0 -- │
│hdisk25 1% 0.0 0.0 0.0 0.0 1% 0.0 -- │
│hdisk27 3% 0.0 1664.0 26.0 64.0 49% 13567.9 -- │
│hdisk29 0% 0.0 0.0 0.0 0.0 1% 0.0 -- │
│hdisk31 0% 0.0 8.0 2.0 4.0 14% 339.8 -- │
│hdisk30 0% 0.0 0.0 0.0 0.0 1% 0.0 -- │
│Totals(MB/s) Read=0.0 Write=3.4 Size(GB)=2396 Free(GB)=387

The ulimit -a output for the user id that's running the job shows:

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) unlimited
memory(kbytes) unlimited
coredump(blocks) 2097151
nofiles(descriptors) 131072
threads(per process) unlimited
processes(per user) unlimited

Shouldn't the CPU usage increase in normal circumstances on AIX, when the job is running? So may be there is some kind of restriction on it for the process.

Memory is maxed out but paging space is still at 1%:

Memory ───────────────────────────────────────────────────────────────────────────────│
│ Physical PageSpace | pages/sec In Out | FileSystemCache │
│% Used 99.9% 0.6% | to Paging Space 0.0 0.0 | (numperm) 27.8% │
│% Free 0.1% 99.4% | to File System 2816.0 848.0 | Process 47.4% │
│MB Used 12277.8MB 38.0MB | Page Scans 15825.1 | System 24.7% │
│MB Free 10.2MB 6106.0MB | Page Cycles 0.0 | Free 0.1% │
│Total(MB) 12288.0MB 6144.0MB | Page Steals 3728.0 | ------ │
│ | Page Faults 5261.0 | Total 100.0% │
│------------------------------------------------------------ | numclient 27.8% │
│Min/Maxperm 356MB( 3%) 10686MB( 90%) <--% of RAM | maxclient 90.0% │
│Min/Maxfree 960 1088 Total Virtual 18.0GB | User 72.1% │
│Min/Maxpgahead 2 8 Accessed Virtual 8.5GB 47.1%| Pinned 22.7% │
│ | lruable pages 3039440.

FileSystem Cache does show usage though, and that might be the scratch disk usage, but why is there no paging?

Thanks again.

jwiles · Post by **jwiles** » Mon Feb 18, 2013 4:41 pm

Did you simply disable the part/sort options in the stage (leaving no sorting/repartioning in the job), or did you replace them with a separate sort stage?

Does your config file specify the correct scratch and dataset storage locations?

Are the stats you just provided from AFTER the changes you made, or before?

If there is no sorting occurring in your job now, the scratch usage looks reasonable...just enough to buffer input data to the lookup while it writes the output data. If there's no sorting at all and it still takes 11 hours, then potential bottlenecks are filesystem throughput and the fact that the lookup is creating 200x as many output records as input (that takes some time, naturally).

What is the throughput of the storage locations (how fast can data be read from or written to the filesystems)? Use simple jobs to read and write data to the filesystems. DataSet->Copy, SeqFile->Copy, RowGen->DataSet, RowGen->SeqFile. Create large files (1 GB, 10GB, 21GB for example) and time how long the jobs run.

What are the characteristics of your record schema? A lot of VarChar(255) columns that normally contain very little actual data? That can cause the engine to have to move a lot of whitespace between stages (called "in-flight").

Regards,