Friday, April 10, 2009

Kludgey Shell Scripts, Perl and VM Disk Access

The tech company I work for is absurdly huge. I was recently asked by a co-worker to write a quick script to merge four flat files (from different DB's) into one file to import into another DB. Each line of the four files starts out with an index field, so I start with the first list, and then append data from matching lines of the other files to the first file.

My first thought was to slap together a shell script with grep and cut to build the merged list line by line. Here's the kludgey script:
---

#!/bin/ksh
F1="big_file"
F2="linked_file"
F3="slg_file"
F4="ei_file"
F_OUT="new_master_file"

CAT="/bin/cat"
CUT="/bin/cut"
ECHO="/bin/echo"
GREP="/bin/grep"
HEAD="/usr/bin/head"
RM="/bin/rm"
SED="/bin/sed"
SORT="/bin/sort"
TR="/usr/bin/tr"
WC="/usr/bin/wc"

WORDS=`$CAT $F1 | $CUT -d, -f1 | $SORT -u | $TR "\n" " "`

loop_cnt=0
$RM $F_OUT

for loop in $WORDS; do
$ECHO -n "${loop}[$loop_cnt] "
out1=`$GREP "^${loop}," $F1 | $HEAD -1`
out2=`$GREP "^${loop}," $F2 | $HEAD -1 | $CUT -d, -f2`
if [[ $out2 = "" ]]; then
out2=" "
fi
out3=`$GREP "^${loop}," $F3 | $HEAD -1 | $CUT -d, -f2`
if [[ $out3 = "" ]]; then
out3=" "
fi
out4=`$GREP "^${loop}," $F4 | $HEAD -1 | $CUT -d, -f2`
if [[ $out4 = "" ]]; then
out4=" "
fi
$ECHO "${out1},${out2},${out3},${out4}" >> $F_OUT
loop_cnt="$(($loop_cnt+1))"
done

---
It works; however, since $F1 has 5456 lines, it reads each file once for each line in $F1, that's 21825 opens, scans and closes, plus 21824 appends to $F_OUT. It's worth noting that not all UNIX systems/shells will allow a $WORDS to be that big.

It took the linux VM with SAN drives almost 10 minutes and pegged the CPU the whole time. For a one-time thing, this would be fine (albeit ugly), but he needs to update every hour.

I didn't find this out until later, but for comparison, I copied all the files to an equivalent linux system with local drives. 24% CPU and done in just over 2 minutes. The high CPU overhead for storage access from a VM is kind of troubling. It does large file access fine, but lots of little requests seems hard on VM's.

For reference, I rewrote it in perl which runs in about a second. If also reads each input file once, loads them into arrays and then writes out the new file in once shot.

----

#!/usr/bin/perl -w

my $F1="big_file";
my $F2="linked_file";
my $F3="slg_file";
my $F4="ei_file";
my $F_OUT="new_master_file";

open (INPUT, "< $F1") or die "Can't open file $F1.\n";
@master_lines=<INPUT>;
close (INPUT) or die "Can't close $F1.\n";

open (INPUT, "< $F2") or die "Can't open file $F2.\n";
@linked_lines=<INPUT>;
close (INPUT) or die "Can't close $F2.\n";

open (INPUT, "< $F3") or die "Can't open file $F3.\n";
@slg_lines=<INPUT>;
close (INPUT) or die "Can't close $F3.\n";

open (INPUT, "< $F4") or die "Can't open file $F4.\n";
@ei_lines=<INPUT>;
close (INPUT) or die "Can't close $F4.\n";

my $loop=0;
my $last_tid="last_value";
@toolid = ();
@toolname = ();
@nodeid = ();
@scmid = ();
@toollinked = ();
@slgserver = ();
@espapplication = ();
%tid_hash = ();

foreach $line (@master_lines)
{
($toolid[$loop],$toolname[$loop],$nodeid[$loop],$scmid[$loop])=split (/,/,$line);
chomp ($scmid[$loop]);
$toollinked[$loop]=" ";
$slgserver[$loop]=" ";
$espapplication[$loop]=" ";
$tid_hash{$toolid[$loop]}=$loop;
if ($loop==0)
{
$loop++;
} else {
$last_loop=$loop-1;
if ($toolid[$loop] ne $toolid[$last_loop])
$loop++;
$last_tid=$toolid[$loop];
}
}
}

$this_toolid="";
$this_linked="";
foreach $line (@linked_lines)
{
($this_toolid,$this_linked)=split (/,/,$line);
chomp ($this_linked);
if (exists ($tid_hash{$this_toolid}))
{
$toollinked[$tid_hash{$this_toolid}]=$this_linked;
}
}

$this_toolid="";
$this_slgserver="";
foreach $line (@slg_lines)
{
($this_toolid,$this_slgserver)=split (/,/,$line);
chomp ($this_slgserver);
if (exists ($tid_hash{$this_toolid}))
{
$slgserver[$tid_hash{$this_toolid}]=$this_slgserver;
}
}

$this_toolid="";
$this_espapplication="";
foreach $line (@ei_lines)
{
($this_toolid,$this_espapplication)=split (/,/,$line);
chomp ($this_espapplication);
if (exists ($tid_hash{$this_toolid}))
{
$espapplication[$tid_hash{$this_toolid}]=$this_espapplication;
}
}

open (OUTPUT, "> $F_OUT") or die "Can't write to file $F_OUT.\n";
foreach $key (keys %tid_hash)
{
$loop=$tid_hash{$key};
print OUTPUT "$toolid[$loop],$toolname[$loop],$nodeid[$loop],$scmid[$loop],$toollinked[$loop],$slgserver[$loop],$espapplication[$loop]\n";
}
close (OUTPUT) or die "Can't write to $F_OUT.\n";

----
I stripped out the comments, but it's still a lot longer than the shell script. Programming historians will recognize the old-school blocking, but I find it easier to see where I forgot to close a block.

The interesting thing about this exercise is the indicated scale. One or two hundred indexes and the shell script would have just taken a second or two. It still would have tied up the CPU, but it would have been a blip. I tested with a subset of 50 and never noticed a delay. It would still be a kludge, but it would have saved me from having to dust off my books to remember how to manipulate arrays and hashes in perl ... it's been years.

And a truly large flat file would have obviously demanded a smarter solution. I was surprised at how big an impact merging a 5400 line file with three others with a few hundred lines each had on an otherwise zippy little system.

Labels:

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home