Strawberry Perl + CPAN

Well, I’ve figured out one of the problems getting anything to work off CPAN. The installer didn’t have the brains to detect that CPAN configuration should work with $prefix where $prefix != C:strawberryperl. As in my case, the distribution is in C:DevFilesLanguagesPerl. Hehe.

It would be nice to be able to install modules and not have so many WTFs/Minute.

The STAMAN Project: Phase IV,

Having thought of tasks and storage formats, it’s now time to figure out an implementation language, i.e. what programming langauge am I going to write the task manager in.

O.K. based on what we’ve got so far, it is easy to infer the following is worth having:

  • Portable between systems—a must for me 😉
  • Easy access to SQLite—usually trivial.
  • Better tools than gmtime().
That means this page is rather useful, for what languages can be ruled out. In my répertoire this means Go (aww), Scheme, AWK, and shell languages can be skipped. Reason being the portability of Scheme bindings in general, and the others lacking sufficient portability (for my taste) at this time. That still leaves about 13 languages, lol. PHP, Java, Lua, JavaScript, and X86 assembly are easy for me to rule out. Reasons for that can all be easily guessed; at least if you remember how much I enjoy Suns Java tools. JS/Lua are great choices but I don’t want to screw with the bindings and stuff.

I’m not very interested in compiling SQLite in C/C++ on Windows, or the CLI binding everywhere. So this effectively makes the choice Perl, Python, or Ruby. Out of those three, none is perfect either: perl doesn’t come with the required database code, it just has the definitive interface for databases everybody mimics. Python and Ruby on the other hand, come with SQLite bindings—which many distributions separate out into separate packages. It’s just a lose, lose situation when you think about dependencies, but it does beat writing your own everything for every program. Sometimes. Setup with these three dynamic languages would be easy though, in so far as we’ve gotten with the above.

Time handling is another issue. Perl has fairly minimalist handling of time built in, but on the upside, if you need it, it’s probably three abreast on CPAN. Time::Format and the core Time::Piece module each come to mind. What isn’t built into Perl, often comes with it or can be added to it. Ruby provides a simple but effective Time class, that makes for more natural code than you may expect. More complex operations will require Googling for a Ruby gem, or hand coding it on demand. Python on the other hand provides a comprehensive datetime module, and supporting time and calendar modules, all out of the box! I would say Python takes the lead here.

Rule one of getting work done: know how to leverage libraries.

In terms of programming languages, Perl, Ruby, and Python are generally equal enough for most tasks, so long as you don’t shoot yourself in the foot. Some subtle differences that personally irk me:

  • Perls autovivification can be almost as much a miss-feature as it can be a convenience. You’ve just got to learn the damn language :-P.
  • Ruby functions are not first class objects! Some things can also be weird if you’re not used to Ruby.
  • Python doesn’t always stand up well to typos, especially if they involve indentation o/.

Because of how many lines of code I’ve done in Python over the years, I am more familiar with it’s set of “Irks” than Ruby’s, like wise I know Sh, C, and Perl more intimately than other languages, so I really know their irks. For perl, it’s mostly thin things that get in inconvenient when combing the warnings pragmata with the nature of perl syntax. They spiritually conflict at times. Under Ruby, I mostly find gripes that have a bigger place in programmer culture. My issues with Python generally have to do with trade-offs that I disagree with as a matter of my convenience, even if it usually results in a Good Thing overall. It comes from a C-oriented background meshed with a love for the Perl programming language.

This is a fact: you will always be irked by your programming language, if you use it enough. What can I say, nothing is perfect. Shoot!

For this particular application, there’s some things worth noting also: language portability. If the machine doesn’t run perl, it’s not a real computer. Most systems you’re likely to care about will run Ruby and Python, and there’s probably a crusty old version of Python for those you don’t (nor directly should). In contrast however, Perl is often a lower level of “Cross platform” behaviour than Ruby/Python. You’ll find this highlighted well in the Camel book. One reason I use Python frequently, it always behaves as expected without so many subtle hiccups.

How much this pertains to the current matter, i.e. implementing STAMAN. Perl is the most universally available language, and I’m more prone to need such a feature than most people. A plus over Ruby is no crappy 1.8.x/1.9.x porting issues…! Of course however, I have a camel to ask about minor details, hehe. In my experience the Python 2k/3k thing is less issue than Ruby’s for writing code yourself, more of an issue in leveraging existing code.

So I reckon, that means Perl or Ruby is best called for here. I exclude Python, because I just use the frick’n thing to often.

I’ve just managed to import my journal entries from October 2009, that should just leave the majority of September, than my transition from Live Journal to Blogger, should ‘technically’ be complete once and for all!

Give or take 10-20some entries, I would say I’m approaching 1800 posts since I started keeping a journal back in ’06. Planning to celebrate my 2000th entry, if I ever notice it’s passing :-o.

Also took some time to move one of my older projects to github. Really, it’s kind of cool: I sat down and read about 1500 out of 2300+ lines of perl code and could still understand it nearly a year later. Paged through the remaining ~800LOC, which was mostly trivial elements. Someday I need to get the ‘uncommitted’ test suite committed and work on some cleanup, but it reads easily enough. I don’t claim to be a genius, but hell, most of peoples maintainability comments about Perl are either due to Perl 4 experiences or shotty programmers if you ask me. Sure, I enjoy an occasional game of golf, but I like writing code that tends to explain itself.

Perl, the worst thing I can say about it is autovivification isn’t all it’s cracked up to be, and you quickly learn how to skip reading error messages and just go proof read your syntax. That’s kind of a downside of perl, to track down errors in Perl code, you kind of need to learn proof reading ;).

I like using my brain more than a debugger, but kind of like compilers that report useful info. I ain’t met many that actually do.

more tpsh: control flow stuff

I’ve been trying to find a way of hooking in proper shell control flow keywords into tpsh, without uglifing the existing code. At the moment, tpsh understands executing singular command sequences, scripts, and a queue of command sequences. It’s fairly easy to modify the parsing/lexing portions to adjust the internal data structures IAW control flow keywords, the problem is how to handle execution phase.

Originally, I had in mind setting up nested data structures and doing a delayed execution: evaluate the control flow statement and modify the data, then execute the remaining commands (e.g. if CMD0; then CMD1; else CMD2; fi, would execute CMD when the statement needs to be evaluated at execution phase, then reshape the strucure so that only CMD1 or CMD2 remains, and then feed that back into the executor).

Last night, I had an interesting idea… on the case specific code generation.

Shell control flow basically amounts to very simple if, while, for, and case statements; and the more modern until and select statements. Normal execution patterns amount to using a single command sequence (e.g. cat f0 f1 f2 f3 | sort > f), or a queue of such command sequences. Why not replace that executor with a section of code, that understands how to handle those as well as control flow (etc), and then generate the desired code to execute the result.

Exempli gratia:

tpsh_cgen( ( [ 'if', 'test command' ],
[ 'then', 'other commands' ],
....
[ 'else', 'other other commands' ],
....
[ 'fi' ] )
)

might return something like:

sub { 
if (evaluate the 'test command' and test the exit status)
{
execute the 'other commands'
}
else {
execute the 'other other commands'
}
}

and so on and so forth; so that if we call the generated code ref, we have a set of code that will execute the correct commands, whatever they may be, and with quite a lot less fuss.

You know it’s time to take a walk when….

if (expr) {
do this only
} elese {
do this instead
}

executes both blocks, and it takes you a minute to realize, that despite using Perls strict and warnings pragmata, you are to tired to notice you wrote ‘elese’ instead of ‘else’

Oy vey, what a life.

tpsh: test of expand_quotes()

$ echo 'hi bye' foo "$USER" and "~" or ~
expand_quotes ': echo | hi bye | foo "$USER" and "~" or ~
expand_quotes ": foo | $USER | and "~" or ~
expand_quotes ": and | ~ | or ~
hi bye foo $USER and ~ or /usr/home/Terry
$

# note:
# the 2 spaces /displayed/ between hi and bye are a bug in
# tpsh; echo'ing things to file via I/O redirection works
# properly. "$USER" is not expanded because expand_parameters()
# still needs adjustments.
#

tpsh_parse invokes expand_quotes() to break up its input line based on the shells quoting rules; and proceeds to go about it’s business. tpsh_lex() then accepts the token buffer and begins building a new data structure from it. The tokens from tpsh_parse get analyzed and reassembled “on the quotes”, i.e. it will do it’s check on ‘hi ‘ and ‘bye’ and the rest as separate elements; then reassemble the argument vectors as an array reference: becoming ‘hi bye’ again. (id est quote expansions add escapes to tell the lex phase where to rejoin things) After everything is said and done between parse and lex, the queue like data structure is ready, the argument vectors contained there in are ready to be mapped onto resolve_cmd() calls for execution.

To hunt down any other booboos in the expand_quotes() subroutine, I’ve made it display it’s work, so I can see how it detects what when testing the shell. basically as “expand_quotes QUOTE: unquoted | quoted | remainder”.

As one can guess from what the above shell snippet implies: quoting is handled recursively. Because I’m used to languages with finite stack space and no reliable tail call optimizations; I almost never write recursive functions of any kind, whether they are tco’able or not. Algorithmically, expand_quotes() is a very simple procedure.

It expects to be called with an input line; and treats multiple arguments accordingly (for now). Internally a dispatch table and token stack are maintained; the table contains references to anonymous subroutines, to which the scanned elements are delegated to for the proper expansions.

If no quotes are detected on the line, return the result of expanding it with the default delegate (for unquoted text).

Otherwise break the line on the first set of (matching) quotes.

Any text defined before the beginning quote must be unquoted; apply the default expansion from from the table.

The text between the matching quotes is quoted, apply the appropriate expansion form the table (i.e. ‘, “, or `).

Any text remaining after the matching quotes may or may not be quoted; invoke expand_quotes() on the remainder to find out, and apply the result.

Each expansion applied is pushed onto the token stack in the escaped form it expanded to (i.e. “‘hi bye'” becomes “hi bye”), and the stack is returned to the caller once processing is completed.

With refactoring, the procedure could likely be made tail recursive but I don’t think perl does TCO. Either way, the users fingers or (likely) the machine generating the inputs should run out of stack space before tpsh could pop a cork at the number of quotes lol. An earlier design for expand_quotes() had more in common with finite state machines (in so far as I’ve seen them implemented), but was a lot more contorted then expand_quotes()’ present shape :-/.

Current bugs are handling nested escaped quotes or multiple empty quotes (the spliter) and removing unquoted quotes (addition to delegate sub for unquoted text).

# bugs in expand_quotes
$ echo 'foo "bar'
expand_quotes ': echo | foo "bar |
foo "bar
$ echo "foo "bar"
expand_quotes ": echo | foo | bar"
foo bar"
#
# correct result would have been equal to the previous command
#
$ echo '' "" '' "" '""' '' "" '"' "'"
expand_quotes ': echo | | "" '' "" '""' '' "" '"' "'"
expand_quotes ": " | '' | " '""' '' "" '"' "'"
expand_quotes ': " | "" | '' "" '"' "'"
expand_quotes ': ' | "" | "' "'"
expand_quotes ": "' | ' |
" '' " "" ' "" "' '
#
# correct result would have been: "" " '
# at least, that's how all bourne based shells I
# know about treat it; I would prefer: "" " '
# i.e. without leading whitespace.
#

For some reason this makes me curious, has anyone ever explained why shell syntax allows “”” but not ”’ ? (the results being ” and unclosed quote /or syntax error respectively)

When trying to solve a programming problem, generally I try the most simple solution before I try something more complex: and then evaluate a neater method. I consider the implications solutions have on efficiency, but that is trying to avoid shooting myself in the foot later, rather then trying to optimize the code for a machine.

Some how, I think expanding quotes is just naturally recursive in my crazy brain :-D.

EDIT


commit aeac14bd177a93b84c138a0c62e2cda49e5fe15c
Author: Terry <***snip***>
Date: Tue Apr 7 22:24:35 2009 +0000

bugfix: parameters now expand within quotes via expand_quotes and may be escaped

commit 089fda7cca0049dcabdc8b9659f94dcae417074b
Author: Terry <***snip***>

bugfix: escaped quotes witihn quotes and multiple quotes handled correctly

previous behaviour:

$ echo 'foo "bar'
expand_quotes ': echo | foo "bar |
foo "bar
$ echo "foo "bar"
expand_quotes ": echo | foo | bar"
foo bar"
$ echo '' "" '' "" '""' '' "" '"' "'"
expand_quotes ': echo | | "" '' "" '""' '' "" '"' "'"
expand_quotes ": " | '' | " '""' '' "" '"' "'"
expand_quotes ': " | "" | '' "" '"' "'"
expand_quotes ': ' | "" | "' "'"
expand_quotes ": "' | ' |
" '' " "" ' "" "' '
$

new behaviour:

$ echo 'foo "bar'
expand_quotes ': echo | foo "bar |
foo "bar
$ echo "foo "bar"
expand_quotes ": echo | foo "bar |
foo "bar
$ echo '' "" '' "" '""' '' "" '"' "'"
expand_quotes ': echo | | "" '' "" '""' '' "" '"' "'"
expand_quotes ": | | '' "" '""' '' "" '"' "'"
expand_quotes ': | | "" '""' '' "" '"' "'"
expand_quotes ": | | '""' '' "" '"' "'"
expand_quotes ': | "" | '' "" '"' "'"
expand_quotes ': | | "" '"' "'"
expand_quotes ": | | '"' "'"
expand_quotes ': | " | "'"
expand_quotes ": | ' |
"" " '
$

casual fun with the Perl profiler

call: dproffpp -ap shift.pl

there are three sets of data, each meant to represent a small, medium, or large set. Each set is a list of words, 5, 25, and 100 words long respectively. (realistically the elements would average within 2 to 5 words inclusive). For simplicity N is 10.

There are 2 functions, xx and yy; representing different ways of solving the same problem: pretty printing the last N items of a given data set. In xx(), the set is reversed and then $#set -= N’d to clip all but the last N items, then reversed again to put it back into proper order. In yy() we avoid any reversals and just shift off the front of the list one at a time, until we reach N items left in the set. If the set contains less then N elements, no adjustment need be done.

Each function is called 3 times per iteration, once with each data set, over 3000 iterations (that’s 9000 calls to each function, or 3000 times with each set). The test was then executed 10 times.

The things so simple, it’s not important how long it takes to finish, but I’m interested in how big the the difference is between for (expr1; expr2; expr3) { shift @list } and $#list -= expr; and how much those two reversals hurt.

Every time, yy() ran faster by at least a half second. Then ran tests with xx() doing one reversal, then no reversals and yy() still beat it.

Now out of more curiosity, let’s see how larger data sets work. Each data set now contains 3 words instead of 1, and N is now 43; with the data sets being 5, 25, 100, 250, 500, and 1000 elements long.

A new function, zz() which is xx() without the reversals is also executed during the tests. After running the tests a short duration, it seems that the $#set -= N’ing is a bit faster, more so then the cost of the reversals.

here’s the new run down:

$ do_test() {
> local T=10
> while [ $T -gt 0 ]; do
> N=$1 dprofpp -ap shift.pl | tail -n 7 >> dp
> T=$(($T - 1))
> done
> }
$ for NUM in `builtin echo "3n10n43n51n227n"`; do do_test $NUM; done

The above (z)sh code will execute the test on shift.pl 10 times with an N of 3, 10, 43, 51, and then 227; appending the report (3 subs = (4+3) lines) to the file ‘dp’ for post-cpu meltdown review, otherwise we would have to take a look at all the I/O the tests generate before the report is printed by dprofpp.

Yes, I’m to damn lazy to use command history, let along retype the commands each time; why else would they have invented functions and loops 😛

About 15 minutes and 17 degrees Celsius later, some of the arithmetic involved finally caught up with my throbbing head.

Recap:

  • 5 tests, each test has a different value of N
  • 10 runs of each test, meaning 50 runs
  • each run examines the data sets 3000 times, for 150,000 examinations
  • each examination calls 3 functions once with each of 6 data sets, 18 function calls per examination.
  • The six data sets consists of a list of 5, 25, 100, 250, 500, and 1,000 elements; each element is 3 words long. So like 1880 elements in the data set, and 5,640,000 elements processed per examination

So that is what, maybe 2,700,000 function calls to &xx, &yy, and &zz; without counting the calls within those functions… and 846,000,000,000 list elements processed overall? After a little estimation based on the original data set/run time, I stopped counting after the estimated execution time passed 8 hours * X on this old laptop. Hmm, how’s that old saying go, curiosity fried the geeks cooling fan? lol.

I’m beginning to understand why some peoples workstations have like 32 ~ 64 GB of ECC RAM, and Symmetrical Multiple Processor (SMP) configurations to drool $! for, haha!

How the code comes out, when you’re ready to pass out

# message for commit 041a7343eb452b827dbd97a0c82c8538597f86f6:
#
# read built-in command implemented


sub read_bin {

my ($prmpt, $time);
my $line = "";
my @argv = @_;
my %opts = ( 'p=s' => $prmpt,
't=s' => $time,
'e' => sub { "no-op" },
);

do_getopt(@argv, %opts);
unless (@argv) {
warn "I can't read into thin air!";
return 0;
}

if ($prmpt and -t *STDIN) {
print $prmpt;
}

eval {
# remove custom die for ease of error check/report
local %SIG;
$SIG{__DIE__} = sub { die @_ };
$SIG{ALRM} = sub { die "timed-outn" };
if ($time) {
# an s, m, or h suffix causes sleep for sec, min, or hour
#
if ($time =~ /^(d*)([smh])/) {
if ($2 eq 's') {
$time = $1;
} elsif ($2 eq 'm') {
$time = $1 * 60;
} elsif ($2 eq 'h') {
$time = $1 * 3600;
} else {
warn "internal error on ", __LINE__;
# NOTREACHED
}
}
alarm $time;
}
chomp($line = );
alarm 0 if $time;
};
if ($@) {
warn $@ unless $@ eq "timed-outn";
# on time out, init the vars to empty strings
@ENV{@argv} = ('') x scalar @argv;
return 0;
} else {
# set each var to the words
#
# XXX because ifsplit has no notion of a &split 'LIMIT'
# if we used ifsplit here instead of a manual split,
# read x y
# foo bar ham
# would set $y to 'bar' instead of 'bar ham'
#
my $ifs = defined $ENV{IFS} ? qr/[$ENV{IFS}]/ : qr/s/;
@ENV{@argv} = grep !/^$/, split $ifs, $line, scalar(@argv);
return 1;
}
}

from the manual page updated in commit b1317d7e6e7f91b6c3a2650f44cd4f425e381d42 with message:

read built-in command documented

and blockquoted here using the pod2html output

read [-p prompt] [-t timeout] variable …]

Read a line from standard input, split by fields, and assign each field to the indicated variables. If the number of variables is less then the number of fields, the remaining fields will be stored ‘as is’ in the last variable. If there are more variables then fields, the excess variables will be undefined. A prompt may be printed before reading input, by using the -p option. The -t option may be used to specify a timeout in which to abort the operation, should the user take their sweet time about pressing CR. The timeout value can take an optional s, m, or h suffix to denote seconds, minutes, or hours. If no suffix is given, s will be assumed.

It’s not the greatest… but hey, I ain’t had any sleep since this mornings roll out… lol. It’ll do fine for an initial implementation, until I’ve actually got a functioning brain to deal with it 😛