Sunday, October 17, 2010

The other day I was thinking about a young semi-student programmer that I know, and thought about presenting him a small set of "Teeth cutting" exercises. Small tasks that would serve a double purpose, help me evaluate his present aptitude for development tasks, and try and prepare him a wee bit for what his future education is likely to throw out. Unlike what seems to be the most common norm in college environments, I can also gently push in more, ahem, practical directions that what most students I've met have learned. I still have yet to find out if the number of stupid programmers on earth is due to the schooling or the students. Alas, that's drifting off topic.

When I stopped thinking about the whole teeth cutting thing, I had done so because no ideas of what to use as a starting exercise had come to mind. Today while chatting, one did: a bare bones version of the UNIX tail program.

(06:30:27 PM) Spidey01: A first exercise:
 language: your choice
 description: implement a program called 'tail' that displays the last N lines of a file, where N is supplied by the user. It need not be a GUI, but can be if you wish.
  A/ Minimise the scope your variables are accessible from.
  B/ Describe the procedure (algorithm) you came up with for finding the last N lines in the file.
  C/ Think and discuss, is there a way to improve on your algorithm?

Tail is complex enough that some C implementations are horrendously overcomplicated, yet simple enough that it is an easily completed without a gruelling mental challenge. Especially if the -n option is the only one you care about. The choice of A was chosen it's a very common foul up among programmers, young and old a like.

I wrote a more complex program that that in C years ago as a learning process, that was more or less a fusion of the unix cat, head, and tail programs. Since the student in question was using Visual Basic .NET (oi), I opted to use C# so as to keep things at least, in the same runtime family. Here is a listing of the example code I wrote, the display here was done by feeding it into gvim and using :TOhtml to get syntax highlighted HTML to post here, than clipping a few things, hehe. The gvim theme is github.

  1 /**
  2  * comments having // style, are notes to young readers.
  3  *
  4  * CAVEATS:
  5  *  line numbers are represented by int, and thus have a size limit imposed by
  6  *  the 32-bit integer representation of the CLR.  Whether the users computer
  7  *  will run out of memory before that is irrelevant.
  8  *
  9  *  If there are less lines read than requested by the user, all lines are
 10  *  displayed without error message. I chose this because the error message
 11  *  would be more annoying than useful.
 12  */
 14 using System;
 15 using System.IO;
 16 using System.Collections.Generic;
 18 class Tail {
 19     enum ExitCode { // overkill
 20         Success=0,
 21         Failure=1,
 22         NotFound=127,
 23     }
 25     static void Main(string[] args) {
 26         if (args.Length != 2) {
 27             usage();
 28         }
 30         using (var s = new StreamReader(args[1])) {
 31             try {
 32                 var n = Convert.ToInt32(args[0]);
 33                 foreach (var line in tail(n, s)) {
 34                     Console.WriteLine(line);
 35                 }
 36             } catch (FormatException) {
 37                 die(ExitCode.Failure,args[0] + " is not a usable line number");
 38             } catch (OverflowException) {
 39                 die(ExitCode.Failure, args[0] + " to big a number!");
 40             }
 41         }
 42     }
 44     static void usage() {
 45             Console.WriteLine("usage: tail.exe number file");
 46             Console.WriteLine("number = number of lines to display from "
 47                               +"end of file");
 48             Console.WriteLine("file = file to read from tail");
 49             Environment.Exit((int)ExitCode.Success);
 50     }
 52     // Instead of doing the display work itself, returns a sequence of lines
 53     // to be displayed. This means this function could be easily used to fill
 54     // in a textbox in a GUI.
 55     //
 56     // It could also take a delegate object to do the display work, thus
 57     // improving runtime performance but that would be less flexible.  In this
 58     // particular programs case, just doing Console.WriteLine() itself would
 59     // be OK. See the foreach loop over tail() up in Main() for reference.
 60     //
 61     // This method also sucks up memory like a filthy whore because it stores
 62     // the whole file in memory as a IList<T>.  That's fine for a quick and
 63     // dirty protype. In real life, this should use a string[] array of length
 64     // 'n' and only store that many lines. That way it could handle files 5
 65     // billion lines long just as efficently as files 5 lines long.
 66     //
 67     // I chose not to make that change in this example, in order to make the
 68     // code as simple to read as possible.
 69     //
 70     // Incremental development +  code review = good idea.
 71     //
 72     static IEnumerable<string> tail(int n, TextReader s) {
 73         string line;
 74         var list = new List<string>();
 76         try {
 77             while ((line = s.ReadLine()) != null) {
 78                 list.Add(line);
 79             }
 80         } catch (OutOfMemoryException) {
 81             die(ExitCode.Failure, "out of memory");
 82         } catch (IOException) {
 83             die(ExitCode.Failure, "error reading from file");
 84         }
 86         if (n > list.Count) {  // a smart bounds check!
 87             n = list.Count;
 88         }
 90         // implecations of a GetRange() using a shallow copy rather than a
 91         // deep copy, are left as an exercise to the reader.
 92         return list.GetRange(list.Count - n, n);
 93     }
 95     static void die(ExitCode e, string message) {
 96         Console.Error.WriteLine("tail.exe: " + message);
 97         Environment.Exit((int)e);
 98     }
 99 }

This is a backtrace of the development process involved.

The program started simple: with the Main() method in Tail. The first thing I did was a simple check to see if args.Length was == 0, and exiting with a usage message. Then I remembered while writing the Console.WriteLine() calls for the usage message, that I really wanted (exactly) two arguments. That's how the test became what's written above in the code listing. A couple minutes later I moved the usage message code from inside that if statement to a usage() method. I did that to keep the Main() method more concise: unless you have reasons to groan over function call overhead, doing that is often a good idea. (Up to a point that is.)

From the get go, I knew I wanted the meat and potatoes to be in a method named tail() instead of Main(). For the same reasons that I created usage(). So that because a short using statement over a new StreamReader object.

First up was converting the string arg[0] from a string representation of a number, to an integeral representation of a number. At first I used uint (Unsigned Integer, 32-bit length) but later decided to make it plane int (Signed Integer, 32-bit length) because that's what subscripts into a collection are defined in terms of. I don't care if the user wants to display more than ~2,147,483,647 lines, it's only an example program, damn it! Because tail() shouldn't give a fuck about converting the programs argument vector to a type it can use (which obviously needs to be numeric), the caller (Main()) does the conversion. First tried args[0].ToUInt32() and when that compiled with errors, I hit Google. That gave me the System.Convert documentation on MSDN, from which it was easy to find the proper method. Because MSDN lists what exceptions System.Convert.ToInt32 can throw, and I know from experience that testing for such things is necessary ^_^, I quickly stubbed out catch clauses for FormatException and OverflowException. I wrote a simple set of messages to the standard output stream and an exit for each one. Than I converted them to using the standard error stream and wrote an enum called ErrorCodes, complete with casts to int when needed.

It was about this time, that I decided that implementing a simple method like Perls die() or BSDs err() would be convenient. Thus I implemented die() and replaced the repetitive error code. Functions are almost like a reusable template in that way. Then I decided that ExitCode was a better than for the enumeration than ErrorCodes, since it was being used more generally as an exit status (code) than an error report; unlike Microsoft I do not consider Success to be an error code ;). That was a simple global search and replace, or :%s/ErrorCodes/ExitCode/g in vim. Followed by a quick write (save) and recompile to test. Job done.

While I was at it, I also had an intentional bug encoded into the exception handlers for Convert, originally n variable was in a higher scope than the Convert (the using instead of try block). The error message for handling FormatException, used n.ToString() and the one for OverflowException used args[0]. The bug here was a subtle food for thought: one displays the result of the conversion, which might not match what the user supplied -> thus confusing the user. The other displayed what the user entered, which might not be what the program had tried to used. That also pushes an interesting thought on your stack, since the same data is used by both die()'s why do we have to write and maintain it twice? Alas, I realised the n variable was in too wide a scope and thus made that mind-play a moot point (by removing n from the scope of the catch statements). If you recall: using minimal scope for variables was actually the intent of the exercise, not error handling and code reuse.

Next I focused on implementing tail(). At first it was a simple. Just take a number and a StreamReader, and do a little loop over reading lines—for a quick test. When I checked the documentation on MSDN, I noticed that StreamReader was an implementation rather than a base class for TextReader. I always find that weird, but that's outside the scope of this journal entry. Thus I made the using statement in Main() create a StreamReader and pass it to tail(), now taking a TextReader. Originally it also had a void return type, and simply printed out its data. I did that to make testing easier. The comments above make a sufficient explanation of why IEnumerable is used, and what I've already written about StreamReader/TextReader may suggest why it doesn't return a straight string[] (e.g. array of strings).

The heart of it of course, is just feeding lines from a file into a generic List of strings. Since the exceptional possibilities are more straightforward, I wrote the catch blogs first. After that it is merely the question of extracting the correct lines from the tail end of the list. That's a simple one to one (1:1) abstraction to how you might do it manually. I believe simple is the best way to make a prototype. Since the student in question was joking about how his implementation would likely crash if the line numbers were out of whack from what's really in the file, I was sure to include a simple check. If the # of lines requested is greater than what really is there, just scale down. Volia. The comments at the top of the listing above, show why there is no error message displayed.

Extracting the items was a bit more of a question, my first implementation was a simple C-style for loop over the list using Console.WriteLine(). In the conversion to returning the data to be displaced, in which the tail() call in Main() became the above foreach loop. I added the comment about GetRange() more so as food for thought (from a code reuse and optimizational perspective). The math needed to extract the correct range of lines is trivial.

I then took a few moments to look at things over, doing a sort of code review. A few things were rearranged for clarity. I also introduced a bug, breaking the specification goals. If you look close enough at tail(), you will see that the variable line is only used inside the try block, yet it is declared at method scope. The #1 goal of the exercise was to avoid such things, hehe. I also thought about adjusting things to use an n sized cache of lines, rather than slurping the entire file in memory but decided against it. To keep the code easier to read, since the target-reader knows neather C# nor a lot of programming stuff, I just left comments noting that pro and contra of the matter.

Some people might find the method naming style odd for C#, but it's one that I've come to like, thanks to Go and C#. I.e. publicly exposed functions get NamesLikeThis and those that ain't, get namesLikeThis. Although personally I prefer C style names_like_this, aesthetically speaking.

The test file I used during the run was this:

line one
line two
line three
line four
line five

and most tests were done using various adjustments on:

terry@dixie$ gmcs tail.cs && mono tail.cs 2 test.txt

After sending the files over, I also whipped up a Visual Studio solution in MonoDevelop, and than realised that I left a rather unprofessional bug. If the filename in args[1] didn't exist, the program would crash. That was easily fixed on the fly.

Overall the program took about an hour and a half to write. For such a simple program, that's actually kind of a scar on my pride lol. But hey, I've barely written any code this month and I had to look up most of the system library calls in MSDN as I went along, I also tried to make it more polished than your typical example code. Which usually smells.

I can also think of a few ways to incrementally adopt that first exercise, into several other exercises. That might be useful.

No comments:

Post a Comment