Workaround for flask/babel/sphinx bug on Python 3+

I’m using Python 3.4 on Windows. Recently I tried to install and use Sphinx.  When I did, I encountered an error that ended with the string

an integer is required (got type str)

Googling that string, I found an explanation of the problem on stackoverflow, HERE. As Andy Skirrow wrote on August 3, 2015, this is a bug in the current distribution of babel.

a pickled file babel/global.dat is included in a the package and this can’t be read by python 3 because it was created by script running under python 2.

The problem (as I understand it) is that Python 2.x pickles/unpickles datetime objects as ASCII strings, but Python 3.x pickles/unpickles them as Unicode strings.

The babel folks are working on it. But until they fix it, I needed a really simple-minded solution.  I needed something that would fit my brain.

Fortunately, I still had Python 2.7 installed on my PC.  So here is what I did.

  1. I went into the Python34 site-packages/babel folder, found the globals.dat file, and copied it into the Python27 folder.
  2. I wrote a program fix_a to unpickle (load) the globals.dat file and save it as a string. I saved this program in the Python27 folder, and ran it under Python 2.7.
  3. I wrote a program fix_b that imported the datetime module and repickled the string into a file named globals.dat. I saved this program in the Python34 folder, and ran it under Python 3.4.
  4. I copied the new globals.dat file over the original globals.dat file in the Python34 site-packages/babel folder.

It worked.  Sphinx is now working fine.

The text of is:

import pickle
f = open("global.dat","rb")
obj = pickle.load(f)

and I ran it this way, from inside the Python27 folder:

python -m fix_a > junk.txt

The text of is:

import datetime
import pickle
d = [copied the text of junk.txt here]
f = open("global.dat", "wb")
pickle.dump(d, f)

and I ran it this way, from inside the Python34 folder:

python -m fix_b

To save you some trouble, here is the Python 3.4 globals.dat file that I made.  Because WordPress wouldn’t allow me to upload it with a “dat” extension, it has a “doc” extension.  When you download it, you should rename it and give it a “dat” extension.


Posted in Python features | Comments Off on Workaround for flask/babel/sphinx bug on Python 3+

enum in Python

Recently I was reading a post by Eli Bendersky (one of my favorite bloggers) and I ran across a sentence in which Eli says “It’s a shame Python still doesn’t have a functional enum type, isn’t it?”

The comment startled me because I had always thought that it was obvious how to do enums in Python, and that it was obvious that you don’t need any special language features to do it. Eli’s comment made me think that I might need to do a reality-check on my sense of what was and was not obvious about enums in Python.

So I googled around a bit and found that there are a lot of different ideas about how to do enums in Python. I found a very large set of suggestions on StackOverflow here and here and here. There is a short set of suggestion on Python Examples. The ActiveState Python Cookbook has a long recipe, and PEP-354 is a short proposal (that has been rejected). Surprisingly, I found only a couple of posts that suggested what had seemed to me to be THE obvious solution. The clearest was by snakile on StackOverflow.

Anyway, to end the suspense, the answer that seemed to me so obvious was this. An enum is an enumerated data type. An enumerated data type is a type, and a type is a class.

class           Color : pass
class Red      (Color): pass
class Yellow   (Color): pass
class Blue     (Color): pass

Which allows you to do things like this.

class Toy: pass

myToy = Toy()

myToy.color = "blue"  # note we assign a string, not an enum

if myToy.color is Color:
    print("My toy has no color!!!")    # produces:  My toy has no color!!!

myToy.color = Blue   # note we use an enum

print("myToy.color is", myToy.color.__name__)  # produces: myToy.color is Blue
print("myToy.color is", myToy.color)           # produces: myToy.color is <class '__main__.Blue'>

if myToy.color is Blue:
    myToy.color = Red

if myToy.color is Red:
    print("my toy is red")   # produces: my toy is red
    print("I don't know what color my toy is.")

So that’s what I came up with.

But with so many intelligent people all trying to answer the same question, and coming up with such a wide array of different answers, I had to fall back and ask myself a few questions.

  • Why am I seeing so many different answers to what seems like a simple question?
  • Is there one right answer? If so, what is it?
  • What is the way — the best, or most widely-used, or most pythonic — way to do enums in Python?
  • Is the question really as simple as it seems?

For me, the jury is still out on most of these questions, but until they return with a verdict I have come up with two thoughts on the subject.

First, I think that many programmers come to Python with backgrounds in other languages — C or C++, Java, etc. Their experiences with other languages shape their conceptions of what an enum — an enumerated data type — is. And when they ask “How can I do enums in Python?” they’re asking a question like the question that sparked the longest thread of answers on StackOverflow:

I’m mainly a C# developer, but I’m currently working on a project in Python. What’s the best way to implement the equivalent of an enum [i.e. a C# enum] in Python?

So naturally, the question “How can I implement in Python the equivalent of the kind of enums that I’m familiar with in language X?” has at least as many answers as there are values of X.

My second thought is somewhat related to the first.

Python developers believe in duck typing. So a Python developer’s first instinct is not to ask you:

What do you mean by “enum”?

A Python developer’s first instinct is to ask you:

What kinds of things do you think an “enum” should be able to do?
What kinds of things do you think you should be able to do with an “enum”?

And I think that different developers probably have very different ideas about what one should be able to do with an “enum”. Naturally, that leads them to propose different ways of implementing enums in Python.

As a simple example, consider the question — Should you be able to sort enums?

My personal inclination is to say that — in the most conceptually pure sense of “enum” — the concept of sorting enums makes no sense. And my suggestion for implementing enums in Python reflects this. Suppose you implement a “Color” enum using the technique that I’ve proposed, and then try to sort enums.

# how do enumerated values sort?
colors = [Red, Yellow, Blue]
for color in colors:

What you get is this:

Traceback (most recent call last):
  File "C:/Users/ferg_s/pydev/enumerated_data_types/", line 32, in <module>
TypeError: unorderable types: type() < type()

So that suites me just fine.

But I can easily imagine someone (myself?) working with an enum for, say, Weekdays (Sunday, Monday, Tuesday… Saturday). And I think it might be reasonable in that situation to want to be able to sort Weekdays and to do greater than and less than comparisons on them.

So if we’re talking duck typing, I’m happy with enums/ducks that are motionless and silent. My only requirement is that they be different from everything else and different from each other. But I can easily imagine situations where one might reasonably need/want/prefer ducks that can form a conga line, dance, and sing a few bars. And for those situations, you obviously need more elaborate implementations of enums.

So, with these thoughts in mind, I’m inclined to think that there is no single, best way to implement an enum in Python. The concept of an enum is flexible enough to cover a variety of implementations offering a variety of features.

Posted in Python features | 8 Comments

Python Decorators

In August 2009, I wrote a post titled Introduction to Python Decorators. It was an attempt to explain Python decorators in a way that I (and I hoped, others) could grok.

Recently I had occasion to re-read that post. It wasn’t a pleasant experience — it was pretty clear to me that the attempt had failed.

That failure — and two other things — have prompted me to try again.

  • Matt Harrison has published an excellent e-book Guide to: Learning Python Decorators.
  • I now have a theory about why most explanations of decorators (mine included) fail, and some ideas about how better to structure an introduction to decorators.

There is an old saying to the effect that “Every stick has two ends, one by which it may be picked up, and one by which it may not.” I believe that most explanations of decorators fail because they pick up the stick by the wrong end.

In this post I will show you what the wrong end of the stick looks like, and point out why I think it is wrong. And I will show you what I think the right end of the stick looks like.


The wrong way to explain decorators

Most explanations of Python decorators start with an example of a function to be decorated, like this:

def aFunction():
    print("inside aFunction")

and then add a decoration line, which starts with an @ sign:

def aFunction():
    print("inside aFunction")

At this point, the author of the introduction often defines a decorator as the line of code that begins with the “@”. (In my older post, I called such lines “annotation” lines. I now prefer the term “decoration” line.)

For instance, in 2008 Bruce Eckel wrote on his Artima blog

A function decorator is applied to a function definition by placing it on the line before that function definition begins.

and in 2004, Phillip Eby wrote in an article in Dr. Dobb’s Journal

Decorators may appear before any function definition…. You can even stack multiple decorators on the same function definition, one per line.

Now there are two things wrong with this approach to explaining decorators. The first is that the explanation begins in the wrong place. It starts with an example of a function to be decorated and an decoration line, when it should begin with the decorator itself. The explanation should end, not start, with the decorated function and the decoration line. The decoration line is, after all, merely syntactic sugar — is not at all an essential element in the concept of a decorator.

The second is that the term “decorator” is used incorrectly (or ambiguously) to refer both to the decorator and to the decoration line. For example, in his Dr. Dobb’s Journal article, after using the term “decorator” to refer to the decoration line, Phillip Eby goes on to define a “decorator” as a callable object.

But before you can do that, you first need to have some decorators to stack. A decorator is a callable object (like a function) that accepts one argument—the function being decorated.

So… it would seem that a decorator is both a callable object (like a function) and a single line of code that can appear before the line of code that begins a function definition. This is sort of like saying that an “address” is both a building (or apartment) at a specific location and a set of lines (written in pencil or ink) on the front of a mailing envelope. The ambiguity may be almost invisible to someone familiar with decorators, but it is very confusing for a reader who is trying to learn about decorators from the ground up.


The right way to explain decorators

So how should we explain decorators?

Well, we start with the decorator, not the function to be decorated.

We start with the basic notion of a function — a function is something that generates a value based on the values of its arguments.

We note that in Python, functions are first-class objects, so they can be passed around like other values (strings, integers, objects, etc.).

We note that because functions are first-class objects in Python, we can write functions that both (a) accept function objects as argument values, and (b) return function objects as return values. For example, here is a function foobar that accepts a function object original_function as an argument and returns a function object new_function as a result.

def foobar(original_function):

    # make a new function
    def new_function():
        # some code

    return new_function

We define “decorator”.

A decorator is a function (such as foobar in the above example) that takes a function object as an argument, and returns a function object as a return value.

So there we have it — the definition of a decorator. Anything else that we say about decorators is a refinement of, or an expansion of, or an addition to, this definition of a decorator.

We show what the internals of a decorator look like. Specifically, we show different ways that a decorator can use the original_function in the creation of the new_function. Here is a simple example.

def verbose(original_function):

    # make a new function that prints a message when original_function starts and finishes
    def new_function(*args, **kwargs):
        print("Entering", original_function.__name__)
        original_function(*args, **kwargs)
        print("Exiting ", original_function.__name__)

    return new_function

We show how to invoke a decorator — how we can pass into a decorator one function object (its input) and get back from it a different function object (its output). In the following example, we pass the widget_func function object to the verbose decorator, and we get back a new function object to which we assign the name talkative_widget_func.

def widget_func():
    # some code

talkative_widget_func = verbose(widget_func)

We point out that decorators are often used to add features to the original_function. Or more precisely, decorators are often used to create a new_function that does roughly what original_function does, but also does things in addition to what original_function does.

And we note that the output of a decorator is typically used to replace the original function that we passed in to the decorator as an argument. A typical use of decorators looks like this. (Note the change to line 4 from the previous example.)

def widget_func():
    # some code

widget_func = verbose(widget_func)

So for all practical purposes, in a typical use of a decorator we pass a function (widget_func) through a decorator (verbose) and get back an enhanced (or souped-up, or “decorated”) version of the function.

We introduce Python’s “decoration syntax” that uses the “@” to create decoration lines. This feature is basically syntactic sugar that makes it possible to re-write our last example this way:

def widget_func():
    # some code

The result of this example is exactly the same as the previous example — after it executes, we have a widget_func that has all of the functionality of the original widget_func, plus the functionality that was added by the verbose decorator.

Note that in this way of explaining decorators, the “@” and decoration syntax is one of the last things that we introduce, not one of the first.

And we absolutely do not refer to line 1 as a “decorator”. We might refer to line 1 as, say, a “decorator invocation line” or a “decoration line” or simply a “decoration”… whatever. But line 1 is not a “decorator”.

Line 1 is a line of code. A decorator is a function — a different animal altogether.


Once we’ve nailed down these basics, there are a few advanced features to be covered.

  • We explain that a decorator need not be a function (it can be any sort of callable, e.g. a class).
  • We explain how decorators can be nested within other decorators.
  • We explain how decorators decoration lines can be “stacked”. A better way to put it would be: we explain how decorators can be “chained”.
  • We explain how additional arguments can be passed to decorators, and how decorators can use them.

Ten — A decorators cookbook

The material that we’ve covered up to this point is what any basic introduction to Python decorators would cover. But a Python programmer needs something more in order to be productive with decorators. He (or she) needs a catalog of recipes, patterns, examples, and commentary that describes / shows / explains when and how decorators can be used to accomplish specific tasks. (Ideally, such a catalog would also include examples and warnings about decorator gotchas and anti-patterns.) Such a catalog might be called “Python Decorator Cookbook” or perhaps “Python Decorator Patterns”.

So that’s it. I’ve described what I think is wrong (well, let’s say suboptimal) about most introductions to decorators. And I’ve sketched out what I think is a better way to structure an introduction to decorators.

Now I can explain why I like Matt Harrison’s e-book Guide to: Learning Python Decorators. Matt’s introduction is structured in the way that I think an introduction to decorators should be structured. It picks up the stick by the proper end.

The first two-thirds of the Guide hardly talk about decorators at all. Instead, Matt begins with a thorough discussion of how Python functions work. By the time the discussion gets to decorators, we have been given a strong understanding of the internal mechanics of functions. And since most decorators are functions (remember our definition of decorator), at that point it is relatively easy for Matt to explain the internal mechanics of decorators.

Which is just as it should be.

Revised 2012-11-26 — replaced the word “annotation” with “decoration”, following terminology ideas discussed in the comments.

Posted in Decorators | 32 Comments

Unicode – the basics

An introduction to the basics of Unicode, distilled from several earlier posts. In the interests of presenting the big picture, I have painted with a broad brush — large areas are summarized; nits are not picked; hairs are not split; wind resistance is ignored.

Unicode = one character set, plus several encodings

Unicode is actually not one thing, but two separate and distinct things. The first is a character set and the second is a set of encodings.

  • The first — the idea of a character set — has absolutely nothing to do with computers.
  • The second — the idea of encodings for the Unicode character set — has everything to do with computers.

Character sets

The idea of a character set has nothing to do with computers. So let’s suppose that you’re a British linguist living in, say, 1750. The British Empire is expanding and Europeans are discovering many new languages, both living and dead. You’ve known about Chinese characters for a long time, and you’ve just discovered Sumerian cuneiform characters from the Middle East and Sanskrit characters from India.

Trying to deal with this huge mass of different characters, you get a brilliant idea — you will make a numbered list of every character in every language that ever existed.

You start your list with your own familiar set of English characters — the upper- and lower-case letters, the numeric digits, and the various punctuation marks like period (full stop), comma, exclamation mark, and so on. And the space character, of course.

01 a
02 b
03 c
26 z
27 A
28 B
52 Z
53 0
54 1
55 2
62 9
63 (space)
64 ? (question mark)
65 , (comma)
... and so on ...

Then you add the Spanish, French and German characters with tildes, accents, and umlauts. You add characters from other living languages — Greek, Japanese, Chinese, Korean, Sanscrit, Arabic, Hebrew, and so on. You add characters from dead alphabets — Assyrian cuneiform — and so on, until finally you have a very long list of characters.

  • What you have created — a numbered list of characters — is known as a character set.
  • The numbers in the list — the numeric identifiers of the characters in the character set — are called code points.
  • And because your list is meant to include every character that ever existed, you call your character set the Universal Character Set.

Congratulations! You’ve just invented (something similar to) the the first half of Unicode — the Universal Character Set or UCS.


Now suppose you jump into your time machine and zip forward to the present. Everybody is using computers. You have a brilliant idea. You will devise a way for computers to handle UCS.

You know that computers think in ones and zeros — bits — and collections of 8 bits — bytes. So you look at the biggest number in your UCS and ask yourself: How many bytes will I need to store a number that big? The answer you come up with is 4 bytes, 32 bits. So you decide on a simple and straight-forward digital implementation of UCS — each number will be stored in 4 bytes. That is, you choose a fixed-length encoding in which every UCS character (code point) can be represented, or encoded, in exactly 4 bytes, or 32 bits.

In short, you devise the Unicode UCS-4 (Universal Character Set, 4 bytes) encoding, aka UTF-32 (Unicode Transformation Format, 32 bits).

UTF-8 and variable-length encodings

UCS-4 is simple and straight-forward… but inefficient. Computers send a lot of strings back and forth, and many of those strings use only ASCII characters — characters from the old ASCII character set. One byte — eight bits — is more than enough to store such characters. It is grossly inefficient to use 4 bytes to store an ASCII character.

The key to the solution is to remember that a code point is nothing but a number (an integer). It may be a short number or a long number, but it is only a number. We need just one byte to store the shorter numbers of the Universal Character Set, and we need more bytes only when the numbers get longer. So the solution to our problem is a variable-length encoding.

Specifically, Unicode’s UTF-8 (Unicode Transformation Format, 8 bit) is a variable-length encoding in which each UCS code point is encoded using 1, 2, 3, or 4 bytes, as necessary.

In UTF-8, if the first bit of a byte is a “0”, then the remaining 7 bits of the byte contain one of the 128 original 7-bit ASCII characters. If the first bit of the byte is a “1” then the byte is the first of multiple bytes used to represent the code point, and other bits of the byte carry other information, such as the total number of bytes — 2, or 3, or 4 bytes — that are being used to represent the code point. (For a quick overview of how this works at the bit level, see How does UTF-8 “variable-width encoding” work?)

Just use UTF-8

UTF-8 is a great technology, which is why it has become the de facto standard for encoding Unicode text, and is the most widely-used text encoding in the world. Text strings that use only ASCII characters can be encoded in UTF-8 using only one byte per character, which is very efficient. And if characters — Chinese or Japanese characters, for instance — require multiple bytes, well, UTF-8 can do that, too.

Byte Order Mark

Unicode fixed-length multi-byte encodings such as UTF-16 and UTF-32 store UCS code points (integers) in multi-byte chunks — 2-byte chunks in the case of UTF-16 and 4-byte chunks in the case of UTF-32.

Unfortunately, different computer architectures — basically, different processor chips — use different techniques for storing such multi-byte integers. In “little-endian” computers, the “little” (least significant) byte of a multi-byte integer is stored leftmost. “Big-endian” computers do the reverse; the “big” (most significant) byte is stored leftmost.

  • Intel computers are little-endian.
  • Motorola computers are big-endian.
  • Microsoft Windows was designed around a little-endian architecture — it runs only on little-endian computers or computers running in little-endian mode — which is why Intel hardware and Microsoft software fit together like hand and glove.

Differences in endian-ness can create data-exchange issues between computers. Specifically, the possibility of differences in endian-ness means that if two computers need to exchange a string of text data, and that string is encoded in a Unicode fixed-length multi-byte encoding such as UTF-16 or UTF-32, the string should begin with a Byte Order Mark (or BOM) — a special character at the beginning of the string that indicates the endian-ness of the string.

Strings encoded in UTF-8 don’t require a BOM, so the BOM is basically a non-issue for programmers who use only UTF-8.


Posted in Unicode | 3 Comments

Python’s magic methods

Here are some links to documentation of Python’s magic methods, aka special methods, aka “dunder” (double underscore) methods.

There are also a few other Python features that are sometimes characterized as “magic”.

I’m sure there are other useful Web pages about magic methods that I haven’t found. If you know of one (and feel like sharing it) note that you can code HTML tags into a WordPress comment, like this, and they will show up properly formatted:

I found a useful discussion of magic methods at
<a href=""></a>


Posted in Python features | 2 Comments

Gotcha — Mutable default arguments

Goto start of series

Note: examples are coded in Python 2.x, but the basic point of the post applies to all versions of Python.

There’s a Python gotcha that bites everybody as they learn Python. In fact, I think it was Tim Peters who suggested that every programmer gets caught by it exactly two times. It is call the mutable defaults trap. Programmers are usually bit by the mutable defaults trap when coding class methods, but I’d like to begin with explaining it in functions, and then move on to talk about class methods.

Mutable defaults for function arguments

The gotcha occurs when you are coding default values for the arguments to a function or a method. Here is an example for a function named foobar:

def foobar(arg_string = "abc", arg_list = []):

Here’s what most beginning Python programmers believe will happen when foobar is called without any arguments:

A new string object containing “abc” will be created and bound to the “arg_string” variable name. A new, empty list object will be created and bound to the “arg_list” variable name. In short, if the arguments are omitted by the caller, the foobar will always get “abc” and [] in its arguments.

This, however, is not what will happen. Here’s why.

The objects that provide the default values are not created at the time that foobar is called. They are created at the time that the statement that defines the function is executed. (See the discussion at Default arguments in Python: two easy blunders: “Expressions in default arguments are calculated when the function is defined, not when it’s called.”)

If foobar, for example, is contained in a module named foo_module, then the statement that defines foobar will probably be executed at the time when foo_module is imported.

When the def statement that creates foobar is executed:

  • A new function object is created, bound to the name foobar, and stored in the namespace of foo_module.
  • Within the foobar function object, for each argument with a default value, an object is created to hold the default object. In the case of foobar, a string object containing “abc” is created as the default for the arg_string argument, and an empty list object is created as the default for the arg_list argument.

After that, whenever foobar is called without arguments, arg_string will be bound to the default string object, and arg_list will be bound to the default list object. In such a case, arg_string will always be “abc”, but arg_list may or may not be an empty list. Here’s why.

There is a crucial difference between a string object and a list object. A string object is immutable, whereas a list object is mutable. That means that the default for arg_string can never be changed, but the default for arg_list can be changed.

Let’s see how the default for arg_list can be changed. Here is a program. It invokes foobar four times. Each time that foobar is invoked it displays the values of the arguments that it receives, then adds something to each of the arguments.

def foobar(arg_string="abc", arg_list = []): 
    print arg_string, arg_list 
    arg_string = arg_string + "xyz" 

for i in range(4): 

The output of this program is:

abc [] 
abc ['F'] 
abc ['F', 'F'] 
abc ['F', 'F', 'F']

As you can see, the first time through, the argument have exactly the default that we expect. On the second and all subsequent passes, the arg_string value remains unchanged — just what we would expect from an immutable object. The line

arg_string = arg_string + "xyz"

creates a new object — the string “abcxyz” — and binds the name “arg_string” to that new object, but it doesn’t change the default object for the arg_string argument.

But the case is quite different with arg_list, whose value is a list — a mutable object. On each pass, we append a member to the list, and the list grows. On the fourth invocation of foobar — that is, after three earlier invocations — arg_list contains three members.

The Solution
This behavior is not a wart in the Python language. It really is a feature, not a bug. There are times when you really do want to use mutable default arguments. One thing they can do (for example) is retain a list of results from previous invocations, something that might be very handy.

But for most programmers — especially beginning Pythonistas — this behavior is a gotcha. So for most cases we adopt the following rules.

  1. Never use a mutable object — that is: a list, a dictionary, or a class instance — as the default value of an argument.
  2. Ignore rule 1 only if you really, really, REALLY know what you’re doing.

So… we plan always to follow rule #1. Now, the question is how to do it… how to code foobar in order to get the behavior that we want.

Fortunately, the solution is straightforward. The mutable objects used as defaults are replaced by None, and then the arguments are tested for None.

def foobar(arg_string="abc", arg_list = None): 
    if arg_list is None: arg_list = [] 

Another solution that you will sometimes see is this:

def foobar(arg_string="abc", arg_list=None): 
    arg_list = arg_list or [] 

This solution, however, is not equivalent to the first, and should be avoided. See Learning Python p. 123 for a discussion of the differences. Thanks to Lloyd Kvam for pointing this out to me.

And of course, in some situations the best solution is simply not to supply a default for the argument.

Mutable defaults for method arguments

Now let’s look at how the mutable arguments gotcha presents itself when a class method is given a mutable default for one of its arguments. Here is a complete program.

# (1) define a class for company employees 
class Employee:
    def __init__ (self, arg_name, arg_dependents=[]): 
        # an employee has two attributes: a name, and a list of his dependents = arg_name 
        self.dependents = arg_dependents
    def addDependent(self, arg_name): 
        # an employee can add a dependent by getting married or having a baby 
    def show(self): 
        print "My name is.......: ", 
        print "My dependents are: ", str(self.dependents)
#   main routine -- hire employees for the company 

# (2) hire a married employee, with dependents 
joe = Employee("Joe Smith", ["Sarah Smith", "Suzy Smith"])

# (3) hire a couple of unmarried employess, without dependents 
mike = Employee("Michael Nesmith") 
barb = Employee("Barbara Bush")

# (4) mike gets married and acquires a dependent 
mike.addDependent("Nancy Nesmith")

# (5) now have our employees tell us about themselves

Let’s look at what happens when this program is run.

  1. First, the code that defines the Employee class is run.
  2. Then we hire Joe. Joe has two dependents, so that fact is recorded at the time that the joe object is created.
  3. Next we hire Mike and Barb.
  4. Then Mike acquires a dependent.
  5. Finally, the last three statements of the program ask each employee to tell us about himself.

Here is the result.

My name is.......:  Joe Smith 
My dependents are:  ['Sarah Smith', 'Suzy Smith']

My name is.......:  Michael Nesmith 
My dependents are:  ['Nancy Nesmith']

My name is.......:  Barbara Bush 
My dependents are:  ['Nancy Nesmith']

Joe is just fine. But somehow, when Mike acquired Nancy as his dependent, Barb also acquired Nancy as a dependent. This of course is wrong. And we’re now in a position to understand what is causing the program to behave this way.

When the code that defines the Employee class is run, objects for the class definition, the method definitions, and the default values for each argument are created. The constructor has an argument arg_dependents whose default value is an empty list, so an empty list object is created and attached to the __init__ method as the default value for arg_dependents.

When we hire Joe, he already has a list of dependents, which is passed in to the Employee constructor — so the arg_dependents attribute does not use the default empty list object.

Next we hire Mike and Barb. Since they have no dependents, the default value for arg_dependents is used. Remember — this is the empty list object that was created when the code that defined the Employee class was run. So in both cases, the empty list is bound to the arg_dependents argument, and then — again in both cases — it is bound to the self.dependents attribute. The result is that after Mike and Barb are hired, the self.dependents attribute of both Mike and Barb point to the same object — the default empty list object.

When Michael gets married, and Nancy Nesmith is added to his self.dependents list, Barb also acquires Nancy as a dependent, because Barb’s self.dependents variable name is bound to the same list object as Mike’s self.dependents variable name.

So this is what happens when mutuable objects are used as defaults for arguments in class methods. If the defaults are used when the method is called, different class instances end up sharing references to the same object.

And that is why you should never, never, NEVER use a list or a dictionary as a default value for an argument to a class method. Unless, of course, you really, really, REALLY know what you’re doing.

Posted in Python gotchas | 6 Comments

Unicode for dummies — Encoding

Another entry in an irregular series of posts about Unicode.
Typos fixed 2012-02-22. Thanks Anonymous, and Clinton, for reporting the typos.

This is a story about encoding and decoding, with a minor subplot involving Unicode.

As our story begins — on a dark and stormy night, of course — we find our protagonist deep in thought. He is asking himself “What is an encoding?”

What is an encoding?

The basic concepts are simple. First, we start with the idea of a piece of information — a message — that exists in a representation that is understandable (perspicuous) to a human being. I’m going to call that representation “plain text”. For English-language speakers, for example, English words printed on a page, or displayed on a screen, count as plain text.

Next, (for reasons that we won’t explore right now) we need to be able to translate a message in a plain-text representation into some other representation (let’s call that representation the “encoded text”), and we need to be able to translate the encoded text back into plain text. The translation from plain text to encoded text is called “encoding”, and the translation of encoded text back into plain text is called “decoding”.

encoding and decoding

There are three points worth noting about this process.

The first point is that no information can be lost during encoding or decoding. It must be possible for us to send a message on a round-trip journey — from plain text to encoded text, and then back again from encoded text to plain text — and get back exactly the same plain text that we started with. That is why, for instance, we can’t use one natural language (Russian, Chinese, French, Navaho) as an encoding for another natural language (English, Hindi, Swahili). The mappings between natural languages are too loose to guarantee that a piece of information can make the round-trip without losing something in translation.

The requirement for a lossless round-trip means that the mapping between the plain text and the encoded text must be very tight, very exact. And that brings us to the second point.

In order for the mapping between the plain text and the encoded text to be very tight — which is to say: in order for us to be able to specify very precisely how the encoding and decoding processes work — we must specify very precisely what the plain text representation looks like.

Suppose, for example, we say that plain text looks like this: the 26 upper-case letters of the Anglo-American alphabet, plus the space and three punctuation symbols: period (full stop), question mark, and dash (hyphen). This gives us a plain-text alphabet of 30 characters. If we need numbers, we can spell them out, like this: “SIX THOUSAND SEVEN HUNDRED FORTY-THREE”.

On the other hand, we may wish to say that our plain text looks like this: 26 upper-case letters, 26 lower-case letters, 10 numeric digits, the space character, and a dozen types of punctuation marks: period, comma, double-quote, left parenthesis, right parenthesis, and so on. That gives us a plain-text alphabet of 75 characters.

Once we’ve specified exactly what a plain-text representation of a message looks like — a finite sequence of characters from our 30-character alphabet, or perhaps our 75-character alphabet — then we can devise a system (a code) that can reliably encode and decode plain-text messages written in that alphabet. The simplest such system is one in which every character in the plain-text alphabet has one and only one corresponding representation in the encoded text. A familiar example is Morse code, in which “SOS” in plain text corresponds to

                ... --- ...

in encoded text.

In the real world, of course, the selection of characters for the plain-text alphabet is influenced by technological limitations on the encoded text. Suppose we have several available technologies for storing encoded messages: one technology supports an encoded alphabet of 256 characters, another technology supports only 128 encoded characters, and a third technology supports only 64 encoded characters. Naturally, we can make our plain-text alphabet much larger if we know that we can use a technology that supports a larger encoded-text alphabet.

And the reverse is also true. If we know that our plain-text alphabet must be very large, then we know that we must find — or devise — a technology capable of storing a large number of encoded characters.

Which brings us to Unicode.


Unicode was devised to be a system capable of storing encoded representations of every plain-text character of every human language that has ever existed. English, French, Spanish. Greek. Arabic. Hindi. Chinese. Assyrian (cuneiform characters).

That’s a lot of characters.

So the first task of the Unicode initiative was simply to list all of those characters, and count them. That’s the first half of Unicode, the Universal Character Set. (And if you really want to “talk Unicode”, don’t call plain-text characters “characters”. Call them “code points”.)

Once you’ve done that, you’ve got to figure out a technology for storing all of the corresponding encoded-text characters. (In Unicode-speak, the encoded-text characters are called “code values”.)

In fact Unicode defines not one but several methods of mapping code points to code values. Each of these methods has its own name. Some of the names start with “UTF”, others start with “UCS”: UTF-8, UTF-16, UTF-32, UCS-2, UCS-4, and so on. The naming convention is “UTF-” and “UCS-” Some (e.g. UCS-4 and UTF-32) are functionally equivalent. See the Wikipedia article on Unicode.

The most important thing about these methods is that some are fixed-width encodings and some are variable-width encodings. The basic idea is that the fixed-width encodings are very long — UCS-4 and UTF-32 are 4 bytes (32 bits) long — long enough to hold the the biggest code value that we will ever need.

In contrast, the variable-width encodings are designed to be short, but expandable. UTF-8, for example, can use as few as 8 bits (one byte) to store Latin and ASCII characters code points. But it also has a sort of “continued on the next byte” mechanism that allows it to use 2 bytes or even 4 bytes if it needs to (as it might, for Chinese characters). For Western programmers, that means that UTF-8 is both efficient and flexible, which is why UTF-8 is the de facto standardard encoding for exchanging Unicode text.

There is, then, no such thing as THE Unicode encoding system or method. There are several encoding methods, and if you want to exchange text with someone, you need explicitly to specify which encoding method you are using.

Is it, say, this.

encoding decoding UTF-8

Or this.

encoding decoding UTF-16

Or something else.

Which brings us back to something I said earlier.

Why encode something in Unicode?

At the beginning of this post I said

We start with the idea of a piece of information — a message — that exists in a representation that is understandable (perspicuous) to a human being.

Next, (for reasons that we won’t explore right now) we need to be able to translate a message in a plain-text representation into some other representation. The translation from plain text to encoded text is called “encoding”, and the translation of encoded text back into plain text is called “decoding”.

OK. So now it is time to explore those reasons. Why might we want to translate a message in a plain-text representation into some other representation?

One reason, of course, is that we want to keep a secret. We want to hide the plain text of our message by encrypting and decrypting it — basically, by keeping the algorithms for encoding and decoding secret and private.

But that is a completely different subject. Right now, we’re not interested in keeping secrets; we’re Python programmers and we’re interested in Unicode. So:

Why — as a Python programmer — would I need to be able to translate a plain-text message into some encoded representation… say, a Unicode representation such as UTF-8?

Suppose you are happily sitting at your PC, working with your favorite text editor, writing the standard Hello World program in Python (specifically, in Python 3+). This single line is your entire program.

                   print("Hello, world!")

Here, “Hello, world!” is plain text. You can see it on your screen. You can read it. You know what it means. It is just a string and you can (if you wish) do standard string-type operations on it, such as taking a substring (a slice).

But now suppose you want to put this string — “Hello, world!” — into a file and save the file on your hard drive. Perhaps you plan to send the file to a friend.

That means that you must eject your poor little string from the warm, friendly, protected home in your Python program, where it exists simply as plain-text characters. You must thrust it into the cold, impersonal, outside world of the file system. And out there it will exist not as characters, but as mere 1’s and 0’s, a jumble of dits and dots, charged and uncharged particles. And that means that your happy little plain-text string must be represented by some specific configuration of 1s and 0s, so that when somebody wants to retrieve that collection of 1s and 0s and convert it back into readable plain text, they can.

The process of converting a plain text into a specific configuration of 1s and 0s is a process of encoding. In order to write a string to a file, you must encode it using some encoding system (such as UTF-8). And to get it back from a file, you must read the file and decode the collection of 1s and 0s back into plain text.

The need to encode/decode strings when writing/reading them from/to files isn’t something new — it is not an additional burden imposed by Python 3’s new support for Unicode. It is something you have always done. But it wasn’t always so obvious. In earlier versions of Python, the encoding scheme was ASCII. And because, in those olden times, ASCII was pretty much the only game in town, you didn’t need to specify that you wanted to write and read your files in ASCII. Python just assumed it by default and did it. But — whether or not you realized it — whenever one of your programs wrote or read strings from a file, Python was busy behind the scene, doing the encoding and decoding for you.

So that’s why you — as a Python programmer — need to be able to encode and decode text into, and out of, UTF-8 (or some other encoding: UTF-16, ASCII, whatever). You need to encode your strings as 1s and 0s so you can put those 1s and 0s into a file and send the file to someone else.

What is plain text?

Earlier, I said that there were three points worth noting about the encoding/decoding process, and I discussed the first two. Here is the third point.

The distinction between plain text and encoded text is relative and context-dependent.

As programmers, we think of plain text as being written text. But it is possible to look at matters differently. For instance, we can think of spoken text as the plain text, and written text as the encoded text. From this perspective, writing is encoded speech. And there are many different encodings for speech as writing. Think of Egyptian hieroglyphics, Mayan hieroglyphics, the Latin alphabet, the Greek alphabet, Arabic, Chinese ideograms, wonderfully flowing Devanagari देवनागरी, sharp pointy cuneiform wedges, even shorthand. These are all written encodings for the spoken word. They are all, as Thomas Hobbes put it, “Marks by which we may remember our thoughts”.

Which reminds us that, in a different context, even speech itself — language — may be regarded as a form of encoding. In much of early modern philosophy (think of Hobbes and Locke) speech (or language) was basically considered to be an encoding of thoughts and ideas. Communication happens when I encode my thought into language and say something — speak to you. You hear the sound of my words and decode it back into ideas. We achieve communication when I successfully transmit a thought from my mind to your mind via language. You understand me when — as a result of my speech — you have the same idea in your mind as I have in mine. (See Ian Hacking, Why Does Language Matter to Philosophy?)

Finally, note that in other contexts, the “plain text” isn’t even text. Where the plain text is soundwaves (e.g. music), it can be encoded as an mp3 file. Where the plain text is an image, it can be encoded as a gif, or png, or jpg file. Where the plain text is a movie, it can be encoded as a wmv file. And so on.

Everywhere, we are surrounded by encoding and decoding.


I’d like to recommend Eli Bendersky’s recent post on The bytes/str dichotomy in Python 3, which prodded me — finally — to put these thoughts into writing. I especially like this passage in his post.

Think of it this way: a string is an abstract representation of text. A string consists of characters, which are also abstract entities not tied to any particular binary representation. When manipulating strings, we’re living in blissful ignorance. We can split and slice them, concatenate and search inside them. We don’t care how they are represented internally and how many bytes it takes to hold each character in them. We only start caring about this when encoding strings into bytes (for example, in order to send them over a communication channel), or decoding strings from bytes (for the other direction).

I strongly recommend Charles Petzold’s wonderful book Code: The Hidden Language of Computer Hardware and Software.

And finally, I’ve found Stephen Pincock’s Codebreaker: The History of Secret Communications a delightful read. It will tell you, among many other things, how the famous WWII Navaho codetalkers could talk about submarines and dive bombers… despite the fact that there are no Navaho words for “submarine” or “dive bomber”.

Posted in Unicode | 5 Comments

How to post source code on WordPress

This post is for folks who blog about Python (or any programming language for that matter) on WordPress.
Updated 2011-11-09 to make it easier to copy-and-paste the [sourcecode] template.

My topic today is How to post source code on WordPress.

The trick is to use the WordPress [sourcecode] shortcut tag, as documented at

Note that when the WordPress docs tell you to enclose the [sourcecode] shortcut tag in square — not pointy — brackets, they mean it. When you view your post as HTML, what you should see is square brackets around the shortcut tags, not pointy brackets.

Here is the tag I like to use for snippets of Python code.

[sourcecode language="python" wraplines="false" collapse="false"]
your source code goes here

The default for wraplines is true, which causes long lines to be wrapped. That isn’t appropriate for Python, so I specify wraplines=”false”.

The default for collapse is false, which is what I normally want. But I code it explicitly, as a reminder that if I ever want to collapse a long code snippet, I can.

Here are some examples.

Note that

  • WordPress knows how to do syntax highlighting for Python. It uses Alex Gorbatchev’s SyntaxHighlighter.
  • If you hover your mouse pointer over the code, you get a pop-up toolbar that allows you to look at the original source code snippet, copy it to the clipboard, print it, etc.


First, a normal chunk of relatively short lines of Python code.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c =

while c and c != SYMBOL:
    # process text characters
    c =

if c and c == SYMBOL:
    c = # read past the SYMBOL
    while c:
        # process suffix characters
        c =


Here is a different code snippet. This one has a line containing a very long comment. Note that the long line is NOT wrapped, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to. That is because we have specified wraplines=”false”.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="false", so lines are NOT wrapped, but extend indefinitely, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to.


This is what a similar code snippet would look like if we had specified wraplines=true. Note that line 2 wraps around and there is no horizontal scrollbar.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.


Finally, the same code snippet with collapse=true, so the code snippet initially displays as collapsed. Clicking on the collapsed code snippet will cause it to expand.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

As far as I can tell, once a reader has expanded a snippet that was initially collapsed, there is no way for him to re-collapse it. That would be a nice enhancement for WordPress — to allow a reader to collapse and expand a code snippet.

Here is a final thought about wraplines. If you specify wraplines=”false”, and a reader prints a paper copy of your post, the printed output will not show the scrollbar, and it will show only the portion of long lines that were visible on the screen. In short, the printed output might cut off the right-hand part of long lines.

In most cases, I think, this should not be a problem. The pop-up tools allow a reader to view or print the entire source code snippet if he wants to. Still, I can imagine cases in which I might choose to specify wraplines=”true”, even for a whitespace-sensitive language such as Python. And I can understand that someone else, simply as a matter of personal taste, might prefer to specify wraplines=”true” all of the time.

Now that I think of it, another nice enhancement for WordPress would be to allow a reader to toggle wraplines on and off.

Keep on bloggin’!

Posted in Miscellaneous | 23 Comments

Python3 pickling

Recently I was converting some old Python2 code to Python3 and I ran across a problem pickling and unpickling.

I guess I would say it wasn’t a major problem because I found the solution fairly quickly with a bit of googling around.

Still, I think the problem and its solution are worth a quick note.  Others will stumble across this problem in the future, especially because there are code examples floating around (in printed books and online posts) that will lead new Python programmers to make this very same mistake.

So let’s talk about pickling.

Suppose you want to “pickle” an object — dump it to a pickle file for persistent storage.

When you pickle an object, you do two things.

  • You open the file that you want to use as the pickle file. The open(…) returns a file handle object.
  • You pass the object that you want to pickle, and the file handle object, to pickle.

Your code might look something like this. Note that this code is wrong. See below.

fileHandle = open(pickleFileName, "w")
pickle.dump(objectToBePickled, fileHandle)

When I wrote code like this, I got back this error message:

Pickler(file, protocol, fix_imports=fix_imports).dump(obj)
TypeError: must be str, not bytes

Talk about a crappy error message!!!

After banging my head against the wall for a while, I googled around and quickly found a very helpful answer on StackOverflow.

The bottom line is that a Python pickle file is (and always has been) a byte stream. Which means that you should always open a pickle file in binary mode: “wb” to write it, and “rb” to read it. The Python docs contain correct example code.

My old code worked just fine running under Python2 (on Windows).  But with Python3’s new strict separation of strings and bytes, it broke. Changing “w” to “wb”, and “r” to “rb”, fixed it. 

One person who posted a question about this problem on the Python forum was aware of the issue, but confused because he was trying to pickle a string.

import pickle
a = "blah"
file = open('state', 'w')

I know of one easy way to solve this is to change the operation argument from ‘w’ to ‘wb’ but I AM using a string not bytes! And none of the examples use ‘wb’ (I figured that out separately) so I want to have an understanding of what is going on here.

Basically, regardless of the kind of object that you are pickling (even a string object), the object will be converted to a bytes representation and pickled as a byte stream. Which means that you always need to use “rb” and “wb”, regardless of the kind of object that you are pickling.

Posted in Moving to Python 3 | Comments Off on Python3 pickling

Yet Another Lambda Tutorial

There are a lot of tutorials[1] for Python’s lambda out there. A very helpful one is Mike Driscoll’s discussion of lambda on the Mouse vs Python blog. Mike’s discussion is excellent: clear, straight-forward, with useful illustrative examples. It helped me — finally — to grok lambda, and led me to write yet another lambda tutorial.

Lambda is a tool for building functions

Lambda is a tool for building functions, or more precisely, for building function objects. That means that Python has two tools for building functions: def and lambda.

Here’s an example. You can build a function in the normal way, using def, like this:

def square_root(x): return math.sqrt(x)

or you can use lambda:

square_root = lambda x: math.sqrt(x)

Here are a few other interesting examples of lambda:

sum = lambda x, y:   x + y   #  def sum(x,y): return x + y

out = lambda   *x:   sys.stdout.write(" ".join(map(str,x)))

lambda event, name=button8.getLabel(): self.onButton(event, name)

What is lambda good for? Why do we need lambda?

Actually, we don’t absolutely need lambda; we could get along without it. But there are certain situations where it makes writing code a bit easier, and the written code a bit cleaner. What kind of situations? … Situations in which (a) the function is fairly simple, and (b) it is going to be used only once.

Normally, functions are created for one of two purposes: (a) to reduce code duplication, or (b) to modularize code.

  • If your application contains duplicate chunks of code in various places, then you can put one copy of that code into a function, give the function a name, and then — using that function name — call it from various places in your code.
  • If you have a chunk of code that performs one well-defined operation — but is really long and gnarly and interrupts the otherwise readable flow of your program — then you can pull that long gnarly code out and put it into a function all by itself.

But suppose you need to create a function that is going to be used only once — called from only one place in your application. Well, first of all, you don’t need to give the function a name. It can be “anonymous”. And you can just define it right in the place where you want to use it. That’s where lambda is useful.

But, but, but… you say.

  • First of all — Why would you want a function that is called only once? That eliminates reason (a) for making a function.
  • And the body of a lambda can contain only a single expression. That means that lambdas must be short. So that eliminates reason (b) for making a function.

What possible reason could I have for wanting to create a short, anonymous function?

Well, consider this snippet of code that uses lambda to define the behavior of buttons in a Tkinter GUI interface. (This example is from Mike’s tutorial.)

frame = tk.Frame(parent)

btn22 = tk.Button(frame,
        text="22", command=lambda: self.printNum(22))

btn44 = tk.Button(frame,
        text="44", command=lambda: self.printNum(44))

The thing to remember here is that a tk.Button expects a function object as an argument to the command parameter. That function object will be the function that the button calls when it (the button) is clicked. Basically, that function specifies what the GUI will do when the button is clicked.

So we must pass a function object in to a button via the command parameter. And note that — since different buttons do different things — we need a different function object for each button object. Each function will be used only once, by the particular button to which it is being supplied.

So, although we could code (say)

def __init__(self, parent):
    frame = tk.Frame(parent)

    btn22 = tk.Button(frame,
        text="22", command=self.buttonCmd22)

    btn44 = tk.Button(frame,
        text="44", command=self.buttonCmd44)

def buttonCmd22(self):

def buttonCmd44(self):

it is much easier (and clearer) to code

def __init__(self, parent):
    frame = tk.Frame(parent)

    btn22 = tk.Button(frame,
        text="22", command=lambda: self.printNum(22))

    btn44 = tk.Button(frame,
        text="44", command=lambda: self.printNum(44))

When a GUI program has this kind of code, the button object is said to “call back” to the function object that was supplied to it as its command. So we can say that one of the most frequent uses of lambda is in coding “callbacks” to GUI frameworks such as Tkinter and wxPython.

This all seems pretty straight-forward. So…

Why is lambda so confusing?

There are four reasons that I can think of.

First Lambda is confusing because: the requirement that a lambda can take only a single expression raises the question: What is an expression?

A lot of people would like to know the answer to that one. If you Google around a bit, you will see a lot of posts from people asking “In Python, what’s the difference between an expression and a statement?”

One good answer is that an expression returns (or evaluates to) a value, whereas a statement does not. Unfortunately, the situation is muddled by the fact that in Python an expression can also be a statement. And we can always throw a red herring into the mix — assigment statements like a = b = 0 suggest that Python supports chained assignments, and that assignment statements return values. (They do not. Python isn’t C.)[2]

In many cases when people ask this question, what they really want to know is: What kind of things can I, and can I not, put into a lambda? And the answer to that question is basically—

  • If it doesn’t return a value, it isn’t an expression and can’t be put into a lambda.
  • If you can imagine it in an assignment statement, on the right-hand side of the equals sign, it is an expression and can be put into a lambda.

Using these rules means that:

  1. Assignment statements cannot be used in lambda. In Python, assignment statements don’t return anything, not even None (null).
  2. Simple things such as mathematical operations, string operations, list comprehensions, etc. are OK in a lambda.
  3. Function calls are expressions. It is OK to put a function call in a lambda, and to pass arguments to that function. Doing this wraps the function call (arguments and all) inside a new, anonymous function.
  4. In Python 3, print became a function, so in Python 3+, print(…) can be used in a lambda.
  5. Even functions that return None, like the print function in Python 3, can be used in a lambda.
  6. Conditional expressions, which were introduced in Python 2.5, are expressions (and not merely a different syntax for an if/else statement). They return a value, and can be used in a lambda.
    lambda: a if some_condition() else b
    lambda x: ‘big’ if x > 100 else ‘small’


Second Lambda is confusing because: the specification that a lambda can take only a single expression raises the question: Why? Why only one expression? Why not multiple expressions? Why not statements?

For some developers, this question means simply Why is the Python lambda syntax so weird? For others, especially those with a Lisp background, the question means Why is Python’s lambda so crippled? Why isn’t it as powerful as Lisp’s lambda?

The answer is complicated, and it involves the “pythonicity” of Python’s syntax. Lambda was a relatively late addition to Python. By the time that it was added, Python syntax had become well established. Under the circumstances, the syntax for lambda had to be shoe-horned into the established Python syntax in a “pythonic” way. And that placed certain limitations on the kinds of things that could be done in lambdas. Frankly, I still think the syntax for lambda looks a little weird. Be that as it may, Guido has explained why lambda’s syntax is not going to change. Python will not become Lisp.[3]

Third Lambda is confusing because: lambda is usually described as a tool for creating functions, but a lambda specification does not contain a return statement.

The return statement is, in a sense, implicit in a lambda. Since a lambda specification must contain only a single expression, and that expression must return a value, an anonymous function created by lambda implicitly returns the value returned by the expression. This makes perfect sense. Still— the lack of an explicit return statement is, I think, part of what makes it hard to grok lambda, or at least, hard to grok it quickly.

Fourth Lambda is confusing because: tutorials on lambda typically introduce lambda as a tool for creating anonymous functions, when in fact the most common use of lambda is for creating anonymous procedures.

Back in the High Old Times, we recognized two different kinds of subroutines: procedures and functions. Procedures were for doing stuff, and did not return anything. Functions were for calculating and returning values. The difference between functions and procedures was even built into some programming languages. In Pascal, for instance, procedure and function were different keywords.

In most modern languages, the difference between procedures and functions is no longer enshrined in the language syntax. A Python function, for instance, can act like a procedure, a function, or both. The (not altogether desirable) result is that a Python function is always referred to as a “function”, even when it is essentially acting as a procedure.

Although the distinction between a procedure and a function has essentially vanished as a language construct, we still often use it when thinking about how a program works. For example, when I’m reading the source code of a program and see some function F, I try to figure out what F does. And I often can categorize it as a procedure or a function — “the purpose of F is to do so-and-so” I will say to myself, or “the purpose of F is to calculate and return such-and-such”.

So now I think we can see why many explanations of lambda are confusing.

First of all, the Python language itself masks the distinction between a function and a procedure.

Second, most tutorials introduce lambda as a tool for creating anonymous functions, things whose primary purpose is to calculate and return a result. The very first example that you see in most tutorials (this one included) shows how to write a lambda to return, say, the square root of x.

But this is not the way that lambda is most commonly used, and is not what most programmers are looking for when they Google “python lambda tutorial”. The most common use for lambda is to create anonymous procedures for use in GUI callbacks. In those use cases, we don’t care about what the lambda returns, we care about what it does.

This explains why most explanations of lambda are confusing for the typical Python programmer. He’s trying to learn how to write code for some GUI framework: Tkinter, say, or wxPython. He runs across examples that use lambda, and wants to understand what he’s seeing. He Googles for “python lambda tutorial”. And he finds tutorials that start with examples that are entirely inappropriate for his purposes.

So, if you are such a programmer — this tutorial is for you. I hope it helps. I’m sorry that we got to this point at the end of the tutorial, rather than at the beginning. Let’s hope that someday, someone will write a lambda tutorial that, instead of beginning this way

Lambda is a tool for building anonymous functions.

begins something like this

Lambda is a tool for building callback handlers.

So there you have it. Yet another lambda tutorial.


[1] Some lambda tutorials:

[2] In some programming languages, such as C, an assignment statement returns the assigned value. This allows chained assignments such as x = y = a, in which the assignment statement y = a returns the value of a, which is then assigned to x. In Python, assignment statements do not return a value. Chained assignment (or more precisely, code that looks like chained assignment statements) is recognized and supported as a special case of the assignment statement.

[3] Python developers who are familiar with Lisp have argued for increasing the power of Python’s lambda, moving it closer to the power of lambda in Lisp. There have been a number of proposals for a syntax for “multi-line lambda”, and so on. Guido has rejected these proposals and blogged about some of his thinking about “pythonicity” and language features as a user interface. This led to an interesting discussion on Lambda the Ultimate, the programming languages weblog about lambda, and about the idea that programming languages have personalities.

Posted in Python features | 17 Comments

Read-Ahead and Python Generators

One of the early classics of program design is Michael Jackson’s Principles of Program Design (1975), which introduced (what later came to be known as) JSP: Jackson Structured Programming.

Back in the 1970’s, most business application programs did their work by reading and writing sequential files of records stored on tape. And it was common to see programs whose top-level control structure looked like (what I will call) the “standard loop”:

open input file F

while not EndOfFile on F:
    read a record
    process the record

close F

Jackson showed that this way of processing a sequence almost always created unnecessary problems in the program logic, and that a better way was to use what he called a “read-ahead” technique. 

In the read-ahead technique, a record is read from the input file immediately after the file is opened, and then a second “read” statement is executed after each record is processed.

This technique produces a program structure like this:

open input file F
read a record from F     # get first

while not EndOfFile on F:
    process the record
    read the next record from F  # get next

close F

I won’t try to explain when or why the read-ahead technique is preferable to the standard loop. That’s out of scope for this blog entry, and a good book on JSP can explain that better than I can. So for now, let’s just say that there are some situations in which the standard loop is the right tool for the job, and there are other situations in which read-ahead is the right tool for the job.

One of the joys of Python is that Python makes it so easy to do “standard loop” processing on a sequence such as a list or a string.

for item in sequence:

There are times, however, when you have a sequence that you need to process with the read-ahead technique.

With Python generators, it is easy to do. Generators make it easy to convert a sequence into a kind of object that provides both a get next method and an end-of-file mark.  That kind of object can easily be processed using the read-ahead technique.

Suppose that we have a list of items (called listOfItems) and we wish to process it using the read-ahead technique.

First, we create the “read-ahead” generator:

def ReadAhead(sequence):
    for item in sequence:
        yield item
    yield None # return the "end of file mark" after the last item

Then we can write our code this way:

items = ReadAhead(listOfItems)
item =  # get first
while item:
    item =  # get next

Here is a simple example.

We have a string (called “line”) consisting of characters. Each line consists of zero or more indent characters, some text characters, and (optionally) a special SYMBOL character followed by some suffix characters. For those familiar with JSP, the input structure diagram looks like this.

    - indent
        * one indent char
    - text
        * one text char
    - possible suffix
        o no suffix
        o suffix
            - suffix SYMBOL
            - suffix
                - one suffix char

We want to parse the line into 3 groups: indent characters, text characters, and suffix characters.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c =

while c and c != SYMBOL:
    # process text characters
    c =

if c and c == SYMBOL:
    c = # read past the SYMBOL
    while c:
        # process suffix characters
        c =
Posted in Python & JSD | 7 Comments

In Java, what is the difference between an abstract class and an interface?

This post is about Java, and has nothing to do with Python.  I’ve posted it here so that it can be available to other folks who might find it useful. (And because I don’t have a Java blog!)

In Java, what is the difference between an abstract class and an interface?

This is a question that comes up periodically. When I Googled for answers to it, I didn’t very much like any of the answers that I found, so I wrote my own. For those who might be interested, here it is.

Q: What is the difference between an abstract class and an interface?

A: Good question.

To help explain, first let me introduce some terminology that I hope will help clarify the situation.

  • I will say that a fully abstract class is an abstract class in which all methods are abstract.
  • In contrast, a partially abstract class is an abstract class in which some of the methods are abstract, and some are concrete (i.e. have implementations).

Q: OK. So what is the difference between a fully abstract class and an interface?

A: Basically, none. They are the same.

Q: Then why does Java have the concept of an interface, as well as the concept of an abstract class?

A: Because Java doesn’t support multiple inheritance. Or rather I should say, it supports a limited form of multiple inheritance.

Q: Huh??!!!

A: Java has a rule that a class can extend only one abstract class, but can implement multiple interfaces (fully abstract classes).

There’s a reason why Java has such a rule.

Remember that a class can be an abstract class without being a fully abstract class. It can be a partially abstract class.

Now imagine that that we have two partially abstract classes A and B. Both have some abstract methods, and both contain a non-abstract method called foo().

And imagine that Java allows a class to extend more than one abstract class, so we can write a class C that extends both A and B. And imagine that C doesn’t implement foo().

So now there is a problem. Suppose we create an instance of C and invoke its foo() method. Which foo() should Java invoke? or

Some languages allow multiple inheritance, and have a way to answer that question. Python for example has a “method resolution order” algorithm that determines the order in which superclasses are searched, looking for an implementation of foo().

But the designers of Java made a different choice. They choose to make it a rule that a class can inherit from as many fully abstract classes it wants, but can inherit from only one partially abstract class. That way, the question of which foo() to use will never come up.

This is a form of limited multiple inheritance. Basically, the rule says that you can inherit from (extend) as many classes as you want, but if you do, only one of those classes can contain concrete (implemented) methods.

So now we do a little terminology substitution:

abstract class = a class that contains at least one abstract method, and can also contain concrete (implemented) methods

interface =  a class that is fully abstract — it has abstract methods, but no concrete methods

With those substitutions, you get the familiar Java rule:

A class can extend at most one abstract class, but may implement many interfaces.

That is, Java supports a limited form of multiple inheritance.

Posted in Java and Python | 14 Comments

Newline conversion in Python 3

I use Python on both Windows and Unix.  Occasionally when running on Windows  I need to read in a file containing Windows newlines and write it out with Unix/Linux newlines.  And sometimes when running on Unix, I need to run the newline conversion in the other direction.

Prior to Python 3, the accepted way to do this was to read data from the file in binary mode, convert the newline characters in the data, and then write the data out again in binary mode. The Tools/Scripts directory contained two scripts ( and with illustrative examples. Here, for instance is the key code from (Windows to Unix conversion)

        data = open(filename, "rb").read()
        newdata = data.replace("\r\n", "\n")
        if newdata != data:
            f = open(filename, "wb")

But if you try to do that with Python 3+, it won’t work.

The key to what will work is the new “newline” argument for the built-in file open() function. It is documented here.

The key point from that documentation is this:

newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:

  • On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

  • On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

So now when I want to convert a file from Windows-style newlines to Linux-style newlines, I do this:

filename = "NameOfFileToBeConverted"
fileContents = open(filename,"r").read()
f = open(filename,"w", newline="\n")

Posted in Moving to Python 3 | 7 Comments

Why import star is a bad idea

When I was learning Python, I of course read the usual warnings. They told me: You can do

from something_or_other import *

but don’t do it. Importing star (asterisk, everything) is a Python Worst Practice.

Don’t “import star”!

But I was young and foolish, and my scripts were short. “How bad can it be?” I thought. So I did it anyway, and everything seemed to work out OK.

Then, like they always do, the quick-and-dirty scripts grew into programs, and then grew into a full-blown system. Before long I had a monster on my hands, and I needed a tool that would look through all of the scripts and programs in the system and do (at least) some basic error checking.

I’d heard good things about pyflakes, so I thought I’d give it a try.

It worked very nicely. It found the basic kinds of errors that I wanted it to find. And it was fast, so I could run it through a directory containing a lot of .py files and it would come out alive and grinning on the other side.

During the process, I learned that pyflakes is designed to be a bit on the quick and dirty side itself, with the quick making up for the dirty. As part of this process, it basically ignores star imports.  Oh, it warns you about the star imports.  What I means is — it doesn’t try to figure out what is imported by the star import.

And that has interesting consequences.

Normally, if your file contains an undefined name — say TARGET_LANGAGE — pyflakes will report it as an error.

But if your file includes any star imports, and your script contains an undefined name like TARGET_LANGAGE, pyflakes won’t report the undefined name as an error.

My hypothesis is that pyflakes doesn’t report TARGET_LANGAGE as undefined because it can’t tell whether TARGET_LANGAGE is truly undefined, or was pulled in by some star import.

This is perfectly understandable. There is no way that pyflakes is going to go out, try to find the something_or_other module, and analyze it to see if it contains TARGET_LANGAGE. And if it doesn’t, but contains star imports, go out and look for all of the modules that something_or_other star imports, and then analyze them. And so on, and so on, and so on. No way!

So, since pyflakes can’t tell whether TARGET_LANGAGE is (a) an undefined name or (b) pulled in via some star import, it does not report TARGET_LANGAGE as an undefined name. Basically, pyflakes ignores it.

And that seems to me to be a perfectly reasonable way to do business, not just for pyflakes but for anything short of the Super Deluxe Hyperdrive model static code analyzer.

The takeaway lesson for me was that using star imports will cripple a static code analyzer. Or at least, cripple the feature that I most want a code analyser for… to find and report undefined names.

So now I don’t use star imports anymore.

There are a variety of alternatives.  The one that I prefer is the “import x as y” feature. So I can write an import statement this way:

import   some_module_with_a_big_long_hairy_name   as    bx

 and rather than coding

x = some_module_with_a_big_long_hairy_name.vector

I can code

x = bx.vector

Works for me.

Posted in Python features | 11 Comments

Learning Subversion: the mystery of .svn

If you are googling for “Subversion command line tutorial introduction for beginners”, read this first! This is for all Subversion newbies.

After using PVCS for many years, our office recently started moving to Subversion. Which means that recently I started trying to learn Subversion.

I was pressed for time. I was in a hurry. I was looking for something that would get me up and running quickly.

First, I got a copy of the free online Subversion documentation Version Control with Subversion.

Second, I got a copy of Mike Mason’s excellent Pragmatic Version Control Using Subversion (2nd ed.).

Third, I googled the Web looking for the kinds of things that you’d expect: Subversion tutorial introduction beginning beginners commands. And I found some good stuff.

But even after reading many of the online Subversion tutorials, I still could not grok Subversion. Different commands seemed to be doing the same thing, and the tutorials used a lot of terms that were never defined or explained: “versioned”, “unversioned”, “under version control”, and so on.

Gradually, I realized the problem. Many of the online tutorials and introductions try to explain how to use Subversion without explaining how Subversion works. They tell you what commands to issue, and when, but they don’t tell you why you are issuing the command at this particular time, or what the command is doing under the covers.

So I had to dig deeper.

What I found was that there was one particular piece of information missing from most of the tutorials and introductions that I found. If you don’t have that piece, nothing about Subversion makes much sense. With it, all of the other pieces of the puzzle fall into place.

So the purpose of this post is to tell you — the Subversion newbie — about that piece.

How Subversion Works

The basic unit of work for Subversion is a project.

A project is basically a directory.

Technically, a project is a subtree: a directory, including all of its files and subdirectories, and all of those subdirectories’ files and subdirectories, etc. But in order to keep things simple, I will talk as if a project is just a directory.

When you are working on a Subversion project, there are actually two directories that you are working with.

  • There is the repository, which is a directory (controlled by Subversion and running on a server somewhere) that contains the master copy of the project directory.
  • There is your own personal workingCopy, which is a directory (controlled by you) that exists on the file system of your own machine (that is, on the hard drive of your own PC).

But (and this is the piece that was missing) a workingCopy directory is not an ordinary directory.

The use of the expression “working copy” is one of the most confusing things about Subversion tutorials and even the Subversion documentation itself. When you encounter the expression “working copy” you assume that you are dealing with an ordinary filesystem directory that is being used to hold a copy of the files in your project. Not so!

In the context of Subversion, “working copy” is a very specific term of art — a Subversion-specific technical term. That is why in this post I avoid the expression “working copy” and instead use workingCopy.

So what is a Subversion workingCopy directory?

A workingCopy directory is a directory that has a hidden subdirectory called “.svn”.

The hidden .svn directory is what Subversion calls an “administrative directory”.

Note the leading period in “.svn”. On Unix systems, a directory whose name begins with a dot is a “hidden” (or “dotfile”) directory.

On your PC, the project’s top-level workingCopy directory has a hidden .svn subdirectory. And each of the subdirectories of the workingCopy directory (if it has any), and each of their subdirectories (if they have any), and so on, has its own hidden .svn subdirectory.

Having a hidden .svn subdirectory is what makes an ordinary file system directory into a Subversion workingCopy directory, a directory that Subversion can recognize and manage.

So, for a project named “ProjectX” the workingCopy directory will be named “ProjectX”. It might look like this:

		.svn [DIRECTORY]

What is in a .svn subdirectory? What does a Subversion administrative directory contain?

The Subversion documentation says this about workingCopy directories:

A Subversion working copy is an ordinary directory tree on your local system, containing a collection of files. You can edit these files however you wish, and if they’re source code files, you can compile your program from them in the usual way. …

A working copy also contains some extra files, created and maintained by Subversion, to help it carry out these commands. In particular, each directory in your working copy contains a subdirectory named .svn, also known as the working copy’s administrative directory. The files in each administrative directory help Subversion recognize which files contain unpublished changes, and which files are out of date with respect to others’ work.

Here’s another clue: a passage from Pragmatic Version Control Using Subversion:

Subversion has a highly efficient network protocol and stores pristine copies of your working files locally, allowing a user to see what changes they’ve made without even contacting the server [where the central repository is stored].

So now we know what a Subversion administrative directory contains.

The .svn admin directory contains pristine (unchanged) copies of files that were downloaded from the repository. (It contains a few other things, too.)

Earlier, I said “When you are working on a Subversion project, there are actually TWO directories that you are working with… the repository and the working copy.” Now I want to change that. It would be more accurate to say that there are really THREE directories that you are working with:

  • the main ProjectX repository on the server
  • the ProjectX workingCopy directory on your PC, which contains editable (and possibly changed) copies of the files in the project …and also …
  • the hidden Subversion administrative directory, which contains a (pristine, unchanged, and uneditable) copies of the files in the main ProjectX repository on the server.

That means that, on your PC, the ProjectX workingCopy directory looks like this.

		.svn [DIRECTORY]

Now things start to become clearer…

Subversion introductions and tutorials often say things that are rather cryptic to someone who is trying to learn Subversion. Even HELP questions and FAQs posted on the Web can be mystifying. Now let’s see how some of those things make sense in light of our knowledge of the .svn subdirectory.

Showing file changes

The reason that Subversion can allow “a user to see what changes they’ve made without even contacting the server” is that the Subversion diff works only on the workingCopy directory on your own PC.

When Subversion shows file changes (that is, shows diffs) it is actually showing diffs between

  • your edited files in the workingCopy directory, and
  • the pristine copies of the those files that are being held in the .svn subdirectory of the workingCopy directory.

“unversioned” files vs. files “under version control”

Suppose that I make a change to one of my files: to ProjectX/

When I make the changes, my editor automatically creates a backup file: ProjectX/

At this point, ProjectX/ is what is called an “unversioned” file. It exists in the ProjectX directory, but not in the ProjectX/.svn directory, so Subversion knows nothing about it. That makes sense: we don’t want to be considered a project file anyway.

But suppose I want to add a new module to the project, called If I simply create the file in the ProjectX folder, it will be an “unversioned” file in just the same way that is an unversioned file: it will not exist in the ProjectX/.svn directory, so Subversion knows nothing about it.

So that is why Subversion has a “svn add” command. The command svn add will add the file to the project by copying ProjectX/ to ProjectX/.svn/ At this point — after it has been added to the .svn subdirectory — the file is said to be “under version control”.

Note that — at this point — although has been “added” to the copy of the project in the workingCopy, the main repository still doesn’t know anything about it — hasn’t been added to the central repository on the server.

When I “commit” my changes, I send the files from my workingCopy to the main repository. Only after that happens does the new file truly become part of the project by becoming one of the files in the central repository.

Help! I’ve lost my .svn directory and I can’t get up!

Because a Subversion workingCopy directory needs a .svn subdirectory in order to work properly, you can have problems with Subversion if you accidentally delete the .svn subdirectory.

What is a “clean copy”?

In various tutorials, and in the Subversion docs, you will run across the expression “clean copy”. A “clean copy” is a copy of only the source-code files, without the .svn directory.

An introduction to Subversion (which is also a nice introduction to the TortoiseSVN open-source Windows GUI client for Subversion) explains things nicely.

If you look closely in your working copy, you may see an .svn folder in each folder of your working copy. The folders are hidden folders, so depending on the Windows settings you may not see them, but they are there. Those folders contain the information that Subversion uses to link your working copy to the repository.

If ever you need to get a copy of what’s in the repository, but without all the .svn folders (say for example you’re ready to publish it or hand the files over to your client), you can do an “SVN Export” into a new folder to get a “clean” copy of what’s in your repository.

Having the concept of a “clean copy” makes it easier to understand the next question…

Checkout vs. Export

A Frequently Asked Question about Subversion is What’s the difference between a “checkout” and an “export” from the repository?

The CollabNet docs say this:

They are the same except that Export doesn’t include the.svn folders and Checkout does include them. Also note that an export cannot be updated.

When you do a Subversion checkout, every folder and subfolder contains an .svn folder. These.svn folders contain clean copies of all files checked out and .tmp directories that contain temporary files created during checkouts, commits, update and other operations.

An Export will be about half the size of a Checkout due to the absence of the.svn folders that duplicate all content.

Note that the reason an exported folder cannot be updated is that the update command updates the .svn directory of a workingCopy, but an export does not create an .svn directory.

Note also that you can export from either the main repository or from the workingCopy .svn directory. See Subversion docs for export.

The (import, checkout) usage pattern for getting started with Subversion

Most “getting started with Subversion” tutorials start the same way. Assuming that you have some project files that you want to put into Subversion, you are told to:

  • do an import
  • do a checkout

in that order.

What you are not told is why you start with those two particular actions in that particular order.

But by now, knowing about the hidden .svn administrative directory and what it does, you can probably figure that out.

Import is the opposite of export. It takes a directory of files — a clean copy of the files, if you will — from your hard drive and copies them into the central Subversion repository on the server.

Always the next step is to do a checkout. Basically a checkout copies the project files from the central repository to a workingCopy directory on your PC. If the workingCopy directory does not exist on your PC, it is created.

The workingCopy directory contains everything you need in order to be able to work with Subversion, including an .svn administrative directory. As the CollabNet documentation (quoted earlier) says:

When you do a Subversion checkout, every folder and subfolder contains an .svn folder. These.svn folders contain clean copies of all files checked out and .tmp directories that contain temporary files created during checkouts, commits, update and other operations.

So the second step — the checkout command — is absolutely necessary in order to get started. It creates a workingCopy directory containing the project files. Only after that happens are your files properly “under version control”.

checkin vs. commit

PVCS (and SourceSafe, and many other version control systems) work on a locking model. “Checking out” a file from the repository means that you get a local working copy of the file, and you lock the file in the repository. At that point, nobody can unlock it except you. Checking out a file gives you exclusive update privileges on it until you check it back in.

“Checking in” a file means that you copy your local working copy of the file back into the repository and you unlock the file in the repository.

It is possible to copy your local working copy of the file into the repository without unlocking the file in the repository. When you do this, you are in a sense “updating” the repository from the working copy.

Because of my familiarity with this kind of version control, I had a certain “mental model” of how a version control system works. And because of that mental model, many of the Subversion tutorials were quite confusing.

One source of confusion is the fact that (as we will see in the next section) the word “updating” in the context of Subversion means exactly the opposite of what it means in the context of PVCS.

One of the Subversion tutorials that I found said that you must checkout your workingCopy from the main repository, because you can’t do a checkin back to the main repository if you hadn’t checked it out. This was very confusing to an ex-PVCS user.

First, it suggested that Subversion works like PVCS: that there is a typical round-trip usage pattern consisting of

  • checking out (locking)
  • editing
  • checking in (unlocking)

But Subversion doesn’t work like this, at least not by default.

What the tutorial was trying to say, I think, was that in order to work with Subversion, you must create a workingCopy directory (that is, a directory that contains an .svn administrative subdirectory). And the way to create a workingCopy directory is to run a svn checkout command against the repository on the server.

Second, explaining things this way was confusing because Subversion doesn’t really have a checkin command. It does have a commit command, which some tutorials call a “checkin” command. But that command does not do the same thing as a PVCS checkin.

Ignore the fact that the short form of the commit command is ci (which stood for “checkin” in an earlier incarnation of Subversion). A Subversion “checkin” is the same thing as a “commit”, and has nothing to do with locking. It would really be helpful if all Subversion tutorials would stop using the term “checkin” and replace it with “commit”.

If you are used to working with a VCS that uses the “check out, edit, check in” paradigm, and you come to understand that Subversion’s commit is not the same as your old familiar check in, then your next question will almost certainly be:

Once you checkout a project into a working folder, how do you check it in a la SourceSafe? [Or PVCS, or other lock-based VCSs? — Steve Ferg]

I know there is “commit” which puts my changes into the respository, but I still have the files checked out under my working folder. What if I am done with a particular file and I don’t want to have it checked out? How do I check it back in?

You can read the answer here.

What does svn update do?

EXECUTIVE SUMMARY: svn update updates the workingCopy, not the repository.

The Subversion docs describe the update command this way:

When working on a project with a team, you’ll want to update your working copy to receive any changes other developers on the project have made since your last update. Use svn update to bring your working copy into sync with the latest revision in the repository:

Basically, what the update command does is to copy the project files from the central repository down to the .svn directory in your workingCopy.

This is something you should do frequently, because you don’t want the files in your workingCopy/.svn directory to get too far out of sync with the file in the central repository. And you don’t want to try to commit files if your workingCopy/.svn is out of sync with the central repository.

That means that as a general rule, you should always run an svn update:

  • just before you start making a new round of changes to your workingCopy, and
  • just before doing a commit.

Now, having mastered the concept of an .svn directory, we can Understand Many Things, even arcana such as why Serving websites from svn checkout considered harmful.

So that’s it.

This post contains information written by a Subversion newbie in the hopes that it will be useful to other Subversion newbies. But of course, having been written by a newb, there are all sorts of ways it could be wrong.

If you’re a Subversion expert (and it doesn’t take much to be more expert than I am) and you see something wrong, confused, or misleading here, please leave a comment. I, and future generations of Subversion newbies, will thank you for it.

Thanks to my co-workers Mark Thomas and Jason Herman for reviewing an earlier draft of this post.

Posted in Subversion | 11 Comments

How to fix a programmable Northgate keyboard

After my earlier post about Northgate keyboard repair it occurred to me that this information might be useful. I don’t think it can be found anywhere else on the Web.

Note that in the following slideshow (showing the repair of an Evolution keyboard) you can mouse-over the image. Controls will pop up that allow you to pause the show and to step forward and backward.

This slideshow requires JavaScript.

When programmable keyboards go bad

A while ago, one of my Northgate keyboards seemed spontaneously to sustain some kind of brain injury. A number of the keys seemed to have gone haywire. The left shift key didn’t work and several pairs of keys seemed to have exchanged places.

I talked with Bob Tibbetts of Northgate Keyboard repair ( and he explained the situation. Here is what I learned.

The Northgates are programmable keyboards — they contain a programmable chip. They were designed so that certain key combinations (e.g. pressing the left shift key four times) puts the keyboard (that is, the programmable chip) into programming mode.

Unfortunately the programmable chip had software that worked only with Windows 98 and earlier. If you are using a Northgate keyboard with any other system, the programmable chip is basically a bad chip and should be removed. (Bob noted that he removes the chip from any keyboards that he sells.)

Fixing the problem is a two-step process. First you “reboot” the keyboard into non-progamming mode, then you remove the chip.

You can just reboot the keyboard without removing the chip, of course, and that will fix the immediate problem. But as long as the programmable chip is still in the keyboard, similar problems can occur again at any time.

How to “reboot” the programmable keyboard

Shut the computer down. Don’t just a log off or do a “soft” reboot. Power off.

Press the ESCAPE (ESC) key down and hold it down while you power up your PC. Do not release the ESC key until the computer beeps at you, or you have to do something like entering a password.

This should make the keyboard work normally. (If it doesn’t, then the problem was something other than the programmable chip.)

The anatomy of an Evolution keyboard

Working with Evolution keyboards is tricky because the Evolutions have the little GlidePoint touchpad in the middle of the top of the keyboard. There are short cables that go from the GlidePoint touchpad in the upper part of the keyboard to the “motherboard” in the bottom part of the keyboard.

Basically, the GlidePoint cables act as a sort of tether between the upper and lower halves of the keyboard. The cables are short, and virtually impossible to re-attach if you pull them loose. So you have to be careful not to pull them loose.

How to remove the programmable chip from an Evolution keyboard

First, make sure you have read “The anatomy of an Evolution keyboard” (above). Then …

“Reboot” the keyboard (see the instructions given above), then shut down (power off) your PC.

Turn the keyboard over, so that you are looking at the bottom of the keyboard.

Take the six screws (the ones holding the upper and lower parts of the keyboard together) out of the keyboard.

Turn the keyboard over, so that it is face up and you are looking at the keys.

DO NOT lift the top off of the keyboard.

Well, you can lift it a little. 

In the slideshow, you can see the top of the keyboard sitting on a little green box that lifts it about 2.75 inches (7 cm).  You can see the GlidePoint cables running from the touchpad in the top of the keyboard to the motherboard in the bottom of the keyboard. Those are the cables that you don’t want to disturb.

Lift the top half of the keyboard just enough to free it from the bottom half, then rotate the top clockwise about 4 or 5 inches, just enough to expose the programmable chip. Rotate the top using the location of the touchpad as the pivot point — that way you will disturb the touchpad cables at little as possible.

On the top right-hand side, locate the programmable chip. It is a small chip about 1/4″ x 3/8″ with 24C16 embossed on it.

Take a small screwdriver and pry the chip out. When you do this, you may break a few of the prongs that hold the chip to the motherboard. That’s OK. Bob Tibbetts suggested using a jeweler’s screwdriver. I used a small (but long) electrician’s screwdriver. I also found that once I had the chip lifted up, but not completely free of the motherboard, a needle-nose pliers was perfect for the final removal.

Around the edges of the chip socket, carefully cut off any remaining prongs. The goal is to leave no prongs sticking up that might touch each other or anything else. I think a “side cutter” pliers would be too big for this job. Something like a toenail clipper might be about right. I had only one prong left stuck in the motherboard, and I gently twisted it off with the needle-nose pliers.

Carefully lower the top of the keyboard back down onto the lower part.

Carefully turn the keyboard over, making sure to keep the two halves of the keyboard together.

Put the screws back in.

You’re done!

How to remove the programmable chip from a non-Evolution programmable keyboard

For other programmable Northgate keyboard models (models ending in a P for “programmable”) — 101P, 102P, Ultra TP and Ultra P — you can use basically the same procedure as described above for the Evolution.

The difference is that non-Evolution keyboards don’t have the GlidePoint touchpad embedded in the top of the keyboard. That means that you don’t need to worry about the GlidePoint cables, so you can lift the keyboard top completely off in order to access the programmable chip.

Posted in Keyboards | 5 Comments

Northgate keyboard repair

The best computer keyboards ever made (even when compared to the original IBM model M keyboards) were the Northgate Omnikey keyboards.  They were heavy keyboards built like tanks, featuring buckling spring key-switches notable for their distinctive clicking as you typed.  These were real keyboards — no crappy “rubber dome” key switches allowed.

Omnikey Ultra keyboard

Omnikey Ultra keyboard

I used only Northgate Omnikey Ultras for years, lugging them from job to job like an itinerant medieval carpenter carrying his tools with him from town to town, and using special keyboard plug adapters when keyboard plug design evolved first to PS/2 and then to USB.

But tools get worn and dirty and a few years ago my Ultras were terminally filthy and starting to fail.  That was when, thanks to the twin miracles of the Web and Google, I found Bob Tibbetts and his Northgate Keyboard Repair web site.  Bob belongs to the school of minimalist website design, but his keyboard expertise and repair skills are totally maximal, and he really saved my bacon keyboards.   He also, in a manner of speaking, saved my wrists.

After 25 years of coding, the joints in my hands and wrists were starting to protest.  I switched from using a mouse to a using a trackball (I prefer a Logitech Cordless Optical Trackman), and that helped a lot.   Carpal tunnel syndrome forced a friend of mine to retire on disability and put The Fear into me.  A bout of online research convinced me that we really need more ergonomic keyboards, so I went shopping for one. 

The major feature of an ergonomic keyboard is a split design in which the left and right halves of the keyboard  are split apart, separated by a few inches, and angled slightly so that you can type without bending your wrists.  The result is a keyboard that is shaped like a V rather than like a straight unbroken line. In a sense, the keyboard is bent so your wrists don’t have to be.

Image of Northgate Evolution keyboard

Northgate Evolution keyboard

What I really wanted, of course, was an ergonomic version of the Omnikey Ultra. 

One day, in an email to Bob, I mentioned that although I loved my Ultras (one of which Bob was cleaning and repairing at the time), what I really wished for was an ergonomic V-shaped version of the Ultra. 

Well, I nearly fell off my chair when Bob told me that such a thing actually existed.  It was called the Omnikey Evolution keyboard.  Evolutions were very advanced for their time, and very few were made.  But a few — new in the box — still existed, and he had a few for sale.

I immediately ordered one, tried it out, and loved it.  It is my favorite keyboard ever.  So I followed my Mom’s tongue in cheek advice (“Get ’em before the hoarders do.”) and got more.  I now own 5 — one for work, one for my home Vista machine, one for my home Linux machine, and two backups.

As I type this, it is almost midnight on March 11, 2011, and Bob has only 3 Evolution keyboards left. 

The good news is that if you have a beloved old Northgate that is showing its age, Northgate Keyboard Repair is still in the business of cleaning and repairing Northgate keyboards.

Finally, if you’re looking to purchase a keyboard with buckling spring key switches, you might check out the Customizer line of keyboards at  It is a reincarnation of the original IBM model M.

And keep on clicking…

## updated January 1, 2012

Posted in Keyboards | 14 Comments

An alternative to string interpolation

I sort of like this.

# ugly
msg = "I found %s files in %s directories" % (filecount,foldercount)

# better
def Str(*args): return "".join(str(x) for x in args)
msg = Str("I found ", filecount, " files in ", foldercount, " directories" )

You don’t have to call it “Str”, of course.

Posted in Python features | 15 Comments

A Globals Module pattern

Two comments on my recent posts on a Globals Class pattern for Python and an Arguments Container pattern reminded me that there is one more container for globals that is worth noting: the module.

The idea is a simple one. You can use a module as a container.

Most introductions to Python tell you all about how to get stuff — that is, how to import stuff — *from* imported modules. They talk very little about writing stuff *to* imported modules. But it can be done.

Here is a simple example.

Let’s start with the intended container module, I’d show you the contents of, except for the fact that there aren’t any. is empty.

Next let’s look at two modules that import and use

The caller module is Note that it imports mem and also imports the subordinate module, minion.  (Note the use of the print() function; we’re running Python 3 here.)

import mem
import minion

mem.x = "foo"
print("leader says:",mem.x)
print("leader says:",mem.x)


mem.x = "bar"
print("leader says:",mem.x)
print("leader says:",mem.x)

The subordinate module is

import mem

def main():
	print("minion says:",mem.x)
	mem.x = "value reset by minion from " + mem.x

If you run it imports minion and mem, and uses mem as a container for variable x.  It assigns a value to x in mem and calls minion, which reads mem.x and resets mem.x’s value, which leader then reads.

When you run, you see this output:

leader says: foo
minion says: foo
leader says: value reset by minion from foo

leader says: bar
minion says: bar
leader says: value reset by minion from bar

Note that passes no arguments to minion.main() and minion.main() doesn’t return anything (other than None, of course). Leader and minion communicate solely by means of the variables set in mem. And the communication is clearly two-way. Leader sets values that minion reads, and minion sets values that leader reads.

So what we have here, in mem, is a truly global container. It is not “module global” as in the Globals Class pattern. It is “application global” — it is global across the multiple modules that make up an application.  In order to gain access to this container, modules simply import it.

In keeping with the earlier posts’ grandiosity, I will call this use of an imported module the Globals Module pattern.

Every Python programmer is familiar with one special case of the Globals Module pattern. Just rename to, stuff it with a bunch of constants or configuration variables, and you have a typical Python file for defining constants or setting configuration values. These values are “application global”, available to all module in an application. All they have to do is to import

Doing a bit of arm-waving, and christening a Globals Module pattern, does one thing.  It reminds us that modules — used as containers for “application global” values — aren’t limited to supplying constants and pre-set values. Modules can also be written to.  The communication between “normal” modules and Globals Modules is a two-way street.

Posted in Python Globals | Comments Off on A Globals Module pattern

An Arguments Container pattern

In a comment on my earlier post A Globals Class pattern for Python, Mike Müller wrote
“No need for globals. Just explicitly pass your container. In my opinion this is much easier to understand.”

Mike’s comment led me to some further thoughts on the subject.

Suppose you have a number of things — x, y, and z — that you want to make available to many functions in a module.

There are four strategies that you could use. You could

1. pass x, y, and z as individual arguments
2. make x, y, and z globals

or you could create a container C of some sort and

3. pass container C as an argument
4. make container C a global

So you have two basic questions to answer. When you make the things — x, y, and z — available:

A. Do you make them available in global variables, or in arguments that you pass around?

B. Do you make them available individually, or do you put them in some kind of container and make the container available?

My original post assumed that in at least some situations you might answer question A with “use global variables” and then went on to propose that in those situations the best answer to B is “put them in a container”.

Since the point of that post was to point out the usefulness of a class as a container, I called the proposed pattern the Globals Class pattern. But in most cases some other kind of container would do as well as a class. I could almost as easily have called the pattern the Globals Container pattern.

So if you look at these two questions — A and B — I think it is interesting where Mike and I differ, and where we agree.

Question A: args or globals

Where we differ, if you could call it that, is in the answer to A.

Mike wrote “No need for globals. Just explicitly pass your container. In my opinion this is much easier to understand.”

In my post I wrote “Sometimes globals are the best practical solution to a particular programming problem.” But that wasn’t really what the post was about. It was about the answer to question B.

So I can’t really say that Mike and I disagree very much. He says “I like apples”. I say “Sometimes I like an orange.”  No big deal.

Question B — multiple things or a single container

What is much more interesting is that we both agree on the answer to question B: use a container object.

But since I was talking about globals, I was talking about a container for globals.  Since Mike was talking about arguments, he was talking about a container for arguments.

Which means that we have two different patterns. My earlier post was about strategy 4 — a Globals Container pattern. Mike is talking about strategy 3 — what we might call an Arguments Container pattern.

As it happens, I had stumbled onto the Arguments Container pattern myself, not in Python but in Java. The circumstances were very similar to the circumstances that led to the Python Globals Class pattern. I had a lot of variables that I needed to pass around. As the code evolved,the argument lists got longer and harder to manage. Finally I just bundled all of the variables into a single container object and passed the container around. As I needed to add new arguments, I was able to add them to just one place — the container.

At the time, I felt sort of stupid doing this. I hadn’t ever heard of this as a programming technique.  It smacked of sneaking global variables in through the back door, and of course everybody knows that globals are always bad. But it worked, and it made my life a lot easier.

So now Mike comes along and proposes doing exactly the same thing. I feel relieved. I’m not the only one doing this. It may even be a Good Thing.

So I’m happy to announce — not the discovery, certainly — the christening of the Arguments Container pattern, which says, basically:

Sometimes when you have a lot of individual variables that you need to pass around to a lot of different functions or methods, the best solution is to put them into a container object and just pass the container object around.

This is not a specifically Python pattern. And in a way it is No Big Deal. But I’m doing a bit of shouting and arm-waving here because I think that somewhere there is probably at least one person for whom this post might be useful.

Posted in Python Globals | 5 Comments

A Globals Class pattern for Python

I’ve gradually been evolving a technique of coding in which I put module globals in a class. Recently I stumbled across Norman Matloff’s Python tutorial in which he recommends doing exactly the same thing, and it dawned on me that this technique constitutes a truly idiomatic Python design pattern.

A pattern needs a short, catchy name. I don’t think anyone has yet given this pattern a name, so I will propose the name Globals Class pattern.

I’m sure that many experienced Python programmers are already quietly using the Globals Class pattern. They may not see much point in making a big deal about it, or in giving it a name and decking it out with the fancy title of “design pattern”. But I think a little bit of hoopla is in order. This is a useful technique, and one worth pointing out for the benefit of those who have not yet discovered it.  A bit of cheering and arm-waving is in order, simply to catch some attention.

The technique is extremely simple.

  • You define a class at the beginning of your module.  This makes the class global.
  • Then, all of the names that you would otherwise declare global, you specify as attributes of the class.

Really, there is virtually nothing class-like about this class; for instance, you probably will never instantiate it. Instead of functioning like a true class, it functions as a simple container object.

I like to use the name “mem” (in my mind, short for “GlobalMemory”) for this class, but of course you can use any name you prefer.

All you really need is a single line of code.

        class mem: pass

That is enough to create your mem container. Then you can use it wherever you like.

        def doSomething():
            mem.counter = 0
        def doMore():
            mem.counter += 1
        def doSomethingElse():
            if mem.counter > 0:

If you wish, you can initialize the global variables when you create the class. In our example, we could move the initialization of mem.counter out of the doSomething() function and put it in the definition of the mem class.

        class mem:
            counter = 0

In a more elaborate version of this technique, you can define a Mem class, complete with methods, and make mem an instance of the class. Sometimes this can be handy.

        class Mem:
            def __init__(self):
                self.stupidErrorsCount = 0
                self.sillyErrorsCount  = 0

            def getTotalErrorsCount(self):
                return self.stupidErrorsCount + self.sillyErrorsCount

        # instantiate the Mem class to create a global mem object
        mem = Mem()

What’s the point?

So, what does the Globals Class pattern buy you?

1. First of all, you don’t have to go putting “global” statements all over your code.   The beauty of using a globals class is that you don’t need to have any “global” statements in you code.

There was a time — in the past, when I still used “global” — when I might find myself in a situation where my code was evolving and I needed to create more and more global variables. In a really bad case I might have a dozen functions, each of which declared a dozen global variables. The code was as ugly as sin and a maintenance nightmare.  But the nightmare stopped when I started putting all of my formerly global variables into a global class like mem.  I simply stopped using “global” and got rid of all those “global” statements that were cluttering up my code. 

So the moral of my story is this.  Kids, don’t be like me.  I started out using “global” and had to change.  I’m a recovering “global” user. 

Don’t you even start.  Skip the section on the “global” keyword in your copy of Beginners Guide to Learning Python for Dummies.  Don’t use “global” at all.  Just use a globals class.

2. I like the fact that you can easily tell when a variable is global simply by noticing the mem. modifier.

3. The globals statement is redundant.  The Globals Class pattern relieves us of of the burden of having to worry about it.

Python has the quirk that if X is a global, and a function only reads X, then within the function, X is global. But if the function assigns a value to X, X is treated as local.

So suppose that — as your code evolves — you add an assignment statement deep in the bowels of the function. The statement assigns a value to X. Then you have — as a side-effect of the addition of that statement — converted X (within the scope of the function) from a global to a local.

You might or might not want to have done that.  You might not even realize what you’ve done.   If you do realize what you’ve done, you probably need to add another statement to the function, specifying that X is global.  That is sort of a language wart. If you use the Globals Class pattern, you avoid that wart.

4. I think the use of the Globals Class pattern makes the work of static code analyzers (e.g. PyFlakes) easier.

5. The Globals Class pattern makes it possible to create multiple, distinct groups of globals.

This can be useful sometimes. I have had modules that processed nested kinds of things: A, B, and C. It was helpful to have different groups of globals for the different kinds of things.

        class memA: pass
        class memB: pass
        class memC: pass

6. Finally, the Globals Class pattern makes it possible to pass your globals as arguments.

I have had the situation where a module grew to the point where it needed to be split into two modules. But the modules still needed to share a common global memory. With the Globals Class pattern, a module’s globals are actually attributes of an object, a globals class.  In Python, classes are first-class objects.  That means that a globals class can be passed — as a parameter — from a function in one module to a function in another module.

Is this really A Good Thing?

At this point I can hear a few stomachs churning. Mine is one of them. Because, as we all know, Global Variables are Always a Bad Thing.

But that proposition is debatable.  In any event, it is an issue that I’m not going to explore here.  For now, I prefer to take a practical, pragmatic position:

  • Sometimes globals are the best practical solution to a particular programming problem.
  • For the occasions when Globals are A Good Thing, it is handy to have a way to Do Globals in A Good Way.

So the bottom line for me is that there are occasions when some kind of globals-like technique is the best tool for the job.  And on those occasions the Globals Class pattern is a better tool for the job than globals themselves.

Posted in Python Globals | 8 Comments

How to open a web browser from Python

This goes under the Tips and Tricks category. 

Also under Stuff I wish I had known about a long time ago.

The trick is in the standard library, in the webbrowser module.

For documentation of the webbrowser module,
import webbrowser
new = 2 # open in a new tab, if possible

# open a public URL, in this case, the webbrowser docs
url = "",new=new)

# open an HTML file on my own (Windows) computer
url = "file://X:/MiscDev/language_links.html",new=new)
Posted in Python features | 4 Comments

Command-line syntax: some basic concepts

I’ve been reading about parsers for command-line arguments lately, for example Plac. And, as Michele Simionato says:

There is no want of command line arguments parsers in the Python world. The standard library alone contains three different modules: getopt (from the stone age), optparse (from Python 2.3) and argparse (from Python 2.7).

My reading has made me realize that there is an immense range of possible syntaxes for command-line arguments, and far less consensus and standardization than I thought. Although there are some general styles that programmers often use when implementing the command-line arguments for their applications, basically every programmer is free to do whatever he (or she) wants. The result is that whenever you encounter an application for the first time, you can’t safely assume anything about the syntax of its command-line arguments.

It also has made me wonder if anyone had ever written an overview of, or introduction to, the basic concepts involved in command line arguments. I searched the Web without finding one, so I thought it would be interesting to try to write one.  I can live with the risk that I’m re-inventing the wheel.

Of course, there may be something out there and I just missed it. So if you know of some other discussion of this topic, please leave a comment and tell me about it. And if there is something that I missed here, I’d appreciate a comment about that too.

What is a command line argument?

When you invoke an application from a command line, it is often useful to be able to send one or more pieces of information from the command line to the application. As a simple example, we might want to start a text editor and also tell it the name of a file that it should open, like this

          superedit a_filename.txt

In this example, “superedit” is the name of the application, and “a_filename.txt” is a command line argument: in this case, the name of a file.

It is possible to supply more than one command line argument

We often want to send an application multiple arguments, like this:

          rename file_a.txt  file_b.txt

Positional arguments, named arguments, and flags

There are three types of command line argument: positional arguments, named arguments, and flags.

  • A positional argument is a bare value, and its position in a list of arguments identifies it.
  • A named argument is a (key, value) pair, where the key identifies the value.
  • A flag is a stand-alone key, whose presence or absence provides information to the application.

If we supplied the “rename” application with two positional arguments, like this

          rename file_a.txt  file_b.txt

then the position of the arguments identifies the value.

  • The value in position 1 (“file_a.txt”) is the current name of the file.
  • The value in position 2 (“file_b.txt”) is the requested new name of the file.

We could have written the “rename” application so that it requires two named arguments, like this

          rename  -oldname file_a.txt  -newname file_b.txt

A flag is an argument whose presence alone is enough to convey information to the application. A good example is the frequently-used “-v” or "--verbose" argument.

Although it is possible to think of flags as degenerate named arguments (named arguments that have a key but no value), I find it easier to think of flags as a distinct type of argument, different from named arguments.

Keyword arguments and options

I will use the term keyword argument to cover both named arguments and flags.

David Goodger notes (in the first comment on the first version of this post) that I am not using the traditional Unix command-line lexicon.  What I have called keyword arguments are — on Unix platforms — traditionally called options;  what I have called values are traditionally called option arguments; and what I have called positional arguments, the Open Group calls operands.  So I should probably say something about my choice of technical terminology.

For the purposes of this analysis, I prefer not to use the traditional Unix vocabulary of options, for a number of reason.  First of all, the term option tends to be Unix-specific; on Windows the term parameter is more frequently used.  Second, the investigation began with command-line parsers, and in the context of a discussion of parsers and parsing, keyword argument seems a more traditional and appropriate term than option.  Third, the usual definition of option is not very useful.

Arguments are options if they begin with a hyphen.

And finally, the term option implies optionality.  Whether an argument is optional or required is a semantic issue rather than a syntactical issue.  At this point I’m interested in syntactical issues, so I want to use a semantically neutral vocabulary.  We can talk about options and optionality later, when we look at semantic concepts.

Keyword arguments require a sigil

When keyword arguments are used, there must be some mechanism for distinguishing a key from a value or from a positional argument. That mechanism is a “sigil”: a special character or string of characters that indicates the beginning of a key. In our example, the sigil was a dash (a hyphen).

On Windows, the sigil is typically a forward slash: “/”.

On Unix-like operating systems, the sigil is typically a dash "-".

Some applications use multiple sigils.  With the plus sign “+” as a sigil, for instance, it is possible to use flags to turn options on and off.

          attrib   -readonly    -archive     file_A.txt
          attrib   +readonly    +archive     file_A.txt

Single-character and multi-character keys

Some applications, especially on Unix, make a distinction between single-character keys and multi-character keys (“long options”), with a single-dash sigil "-" indicating the beginning of a single-character key, and a double dash "--" sigil indicating the beginning of a multi-character key. Often, an application will support both single-character and multi-character keys for the same argument. For example, the “rename” application might accept both this

          rename  -o file_a.txt  -n file_b.txt

and this

          rename  --oldname file_a.txt  --newname file_b.txt

Fixed-length and variable-length keys

The previous section describes what I think most Unix programmers would say is the difference between single-dash and double-dash keys. But I think it is actually wrong.

The real difference between a single-dash sigil "-" and a double dash "--" sigil is not the difference between one and many, but the difference between fixed-length and variable-length keys. (This is obscured by the fact that a single-character key is also automatically a fixed-length key.)

The thing that really makes keys that begin with a single dash different from keys that begin with a double dash is not that they are one character long, but that their length is fixed and known. For example, flag concatenation (see below) is possible because the flag keys have a known and fixed length. It doesn’t depend on the flag keys being one character long — it would work just as well if the length for flag keys was fixed at two or even three characters. And this is also true of the third technique for distinguishing keys from argument values (see the next section).

Named arguments require a mechanism to distinguish keys from argument values

One technique is to use whitespace to separate argument values from keys. We saw this in our earlier example

          rename  -o file_a.txt  -n file_b.txt

A second technique is to use a special (non-whitespace) character to separate argument values from keys. This special character could be any character that cannot occur in either the key or argument value.

On Unix, this is traditionally an equal sign “=”, like this.

          rename  -o=file_a.txt  -n=file_b.txt

On Windows and MS-DOS this is traditionally a colon “:”, like this.

          rename  /o:file_a.txt  /n:file_b.txt

An application might permit whitespace before and after the equal sign, like this.

          rename  -o = file_a.txt  -n = file_b.txt

A third technique is to use the known length of the key to distinguish the key from the argument value. Suppose the “rename” application uses only 1-character keys. Then it might accept arguments like this.

          rename  -ofile_a.txt  -nfile_b.txt

Fixed-length keys make flag concatenation possible

Suppose that an application follows the convention that a single-dash sigil signals the start of a single-character flag argument. Then it can accept either this

          tar -x -v -f  some_filename.tar

or this, where several flag arguments are specified together

          tar -xvf some_filename.tar

Here is where the distinction between the single-dash sigil and the double-dash sigil becomes important.

  • "-xvf" indicates the concatenation of three single-character flags: “x”, “v”, and “f”.
  • "--xvf" (note the double dash) indicates a single multi-character flag: “xvf”.

Parsing the command line

In many of the examples that we’ve seen, parsing the command line is as simple as splitting it on whitespace. But the situation gets more complicated if values can contain whitespace. If that is true, then we need to support delimiters that can enclose values that contain whitespace.

Suppose we want to invoke a word-processor from the command line. And we want to specify two arguments on the command line: the name of the file, and the name of the author. This obviously will not work.

          superedit A Christmas Story.doc  Clement Moore

What we need is this.

          superedit "A Christmas Story.doc"  "Clement Moore"

Support of quoted values means that command-line parsers must be more sophisticated… just splitting the command line on whitespace won’t do the job. The command-line parser must recognize and correctly handle quote characters… and escaped quote characters inside of quoted strings.

The most common delimiter for argument values is the double-quote symbol. But we might also (or instead) want to support single quotes, back ticks, parentheses, or square/wavy/pointy brackets. We can imagine a case in which a malevolent programmer wrote superedit to expect positional arguments like this.

          superedit (A Christmas Story.doc)  (Clement Moore)

… or named arguments like this.

          superedit filename(A Christmas Story.doc)  author(Clement Moore)

Sigils in positional arguments

Remember our “rename” application? It accepted arguments like this, where the dash is the sigil that introduces the key of a named argument.

          rename  -o file_a.txt  -n file_b.txt

But filenames can begin with dashes. We might need to write a command like this, which would cause problems.

          rename  -o -file_a.txt  -n -file_b.txt

So this is another reason why we might need to be able to quote argument values: to “hide” a sigil character inside a value.

          rename  -o "-file_a.txt"  -n "-file_b.txt"

The order of arguments

In the first version of this post, I wrote that:

It is a universally observed convention that

  • keyword arguments (named arguments and flags) are grouped together
  • positional arguments are grouped together
  • keyword arguments must be specified first, before specifying positional arguments

But that is wrong. It is a widely — but not universally — observed convention. As Eric wrote, in a comment on the first version of this post,

many modern programs allow keyword arguments to be specified after (or even between) positional arguments

And even very old programs do it too. The command-line syntax for Microsoft DOS’s dir command (roughly equivalent to Unix’s ls command) is basically

dir [filename] [switches]

with the filename positional argument appearing before the switches.

A separator between keyword arguments and positional arguments

Suppose we have an application “myprog” that accepts one or more keyword arguments that start with a dash sigil, followed by one or more positional arguments that supply filenames. And suppose that filenames can contain — and begin with — dashes.

We’re going to have a problem if we code this

          myprog -v -r -t -file_a.txt -file_b.txt  -file_c.txt

myprog is going to see “-file_a.txt” and (since it starts with a dash, the sigil) myprog will try to handle it like a keyword argument. Not good.

We could deal with this problem by routinely enclosing all filename positional arguments in quotes, but that would be clumsy and laborious.

          myprog -v -r -t "-file_a.txt" "-file_b.txt"  "-file_c.txt"

An alternative is to use a special string (typically double dashes "--") to indicate the beginning of positional arguments.

          myprog -v -r -t   --  -file_a.txt -file_b.txt  -file_c.txt

So now we have four basic kinds of arguments.

  • positional arguments
  • named arguments (key+value pairs)
  • flags
  • an indicator of the beginning of positional arguments ("--")

Argument semantics

To be expanded…

Optional arguments vs. required argments

Relationships between different arguments

  • Aliases
  • Mutual exclusion
  • Mutual necessity


Other variations

In some conventions:

  • Multi-character keys may be abbreviated as long as the abbreviations are unique.
  • The value in a named argument is optional and may be omitted.
  • The value of a named argument may be a list, with items in the list separated by a colon or a comma.
  • A sigil character standing by itself (e.g. a single dash) is treated as a positional argument.

Command-line as a programming language

I think that the best way to think of a command-line, and its arguments, is as a statement in a command-line (CL) programming language, where each application defines its own CL language.

This means that — as far as an application is concerned — the process of using command-line arguments always looks like this:

  1. define (i.e. tell the parsing module about) the syntax rules of the CL language to be used
  2. define (i.e. tell the parsing module about) the semantics of the CL language
  3. call the parser to parse the command line and its arguments
  4. query the parser for information about the “tokens” (the command-line arguments) that it found

Step 2 — specifying the CL semantics — is the step in which the application specifies (for example) what named arguments and flags it accepts, and which are required. This step is necessary for the parser to do certain kinds of semantic checking: (for example) to automatically reject unrecognized keys, or to automatically report required arguments that were not provided.

Step 2 can be omitted, but only if the application itself will do the semantic checking rather than expecting the parsing module to do it.

The upside of doing step 2 is that it enables a smart CL parsing module automatically to generate user documentation for the CL language, and to dump that documentation to the screen when it finds a syntactic or semantic error in the command line, or when the command line is a request (e.g. “/?” or “-h”) for the command-line documentation.

Command-line meta-languages

CL languages are like markup languages. You can invent your own from scratch if you wish, but life is a lot easier if you at least follow some standard conventions when you do.

In the world of markup languages, such standard conventions are called meta-languages. The best-known markup meta-language is XML. XML is not a markup language; it is a markup meta-language … roughly: a style, or set of conventions, or template for creating specific markup languages.

XML is well-defined by the W3C. It would make sense to have similarly well-defined, carefully specified meta-languages for CL languages. Right now, I think we have two loosely-defined CL meta-languages, which I shall refer to as

  • WinCL (for Windows)
  • NixCl (for *nix platforms)

Traditionally (see the Wikipedia article on command line argument)

  • WinCL uses a slash as the sigil; NixCL uses a dash.
  • WinCL uses a colon as a key/value separator; NixCL uses an equal sign.
  • WinCL keywords traditionally consist of a single letter; NixCL is open to multi-character keywords (GNU “long options”).

As of July 25, 2010,:

If it is (or becomes) possible to consider WinCL and NixCL to be well-defined CL meta-languages, then the first step of specifying a CL language for an application (which I gave earlier):

  • define (i.e. tell the parsing module about) the syntax rules of the CL language to be used

could be simply

  • tell the parsing module whether the CL language will be a WinCL or a NixCL language

An alternative is to use a parser utility that is designed to handle specifically WinCL or NixCL. Python’s optparse, for example, “supports only the most common command-line syntax and semantics conventionally used under Unix.” And if you aren’t familiar with those conventions, the documentation summarizes them.

Posted in Miscellaneous | 4 Comments

Unicode for dummies – just use UTF-8

Revised 2012-03-18 — fixed a bad link, and removed an incorrect statement about the origin of the terms “big-endian” and “little-endian”.

Commenting on my previous post about Unicode, an anonymous commentator noted that

the usage of the BOM [the Unicode Byte Order Mark] with UTF-8 is strongly discouraged and really only a Microsoft-ism. It’s not used on Linux or Macs and just tends to get in the way of things.

So it seems worth-while to talk a bit more about the BOM.  And in the spirit of Beginners Introduction for Dummies Made Simple, let’s begin at the beginning: by distinguishing big and little from left and right.

Big and Little

“Big” in this context means “more significant”. “Little” means “least significant”.

Consider the year of American independence — 1776.  In the number 1776:

  • The least significant (“smallest”) digit is 6. It has the smallest magnitude: it represents 6 * 1, or 6.
  • The most significant (“biggest”) digit is 1. It has the largest magnitude: it represents 1 * 1000, or 1000.

So we say that 1 is located at the big end of 1776 and 6 is located at the small end of 1776.

Left and Right

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

Here are two technical terms: “big endian” and “little endian”.

These terms are derived from “Big End In” and “Little End In.”  According to Wikipedia, the terms Little-Endian and Big-Endian were introduced in 1980 by Danny Cohen in a paper called “On Holy Wars and a Plea for Peace”.

1776 is a “big endian” number because the “biggest” (most significant) digit is stored in the leftmost position. The big end of 1776 is on the left.

Big-endian numbers are familiar.  Our everyday “arabic” numerals are big-endian representations of numbers.  If we used a little-endian representation, the number 1776 would be represented as 6771.  That is, with the “little” end of 1776 — the “smallest” (least significant) digit — in the leftmost position.

What do you think? In Roman numerals, 1776 is represented as MDCCLVI. Are Roman numerals big-endian or little-endian?

So big and little are not the same as left and right.

Byte Order

Now we’re ready to talk about byte order. And specifically, byte-order in computer architectures.

Most computer (hardware) architectures agree on bits (ON and OFF) and bytes (a sequence of 8 bits), and byte-level endian-ness.  (Bytes are big-endian: the leftmost bit of a byte is the biggest.  See Understanding Big and Little Endian Byte Order.)

But problems come up when handling pieces of data, like large numbers and strings, that are stored in multiple bytes.  Different computer architectures use different endian-ness at the level of multi-byte data items (I’ll call them chunks of data).

In the memory of little-endian computers, the “little” end of a data chunk is stored leftmost. This means that, a data chunk whose logical value is 0x12345678 is stored as 4 bytes with the least significant byte to the left, like this: 0x78 0x56 0x34 0x12.

  • For those (like me) who are still operating largely at the dummies level: imagine 1776 being stored in memory as 6771.

Big-endian hardward does the reverse. In the memory of big-endian computers, the “big” end of a data chunk is stored leftmost. This means that a data chunk of 0x12345678 is stored as 4 bytes with the most significant byte to the left, like this: 0x12 0x34 0x56 0x78.

  • For us dummies: imagine 1776 being stored in memory as 1776.

Here are some random (but curiously interesting) bits of information, courtesy of the Microsoft Support web-site article Explanation of Big Endian and Little Endian Architecture.

  • Intel computers are little endian.
  • Motorola computers are big endian.
  • RISC-based MIPS computers and the DEC Alpha computers are configurable for big endian or little endian.
  • Windows NT was designed around a little endian architecture, and runs only on little-endian computers or computers running in little-endian mode.

In summary, the byte order — the order of the bytes in multi-byte chunks of data — is different on big-endian and little-endian computers.

Which brings us to…

The Unicode Byte Order Mark

In this section, I’m going shamelessly to rip off information from Jukka K. Korpela’s outstanding Unicode Explained from O’Reilly (see the section on Byte Order starting on page 300). (See also Jukka’s valuable web page on characters and encodings.)

Suppose you’re running a big-endian computer, and create a file in Unicode’s UTF-16 (two-byte) format.

Note that the encoding is the Unicode UTF-16 (two-byte) encoding, not UTF-8 (one-byte). That’s an important aspect of the problem, as you will see.

You send the file out into the world, and it is downloaded by somebody running a little-endian computer. The recipient knows that the file is in UTF-16 encoding. But the bytes are not in the order that he (with his little-endian computer) expects. The data in the file appears to be scrambled beyond recognition.

The solution, of course, is simply to tell the recipient that the file was encoded in UTF-16 on a big-endian computer.  Ideally, we’d like for the data in the file itself to be able to tell the recipient the byte order (big endian or small endian) that was used when the data was encoded and stored in the file.

This is exactly what the Unicode byte order mark (BOM) is designed to do.

Unicode contains two code points reserved specifically for the purpose of indicating byte order: U+FEFF (big endian) and U+FFFE (little endian).

These code points are used for nothing else than to indicate byte order. If the first two bytes of a file are 0xFEFF or 0xFFFE, then a Unicode decoder knows that those two bytes contain a Unicode BOM, and knows what to do with the BOM.

This also means that if you (in the role, say, of a forensic computer scientist) must process a mystery file, and you see that the file’s first two bytes contain one of the two Unicode BOMs, you can (with a high probability of being correct) infer that the file is encoded in Unicode UTF-16 format.

So: Where’s the BOM?

In actual practice, most UTF-8 files do not include a BOM.  Why not?

A file that has been encoded using UTF-16 is an ordered sequence of 2-byte chunks. Knowing the order of the bytes within the chunks is crucial to being able to decode the file into the correct Unicode code points.  So a BOM should be considered mandatory for files encoded using UTF-16.

But a file in UTF-8 encoding is an ordered sequence of 1-byte chunks.  In UTF-8, a byte and a chunk are essentially the same thing.  So with UTF-8, the problem of knowing the order of the bytes within the chunks is simply a non-issue, and a BOM is pointless. And since the Unicode standard does not require the use of the BOM, virtually nobody puts a BOM in files encoded using UTF-8.

Let’s do UTF-8… all the time!

It is important to recognize that UTF-8 is able to represent any character in the Unicode standard.  So there is a simple rule for coding English text (i.e. text that uses only or mostly ASCII characters) —

Always use UTF-8.

  • UTF-8 is easy to use. You don’t need a BOM.
  • UTF-8 can encode anything.
  • For English or mostly-ASCII text, there is essentially no storage penalty for using UTF-8. (Note, however, that if you’re encoding Chinese text, your mileage will differ!)

What’s not to like!!??

UTF-8? For every Unicode code point?!

How can you possbily encode every character in the entire Unicode character set using only 8 bits!!!!

Here’s where Joel Spolsky’s (Joel on Software) excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) comes in useful.  As Joel notes

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

This is the myth that Unicode is what is known as a Multibyte Character Set (MBCS) or Double-Byte Character Set (DBCS).   Hopefully, by now, this myth is dying.

In fact, UTF-8 is what is known variously as a

  • multibyte encoding
  • variable-width encoding
  • multi-octet encoding (For us dummies, octet == byte. For the difference, see page 46 of Korpela’s Unicode Explained.)

Here’s how multibyte encoding works in UTF-8.

  • ASCII characters are stored in single bytes.
  • Non-ASCII characters are stored in multiple bytes, in a “multibyte sequence”.
  • For non-ASCII characters, the first byte in a multibyte sequence is always in the range 0xC0 to 0xFD. The coding of the first byte indicates how many bytes follow, and so indicates the total number of bytes in the multibyte sequence.
  • In UTF-8, a multibyte sequence can contain as many as four bytes.
  • Originally a multibyte sequence could contain six bytes, but UTF-8 was restricted to four bytes by RFC 3629 in November 2003.

For a quick overview of how this works at the bit level, take a look at the answer by dsimard to the question How does UTF-8 “variable-width encoding” work? on stackoverflow.

Wrapping it all up

So that’s it. Our investigation of the BOM has led us to take a closer look at UTF-8 and multibyte encoding.

And that leads us to a nice place. For the most part, and certainly if you’re working with ASCII data, there is a simple rule.

Just use UTF-8 and forget about the BOM.

Posted in Unicode | 9 Comments

Unicode Beginners Introduction for Dummies Made Simple

I’ve been trying to grok Unicode, and it hasn’t been easy.  But today, I finally got it.  And, as it turns out, the basics of Unicode aren’t too difficult.

The problems that I’ve been having turn out not to be with Unicode, but with the introductions that I’ve found.  They’re pretty confusing.  Or maybe I should say, they just don’t fit my brain.   So the logical thing to do, of course, is to write the introduction that I would like to have read.

A lot of what I will write will be shamelessly ripped off from other descriptions that I’ve found useful, including

We start with the observation that “Unicode” is actually two separate and distinct things.  And the first of these things has nothing to do with computers.

Suppose you’re an English orientalist in, say, 1750.  You’ve just discovered Sumerian cuneiform characters from the middle east and Sanskit characters from India.  You get a brilliant idea.  You will make a list of all characters in all languages ever used.  Each will be identified by its own unique number.  So you start out making your list with your own good English characters.  You add in the cuneiform characters and the Sanskrit characters and Greek, Japanese, Chinese, and Korean characters. You add in characters for the funny squiggly/accented/umlauted characters in Spanish, French and German. And so on. And finally you have a very long list of about a zillion characters.

1 a
2 b
3 c
26 z
27 A
28 B
52 Z
53 (space)
54 ? (question mark)
55 , (comma)
... and so on ...

And (as I say) you did it all with your feather-quill pen. This has nothing to do with computers. It is simply about creating a numbered list of all known characters.

When you finish, you have a complete (you hope) set of characters. So you call it a “character set”. And because you’re in a funny mood, instead of calling the numeric identifiers “numeric identifiers”, you call them “code points”. And because your list is meant to include every character in the known universe, you call it the Universal Character Set, or UCS.

Congratulations! You’ve just invented the first, non-computer, half of Unicode, the Universal Character Set.

Now you borrow Guido’s time machine and fast-forward 260 years to 2010.  Everybody is using computers.  So you have a brilliant idea.  You will find a way for computers to handle your UCS.

Now computers think in 8-bit bytes.  So you think:  we’ll use one byte for each numeric identifier (code point)!  Great idea!  An 8-bit encoding.

The problem of course is that with 8 bits you can make only 256 different bit combinations.  And your list has way more than 256 characters.  So you think: we’ll use two bytes for each character!  Great idea!  A 16-bit encoding.

But there are still problems.  First, even two bytes are not enough to store a number as big as a zillion.  You figure that you’ll need at least 3 bytes to hold the biggest number on your list.  Second, even if you decided to use four bytes (32 bits) for each character, your list might still keep growing and someday even 32-bits might not be enough.  Third, you’re doing mostly English, and 8 bits is plenty for working with English.  So with a 16-bit encoding, you’re using twice as much storage as you really need (and, if you use a 32-bit encoding, you’re using four times as much as you need).

So you think:  Let’s just use an 8-bit encoding, but with a twist.  One of the bit combinations won’t identify a character at all, but will be sort of a continuation sign, saying (in essence) this character identifier is continued on the next several bytes.  So for the most part, you’ll use only one byte per character, but if you need a document to contain some exotic charcters, you can do that.

Congratulations!  You’ve just invented UTF-8 — the 8-bit Unicode Transformation Format, a variable length encoding in which every UCS character (code point) can be encoded in 1 to 4 bytes.

Now you still have one last problem.  You’ve defined both a UTF-8 format and a UTF-16 format.  So you go to open a file and start reading.  You read the first two bytes.  How do you know what you’re reading?  Are the first two bytes two characters in UTF-8 encoding? or a single character in UTF-16 encoding?  What you need is a standard marker at the beginning of files to indicate what encoding the file is in.

Bingo.  You’ve just invented the Byte Order Mark, or BOM (aka “encoding signature”).  The BOM is a two-byte marker at the beginning of a file that tells what encoding the file is using.

So now, when you read a file, you first read the BOM, which tells you what encoding was used to create the file.  This allows you to decode the file into code points (however code points are represented internally in your programming language: Java, Python, whatever).  And when you write out a file, you choose the encoding to be used to encode your Unicode charaters in bits.  You write the BOM, and you write out your Unicode strings.  When you write out the Unicode strings, you specify the encoding to be used when writing the bits and bytes to the file.

And that’s the basics.    In summary,

Unicode =
UCS (definition of a universal character set)
UTF (techniques for encoding code points in bit-configurations)

The connection between characters and bit configurations is the numeric character identifier, the “code point”.

     Character set
=======       ==========         =============
a                  1                 0000
b                  2                 0001
and so on.

There are lots more complicated details of course. But this is the basics.

For a follow-up post, see Unicode for dummies – just use UTF-8.

Posted in Unicode | 12 Comments

Multiple constructors in a Python class

In addition to working with Python, I also work with Java quite a lot.

When coding in Python, I occasionally encounter situations in which I wish I could code multiple constructors  (with different signatures)  for a class, the way you can in Java.  

Recently, someone else had the same desire, and posted his question on comp.lang.python. So I thought that I would post an example of the technique that I use, in case others might find it useful.  So here it is:


import sys, types, pprint

class Vector:
    Demo of a class with multiple signatures for the constructor
    def __init__(self, *args, **kwargs):
        if len(args) == 1:  foundOneArg = True;  theOnlyArg = args[0]
        else:               foundOneArg = False; theOnlyArg  = None

        if foundOneArg and isinstance(theOnlyArg, types.ListType):      
        elif foundOneArg and isinstance(theOnlyArg,Vector):

        pprint.pprint(self.values)  # for debugging only
    def initializeFromList(self, argList):
        self.values = [x for x in argList]

    def initializeFromVector(self, vector):
        self.values = [x for x in vector.values]

    def initializeFromArgs(self, *args):
        self.values = [x for x in args]
#------------ end of class definition ---------------------

v = Vector(1,2,3) 
v = Vector([4,5,6]) 
q = Vector(v);
Posted in Python features | 6 Comments

How do I reverse a string in Python 3?

With the improved support for Unicode in Python3, more and more folks will be working with languages (Arabic, Hebrew, etc.) that read right-to-left rather than left-to-right. So more and more folks will have a need to reverse a string.

Unfortunately, Python doesn’t have a built-in function, nor do string objects have a built-in method, to do what they will want.  The obvious techniques don’t work. This:

            s = "a b c"
            s = reverse(s)
        except Exception as e:

            s = "a b c"
            s = reversed(s)
        except Exception as e:

            s = "a b c"
        except Exception as e:

            s = "a b c"
        except Exception as e:

produces this output

        name 'reverse' is not defined
        <reversed object at 0x00BAB5F0>
        'str' object has no attribute 'reverse'
        'str' object has no attribute 'reversed'

Fortunately, the solution is not too difficult. A little one-line function will do the trick.

I call the function “rev” rather than “reverse” on the chance that Python will eventually acquire its own builtin function named “reverse”.

        def rev(s): return s[::-1]

In a comment, Michael Watkins has noted another possible implementation of the “rev” function.

        def rev(s): ''.join(reversed(s))


            s = "a b c"
            s = rev(s)
        except Exception as e:


        c b a
Posted in Moving to Python 3 | 12 Comments

What’s wrong with use cases?

I originally wrote this in 2002, when Use Cases seemed to be all the rage. I thought I would republish it here because the Use Case Approach — although now operating under a variety of new names — still seems to be popular. — Steve Ferg

What’s wrong with Use Cases?

For comparison, let’s start with a slightly different question.

What’s wrong with chocolate?

At one level, of course, nothing at all is wrong with chocolate. There aren’t many essential nutrients in chocolate, of course, but in and of itself that is not a problem. Consumed in moderation, as part of a well-rounded diet, chocolate adds variety and pleasure to the diet, and in some cases may provided needed calories.

But of course, that isn’t the end of the story. What’s wrong with chocolate is that it is delicious. Addictive, in fact! There was a reason for the invention of the word chocoholic. Where moderate consumption might be useful, chocolate tempts us to immoderate consumption. Where we should consume a variety of foods to obtain a balanced intake of nutrients, chocolate tempts us to ignore other foods and eat only chocolate. If we succumb to its charms, we eat an unbalanced diet. If we eat an unbalance diet, we suffer. The consequences of excessive chocolate consumption are very real and easy to see (on the bathroom scale, for instance). But (and here is part of the problem) the causal connection between chocolate consumption and its consequences is (in contrast to chocolate’s easily-perceived immediate charms) virtually invisible.

The problem with chocolate, then, lies in a combination of things: it does not provide a complete, balanced nutritional intake; it is extremely attractive, tempting one to consume only chocolate and ignore other foods; and it is very difficult to make the connection between excessive chocolate consumption and its undesirable consequences.

Use Cases have the same problems, caused by the same combination of characteristics:

  • A set of Use Cases does not provide a system developer with all of the information that he needs about his client’s needs, in order to produce a system that meets those needs. Use Cases are nutritionally deficient.
  • Use Cases are extremely attractive. Much of the attraction is probably due to their simplicity. One doesn’t have to work very hard to understand the basic Use Case concepts. With such a low-effort, no-sweat requirements-gathering technique available, it is tempting indeed to believe that all one has to do when gathering requirements is to create a list of Use Cases.
  • The use of Use Cases, to the complete or virtual exclusion of other requirements-gathering techniques, has undesirable consequences, in the poor quality of the systems developed. But these consequences — the connection between requirements gathering and the eventual quality of the system as built — are largely invisible to both developers and developer management.

In fairness, it should be noted that the Use Case approach is not the culprit here. The connection between the quality of the requirements-gathering for a system, and the quality of the system that is eventually developed, seems to be invisible to most organizations regardless of whatever particular requirements-gathering methodology the organization is officially practicing. That’s part of the reason why we — as a profession and as an industry — continue to do such a poor job of requirements gathering, and continue to have so many system failures.

In short, the analog of the Chocoholic’s Diet (in which the unfortunate dieter consumes only, or mostly, chocolate) is what we might call the Use Case Approach (UCA) in which the unfortunate software development organization uses only, or mainly, Use Cases to gather and document client requirements. We have been warned for years, by everyone from the government to our own doctors, about the dangers of junk food diets, of which the Chocoholic’s Diet is an example. Now it is time to issue a warning about the dangers of junk requirements-gathering methodologies, of which the Use Case Approach is an example.

The time has come to issue such a warning because the Use Case Approach is quickly gaining in popularity in software development shops. In most cases, UCA is riding on the coat-tails of UML. There are some UML proponents who advocate the requirements-gathering equivalent of a balanced diet: a requirements-gathering methodology in which Use Cases figure as only one element of a well-rounded diet of requirements-gathering techniques. But for every such reasonable guru, there is at least one snake-oil salesman pushing Use Cases as The Solution For All Your Requirements Methodology Needs. Some of these gentlemen give lip service to the idea of a well-balance diet, but it is only lip service. In their methodology books and courses, they pause briefly to give a cursory nod to (for instance) describing the problem domain, before spending the overwhelming bulk of discussion on Use Cases.

Unfortunately, the software development community seems to be eating it up. Many organizations see UCA as a silver bullet for their biggest requirements gathering problem… which, incidently, seems often not to be How do you gather requirements effectively?, but What do you say when somebody asks you what your official requirements gathering methodology is? It used to be said that no one ever lost their job by buying IBM; today, no one ever loses their job by adopting Use Cases and UML.

This stampede of enthusiasm has produced the software development equivalent of a national health crisis. Just as the popularity of the “Sugar Buster’s Diet” has forced doctors and dietitians to issue warnings about the nutritional deficiencies of such a diet, the popularity of the Use Case Approach makes it imperative to point out the deficiencies of Use Cases as a requirements-gathering tool. That’s the purpose of this article.

Use Cases Unsupported by Domain Descriptions Are Vague

When we embark on a system development project, the system to be developed is the putative solution to some problem in the user’s environment (the “application domain”).

Naturally, we would expect the problem-solving method to look something like this:

  • (1) Study the problem until you are confident that you understand it.
  • (2) Describe a proposed solution for the problem. (In our field, the description of the proposed solution is a requirements specification document for a computer system.)
  • (3) Implement the solution. (That is, build the system.)

A Use Case — as a description of an actor’s interaction with the system-to-be — is both a description of the system’s user interface and an indirect description of some function that the system will provide. In short, as descriptions of the system-to-be, Use Cases belong in step 2 — describing the proposed solution to the problem. So the development of Use Cases has a place in the problem solving process… but that place is not as the first step, and it is not as the only step.

The first activity in the requirements-gathering process must be the study and description of the problem-environment, the application domain. To put it bluntly, the requirements analyst’s first job is to study and understand the problem, not to jump right in and start proposing a solution. (This is a major theme of Michael Jackson’s 1995 book, System Requirements and Specifications and his new book, Problem Frames. In System Requirements and Specifications, see especially the entry on “The Problem Context”.)

One way that the problem domain can be described is by creating a model. The developer begins by creating a model of the “real world”, i.e. the part of the real world that is relevant to the problem at hand, the part of reality with which the system is concerned and which furnishes its subject matter.

We start by modeling the real world (rather than describing the functionality that we wish the system-to-be to provide) because the model supplies essential components that we need in order to create our descriptions of the system functionality. In describing a university library application, for example, in the real-world model we would describe books and copies of books, describe what counts as being a university member for the purposes of using the library, and so on. In creating the domain model we, in a sense, construct a dictionary of words. We can then use those words when we write our descriptions of the functions and Use Cases that we wish the system to support. In the case of the university library, once we have described the objects in the model, we can specify any system functions that can be described using a vocabulary of “books”, “copies of books”, and “university members”.

Note that the scope of the vocabulary that was created in the domain model implicitly defines (and defines the limits of) a set of possible system functions. [Jackson, 1983, p. 64] We can specify any function that can be described using the vocabulary of words that appear in the model/dictionary, but we cannot specify a function if its description would require terms that are not in the model. For instance, we can describe the process of a member borrowing a book and the process of returning it, the process of reserving a book and the process of notifying a member when a reserved book becomes available for checkout, the process of sending out overdue notices, and so on. But we can not describe a Use Case in which a university member presses a button that triggers the rocket launch of a weather satellite into orbit. “Rocket”, “satellite”, and “launch” were not part of the conceptual vocabulary that was created in our domain description of the university library.

Note that the model of reality isn’t just something that is nice to have in order to support the descriptions of the Use Cases. It is an essential foundation for those descriptions. If we produce just the Use Cases without first creating the description of the problem domain, then the descriptions of the Use Cases are fundamentally flawed by using undefined terms. A Use Case description for Member Checks Out a Book will use the terms “library member” and “book”, but if those terms have not been defined earlier (in the model of the application domain), then the Use Case specification is necessarily vague (i.e. not clearly defined).

Consider the term “book”. In this context, the term “book” is ambiguous between book as a work of art, which has no physical location, and book as a physical object (a “copy of a book”) that does. Only after we have disambiguated the word “book”, can we explain, for instance, why the book involved in the Member Checks Out Book Use Case is not the same as the book in the Member Reserves Book Use Case. I often use the university library as a teaching example in my data modeling classes, just because of this ambiguity with the word “book”. It exposes the students to the issue of ambiguity in domain descriptions, and helps me make the point that one of the requirements analyst’s most important tasks is the detection and removal of such ambiguities.

Such ambiguities are quite real, and quite common. By now, you would expect that we would have recognized that fact, and have learned to deal with it. Yet one of the most common causes for major project problems is the failure of developers and their clients seriously to consider that ambiguity might exist in their requirements documents. Clients and developers tend to think that their primary job on the project is to describe the “requirements” for the system (that is, to describe what the system is supposed to do, the solution to the problem), not to describe the problem or the problem domain. Project participants often simply assume that the meanings of familiar terms are so clear and so well known to everyone present, that no explicit definition is necessary. This assumption is often false, and when it is false, the consequence is that significant ambiguities in the system specification remain hidden until they emerge and plague the later stages of the development project.

In a case that I heard about recently, a requirements analyst was working on a project for a large American railroad corporation. He was having problems capturing the requirements because his users did not use the word “train” consistently. To some of them, a “train” was a particular collection of rolling stock (a locomotive and all the cars it pulled). To others, a “train” was just the locomotive. To others, a “train” was a regularly scheduled run, as in “I’ll catch the 6 o’clock train to Boston”. To others, a “train” was a specific instance of a regularly scheduled run, so that the train that left for Boston today at 6 o’clock, and the train that left for Boston yesterday at the same time, are two different trains. And so on. In this case, fortunately, the ambiguities were so glaring that they could not be ignored. On many projects the ambiguities are not so intrusive, so they are left hidden, like ticking time bombs.

Use Cases Do Not Capture Important Information about the Problem Context

Another reason that we start development with capturing information about the real world is there are properties of the real world that constrain the system, or that the system must know about, or that the system relies on, in order to satisfy the customer’s requirements. Here, the downside of the Use-Case approach is that it draws the requirements analyst’s attention away from the task of describing properties of the real world, and focuses his attention on the narrow area where the real world interacts directly with the system.

Diagram of two overlapping circles. One circle is labeled 'Real World' and the other is labeled 'Computer System'. The area where they overlap is labeled 'Use Cases'.

In embedded systems, the system’s reliance on properties of the surrounding real world are very real, and can often be safety-critical. In Software Requirements and Specifications (entry “Requirements”) Michael Jackson describes an incident in which an airplane overshot the runway when attempting to land. The runway was wet, and the plane’s wheels were aquaplaning instead of turning. The plane’s guidance system thought, in effect, “I’m on the ground, but my wheels aren’t turning. So I must not be moving,” and would not allow the pilot to engage reverse thrust. Aquaplaning, a very relevant property of the real world, was not considered by the developers when they created the guidance system. The consequence is a plane in a ditch past the end of the runway, instead of safely docked in the terminal. (Jackson says that the error, which could have been catastrophic, fortunately was not.)

For computer professionals, it is tempting to blame such problems on the clients. Knowing about aquaplaning, we are tempted to say, is our client’s problem. Our only job is to figure out how to make the system do what they tell us they want it to do, in their Use Case descriptions. But even if our clients will let us get away with wiggling out of all responsibility for planes in ditches (something that seems extremely unlikely), this still won’t cut it. As Jackson points out in a recent paper, it is almost impossible for software developers to build correct software if they don’t understand the problem domain, and how what they are doing relates to what happens there.

One potentially attractive view, is that the concerns of computer science are bounded by the interface between the computer and the world outside it…. So if we restrict our concerns to the behaviour of the computer itself we can set aside the disagreeably complex and informal nature of the problem world. It is somebody else’s task to grapple with that. …

Unfortunately, … the specification of computer behaviour at the interface, taken in isolation, is likely to be a description of arbitrary and therefore unintelligible behaviour. … Practising programmers who try to adhere to this doctrine will find themselves devoting their skills to tasks that seem at best arbitrary and at worst senseless. [Italics mine — Steve Ferg]

— Michael Jackson, “The Real World”

Rephrasing this point in terms of Use Cases: it is almost impossible for programmers to build correct software if they are given Use Case information but no information about the problem domain and how their program relates to what happens there. Yet this is what the Use Case Approach encourages requirements analysts to do. This is the real problem with the Use Case Approach. It discourages the requirements analyst from examining the problem domain, by focusing attention only on what happens at the system boundary.

For Some Jobs, Use Cases Are Just the Wrong Tool

For some applications, there clearly is domain information that must be specified in the system requirements, but which has no natural home in any particular Use Case. Often, for example, an important requirement for a system is that it enforce (or at least not violate) a set of business rules or governmental regulations. Other systems (e.g. systems in the physical and social sciences) are heavily algorithm-driven. The Use Case Approach provides no natural mechanism for capturing such mathematical algorithms, business rules, and government regulations. Certainly, it is very unnatural to embed them in the descriptions of specific Use Cases. Specifying a single big module that contains all of the customer’s business rules and (in UML terminology) extends every Use Case, is possible but clumsy and unnatural. Use Cases are simply not the best tool for capturing such requirements.

Nobody Really Knows What a Use Case Description Looks Like

Nobody really knows what a Use Case description looks like. Use Cases can be written at a very high level of detail, or at a very low level, or anywhere in between.

In some organizations, Use Cases may be written at a very high level. Use Case descriptions that are written at too-high a level are often useless. Sometimes, they are worse than useless, because they give the impression that the system requirements have been completely specified, when in fact that is not true at all. A recent case in point was a project to develop a securities information system. The customers knew that they wanted the system to generate reports, so the system specification included a Use Case to Run Reports. The problems with this Use Case didn’t emerge until the requirements analyst began to ask the customers about the contents of the reports that they wanted to generate. Then it emerged that the customer had in mind 140 reports of radically varying content, and the real requirement for the system — what the customer really needed — was a system that could store a history of all of the kinds of events that could affect securities. Virtually all of the system complexity, and all the information needed to design the system, was hidden under the single Use Case Run Reports.

In other organizations, Use Cases may be written at a detailed, implementation-specific level, describing the mechanics of the graphical user interface (GUI), complete with buttons, menus, and drop-down list boxes. Often, at a stage in the project when the primary concern should be understanding the business context and functions that the system must support, the client is instead engaging in premature user-interface design.

In addition, as the editor of observed, in many organizations the clients or end-users are the ones who write the Use Case descriptions. But without any training in user-interface design

… there is little hope that [the end-users] will make a good job of it. … The adoption of Rational Unified Process in its complete form is likely to set the development of good User Interface Design back by perhaps 20 years.

The result, in short, is not only that the user-interface design is being done at the wrong time, it is being done by the wrong people.

Use Cases Are Unsystematic

An important feature of any useful requirements-gathering methodology is that it provides a systematic approach to identifying all of the system requirements. A requirements-gathering methodology that consists of writing Use Cases is no methodology at all, because it provides no help in systematically addressing the problem. The only guidance that the Use Case approach provides is the most generic question possible: “What would you like to do with the system, and how would you like it to behave?”

As Ben Kovitz points out, writing software requirements by writing Use Cases as they come to mind, is the requirements-writing equivalent of programming by hacking around. What it produces is simply a sprawl of Use Cases. How can you tell if you’ve identified all of the Use Cases for the system? How can you tell if the Use Cases conflict? How can you tell if the Use Cases leave any gaps? It is not possible for an ordinary human to understand all of the ways that a grab-bag of 50, or 100, or 200 Use Cases are inter-related. Suppose the customer wants the system to support some new functionality. How can you tell which, if any, Use Cases will be affected by the change? There simply are no answers to these questions.

In short, as a methodology the Use Case Approach is too mushy to provide any real guidance in gathering requirements. The Use Case Approach, in fact, is not a methodology at all; it is merely a notation. Once an organization has settled on the template that it wants to use for describing Use Cases, that’s it. At that point, the organization has got all the methodology help that it is going to get out of the Use Case Approach.

Use Cases Are Not Object-Oriented

There is really nothing at all that makes the Use Case Approach especially “object-oriented”. Confined to describing behavior at the system boundary, a set of Use Cases describes neither objects in the real world nor objects inside the computer.

Because the Use Case Approach is not object-oriented, it is completely compatible with non-object oriented methods. As Grady Booch observed

[A] very real danger is that your development team will take [its] object-oriented scenarios and turn them into a very nonobject-oriented implementation. In more than one case, I’ve seen a project start with scenarios, only to end up with a functional architecture, because they treated each scenario independently, creating code for each thread of control represented by each scenario.

–entry “Scenarios” in [Booch, 1998]

There is nothing wrong with not being object-oriented, of course. If a technique is useful, it is useful, object-oriented or not. But given the popularity of the buzzword “object-oriented”, I think it is important to point out that the object-oriented-ness of Use Cases is a myth. Use Cases are enjoying their current popularity because the Three Amigos bundled them into UML along with other, truly object-oriented, techniques. Whatever object-oriented-ness Use Cases have, they have acquired solely by rubbing up against true object-oriented methodologies into whose company they have fortuitously fallen.

The Proper Use of Use Cases

The best metaphor for methodology skills is a handyman’s toolbox. No handyman could be successful if he relied on a single tool (a hammer, say) for everything that he did. Hammers don’t work well when you’re trying to drive a screw, nor when you need to cut a plank. You need different tools for different jobs.

A software developer is a software handyman. To do everything that he needs to do, a software developer needs a toolbox that contains a variety of tools. Use Case description shouldn’t be the only tool in the toolbox. It should simply be there along with the other tools, so that it is available when it is the right tool for the job.

Most of the problems with Use Cases that we’ve discussed, like the problems with chocolate, come not from the thing itself, but from its improper or excessive use. In spite of all of the problems we’ve discussed, Use Cases can be useful — when they are used properly. Most importantly, Use Cases can be an effective tool when they are developed in a disciplined manner, as part of a methodology that first creates a well-defined domain model.

The domain model provides an infrastructure for the requirements-gathering process, so that the development of Use Cases can proceed systematically, in a way that could never happen without the domain model. The domain model describes the objects in the problem domain, including the events in the lives of objects. Once the events in the lives of the objects have been described, the requirements analyst can approach the task of writing Use Cases in a systematic fashion, by writing a Use Case for each of the events. If the life of a library book includes events such as being acquired, being borrowed, being returned, etc., then the analyst will develop a Use Case for each of those events. Each Use Case answers questions derived from specific events: “How does the system get told that a book has been returned?” or “How does the system know when to generate overdue-notice letters?”

In actual practice, of course, deriving Use Cases from the domain model is rarely as simple and mechanical as we’ve just described. But there are techniques for dealing with more intricate situations. The whole purpose of systems analysis methodologies, in fact, is to provide guidance for situations in which the process is not so simple and mechanical. But whether the process of developing a particular Use Case is simple or tricky, its foundation must always lie in a solid understanding of the problem, and a carefully-constructed description of the problem domain.

In summary, a set of Use Cases is a description of the system to be constructed, the thing to be built, the solution to the problem. But, as Michael Jackson points out, before you can effectively start building the solution to a problem, “first you must concentrate your attention on the application domain to learn what the problem is about.” [Jackson, 1995, p. 158]


Several of the ideas in this paper are derived from the works of Michael Jackson, and from Ben Kovitz’s short discussion of Use Cases in Practical Software Requirements (pp. 251-252). I hope that Ben and Michael will consider theft the sincerest form of flattery. Neither, of course, is responsible for any mistakes that I might have made in presenting their ideas, or for cases in which my opinion or terminology differs from theirs.


[Booch, 1998] Grady Booch, Best of Booch (ed. Ed Eykholt), Cambridge University Press, 1998

[Jackson, 1983] Michael Jackson, System Development, ACM Press, 1983

[Jackson, 1995] Michael Jackson, Software Requirements and Specifications, Addison-Wesley, 1995

[Jackson, 2000] Michael Jackson, “The Real World” in Millennial Perspectives in Computer Science: Proceedings of the 1999 Oxford-Microsoft Symposium in Honour of Sir Anthony Hoare (ed. Jim Davies, Bill Roscoe, Jim Woodcock), Palgrave, 2000

[Kovitz, 1999] Benjamin L. Kovitz, Practical Software Requirements, 1999

Posted in Software Development | Tagged | 2 Comments

Python Packages

I’ve avoided putting my Python modules into packages.  That’s because (like a lot of other folks) I don’t really understand how to create a package.  And that’s because (probably like a lot of other folks) I’m too time-constrained and over-worked to be able to grok anything except a very short, clear, and simple explanation of how packages work… and I never found one.

That’s all over now. In a post on comp.lang.python Steven D’Aprano has made it all beautifully clear. Here is an excerpt from his post. This is the part that really turned on the lights for me.

A package is a special arrangement of folder + modules.

To be a package, there must be a file called in the folder, e.g.:

        +-- feeding/

This defines a package called parrot which includes a sub-package feeding and modules fighting, flying, sleeping and talking. You can use it by any variant of the following:

import parrot  # loads parrot/
import parrot.talking  # loads parrot/
from parrot import sleeping
import parrot.feeding
from parrot.feeding.eating import eat_cracker
#... and so on ...

Common (but not compulsory) behaviour is for parrot/ to import all the modules in the package, so that the caller can do this:

import parrot

without needing to manually import sub-packages. The os module behaves similarly: having imported os, you can immediately use functions in os.path without an additional import.

Just dumping a bunch of modules into a folder doesn’t make it a package, it just makes it a bunch of modules in a folder. Unless that folder is in the PYTHONPATH, you won’t be able to import the modules because Python doesn’t look inside folders. The one exception is that it will look inside a folder for a file, and if it finds one, it will treat that folder and its contents as a package.

And to distribute a collection of modules as a package….
  • Put your modules in a package,
  • tell the user to just place the entire package directory where they normally install Python code (usually site-packages — Steve Ferg), and
  • importing will just work

Thank you, Steven! You have no idea how much we needed that!
Updated 2011-11-06 to improve formatting

Posted in Python features | 6 Comments

Diagram arrangements

Aesthetics matters!

For a number of years I did quite a lot of teaching — computer systems analysis and design, and also data modeling and database design. During the classes the students and I constructed a lot of diagrams of various kinds.

In my classes I tried to impress upon my students the importance of the arrangement of their diagrams. The layout of a diagram conveys important information about the basic structure of the domain — application, database, process, or whatever — that is being diagrammed. Aesthetics matters!

Here are some examples that illustrate that point. All of these diagrams are structurally identical. They all contain the same number of nodes, connected in the same way. But they are arranged differently, and the different arrangements tell very different stories about the basic structure of the domain being modeled.

Example 1. The domain is basically tree-structured, hierarchical.


Example 2. The domain has two basic parts. Each part is almost self-contained, with a very limited interface between the two parts.


Example 3. The domain is basically a sequence of parts. Some of the parts have sub-parts.


Example 4. Here is an example of how not to diagram a domain.

If your diagram looks like this, you need to untangle it. Make it tell a clear story about the basic structure of the domain.


Some tips about diagram arrangements

1. Try to avoid having connector lines crossing other connector lines.

2. Your diagram should have a clear flow or direction:

  • left to right
  • top to bottom
  • center, radiating out to edges

Diagrams that represent processes (e.g. process flow diagrams) have a natural flow based on the sequence in time of the processes being represented. The flow of the diagram nodes in space should reflect the succession of the processes in time. And it should be consistently in one direction: left to right, or top to bottom, or center toward periphery.

When diagramming static domains that don’t have a natural temporal flow (e.g. class diagrams, entity-relationship diagrams, database diagrams), the diagram’s direction of flow should be from the most important nodes (entities/classes/tables/modules) of the application toward the less important nodes.

Put the most important nodes at the beginning of the flow — at the left, top, or center of the diagram. Their position on the diagram helps to draw the viewer’s attention to them, and indicates that they are the important nodes in the domain.

3. If rule 1 (no crossing connector lines) conflicts with rule 2 (natural flow), rule 2 takes precedence.

Avoiding crossed lines is desirable, but the most important thing is that the nodes of the diagram have a clear, natural flow.

In developing a database design diagram, you might be tempted to put database lookup tables (reference tables, mapping tables, translation tables) at the center of the diagram. After all (you think), the database design contains many foreign keys that reference the lookup tables, so putting the lookup tables at the center of the diagram will avoid a lot of crossed lines.

True… but still not a good idea! The lookup tables are simply part of the mechanics of database design. They are not of core importance for the business application for which the database is being designed. They belong at the edges, or the bottom, of the diagram

4. Switching your diagram from portrait mode to landscape mode (or vice versa) may make it easier to create a diagram with a natural shape.

Some diagrams are naturally tall and narrow (like example 2) while others are naturally short and wide (like example 3). Take advantage of the printer’s ability to switch between portrait and landscape mode.

About empty space

The most common mistake I see is to try to arrange a diagram so that it fits on a single sheet of paper, rather than arranging it to show its natural structure. The result often looks like example 4.

Do not do this!

The most important rule of all (because it is so often violated) is: Once you have a nice clean diagram that tells a clear story about the domain, do not rearrange it so that it will fit on one page.

Keep the size of your nodes fairly small. This will increase the amount of empty space on your diagram.

Don’t try to use all available space on the page. You want empty space in your diagram. Empty space gives you room in which to arrange your nodes in a way that has an obvious structure that tells a clear story.

If your diagram is too big to fit on one printed page:

  • Arrange it for printing across multiple sheets of paper.
  • Print it on a larger size sheet of paper.
  • Shrink the size of your nodes.
  • Split the diagram into two separate diagrams, and use off-page connector symbols to link the diagrams. A diagram like Example 2 is a good candidate for this technique.
  • Decompose your single diagram into a set of hierarchically decomposed diagrams (like the old dataflow diagrams of structured analysis).

If necessary, to avoid a lot of connector lines crossing all over the diagram to get to one lookup table, draw duplicate nodes representing the same lookup table on your diagram. Position them so that you can minimize crossed lines and so that they are in keeping with the direction/flow of the diagram.

Posted in Software Development | Tagged | 1 Comment