Death Swamp

Recently a friend sent me this. I recognized it instantly, although I never knew that it had a name.

There is a management technique called “death swamp” (or “death bog” or “fly paper”). It works this way. Occasionally some young fire-eater comes up with an idea to Do Something. The bureaucracy can’t simply reject his idea because then they’d have to give an explanation for why his idea was rejected. So they pat him on the back, agree that his idea is a good one, and encourage him to pursue it. In fact they think so highly of his idea that they helpfully volunteer information about How We Get Things Done Around Here. They provide a sheaf of forms and advice on how to get the ball rolling.

The young and inexperienced fire-eater happily starts down the road in the direction that has been pointed out to him. In short order he finds himself in a swamp of procedures and paperwork so thick that he is completely bogged down and making no progress. Eventually he gives up.

The next time he comes up with an idea, he is given the same forms again. This time, seeing the forms, he realizes his mistake. He politely accepts the forms and walks away. Around the corner, he throws the forms in the trash and gives up on his idea. Because by then he knows that the only way to escape the swamp is not to enter it in the first place.

enum in Python

Recently I was reading a post by Eli Bendersky (one of my favorite bloggers) and I ran across a sentence in which Eli says “It’s a shame Python still doesn’t have a functional enum type, isn’t it?”

The comment startled me because I had always thought that it was obvious how to do enums in Python, and that it was obvious that you don’t need any special language features to do it. Eli’s comment made me think that I might need to do a reality-check on my sense of what was and was not obvious about enums in Python.

So I googled around a bit and found that there are a lot of different ideas about how to do enums in Python. I found a very large set of suggestions on StackOverflow here and here and here. There is a short set of suggestion on Python Examples. The ActiveState Python Cookbook has a long recipe, and PEP-354 is a short proposal (that has been rejected). Surprisingly, I found only a couple of posts that suggested what had seemed to me to be THE obvious solution. The clearest was by snakile on StackOverflow.

Anyway, to end the suspense, the answer that seemed to me so obvious was this. An enum is an enumerated data type. An enumerated data type is a type, and a type is a class.

class           Color : pass
class Red      (Color): pass
class Yellow   (Color): pass
class Blue     (Color): pass

Which allows you to do things like this.

class Toy: pass

myToy = Toy()

myToy.color = "blue"  # note we assign a string, not an enum

if myToy.color is Color:
    pass
else:
    print("My toy has no color!!!")    # produces:  My toy has no color!!!

myToy.color = Blue   # note we use an enum

print("myToy.color is", myToy.color.__name__)  # produces: myToy.color is Blue
print("myToy.color is", myToy.color)           # produces: myToy.color is <class '__main__.Blue'>

if myToy.color is Blue:
    myToy.color = Red

if myToy.color is Red:
    print("my toy is red")   # produces: my toy is red
else:
    print("I don't know what color my toy is.")

So that’s what I came up with.

But with so many intelligent people all trying to answer the same question, and coming up with such a wide array of different answers, I had to fall back and ask myself a few questions.

  • Why am I seeing so many different answers to what seems like a simple question?
  • Is there one right answer? If so, what is it?
  • What is the way — the best, or most widely-used, or most pythonic — way to do enums in Python?
  • Is the question really as simple as it seems?

For me, the jury is still out on most of these questions, but until they return with a verdict I have come up with two thoughts on the subject.

First, I think that many programmers come to Python with backgrounds in other languages — C or C++, Java, etc. Their experiences with other languages shape their conceptions of what an enum — an enumerated data type — is. And when they ask “How can I do enums in Python?” they’re asking a question like the question that sparked the longest thread of answers on StackOverflow:

I’m mainly a C# developer, but I’m currently working on a project in Python. What’s the best way to implement the equivalent of an enum [i.e. a C# enum] in Python?

So naturally, the question “How can I implement in Python the equivalent of the kind of enums that I’m familiar with in language X?” has at least as many answers as there are values of X.

My second thought is somewhat related to the first.

Python developers believe in duck typing. So a Python developer’s first instinct is not to ask you:

What do you mean by “enum”?

A Python developer’s first instinct is to ask you:

What kinds of things do you think an “enum” should be able to do?
What kinds of things do you think you should be able to do with an “enum”?

And I think that different developers probably have very different ideas about what one should be able to do with an “enum”. Naturally, that leads them to propose different ways of implementing enums in Python.

As a simple example, consider the question — Should you be able to sort enums?

My personal inclination is to say that — in the most conceptually pure sense of “enum” — the concept of sorting enums makes no sense. And my suggestion for implementing enums in Python reflects this. Suppose you implement a “Color” enum using the technique that I’ve proposed, and then try to sort enums.

# how do enumerated values sort?
colors = [Red, Yellow, Blue]
colors.sort()
for color in colors:
    print(color.__name__)

What you get is this:

Traceback (most recent call last):
  File "C:/Users/ferg_s/pydev/enumerated_data_types/edt.py", line 32, in <module>
    colors.sort()
TypeError: unorderable types: type() < type()

So that suites me just fine.

But I can easily imagine someone (myself?) working with an enum for, say, Weekdays (Sunday, Monday, Tuesday… Saturday). And I think it might be reasonable in that situation to want to be able to sort Weekdays and to do greater than and less than comparisons on them.

So if we’re talking duck typing, I’m happy with enums/ducks that are motionless and silent. My only requirement is that they be different from everything else and different from each other. But I can easily imagine situations where one might reasonably need/want/prefer ducks that can form a conga line, dance, and sing a few bars. And for those situations, you obviously need more elaborate implementations of enums.

So, with these thoughts in mind, I’m inclined to think that there is no single, best way to implement an enum in Python. The concept of an enum is flexible enough to cover a variety of implementations offering a variety of features.

Python Decorators

In August 2009, I wrote a post titled Introduction to Python Decorators. It was an attempt to explain Python decorators in a way that I (and I hoped, others) could grok.

Recently I had occasion to re-read that post. It wasn’t a pleasant experience — it was pretty clear to me that the attempt had failed.

That failure — and two other things — have prompted me to try again.

  • Matt Harrison has published an excellent e-book Guide to: Learning Python Decorators.
  • I now have a theory about why most explanations of decorators (mine included) fail, and some ideas about how better to structure an introduction to decorators.

There is an old saying to the effect that “Every stick has two ends, one by which it may be picked up, and one by which it may not.” I believe that most explanations of decorators fail because they pick up the stick by the wrong end.

In this post I will show you what the wrong end of the stick looks like, and point out why I think it is wrong. And I will show you what I think the right end of the stick looks like.

 

The wrong way to explain decorators

Most explanations of Python decorators start with an example of a function to be decorated, like this:

def aFunction():
    print("inside aFunction")

and then add a decoration line, which starts with an @ sign:

@myDecorator
def aFunction():
    print("inside aFunction")

At this point, the author of the introduction often defines a decorator as the line of code that begins with the “@”. (In my older post, I called such lines “annotation” lines. I now prefer the term “decoration” line.)

For instance, in 2008 Bruce Eckel wrote on his Artima blog

A function decorator is applied to a function definition by placing it on the line before that function definition begins.

and in 2004, Phillip Eby wrote in an article in Dr. Dobb’s Journal

Decorators may appear before any function definition…. You can even stack multiple decorators on the same function definition, one per line.

Now there are two things wrong with this approach to explaining decorators. The first is that the explanation begins in the wrong place. It starts with an example of a function to be decorated and an decoration line, when it should begin with the decorator itself. The explanation should end, not start, with the decorated function and the decoration line. The decoration line is, after all, merely syntactic sugar — is not at all an essential element in the concept of a decorator.

The second is that the term “decorator” is used incorrectly (or ambiguously) to refer both to the decorator and to the decoration line. For example, in his Dr. Dobb’s Journal article, after using the term “decorator” to refer to the decoration line, Phillip Eby goes on to define a “decorator” as a callable object.

But before you can do that, you first need to have some decorators to stack. A decorator is a callable object (like a function) that accepts one argument—the function being decorated.

So… it would seem that a decorator is both a callable object (like a function) and a single line of code that can appear before the line of code that begins a function definition. This is sort of like saying that an “address” is both a building (or apartment) at a specific location and a set of lines (written in pencil or ink) on the front of a mailing envelope. The ambiguity may be almost invisible to someone familiar with decorators, but it is very confusing for a reader who is trying to learn about decorators from the ground up.

 

The right way to explain decorators

So how should we explain decorators?

Well, we start with the decorator, not the function to be decorated.

One
We start with the basic notion of a function — a function is something that generates a value based on the values of its arguments.

Two
We note that in Python, functions are first-class objects, so they can be passed around like other values (strings, integers, objects, etc.).

Three
We note that because functions are first-class objects in Python, we can write functions that both (a) accept function objects as argument values, and (b) return function objects as return values. For example, here is a function foobar that accepts a function object original_function as an argument and returns a function object new_function as a result.

def foobar(original_function):

    # make a new function
    def new_function():
        # some code

    return new_function

Four
We define “decorator”.

A decorator is a function (such as foobar in the above example) that takes a function object as an argument, and returns a function object as a return value.

So there we have it — the definition of a decorator. Anything else that we say about decorators is a refinement of, or an expansion of, or an addition to, this definition of a decorator.

Five
We show what the internals of a decorator look like. Specifically, we show different ways that a decorator can use the original_function in the creation of the new_function. Here is a simple example.

def verbose(original_function):

    # make a new function that prints a message when original_function starts and finishes
    def new_function(*args, **kwargs):
        print("Entering", original_function.__name__)
        original_function(*args, **kwargs)
        print("Exiting ", original_function.__name__)

    return new_function

Six
We show how to invoke a decorator — how we can pass into a decorator one function object (its input) and get back from it a different function object (its output). In the following example, we pass the widget_func function object to the verbose decorator, and we get back a new function object to which we assign the name talkative_widget_func.

def widget_func():
    # some code

talkative_widget_func = verbose(widget_func)

Seven
We point out that decorators are often used to add features to the original_function. Or more precisely, decorators are often used to create a new_function that does roughly what original_function does, but also does things in addition to what original_function does.

And we note that the output of a decorator is typically used to replace the original function that we passed in to the decorator as an argument. A typical use of decorators looks like this. (Note the change to line 4 from the previous example.)

def widget_func():
    # some code

widget_func = verbose(widget_func)

So for all practical purposes, in a typical use of a decorator we pass a function (widget_func) through a decorator (verbose) and get back an enhanced (or souped-up, or “decorated”) version of the function.

Eight
We introduce Python’s “decoration syntax” that uses the “@” to create decoration lines. This feature is basically syntactic sugar that makes it possible to re-write our last example this way:

@verbose
def widget_func():
    # some code

The result of this example is exactly the same as the previous example — after it executes, we have a widget_func that has all of the functionality of the original widget_func, plus the functionality that was added by the verbose decorator.

Note that in this way of explaining decorators, the “@” and decoration syntax is one of the last things that we introduce, not one of the first.

And we absolutely do not refer to line 1 as a “decorator”. We might refer to line 1 as, say, a “decorator invocation line” or a “decoration line” or simply a “decoration”… whatever. But line 1 is not a “decorator”.

Line 1 is a line of code. A decorator is a function — a different animal altogether.

 

Nine
Once we’ve nailed down these basics, there are a few advanced features to be covered.

  • We explain that a decorator need not be a function (it can be any sort of callable, e.g. a class).
  • We explain how decorators can be nested within other decorators.
  • We explain how decorators decoration lines can be “stacked”. A better way to put it would be: we explain how decorators can be “chained”.
  • We explain how additional arguments can be passed to decorators, and how decorators can use them.

Ten — A decorators cookbook

The material that we’ve covered up to this point is what any basic introduction to Python decorators would cover. But a Python programmer needs something more in order to be productive with decorators. He (or she) needs a catalog of recipes, patterns, examples, and commentary that describes / shows / explains when and how decorators can be used to accomplish specific tasks. (Ideally, such a catalog would also include examples and warnings about decorator gotchas and anti-patterns.) Such a catalog might be called “Python Decorator Cookbook” or perhaps “Python Decorator Patterns”.



So that’s it. I’ve described what I think is wrong (well, let’s say suboptimal) about most introductions to decorators. And I’ve sketched out what I think is a better way to structure an introduction to decorators.

Now I can explain why I like Matt Harrison’s e-book Guide to: Learning Python Decorators. Matt’s introduction is structured in the way that I think an introduction to decorators should be structured. It picks up the stick by the proper end.

The first two-thirds of the Guide hardly talk about decorators at all. Instead, Matt begins with a thorough discussion of how Python functions work. By the time the discussion gets to decorators, we have been given a strong understanding of the internal mechanics of functions. And since most decorators are functions (remember our definition of decorator), at that point it is relatively easy for Matt to explain the internal mechanics of decorators.

Which is just as it should be.


Revised 2012-11-26 — replaced the word “annotation” with “decoration”, following terminology ideas discussed in the comments.


					

Unicode – the basics

An introduction to the basics of Unicode, distilled from several earlier posts. In the interests of presenting the big picture, I have painted with a broad brush — large areas are summarized; nits are not picked; hairs are not split; wind resistance is ignored.

Unicode = one character set, plus several encodings

Unicode is actually not one thing, but two separate and distinct things. The first is a character set and the second is a set of encodings.

  • The first — the idea of a character set — has absolutely nothing to do with computers.
  • The second — the idea of encodings for the Unicode character set — has everything to do with computers.

Character sets

The idea of a character set has nothing to do with computers. So let’s suppose that you’re a British linguist living in, say, 1750. The British Empire is expanding and Europeans are discovering many new languages, both living and dead. You’ve known about Chinese characters for a long time, and you’ve just discovered Sumerian cuneiform characters from the Middle East and Sanskrit characters from India.

Trying to deal with this huge mass of different characters, you get a brilliant idea — you will make a numbered list of every character in every language that ever existed.

You start your list with your own familiar set of English characters — the upper- and lower-case letters, the numeric digits, and the various punctuation marks like period (full stop), comma, exclamation mark, and so on. And the space character, of course.

01 a
02 b
03 c
...
26 z
27 A
28 B
...
52 Z
53 0
54 1
55 2
...
62 9
63 (space)
64 ? (question mark)
65 , (comma)
... and so on ...

Then you add the Spanish, French and German characters with tildes, accents, and umlauts. You add characters from other living languages — Greek, Japanese, Chinese, Korean, Sanscrit, Arabic, Hebrew, and so on. You add characters from dead alphabets — Assyrian cuneiform — and so on, until finally you have a very long list of characters.

  • What you have created — a numbered list of characters — is known as a character set.
  • The numbers in the list — the numeric identifiers of the characters in the character set — are called code points.
  • And because your list is meant to include every character that ever existed, you call your character set the Universal Character Set.

Congratulations! You’ve just invented (something similar to) the the first half of Unicode — the Universal Character Set or UCS.

Encodings

Now suppose you jump into your time machine and zip forward to the present. Everybody is using computers. You have a brilliant idea. You will devise a way for computers to handle UCS.

You know that computers think in ones and zeros — bits — and collections of 8 bits — bytes. So you look at the biggest number in your UCS and ask yourself: How many bytes will I need to store a number that big? The answer you come up with is 4 bytes, 32 bits. So you decide on a simple and straight-forward digital implementation of UCS — each number will be stored in 4 bytes. That is, you choose a fixed-length encoding in which every UCS character (code point) can be represented, or encoded, in exactly 4 bytes, or 32 bits.

In short, you devise the Unicode UCS-4 (Universal Character Set, 4 bytes) encoding, aka UTF-32 (Unicode Transformation Format, 32 bits).

UTF-8 and variable-length encodings

UCS-4 is simple and straight-forward… but inefficient. Computers send a lot of strings back and forth, and many of those strings use only ASCII characters — characters from the old ASCII character set. One byte — eight bits — is more than enough to store such characters. It is grossly inefficient to use 4 bytes to store an ASCII character.

The key to the solution is to remember that a code point is nothing but a number (an integer). It may be a short number or a long number, but it is only a number. We need just one byte to store the shorter numbers of the Universal Character Set, and we need more bytes only when the numbers get longer. So the solution to our problem is a variable-length encoding.

Specifically, Unicode’s UTF-8 (Unicode Transformation Format, 8 bit) is a variable-length encoding in which each UCS code point is encoded using 1, 2, 3, or 4 bytes, as necessary.

In UTF-8, if the first bit of a byte is a “0″, then the remaining 7 bits of the byte contain one of the 128 original 7-bit ASCII characters. If the first bit of the byte is a “1″ then the byte is the first of multiple bytes used to represent the code point, and other bits of the byte carry other information, such as the total number of bytes — 2, or 3, or 4 bytes — that are being used to represent the code point. (For a quick overview of how this works at the bit level, see How does UTF-8 “variable-width encoding” work?)

Just use UTF-8

UTF-8 is a great technology, which is why it has become the de facto standard for encoding Unicode text, and is the most widely-used text encoding in the world. Text strings that use only ASCII characters can be encoded in UTF-8 using only one byte per character, which is very efficient. And if characters — Chinese or Japanese characters, for instance — require multiple bytes, well, UTF-8 can do that, too.

Byte Order Mark

Unicode fixed-length multi-byte encodings such as UTF-16 and UTF-32 store UCS code points (integers) in multi-byte chunks — 2-byte chunks in the case of UTF-16 and 4-byte chunks in the case of UTF-32.

Unfortunately, different computer architectures — basically, different processor chips — use different techniques for storing such multi-byte integers. In “little-endian” computers, the “little” (least significant) byte of a multi-byte integer is stored leftmost. “Big-endian” computers do the reverse; the “big” (most significant) byte is stored leftmost.

  • Intel computers are little-endian.
  • Motorola computers are big-endian.
  • Microsoft Windows was designed around a little-endian architecture — it runs only on little-endian computers or computers running in little-endian mode — which is why Intel hardware and Microsoft software fit together like hand and glove.

Differences in endian-ness can create data-exchange issues between computers. Specifically, the possibility of differences in endian-ness means that if two computers need to exchange a string of text data, and that string is encoded in a Unicode fixed-length multi-byte encoding such as UTF-16 or UTF-32, the string should begin with a Byte Order Mark (or BOM) — a special character at the beginning of the string that indicates the endian-ness of the string.

Strings encoded in UTF-8 don’t require a BOM, so the BOM is basically a non-issue for programmers who use only UTF-8.


Resources

Python’s magic methods

Here are some links to documentation of Python’s magic methods, aka special methods, aka “dunder” (double underscore) methods.

There are also a few other Python features that are sometimes characterized as “magic”.

I’m sure there are other useful Web pages about magic methods that I haven’t found. If you know of one (and feel like sharing it) note that you can code HTML tags into a WordPress comment, like this, and they will show up properly formatted:

I found a useful discussion of magic methods at
<a href="http://www.somebodys_web_site.com/magic-methods">www.somebodys_web_site.com/magic-methods</a>

 

Gotcha — Mutable default arguments

Goto start of series

Note: examples are coded in Python 2.x, but the basic point of the post applies to all versions of Python.

There’s a Python gotcha that bites everybody as they learn Python. In fact, I think it was Tim Peters who suggested that every programmer gets caught by it exactly two times. It is call the mutable defaults trap. Programmers are usually bit by the mutable defaults trap when coding class methods, but I’d like to begin with explaining it in functions, and then move on to talk about class methods.

Mutable defaults for function arguments

The gotcha occurs when you are coding default values for the arguments to a function or a method. Here is an example for a function named foobar:

def foobar(arg_string = "abc", arg_list = []):
    ...

Here’s what most beginning Python programmers believe will happen when foobar is called without any arguments:

A new string object containing “abc” will be created and bound to the “arg_string” variable name. A new, empty list object will be created and bound to the “arg_list” variable name. In short, if the arguments are omitted by the caller, the foobar will always get “abc” and [] in its arguments.

This, however, is not what will happen. Here’s why.

The objects that provide the default values are not created at the time that foobar is called. They are created at the time that the statement that defines the function is executed. (See the discussion at Default arguments in Python: two easy blunders: “Expressions in default arguments are calculated when the function is defined, not when it’s called.”)

If foobar, for example, is contained in a module named foo_module, then the statement that defines foobar will probably be executed at the time when foo_module is imported.

When the def statement that creates foobar is executed:

  • A new function object is created, bound to the name foobar, and stored in the namespace of foo_module.
  • Within the foobar function object, for each argument with a default value, an object is created to hold the default object. In the case of foobar, a string object containing “abc” is created as the default for the arg_string argument, and an empty list object is created as the default for the arg_list argument.

After that, whenever foobar is called without arguments, arg_string will be bound to the default string object, and arg_list will be bound to the default list object. In such a case, arg_string will always be “abc”, but arg_list may or may not be an empty list. Here’s why.

There is a crucial difference between a string object and a list object. A string object is immutable, whereas a list object is mutable. That means that the default for arg_string can never be changed, but the default for arg_list can be changed.

Let’s see how the default for arg_list can be changed. Here is a program. It invokes foobar four times. Each time that foobar is invoked it displays the values of the arguments that it receives, then adds something to each of the arguments.

def foobar(arg_string="abc", arg_list = []): 
    print arg_string, arg_list 
    arg_string = arg_string + "xyz" 
    arg_list.append("F")

for i in range(4): 
    foobar()

The output of this program is:

abc [] 
abc ['F'] 
abc ['F', 'F'] 
abc ['F', 'F', 'F']

As you can see, the first time through, the argument have exactly the default that we expect. On the second and all subsequent passes, the arg_string value remains unchanged — just what we would expect from an immutable object. The line

arg_string = arg_string + "xyz"

creates a new object — the string “abcxyz” — and binds the name “arg_string” to that new object, but it doesn’t change the default object for the arg_string argument.

But the case is quite different with arg_list, whose value is a list — a mutable object. On each pass, we append a member to the list, and the list grows. On the fourth invocation of foobar — that is, after three earlier invocations — arg_list contains three members.

The Solution
This behavior is not a wart in the Python language. It really is a feature, not a bug. There are times when you really do want to use mutable default arguments. One thing they can do (for example) is retain a list of results from previous invocations, something that might be very handy.

But for most programmers — especially beginning Pythonistas — this behavior is a gotcha. So for most cases we adopt the following rules.

  1. Never use a mutable object — that is: a list, a dictionary, or a class instance — as the default value of an argument.
  2. Ignore rule 1 only if you really, really, REALLY know what you’re doing.

So… we plan always to follow rule #1. Now, the question is how to do it… how to code foobar in order to get the behavior that we want.

Fortunately, the solution is straightforward. The mutable objects used as defaults are replaced by None, and then the arguments are tested for None.

def foobar(arg_string="abc", arg_list = None): 
    if arg_list is None: arg_list = [] 
    ...

Another solution that you will sometimes see is this:

def foobar(arg_string="abc", arg_list=None): 
    arg_list = arg_list or [] 
    ...

This solution, however, is not equivalent to the first, and should be avoided. See Learning Python p. 123 for a discussion of the differences. Thanks to Lloyd Kvam for pointing this out to me.

And of course, in some situations the best solution is simply not to supply a default for the argument.

Mutable defaults for method arguments

Now let’s look at how the mutable arguments gotcha presents itself when a class method is given a mutable default for one of its arguments. Here is a complete program.

# (1) define a class for company employees 
class Employee:
    def __init__ (self, arg_name, arg_dependents=[]): 
        # an employee has two attributes: a name, and a list of his dependents 
        self.name = arg_name 
        self.dependents = arg_dependents
    
    def addDependent(self, arg_name): 
        # an employee can add a dependent by getting married or having a baby 
        self.dependents.append(arg_name)
    
    def show(self): 
        print
        print "My name is.......: ", self.name 
        print "My dependents are: ", str(self.dependents)
#--------------------------------------------------- 
#   main routine -- hire employees for the company 
#---------------------------------------------------

# (2) hire a married employee, with dependents 
joe = Employee("Joe Smith", ["Sarah Smith", "Suzy Smith"])

# (3) hire a couple of unmarried employess, without dependents 
mike = Employee("Michael Nesmith") 
barb = Employee("Barbara Bush")

# (4) mike gets married and acquires a dependent 
mike.addDependent("Nancy Nesmith")

# (5) now have our employees tell us about themselves 
joe.show() 
mike.show() 
barb.show()

Let’s look at what happens when this program is run.

  1. First, the code that defines the Employee class is run.
  2. Then we hire Joe. Joe has two dependents, so that fact is recorded at the time that the joe object is created.
  3. Next we hire Mike and Barb.
  4. Then Mike acquires a dependent.
  5. Finally, the last three statements of the program ask each employee to tell us about himself.

Here is the result.

My name is.......:  Joe Smith 
My dependents are:  ['Sarah Smith', 'Suzy Smith']

My name is.......:  Michael Nesmith 
My dependents are:  ['Nancy Nesmith']

My name is.......:  Barbara Bush 
My dependents are:  ['Nancy Nesmith']

Joe is just fine. But somehow, when Mike acquired Nancy as his dependent, Barb also acquired Nancy as a dependent. This of course is wrong. And we’re now in a position to understand what is causing the program to behave this way.

When the code that defines the Employee class is run, objects for the class definition, the method definitions, and the default values for each argument are created. The constructor has an argument arg_dependents whose default value is an empty list, so an empty list object is created and attached to the __init__ method as the default value for arg_dependents.

When we hire Joe, he already has a list of dependents, which is passed in to the Employee constructor — so the arg_dependents attribute does not use the default empty list object.

Next we hire Mike and Barb. Since they have no dependents, the default value for arg_dependents is used. Remember — this is the empty list object that was created when the code that defined the Employee class was run. So in both cases, the empty list is bound to the arg_dependents argument, and then — again in both cases — it is bound to the self.dependents attribute. The result is that after Mike and Barb are hired, the self.dependents attribute of both Mike and Barb point to the same object — the default empty list object.

When Michael gets married, and Nancy Nesmith is added to his self.dependents list, Barb also acquires Nancy as a dependent, because Barb’s self.dependents variable name is bound to the same list object as Mike’s self.dependents variable name.

So this is what happens when mutuable objects are used as defaults for arguments in class methods. If the defaults are used when the method is called, different class instances end up sharing references to the same object.

And that is why you should never, never, NEVER use a list or a dictionary as a default value for an argument to a class method. Unless, of course, you really, really, REALLY know what you’re doing.

Backing up your email

Just in case someone might find this useful …

I recently had something bad happen to me. I use Thunderbird (on Windows Vista) as my email client. I asked Thunderbird to compact my email files, and it wiped out a bunch of my email messages. (I think that one of my email files must have been corrupt, and when I compacted it, the compaction process wiped out messages that should not have been wiped out.)

You can recover deleted email messages … but not after the email file has been compacted. So the messages were not recoverable. Bummer.

The upside is that this nasty incident led me to learn some things.

One thing that I learned was that the disk backup utility that I was using at the time did NOT backup my email files. The email files were stored in a directory called AppData, and the AppData directory is a “hidden” directory. So the backup utility didn’t see the AppData directory, and didn’t back it up. So I had no backup of the deleted messages.

Learning that led me to investigate ways to backup my email files, and I found this: Five ways to keep your emails backed up

For backing up Thunderbird files, it recommends MozBackup as being fast, free and easy to use. So I tried MozBackup, and those claims seem to be true.

Now I’m evaluating different disk backup options.

The take-away here is that you need to pay special attention to backing up your email files. So if you’re not backing up your email files, take a look at Five ways to keep your emails backed up (and read the comments, which are useful) or google something like “email backup”.

[Note that this applies only if you are using an email client such as Thunderbird, Outlook, Outlook Express, etc. If you don't use an email client, and do all of your email work through a Web interface to your Internet Service Provider, then this is not an issue.]

Unicode for dummies — Encoding

Another entry in an irregular series of posts about Unicode.
Typos fixed 2012-02-22. Thanks Anonymous, and Clinton, for reporting the typos.

This is a story about encoding and decoding, with a minor subplot involving Unicode.

As our story begins — on a dark and stormy night, of course — we find our protagonist deep in thought. He is asking himself “What is an encoding?”

What is an encoding?

The basic concepts are simple. First, we start with the idea of a piece of information — a message — that exists in a representation that is understandable (perspicuous) to a human being. I’m going to call that representation “plain text”. For English-language speakers, for example, English words printed on a page, or displayed on a screen, count as plain text.

Next, (for reasons that we won’t explore right now) we need to be able to translate a message in a plain-text representation into some other representation (let’s call that representation the “encoded text”), and we need to be able to translate the encoded text back into plain text. The translation from plain text to encoded text is called “encoding”, and the translation of encoded text back into plain text is called “decoding”.

encoding and decoding

There are three points worth noting about this process.

The first point is that no information can be lost during encoding or decoding. It must be possible for us to send a message on a round-trip journey — from plain text to encoded text, and then back again from encoded text to plain text — and get back exactly the same plain text that we started with. That is why, for instance, we can’t use one natural language (Russian, Chinese, French, Navaho) as an encoding for another natural language (English, Hindi, Swahili). The mappings between natural languages are too loose to guarantee that a piece of information can make the round-trip without losing something in translation.

The requirement for a lossless round-trip means that the mapping between the plain text and the encoded text must be very tight, very exact. And that brings us to the second point.

In order for the mapping between the plain text and the encoded text to be very tight — which is to say: in order for us to be able to specify very precisely how the encoding and decoding processes work — we must specify very precisely what the plain text representation looks like.

Suppose, for example, we say that plain text looks like this: the 26 upper-case letters of the Anglo-American alphabet, plus the space and three punctuation symbols: period (full stop), question mark, and dash (hyphen). This gives us a plain-text alphabet of 30 characters. If we need numbers, we can spell them out, like this: “SIX THOUSAND SEVEN HUNDRED FORTY-THREE”.

On the other hand, we may wish to say that our plain text looks like this: 26 upper-case letters, 26 lower-case letters, 10 numeric digits, the space character, and a dozen types of punctuation marks: period, comma, double-quote, left parenthesis, right parenthesis, and so on. That gives us a plain-text alphabet of 75 characters.

Once we’ve specified exactly what a plain-text representation of a message looks like — a finite sequence of characters from our 30-character alphabet, or perhaps our 75-character alphabet — then we can devise a system (a code) that can reliably encode and decode plain-text messages written in that alphabet. The simplest such system is one in which every character in the plain-text alphabet has one and only one corresponding representation in the encoded text. A familiar example is Morse code, in which “SOS” in plain text corresponds to

                ... --- ...

in encoded text.

In the real world, of course, the selection of characters for the plain-text alphabet is influenced by technological limitations on the encoded text. Suppose we have several available technologies for storing encoded messages: one technology supports an encoded alphabet of 256 characters, another technology supports only 128 encoded characters, and a third technology supports only 64 encoded characters. Naturally, we can make our plain-text alphabet much larger if we know that we can use a technology that supports a larger encoded-text alphabet.

And the reverse is also true. If we know that our plain-text alphabet must be very large, then we know that we must find — or devise — a technology capable of storing a large number of encoded characters.

Which brings us to Unicode.

Unicode

Unicode was devised to be a system capable of storing encoded representations of every plain-text character of every human language that has ever existed. English, French, Spanish. Greek. Arabic. Hindi. Chinese. Assyrian (cuneiform characters).

That’s a lot of characters.

So the first task of the Unicode initiative was simply to list all of those characters, and count them. That’s the first half of Unicode, the Universal Character Set. (And if you really want to “talk Unicode”, don’t call plain-text characters “characters”. Call them “code points”.)

Once you’ve done that, you’ve got to figure out a technology for storing all of the corresponding encoded-text characters. (In Unicode-speak, the encoded-text characters are called “code values”.)

In fact Unicode defines not one but several methods of mapping code points to code values. Each of these methods has its own name. Some of the names start with “UTF”, others start with “UCS”: UTF-8, UTF-16, UTF-32, UCS-2, UCS-4, and so on. The naming convention is “UTF-<number of bits in a code value>” and “UCS-<number of bytes in a code value>” Some (e.g. UCS-4 and UTF-32) are functionally equivalent. See the Wikipedia article on Unicode.

The most important thing about these methods is that some are fixed-width encodings and some are variable-width encodings. The basic idea is that the fixed-width encodings are very long — UCS-4 and UTF-32 are 4 bytes (32 bits) long — long enough to hold the the biggest code value that we will ever need.

In contrast, the variable-width encodings are designed to be short, but expandable. UTF-8, for example, can use as few as 8 bits (one byte) to store Latin and ASCII characters code points. But it also has a sort of “continued on the next byte” mechanism that allows it to use 2 bytes or even 4 bytes if it needs to (as it might, for Chinese characters). For Western programmers, that means that UTF-8 is both efficient and flexible, which is why UTF-8 is the de facto standardard encoding for exchanging Unicode text.

There is, then, no such thing as THE Unicode encoding system or method. There are several encoding methods, and if you want to exchange text with someone, you need explicitly to specify which encoding method you are using.

Is it, say, this.

encoding decoding UTF-8

Or this.

encoding decoding UTF-16

Or something else.

Which brings us back to something I said earlier.

Why encode something in Unicode?

At the beginning of this post I said

We start with the idea of a piece of information — a message — that exists in a representation that is understandable (perspicuous) to a human being.

Next, (for reasons that we won’t explore right now) we need to be able to translate a message in a plain-text representation into some other representation. The translation from plain text to encoded text is called “encoding”, and the translation of encoded text back into plain text is called “decoding”.

OK. So now it is time to explore those reasons. Why might we want to translate a message in a plain-text representation into some other representation?

One reason, of course, is that we want to keep a secret. We want to hide the plain text of our message by encrypting and decrypting it — basically, by keeping the algorithms for encoding and decoding secret and private.

But that is a completely different subject. Right now, we’re not interested in keeping secrets; we’re Python programmers and we’re interested in Unicode. So:

Why — as a Python programmer — would I need to be able to translate a plain-text message into some encoded representation… say, a Unicode representation such as UTF-8?

Suppose you are happily sitting at your PC, working with your favorite text editor, writing the standard Hello World program in Python (specifically, in Python 3+). This single line is your entire program.

                   print("Hello, world!")

Here, “Hello, world!” is plain text. You can see it on your screen. You can read it. You know what it means. It is just a string and you can (if you wish) do standard string-type operations on it, such as taking a substring (a slice).

But now suppose you want to put this string — “Hello, world!” — into a file and save the file on your hard drive. Perhaps you plan to send the file to a friend.

That means that you must eject your poor little string from the warm, friendly, protected home in your Python program, where it exists simply as plain-text characters. You must thrust it into the cold, impersonal, outside world of the file system. And out there it will exist not as characters, but as mere 1′s and 0′s, a jumble of dits and dots, charged and uncharged particles. And that means that your happy little plain-text string must be represented by some specific configuration of 1s and 0s, so that when somebody wants to retrieve that collection of 1s and 0s and convert it back into readable plain text, they can.

The process of converting a plain text into a specific configuration of 1s and 0s is a process of encoding. In order to write a string to a file, you must encode it using some encoding system (such as UTF-8). And to get it back from a file, you must read the file and decode the collection of 1s and 0s back into plain text.

The need to encode/decode strings when writing/reading them from/to files isn’t something new — it is not an additional burden imposed by Python 3′s new support for Unicode. It is something you have always done. But it wasn’t always so obvious. In earlier versions of Python, the encoding scheme was ASCII. And because, in those olden times, ASCII was pretty much the only game in town, you didn’t need to specify that you wanted to write and read your files in ASCII. Python just assumed it by default and did it. But — whether or not you realized it — whenever one of your programs wrote or read strings from a file, Python was busy behind the scene, doing the encoding and decoding for you.

So that’s why you — as a Python programmer — need to be able to encode and decode text into, and out of, UTF-8 (or some other encoding: UTF-16, ASCII, whatever). You need to encode your strings as 1s and 0s so you can put those 1s and 0s into a file and send the file to someone else.

What is plain text?

Earlier, I said that there were three points worth noting about the encoding/decoding process, and I discussed the first two. Here is the third point.

The distinction between plain text and encoded text is relative and context-dependent.

As programmers, we think of plain text as being written text. But it is possible to look at matters differently. For instance, we can think of spoken text as the plain text, and written text as the encoded text. From this perspective, writing is encoded speech. And there are many different encodings for speech as writing. Think of Egyptian hieroglyphics, Mayan hieroglyphics, the Latin alphabet, the Greek alphabet, Arabic, Chinese ideograms, wonderfully flowing Devanagari देवनागरी, sharp pointy cuneiform wedges, even shorthand. These are all written encodings for the spoken word. They are all, as Thomas Hobbes put it, “Marks by which we may remember our thoughts”.

Which reminds us that, in a different context, even speech itself — language — may be regarded as a form of encoding. In much of early modern philosophy (think of Hobbes and Locke) speech (or language) was basically considered to be an encoding of thoughts and ideas. Communication happens when I encode my thought into language and say something — speak to you. You hear the sound of my words and decode it back into ideas. We achieve communication when I successfully transmit a thought from my mind to your mind via language. You understand me when — as a result of my speech — you have the same idea in your mind as I have in mine. (See Ian Hacking, Why Does Language Matter to Philosophy?)

Finally, note that in other contexts, the “plain text” isn’t even text. Where the plain text is soundwaves (e.g. music), it can be encoded as an mp3 file. Where the plain text is an image, it can be encoded as a gif, or png, or jpg file. Where the plain text is a movie, it can be encoded as a wmv file. And so on.

Everywhere, we are surrounded by encoding and decoding.


Notes

I’d like to recommend Eli Bendersky’s recent post on The bytes/str dichotomy in Python 3, which prodded me — finally — to put these thoughts into writing. I especially like this passage in his post.

Think of it this way: a string is an abstract representation of text. A string consists of characters, which are also abstract entities not tied to any particular binary representation. When manipulating strings, we’re living in blissful ignorance. We can split and slice them, concatenate and search inside them. We don’t care how they are represented internally and how many bytes it takes to hold each character in them. We only start caring about this when encoding strings into bytes (for example, in order to send them over a communication channel), or decoding strings from bytes (for the other direction).

I strongly recommend Charles Petzold’s wonderful book Code: The Hidden Language of Computer Hardware and Software.

And finally, I’ve found Stephen Pincock’s Codebreaker: The History of Secret Communications a delightful read. It will tell you, among many other things, how the famous WWII Navaho codetalkers could talk about submarines and dive bombers… despite the fact that there are no Navaho words for “submarine” or “dive bomber”.

How to post source code on WordPress

This post is for folks who blog about Python (or any programming language for that matter) on WordPress.
Updated 2011-11-09 to make it easier to copy-and-paste the [sourcecode] template.

My topic today is How to post source code on WordPress.

The trick is to use the WordPress [sourcecode] shortcut tag, as documented at http://en.support.wordpress.com/code/posting-source-code/.

Note that when the WordPress docs tell you to enclose the [sourcecode] shortcut tag in square — not pointy — brackets, they mean it. When you view your post as HTML, what you should see is square brackets around the shortcut tags, not pointy brackets.

Here is the tag I like to use for snippets of Python code.


[sourcecode language="python" wraplines="false" collapse="false"]
your source code goes here
[/sourcecode]


The default for wraplines is true, which causes long lines to be wrapped. That isn’t appropriate for Python, so I specify wraplines=”false”.

The default for collapse is false, which is what I normally want. But I code it explicitly, as a reminder that if I ever want to collapse a long code snippet, I can.


Here are some examples.

Note that

  • WordPress knows how to do syntax highlighting for Python. It uses Alex Gorbatchev’s SyntaxHighlighter.
  • If you hover your mouse pointer over the code, you get a pop-up toolbar that allows you to look at the original source code snippet, copy it to the clipboard, print it, etc.

(1)

First, a normal chunk of relatively short lines of Python code.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()

(2)

Here is a different code snippet. This one has a line containing a very long comment. Note that the long line is NOT wrapped, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to. That is because we have specified wraplines=”false”.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="false", so lines are NOT wrapped, but extend indefinitely, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to.

(3)

This is what a similar code snippet would look like if we had specified wraplines=true. Note that line 2 wraps around and there is no horizontal scrollbar.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

(4)

Finally, the same code snippet with collapse=true, so the code snippet initially displays as collapsed. Clicking on the collapsed code snippet will cause it to expand.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

As far as I can tell, once a reader has expanded a snippet that was initially collapsed, there is no way for him to re-collapse it. That would be a nice enhancement for WordPress — to allow a reader to collapse and expand a code snippet.


Here is a final thought about wraplines. If you specify wraplines=”false”, and a reader prints a paper copy of your post, the printed output will not show the scrollbar, and it will show only the portion of long lines that were visible on the screen. In short, the printed output might cut off the right-hand part of long lines.

In most cases, I think, this should not be a problem. The pop-up tools allow a reader to view or print the entire source code snippet if he wants to. Still, I can imagine cases in which I might choose to specify wraplines=”true”, even for a whitespace-sensitive language such as Python. And I can understand that someone else, simply as a matter of personal taste, might prefer to specify wraplines=”true” all of the time.

Now that I think of it, another nice enhancement for WordPress would be to allow a reader to toggle wraplines on and off.


Keep on bloggin’!

Python3 pickling

Recently I was converting some old Python2 code to Python3 and I ran across a problem pickling and unpickling.

I guess I would say it wasn’t a major problem because I found the solution fairly quickly with a bit of googling around.

Still, I think the problem and its solution are worth a quick note.  Others will stumble across this problem in the future, especially because there are code examples floating around (in printed books and online posts) that will lead new Python programmers to make this very same mistake.

So let’s talk about pickling.

Suppose you want to “pickle” an object — dump it to a pickle file for persistent storage.

When you pickle an object, you do two things.

  • You open the file that you want to use as the pickle file. The open(…) returns a file handle object.
  • You pass the object that you want to pickle, and the file handle object, to pickle.

Your code might look something like this. Note that this code is wrong. See below.

fileHandle = open(pickleFileName, "w")
pickle.dump(objectToBePickled, fileHandle)

When I wrote code like this, I got back this error message:

Pickler(file, protocol, fix_imports=fix_imports).dump(obj)
TypeError: must be str, not bytes

Talk about a crappy error message!!!

After banging my head against the wall for a while, I googled around and quickly found a very helpful answer on StackOverflow.

The bottom line is that a Python pickle file is (and always has been) a byte stream. Which means that you should always open a pickle file in binary mode: “wb” to write it, and “rb” to read it. The Python docs contain correct example code.

My old code worked just fine running under Python2 (on Windows).  But with Python3′s new strict separation of strings and bytes, it broke. Changing “w” to “wb”, and “r” to “rb”, fixed it. 


One person who posted a question about this problem on the Python forum was aware of the issue, but confused because he was trying to pickle a string.

import pickle
a = "blah"
file = open('state', 'w')
pickle.dump(a,file)

I know of one easy way to solve this is to change the operation argument from ‘w’ to ‘wb’ but I AM using a string not bytes! And none of the examples use ‘wb’ (I figured that out separately) so I want to have an understanding of what is going on here.

Basically, regardless of the kind of object that you are pickling (even a string object), the object will be converted to a bytes representation and pickled as a byte stream. Which means that you always need to use “rb” and “wb”, regardless of the kind of object that you are pickling.

Yet Another Lambda Tutorial

modified to use the WordPress [sourcecode] tag — 2012-01-14

There are a lot of tutorials[1] for Python’s lambda out there. One that I stumbled across recently and really found helpful was Mike Driscoll’s discussion of lambda on the Mouse vs Python blog.

When I first started learning Python, one of the most confusing concepts to get my head around was the lambda statement. I’m sure other new programmers get confused by it as well…

Mike’s discussion is excellent: clear, straight-forward, with useful illustrative examples. It helped me — finally — to grok lambda, and led me to write yet another lambda tutorial.

 


Lambda: a tool for building functions

Basically, Python’s lambda is a tool for building functions (or more precisely, function objects). That means that Python has two tools for building functions: def and lambda.

Here’s an example. You can build a function in the normal way, using def, like this:

def square_root(x): return math.sqrt(x)

or you can use lambda:

square_root = lambda x: math.sqrt(x)

Here are a few other interesting examples of lambda:

sum = lambda x, y:   x + y   #  def sum(x,y): return x + y
out = lambda   *x:   sys.stdout.write(" ".join(map(str,x)))
lambda event, name=button8.getLabel(): self.onButton(event, name)

 


What is lambda good for?

A question that I’ve had for a long time is: What is lambda good for? Why do we need lambda?

The answer is:

  • We don’t need lambda, we could get along all right without it. But…
  • there are certain situations where it is convenient — it makes writing code a bit easier, and the written code a bit cleaner.

What kind of situations?

Well, situations in which we need a simple one-off function: a function that is going to be used only once.

Normally, functions are created for one of two purposes: (a) to reduce code duplication, or (b) to modularize code.

  • If your application contains duplicate chunks of code in various places, then you can put one copy of that code into a function, give the function a name, and then — using that function name — call it from various places in your code.
  • If you have a chunk of code that performs one well-defined operation — but is really long and gnarly and interrupts the otherwise readable flow of your program — then you can pull that long gnarly code out and put it into a function all by itself.

But suppose you need to create a function that is going to be used only once — called from only one place in your application. Well, first of all, you don’t need to give the function a name. It can be “anonymous”. And you can just define it right in the place where you want to use it. That’s where lambda is useful.

But, but, but… you say.

  • First of all — Why would you want a function that is called only once? That eliminates reason (a) for making a function.
  • And the body of a lambda can contain only a single expression. That means that lambdas must be short. So that eliminates reason (b) for making a function.

What possible reason could I have for wanting to create a short, anonymous function?

Well, consider this snippet of code that uses lambda to define the behavior of buttons in a Tkinter GUI interface. (This example is from Mike’s tutorial.)

frame = tk.Frame(parent)
frame.pack()

btn22 = tk.Button(frame, 
        text="22", command=lambda: self.printNum(22))
btn22.pack(side=tk.LEFT)

btn44 = tk.Button(frame, 
        text="44", command=lambda: self.printNum(44))
btn44.pack(side=tk.LEFT)

The thing to remember here is that a tk.Button expects a function object as an argument to the command parameter. That function object will be the function that the button calls when it (the button) is clicked. Basically, that function specifies what the GUI will do when the button is clicked.

So we must pass a function object in to a button via the command parameter. And note that — since different buttons do different things — we need a different function object for each button object. Each function will be used only once, by the particular button to which it is being supplied.

So, although we could code (say)

def __init__(self, parent):
    """Constructor"""
    frame = tk.Frame(parent)
    frame.pack()

    btn22 = tk.Button(frame, 
        text="22", command=self.buttonCmd22)
    btn22.pack(side=tk.LEFT)

    btn44 = tk.Button(frame, 
        text="44", command=self.buttonCmd44)
    btn44.pack(side=tk.LEFT)

def buttonCmd22(self):
    self.printNum(22)

def buttonCmd44(self):
    self.printNum(44)

it is much easier (and clearer) to code

def __init__(self, parent):
    """Constructor"""
    frame = tk.Frame(parent)
    frame.pack()

    btn22 = tk.Button(frame, 
        text="22", command=lambda: self.printNum(22))
    btn22.pack(side=tk.LEFT)

    btn44 = tk.Button(frame, 
        text="44", command=lambda: self.printNum(44))
    btn44.pack(side=tk.LEFT)

When a GUI program has this kind of code, the button object is said to “call back” to the function object that was supplied to it as its command.

So we can say that one of the most frequent uses of lambda is in coding “callbacks” to GUI frameworks such as Tkinter and wxPython.

 


This all seems pretty straight-forward. So…

Why is lambda so confusing?

There are four reasons that I can think of.

First Lambda is confusing because: the requirement that a lambda can take only a single expression raises the question: What is an expression?

A lot of people would like to know the answer to that one. If you Google around a bit, you will see a lot of posts from people asking “In Python, what’s the difference between an expression and a statement?”

One good answer is that an expression returns (or evaluates to) a value, whereas a statement does not. Unfortunately, the situation is muddled by the fact that in Python an expression can also be a statement. And we can always throw a red herring into the mix — assigment statements like a = b = 0 suggest that Python supports chained assignments, and that assignment statements return values. (They do not. Python isn’t C.)[2]

In many cases when people ask this question, what they really want to know is: What kind of things can I, and can I not, put into a lambda?

And for that question, I think a few simple rules of thumb will be sufficient.

  • If it doesn’t return a value, it isn’t an expression and can’t be put into a lambda.
  • If you can imagine it in an assignment statement, on the right-hand side of the equals sign, it is an expression and can be put into a lambda.

Using these rules means that:

  1. Assignment statements cannot be used in lambda. In Python, assignment statements don’t return anything, not even None (null).
  2. Simple things such as mathematical operations, string operations, list comprehensions, etc. are OK in a lambda.
  3. Function calls are expressions. It is OK to put a function call in a lambda, and to pass arguments to that function. Doing this wraps the function call (arguments and all) inside a new, anonymous function.
  4. In Python 3, print became a function, so in Python 3+, print(…) can be used in a lambda.
  5. Even functions that return None, like the print function in Python 3, can be used in a lambda.
  6. Conditional expressions, which were introduced in Python 2.5, are expressions (and not merely a different syntax for an if/else statement). They return a value, and can be used in a lambda.
    lambda: a if some_condition() else b
    lambda x: ‘big’ if x > 100 else ‘small’

 

Second Lambda is confusing because: the specification that a lambda can take only a single expression raises the question: Why? Why only one expression? Why not multiple expressions? Why not statements?

For some developers, this question means simply Why is the Python lambda syntax so weird? For others, especially those with a Lisp background, the question means Why is Python’s lambda so crippled? Why isn’t it as powerful as Lisp’s lambda?

The answer is complicated, and it involves the “pythonicity” of Python’s syntax. Lambda was a relatively late addition to Python. By the time that it was added, Python syntax had become well established. Under the circumstances, the syntax for lambda had to be shoe-horned into the established Python syntax in a “pythonic” way. And that placed certain limitations on the kinds of things that could be done in lambdas.

Frankly, I still think the syntax for lambda looks a little weird. Be that as it may, Guido has explained why lambda’s syntax is not going to change. Python will not become Lisp.[3]

 

Third Lambda is confusing because: lambda is usually described as a tool for creating functions, but a lambda specification does not contain a return statement.

The return statement is, in a sense, implicit in a lambda. Since a lambda specification must contain only a single expression, and that expression must return a value, an anonymous function created by lambda implicitly returns the value returned by the expression. This makes perfect sense.

Still — the lack of an explicit return statement is, I think, part of what makes it hard to grok lambda, or at least, hard to grok it quickly.

 

Fourth Lambda is confusing because: tutorials on lambda typically introduce lambda as a tool for creating anonymous functions, when in fact the most common use of lambda is for creating anonymous procedures.

Back in the High Old Times, we recognized two different kinds of subroutines: procedures and functions. Procedures were for doing stuff, and did not return anything. Functions were for calculating and returning values. The difference between functions and procedures was even built into some programming languages. In Pascal, for instance, procedure and function were different keywords.

In most modern languages, the difference between procedures and functions is no longer enshrined in the language syntax. A Python function, for instance, can act like a procedure, a function, or both. The (not altogether desirable) result is that a Python function is always referred to as a “function”, even when it is essentially acting as a procedure.

Although the distinction between a procedure and a function has essentially vanished as a language construct, we still often use it when thinking about how a program works. For example, when I’m reading the source code of a program and see some function F, I try to figure out what F does. And I often can categorize it as a procedure or a function — “the purpose of F is to do so-and-so” I will say to myself, or “the purpose of F is to calculate and return such-and-such”.

So now I think we can see why many explanations of lambda are confusing.

First of all, the Python language itself masks the distinction between a function and a procedure.

Second, most tutorials introduce lambda as a tool for creating anonymous functions, things whose primary purpose is to calculate and return a result. The very first example that you see in most tutorials (this one included) shows how to write a lambda to return, say, the square root of x.

But this is not the way that lambda is most commonly used, and is not what most programmers are looking for when they Google “python lambda tutorial”. The most common use for lambda is to create anonymous procedures for use in GUI callbacks. In those use cases, we don’t care about what the lambda returns, we care about what it does.

This explains why most explanations of lambda are confusing for the typical Python programmer. He’s trying to learn how to write code for some GUI framework: Tkinter, say, or wxPython. He runs across examples that use lambda, and wants to understand what he’s seeing. He Googles for “python lambda tutorial”. And he finds tutorials that start with examples that are entirely inappropriate for his purposes.

So, if you are such a programmer — this tutorial is for you. I hope it helps. I’m sorry that we got to this point at the end of the tutorial, rather than at the beginning. Let’s hope that someday, someone will write a lambda tutorial that, instead of beginning this way

Lambda is a tool for building anonymous functions.

begins something like this

Lambda is a tool for building callback handlers.

 


So there you have it. Yet another lambda tutorial.


Footnotes

 

[1] Some lambda tutorials:

 

[2] In some programming languages, such as C, an assignment statement returns the assigned value. This allows chained assignments such as x = y = a, in which the assignment statement y = a returns the value of a, which is then assigned to x. In Python, assignment statements do not return a value. Chained assignment (or more precisely, code that looks like chained assignment statements) is recognized and supported as a special case of the assignment statement.

 

[3] Python developers who are familiar with Lisp have argued for increasing the power of Python’s lambda, moving it closer to the power of lambda in Lisp. There have been a number of proposals for a syntax for “multi-line lambda”, and so on. Guido has rejected these proposals and blogged about some of his thinking about “pythonicity” and language features as a user interface. This led to an interesting discussion on Lambda the Ultimate, the programming languages weblog about lambda, and about the idea that programming languages have personalities.

Read-Ahead and Python Generators

One of the early classics of program design is Michael Jackson’s Principles of Program Design (1975), which introduced (what later came to be known as) JSP: Jackson Structured Programming.

Back in the 1970′s, most business application programs did their work by reading and writing sequential files of records stored on tape. And it was common to see programs whose top-level control structure looked like (what I will call) the “standard loop”:

open input file F

while not EndOfFile on F:
    read a record
    process the record

close F

Jackson showed that this way of processing a sequence almost always created unnecessary problems in the program logic, and that a better way was to use what he called a “read-ahead” technique. 

In the read-ahead technique, a record is read from the input file immediately after the file is opened, and then a second “read” statement is executed after each record is processed.

This technique produces a program structure like this:

open input file F
read a record from F     # get first

while not EndOfFile on F:
    process the record
    read the next record from F  # get next

close F

I won’t try to explain when or why the read-ahead technique is preferable to the standard loop. That’s out of scope for this blog entry, and a good book on JSP can explain that better than I can. So for now, let’s just say that there are some situations in which the standard loop is the right tool for the job, and there are other situations in which read-ahead is the right tool for the job.

One of the joys of Python is that Python makes it so easy to do “standard loop” processing on a sequence such as a list or a string.

for item in sequence:
    processItem(item)

There are times, however, when you have a sequence that you need to process with the read-ahead technique.

With Python generators, it is easy to do. Generators make it easy to convert a sequence into a kind of object that provides both a get next method and an end-of-file mark.  That kind of object can easily be processed using the read-ahead technique.

Suppose that we have a list of items (called listOfItems) and we wish to process it using the read-ahead technique.

First, we create the “read-ahead” generator:

def ReadAhead(sequence):
    for item in sequence:
        yield item
    yield None # return the "end of file mark" after the last item

Then we can write our code this way:

items = ReadAhead(listOfItems)
item = items.next()  # get first
while item:
    processItem(item)
    item = items.next()  # get next

Here is a simple example.

We have a string (called “line”) consisting of characters. Each line consists of zero or more indent characters, some text characters, and (optionally) a special SYMBOL character followed by some suffix characters. For those familiar with JSP, the input structure diagram looks like this.

line
    - indent
        * one indent char
    - text
        * one text char
    - possible suffix
        o no suffix
        o suffix
            - suffix SYMBOL
            - suffix
                - one suffix char

We want to parse the line into 3 groups: indent characters, text characters, and suffix characters.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()

In Java, what is the difference between an abstract class and an interface?

This post is about Java, and has nothing to do with Python.  I’ve posted it here so that it can be available to other folks who might find it useful. (And because I don’t have a Java blog!)

In Java, what is the difference between an abstract class and an interface?

This is a question that comes up periodically. When I Googled for answers to it, I didn’t very much like any of the answers that I found, so I wrote my own. For those who might be interested, here it is.

Q: What is the difference between an abstract class and an interface?

A: Good question.

To help explain, first let me introduce some terminology that I hope will help clarify the situation.

  • I will say that a fully abstract class is an abstract class in which all methods are abstract.
  • In contrast, a partially abstract class is an abstract class in which some of the methods are abstract, and some are concrete (i.e. have implementations).

Q: OK. So what is the difference between a fully abstract class and an interface?

A: Basically, none. They are the same.

Q: Then why does Java have the concept of an interface, as well as the concept of an abstract class?

A: Because Java doesn’t support multiple inheritance. Or rather I should say, it supports a limited form of multiple inheritance.

Q: Huh??!!!

A: Java has a rule that a class can extend only one abstract class, but can implement multiple interfaces (fully abstract classes).

There’s a reason why Java has such a rule.

Remember that a class can be an abstract class without being a fully abstract class. It can be a partially abstract class.

Now imagine that that we have two partially abstract classes A and B. Both have some abstract methods, and both contain a non-abstract method called foo().

And imagine that Java allows a class to extend more than one abstract class, so we can write a class C that extends both A and B. And imagine that C doesn’t implement foo().

So now there is a problem. Suppose we create an instance of C and invoke its foo() method. Which foo() should Java invoke? A.foo() or B.foo()?

Some languages allow multiple inheritance, and have a way to answer that question. Python for example has a “method resolution order” algorithm that determines the order in which superclasses are searched, looking for an implementation of foo().

But the designers of Java made a different choice. They choose to make it a rule that a class can inherit from as many fully abstract classes it wants, but can inherit from only one partially abstract class. That way, the question of which foo() to use will never come up.

This is a form of limited multiple inheritance. Basically, the rule says that you can inherit from (extend) as many classes as you want, but if you do, only one of those classes can contain concrete (implemented) methods.

So now we do a little terminology substitution:

abstract class = a class that contains at least one abstract method, and can also contain concrete (implemented) methods

interface =  a class that is fully abstract — it has abstract methods, but no concrete methods

With those substitutions, you get the familiar Java rule:

A class can extend at most one abstract class, but may implement many interfaces.

That is, Java supports a limited form of multiple inheritance.

Newline conversion in Python 3

I use Python on both Windows and Unix.  Occasionally when running on Windows  I need to read in a file containing Windows newlines and write it out with Unix/Linux newlines.  And sometimes when running on Unix, I need to run the newline conversion in the other direction.

Prior to Python 3, the accepted way to do this was to read data from the file in binary mode, convert the newline characters in the data, and then write the data out again in binary mode. The Tools/Scripts directory contained two scripts (crlf.py and lfcr.py) with illustrative examples. Here, for instance is the key code from crlf.py (Windows to Unix conversion)

        data = open(filename, "rb").read()
        newdata = data.replace("\r\n", "\n")
        if newdata != data:
            f = open(filename, "wb")
            f.write(newdata)
            f.close()

But if you try to do that with Python 3+, it won’t work.

The key to what will work is the new “newline” argument for the built-in file open() function. It is documented here.

The key point from that documentation is this:

newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:

  • On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

  • On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

So now when I want to convert a file from Windows-style newlines to Linux-style newlines, I do this:

filename = "NameOfFileToBeConverted"
fileContents = open(filename,"r").read()
f = open(filename,"w", newline="\n")
f.write(fileContents)
f.close()

Why import star is a bad idea

When I was learning Python, I of course read the usual warnings. They told me: You can do

from something_or_other import *

but don’t do it. Importing star (asterisk, everything) is a Python Worst Practice.

Don’t “import star”!

But I was young and foolish, and my scripts were short. “How bad can it be?” I thought. So I did it anyway, and everything seemed to work out OK.

Then, like they always do, the quick-and-dirty scripts grew into programs, and then grew into a full-blown system. Before long I had a monster on my hands, and I needed a tool that would look through all of the scripts and programs in the system and do (at least) some basic error checking.

I’d heard good things about pyflakes, so I thought I’d give it a try.

It worked very nicely. It found the basic kinds of errors that I wanted it to find. And it was fast, so I could run it through a directory containing a lot of .py files and it would come out alive and grinning on the other side.

During the process, I learned that pyflakes is designed to be a bit on the quick and dirty side itself, with the quick making up for the dirty. As part of this process, it basically ignores star imports.  Oh, it warns you about the star imports.  What I means is — it doesn’t try to figure out what is imported by the star import.

And that has interesting consequences.

Normally, if your file contains an undefined name — say TARGET_LANGAGE — pyflakes will report it as an error.

But if your file includes any star imports, and your script contains an undefined name like TARGET_LANGAGE, pyflakes won’t report the undefined name as an error.

My hypothesis is that pyflakes doesn’t report TARGET_LANGAGE as undefined because it can’t tell whether TARGET_LANGAGE is truly undefined, or was pulled in by some star import.

This is perfectly understandable. There is no way that pyflakes is going to go out, try to find the something_or_other module, and analyze it to see if it contains TARGET_LANGAGE. And if it doesn’t, but contains star imports, go out and look for all of the modules that something_or_other star imports, and then analyze them. And so on, and so on, and so on. No way!

So, since pyflakes can’t tell whether TARGET_LANGAGE is (a) an undefined name or (b) pulled in via some star import, it does not report TARGET_LANGAGE as an undefined name. Basically, pyflakes ignores it.

And that seems to me to be a perfectly reasonable way to do business, not just for pyflakes but for anything short of the Super Deluxe Hyperdrive model static code analyzer.

The takeaway lesson for me was that using star imports will cripple a static code analyzer. Or at least, cripple the feature that I most want a code analyser for… to find and report undefined names.

So now I don’t use star imports anymore.

There are a variety of alternatives.  The one that I prefer is the “import x as y” feature. So I can write an import statement this way:

import   some_module_with_a_big_long_hairy_name   as    bx

 and rather than coding

x = some_module_with_a_big_long_hairy_name.vector

I can code

x = bx.vector

Works for me.

Learning Subversion: the mystery of .svn

If you are googling for “Subversion command line tutorial introduction for beginners”, read this first! This is for all Subversion newbies.

After using PVCS for many years, our office recently started moving to Subversion. Which means that recently I started trying to learn Subversion.

I was pressed for time. I was in a hurry. I was looking for something that would get me up and running quickly.

First, I got a copy of the free online Subversion documentation Version Control with Subversion.

Second, I got a copy of Mike Mason’s excellent Pragmatic Version Control Using Subversion (2nd ed.).

Third, I googled the Web looking for the kinds of things that you’d expect: Subversion tutorial introduction beginning beginners commands. And I found some good stuff.

But even after reading many of the online Subversion tutorials, I still could not grok Subversion. Different commands seemed to be doing the same thing, and the tutorials used a lot of terms that were never defined or explained: “versioned”, “unversioned”, “under version control”, and so on.

Gradually, I realized the problem. Many of the online tutorials and introductions try to explain how to use Subversion without explaining how Subversion works. They tell you what commands to issue, and when, but they don’t tell you why you are issuing the command at this particular time, or what the command is doing under the covers.

So I had to dig deeper.

What I found was that there was one particular piece of information missing from most of the tutorials and introductions that I found. If you don’t have that piece, nothing about Subversion makes much sense. With it, all of the other pieces of the puzzle fall into place.

So the purpose of this post is to tell you — the Subversion newbie — about that piece.


How Subversion Works

The basic unit of work for Subversion is a project.

A project is basically a directory.

Technically, a project is a subtree: a directory, including all of its files and subdirectories, and all of those subdirectories’ files and subdirectories, etc. But in order to keep things simple, I will talk as if a project is just a directory.

When you are working on a Subversion project, there are actually two directories that you are working with.

  • There is the repository, which is a directory (controlled by Subversion and running on a server somewhere) that contains the master copy of the project directory.
  • There is your own personal workingCopy, which is a directory (controlled by you) that exists on the file system of your own machine (that is, on the hard drive of your own PC).

But (and this is the piece that was missing) a workingCopy directory is not an ordinary directory.

The use of the expression “working copy” is one of the most confusing things about Subversion tutorials and even the Subversion documentation itself. When you encounter the expression “working copy” you assume that you are dealing with an ordinary filesystem directory that is being used to hold a copy of the files in your project. Not so!

In the context of Subversion, “working copy” is a very specific term of art — a Subversion-specific technical term. That is why in this post I avoid the expression “working copy” and instead use workingCopy.

So what is a Subversion workingCopy directory?

A workingCopy directory is a directory that has a hidden subdirectory called “.svn”.

The hidden .svn directory is what Subversion calls an “administrative directory”.

Note the leading period in “.svn”. On Unix systems, a directory whose name begins with a dot is a “hidden” (or “dotfile”) directory.

On your PC, the project’s top-level workingCopy directory has a hidden .svn subdirectory. And each of the subdirectories of the workingCopy directory (if it has any), and each of their subdirectories (if they have any), and so on, has its own hidden .svn subdirectory.

Having a hidden .svn subdirectory is what makes an ordinary file system directory into a Subversion workingCopy directory, a directory that Subversion can recognize and manage.

So, for a project named “ProjectX” the workingCopy directory will be named “ProjectX”. It might look like this:

	ProjectX [DIRECTORY]
		projectx.py
		projectx_constants.py
		.svn [DIRECTORY]

What is in a .svn subdirectory? What does a Subversion administrative directory contain?

The Subversion documentation says this about workingCopy directories:

A Subversion working copy is an ordinary directory tree on your local system, containing a collection of files. You can edit these files however you wish, and if they’re source code files, you can compile your program from them in the usual way. …

A working copy also contains some extra files, created and maintained by Subversion, to help it carry out these commands. In particular, each directory in your working copy contains a subdirectory named .svn, also known as the working copy’s administrative directory. The files in each administrative directory help Subversion recognize which files contain unpublished changes, and which files are out of date with respect to others’ work.

Here’s another clue: a passage from Pragmatic Version Control Using Subversion:

Subversion has a highly efficient network protocol and stores pristine copies of your working files locally, allowing a user to see what changes they’ve made without even contacting the server [where the central repository is stored].

So now we know what a Subversion administrative directory contains.

The .svn admin directory contains pristine (unchanged) copies of files that were downloaded from the repository. (It contains a few other things, too.)

Earlier, I said “When you are working on a Subversion project, there are actually TWO directories that you are working with… the repository and the working copy.” Now I want to change that. It would be more accurate to say that there are really THREE directories that you are working with:

  • the main ProjectX repository on the server
  • the ProjectX workingCopy directory on your PC, which contains editable (and possibly changed) copies of the files in the project …and also …
  • the hidden Subversion administrative directory, which contains a (pristine, unchanged, and uneditable) copies of the files in the main ProjectX repository on the server.

That means that, on your PC, the ProjectX workingCopy directory looks like this.

	ProjectX [DIRECTORY]
		projectx.py
		projectx_constants.py
		.svn [DIRECTORY]
			projectx.py
			projectx_constants.py

Now things start to become clearer…

Subversion introductions and tutorials often say things that are rather cryptic to someone who is trying to learn Subversion. Even HELP questions and FAQs posted on the Web can be mystifying. Now let’s see how some of those things make sense in light of our knowledge of the .svn subdirectory.


Showing file changes

The reason that Subversion can allow “a user to see what changes they’ve made without even contacting the server” is that the Subversion diff works only on the workingCopy directory on your own PC.

When Subversion shows file changes (that is, shows diffs) it is actually showing diffs between

  • your edited files in the workingCopy directory, and
  • the pristine copies of the those files that are being held in the .svn subdirectory of the workingCopy directory.

“unversioned” files vs. files “under version control”

Suppose that I make a change to one of my files: to ProjectX/projectx_constants.py.

When I make the changes, my editor automatically creates a backup file: ProjectX/projectx_constants.py.bak.

At this point, ProjectX/projectx_constants.py.bak is what is called an “unversioned” file. It exists in the ProjectX directory, but not in the ProjectX/.svn directory, so Subversion knows nothing about it. That makes sense: we don’t want projectx_constants.py.bak to be considered a project file anyway.

But suppose I want to add a new module to the project, called projectx_utils.py. If I simply create the file in the ProjectX folder, it will be an “unversioned” file in just the same way that projectx_constants.py.bak is an unversioned file: it will not exist in the ProjectX/.svn directory, so Subversion knows nothing about it.

So that is why Subversion has a “svn add” command. The command svn add projectx_utils.py will add the file to the project by copying ProjectX/project_utils.py to ProjectX/.svn/project_utils.py. At this point — after it has been added to the .svn subdirectory — the file is said to be “under version control”.

Note that — at this point — although project_utils.py has been “added” to the copy of the project in the workingCopy, the main repository still doesn’t know anything about it — project_utils.py hasn’t been added to the central repository on the server.

When I “commit” my changes, I send the files from my workingCopy to the main repository. Only after that happens does the new file truly become part of the project by becoming one of the files in the central repository.


Help! I’ve lost my .svn directory and I can’t get up!

Because a Subversion workingCopy directory needs a .svn subdirectory in order to work properly, you can have problems with Subversion if you accidentally delete the .svn subdirectory.


What is a “clean copy”?

In various tutorials, and in the Subversion docs, you will run across the expression “clean copy”. A “clean copy” is a copy of only the source-code files, without the .svn directory.

An introduction to Subversion (which is also a nice introduction to the TortoiseSVN open-source Windows GUI client for Subversion) explains things nicely.

If you look closely in your working copy, you may see an .svn folder in each folder of your working copy. The folders are hidden folders, so depending on the Windows settings you may not see them, but they are there. Those folders contain the information that Subversion uses to link your working copy to the repository.

If ever you need to get a copy of what’s in the repository, but without all the .svn folders (say for example you’re ready to publish it or hand the files over to your client), you can do an “SVN Export” into a new folder to get a “clean” copy of what’s in your repository.

Having the concept of a “clean copy” makes it easier to understand the next question…


Checkout vs. Export

A Frequently Asked Question about Subversion is What’s the difference between a “checkout” and an “export” from the repository?

The CollabNet docs say this:

They are the same except that Export doesn’t include the.svn folders and Checkout does include them. Also note that an export cannot be updated.

When you do a Subversion checkout, every folder and subfolder contains an .svn folder. These.svn folders contain clean copies of all files checked out and .tmp directories that contain temporary files created during checkouts, commits, update and other operations.

An Export will be about half the size of a Checkout due to the absence of the.svn folders that duplicate all content.

Note that the reason an exported folder cannot be updated is that the update command updates the .svn directory of a workingCopy, but an export does not create an .svn directory.

Note also that you can export from either the main repository or from the workingCopy .svn directory. See Subversion docs for export.


The (import, checkout) usage pattern for getting started with Subversion

Most “getting started with Subversion” tutorials start the same way. Assuming that you have some project files that you want to put into Subversion, you are told to:

  • do an import
  • do a checkout

in that order.

What you are not told is why you start with those two particular actions in that particular order.

But by now, knowing about the hidden .svn administrative directory and what it does, you can probably figure that out.

Import is the opposite of export. It takes a directory of files — a clean copy of the files, if you will — from your hard drive and copies them into the central Subversion repository on the server.

Always the next step is to do a checkout. Basically a checkout copies the project files from the central repository to a workingCopy directory on your PC. If the workingCopy directory does not exist on your PC, it is created.

The workingCopy directory contains everything you need in order to be able to work with Subversion, including an .svn administrative directory. As the CollabNet documentation (quoted earlier) says:

When you do a Subversion checkout, every folder and subfolder contains an .svn folder. These.svn folders contain clean copies of all files checked out and .tmp directories that contain temporary files created during checkouts, commits, update and other operations.

So the second step — the checkout command — is absolutely necessary in order to get started. It creates a workingCopy directory containing the project files. Only after that happens are your files properly “under version control”.


checkin vs. commit

PVCS (and SourceSafe, and many other version control systems) work on a locking model. “Checking out” a file from the repository means that you get a local working copy of the file, and you lock the file in the repository. At that point, nobody can unlock it except you. Checking out a file gives you exclusive update privileges on it until you check it back in.

“Checking in” a file means that you copy your local working copy of the file back into the repository and you unlock the file in the repository.

It is possible to copy your local working copy of the file into the repository without unlocking the file in the repository. When you do this, you are in a sense “updating” the repository from the working copy.

Because of my familiarity with this kind of version control, I had a certain “mental model” of how a version control system works. And because of that mental model, many of the Subversion tutorials were quite confusing.

One source of confusion is the fact that (as we will see in the next section) the word “updating” in the context of Subversion means exactly the opposite of what it means in the context of PVCS.

One of the Subversion tutorials that I found said that you must checkout your workingCopy from the main repository, because you can’t do a checkin back to the main repository if you hadn’t checked it out. This was very confusing to an ex-PVCS user.

First, it suggested that Subversion works like PVCS: that there is a typical round-trip usage pattern consisting of

  • checking out (locking)
  • editing
  • checking in (unlocking)

But Subversion doesn’t work like this, at least not by default.

What the tutorial was trying to say, I think, was that in order to work with Subversion, you must create a workingCopy directory (that is, a directory that contains an .svn administrative subdirectory). And the way to create a workingCopy directory is to run a svn checkout command against the repository on the server.

Second, explaining things this way was confusing because Subversion doesn’t really have a checkin command. It does have a commit command, which some tutorials call a “checkin” command. But that command does not do the same thing as a PVCS checkin.

Ignore the fact that the short form of the commit command is ci (which stood for “checkin” in an earlier incarnation of Subversion). A Subversion “checkin” is the same thing as a “commit”, and has nothing to do with locking. It would really be helpful if all Subversion tutorials would stop using the term “checkin” and replace it with “commit”.

If you are used to working with a VCS that uses the “check out, edit, check in” paradigm, and you come to understand that Subversion’s commit is not the same as your old familiar check in, then your next question will almost certainly be:

Once you checkout a project into a working folder, how do you check it in a la SourceSafe? [Or PVCS, or other lock-based VCSs? -- Steve Ferg]

I know there is “commit” which puts my changes into the respository, but I still have the files checked out under my working folder. What if I am done with a particular file and I don’t want to have it checked out? How do I check it back in?

You can read the answer here.


What does svn update do?

EXECUTIVE SUMMARY: svn update updates the workingCopy, not the repository.

The Subversion docs describe the update command this way:

When working on a project with a team, you’ll want to update your working copy to receive any changes other developers on the project have made since your last update. Use svn update to bring your working copy into sync with the latest revision in the repository:

Basically, what the update command does is to copy the project files from the central repository down to the .svn directory in your workingCopy.

This is something you should do frequently, because you don’t want the files in your workingCopy/.svn directory to get too far out of sync with the file in the central repository. And you don’t want to try to commit files if your workingCopy/.svn is out of sync with the central repository.

That means that as a general rule, you should always run an svn update:

  • just before you start making a new round of changes to your workingCopy, and
  • just before doing a commit.

Now, having mastered the concept of an .svn directory, we can Understand Many Things, even arcana such as why Serving websites from svn checkout considered harmful.

So that’s it.

This post contains information written by a Subversion newbie in the hopes that it will be useful to other Subversion newbies. But of course, having been written by a newb, there are all sorts of ways it could be wrong.

If you’re a Subversion expert (and it doesn’t take much to be more expert than I am) and you see something wrong, confused, or misleading here, please leave a comment. I, and future generations of Subversion newbies, will thank you for it.

Thanks to my co-workers Mark Thomas and Jason Herman for reviewing an earlier draft of this post.

How to fix a programmable Northgate keyboard

After my earlier post about Northgate keyboard repair it occurred to me that this information might be useful. I don’t think it can be found anywhere else on the Web.

Note that in the following slideshow (showing the repair of an Evolution keyboard) you can mouse-over the image. Controls will pop up that allow you to pause the show and to step forward and backward.

This slideshow requires JavaScript.

When programmable keyboards go bad

A while ago, one of my Northgate keyboards seemed spontaneously to sustain some kind of brain injury. A number of the keys seemed to have gone haywire. The left shift key didn’t work and several pairs of keys seemed to have exchanged places.

I talked with Bob Tibbetts of Northgate Keyboard repair (http://www.northgate-keyboard-repair.com/) and he explained the situation. Here is what I learned.

The Northgates are programmable keyboards — they contain a programmable chip. They were designed so that certain key combinations (e.g. pressing the left shift key four times) puts the keyboard (that is, the programmable chip) into programming mode.

Unfortunately the programmable chip had software that worked only with Windows 98 and earlier. If you are using a Northgate keyboard with any other system, the programmable chip is basically a bad chip and should be removed. (Bob noted that he removes the chip from any keyboards that he sells.)

Fixing the problem is a two-step process. First you “reboot” the keyboard into non-progamming mode, then you remove the chip.

You can just reboot the keyboard without removing the chip, of course, and that will fix the immediate problem. But as long as the programmable chip is still in the keyboard, similar problems can occur again at any time.

How to “reboot” the programmable keyboard

Shut the computer down. Don’t just a log off or do a “soft” reboot. Power off.

Press the ESCAPE (ESC) key down and hold it down while you power up your PC. Do not release the ESC key until the computer beeps at you, or you have to do something like entering a password.

This should make the keyboard work normally. (If it doesn’t, then the problem was something other than the programmable chip.)

The anatomy of an Evolution keyboard

Working with Evolution keyboards is tricky because the Evolutions have the little GlidePoint touchpad in the middle of the top of the keyboard. There are short cables that go from the GlidePoint touchpad in the upper part of the keyboard to the “motherboard” in the bottom part of the keyboard.

Basically, the GlidePoint cables act as a sort of tether between the upper and lower halves of the keyboard. The cables are short, and virtually impossible to re-attach if you pull them loose. So you have to be careful not to pull them loose.

How to remove the programmable chip from an Evolution keyboard

First, make sure you have read “The anatomy of an Evolution keyboard” (above). Then …

“Reboot” the keyboard (see the instructions given above), then shut down (power off) your PC.

Turn the keyboard over, so that you are looking at the bottom of the keyboard.

Take the six screws (the ones holding the upper and lower parts of the keyboard together) out of the keyboard.

Turn the keyboard over, so that it is face up and you are looking at the keys.

DO NOT lift the top off of the keyboard.

Well, you can lift it a little. 

In the slideshow, you can see the top of the keyboard sitting on a little green box that lifts it about 2.75 inches (7 cm).  You can see the GlidePoint cables running from the touchpad in the top of the keyboard to the motherboard in the bottom of the keyboard. Those are the cables that you don’t want to disturb.

Lift the top half of the keyboard just enough to free it from the bottom half, then rotate the top clockwise about 4 or 5 inches, just enough to expose the programmable chip. Rotate the top using the location of the touchpad as the pivot point — that way you will disturb the touchpad cables at little as possible.

On the top right-hand side, locate the programmable chip. It is a small chip about 1/4″ x 3/8″ with 24C16 embossed on it.

Take a small screwdriver and pry the chip out. When you do this, you may break a few of the prongs that hold the chip to the motherboard. That’s OK. Bob Tibbetts suggested using a jeweler’s screwdriver. I used a small (but long) electrician’s screwdriver. I also found that once I had the chip lifted up, but not completely free of the motherboard, a needle-nose pliers was perfect for the final removal.

Around the edges of the chip socket, carefully cut off any remaining prongs. The goal is to leave no prongs sticking up that might touch each other or anything else. I think a “side cutter” pliers would be too big for this job. Something like a toenail clipper might be about right. I had only one prong left stuck in the motherboard, and I gently twisted it off with the needle-nose pliers.

Carefully lower the top of the keyboard back down onto the lower part.

Carefully turn the keyboard over, making sure to keep the two halves of the keyboard together.

Put the screws back in.

You’re done!

How to remove the programmable chip from a non-Evolution programmable keyboard

For other programmable Northgate keyboard models (models ending in a P for “programmable”) — 101P, 102P, Ultra TP and Ultra P — you can use basically the same procedure as described above for the Evolution.

The difference is that non-Evolution keyboards don’t have the GlidePoint touchpad embedded in the top of the keyboard. That means that you don’t need to worry about the GlidePoint cables, so you can lift the keyboard top completely off in order to access the programmable chip.

Northgate keyboard repair

The best computer keyboards ever made (even when compared to the original IBM model M keyboards) were the Northgate Omnikey keyboards.  They were heavy keyboards built like tanks, featuring buckling spring key-switches notable for their distinctive clicking as you typed.  These were real keyboards — no crappy “rubber dome” key switches allowed.

Omnikey Ultra keyboard

Omnikey Ultra keyboard

I used only Northgate Omnikey Ultras for years, lugging them from job to job like an itinerant medieval carpenter carrying his tools with him from town to town, and using special keyboard plug adapters when keyboard plug design evolved first to PS/2 and then to USB.

But tools get worn and dirty and a few years ago my Ultras were terminally filthy and starting to fail.  That was when, thanks to the twin miracles of the Web and Google, I found Bob Tibbetts and his Northgate Keyboard Repair web site.  Bob belongs to the school of minimalist website design, but his keyboard expertise and repair skills are totally maximal, and he really saved my bacon keyboards.   He also, in a manner of speaking, saved my wrists.

After 25 years of coding, the joints in my hands and wrists were starting to protest.  I switched from using a mouse to a using a trackball (I prefer a Logitech Cordless Optical Trackman), and that helped a lot.   Carpal tunnel syndrome forced a friend of mine to retire on disability and put The Fear into me.  A bout of online research convinced me that we really need more ergonomic keyboards, so I went shopping for one. 

The major feature of an ergonomic keyboard is a split design in which the left and right halves of the keyboard  are split apart, separated by a few inches, and angled slightly so that you can type without bending your wrists.  The result is a keyboard that is shaped like a V rather than like a straight unbroken line. In a sense, the keyboard is bent so your wrists don’t have to be.

Image of Northgate Evolution keyboard

Northgate Evolution keyboard

What I really wanted, of course, was an ergonomic version of the Omnikey Ultra. 

One day, in an email to Bob, I mentioned that although I loved my Ultras (one of which Bob was cleaning and repairing at the time), what I really wished for was an ergonomic V-shaped version of the Ultra. 

Well, I nearly fell off my chair when Bob told me that such a thing actually existed.  It was called the Omnikey Evolution keyboard.  Evolutions were very advanced for their time, and very few were made.  But a few — new in the box — still existed, and he had a few for sale.

I immediately ordered one, tried it out, and loved it.  It is my favorite keyboard ever.  So I followed my Mom’s tongue in cheek advice (“Get ‘em before the hoarders do.”) and got more.  I now own 5 – one for work, one for my home Vista machine, one for my home Linux machine, and two backups.

As I type this, it is almost midnight on March 11, 2011, and Bob has only 3 Evolution keyboards left. 

The good news is that if you have a beloved old Northgate that is showing its age, Northgate Keyboard Repair is still in the business of cleaning and repairing Northgate keyboards.

Finally, if you’re looking to purchase a keyboard with buckling spring key switches, you might check out the Customizer line of keyboards at pckeyboards.com.  It is a reincarnation of the original IBM model M.

And keep on clicking…

## updated January 1, 2012

What every beginning programmer should learn

Are you the kind of person who enjoys thinking about this kind of stuff?

Then seek therapy immediately!

Or, just for fun, consider this…

A young programmer, fresh out of computer programming school and still wet behind the ears, asks you — tough old seasoned professonal software developer that you are — for suggestions for things that he/she should learn in order to become as tough, as old, as seasoned, and as professional as you. What do you tell him/her?

Here is a jumbled grab-bag of concepts, jargon, ideas, resources (books), etc.  Basically, a list of the tools in the toolbox of a working business software developer.  Or at least, the first cut at such a list.

If this was YOUR list for a beginning programmer, what things would you add to it, or change?


relational database concepts

data modeling (entity-relationship modeling)

  • Chen-style (relationships shown as nodes)
  • Information Engineering (IE) style (relationships shown as lines)

data-driven program design

  • “Principles of Program Design” by M. A. Jackson
  • “Jackson Structured Programming: A Practical Method of Programme Design” (Paperback) by Leif Ingevaldsson

software requirements and specifications

  • “Structured Analysis and System Specification” by Tom Demarco & P. J. Plauger (Paperback 1979) Data flow diagrams can still be useful tools in certain contexts, even though they are obsolete for software systems analysis.
  • “Software Requirements and Specifications:” by Michael Jackson     STRONGLY RECOMMENDED
  • “Problem Frames” by Michael Jackson

object-oriented programming

  • class, object (or instance), method, static method, instance variable
  • “Beginning Java Objects” by Jacquie Barker

domain-driven design

  • “Domain-Driven Design” by Eric Evans

event-driven programming

data structures and algorithms

  • variables (in non-object oriented languages), objects (in object-oriented languages), pointers (references)
  • stack, queue, LIFO, FIFO, linked list, tree, node (and how to program them)
  • recursion (for traversing trees)
  • “Thinking Recursively” or “Thinking Recursively in Java” by Eric Roberts

process modeling, entity life-history modeling

  • finite-state machine (FSM) and state-transition diagrams
  • “Software Engineering Fundamentals: Jackson Approach” (Paperback) Leif Ingevaldsson

popular software design methods/tools/ideas

  • UML and its various diagram types
  • design patterns –  “Head First Design Patterns” might be a painless introductory book

software development management

  • iterative/evolutionary development (as opposed to “big bang” development)
  • “waterfall” methods
  • “agile” methods

programming languages

  • Dynamically-typed language: Python (or Ruby)
  • Statically-typed language: Java (or C#)
  • A good IDE (Eclipse: don’t try to program Java without it!)

added January 5, 2010

propositional logic (to help with those tricky if-then-else tests)

  • and, or , negation, material implication
  • truth tables
  • modus ponens, modus tolens, de Morgan’s laws
  • proof, reductio ad absurdum

sets and set theory

  • set, set membership
  • union, difference, intersection (and Venn diagrams, of course)

Python runs JSD

Of possible interest to those interested in systems analysis methodologies.

http://master.dl.sourceforge.net/project/pyjsd/MissGrantsControllerInJSD.pdf

A synopsis…

Miss Grant’s Controller: A JSD specification

For the purpose of writing computer software specifications, it is useful to view a computer software system as a software “machine” that transitions from state to state under the control of an input stream of events.

Traditionally, computer system specifications focus on the states and the transitions between the states: they view the system as a state machine and use state transition diagrams (STDs) to specify the behavior of the machine.

In contrast, Jackson System Development (JSD) specifications focus on the events and the sequence in which the events may occur. JSD views a software machine as a simulation in which model processes (coroutines running inside the system) simulate real-world processes. The model processes running in the machine are synchronized with their real-world counterparts by means of events – events sent from the real-world processes into the machine. JSD uses action structure diagrams to represent model processes.

I (and others, of course) believe that JSD-style specifications are a more useful tool than STDs and state machines for specifying many systems.

In this paper, I will present a small argument-by-example for JSD event-oriented specifications.

My example problem will be “Miss Grant’s controller”, which is based on an example problem from the introduction to Martin Fowler’s new (2010) book Domain-Specific Languages.
….

Since the 1980′s, JSD experts have had a vision of executable JSD specifications. They were frustrated by the fact that a JSD model process is a coroutine, but COBOL – the programming language in use in the business community where JSD was most popular – did not support coroutines.

That situation has changed in the last few years, with the increasing acceptance ofPython. Python, it turns out, is the ideal language for creating executable JSD specifications.

This is what Miss Grant’s controller looks like when the action structure diagram is translated into Python.

To run the model, we need to create some test data – a stream of sensor events – that we can feed to the controller. So here is the code for a Python driver program that creates a sequence of event objects and feeds them to the Python specification for Miss Grant’s controller.

Here is the output produced by a test run.

PyCharm gotchas

I’ve been playing with PyCharm, the new Python IDE.  I like it a lot.  But I’ve discovered a few major gotchas.

I should note that I come from a Windows background, and I’m used to standard Windows keyboard shortcuts and editor behavior.  So what I’m going to describe may be standard behavior for editors on *nix, and nothing surprising to say, Ubuntu programmers.  But I’m working on Windows, and on Windows PyCharm’s out-of-the-box configuration of keyboard shortcuts is definitely a big gotcha.

First of all, on Windows ctl+z is the standard shortcut key for UNDO, and ctl+y is the standard shortcut key for REDO (or in some contexts, DO AGAIN).  But in the default PyCharm configuration, ctl+y is not REDO but DELETE LINE.  Imagine trying to REDO a series of commands by repeatedly punching ctl+y…  and watching your source code evaporate!  That was me, brother.

Second, on most Windows editors (and things like Netbeans and Eclipse, too) if you try to close a file that has unsaved changes, the editor will warn you and ask you what you want to do.  Not so PyCharm.  If you click on a tab to close a file (a file with unsaved changes), PyCharm saves the file and closes it.  No warnings. 

As I write this, I realize that I don’t know how to tell PyCharm to close a file without also writing it to disk.

Third, PyCharm offers several alternate keyboard shortcut configurations… but they all seem to be *nix inspired.  There is no pre-built alternate shortcut configuration for Windows-like behavior. 

On the upside, though, PyCharm provides a lot of support for  keyboard shortcut customization, and you can customize the keyboard shortcuts to the behavior that you want.  The PyCharm online documentation is good, and detailed.  Very well done.

Some time I might blog about what I like about PyCharm.  As I say, there is a lot to like about PyCharm and I don’t want to slam it.  But consider this a heads-up.

An alternative to string interpolation

I sort of like this.

# ugly
msg = "I found %s files in %s directories" % (filecount,foldercount)

# better
def Str(*args): return "".join(str(x) for x in args)
:
:
msg = Str("I found ", filecount, " files in ", foldercount, " directories" )

You don’t have to call it “Str”, of course.

A Globals Module pattern

Two comments on my recent posts on a Globals Class pattern for Python and an Arguments Container pattern reminded me that there is one more container for globals that is worth noting: the module.

The idea is a simple one. You can use a module as a container.

Most introductions to Python tell you all about how to get stuff — that is, how to import stuff — *from* imported modules. They talk very little about writing stuff *to* imported modules. But it can be done.

Here is a simple example.

Let’s start with the intended container module, mem.py. I’d show you the contents of mem.py, except for the fact that there aren’t any. mem.py is empty.

Next let’s look at two modules that import and use mem.py.

The caller module is leader.py. Note that it imports mem and also imports the subordinate module, minion.  (Note the use of the print() function; we’re running Python 3 here.)

"leader.py"
import mem
import minion

mem.x = "foo"
print("leader says:",mem.x)
minion.main()
print("leader says:",mem.x)

print()

mem.x = "bar"
print("leader says:",mem.x)
minion.main()
print("leader says:",mem.x)

The subordinate module is minion.py.

"minion.py"
import mem

def main():
	print("minion says:",mem.x)
	mem.x = "value reset by minion from " + mem.x

If you run leader.py it imports minion and mem, and uses mem as a container for variable x.  It assigns a value to x in mem and calls minion, which reads mem.x and resets mem.x’s value, which leader then reads.

When you run leader.py, you see this output:

leader says: foo
minion says: foo
leader says: value reset by minion from foo

leader says: bar
minion says: bar
leader says: value reset by minion from bar

Note that leader.py passes no arguments to minion.main() and minion.main() doesn’t return anything (other than None, of course). Leader and minion communicate solely by means of the variables set in mem. And the communication is clearly two-way. Leader sets values that minion reads, and minion sets values that leader reads.

So what we have here, in mem, is a truly global container. It is not “module global” as in the Globals Class pattern. It is “application global” — it is global across the multiple modules that make up an application.  In order to gain access to this container, modules simply import it.

In keeping with the earlier posts’ grandiosity, I will call this use of an imported module the Globals Module pattern.

Every Python programmer is familiar with one special case of the Globals Module pattern. Just rename mem.py to config.py, stuff it with a bunch of constants or configuration variables, and you have a typical Python file for defining constants or setting configuration values. These values are “application global”, available to all module in an application. All they have to do is to import config.py.

Doing a bit of arm-waving, and christening a Globals Module pattern, does one thing.  It reminds us that modules — used as containers for “application global” values – aren’t limited to supplying constants and pre-set values. Modules can also be written to.  The communication between “normal” modules and Globals Modules is a two-way street.

An Arguments Container pattern

In a comment on my earlier post A Globals Class pattern for Python, Mike Müller wrote
“No need for globals. Just explicitly pass your container. In my opinion this is much easier to understand.”

Mike’s comment led me to some further thoughts on the subject.

Suppose you have a number of things — x, y, and z — that you want to make available to many functions in a module.

There are four strategies that you could use. You could

1. pass x, y, and z as individual arguments
2. make x, y, and z globals

or you could create a container C of some sort and

3. pass container C as an argument
4. make container C a global

So you have two basic questions to answer. When you make the things — x, y, and z — available:

A. Do you make them available in global variables, or in arguments that you pass around?

B. Do you make them available individually, or do you put them in some kind of container and make the container available?

My original post assumed that in at least some situations you might answer question A with “use global variables” and then went on to propose that in those situations the best answer to B is “put them in a container”.

Since the point of that post was to point out the usefulness of a class as a container, I called the proposed pattern the Globals Class pattern. But in most cases some other kind of container would do as well as a class. I could almost as easily have called the pattern the Globals Container pattern.

So if you look at these two questions — A and B — I think it is interesting where Mike and I differ, and where we agree.

Question A: args or globals

Where we differ, if you could call it that, is in the answer to A.

Mike wrote “No need for globals. Just explicitly pass your container. In my opinion this is much easier to understand.”

In my post I wrote “Sometimes globals are the best practical solution to a particular programming problem.” But that wasn’t really what the post was about. It was about the answer to question B.

So I can’t really say that Mike and I disagree very much. He says “I like apples”. I say “Sometimes I like an orange.”  No big deal.

Question B — multiple things or a single container

What is much more interesting is that we both agree on the answer to question B: use a container object.

But since I was talking about globals, I was talking about a container for globals.  Since Mike was talking about arguments, he was talking about a container for arguments.

Which means that we have two different patterns. My earlier post was about strategy 4 – a Globals Container pattern. Mike is talking about strategy 3 — what we might call an Arguments Container pattern.

As it happens, I had stumbled onto the Arguments Container pattern myself, not in Python but in Java. The circumstances were very similar to the circumstances that led to the Python Globals Class pattern. I had a lot of variables that I needed to pass around. As the code evolved,the argument lists got longer and harder to manage. Finally I just bundled all of the variables into a single container object and passed the container around. As I needed to add new arguments, I was able to add them to just one place — the container.

At the time, I felt sort of stupid doing this. I hadn’t ever heard of this as a programming technique.  It smacked of sneaking global variables in through the back door, and of course everybody knows that globals are always bad. But it worked, and it made my life a lot easier.

So now Mike comes along and proposes doing exactly the same thing. I feel relieved. I’m not the only one doing this. It may even be a Good Thing.

So I’m happy to announce — not the discovery, certainly — the christening of the Arguments Container pattern, which says, basically:

Sometimes when you have a lot of individual variables that you need to pass around to a lot of different functions or methods, the best solution is to put them into a container object and just pass the container object around.

This is not a specifically Python pattern. And in a way it is No Big Deal. But I’m doing a bit of shouting and arm-waving here because I think that somewhere there is probably at least one person for whom this post might be useful.

A Globals Class pattern for Python

I’ve gradually been evolving a technique of coding in which I put module globals in a class. Recently I stumbled across Norman Matloff’s Python tutorial in which he recommends doing exactly the same thing, and it dawned on me that this technique constitutes a truly idiomatic Python design pattern.

A pattern needs a short, catchy name. I don’t think anyone has yet given this pattern a name, so I will propose the name Globals Class pattern.

I’m sure that many experienced Python programmers are already quietly using the Globals Class pattern. They may not see much point in making a big deal about it, or in giving it a name and decking it out with the fancy title of “design pattern”. But I think a little bit of hoopla is in order. This is a useful technique, and one worth pointing out for the benefit of those who have not yet discovered it.  A bit of cheering and arm-waving is in order, simply to catch some attention.

The technique is extremely simple.

  • You define a class at the beginning of your module.  This makes the class global.
  • Then, all of the names that you would otherwise declare global, you specify as attributes of the class.

Really, there is virtually nothing class-like about this class; for instance, you probably will never instantiate it. Instead of functioning like a true class, it functions as a simple container object.

I like to use the name “mem” (in my mind, short for “GlobalMemory”) for this class, but of course you can use any name you prefer.

All you really need is a single line of code.

        class mem: pass

That is enough to create your mem container. Then you can use it wherever you like.

        def doSomething():
            mem.counter = 0
            ...
        def doMore():
            mem.counter += 1
            ...
        def doSomethingElse():
            if mem.counter > 0:
                ...

If you wish, you can initialize the global variables when you create the class. In our example, we could move the initialization of mem.counter out of the doSomething() function and put it in the definition of the mem class.

        class mem:
            counter = 0

In a more elaborate version of this technique, you can define a Mem class, complete with methods, and make mem an instance of the class. Sometimes this can be handy.

        class Mem:
            def __init__(self):
                self.stupidErrorsCount = 0
                self.sillyErrorsCount  = 0

            def getTotalErrorsCount(self):
                return self.stupidErrorsCount + self.sillyErrorsCount

        # instantiate the Mem class to create a global mem object
        mem = Mem()

What’s the point?

So, what does the Globals Class pattern buy you?

1. First of all, you don’t have to go putting “global” statements all over your code.   The beauty of using a globals class is that you don’t need to have any “global” statements in you code.

There was a time — in the past, when I still used “global” — when I might find myself in a situation where my code was evolving and I needed to create more and more global variables. In a really bad case I might have a dozen functions, each of which declared a dozen global variables. The code was as ugly as sin and a maintenance nightmare.  But the nightmare stopped when I started putting all of my formerly global variables into a global class like mem.  I simply stopped using “global” and got rid of all those “global” statements that were cluttering up my code. 

So the moral of my story is this.  Kids, don’t be like me.  I started out using “global” and had to change.  I’m a recovering “global” user. 

Don’t you even start.  Skip the section on the “global” keyword in your copy of Beginners Guide to Learning Python for Dummies.  Don’t use “global” at all.  Just use a globals class.

2. I like the fact that you can easily tell when a variable is global simply by noticing the mem. modifier.

3. The globals statement is redundant.  The Globals Class pattern relieves us of of the burden of having to worry about it.

Python has the quirk that if X is a global, and a function only reads X, then within the function, X is global. But if the function assigns a value to X, X is treated as local.

So suppose that — as your code evolves — you add an assignment statement deep in the bowels of the function. The statement assigns a value to X. Then you have — as a side-effect of the addition of that statement — converted X (within the scope of the function) from a global to a local.

You might or might not want to have done that.  You might not even realize what you’ve done.   If you do realize what you’ve done, you probably need to add another statement to the function, specifying that X is global.  That is sort of a language wart. If you use the Globals Class pattern, you avoid that wart.

4. I think the use of the Globals Class pattern makes the work of static code analyzers (e.g. PyFlakes) easier.

5. The Globals Class pattern makes it possible to create multiple, distinct groups of globals.

This can be useful sometimes. I have had modules that processed nested kinds of things: A, B, and C. It was helpful to have different groups of globals for the different kinds of things.

        class memA: pass
        class memB: pass
        class memC: pass

6. Finally, the Globals Class pattern makes it possible to pass your globals as arguments.

I have had the situation where a module grew to the point where it needed to be split into two modules. But the modules still needed to share a common global memory. With the Globals Class pattern, a module’s globals are actually attributes of an object, a globals class.  In Python, classes are first-class objects.  That means that a globals class can be passed — as a parameter — from a function in one module to a function in another module.

Is this really A Good Thing?

At this point I can hear a few stomachs churning. Mine is one of them. Because, as we all know, Global Variables are Always a Bad Thing.

But that proposition is debatable.  In any event, it is an issue that I’m not going to explore here.  For now, I prefer to take a practical, pragmatic position:

  • Sometimes globals are the best practical solution to a particular programming problem.
  • For the occasions when Globals are A Good Thing, it is handy to have a way to Do Globals in A Good Way.

So the bottom line for me is that there are occasions when some kind of globals-like technique is the best tool for the job.  And on those occasions the Globals Class pattern is a better tool for the job than globals themselves.

How to open a web browser from Python

This goes under the Tips and Tricks category. 

Also under Stuff I wish I had known about a long time ago.

The trick is in the standard library, in the webbrowser module.

"""
For documentation of the webbrowser module,
see http://docs.python.org/library/webbrowser.html
"""
import webbrowser
new = 2 # open in a new tab, if possible

# open a public URL, in this case, the webbrowser docs
url = "http://docs.python.org/library/webbrowser.html"
webbrowser.open(url,new=new)

# open an HTML file on my own (Windows) computer
url = "file://X:/MiscDev/language_links.html"
webbrowser.open(url,new=new)

Command-line syntax: some basic concepts

I’ve been reading about parsers for command-line arguments lately, for example Plac. And, as Michele Simionato says:

There is no want of command line arguments parsers in the Python world. The standard library alone contains three different modules: getopt (from the stone age), optparse (from Python 2.3) and argparse (from Python 2.7).

My reading has made me realize that there is an immense range of possible syntaxes for command-line arguments, and far less consensus and standardization than I thought. Although there are some general styles that programmers often use when implementing the command-line arguments for their applications, basically every programmer is free to do whatever he (or she) wants. The result is that whenever you encounter an application for the first time, you can’t safely assume anything about the syntax of its command-line arguments.

It also has made me wonder if anyone had ever written an overview of, or introduction to, the basic concepts involved in command line arguments. I searched the Web without finding one, so I thought it would be interesting to try to write one.  I can live with the risk that I’m re-inventing the wheel.

Of course, there may be something out there and I just missed it. So if you know of some other discussion of this topic, please leave a comment and tell me about it. And if there is something that I missed here, I’d appreciate a comment about that too.

What is a command line argument?

When you invoke an application from a command line, it is often useful to be able to send one or more pieces of information from the command line to the application. As a simple example, we might want to start a text editor and also tell it the name of a file that it should open, like this

          superedit a_filename.txt

In this example, “superedit” is the name of the application, and “a_filename.txt” is a command line argument: in this case, the name of a file.

It is possible to supply more than one command line argument

We often want to send an application multiple arguments, like this:

          rename file_a.txt  file_b.txt

Positional arguments, named arguments, and flags

There are three types of command line argument: positional arguments, named arguments, and flags.

  • A positional argument is a bare value, and its position in a list of arguments identifies it.
  • A named argument is a (key, value) pair, where the key identifies the value.
  • A flag is a stand-alone key, whose presence or absence provides information to the application.

If we supplied the “rename” application with two positional arguments, like this

          rename file_a.txt  file_b.txt

then the position of the arguments identifies the value.

  • The value in position 1 (“file_a.txt”) is the current name of the file.
  • The value in position 2 (“file_b.txt”) is the requested new name of the file.

We could have written the “rename” application so that it requires two named arguments, like this

          rename  -oldname file_a.txt  -newname file_b.txt

A flag is an argument whose presence alone is enough to convey information to the application. A good example is the frequently-used “-v” or "--verbose" argument.

Although it is possible to think of flags as degenerate named arguments (named arguments that have a key but no value), I find it easier to think of flags as a distinct type of argument, different from named arguments.

Keyword arguments and options

I will use the term keyword argument to cover both named arguments and flags.

David Goodger notes (in the first comment on the first version of this post) that I am not using the traditional Unix command-line lexicon.  What I have called keyword arguments are — on Unix platforms — traditionally called options;  what I have called values are traditionally called option arguments; and what I have called positional arguments, the Open Group calls operands.  So I should probably say something about my choice of technical terminology.

For the purposes of this analysis, I prefer not to use the traditional Unix vocabulary of options, for a number of reason.  First of all, the term option tends to be Unix-specific; on Windows the term parameter is more frequently used.  Second, the investigation began with command-line parsers, and in the context of a discussion of parsers and parsing, keyword argument seems a more traditional and appropriate term than option.  Third, the usual definition of option is not very useful.

Arguments are options if they begin with a hyphen.

And finally, the term option implies optionality.  Whether an argument is optional or required is a semantic issue rather than a syntactical issue.  At this point I’m interested in syntactical issues, so I want to use a semantically neutral vocabulary.  We can talk about options and optionality later, when we look at semantic concepts.

Keyword arguments require a sigil

When keyword arguments are used, there must be some mechanism for distinguishing a key from a value or from a positional argument. That mechanism is a “sigil”: a special character or string of characters that indicates the beginning of a key. In our example, the sigil was a dash (a hyphen).

On Windows, the sigil is typically a forward slash: “/”.

On Unix-like operating systems, the sigil is typically a dash "-".

Some applications use multiple sigils.  With the plus sign “+” as a sigil, for instance, it is possible to use flags to turn options on and off.

          attrib   -readonly    -archive     file_A.txt
          attrib   +readonly    +archive     file_A.txt

Single-character and multi-character keys

Some applications, especially on Unix, make a distinction between single-character keys and multi-character keys (“long options”), with a single-dash sigil "-" indicating the beginning of a single-character key, and a double dash "--" sigil indicating the beginning of a multi-character key. Often, an application will support both single-character and multi-character keys for the same argument. For example, the “rename” application might accept both this

          rename  -o file_a.txt  -n file_b.txt

and this

          rename  --oldname file_a.txt  --newname file_b.txt

Fixed-length and variable-length keys

The previous section describes what I think most Unix programmers would say is the difference between single-dash and double-dash keys. But I think it is actually wrong.

The real difference between a single-dash sigil "-" and a double dash "--" sigil is not the difference between one and many, but the difference between fixed-length and variable-length keys. (This is obscured by the fact that a single-character key is also automatically a fixed-length key.)

The thing that really makes keys that begin with a single dash different from keys that begin with a double dash is not that they are one character long, but that their length is fixed and known. For example, flag concatenation (see below) is possible because the flag keys have a known and fixed length. It doesn’t depend on the flag keys being one character long — it would work just as well if the length for flag keys was fixed at two or even three characters. And this is also true of the third technique for distinguishing keys from argument values (see the next section).

Named arguments require a mechanism to distinguish keys from argument values

One technique is to use whitespace to separate argument values from keys. We saw this in our earlier example

          rename  -o file_a.txt  -n file_b.txt

A second technique is to use a special (non-whitespace) character to separate argument values from keys. This special character could be any character that cannot occur in either the key or argument value.

On Unix, this is traditionally an equal sign “=”, like this.

          rename  -o=file_a.txt  -n=file_b.txt

On Windows and MS-DOS this is traditionally a colon “:”, like this.

          rename  /o:file_a.txt  /n:file_b.txt

An application might permit whitespace before and after the equal sign, like this.

          rename  -o = file_a.txt  -n = file_b.txt

A third technique is to use the known length of the key to distinguish the key from the argument value. Suppose the “rename” application uses only 1-character keys. Then it might accept arguments like this.

          rename  -ofile_a.txt  -nfile_b.txt

Fixed-length keys make flag concatenation possible

Suppose that an application follows the convention that a single-dash sigil signals the start of a single-character flag argument. Then it can accept either this

          tar -x -v -f  some_filename.tar

or this, where several flag arguments are specified together

          tar -xvf some_filename.tar

Here is where the distinction between the single-dash sigil and the double-dash sigil becomes important.

  • "-xvf" indicates the concatenation of three single-character flags: “x”, “v”, and “f”.
  • "--xvf" (note the double dash) indicates a single multi-character flag: “xvf”.

Parsing the command line

In many of the examples that we’ve seen, parsing the command line is as simple as splitting it on whitespace. But the situation gets more complicated if values can contain whitespace. If that is true, then we need to support delimiters that can enclose values that contain whitespace.

Suppose we want to invoke a word-processor from the command line. And we want to specify two arguments on the command line: the name of the file, and the name of the author. This obviously will not work.

          superedit A Christmas Story.doc  Clement Moore

What we need is this.

          superedit "A Christmas Story.doc"  "Clement Moore"

Support of quoted values means that command-line parsers must be more sophisticated… just splitting the command line on whitespace won’t do the job. The command-line parser must recognize and correctly handle quote characters… and escaped quote characters inside of quoted strings.

The most common delimiter for argument values is the double-quote symbol. But we might also (or instead) want to support single quotes, back ticks, parentheses, or square/wavy/pointy brackets. We can imagine a case in which a malevolent programmer wrote superedit to expect positional arguments like this.

          superedit (A Christmas Story.doc)  (Clement Moore)

… or named arguments like this.

          superedit filename(A Christmas Story.doc)  author(Clement Moore)

Sigils in positional arguments

Remember our “rename” application? It accepted arguments like this, where the dash is the sigil that introduces the key of a named argument.

          rename  -o file_a.txt  -n file_b.txt

But filenames can begin with dashes. We might need to write a command like this, which would cause problems.

          rename  -o -file_a.txt  -n -file_b.txt

So this is another reason why we might need to be able to quote argument values: to “hide” a sigil character inside a value.

          rename  -o "-file_a.txt"  -n "-file_b.txt"

The order of arguments

In the first version of this post, I wrote that:

It is a universally observed convention that
  • keyword arguments (named arguments and flags) are grouped together
  • positional arguments are grouped together
  • keyword arguments must be specified first, before specifying positional arguments

But that is wrong. It is a widely — but not universally — observed convention. As Eric wrote, in a comment on the first version of this post,

many modern programs allow keyword arguments to be specified after (or even between) positional arguments

And even very old programs do it too. The command-line syntax for Microsoft DOS’s dir command (roughly equivalent to Unix’s ls command) is basically

dir [filename] [switches]

with the filename positional argument appearing before the switches.

A separator between keyword arguments and positional arguments

Suppose we have an application “myprog” that accepts one or more keyword arguments that start with a dash sigil, followed by one or more positional arguments that supply filenames. And suppose that filenames can contain — and begin with — dashes.

We’re going to have a problem if we code this

          myprog -v -r -t -file_a.txt -file_b.txt  -file_c.txt

myprog is going to see “-file_a.txt” and (since it starts with a dash, the sigil) myprog will try to handle it like a keyword argument. Not good.

We could deal with this problem by routinely enclosing all filename positional arguments in quotes, but that would be clumsy and laborious.

          myprog -v -r -t "-file_a.txt" "-file_b.txt"  "-file_c.txt"

An alternative is to use a special string (typically double dashes "--") to indicate the beginning of positional arguments.

          myprog -v -r -t   --  -file_a.txt -file_b.txt  -file_c.txt

So now we have four basic kinds of arguments.

  • positional arguments
  • named arguments (key+value pairs)
  • flags
  • an indicator of the beginning of positional arguments ("--")

Argument semantics

To be expanded…

Optional arguments vs. required argments

Relationships between different arguments

  • Aliases
  • Mutual exclusion
  • Mutual necessity

 

Other variations

In some conventions:

  • Multi-character keys may be abbreviated as long as the abbreviations are unique.
  • The value in a named argument is optional and may be omitted.
  • The value of a named argument may be a list, with items in the list separated by a colon or a comma.
  • A sigil character standing by itself (e.g. a single dash) is treated as a positional argument.

Command-line as a programming language

I think that the best way to think of a command-line, and its arguments, is as a statement in a command-line (CL) programming language, where each application defines its own CL language.

This means that — as far as an application is concerned — the process of using command-line arguments always looks like this:

  1. define (i.e. tell the parsing module about) the syntax rules of the CL language to be used
  2. define (i.e. tell the parsing module about) the semantics of the CL language
  3. call the parser to parse the command line and its arguments
  4. query the parser for information about the “tokens” (the command-line arguments) that it found

Step 2 — specifying the CL semantics — is the step in which the application specifies (for example) what named arguments and flags it accepts, and which are required. This step is necessary for the parser to do certain kinds of semantic checking: (for example) to automatically reject unrecognized keys, or to automatically report required arguments that were not provided.

Step 2 can be omitted, but only if the application itself will do the semantic checking rather than expecting the parsing module to do it.

The upside of doing step 2 is that it enables a smart CL parsing module automatically to generate user documentation for the CL language, and to dump that documentation to the screen when it finds a syntactic or semantic error in the command line, or when the command line is a request (e.g. “/?” or “-h”) for the command-line documentation.

Command-line meta-languages

CL languages are like markup languages. You can invent your own from scratch if you wish, but life is a lot easier if you at least follow some standard conventions when you do.

In the world of markup languages, such standard conventions are called meta-languages. The best-known markup meta-language is XML. XML is not a markup language; it is a markup meta-language … roughly: a style, or set of conventions, or template for creating specific markup languages.

XML is well-defined by the W3C. It would make sense to have similarly well-defined, carefully specified meta-languages for CL languages. Right now, I think we have two loosely-defined CL meta-languages, which I shall refer to as

  • WinCL (for Windows)
  • NixCl (for *nix platforms)

Traditionally (see the Wikipedia article on command line argument)

  • WinCL uses a slash as the sigil; NixCL uses a dash.
  • WinCL uses a colon as a key/value separator; NixCL uses an equal sign.
  • WinCL keywords traditionally consist of a single letter; NixCL is open to multi-character keywords (GNU “long options”).

As of July 25, 2010,:

If it is (or becomes) possible to consider WinCL and NixCL to be well-defined CL meta-languages, then the first step of specifying a CL language for an application (which I gave earlier):

  • define (i.e. tell the parsing module about) the syntax rules of the CL language to be used

could be simply

  • tell the parsing module whether the CL language will be a WinCL or a NixCL language

An alternative is to use a parser utility that is designed to handle specifically WinCL or NixCL. Python’s optparse, for example, “supports only the most common command-line syntax and semantics conventionally used under Unix.” And if you aren’t familiar with those conventions, the documentation summarizes them.

Unicode for dummies – just use UTF-8

Revised 2012-03-18 — fixed a bad link, and removed an incorrect statement about the origin of the terms “big-endian” and “little-endian”.

Commenting on my previous post about Unicode, an anonymous commentator noted that

the usage of the BOM [the Unicode Byte Order Mark] with UTF-8 is strongly discouraged and really only a Microsoft-ism. It’s not used on Linux or Macs and just tends to get in the way of things.

So it seems worth-while to talk a bit more about the BOM.  And in the spirit of Beginners Introduction for Dummies Made Simple, let’s begin at the beginning: by distinguishing big and little from left and right.

Big and Little

“Big” in this context means “more significant”. “Little” means “least significant”.

Consider the year of American independence — 1776.  In the number 1776:

  • The least significant (“smallest”) digit is 6. It has the smallest magnitude: it represents 6 * 1, or 6.
  • The most significant (“biggest”) digit is 1. It has the largest magnitude: it represents 1 * 1000, or 1000.

So we say that 1 is located at the big end of 1776 and 6 is located at the small end of 1776.

Left and Right

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

Here are two technical terms: “big endian” and “little endian”.

These terms are derived from “Big End In” and “Little End In.”  According to Wikipedia, the terms Little-Endian and Big-Endian were introduced in 1980 by Danny Cohen in a paper called “On Holy Wars and a Plea for Peace”.

1776 is a “big endian” number because the “biggest” (most significant) digit is stored in the leftmost position. The big end of 1776 is on the left.

Big-endian numbers are familiar.  Our everyday “arabic” numerals are big-endian representations of numbers.  If we used a little-endian representation, the number 1776 would be represented as 6771.  That is, with the “little” end of 1776 — the “smallest” (least significant) digit — in the leftmost position.

What do you think? In Roman numerals, 1776 is represented as MDCCLVI. Are Roman numerals big-endian or little-endian?

So big and little are not the same as left and right.

Byte Order

Now we’re ready to talk about byte order. And specifically, byte-order in computer architectures.

Most computer (hardware) architectures agree on bits (ON and OFF) and bytes (a sequence of 8 bits), and byte-level endian-ness.  (Bytes are big-endian: the leftmost bit of a byte is the biggest.  See Understanding Big and Little Endian Byte Order.)

But problems come up when handling pieces of data, like large numbers and strings, that are stored in multiple bytes.  Different computer architectures use different endian-ness at the level of multi-byte data items (I’ll call them chunks of data).

In the memory of little-endian computers, the “little” end of a data chunk is stored leftmost. This means that, a data chunk whose logical value is 0×12345678 is stored as 4 bytes with the least significant byte to the left, like this: 0×78 0×56 0×34 0×12.

  • For those (like me) who are still operating largely at the dummies level: imagine 1776 being stored in memory as 6771.

Big-endian hardward does the reverse. In the memory of big-endian computers, the “big” end of a data chunk is stored leftmost. This means that a data chunk of 0×12345678 is stored as 4 bytes with the most significant byte to the left, like this: 0×12 0×34 0×56 0×78.

  • For us dummies: imagine 1776 being stored in memory as 1776.

Here are some random (but curiously interesting) bits of information, courtesy of the Microsoft Support web-site article Explanation of Big Endian and Little Endian Architecture.

  • Intel computers are little endian.
  • Motorola computers are big endian.
  • RISC-based MIPS computers and the DEC Alpha computers are configurable for big endian or little endian.
  • Windows NT was designed around a little endian architecture, and runs only on little-endian computers or computers running in little-endian mode.

In summary, the byte order — the order of the bytes in multi-byte chunks of data — is different on big-endian and little-endian computers.

Which brings us to…

The Unicode Byte Order Mark

In this section, I’m going shamelessly to rip off information from Jukka K. Korpela’s outstanding Unicode Explained from O’Reilly (see the section on Byte Order starting on page 300). (See also Jukka’s valuable web page on characters and encodings.)

Suppose you’re running a big-endian computer, and create a file in Unicode’s UTF-16 (two-byte) format.

Note that the encoding is the Unicode UTF-16 (two-byte) encoding, not UTF-8 (one-byte). That’s an important aspect of the problem, as you will see.

You send the file out into the world, and it is downloaded by somebody running a little-endian computer. The recipient knows that the file is in UTF-16 encoding. But the bytes are not in the order that he (with his little-endian computer) expects. The data in the file appears to be scrambled beyond recognition.

The solution, of course, is simply to tell the recipient that the file was encoded in UTF-16 on a big-endian computer.  Ideally, we’d like for the data in the file itself to be able to tell the recipient the byte order (big endian or small endian) that was used when the data was encoded and stored in the file.

This is exactly what the Unicode byte order mark (BOM) is designed to do.

Unicode contains two code points reserved specifically for the purpose of indicating byte order: U+FEFF (big endian) and U+FFFE (little endian).

These code points are used for nothing else than to indicate byte order. If the first two bytes of a file are 0xFEFF or 0xFFFE, then a Unicode decoder knows that those two bytes contain a Unicode BOM, and knows what to do with the BOM.

This also means that if you (in the role, say, of a forensic computer scientist) must process a mystery file, and you see that the file’s first two bytes contain one of the two Unicode BOMs, you can (with a high probability of being correct) infer that the file is encoded in Unicode UTF-16 format.

So: Where’s the BOM?

In actual practice, most UTF-8 files do not include a BOM.  Why not?

A file that has been encoded using UTF-16 is an ordered sequence of 2-byte chunks. Knowing the order of the bytes within the chunks is crucial to being able to decode the file into the correct Unicode code points.  So a BOM should be considered mandatory for files encoded using UTF-16.

But a file in UTF-8 encoding is an ordered sequence of 1-byte chunks.  In UTF-8, a byte and a chunk are essentially the same thing.  So with UTF-8, the problem of knowing the order of the bytes within the chunks is simply a non-issue, and a BOM is pointless. And since the Unicode standard does not require the use of the BOM, virtually nobody puts a BOM in files encoded using UTF-8.

Let’s do UTF-8… all the time!

It is important to recognize that UTF-8 is able to represent any character in the Unicode standard.  So there is a simple rule for coding English text (i.e. text that uses only or mostly ASCII characters) —

Always use UTF-8.

  • UTF-8 is easy to use. You don’t need a BOM.
  • UTF-8 can encode anything.
  • For English or mostly-ASCII text, there is essentially no storage penalty for using UTF-8. (Note, however, that if you’re encoding Chinese text, your mileage will differ!)

What’s not to like!!??

UTF-8? For every Unicode code point?!

How can you possbily encode every character in the entire Unicode character set using only 8 bits!!!!

Here’s where Joel Spolsky’s (Joel on Software) excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) comes in useful.  As Joel notes

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

This is the myth that Unicode is what is known as a Multibyte Character Set (MBCS) or Double-Byte Character Set (DBCS).   Hopefully, by now, this myth is dying.

In fact, UTF-8 is what is known variously as a

  • multibyte encoding
  • variable-width encoding
  • multi-octet encoding (For us dummies, octet == byte. For the difference, see page 46 of Korpela’s Unicode Explained.)

Here’s how multibyte encoding works in UTF-8.

  • ASCII characters are stored in single bytes.
  • Non-ASCII characters are stored in multiple bytes, in a “multibyte sequence”.
  • For non-ASCII characters, the first byte in a multibyte sequence is always in the range 0xC0 to 0xFD. The coding of the first byte indicates how many bytes follow, and so indicates the total number of bytes in the multibyte sequence.
  • In UTF-8, a multibyte sequence can contain as many as four bytes.
  • Originally a multibyte sequence could contain six bytes, but UTF-8 was restricted to four bytes by RFC 3629 in November 2003.

For a quick overview of how this works at the bit level, take a look at the answer by dsimard to the question How does UTF-8 “variable-width encoding” work? on stackoverflow.

Wrapping it all up

So that’s it. Our investigation of the BOM has led us to take a closer look at UTF-8 and multibyte encoding.

And that leads us to a nice place. For the most part, and certainly if you’re working with ASCII data, there is a simple rule.

Just use UTF-8 and forget about the BOM.