Backing up your email

Just in case someone might find this useful …

I recently had something bad happen to me. I use Thunderbird (on Windows Vista) as my email client. I asked Thunderbird to compact my email files, and it wiped out a bunch of my email messages. (I think that one of my email files must have been corrupt, and when I compacted it, the compaction process wiped out messages that should not have been wiped out.)

You can recover deleted email messages … but not after the email file has been compacted. So the messages were not recoverable. Bummer.

The upside is that this nasty incident led me to learn some things.

One thing that I learned was that the disk backup utility that I was using at the time did NOT backup my email files. The email files were stored in a directory called AppData, and the AppData directory is a “hidden” directory. So the backup utility didn’t see the AppData directory, and didn’t back it up. So I had no backup of the deleted messages.

Learning that led me to investigate ways to backup my email files, and I found this: Five ways to keep your emails backed up

For backing up Thunderbird files, it recommends MozBackup as being fast, free and easy to use. So I tried MozBackup, and those claims seem to be true.

Now I’m evaluating different disk backup options.

The take-away here is that you need to pay special attention to backing up your email files. So if you’re not backing up your email files, take a look at Five ways to keep your emails backed up (and read the comments, which are useful) or google something like “email backup”.

[Note that this applies only if you are using an email client such as Thunderbird, Outlook, Outlook Express, etc. If you don't use an email client, and do all of your email work through a Web interface to your Internet Service Provider, then this is not an issue.]

Unicode for dummies — Encoding

Another entry in an irregular series of posts about Unicode.
Typos fixed 2012-02-22. Thanks Anonymous, and Clinton, for reporting the typos.

This is a story about encoding and decoding, with a minor subplot involving Unicode.

As our story begins — on a dark and stormy night, of course — we find our protagonist deep in thought. He is asking himself “What is an encoding?”

What is an encoding?

The basic concepts are simple. First, we start with the idea of a piece of information — a message — that exists in a representation that is understandable (perspicuous) to a human being. I’m going to call that representation “plain text”. For English-language speakers, for example, English words printed on a page, or displayed on a screen, count as plain text.

Next, (for reasons that we won’t explore right now) we need to be able to translate a message in a plain-text representation into some other representation (let’s call that representation the “encoded text”), and we need to be able to translate the encoded text back into plain text. The translation from plain text to encoded text is called “encoding”, and the translation of encoded text back into plain text is called “decoding”.

encoding and decoding

There are three points worth noting about this process.

The first point is that no information can be lost during encoding or decoding. It must be possible for us to send a message on a round-trip journey — from plain text to encoded text, and then back again from encoded text to plain text — and get back exactly the same plain text that we started with. That is why, for instance, we can’t use one natural language (Russian, Chinese, French, Navaho) as an encoding for another natural language (English, Hindi, Swahili). The mappings between natural languages are too loose to guarantee that a piece of information can make the round-trip without losing something in translation.

The requirement for a lossless round-trip means that the mapping between the plain text and the encoded text must be very tight, very exact. And that brings us to the second point.

In order for the mapping between the plain text and the encoded text to be very tight — which is to say: in order for us to be able to specify very precisely how the encoding and decoding processes work — we must specify very precisely what the plain text representation looks like.

Suppose, for example, we say that plain text looks like this: the 26 upper-case letters of the Anglo-American alphabet, plus the space and three punctuation symbols: period (full stop), question mark, and dash (hyphen). This gives us a plain-text alphabet of 30 characters. If we need numbers, we can spell them out, like this: “SIX THOUSAND SEVEN HUNDRED FORTY-THREE”.

On the other hand, we may wish to say that our plain text looks like this: 26 upper-case letters, 26 lower-case letters, 10 numeric digits, the space character, and a dozen types of punctuation marks: period, comma, double-quote, left parenthesis, right parenthesis, and so on. That gives us a plain-text alphabet of 75 characters.

Once we’ve specified exactly what a plain-text representation of a message looks like — a finite sequence of characters from our 30-character alphabet, or perhaps our 75-character alphabet — then we can devise a system (a code) that can reliably encode and decode plain-text messages written in that alphabet. The simplest such system is one in which every character in the plain-text alphabet has one and only one corresponding representation in the encoded text. A familiar example is Morse code, in which “SOS” in plain text corresponds to

                ... --- ...

in encoded text.

In the real world, of course, the selection of characters for the plain-text alphabet is influenced by technological limitations on the encoded text. Suppose we have several available technologies for storing encoded messages: one technology supports an encoded alphabet of 256 characters, another technology supports only 128 encoded characters, and a third technology supports only 64 encoded characters. Naturally, we can make our plain-text alphabet much larger if we know that we can use a technology that supports a larger encoded-text alphabet.

And the reverse is also true. If we know that our plain-text alphabet must be very large, then we know that we must find -- or devise -- a technology capable of storing a large number of encoded characters.

Which brings us to Unicode.

Unicode

Unicode was devised to be a system capable of storing encoded representations of every plain-text character of every human language that has ever existed. English, French, Spanish. Greek. Arabic. Hindi. Chinese. Assyrian (cuneiform characters).

That's a lot of characters.

So the first task of the Unicode initiative was simply to list all of those characters, and count them. That's the first half of Unicode, the Universal Character Set. (And if you really want to "talk Unicode", don't call plain-text characters "characters". Call them "code points".)

Once you've done that, you've got to figure out a technology for storing all of the corresponding encoded-text characters. (In Unicode-speak, the encoded-text characters are called "code values".)

In fact Unicode defines not one but several methods of mapping code points to code values. Each of these methods has its own name. Some of the names start with "UTF", others start with "UCS": UTF-8, UTF-16, UTF-32, UCS-2, UCS-4, and so on. The naming convention is "UTF-<number of bits in a code value>" and "UCS-<number of bytes in a code value>" Some (e.g. UCS-4 and UTF-32) are functionally equivalent. See the Wikipedia article on Unicode.

The most important thing about these methods is that some are fixed-width encodings and some are variable-width encodings. The basic idea is that the fixed-width encodings are very long -- UCS-4 and UTF-32 are 4 bytes (32 bits) long -- long enough to hold the the biggest code value that we will ever need.

In contrast, the variable-width encodings are designed to be short, but expandable. UTF-8, for example, can use as few as 8 bits (one byte) to store Latin and ASCII characters code points. But it also has a sort of "continued on the next byte" mechanism that allows it to use 2 bytes or even 4 bytes if it needs to (as it might, for Chinese characters). For Western programmers, that means that UTF-8 is both efficient and flexible, which is why UTF-8 is the de facto standardard encoding for exchanging Unicode text.

There is, then, no such thing as THE Unicode encoding system or method. There are several encoding methods, and if you want to exchange text with someone, you need explicitly to specify which encoding method you are using.

Is it, say, this.

encoding decoding UTF-8

Or this.

encoding decoding UTF-16

Or something else.

Which brings us back to something I said earlier.

Why encode something in Unicode?

At the beginning of this post I said

We start with the idea of a piece of information -- a message -- that exists in a representation that is understandable (perspicuous) to a human being.

Next, (for reasons that we won't explore right now) we need to be able to translate a message in a plain-text representation into some other representation. The translation from plain text to encoded text is called "encoding", and the translation of encoded text back into plain text is called "decoding".

OK. So now it is time to explore those reasons. Why might we want to translate a message in a plain-text representation into some other representation?

One reason, of course, is that we want to keep a secret. We want to hide the plain text of our message by encrypting and decrypting it -- basically, by keeping the algorithms for encoding and decoding secret and private.

But that is a completely different subject. Right now, we're not interested in keeping secrets; we're Python programmers and we're interested in Unicode. So:

Why -- as a Python programmer -- would I need to be able to translate a plain-text message into some encoded representation... say, a Unicode representation such as UTF-8?

Suppose you are happily sitting at your PC, working with your favorite text editor, writing the standard Hello World program in Python (specifically, in Python 3+). This single line is your entire program.

                   print("Hello, world!")

Here, "Hello, world!" is plain text. You can see it on your screen. You can read it. You know what it means. It is just a string and you can (if you wish) do standard string-type operations on it, such as taking a substring (a slice).

But now suppose you want to put this string -- "Hello, world!" -- into a file and save the file on your hard drive. Perhaps you plan to send the file to a friend.

That means that you must eject your poor little string from the warm, friendly, protected home in your Python program, where it exists simply as plain-text characters. You must thrust it into the cold, impersonal, outside world of the file system. And out there it will exist not as characters, but as mere 1's and 0's, a jumble of dits and dots, charged and uncharged particles. And that means that your happy little plain-text string must be represented by some specific configuration of 1s and 0s, so that when somebody wants to retrieve that collection of 1s and 0s and convert it back into readable plain text, they can.

The process of converting a plain text into a specific configuration of 1s and 0s is a process of encoding. In order to write a string to a file, you must encode it using some encoding system (such as UTF-8). And to get it back from a file, you must read the file and decode the collection of 1s and 0s back into plain text.

The need to encode/decode strings when writing/reading them from/to files isn't something new -- it is not an additional burden imposed by Python 3's new support for Unicode. It is something you have always done. But it wasn't always so obvious. In earlier versions of Python, the encoding scheme was ASCII. And because, in those olden times, ASCII was pretty much the only game in town, you didn't need to specify that you wanted to write and read your files in ASCII. Python just assumed it by default and did it. But -- whether or not you realized it -- whenever one of your programs wrote or read strings from a file, Python was busy behind the scene, doing the encoding and decoding for you.

So that's why you -- as a Python programmer -- need to be able to encode and decode text into, and out of, UTF-8 (or some other encoding: UTF-16, ASCII, whatever). You need to encode your strings as 1s and 0s so you can put those 1s and 0s into a file and send the file to someone else.

What is plain text?

Earlier, I said that there were three points worth noting about the encoding/decoding process, and I discussed the first two. Here is the third point.

The distinction between plain text and encoded text is relative and context-dependent.

As programmers, we think of plain text as being written text. But it is possible to look at matters differently. For instance, we can think of spoken text as the plain text, and written text as the encoded text. From this perspective, writing is encoded speech. And there are many different encodings for speech as writing. Think of Egyptian hieroglyphics, Mayan hieroglyphics, the Latin alphabet, the Greek alphabet, Arabic, Chinese ideograms, wonderfully flowing Devanagari देवनागरी, sharp pointy cuneiform wedges, even shorthand. These are all written encodings for the spoken word. They are all, as Thomas Hobbes put it, "Marks by which we may remember our thoughts".

Which reminds us that, in a different context, even speech itself -- language -- may be regarded as a form of encoding. In much of early modern philosophy (think of Hobbes and Locke) speech (or language) was basically considered to be an encoding of thoughts and ideas. Communication happens when I encode my thought into language and say something -- speak to you. You hear the sound of my words and decode it back into ideas. We achieve communication when I successfully transmit a thought from my mind to your mind via language. You understand me when -- as a result of my speech -- you have the same idea in your mind as I have in mine. (See Ian Hacking, Why Does Language Matter to Philosophy?)

Finally, note that in other contexts, the "plain text" isn't even text. Where the plain text is soundwaves (e.g. music), it can be encoded as an mp3 file. Where the plain text is an image, it can be encoded as a gif, or png, or jpg file. Where the plain text is a movie, it can be encoded as a wmv file. And so on.

Everywhere, we are surrounded by encoding and decoding.


Notes

I'd like to recommend Eli Bendersky's recent post on The bytes/str dichotomy in Python 3, which prodded me -- finally -- to put these thoughts into writing. I especially like this passage in his post.

Think of it this way: a string is an abstract representation of text. A string consists of characters, which are also abstract entities not tied to any particular binary representation. When manipulating strings, we’re living in blissful ignorance. We can split and slice them, concatenate and search inside them. We don’t care how they are represented internally and how many bytes it takes to hold each character in them. We only start caring about this when encoding strings into bytes (for example, in order to send them over a communication channel), or decoding strings from bytes (for the other direction).

I strongly recommend Charles Petzold's wonderful book Code: The Hidden Language of Computer Hardware and Software.

And finally, I've found Stephen Pincock's Codebreaker: The History of Secret Communications a delightful read. It will tell you, among many other things, how the famous WWII Navaho codetalkers could talk about submarines and dive bombers... despite the fact that there are no Navaho words for "submarine" or "dive bomber".

How to post source code on WordPress

This post is for folks who blog about Python (or any programming language for that matter) on WordPress.
Updated 2011-11-09 to make it easier to copy-and-paste the [sourcecode] template.

My topic today is How to post source code on WordPress.

The trick is to use the WordPress [sourcecode] shortcut tag, as documented at http://en.support.wordpress.com/code/posting-source-code/.

Note that when the WordPress docs tell you to enclose the [sourcecode] shortcut tag in square -- not pointy -- brackets, they mean it. When you view your post as HTML, what you should see is square brackets around the shortcut tags, not pointy brackets.

Here is the tag I like to use for snippets of Python code.


[sourcecode language="python" wraplines="false" collapse="false"]
your source code goes here
[/sourcecode]


The default for wraplines is true, which causes long lines to be wrapped. That isn't appropriate for Python, so I specify wraplines="false".

The default for collapse is false, which is what I normally want. But I code it explicitly, as a reminder that if I ever want to collapse a long code snippet, I can.


Here are some examples.

Note that

  • WordPress knows how to do syntax highlighting for Python. It uses Alex Gorbatchev's SyntaxHighlighter.
  • If you hover your mouse pointer over the code, you get a pop-up toolbar that allows you to look at the original source code snippet, copy it to the clipboard, print it, etc.

(1)

First, a normal chunk of relatively short lines of Python code.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()

(2)

Here is a different code snippet. This one has a line containing a very long comment. Note that the long line is NOT wrapped, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to. That is because we have specified wraplines="false".

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="false", so lines are NOT wrapped, but extend indefinitely, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to.

(3)

This is what a similar code snippet would look like if we had specified wraplines=true. Note that line 2 wraps around and there is no horizontal scrollbar.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

(4)

Finally, the same code snippet with collapse=true, so the code snippet initially displays as collapsed. Clicking on the collapsed code snippet will cause it to expand.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

As far as I can tell, once a reader has expanded a snippet that was initially collapsed, there is no way for him to re-collapse it. That would be a nice enhancement for WordPress — to allow a reader to collapse and expand a code snippet.


Here is a final thought about wraplines. If you specify wraplines="false", and a reader prints a paper copy of your post, the printed output will not show the scrollbar, and it will show only the portion of long lines that were visible on the screen. In short, the printed output might cut off the right-hand part of long lines.

In most cases, I think, this should not be a problem. The pop-up tools allow a reader to view or print the entire source code snippet if he wants to. Still, I can imagine cases in which I might choose to specify wraplines="true", even for a whitespace-sensitive language such as Python. And I can understand that someone else, simply as a matter of personal taste, might prefer to specify wraplines="true" all of the time.

Now that I think of it, another nice enhancement for WordPress would be to allow a reader to toggle wraplines on and off.


Keep on bloggin'!

Python3 pickling

Recently I was converting some old Python2 code to Python3 and I ran across a problem pickling and unpickling.

I guess I would say it wasn’t a major problem because I found the solution fairly quickly with a bit of googling around.

Still, I think the problem and its solution are worth a quick note.  Others will stumble across this problem in the future, especially because there are code examples floating around (in printed books and online posts) that will lead new Python programmers to make this very same mistake.

So let’s talk about pickling.

Suppose you want to “pickle” an object — dump it to a pickle file for persistent storage.

When you pickle an object, you do two things.

  • You open the file that you want to use as the pickle file. The open(…) returns a file handle object.
  • You pass the object that you want to pickle, and the file handle object, to pickle.

Your code might look something like this. Note that this code is wrong. See below.

fileHandle = open(pickleFileName, "w")
pickle.dump(objectToBePickled, fileHandle)

When I wrote code like this, I got back this error message:

Pickler(file, protocol, fix_imports=fix_imports).dump(obj)
TypeError: must be str, not bytes

Talk about a crappy error message!!!

After banging my head against the wall for a while, I googled around and quickly found a very helpful answer on StackOverflow.

The bottom line is that a Python pickle file is (and always has been) a byte stream. Which means that you should always open a pickle file in binary mode: "wb" to write it, and "rb" to read it. The Python docs contain correct example code.

My old code worked just fine running under Python2 (on Windows).  But with Python3's new strict separation of strings and bytes, it broke. Changing "w" to "wb", and "r" to "rb", fixed it. 


One person who posted a question about this problem on the Python forum was aware of the issue, but confused because he was trying to pickle a string.

import pickle
a = "blah"
file = open('state', 'w')
pickle.dump(a,file)

I know of one easy way to solve this is to change the operation argument from 'w' to 'wb' but I AM using a string not bytes! And none of the examples use 'wb' (I figured that out separately) so I want to have an understanding of what is going on here.

Basically, regardless of the kind of object that you are pickling (even a string object), the object will be converted to a bytes representation and pickled as a byte stream. Which means that you always need to use "rb" and "wb", regardless of the kind of object that you are pickling.

Yet Another Lambda Tutorial

modified to use the WordPress [sourcecode] tag — 2012-01-14

There are a lot of tutorials[1] for Python’s lambda out there. One that I stumbled across recently and really found helpful was Mike Driscoll’s discussion of lambda on the Mouse vs Python blog.

When I first started learning Python, one of the most confusing concepts to get my head around was the lambda statement. I’m sure other new programmers get confused by it as well…

Mike’s discussion is excellent: clear, straight-forward, with useful illustrative examples. It helped me — finally — to grok lambda, and led me to write yet another lambda tutorial.

 


Lambda: a tool for building functions

Basically, Python’s lambda is a tool for building functions (or more precisely, function objects). That means that Python has two tools for building functions: def and lambda.

Here’s an example. You can build a function in the normal way, using def, like this:

def square_root(x): return math.sqrt(x)

or you can use lambda:

square_root = lambda x: math.sqrt(x)

Here are a few other interesting examples of lambda:

sum = lambda x, y:   x + y   #  def sum(x,y): return x + y
out = lambda   *x:   sys.stdout.write(" ".join(map(str,x)))
lambda event, name=button8.getLabel(): self.onButton(event, name)

 


What is lambda good for?

A question that I've had for a long time is: What is lambda good for? Why do we need lambda?

The answer is:

  • We don't need lambda, we could get along all right without it. But...
  • there are certain situations where it is convenient — it makes writing code a bit easier, and the written code a bit cleaner.

What kind of situations?

Well, situations in which we need a simple one-off function: a function that is going to be used only once.

Normally, functions are created for one of two purposes: (a) to reduce code duplication, or (b) to modularize code.

  • If your application contains duplicate chunks of code in various places, then you can put one copy of that code into a function, give the function a name, and then -- using that function name -- call it from various places in your code.
  • If you have a chunk of code that performs one well-defined operation -- but is really long and gnarly and interrupts the otherwise readable flow of your program -- then you can pull that long gnarly code out and put it into a function all by itself.

But suppose you need to create a function that is going to be used only once -- called from only one place in your application. Well, first of all, you don't need to give the function a name. It can be "anonymous". And you can just define it right in the place where you want to use it. That's where lambda is useful.

But, but, but... you say.

  • First of all -- Why would you want a function that is called only once? That eliminates reason (a) for making a function.
  • And the body of a lambda can contain only a single expression. That means that lambdas must be short. So that eliminates reason (b) for making a function.

What possible reason could I have for wanting to create a short, anonymous function?

Well, consider this snippet of code that uses lambda to define the behavior of buttons in a Tkinter GUI interface. (This example is from Mike's tutorial.)

frame = tk.Frame(parent)
frame.pack()

btn22 = tk.Button(frame, 
        text="22", command=lambda: self.printNum(22))
btn22.pack(side=tk.LEFT)

btn44 = tk.Button(frame, 
        text="44", command=lambda: self.printNum(44))
btn44.pack(side=tk.LEFT)

The thing to remember here is that a tk.Button expects a function object as an argument to the command parameter. That function object will be the function that the button calls when it (the button) is clicked. Basically, that function specifies what the GUI will do when the button is clicked.

So we must pass a function object in to a button via the command parameter. And note that -- since different buttons do different things -- we need a different function object for each button object. Each function will be used only once, by the particular button to which it is being supplied.

So, although we could code (say)

def __init__(self, parent):
    """Constructor"""
    frame = tk.Frame(parent)
    frame.pack()

    btn22 = tk.Button(frame, 
        text="22", command=self.buttonCmd22)
    btn22.pack(side=tk.LEFT)

    btn44 = tk.Button(frame, 
        text="44", command=self.buttonCmd44)
    btn44.pack(side=tk.LEFT)

def buttonCmd22(self):
    self.printNum(22)

def buttonCmd44(self):
    self.printNum(44)

it is much easier (and clearer) to code

def __init__(self, parent):
    """Constructor"""
    frame = tk.Frame(parent)
    frame.pack()

    btn22 = tk.Button(frame, 
        text="22", command=lambda: self.printNum(22))
    btn22.pack(side=tk.LEFT)

    btn44 = tk.Button(frame, 
        text="44", command=lambda: self.printNum(44))
    btn44.pack(side=tk.LEFT)

When a GUI program has this kind of code, the button object is said to "call back" to the function object that was supplied to it as its command.

So we can say that one of the most frequent uses of lambda is in coding "callbacks" to GUI frameworks such as Tkinter and wxPython.

 


This all seems pretty straight-forward. So...

Why is lambda so confusing?

There are four reasons that I can think of.

First Lambda is confusing because: the requirement that a lambda can take only a single expression raises the question: What is an expression?

A lot of people would like to know the answer to that one. If you Google around a bit, you will see a lot of posts from people asking "In Python, what's the difference between an expression and a statement?"

One good answer is that an expression returns (or evaluates to) a value, whereas a statement does not. Unfortunately, the situation is muddled by the fact that in Python an expression can also be a statement. And we can always throw a red herring into the mix -- assigment statements like a = b = 0 suggest that Python supports chained assignments, and that assignment statements return values. (They do not. Python isn't C.)[2]

In many cases when people ask this question, what they really want to know is: What kind of things can I, and can I not, put into a lambda?

And for that question, I think a few simple rules of thumb will be sufficient.

  • If it doesn't return a value, it isn't an expression and can't be put into a lambda.
  • If you can imagine it in an assignment statement, on the right-hand side of the equals sign, it is an expression and can be put into a lambda.

Using these rules means that:

  1. Assignment statements cannot be used in lambda. In Python, assignment statements don't return anything, not even None (null).
  2. Simple things such as mathematical operations, string operations, list comprehensions, etc. are OK in a lambda.
  3. Function calls are expressions. It is OK to put a function call in a lambda, and to pass arguments to that function. Doing this wraps the function call (arguments and all) inside a new, anonymous function.
  4. In Python 3, print became a function, so in Python 3+, print(...) can be used in a lambda.
  5. Even functions that return None, like the print function in Python 3, can be used in a lambda.
  6. Conditional expressions, which were introduced in Python 2.5, are expressions (and not merely a different syntax for an if/else statement). They return a value, and can be used in a lambda.
    lambda: a if some_condition() else b
    lambda x: ‘big’ if x > 100 else ‘small’

 

Second Lambda is confusing because: the specification that a lambda can take only a single expression raises the question: Why? Why only one expression? Why not multiple expressions? Why not statements?

For some developers, this question means simply Why is the Python lambda syntax so weird? For others, especially those with a Lisp background, the question means Why is Python's lambda so crippled? Why isn't it as powerful as Lisp's lambda?

The answer is complicated, and it involves the "pythonicity" of Python's syntax. Lambda was a relatively late addition to Python. By the time that it was added, Python syntax had become well established. Under the circumstances, the syntax for lambda had to be shoe-horned into the established Python syntax in a "pythonic" way. And that placed certain limitations on the kinds of things that could be done in lambdas.

Frankly, I still think the syntax for lambda looks a little weird. Be that as it may, Guido has explained why lambda's syntax is not going to change. Python will not become Lisp.[3]

 

Third Lambda is confusing because: lambda is usually described as a tool for creating functions, but a lambda specification does not contain a return statement.

The return statement is, in a sense, implicit in a lambda. Since a lambda specification must contain only a single expression, and that expression must return a value, an anonymous function created by lambda implicitly returns the value returned by the expression. This makes perfect sense.

Still -- the lack of an explicit return statement is, I think, part of what makes it hard to grok lambda, or at least, hard to grok it quickly.

 

Fourth Lambda is confusing because: tutorials on lambda typically introduce lambda as a tool for creating anonymous functions, when in fact the most common use of lambda is for creating anonymous procedures.

Back in the High Old Times, we recognized two different kinds of subroutines: procedures and functions. Procedures were for doing stuff, and did not return anything. Functions were for calculating and returning values. The difference between functions and procedures was even built into some programming languages. In Pascal, for instance, procedure and function were different keywords.

In most modern languages, the difference between procedures and functions is no longer enshrined in the language syntax. A Python function, for instance, can act like a procedure, a function, or both. The (not altogether desirable) result is that a Python function is always referred to as a "function", even when it is essentially acting as a procedure.

Although the distinction between a procedure and a function has essentially vanished as a language construct, we still often use it when thinking about how a program works. For example, when I'm reading the source code of a program and see some function F, I try to figure out what F does. And I often can categorize it as a procedure or a function -- "the purpose of F is to do so-and-so" I will say to myself, or "the purpose of F is to calculate and return such-and-such".

So now I think we can see why many explanations of lambda are confusing.

First of all, the Python language itself masks the distinction between a function and a procedure.

Second, most tutorials introduce lambda as a tool for creating anonymous functions, things whose primary purpose is to calculate and return a result. The very first example that you see in most tutorials (this one included) shows how to write a lambda to return, say, the square root of x.

But this is not the way that lambda is most commonly used, and is not what most programmers are looking for when they Google "python lambda tutorial". The most common use for lambda is to create anonymous procedures for use in GUI callbacks. In those use cases, we don't care about what the lambda returns, we care about what it does.

This explains why most explanations of lambda are confusing for the typical Python programmer. He's trying to learn how to write code for some GUI framework: Tkinter, say, or wxPython. He runs across examples that use lambda, and wants to understand what he's seeing. He Googles for "python lambda tutorial". And he finds tutorials that start with examples that are entirely inappropriate for his purposes.

So, if you are such a programmer -- this tutorial is for you. I hope it helps. I'm sorry that we got to this point at the end of the tutorial, rather than at the beginning. Let's hope that someday, someone will write a lambda tutorial that, instead of beginning this way

Lambda is a tool for building anonymous functions.

begins something like this

Lambda is a tool for building callback handlers.

 


So there you have it. Yet another lambda tutorial.


Footnotes

 

[1] Some lambda tutorials:

 

[2] In some programming languages, such as C, an assignment statement returns the assigned value. This allows chained assignments such as x = y = a, in which the assignment statement y = a returns the value of a, which is then assigned to x. In Python, assignment statements do not return a value. Chained assignment (or more precisely, code that looks like chained assignment statements) is recognized and supported as a special case of the assignment statement.

 

[3] Python developers who are familiar with Lisp have argued for increasing the power of Python's lambda, moving it closer to the power of lambda in Lisp. There have been a number of proposals for a syntax for "multi-line lambda", and so on. Guido has rejected these proposals and blogged about some of his thinking about "pythonicity" and language features as a user interface. This led to an interesting discussion on Lambda the Ultimate, the programming languages weblog about lambda, and about the idea that programming languages have personalities.

Read-Ahead and Python Generators

One of the early classics of program design is Michael Jackson’s Principles of Program Design (1975), which introduced (what later came to be known as) JSP: Jackson Structured Programming.

Back in the 1970′s, most business application programs did their work by reading and writing sequential files of records stored on tape. And it was common to see programs whose top-level control structure looked like (what I will call) the “standard loop”:

open input file F

while not EndOfFile on F:
    read a record
    process the record

close F

Jackson showed that this way of processing a sequence almost always created unnecessary problems in the program logic, and that a better way was to use what he called a "read-ahead" technique. 

In the read-ahead technique, a record is read from the input file immediately after the file is opened, and then a second "read" statement is executed after each record is processed.

This technique produces a program structure like this:

open input file F
read a record from F     # get first

while not EndOfFile on F:
    process the record
    read the next record from F  # get next

close F

I won't try to explain when or why the read-ahead technique is preferable to the standard loop. That's out of scope for this blog entry, and a good book on JSP can explain that better than I can. So for now, let's just say that there are some situations in which the standard loop is the right tool for the job, and there are other situations in which read-ahead is the right tool for the job.

One of the joys of Python is that Python makes it so easy to do "standard loop" processing on a sequence such as a list or a string.

for item in sequence:
    processItem(item)

There are times, however, when you have a sequence that you need to process with the read-ahead technique.

With Python generators, it is easy to do. Generators make it easy to convert a sequence into a kind of object that provides both a get next method and an end-of-file mark.  That kind of object can easily be processed using the read-ahead technique.

Suppose that we have a list of items (called listOfItems) and we wish to process it using the read-ahead technique.

First, we create the "read-ahead" generator:

def ReadAhead(sequence):
    for item in sequence:
        yield item
    yield None # return the "end of file mark" after the last item

Then we can write our code this way:

items = ReadAhead(listOfItems)
item = items.next()  # get first
while item:
    processItem(item)
    item = items.next()  # get next

Here is a simple example.

We have a string (called "line") consisting of characters. Each line consists of zero or more indent characters, some text characters, and (optionally) a special SYMBOL character followed by some suffix characters. For those familiar with JSP, the input structure diagram looks like this.

line
    - indent
        * one indent char
    - text
        * one text char
    - possible suffix
        o no suffix
        o suffix
            - suffix SYMBOL
            - suffix
                - one suffix char

We want to parse the line into 3 groups: indent characters, text characters, and suffix characters.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()

In Java, what is the difference between an abstract class and an interface?

This post is about Java, and has nothing to do with Python.  I’ve posted it here so that it can be available to other folks who might find it useful. (And because I don’t have a Java blog!)

In Java, what is the difference between an abstract class and an interface?

This is a question that comes up periodically. When I Googled for answers to it, I didn’t very much like any of the answers that I found, so I wrote my own. For those who might be interested, here it is.

Q: What is the difference between an abstract class and an interface?

A: Good question.

To help explain, first let me introduce some terminology that I hope will help clarify the situation.

  • I will say that a fully abstract class is an abstract class in which all methods are abstract.
  • In contrast, a partially abstract class is an abstract class in which some of the methods are abstract, and some are concrete (i.e. have implementations).

Q: OK. So what is the difference between a fully abstract class and an interface?

A: Basically, none. They are the same.

Q: Then why does Java have the concept of an interface, as well as the concept of an abstract class?

A: Because Java doesn’t support multiple inheritance. Or rather I should say, it supports a limited form of multiple inheritance.

Q: Huh??!!!

A: Java has a rule that a class can extend only one abstract class, but can implement multiple interfaces (fully abstract classes).

There’s a reason why Java has such a rule.

Remember that a class can be an abstract class without being a fully abstract class. It can be a partially abstract class.

Now imagine that that we have two partially abstract classes A and B. Both have some abstract methods, and both contain a non-abstract method called foo().

And imagine that Java allows a class to extend more than one abstract class, so we can write a class C that extends both A and B. And imagine that C doesn’t implement foo().

So now there is a problem. Suppose we create an instance of C and invoke its foo() method. Which foo() should Java invoke? A.foo() or B.foo()?

Some languages allow multiple inheritance, and have a way to answer that question. Python for example has a “method resolution order” algorithm that determines the order in which superclasses are searched, looking for an implementation of foo().

But the designers of Java made a different choice. They choose to make it a rule that a class can inherit from as many fully abstract classes it wants, but can inherit from only one partially abstract class. That way, the question of which foo() to use will never come up.

This is a form of limited multiple inheritance. Basically, the rule says that you can inherit from (extend) as many classes as you want, but if you do, only one of those classes can contain concrete (implemented) methods.

So now we do a little terminology substitution:

abstract class = a class that contains at least one abstract method, and can also contain concrete (implemented) methods

interface =  a class that is fully abstract — it has abstract methods, but no concrete methods

With those substitutions, you get the familiar Java rule:

A class can extend at most one abstract class, but may implement many interfaces.

That is, Java supports a limited form of multiple inheritance.