Python3 pickling

Recently I was converting some old Python2 code to Python3 and I ran across a problem pickling and unpickling.

I guess I would say it wasn’t a major problem because I found the solution fairly quickly with a bit of googling around.

Still, I think the problem and its solution are worth a quick note.  Others will stumble across this problem in the future, especially because there are code examples floating around (in printed books and online posts) that will lead new Python programmers to make this very same mistake.

So let’s talk about pickling.

Suppose you want to “pickle” an object — dump it to a pickle file for persistent storage.

When you pickle an object, you do two things.

  • You open the file that you want to use as the pickle file. The open(…) returns a file handle object.
  • You pass the object that you want to pickle, and the file handle object, to pickle.

Your code might look something like this. Note that this code is wrong. See below.

fileHandle = open(pickleFileName, "w")
pickle.dump(objectToBePickled, fileHandle)

When I wrote code like this, I got back this error message:

Pickler(file, protocol, fix_imports=fix_imports).dump(obj)
TypeError: must be str, not bytes

Talk about a crappy error message!!!

After banging my head against the wall for a while, I googled around and quickly found a very helpful answer on StackOverflow.

The bottom line is that a Python pickle file is (and always has been) a byte stream. Which means that you should always open a pickle file in binary mode: “wb” to write it, and “rb” to read it. The Python docs contain correct example code.

My old code worked just fine running under Python2 (on Windows).  But with Python3’s new strict separation of strings and bytes, it broke. Changing “w” to “wb”, and “r” to “rb”, fixed it. 


One person who posted a question about this problem on the Python forum was aware of the issue, but confused because he was trying to pickle a string.

import pickle
a = "blah"
file = open('state', 'w')
pickle.dump(a,file)

I know of one easy way to solve this is to change the operation argument from ‘w’ to ‘wb’ but I AM using a string not bytes! And none of the examples use ‘wb’ (I figured that out separately) so I want to have an understanding of what is going on here.

Basically, regardless of the kind of object that you are pickling (even a string object), the object will be converted to a bytes representation and pickled as a byte stream. Which means that you always need to use “rb” and “wb”, regardless of the kind of object that you are pickling.

Yet Another Lambda Tutorial

modified to use the WordPress [sourcecode] tag — 2012-01-14

There are a lot of tutorials[1] for Python’s lambda out there. One that I stumbled across recently and really found helpful was Mike Driscoll’s discussion of lambda on the Mouse vs Python blog.

When I first started learning Python, one of the most confusing concepts to get my head around was the lambda statement. I’m sure other new programmers get confused by it as well…

Mike’s discussion is excellent: clear, straight-forward, with useful illustrative examples. It helped me — finally — to grok lambda, and led me to write yet another lambda tutorial.

 


Lambda: a tool for building functions

Basically, Python’s lambda is a tool for building functions (or more precisely, function objects). That means that Python has two tools for building functions: def and lambda.

Here’s an example. You can build a function in the normal way, using def, like this:

def square_root(x): return math.sqrt(x)

or you can use lambda:

square_root = lambda x: math.sqrt(x)

Here are a few other interesting examples of lambda:

sum = lambda x, y:   x + y   #  def sum(x,y): return x + y
out = lambda   *x:   sys.stdout.write(" ".join(map(str,x)))
lambda event, name=button8.getLabel(): self.onButton(event, name)

 


What is lambda good for?

A question that I’ve had for a long time is: What is lambda good for? Why do we need lambda?

The answer is:

  • We don’t need lambda, we could get along all right without it. But…
  • there are certain situations where it is convenient — it makes writing code a bit easier, and the written code a bit cleaner.

What kind of situations?

Well, situations in which we need a simple one-off function: a function that is going to be used only once.

Normally, functions are created for one of two purposes: (a) to reduce code duplication, or (b) to modularize code.

  • If your application contains duplicate chunks of code in various places, then you can put one copy of that code into a function, give the function a name, and then — using that function name — call it from various places in your code.
  • If you have a chunk of code that performs one well-defined operation — but is really long and gnarly and interrupts the otherwise readable flow of your program — then you can pull that long gnarly code out and put it into a function all by itself.

But suppose you need to create a function that is going to be used only once — called from only one place in your application. Well, first of all, you don’t need to give the function a name. It can be “anonymous”. And you can just define it right in the place where you want to use it. That’s where lambda is useful.

But, but, but… you say.

  • First of all — Why would you want a function that is called only once? That eliminates reason (a) for making a function.
  • And the body of a lambda can contain only a single expression. That means that lambdas must be short. So that eliminates reason (b) for making a function.

What possible reason could I have for wanting to create a short, anonymous function?

Well, consider this snippet of code that uses lambda to define the behavior of buttons in a Tkinter GUI interface. (This example is from Mike’s tutorial.)

frame = tk.Frame(parent)
frame.pack()

btn22 = tk.Button(frame, 
        text="22", command=lambda: self.printNum(22))
btn22.pack(side=tk.LEFT)

btn44 = tk.Button(frame, 
        text="44", command=lambda: self.printNum(44))
btn44.pack(side=tk.LEFT)

The thing to remember here is that a tk.Button expects a function object as an argument to the command parameter. That function object will be the function that the button calls when it (the button) is clicked. Basically, that function specifies what the GUI will do when the button is clicked.

So we must pass a function object in to a button via the command parameter. And note that — since different buttons do different things — we need a different function object for each button object. Each function will be used only once, by the particular button to which it is being supplied.

So, although we could code (say)

def __init__(self, parent):
    """Constructor"""
    frame = tk.Frame(parent)
    frame.pack()

    btn22 = tk.Button(frame, 
        text="22", command=self.buttonCmd22)
    btn22.pack(side=tk.LEFT)

    btn44 = tk.Button(frame, 
        text="44", command=self.buttonCmd44)
    btn44.pack(side=tk.LEFT)

def buttonCmd22(self):
    self.printNum(22)

def buttonCmd44(self):
    self.printNum(44)

it is much easier (and clearer) to code

def __init__(self, parent):
    """Constructor"""
    frame = tk.Frame(parent)
    frame.pack()

    btn22 = tk.Button(frame, 
        text="22", command=lambda: self.printNum(22))
    btn22.pack(side=tk.LEFT)

    btn44 = tk.Button(frame, 
        text="44", command=lambda: self.printNum(44))
    btn44.pack(side=tk.LEFT)

When a GUI program has this kind of code, the button object is said to “call back” to the function object that was supplied to it as its command.

So we can say that one of the most frequent uses of lambda is in coding “callbacks” to GUI frameworks such as Tkinter and wxPython.

 


This all seems pretty straight-forward. So…

Why is lambda so confusing?

There are four reasons that I can think of.

First Lambda is confusing because: the requirement that a lambda can take only a single expression raises the question: What is an expression?

A lot of people would like to know the answer to that one. If you Google around a bit, you will see a lot of posts from people asking “In Python, what’s the difference between an expression and a statement?”

One good answer is that an expression returns (or evaluates to) a value, whereas a statement does not. Unfortunately, the situation is muddled by the fact that in Python an expression can also be a statement. And we can always throw a red herring into the mix — assigment statements like a = b = 0 suggest that Python supports chained assignments, and that assignment statements return values. (They do not. Python isn’t C.)[2]

In many cases when people ask this question, what they really want to know is: What kind of things can I, and can I not, put into a lambda?

And for that question, I think a few simple rules of thumb will be sufficient.

  • If it doesn’t return a value, it isn’t an expression and can’t be put into a lambda.
  • If you can imagine it in an assignment statement, on the right-hand side of the equals sign, it is an expression and can be put into a lambda.

Using these rules means that:

  1. Assignment statements cannot be used in lambda. In Python, assignment statements don’t return anything, not even None (null).
  2. Simple things such as mathematical operations, string operations, list comprehensions, etc. are OK in a lambda.
  3. Function calls are expressions. It is OK to put a function call in a lambda, and to pass arguments to that function. Doing this wraps the function call (arguments and all) inside a new, anonymous function.
  4. In Python 3, print became a function, so in Python 3+, print(…) can be used in a lambda.
  5. Even functions that return None, like the print function in Python 3, can be used in a lambda.
  6. Conditional expressions, which were introduced in Python 2.5, are expressions (and not merely a different syntax for an if/else statement). They return a value, and can be used in a lambda.
    lambda: a if some_condition() else b
    lambda x: ‘big’ if x > 100 else ‘small’

 

Second Lambda is confusing because: the specification that a lambda can take only a single expression raises the question: Why? Why only one expression? Why not multiple expressions? Why not statements?

For some developers, this question means simply Why is the Python lambda syntax so weird? For others, especially those with a Lisp background, the question means Why is Python’s lambda so crippled? Why isn’t it as powerful as Lisp’s lambda?

The answer is complicated, and it involves the “pythonicity” of Python’s syntax. Lambda was a relatively late addition to Python. By the time that it was added, Python syntax had become well established. Under the circumstances, the syntax for lambda had to be shoe-horned into the established Python syntax in a “pythonic” way. And that placed certain limitations on the kinds of things that could be done in lambdas.

Frankly, I still think the syntax for lambda looks a little weird. Be that as it may, Guido has explained why lambda’s syntax is not going to change. Python will not become Lisp.[3]

 

Third Lambda is confusing because: lambda is usually described as a tool for creating functions, but a lambda specification does not contain a return statement.

The return statement is, in a sense, implicit in a lambda. Since a lambda specification must contain only a single expression, and that expression must return a value, an anonymous function created by lambda implicitly returns the value returned by the expression. This makes perfect sense.

Still — the lack of an explicit return statement is, I think, part of what makes it hard to grok lambda, or at least, hard to grok it quickly.

 

Fourth Lambda is confusing because: tutorials on lambda typically introduce lambda as a tool for creating anonymous functions, when in fact the most common use of lambda is for creating anonymous procedures.

Back in the High Old Times, we recognized two different kinds of subroutines: procedures and functions. Procedures were for doing stuff, and did not return anything. Functions were for calculating and returning values. The difference between functions and procedures was even built into some programming languages. In Pascal, for instance, procedure and function were different keywords.

In most modern languages, the difference between procedures and functions is no longer enshrined in the language syntax. A Python function, for instance, can act like a procedure, a function, or both. The (not altogether desirable) result is that a Python function is always referred to as a “function”, even when it is essentially acting as a procedure.

Although the distinction between a procedure and a function has essentially vanished as a language construct, we still often use it when thinking about how a program works. For example, when I’m reading the source code of a program and see some function F, I try to figure out what F does. And I often can categorize it as a procedure or a function — “the purpose of F is to do so-and-so” I will say to myself, or “the purpose of F is to calculate and return such-and-such”.

So now I think we can see why many explanations of lambda are confusing.

First of all, the Python language itself masks the distinction between a function and a procedure.

Second, most tutorials introduce lambda as a tool for creating anonymous functions, things whose primary purpose is to calculate and return a result. The very first example that you see in most tutorials (this one included) shows how to write a lambda to return, say, the square root of x.

But this is not the way that lambda is most commonly used, and is not what most programmers are looking for when they Google “python lambda tutorial”. The most common use for lambda is to create anonymous procedures for use in GUI callbacks. In those use cases, we don’t care about what the lambda returns, we care about what it does.

This explains why most explanations of lambda are confusing for the typical Python programmer. He’s trying to learn how to write code for some GUI framework: Tkinter, say, or wxPython. He runs across examples that use lambda, and wants to understand what he’s seeing. He Googles for “python lambda tutorial”. And he finds tutorials that start with examples that are entirely inappropriate for his purposes.

So, if you are such a programmer — this tutorial is for you. I hope it helps. I’m sorry that we got to this point at the end of the tutorial, rather than at the beginning. Let’s hope that someday, someone will write a lambda tutorial that, instead of beginning this way

Lambda is a tool for building anonymous functions.

begins something like this

Lambda is a tool for building callback handlers.

 


So there you have it. Yet another lambda tutorial.


Footnotes

 

[1] Some lambda tutorials:

 

[2] In some programming languages, such as C, an assignment statement returns the assigned value. This allows chained assignments such as x = y = a, in which the assignment statement y = a returns the value of a, which is then assigned to x. In Python, assignment statements do not return a value. Chained assignment (or more precisely, code that looks like chained assignment statements) is recognized and supported as a special case of the assignment statement.

 

[3] Python developers who are familiar with Lisp have argued for increasing the power of Python’s lambda, moving it closer to the power of lambda in Lisp. There have been a number of proposals for a syntax for “multi-line lambda”, and so on. Guido has rejected these proposals and blogged about some of his thinking about “pythonicity” and language features as a user interface. This led to an interesting discussion on Lambda the Ultimate, the programming languages weblog about lambda, and about the idea that programming languages have personalities.

Read-Ahead and Python Generators

One of the early classics of program design is Michael Jackson’s Principles of Program Design (1975), which introduced (what later came to be known as) JSP: Jackson Structured Programming.

Back in the 1970’s, most business application programs did their work by reading and writing sequential files of records stored on tape. And it was common to see programs whose top-level control structure looked like (what I will call) the “standard loop”:

open input file F

while not EndOfFile on F:
    read a record
    process the record

close F

Jackson showed that this way of processing a sequence almost always created unnecessary problems in the program logic, and that a better way was to use what he called a “read-ahead” technique. 

In the read-ahead technique, a record is read from the input file immediately after the file is opened, and then a second “read” statement is executed after each record is processed.

This technique produces a program structure like this:

open input file F
read a record from F     # get first

while not EndOfFile on F:
    process the record
    read the next record from F  # get next

close F

I won’t try to explain when or why the read-ahead technique is preferable to the standard loop. That’s out of scope for this blog entry, and a good book on JSP can explain that better than I can. So for now, let’s just say that there are some situations in which the standard loop is the right tool for the job, and there are other situations in which read-ahead is the right tool for the job.

One of the joys of Python is that Python makes it so easy to do “standard loop” processing on a sequence such as a list or a string.

for item in sequence:
    processItem(item)

There are times, however, when you have a sequence that you need to process with the read-ahead technique.

With Python generators, it is easy to do. Generators make it easy to convert a sequence into a kind of object that provides both a get next method and an end-of-file mark.  That kind of object can easily be processed using the read-ahead technique.

Suppose that we have a list of items (called listOfItems) and we wish to process it using the read-ahead technique.

First, we create the “read-ahead” generator:

def ReadAhead(sequence):
    for item in sequence:
        yield item
    yield None # return the "end of file mark" after the last item

Then we can write our code this way:

items = ReadAhead(listOfItems)
item = items.next()  # get first
while item:
    processItem(item)
    item = items.next()  # get next

Here is a simple example.

We have a string (called “line”) consisting of characters. Each line consists of zero or more indent characters, some text characters, and (optionally) a special SYMBOL character followed by some suffix characters. For those familiar with JSP, the input structure diagram looks like this.

line
    - indent
        * one indent char
    - text
        * one text char
    - possible suffix
        o no suffix
        o suffix
            - suffix SYMBOL
            - suffix
                - one suffix char

We want to parse the line into 3 groups: indent characters, text characters, and suffix characters.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()

In Java, what is the difference between an abstract class and an interface?

This post is about Java, and has nothing to do with Python.  I’ve posted it here so that it can be available to other folks who might find it useful. (And because I don’t have a Java blog!)

In Java, what is the difference between an abstract class and an interface?

This is a question that comes up periodically. When I Googled for answers to it, I didn’t very much like any of the answers that I found, so I wrote my own. For those who might be interested, here it is.

Q: What is the difference between an abstract class and an interface?

A: Good question.

To help explain, first let me introduce some terminology that I hope will help clarify the situation.

  • I will say that a fully abstract class is an abstract class in which all methods are abstract.
  • In contrast, a partially abstract class is an abstract class in which some of the methods are abstract, and some are concrete (i.e. have implementations).

Q: OK. So what is the difference between a fully abstract class and an interface?

A: Basically, none. They are the same.

Q: Then why does Java have the concept of an interface, as well as the concept of an abstract class?

A: Because Java doesn’t support multiple inheritance. Or rather I should say, it supports a limited form of multiple inheritance.

Q: Huh??!!!

A: Java has a rule that a class can extend only one abstract class, but can implement multiple interfaces (fully abstract classes).

There’s a reason why Java has such a rule.

Remember that a class can be an abstract class without being a fully abstract class. It can be a partially abstract class.

Now imagine that that we have two partially abstract classes A and B. Both have some abstract methods, and both contain a non-abstract method called foo().

And imagine that Java allows a class to extend more than one abstract class, so we can write a class C that extends both A and B. And imagine that C doesn’t implement foo().

So now there is a problem. Suppose we create an instance of C and invoke its foo() method. Which foo() should Java invoke? A.foo() or B.foo()?

Some languages allow multiple inheritance, and have a way to answer that question. Python for example has a “method resolution order” algorithm that determines the order in which superclasses are searched, looking for an implementation of foo().

But the designers of Java made a different choice. They choose to make it a rule that a class can inherit from as many fully abstract classes it wants, but can inherit from only one partially abstract class. That way, the question of which foo() to use will never come up.

This is a form of limited multiple inheritance. Basically, the rule says that you can inherit from (extend) as many classes as you want, but if you do, only one of those classes can contain concrete (implemented) methods.

So now we do a little terminology substitution:

abstract class = a class that contains at least one abstract method, and can also contain concrete (implemented) methods

interface =  a class that is fully abstract — it has abstract methods, but no concrete methods

With those substitutions, you get the familiar Java rule:

A class can extend at most one abstract class, but may implement many interfaces.

That is, Java supports a limited form of multiple inheritance.

Newline conversion in Python 3

I use Python on both Windows and Unix.  Occasionally when running on Windows  I need to read in a file containing Windows newlines and write it out with Unix/Linux newlines.  And sometimes when running on Unix, I need to run the newline conversion in the other direction.

Prior to Python 3, the accepted way to do this was to read data from the file in binary mode, convert the newline characters in the data, and then write the data out again in binary mode. The Tools/Scripts directory contained two scripts (crlf.py and lfcr.py) with illustrative examples. Here, for instance is the key code from crlf.py (Windows to Unix conversion)

        data = open(filename, "rb").read()
        newdata = data.replace("\r\n", "\n")
        if newdata != data:
            f = open(filename, "wb")
            f.write(newdata)
            f.close()

But if you try to do that with Python 3+, it won’t work.

The key to what will work is the new “newline” argument for the built-in file open() function. It is documented here.

The key point from that documentation is this:

newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:

  • On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

  • On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

So now when I want to convert a file from Windows-style newlines to Linux-style newlines, I do this:

filename = "NameOfFileToBeConverted"
fileContents = open(filename,"r").read()
f = open(filename,"w", newline="\n")
f.write(fileContents)
f.close()

Why import star is a bad idea

When I was learning Python, I of course read the usual warnings. They told me: You can do

from something_or_other import *

but don’t do it. Importing star (asterisk, everything) is a Python Worst Practice.

Don’t “import star”!

But I was young and foolish, and my scripts were short. “How bad can it be?” I thought. So I did it anyway, and everything seemed to work out OK.

Then, like they always do, the quick-and-dirty scripts grew into programs, and then grew into a full-blown system. Before long I had a monster on my hands, and I needed a tool that would look through all of the scripts and programs in the system and do (at least) some basic error checking.

I’d heard good things about pyflakes, so I thought I’d give it a try.

It worked very nicely. It found the basic kinds of errors that I wanted it to find. And it was fast, so I could run it through a directory containing a lot of .py files and it would come out alive and grinning on the other side.

During the process, I learned that pyflakes is designed to be a bit on the quick and dirty side itself, with the quick making up for the dirty. As part of this process, it basically ignores star imports.  Oh, it warns you about the star imports.  What I means is — it doesn’t try to figure out what is imported by the star import.

And that has interesting consequences.

Normally, if your file contains an undefined name — say TARGET_LANGAGE — pyflakes will report it as an error.

But if your file includes any star imports, and your script contains an undefined name like TARGET_LANGAGE, pyflakes won’t report the undefined name as an error.

My hypothesis is that pyflakes doesn’t report TARGET_LANGAGE as undefined because it can’t tell whether TARGET_LANGAGE is truly undefined, or was pulled in by some star import.

This is perfectly understandable. There is no way that pyflakes is going to go out, try to find the something_or_other module, and analyze it to see if it contains TARGET_LANGAGE. And if it doesn’t, but contains star imports, go out and look for all of the modules that something_or_other star imports, and then analyze them. And so on, and so on, and so on. No way!

So, since pyflakes can’t tell whether TARGET_LANGAGE is (a) an undefined name or (b) pulled in via some star import, it does not report TARGET_LANGAGE as an undefined name. Basically, pyflakes ignores it.

And that seems to me to be a perfectly reasonable way to do business, not just for pyflakes but for anything short of the Super Deluxe Hyperdrive model static code analyzer.

The takeaway lesson for me was that using star imports will cripple a static code analyzer. Or at least, cripple the feature that I most want a code analyser for… to find and report undefined names.

So now I don’t use star imports anymore.

There are a variety of alternatives.  The one that I prefer is the “import x as y” feature. So I can write an import statement this way:

import   some_module_with_a_big_long_hairy_name   as    bx

 and rather than coding

x = some_module_with_a_big_long_hairy_name.vector

I can code

x = bx.vector

Works for me.

Learning Subversion: the mystery of .svn

If you are googling for “Subversion command line tutorial introduction for beginners”, read this first! This is for all Subversion newbies.

After using PVCS for many years, our office recently started moving to Subversion. Which means that recently I started trying to learn Subversion.

I was pressed for time. I was in a hurry. I was looking for something that would get me up and running quickly.

First, I got a copy of the free online Subversion documentation Version Control with Subversion.

Second, I got a copy of Mike Mason’s excellent Pragmatic Version Control Using Subversion (2nd ed.).

Third, I googled the Web looking for the kinds of things that you’d expect: Subversion tutorial introduction beginning beginners commands. And I found some good stuff.

But even after reading many of the online Subversion tutorials, I still could not grok Subversion. Different commands seemed to be doing the same thing, and the tutorials used a lot of terms that were never defined or explained: “versioned”, “unversioned”, “under version control”, and so on.

Gradually, I realized the problem. Many of the online tutorials and introductions try to explain how to use Subversion without explaining how Subversion works. They tell you what commands to issue, and when, but they don’t tell you why you are issuing the command at this particular time, or what the command is doing under the covers.

So I had to dig deeper.

What I found was that there was one particular piece of information missing from most of the tutorials and introductions that I found. If you don’t have that piece, nothing about Subversion makes much sense. With it, all of the other pieces of the puzzle fall into place.

So the purpose of this post is to tell you — the Subversion newbie — about that piece.


How Subversion Works

The basic unit of work for Subversion is a project.

A project is basically a directory.

Technically, a project is a subtree: a directory, including all of its files and subdirectories, and all of those subdirectories’ files and subdirectories, etc. But in order to keep things simple, I will talk as if a project is just a directory.

When you are working on a Subversion project, there are actually two directories that you are working with.

  • There is the repository, which is a directory (controlled by Subversion and running on a server somewhere) that contains the master copy of the project directory.
  • There is your own personal workingCopy, which is a directory (controlled by you) that exists on the file system of your own machine (that is, on the hard drive of your own PC).

But (and this is the piece that was missing) a workingCopy directory is not an ordinary directory.

The use of the expression “working copy” is one of the most confusing things about Subversion tutorials and even the Subversion documentation itself. When you encounter the expression “working copy” you assume that you are dealing with an ordinary filesystem directory that is being used to hold a copy of the files in your project. Not so!

In the context of Subversion, “working copy” is a very specific term of art — a Subversion-specific technical term. That is why in this post I avoid the expression “working copy” and instead use workingCopy.

So what is a Subversion workingCopy directory?

A workingCopy directory is a directory that has a hidden subdirectory called “.svn”.

The hidden .svn directory is what Subversion calls an “administrative directory”.

Note the leading period in “.svn”. On Unix systems, a directory whose name begins with a dot is a “hidden” (or “dotfile”) directory.

On your PC, the project’s top-level workingCopy directory has a hidden .svn subdirectory. And each of the subdirectories of the workingCopy directory (if it has any), and each of their subdirectories (if they have any), and so on, has its own hidden .svn subdirectory.

Having a hidden .svn subdirectory is what makes an ordinary file system directory into a Subversion workingCopy directory, a directory that Subversion can recognize and manage.

So, for a project named “ProjectX” the workingCopy directory will be named “ProjectX”. It might look like this:

	ProjectX [DIRECTORY]
		projectx.py
		projectx_constants.py
		.svn [DIRECTORY]

What is in a .svn subdirectory? What does a Subversion administrative directory contain?

The Subversion documentation says this about workingCopy directories:

A Subversion working copy is an ordinary directory tree on your local system, containing a collection of files. You can edit these files however you wish, and if they’re source code files, you can compile your program from them in the usual way. …

A working copy also contains some extra files, created and maintained by Subversion, to help it carry out these commands. In particular, each directory in your working copy contains a subdirectory named .svn, also known as the working copy’s administrative directory. The files in each administrative directory help Subversion recognize which files contain unpublished changes, and which files are out of date with respect to others’ work.

Here’s another clue: a passage from Pragmatic Version Control Using Subversion:

Subversion has a highly efficient network protocol and stores pristine copies of your working files locally, allowing a user to see what changes they’ve made without even contacting the server [where the central repository is stored].

So now we know what a Subversion administrative directory contains.

The .svn admin directory contains pristine (unchanged) copies of files that were downloaded from the repository. (It contains a few other things, too.)

Earlier, I said “When you are working on a Subversion project, there are actually TWO directories that you are working with… the repository and the working copy.” Now I want to change that. It would be more accurate to say that there are really THREE directories that you are working with:

  • the main ProjectX repository on the server
  • the ProjectX workingCopy directory on your PC, which contains editable (and possibly changed) copies of the files in the project …and also …
  • the hidden Subversion administrative directory, which contains a (pristine, unchanged, and uneditable) copies of the files in the main ProjectX repository on the server.

That means that, on your PC, the ProjectX workingCopy directory looks like this.

	ProjectX [DIRECTORY]
		projectx.py
		projectx_constants.py
		.svn [DIRECTORY]
			projectx.py
			projectx_constants.py

Now things start to become clearer…

Subversion introductions and tutorials often say things that are rather cryptic to someone who is trying to learn Subversion. Even HELP questions and FAQs posted on the Web can be mystifying. Now let’s see how some of those things make sense in light of our knowledge of the .svn subdirectory.


Showing file changes

The reason that Subversion can allow “a user to see what changes they’ve made without even contacting the server” is that the Subversion diff works only on the workingCopy directory on your own PC.

When Subversion shows file changes (that is, shows diffs) it is actually showing diffs between

  • your edited files in the workingCopy directory, and
  • the pristine copies of the those files that are being held in the .svn subdirectory of the workingCopy directory.

“unversioned” files vs. files “under version control”

Suppose that I make a change to one of my files: to ProjectX/projectx_constants.py.

When I make the changes, my editor automatically creates a backup file: ProjectX/projectx_constants.py.bak.

At this point, ProjectX/projectx_constants.py.bak is what is called an “unversioned” file. It exists in the ProjectX directory, but not in the ProjectX/.svn directory, so Subversion knows nothing about it. That makes sense: we don’t want projectx_constants.py.bak to be considered a project file anyway.

But suppose I want to add a new module to the project, called projectx_utils.py. If I simply create the file in the ProjectX folder, it will be an “unversioned” file in just the same way that projectx_constants.py.bak is an unversioned file: it will not exist in the ProjectX/.svn directory, so Subversion knows nothing about it.

So that is why Subversion has a “svn add” command. The command svn add projectx_utils.py will add the file to the project by copying ProjectX/project_utils.py to ProjectX/.svn/project_utils.py. At this point — after it has been added to the .svn subdirectory — the file is said to be “under version control”.

Note that — at this point — although project_utils.py has been “added” to the copy of the project in the workingCopy, the main repository still doesn’t know anything about it — project_utils.py hasn’t been added to the central repository on the server.

When I “commit” my changes, I send the files from my workingCopy to the main repository. Only after that happens does the new file truly become part of the project by becoming one of the files in the central repository.


Help! I’ve lost my .svn directory and I can’t get up!

Because a Subversion workingCopy directory needs a .svn subdirectory in order to work properly, you can have problems with Subversion if you accidentally delete the .svn subdirectory.


What is a “clean copy”?

In various tutorials, and in the Subversion docs, you will run across the expression “clean copy”. A “clean copy” is a copy of only the source-code files, without the .svn directory.

An introduction to Subversion (which is also a nice introduction to the TortoiseSVN open-source Windows GUI client for Subversion) explains things nicely.

If you look closely in your working copy, you may see an .svn folder in each folder of your working copy. The folders are hidden folders, so depending on the Windows settings you may not see them, but they are there. Those folders contain the information that Subversion uses to link your working copy to the repository.

If ever you need to get a copy of what’s in the repository, but without all the .svn folders (say for example you’re ready to publish it or hand the files over to your client), you can do an “SVN Export” into a new folder to get a “clean” copy of what’s in your repository.

Having the concept of a “clean copy” makes it easier to understand the next question…


Checkout vs. Export

A Frequently Asked Question about Subversion is What’s the difference between a “checkout” and an “export” from the repository?

The CollabNet docs say this:

They are the same except that Export doesn’t include the.svn folders and Checkout does include them. Also note that an export cannot be updated.

When you do a Subversion checkout, every folder and subfolder contains an .svn folder. These.svn folders contain clean copies of all files checked out and .tmp directories that contain temporary files created during checkouts, commits, update and other operations.

An Export will be about half the size of a Checkout due to the absence of the.svn folders that duplicate all content.

Note that the reason an exported folder cannot be updated is that the update command updates the .svn directory of a workingCopy, but an export does not create an .svn directory.

Note also that you can export from either the main repository or from the workingCopy .svn directory. See Subversion docs for export.


The (import, checkout) usage pattern for getting started with Subversion

Most “getting started with Subversion” tutorials start the same way. Assuming that you have some project files that you want to put into Subversion, you are told to:

  • do an import
  • do a checkout

in that order.

What you are not told is why you start with those two particular actions in that particular order.

But by now, knowing about the hidden .svn administrative directory and what it does, you can probably figure that out.

Import is the opposite of export. It takes a directory of files — a clean copy of the files, if you will — from your hard drive and copies them into the central Subversion repository on the server.

Always the next step is to do a checkout. Basically a checkout copies the project files from the central repository to a workingCopy directory on your PC. If the workingCopy directory does not exist on your PC, it is created.

The workingCopy directory contains everything you need in order to be able to work with Subversion, including an .svn administrative directory. As the CollabNet documentation (quoted earlier) says:

When you do a Subversion checkout, every folder and subfolder contains an .svn folder. These.svn folders contain clean copies of all files checked out and .tmp directories that contain temporary files created during checkouts, commits, update and other operations.

So the second step — the checkout command — is absolutely necessary in order to get started. It creates a workingCopy directory containing the project files. Only after that happens are your files properly “under version control”.


checkin vs. commit

PVCS (and SourceSafe, and many other version control systems) work on a locking model. “Checking out” a file from the repository means that you get a local working copy of the file, and you lock the file in the repository. At that point, nobody can unlock it except you. Checking out a file gives you exclusive update privileges on it until you check it back in.

“Checking in” a file means that you copy your local working copy of the file back into the repository and you unlock the file in the repository.

It is possible to copy your local working copy of the file into the repository without unlocking the file in the repository. When you do this, you are in a sense “updating” the repository from the working copy.

Because of my familiarity with this kind of version control, I had a certain “mental model” of how a version control system works. And because of that mental model, many of the Subversion tutorials were quite confusing.

One source of confusion is the fact that (as we will see in the next section) the word “updating” in the context of Subversion means exactly the opposite of what it means in the context of PVCS.

One of the Subversion tutorials that I found said that you must checkout your workingCopy from the main repository, because you can’t do a checkin back to the main repository if you hadn’t checked it out. This was very confusing to an ex-PVCS user.

First, it suggested that Subversion works like PVCS: that there is a typical round-trip usage pattern consisting of

  • checking out (locking)
  • editing
  • checking in (unlocking)

But Subversion doesn’t work like this, at least not by default.

What the tutorial was trying to say, I think, was that in order to work with Subversion, you must create a workingCopy directory (that is, a directory that contains an .svn administrative subdirectory). And the way to create a workingCopy directory is to run a svn checkout command against the repository on the server.

Second, explaining things this way was confusing because Subversion doesn’t really have a checkin command. It does have a commit command, which some tutorials call a “checkin” command. But that command does not do the same thing as a PVCS checkin.

Ignore the fact that the short form of the commit command is ci (which stood for “checkin” in an earlier incarnation of Subversion). A Subversion “checkin” is the same thing as a “commit”, and has nothing to do with locking. It would really be helpful if all Subversion tutorials would stop using the term “checkin” and replace it with “commit”.

If you are used to working with a VCS that uses the “check out, edit, check in” paradigm, and you come to understand that Subversion’s commit is not the same as your old familiar check in, then your next question will almost certainly be:

Once you checkout a project into a working folder, how do you check it in a la SourceSafe? [Or PVCS, or other lock-based VCSs? — Steve Ferg]

I know there is “commit” which puts my changes into the respository, but I still have the files checked out under my working folder. What if I am done with a particular file and I don’t want to have it checked out? How do I check it back in?

You can read the answer here.


What does svn update do?

EXECUTIVE SUMMARY: svn update updates the workingCopy, not the repository.

The Subversion docs describe the update command this way:

When working on a project with a team, you’ll want to update your working copy to receive any changes other developers on the project have made since your last update. Use svn update to bring your working copy into sync with the latest revision in the repository:

Basically, what the update command does is to copy the project files from the central repository down to the .svn directory in your workingCopy.

This is something you should do frequently, because you don’t want the files in your workingCopy/.svn directory to get too far out of sync with the file in the central repository. And you don’t want to try to commit files if your workingCopy/.svn is out of sync with the central repository.

That means that as a general rule, you should always run an svn update:

  • just before you start making a new round of changes to your workingCopy, and
  • just before doing a commit.

Now, having mastered the concept of an .svn directory, we can Understand Many Things, even arcana such as why Serving websites from svn checkout considered harmful.

So that’s it.

This post contains information written by a Subversion newbie in the hopes that it will be useful to other Subversion newbies. But of course, having been written by a newb, there are all sorts of ways it could be wrong.

If you’re a Subversion expert (and it doesn’t take much to be more expert than I am) and you see something wrong, confused, or misleading here, please leave a comment. I, and future generations of Subversion newbies, will thank you for it.

Thanks to my co-workers Mark Thomas and Jason Herman for reviewing an earlier draft of this post.