Read-Ahead and Python Generators

One of the early classics of program design is Michael Jackson’s Principles of Program Design (1975), which introduced (what later came to be known as) JSP: Jackson Structured Programming.

Back in the 1970’s, most business application programs did their work by reading and writing sequential files of records stored on tape. And it was common to see programs whose top-level control structure looked like (what I will call) the “standard loop”:

open input file F

while not EndOfFile on F:
    read a record
    process the record

close F

Jackson showed that this way of processing a sequence almost always created unnecessary problems in the program logic, and that a better way was to use what he called a “read-ahead” technique. 

In the read-ahead technique, a record is read from the input file immediately after the file is opened, and then a second “read” statement is executed after each record is processed.

This technique produces a program structure like this:

open input file F
read a record from F     # get first

while not EndOfFile on F:
    process the record
    read the next record from F  # get next

close F

I won’t try to explain when or why the read-ahead technique is preferable to the standard loop. That’s out of scope for this blog entry, and a good book on JSP can explain that better than I can. So for now, let’s just say that there are some situations in which the standard loop is the right tool for the job, and there are other situations in which read-ahead is the right tool for the job.

One of the joys of Python is that Python makes it so easy to do “standard loop” processing on a sequence such as a list or a string.

for item in sequence:
    processItem(item)

There are times, however, when you have a sequence that you need to process with the read-ahead technique.

With Python generators, it is easy to do. Generators make it easy to convert a sequence into a kind of object that provides both a get next method and an end-of-file mark.  That kind of object can easily be processed using the read-ahead technique.

Suppose that we have a list of items (called listOfItems) and we wish to process it using the read-ahead technique.

First, we create the “read-ahead” generator:

def ReadAhead(sequence):
    for item in sequence:
        yield item
    yield None # return the "end of file mark" after the last item

Then we can write our code this way:

items = ReadAhead(listOfItems)
item = items.next()  # get first
while item:
    processItem(item)
    item = items.next()  # get next

Here is a simple example.

We have a string (called “line”) consisting of characters. Each line consists of zero or more indent characters, some text characters, and (optionally) a special SYMBOL character followed by some suffix characters. For those familiar with JSP, the input structure diagram looks like this.

line
    - indent
        * one indent char
    - text
        * one text char
    - possible suffix
        o no suffix
        o suffix
            - suffix SYMBOL
            - suffix
                - one suffix char

We want to parse the line into 3 groups: indent characters, text characters, and suffix characters.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()