Read-Ahead and Python Generators

One of the early classics of program design is Michael Jackson’s Principles of Program Design (1975), which introduced (what later came to be known as) JSP: Jackson Structured Programming.

Back in the 1970′s, most business application programs did their work by reading and writing sequential files of records stored on tape. And it was common to see programs whose top-level control structure looked like (what I will call) the “standard loop”:

open input file F

while not EndOfFile on F:
    read a record
    process the record

close F

Jackson showed that this way of processing a sequence almost always created unnecessary problems in the program logic, and that a better way was to use what he called a “read-ahead” technique. 

In the read-ahead technique, a record is read from the input file immediately after the file is opened, and then a second “read” statement is executed after each record is processed.

This technique produces a program structure like this:

open input file F
read a record from F     # get first

while not EndOfFile on F:
    process the record
    read the next record from F  # get next

close F

I won’t try to explain when or why the read-ahead technique is preferable to the standard loop. That’s out of scope for this blog entry, and a good book on JSP can explain that better than I can. So for now, let’s just say that there are some situations in which the standard loop is the right tool for the job, and there are other situations in which read-ahead is the right tool for the job.

One of the joys of Python is that Python makes it so easy to do “standard loop” processing on a sequence such as a list or a string.

for item in sequence:
    processItem(item)

There are times, however, when you have a sequence that you need to process with the read-ahead technique.

With Python generators, it is easy to do. Generators make it easy to convert a sequence into a kind of object that provides both a get next method and an end-of-file mark.  That kind of object can easily be processed using the read-ahead technique.

Suppose that we have a list of items (called listOfItems) and we wish to process it using the read-ahead technique.

First, we create the “read-ahead” generator:

def ReadAhead(sequence):
    for item in sequence:
        yield item
    yield None # return the "end of file mark" after the last item

Then we can write our code this way:

items = ReadAhead(listOfItems)
item = items.next()  # get first
while item:
    processItem(item)
    item = items.next()  # get next

Here is a simple example.

We have a string (called “line”) consisting of characters. Each line consists of zero or more indent characters, some text characters, and (optionally) a special SYMBOL character followed by some suffix characters. For those familiar with JSP, the input structure diagram looks like this.

line
    - indent
        * one indent char
    - text
        * one text char
    - possible suffix
        o no suffix
        o suffix
            - suffix SYMBOL
            - suffix
                - one suffix char

We want to parse the line into 3 groups: indent characters, text characters, and suffix characters.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()
About these ads

7 thoughts on “Read-Ahead and Python Generators

  1. Cool, I hadn’t seen this pattern explained before.

    Incidentally, there’s no need to convert the line via list in the final example: since strings are sequences in Python, the ReadAhead generator will iterate over it as-is on a character-by-character basis.

  2. A long time ago I posted this pattern and received a stern rebuke from martelli it was “unpyhtonic”

  3. This is cool. I can see myself using this pattern.

    I don’t know that it isn’t pythonic overall, but it would be _more_ pythonic to replace the function ReadAhead with a generator expression:

    items = (x for x in listOfItems)
    item = items.next()  # get first
    while item:
        processItem(item)
        item = items.next()  # get next
    
    • Thanks for your comment! I learned something from it. :-)

      The bad news is that it doesn’t work. The good news is that understanding why it doesn’t work is really useful and interesting, because it helps point out an important feature of the ReadAhead technique and the ReadAhead generator.

      I ran this program under Python 2.5:

      import sys
      print(sys.version)
      listOfItems ="abcd"
      items = (x for x in listOfItems)
      item = items.next()  # get first
      while item:
          print(item)
          item = items.next()  # get next

      And got this.

      2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
      a
      b
      c
      d
      Traceback (most recent call last):
        File "c:\junk\test.py", line 9, in 
          item = items.next()  # get next
      StopIteration

      The “trick” with the ReadAhead generator is that after yielding all of the members of the list, it yields None, thus avoiding the raising of StopIteration. In the classic formulation of the ReadAhead technique, the yield None corresponds to returning the end-of-file mark.

      NOTE that with Python 3.x, instead of items.next() you need to use next(items).

  4. I wonder if this wouldn’t do roughly the same:

        indentCount = 0
        textChars = []
        suffixChars = []
    
        # convert the line into a list of characters
        for c in line:
            if c == INDENT_CHAR:
                # process indent characters
                indentCount += 1
                continue
    
            if c != SYMBOL:
                # process text characters
                textChars.append(c)
                continue
    
            if c == SYMBOL:
                for c2 in line:
                    # process suffix characters
                    suffixChars.append(c2)
    
    • No, that won’t even come close.

      First of all, the demonstration. Here is a full Python program using the ReadAhead technique.

      ###################################################################
      indentCount = 0
      textChars = []
      suffixChars = []
      
      INDENT_CHAR = " " # space character
      SYMBOL = "!"
      line = "   this is the line body ! this is the line suffix !!!"
      
      def ReadAhead(sequence):
          for item in sequence:
              yield item
          yield None # return the "end of file mark" after the last item
      
      #feed the line to the ReadAhead generator
      chars = ReadAhead(line)
      
      # This code works for Python 2.x
      # For Python 3.x, change "chars.next()" to "next(chars)"
      c = chars.next() # get first
      
      while c and c == INDENT_CHAR:
          # process indent characters
          indentCount += 1
          c = chars.next()
      
      while c and c != SYMBOL:
          # process text characters
          textChars.append(c)
          c = chars.next()
      
      if c and c == SYMBOL:
          c = chars.next() # read past the SYMBOL
          while c:
              # process suffix characters
              suffixChars.append(c)
              c = chars.next()
      
      print('Line (input)= "' + line + '"')
      print('Line indent = ' + str(indentCount))
      print('Line text   = "' + "".join(textChars) + '"')
      print('Line suffix = "' + "".join(suffixChars) + '"')
      ##################################################################

      It produces this output.

      Line (input)= "   this is the line body ! this is the line suffix !!!"
      Line indent = 3
      Line text   = "this is the line body "
      Line suffix = " this is the line suffix !!!"

      Now replace the relevant code with your suggested code and look at the output.

      Second, the explanation. To understand why your suggested code does not work, the best place to start is Michael Jackson’s famous paper
      Getting It Wrong — A Cautionary Tale

      After that, his book Principles of Program Design would be a good place to go.

      For online resources about the Jackson methods, see http://www.jacksonworkbench.co.uk/stevefergspages/jsp_and_jsd/index.html.

  5. I don’t think ReadAhead is descriptive of what this function is actually doing. All it does is tacks a None onto the end of an iterator. Apart from the fact that this simply won’t do if the sequence contains None, it can also be implemented as itertools.chain(sequence, (None,)), which is much more explicit. (To me, the name ReadAhead implies the ability to view future sequence items without modifying the underlying state of the iterator. This can be performed with itertools.tee.)

    As long as there is no chance of StopIteration being called by something else, the above could be rewritten using a try-except.

    However, this particular application would be best solved by a regular expression: indentChars, textChars, suffixChars = re.match(‘({indent}*)(.*?)(?:{symbol}(.*))?’, line).groups().

Comments are closed.