Newline conversion in Python 3

I use Python on both Windows and Unix.  Occasionally when running on Windows  I need to read in a file containing Windows newlines and write it out with Unix/Linux newlines.  And sometimes when running on Unix, I need to run the newline conversion in the other direction.

Prior to Python 3, the accepted way to do this was to read data from the file in binary mode, convert the newline characters in the data, and then write the data out again in binary mode. The Tools/Scripts directory contained two scripts (crlf.py and lfcr.py) with illustrative examples. Here, for instance is the key code from crlf.py (Windows to Unix conversion)

        data = open(filename, "rb").read()
        newdata = data.replace("\r\n", "\n")
        if newdata != data:
            f = open(filename, "wb")
            f.write(newdata)
            f.close()

But if you try to do that with Python 3+, it won’t work.

The key to what will work is the new “newline” argument for the built-in file open() function. It is documented here.

The key point from that documentation is this:

newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:

  • On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

  • On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

So now when I want to convert a file from Windows-style newlines to Linux-style newlines, I do this:

filename = "NameOfFileToBeConverted"
fileContents = open(filename,"r").read()
f = open(filename,"w", newline="\n")
f.write(fileContents)
f.close()

About these ads

7 thoughts on “Newline conversion in Python 3

  1. I’m not very familiar with all the changes in Python 3 and don’t have it installed to try, so why doesn’t the fist example work?

    • I have filed a bug report with the details. It is available at
      http://bugs.python.org/issue12032.

      Note that the bug is not with Python 3 itself, but with the crlf.py script in Tools/Scripts, which needs to be updated for Python 3.

  2. Thanks for this blog post, this is interesting and good to know.

    And I would also appreciate an answer to Adam’s question. Why doesn’t the first sample work? After all, it’s using binary mode, which should be exempt from all of the newline-related stuff, no?

    • This is (I think) a consequence of the new distinction between strings and bytes. I modified the code so — before crashing — it gives some information about the types of the objects it is working with. When I run this:

      import os
      for filename in os.listdir("."):
          if os.path.isdir(filename):
              continue
          data = open(filename, "rb").read()
      
          x = 'x'
          print("type of data is: ", type(data))
          print("type of x is: ", type(x))
      
          if x in data:
              print("success!")
      

      I get this:

      Traceback (most recent call last):
        File "C:/pydev/zob/zobtest.py", line 6, in 
          import zob
        File "C:\pydev\zob\zob.py", line 10, in 
          import test.py
        File "C:\pydev\zob\test.py", line 11, in 
          if x in data:
      TypeError: Type str doesn't support the buffer API
      type of data is:  <class 'bytes'>
      type of x is:  <class 'str'>
      
      Process finished with exit code 11
      

      So I think the problem is that a “read binary” returns a bytes object, but things like ‘\n’ and ‘\r’ are strings. And never the two shall meet (barring some explicit type conversion, anyway).

    • In Python 3 all strings are unicode strings and in Python 2.x all strings are byte strings. So, if you try to mix byte strings and unicode strings in Python 3 you will get an exception. If you look at line 2 of the code, in Python 3 “\r\n” and “\n” are unicode strings which are being passed to the “bytes.replace” function.

      There are two ways of fixing the code,

      1. Open the file as text instead of binary data.
      -or-
      2. Replace “\r\n”, “\n” with b”\r\n”, b”\n”.

  3. The arguments to replace should be bytes like b'\r\n', not strings like '\r\n'.

    $ python3
    Python 3.1.3 (r313:86834, Nov 28 2010, 10:01:07) 
    >>> b'x\r\ny\n'.replace('\r\n', '\n')
    TypeError: expected an object with the buffer interface
    >>> b'x\r\ny\n'.replace(b'\r\n', b'\n')
    b'x\ny\n'
    
  4. The std type ‘file’ has had multitype newline option since at least python 2.5, probably much earlier. You can utilise file in conjunction with str.decode or similar. Additionally, the normal ‘open’ built-in function (being a file object) also has this ability (since 2.5 at least):

    http://docs.python.org/library/stdtypes.html#file.newlines
    http://docs.python.org/library/functions.html#open

    In addition to the standard fopen() values mode may be ‘U’ or ‘rU’. Python is usually built with universal newline support; supplying ‘U’ opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention ‘\n’, the Macintosh convention ‘\r’, or the Windows convention ‘\r\n’. All of these external representations are seen as ‘\n’ by the Python program. If Python is built without universal newline support a mode with ‘U’ is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), ‘\n’, ‘\r’, ‘\r\n’, or a tuple containing all the newline types seen.

Comments are closed.