I use Python on both Windows and Unix. Occasionally when running on Windows I need to read in a file containing Windows newlines and write it out with Unix/Linux newlines. And sometimes when running on Unix, I need to run the newline conversion in the other direction.
Prior to Python 3, the accepted way to do this was to read data from the file in binary mode, convert the newline characters in the data, and then write the data out again in binary mode. The Tools/Scripts directory contained two scripts (crlf.py and lfcr.py) with illustrative examples. Here, for instance is the key code from crlf.py (Windows to Unix conversion)
data = open(filename, "rb").read() newdata = data.replace("\r\n", "\n") if newdata != data: f = open(filename, "wb") f.write(newdata) f.close()
But if you try to do that with Python 3+, it won’t work.
The key to what will work is the new “newline” argument for the built-in file open() function. It is documented here.
The key point from that documentation is this:
newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:
On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.
So now when I want to convert a file from Windows-style newlines to Linux-style newlines, I do this:
filename = "NameOfFileToBeConverted" fileContents = open(filename,"r").read() f = open(filename,"w", newline="\n") f.write(fileContents) f.close()
I’m not very familiar with all the changes in Python 3 and don’t have it installed to try, so why doesn’t the fist example work?
I have filed a bug report with the details. It is available at
http://bugs.python.org/issue12032.
Note that the bug is not with Python 3 itself, but with the crlf.py script in Tools/Scripts, which needs to be updated for Python 3.
Thanks for this blog post, this is interesting and good to know.
And I would also appreciate an answer to Adam’s question. Why doesn’t the first sample work? After all, it’s using binary mode, which should be exempt from all of the newline-related stuff, no?
This is (I think) a consequence of the new distinction between strings and bytes. I modified the code so — before crashing — it gives some information about the types of the objects it is working with. When I run this:
I get this:
So I think the problem is that a “read binary” returns a bytes object, but things like ‘\n’ and ‘\r’ are strings. And never the two shall meet (barring some explicit type conversion, anyway).
In Python 3 all strings are unicode strings and in Python 2.x all strings are byte strings. So, if you try to mix byte strings and unicode strings in Python 3 you will get an exception. If you look at line 2 of the code, in Python 3 “\r\n” and “\n” are unicode strings which are being passed to the “bytes.replace” function.
There are two ways of fixing the code,
1. Open the file as text instead of binary data.
-or-
2. Replace “\r\n”, “\n” with b”\r\n”, b”\n”.
The arguments to replace should be bytes like
b'\r\n'
, not strings like'\r\n'
.The std type ‘file’ has had multitype newline option since at least python 2.5, probably much earlier. You can utilise file in conjunction with str.decode or similar. Additionally, the normal ‘open’ built-in function (being a file object) also has this ability (since 2.5 at least):
http://docs.python.org/library/stdtypes.html#file.newlines
http://docs.python.org/library/functions.html#open
In addition to the standard fopen() values mode may be ‘U’ or ‘rU’. Python is usually built with universal newline support; supplying ‘U’ opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention ‘\n’, the Macintosh convention ‘\r’, or the Windows convention ‘\r\n’. All of these external representations are seen as ‘\n’ by the Python program. If Python is built without universal newline support a mode with ‘U’ is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), ‘\n’, ‘\r’, ‘\r\n’, or a tuple containing all the newline types seen.