Command-line syntax: some basic concepts

I’ve been reading about parsers for command-line arguments lately, for example Plac. And, as Michele Simionato says:

There is no want of command line arguments parsers in the Python world. The standard library alone contains three different modules: getopt (from the stone age), optparse (from Python 2.3) and argparse (from Python 2.7).

My reading has made me realize that there is an immense range of possible syntaxes for command-line arguments, and far less consensus and standardization than I thought. Although there are some general styles that programmers often use when implementing the command-line arguments for their applications, basically every programmer is free to do whatever he (or she) wants. The result is that whenever you encounter an application for the first time, you can’t safely assume anything about the syntax of its command-line arguments.

It also has made me wonder if anyone had ever written an overview of, or introduction to, the basic concepts involved in command line arguments. I searched the Web without finding one, so I thought it would be interesting to try to write one.  I can live with the risk that I’m re-inventing the wheel.

Of course, there may be something out there and I just missed it. So if you know of some other discussion of this topic, please leave a comment and tell me about it. And if there is something that I missed here, I’d appreciate a comment about that too.

What is a command line argument?

When you invoke an application from a command line, it is often useful to be able to send one or more pieces of information from the command line to the application. As a simple example, we might want to start a text editor and also tell it the name of a file that it should open, like this

          superedit a_filename.txt

In this example, “superedit” is the name of the application, and “a_filename.txt” is a command line argument: in this case, the name of a file.

It is possible to supply more than one command line argument

We often want to send an application multiple arguments, like this:

          rename file_a.txt  file_b.txt

Positional arguments, named arguments, and flags

There are three types of command line argument: positional arguments, named arguments, and flags.

  • A positional argument is a bare value, and its position in a list of arguments identifies it.
  • A named argument is a (key, value) pair, where the key identifies the value.
  • A flag is a stand-alone key, whose presence or absence provides information to the application.

If we supplied the “rename” application with two positional arguments, like this

          rename file_a.txt  file_b.txt

then the position of the arguments identifies the value.

  • The value in position 1 (“file_a.txt”) is the current name of the file.
  • The value in position 2 (“file_b.txt”) is the requested new name of the file.

We could have written the “rename” application so that it requires two named arguments, like this

          rename  -oldname file_a.txt  -newname file_b.txt

A flag is an argument whose presence alone is enough to convey information to the application. A good example is the frequently-used “-v” or "--verbose" argument.

Although it is possible to think of flags as degenerate named arguments (named arguments that have a key but no value), I find it easier to think of flags as a distinct type of argument, different from named arguments.

Keyword arguments and options

I will use the term keyword argument to cover both named arguments and flags.

David Goodger notes (in the first comment on the first version of this post) that I am not using the traditional Unix command-line lexicon.  What I have called keyword arguments are — on Unix platforms — traditionally called options;  what I have called values are traditionally called option arguments; and what I have called positional arguments, the Open Group calls operands.  So I should probably say something about my choice of technical terminology.

For the purposes of this analysis, I prefer not to use the traditional Unix vocabulary of options, for a number of reason.  First of all, the term option tends to be Unix-specific; on Windows the term parameter is more frequently used.  Second, the investigation began with command-line parsers, and in the context of a discussion of parsers and parsing, keyword argument seems a more traditional and appropriate term than option.  Third, the usual definition of option is not very useful.

Arguments are options if they begin with a hyphen.

And finally, the term option implies optionality.  Whether an argument is optional or required is a semantic issue rather than a syntactical issue.  At this point I’m interested in syntactical issues, so I want to use a semantically neutral vocabulary.  We can talk about options and optionality later, when we look at semantic concepts.

Keyword arguments require a sigil

When keyword arguments are used, there must be some mechanism for distinguishing a key from a value or from a positional argument. That mechanism is a “sigil”: a special character or string of characters that indicates the beginning of a key. In our example, the sigil was a dash (a hyphen).

On Windows, the sigil is typically a forward slash: “/”.

On Unix-like operating systems, the sigil is typically a dash "-".

Some applications use multiple sigils.  With the plus sign “+” as a sigil, for instance, it is possible to use flags to turn options on and off.

          attrib   -readonly    -archive     file_A.txt
          attrib   +readonly    +archive     file_A.txt

Single-character and multi-character keys

Some applications, especially on Unix, make a distinction between single-character keys and multi-character keys (“long options”), with a single-dash sigil "-" indicating the beginning of a single-character key, and a double dash "--" sigil indicating the beginning of a multi-character key. Often, an application will support both single-character and multi-character keys for the same argument. For example, the “rename” application might accept both this

          rename  -o file_a.txt  -n file_b.txt

and this

          rename  --oldname file_a.txt  --newname file_b.txt

Fixed-length and variable-length keys

The previous section describes what I think most Unix programmers would say is the difference between single-dash and double-dash keys. But I think it is actually wrong.

The real difference between a single-dash sigil "-" and a double dash "--" sigil is not the difference between one and many, but the difference between fixed-length and variable-length keys. (This is obscured by the fact that a single-character key is also automatically a fixed-length key.)

The thing that really makes keys that begin with a single dash different from keys that begin with a double dash is not that they are one character long, but that their length is fixed and known. For example, flag concatenation (see below) is possible because the flag keys have a known and fixed length. It doesn’t depend on the flag keys being one character long — it would work just as well if the length for flag keys was fixed at two or even three characters. And this is also true of the third technique for distinguishing keys from argument values (see the next section).

Named arguments require a mechanism to distinguish keys from argument values

One technique is to use whitespace to separate argument values from keys. We saw this in our earlier example

          rename  -o file_a.txt  -n file_b.txt

A second technique is to use a special (non-whitespace) character to separate argument values from keys. This special character could be any character that cannot occur in either the key or argument value.

On Unix, this is traditionally an equal sign “=”, like this.

          rename  -o=file_a.txt  -n=file_b.txt

On Windows and MS-DOS this is traditionally a colon “:”, like this.

          rename  /o:file_a.txt  /n:file_b.txt

An application might permit whitespace before and after the equal sign, like this.

          rename  -o = file_a.txt  -n = file_b.txt

A third technique is to use the known length of the key to distinguish the key from the argument value. Suppose the “rename” application uses only 1-character keys. Then it might accept arguments like this.

          rename  -ofile_a.txt  -nfile_b.txt

Fixed-length keys make flag concatenation possible

Suppose that an application follows the convention that a single-dash sigil signals the start of a single-character flag argument. Then it can accept either this

          tar -x -v -f  some_filename.tar

or this, where several flag arguments are specified together

          tar -xvf some_filename.tar

Here is where the distinction between the single-dash sigil and the double-dash sigil becomes important.

  • "-xvf" indicates the concatenation of three single-character flags: “x”, “v”, and “f”.
  • "--xvf" (note the double dash) indicates a single multi-character flag: “xvf”.

Parsing the command line

In many of the examples that we’ve seen, parsing the command line is as simple as splitting it on whitespace. But the situation gets more complicated if values can contain whitespace. If that is true, then we need to support delimiters that can enclose values that contain whitespace.

Suppose we want to invoke a word-processor from the command line. And we want to specify two arguments on the command line: the name of the file, and the name of the author. This obviously will not work.

          superedit A Christmas Story.doc  Clement Moore

What we need is this.

          superedit "A Christmas Story.doc"  "Clement Moore"

Support of quoted values means that command-line parsers must be more sophisticated… just splitting the command line on whitespace won’t do the job. The command-line parser must recognize and correctly handle quote characters… and escaped quote characters inside of quoted strings.

The most common delimiter for argument values is the double-quote symbol. But we might also (or instead) want to support single quotes, back ticks, parentheses, or square/wavy/pointy brackets. We can imagine a case in which a malevolent programmer wrote superedit to expect positional arguments like this.

          superedit (A Christmas Story.doc)  (Clement Moore)

… or named arguments like this.

          superedit filename(A Christmas Story.doc)  author(Clement Moore)

Sigils in positional arguments

Remember our “rename” application? It accepted arguments like this, where the dash is the sigil that introduces the key of a named argument.

          rename  -o file_a.txt  -n file_b.txt

But filenames can begin with dashes. We might need to write a command like this, which would cause problems.

          rename  -o -file_a.txt  -n -file_b.txt

So this is another reason why we might need to be able to quote argument values: to “hide” a sigil character inside a value.

          rename  -o "-file_a.txt"  -n "-file_b.txt"

The order of arguments

In the first version of this post, I wrote that:

It is a universally observed convention that
  • keyword arguments (named arguments and flags) are grouped together
  • positional arguments are grouped together
  • keyword arguments must be specified first, before specifying positional arguments

But that is wrong. It is a widely — but not universally — observed convention. As Eric wrote, in a comment on the first version of this post,

many modern programs allow keyword arguments to be specified after (or even between) positional arguments

And even very old programs do it too. The command-line syntax for Microsoft DOS’s dir command (roughly equivalent to Unix’s ls command) is basically

dir [filename] [switches]

with the filename positional argument appearing before the switches.

A separator between keyword arguments and positional arguments

Suppose we have an application “myprog” that accepts one or more keyword arguments that start with a dash sigil, followed by one or more positional arguments that supply filenames. And suppose that filenames can contain — and begin with — dashes.

We’re going to have a problem if we code this

          myprog -v -r -t -file_a.txt -file_b.txt  -file_c.txt

myprog is going to see “-file_a.txt” and (since it starts with a dash, the sigil) myprog will try to handle it like a keyword argument. Not good.

We could deal with this problem by routinely enclosing all filename positional arguments in quotes, but that would be clumsy and laborious.

          myprog -v -r -t "-file_a.txt" "-file_b.txt"  "-file_c.txt"

An alternative is to use a special string (typically double dashes "--") to indicate the beginning of positional arguments.

          myprog -v -r -t   --  -file_a.txt -file_b.txt  -file_c.txt

So now we have four basic kinds of arguments.

  • positional arguments
  • named arguments (key+value pairs)
  • flags
  • an indicator of the beginning of positional arguments ("--")

Argument semantics

To be expanded…

Optional arguments vs. required argments

Relationships between different arguments

  • Aliases
  • Mutual exclusion
  • Mutual necessity

 

Other variations

In some conventions:

  • Multi-character keys may be abbreviated as long as the abbreviations are unique.
  • The value in a named argument is optional and may be omitted.
  • The value of a named argument may be a list, with items in the list separated by a colon or a comma.
  • A sigil character standing by itself (e.g. a single dash) is treated as a positional argument.

Command-line as a programming language

I think that the best way to think of a command-line, and its arguments, is as a statement in a command-line (CL) programming language, where each application defines its own CL language.

This means that — as far as an application is concerned — the process of using command-line arguments always looks like this:

  1. define (i.e. tell the parsing module about) the syntax rules of the CL language to be used
  2. define (i.e. tell the parsing module about) the semantics of the CL language
  3. call the parser to parse the command line and its arguments
  4. query the parser for information about the “tokens” (the command-line arguments) that it found

Step 2 — specifying the CL semantics — is the step in which the application specifies (for example) what named arguments and flags it accepts, and which are required. This step is necessary for the parser to do certain kinds of semantic checking: (for example) to automatically reject unrecognized keys, or to automatically report required arguments that were not provided.

Step 2 can be omitted, but only if the application itself will do the semantic checking rather than expecting the parsing module to do it.

The upside of doing step 2 is that it enables a smart CL parsing module automatically to generate user documentation for the CL language, and to dump that documentation to the screen when it finds a syntactic or semantic error in the command line, or when the command line is a request (e.g. “/?” or “-h”) for the command-line documentation.

Command-line meta-languages

CL languages are like markup languages. You can invent your own from scratch if you wish, but life is a lot easier if you at least follow some standard conventions when you do.

In the world of markup languages, such standard conventions are called meta-languages. The best-known markup meta-language is XML. XML is not a markup language; it is a markup meta-language … roughly: a style, or set of conventions, or template for creating specific markup languages.

XML is well-defined by the W3C. It would make sense to have similarly well-defined, carefully specified meta-languages for CL languages. Right now, I think we have two loosely-defined CL meta-languages, which I shall refer to as

  • WinCL (for Windows)
  • NixCl (for *nix platforms)

Traditionally (see the Wikipedia article on command line argument)

  • WinCL uses a slash as the sigil; NixCL uses a dash.
  • WinCL uses a colon as a key/value separator; NixCL uses an equal sign.
  • WinCL keywords traditionally consist of a single letter; NixCL is open to multi-character keywords (GNU “long options”).

As of July 25, 2010,:

If it is (or becomes) possible to consider WinCL and NixCL to be well-defined CL meta-languages, then the first step of specifying a CL language for an application (which I gave earlier):

  • define (i.e. tell the parsing module about) the syntax rules of the CL language to be used

could be simply

  • tell the parsing module whether the CL language will be a WinCL or a NixCL language

An alternative is to use a parser utility that is designed to handle specifically WinCL or NixCL. Python’s optparse, for example, “supports only the most common command-line syntax and semantics conventionally used under Unix.” And if you aren’t familiar with those conventions, the documentation summarizes them.

About these ads

4 thoughts on “Command-line syntax: some basic concepts

  1. There is a good discussion of command-line arguments and options in the optparse docs:

    http://docs.python.org/dev/library/optparse.html#background

    You’re missing one HUGE distinction in your article, which is doing a disservice to your readers. You introduce this new terminology of “keyword arguments”, but omit the common term “command-line OPTIONS”. (I assume you lifted the “keyword arguments” term from the Python programming world; AFAIK it is absent from the command-line lexicon.) Traditionally, command-line arguments like “-v” or “–help” are called “command-line options”, for the simple reason that they’re optional. The command-line libraries I know (getopt & optparse; haven’t used argparse yet) implement options as just that: optional. It is a CONTRADICTION IN TERMS to have “required options” or “mandatory options”. It is considered bad form (and a misuse of such libraries) to require one or more “keyword arguments” on the command line.

    Command-line options may have arguments, e.g. “–output myfile.txt” (those that don’t have argument are, as you write, often called “flags” or “switches”). These are called “option arguments”, and they are usually required by the option. In rare cases option arguments themselves are optional, but that introduces an ambiguity to command-line processing: in “-x one two”, is “one” the argument of the “-x” option, or is it the first positional argument? Best to avoid such ambiguity by having different options for the has-argument and no-argument cases.

  2. Slight correction: Attempting to “hide” a sigil character by quoting its argument doesn’t work, at least not when calling a program from a standard shell. The quotes are interpreted by the shell, not the program; the program receives a list of arguments, any of which may or may not have whitespace, and any of which may have undergone quote stripping. The sigil character, in contrast, is interpreted by the program. Try it:

    $ echo “import sys” > quotes.py
    $ echo “print(sys.argv)” >> quotes.py
    $ python quotes.py -unquoted “-quoted”
    [‘quotes.py’, ‘-unquoted’, ‘-quoted’]

    In addition, many modern programs allow keyword arguments to be specified after (or even between) positional arguments, but most still recommend listing positional arguments first. In practice, the main one I use in the “wrong” place is the -m/–message argument to svn/git/hg/bzr commit; I was routinely annoyed by a version of Mercurial that enforced the convention. Then there’s `find`…

  3. You may be interested in a little Python module I wrote to make handling of command line arguments even easier (open source and free to use) – Commando

Comments are closed.