Archive for the ‘Tools and Technologies’ Category

The boundaries which divide Life from Death are at best shadowy and vague. Who shall say where the one ends and where the other begins?

This post looks at the details of how a program running under Windows splits its command line into individual arguments.

Parsers

We’re only going to consider the two most common command line parsers– parse_cmdline and CommandLineToArgvW. The first, parse_cmdline, is part of the Microsoft CRT library initialization code and is called automatically at program startup to generate argv[]. It has both an ANSI and a Unicode version. CommandLineToArgvW is called explicitly by the programmer and has just a Unicode version. There is no standard function to un-parse a command line— that is, that takes a set of arguments and outputs a command line that re-generates the same set of arguments (something like ArgvToCommandLine) but I will give you a sample utility that does just that.

Whitespace Defined

I already mentioned in my previous post that the command line is divided into arguments at whitespace boundaries by both parse_cmdline and CommandLineToArgvW. Conceptually the process is simple— the parser scans the command line string from left to right looking for sequences of one or more whitespace characters which it discards. The substrings that remain between these blocks of whitespace become the individual arguments. This simplistic description ignores for the time being the handling of special characters like double quote and does not take into account the special handling of the first argument. We will discuss these issues later.

Both parse_cmdline and CommandLineToArgvW define whitespace as just space (0×20) and horizontal tab (0×09).

Given that whitespace is afforded special treatment, the obvious question is how do you supply an argument that contains literal whitespace characters and prevent them from being interpreted as an argument separator? We’ve already seen the answer— by adding a double quote character [”] to the command line to switch the parser state from InterpretSpecialChars to IgnoreSpecialChars. This has to be done somewhere after the previous separator whitespace, and before the literal space. After encountering this double quote character (which is removed) the parser stops recognizing whitespace as an argument separator. A subsequent double quote character (also removed) re-enables this recognition. I will show later how this toggling behavior can be selectively overridden.

The following has already been stated (with different wording), but deserves repeating:
 
Quoting serves only to tell the parser to toggle the interpretation of whitespace as an argument separator on and offQuoting serves only to tell the parser to stop or restart interpreting whitespace as an argument separator. It does not determine where an argument begins or ends.

You do not need to enclose an entire argument containing spaces in double quotes— just make sure the recognition of whitespace as an argument separator is disabled (state is IgnoreSpecialChars) prior to any literal whitespace.

Escaping Defined

Since the double quote character is now given special meaning, you have a problem similar to the one you had for whitespace: how do you insert a literal double quote character without toggling the parser state?

This is where the concept of escaping comes in. Another special character is defined, the backslash ([\]), which means don’t interpret the next character as special. If you want to include a literal double quote character, whether the state is InterpretSpecialChars or IgnoreSpecialChars, just precede it by a backslash: [\”]. The word escape reflects the idea that you are temporarily escaping from the normal flow of processing. The concept is subtly different from that of quoting because it’s active for only a single character after which the state automatically switches back to normal processing. Double quotes on the other hand switch and latch the parser state until another double quote character is seen.

Once again this begs the question— how do you include a real escape character? It’s beginning to feel like we’re never going to get off this treadmill! As soon as we define a way to handle a special case, our so-called fix introduces a new special case. We turned whitespace as the special case into double quote as the special case, then double quote as the special case into backslash as the special case.

Luckily for us, the chain is broken at this point through a simple mechanism: allow the escape character to escape itself! Conceptually, it works a lot like string constants in source code— to include a literal backslash, just supply two backslashes: [first\\second] for [first\second]. No new special case is introduced.

In an ideal world that’s all there would be to it. You would separate arguments with whitespace, protect literal whitespace by including a double quote character to enter IgnoreSpecialChars, place a backslash before any double quote that you don’t want to change the parser state, and always double up (or self escape) backslashes that are not escaping something.

Not so fast! Dealing With a Proliferation of Backslashes

But we don’t live in an ideal world. If you read the Microsoft documentation, you’ll see a set of rules that complicate this ideal behavior. I’m going to explain what I think is the reasoning behind some of these rules with the hope that doing so will help you understand and remember them better.

But first I’ll just list the Microsoft rules from the documentation (with a few comments in red)

  • Arguments are delimited by white space, which is either a space or a tab.
  • The caret character (^) is not recognized as an escape character or delimiter. The character is handled completely by the command-line parser in the operating system before being passed to the argv array in the program. potentially confusing mix of information about cmd.exe and the executable’s processing of the command line. You should keep these separate in your mind!
  • A string surrounded by double quotation marks (“string”) is interpreted as a single argument, regardless of white space contained within. A quoted string can be embedded in an argument. what exactly does this mean? (see discussion below)
  • A double quotation mark preceded by a backslash (\”) is interpreted as a literal double quotation mark character (“). independent of the parser state
  • Backslashes are interpreted literally, unless they immediately precede a double quotation mark.
  • If an even number of backslashes is followed by a double quotation mark, one backslash is placed in the argv array for every pair of backslashes, and the double quotation mark is interpreted as a string delimiter. reinforces the error-prone concept that double quotes enclose some meaningful string, or (worse) a complete argument
  • If an odd number of backslashes is followed by a double quotation mark, one backslash is placed in the argv array for every pair of backslashes, and the double quotation mark is “escaped” by the remaining backslash, causing a literal double quotation mark (“) to be placed in argv.
  • (there is one more undocumented rule which we will discuss later)

 
I don’t know what is meant by the statement a quoted string can be embedded in an argument (from the third rule). To me it suggests you can so something like the following and still have the entire string interpreted as a single argument:

["She said "you can't do this!", didn't she?"]

But this is interpreted as four arguments:

argv[1] = [She said you]
argv[2] = [can't]
argv[3] = [do]
argv[4] = [this!, didn't she?]

Or maybe it means you can embed a quoted string in the middle if you’re not already in a quoted string? That’s true, but then you can’t have spaces elsewhere:

[Shesaid"you can't do this!",didn'tshe?]

Not quite.

So just forget about quoted strings and embedded quotes.

This happens to be a good example to demonstrate some of the things I have been saying. I’ll show the parse image first then point out just the interesting parts:
 

Image of Example Command Line Parse

  • 3) Switch to IgnoreSpecialChars
  • 4) Switch to InterpretSpecialChars
  • 8) Single double quote character- Enter IgnoreSpecialChars
  • 9) Single double quote character- Return to InterpretSpecialChars

 
The second double quote character, immediately before [you], does not end the first argument. This particular situation seems to causes conceptual difficulty for some people.

The double quote character simply causes the parser to leave IgnoreSpecialChars and re-enter InterpretSpecialChars (indicated by the change from red to green). Because there is no whitespace, the argument continues. But because the parser is now in InterpretSpecialChars, the space between [you] and [can’t] does start a new argument. So does the space between [can’t] and [do], as well as between [do] and [this!].

Finally, the third double quote character (before the comma) causes another transition into IgnoreSpecialChars, preventing any further arguments from being generated.

Backslash Rules

Backslashes are very common because they’re the file system path separator under Windows. The writers of the C/C++ runtime library must have realized that it’s not only inconvenient, but also just plain ugly to have to fully escape these backslashes:

A UNC path such as:

[\\SomeComputer\subdir1\subdir2\]

Would need to be passed to CreateProcess in the command line string as:

[\\\\SomeComputer\\subdir1\\subdir2\\]

Which would in turn be expressed in source code as:

["\\\\\\\\SomeComputer\\\\subdir1\\\\subdir2\\\\"]

Ouch!

I believe that it was to avoid this proliferation of backslashes that Microsoft introduced the rule that backslashes are interpreted literally (except when they precede a double quote character).

But this rule results in an ambiguity: when the parser encounters [\”], is it an escaped double quote or a backslash followed by an un-escaped quote?

The strange-sounding rules for backslashes resolve the ambiguity. It might be more easily understood if you think of an escaped double quote as a single special compound character [\”]— call it an e-quote— instead of two separate characters and re-word the rule (in terms of creating the command line instead of parsing it) to read something like: Escape any literal backslashes that occur immediately prior to either a double quote or an e-quote character so they’re not interpreted as an escape character.

The intent may have been to facilitate the quoting of paths generated or received by a script or executable (not a human)— i.e., to allow you to always unconditionally enclose paths in double quote characters without further escaping. But doing so fails if the path contains a trailing backslash because the backslash is interpreted as an escape, as we have seen.

The following example illustrates the problem. The command line is unexpectedly split into just the two arguments shown. The final two components are appended to the second argument because the parser state is still IgnoreSpecialChars after the escaped double quote character and remains so until the end of the command line, where the parser ends in the IgnoreSpecialChars state. You should think of it that way instead of as an un-closed quoted string that is implicitly closed at the end of the command line:

Command line:

[test.exe "c:\Path With Spaces\Ending In Backslash\" Arg2 Arg3]

Actual arguments generated:

[test.exe]
[c:\Path With Spaces\Ending In Backslash" Arg2 Arg3]

Probably what was expected:

[test.exe]
[c:\Path With Spaces\Ending In Backslash\]
[Arg2]
[Arg3]

You might think you can avoid this by always stripping trailing backslashes. But doing so fails for one important special case— a root directory:

The path

[c:\]

(the root directory of drive c:) has a different meaning than

[c:]

(the current working directory on drive c:, not necessarily the root).

You can unconditionally remove trailing backslashes from paths, except for a root directory which must be explicitly handled as a special case.

Parser Specifics

So far we’ve covered that the command line is split on whitespace, that you can disable or re-enable (toggle) this splitting using a double quote character, and that you can mask the special toggling behavior of the double quote character by escaping it with a backslash.

We also covered the special rules Microsoft introduced for backslashes to avoid the need to escape backslashes in paths.

We’re now going to delve into the specifics of the parsers themselves.

The following pseudocode is for both parse_cmdline and CommandLineToArgvW. It ignores the special handling of the first argument. Except for one very minor difference (highlighted), it is the same for both. Ironically, the one situation where the behavior is not the same involves the undocumented rule I alluded to when listing the Microsoft rules, above.

I have not rigorously verified the following pseudocode. If you have Visual Studio installed you can find the actual source code for parse_cmdline in the file stdargv.c in the CRT source directory.

One aspect of parse_cmdline that I do not cover is the expansion of wildcards (* and ?), a feature that must be compiled into the executable.

 

State = InterpretSpecialChars
while(command line string not finished) {
   advance past leading whitespace (space or tab)
   count and advance past leading backslashes
   if (current character is ["]) {
      for each pair of leading backslashes counted, output a single [\]
      if (a backslash is leftover) {
         skip the leftover [\] and append ["] to the current argument
      }
      else if (the current ["] is followed by a second ["]) && State == IgnoreSpecialChars {
         skip the first ["] and append the second ["] to the current argument
         if (parser is CommandLineToArgvW) { // parse_cmdline remains in IgnoreSpecialChars
            State = InterpretSpecialChars;
         }
      }
      else {
         // toggle parser state:
         if (State == InterpretSpecialChars) {
            State = IgnoreSpecialChars
         }
         else {
            State = InterpretSpecialChars
         }
      }
   }
   else {
      for each leading backslash, output a single [\]
      if (next character is space or tab) {
         if (State = InterpretSpecialChars) {
            start a new argument;
         }
      }
      else {
         append the current character to the current argument.
      }
   }
}

Other than how the first argument is handled (discussed later), the only difference between parse_cmdline and CommandLineToArgvW is what the parser does after it encounters two double quote characters in a row when the state is IgnoreSpecialChars. Both parsers treat the first double quote character as a kind of escape for the second one and discard it (this is the undocumented rule). The second double quote causes CommandLineToArgvW to exit IgnoreSpecialChars, while parse_cmdline remains in IgnoreSpecialChars.

You are not likely to encounter a practical command line that causes the two parsers to generate different results. Because I have seen numerous examples of contrived command lines that demonstrate the difference, and people seem to be interested in how to explain them, I’m going to go over one extreme example that I encountered on the Internet:

[DumpArgs foo""""""""""""bar]

parse_cmdline generates the following arguments for the prior command line:

[DumpArgs]
[foo"""""bar] 5 literal double quote characters

But CommandLineToArgvW generates these:

[DumpArgs]
[foo""""bar] only 4 literal double quote characters

I’ve seen people try, unsuccessfully, to explain this example by the usual method of looking for pairs of double quote characters. But if you simply scan from left to right as I have shown, tracking the parser state, you’ll find that the difference is due to the fact that CommandLineToArgvW exits then re-enters IgnoreSpecialChars repeatedly (every time it encounters two consecutive double quote characters), but parse_cmdline only does so one time.

This will be easier to visualize by seeing the diagrams.

First we see how parse_cmdline parses the example. It removes a double quote both when entering and when leaving IgnoreSpecialChars (at points 3 and 4). While the state remains IgnoreSpecialChars, it removes one of each of the five pairs of double quote characters (the points marked ’6′), for a total of seven double quote characters removed:
 

Image of Example Command Line Parse

Next we see how CommandLineToArgvW parses the same example. Like parse_cmdline, this parser removes a double quote character every time it enters or leaves IgnoreSpecialChars. Unlike parse_cmdline, where IgnoreSpecialChars is entered and left only once, here it happens four times each, at the points marked 3 (enter) and 5 (leave). So eight double quote characters are removed instead of just the seven that parse_cmdline removes:
 

Image of Example Command Line Parse

  • 3) Switch to IgnoreSpecialChars
  • 4) Switch to InterpretSpecialChars
  • 5) First of 2 double quote characters in a row escapes next double quote. Next double quote switches to InterpretSpecialChars
  • 6) First of 2 double quote characters in a row escapes next double quote
  • 7) CommandLineToArgvW: saw 2 double quote characters in a row. Return to InterpretSpecialChars
  • 8) Single double quote character- Enter IgnoreSpecialChars
  • 9) Single double quote character- Return to InterpretSpecialChars

To convince you this is not just academic, the next example is my attempt to come up with a plausible real-world scenario (though to me it still seems unlikely to occur):

Suppose we have a processing pipeline where the first program generates the string [hello world]. The next 2 stages in the pipeline blindly double-quote the argument and pass it on, generating first [”hello world”], then [”"hello world”"]. Finally, the last stage double quotes the entire command line and passes it on to FinalProgram.exe:

[FinalProgram.exe "first second ""embedded quote"" third"]

Command Line Arguments From CommandLineToArgvW:

arg 0   = [FinalProgram.exe]
arg 1   = [first second "embedded]
arg 2   = [quote]
arg 3   = [third]


Image of Example Command Line Parse

Command Line Arguments From argv Array (argc = 2):

argv[0] = [FinalProgram.exe]
argv[1] = [first second "embedded quote" third]


Image of Example Command Line Parse

Here, we again see the undocumented rule come into play— two double quote characters in a row while the state is IgnoreSpecialChars are interpreted as an escaped double quote character. Both parsers consume the first one and output the second one. But the difference is that for CommandLineToArgvW, there is a transition back to InterpretSpecialChars, while parse_cmdline remains in IgnoreSpecialChars.

First Command Line Argument

Both parsers process the first argument differently than the remainder of the command line. There may be a good reason for this, but I can’t think of one and I think it just causes additional confusion without providing much, if any benefit.

To compound the confusion, there is a much greater difference between the two parsers for the first argument than there is for the remainder of the command line.

I will discuss the two parsers separately.

Pseudocode for parse_cmdline Handling of First Argument

The following is pseudocode for parse_cmdline:

 

ParserState = InterpretSpecialChars
loop while not end of command line and not end of argv[0] {
   if (char is space or tab) and (ParserState is InterpretSpecialChars) {
      Overwrite the whitespace char with string terminator
      End argv[0]
      }
   else if (char is ["] {
      Toggle ParserState
      Discard the ["]
      }
   else {
      Append char to argv[0]
      }
   }
parse remainder of command line normally

Notes for parse_cmdline

  • You can enter and exit IgnoreSpecialChars as many times as you want, the same as when parsing the remainder of the command line. The double quote characters are removed.

The following command line:

["F"i"r"s"t S"e"c"o"n"d" T"h"i"r"d"]

generates just one argument (when it appears at the start of the command line):

[First Second Third]

If you trace it carefully you will find that the state is IgnoreSpecialChars when each of the two spaces is encountered.


Image of Example Command Line Parse

  • You cannot use either one of the usual ways of escaping double quote characters (with a backslash or with another double quote character):

The backslash in the following does not escape the double quote that follows it and the two pairs of back-to-back double quote characters do nothing (they’re simply removed because they just cause the state to toggle then immediately toggle back):

[F""ir"s""t \"Second Third"]

Even though the 6th double quote looks like it’s escaped, the backslash does not escape anything when parsing the first argument. Therefore this double quote causes the state to change back to InterpretSpecialChars and the following space ends the first argument. Therefore two arguments are generated instead of the single argument ([First “Second Third]) that would be generated for the same text later in the command line:

argv[0] = [First \Second]
argv[1] = [Third]
  • If the first character of the command line is a space or tab, an empty first argument is generated and the remainder of the command line is parsed normally:

This command line:

[  Something Else]

Generates these 3 arguments:

argv[0] = []
argv[1] = [Something]
argv[2] = [Else]

Pseudocode for CommandLineToArgvW Handling of First Argument

The following is pseudocode for CommandLineToArgvW:

 

if ((first char is >= 0x01) and (first char <= 0x20) {
    arg[0] is empty string
    }
 else if ((first char is ["]) {
    Discard the ["]
    loop while not end of arg[0] {
       if (char is ["]) {
          Discard the ["]
          End arg[0]
          }
       else if end of command line {
          End arg[0]
          }
       else {
          Append char to arg[0]
          }
       }
    }
 else
    loop while not end of arg[0] {
       if (char is >= 0x01 and char <= 0x20) {
         Discard char
         End arg[0]
         }
      else if end of command line {
         End arg[0]
         }
      else {
         Append char to arg[0]
         }
      }
   }
parse remainder of command line normally

Notes for parse_cmdline

CommandLineToArgvW processing of the first argument has some (at least to me) strange behavior, most-notably the way it sometimes accepts any character between 0×01 and 0×20 as whitespace.

  • The same as for parse_cmdline, if the first character of the command line is whitespace, an empty first argument is generated and the remainder of the command line is parsed normally. However, any character between 0×01 and 0×20, inclusive, is considered whitespace!

This command line:

[  Something Else]

Generates these 3 arguments:

argv[0] = []
argv[1] = [Something]
argv[2] = [Else]

The first character is shown as a space, but any character between 0×01 and 0×20 will cause an empty first argument. If the second character, again shown as a space, is something in the same range, other than a space or tab, it will become the first character of the next argument

  • If the first character is a double quote then any non-zero character is accepted as part of the first argument (even 0×01 through 0×20) until another double quote or the end of the command line is encountered.

If the [*] in the next line is really \x05 (or any other character in the range 0×01 through 0×20):

["123 456*abc\def"ghi]

It generates these arguments:

[123 456*abc\def]
[ghi]

(it is correct that [ghi] is a new argument even though there is no whitespace after the second double quote)

  • If the first character is not a double quote and not a space, then any non-whitespace character, including a double quote is accepted as part of the argument. Here also, whitespace is defined as any character between 0×01 and 0×20, inclusive (not just space and tab)

If the [*] in the next line is really \x05 (or any other character in the range 0×01 through 0×20), then it acts as an argument separator:

[123"456"*abc]

These two arguments are generated:

[123"456"]
[abc]

You can download a sample CommandLines.txt file containing all the sample command lines in this post. It can be used with the RunTest utility. The file will need to be renamed as CommandLines.txt before using it.

The last thing I need to cover is the behavior of cmd.exe and batch files. I hope to get this posted in the next week or so. See you next time!

I’d like to have an argument, please

This post presents a better way to understand the quoting and escaping of Windows command line arguments.

A Problem Scenario

In order for you to appreciate why it’s important to understand the things we’re discussing, I’m going to start with a typical problem scenario that illustrates the kind of things that can go wrong.

We know from the Microsoft documentation for CreateProcess that command lines are split into individual arguments based on the location of spaces on the command line. We’re told to enclose arguments that contain spaces between a pair of double quote characters. The typical description of how whitespace (space and tab) and quotes are dealt with while parsing a command line uses terms like double-quoted string and inside- or outside a quoted part. Even the Microsoft description of how a command line is parsed says a string surrounded by double quotation marks (“string”) is interpreted as a single argument, regardless of white space contained within. A quoted string can be embedded in an argument.

We’re led to believe that analyzing a command line is just a matter of looking for matching pairs of quote characters that enclose individual arguments or embedded quotes. It sounds simple enough and we’re pretty sure we understand how it works. But still, occasionally we come across an example we can’t explain, or (more likely), some command line breaks an existing program or script.

Suppose we have a program, CreateDocs.exe, that generates documentation for a set of C++ source files. It’s driven by a batch file that accepts a starting directory and an optional “VERBOSE” switch. The batch file first passes the starting directory to the program, then generates a list of subdirectories which it also passes one at a time to the program. Our batch file faithfully quotes each directory in case it contain spaces. Recall that %~1 is the first batch argument (%1) with any existing quotes removed, so [”%~1″] unconditionally double quotes the string whether it was quoted or not on the command line:
 

@echo off
REM %1 is the root of the directory tree to process.
REM %2 is the optional string VERBOSE
cls
setlocal
echo.
IF '%1'=='' echo No Path Specified& goto:eof
set ROOT_PATH=%~1

REM Process root directory:
CreateDocs "%ROOT_PATH%" %2

REM Process subdirectories:
FOR /F %%S IN ('dir /ad /b "%ROOT_PATH%"') DO (
   CreateDocs "%%S" %2
   )
echo.
goto:eof

 

Once in a while the script neglects to process files in the starting directory. We discover that it fails whenever the user appends a trailing backslash to the specified directory. The backslash is interpreted by CreateDocs as an escape for the closing double quote. So instead of receiving [SomeDirectory\] in argv[1] when processing [”SomeDirectory\” VERBOSE], it receives [SomeDirectory” VERBOSE]. If you don’t already understand, you will later see why this is happening. It may seem a little mysterious that the FOR loop works correctly, but that’s only because cmd.exe parses things differently than the executable does.

Although we now know what the problem is, we can’t just tell our users not to append a trailing backslash. Someone will get it wrong! To fix this in CreateDocs would require some relatively complex logic. We could, for example, detect that argv[0] contains a [”] at the end of a valid path (and the directory exists), possibly followed by [ VERBOSE].

We opt instead to add logic to our script to detect and remove the trailing backslash:

 

@echo off
REM %1 is the root of the directory tree to process.
REM %2 is the optional string VERBOSE
cls
setlocal
echo.
IF '%1'=='' echo No Path Specified& goto:eof
set ROOT_PATH=%~1

REM Strip off trailing backslash:
IF [%ROOT_PATH:~-1,1%]==[\] set ROOT_PATH=%ROOT_PATH:~0,-1%

REM Process root directory:
CreateDocs "%ROOT_PATH%" %2

REM Process subdirectories:
FOR /F %%S IN ('dir /ad /b "%ROOT_PATH%"') DO (
   CreateDocs "%%S" %2
   )
echo.
goto:eof

 

We congratulate ourselves on our cleverness and we use the script successfully for months.

Then one day we change our build process. We want to be able to debug using binaries built on different development machines. Because the source code is installed in different places on different machines (something we don’t want to change), the debugger can’t always find the .pdb file because it’s specified in the executable image with an absolute path. We could fix this various ways, but we decide to use a fixed drive letter in the build scripts and use subst to map the root of the source code tree, wherever it is on a particular machine, to the root of this drive.

Once again we start to notice that our script occasionally (but not always) fails.

What’s happening now?

The answer is that our script was smart, but not smart enough. The problem this time turns out to be caused by the fact that we are now specifying a root directory (ex. [x:\]). When the batch file removes the trailing backslash it generates [x:] which (for a root directory only) is not the same as with the backslash. The former specifies the default working directory while the latter is the root directory.

Most of the time our users open a dedicated console for running this tool and never do any other work there. In that case c:\ is the same as c:. But occasionally someone switches to the substed drive and goes to a subdirectory to do some other work. This makes c: different from c:\ and on those infrequent occasions the script fails.

We fix it by adding a special case to not remove the trailing backslash for the special case of a root directory explicitly specified.

This example was not intended to teach you what to do, but to give a reasonable example of how problems can creep in.

The above batch file still has some issues that I will leave unresolved for now (the “IF” statements may fail with certain strings containing double quote characters).

The Problem With the Existing Way of Looking at Things

Consider the following set of partial command lines:

["Argument With Spaces"]

[Argument" "With" "Spaces]

["Argument "With" Spaces"]

[Argument" With Sp"aces]

["Argument With Spaces]

["Ar"g"um"e"n"t" W"it"h Sp"aces""]

Before continuing, take a few moments and try to pick out the outer and the embedded quoted strings.

You may be surprised to learn that all of these are interpreted exactly the same way by the command line parser— as a single argument: [Argument With Spaces].

But how can all of these possibly generate the same argument? One of the quotes is not even closed!

The command line parser rules must be either inconsistent, incomprehensible, or just plain stupid.

But the problem isn’t with the rules, but simply that the conventional way of looking at things is wrong. We need a different way of analyzing command lines that’s easy to apply and always works.

(for an even more extreme and humorous example take a look at 50 Ways to Say Hello)

A Better Way To Look at Things

We come now to a very important concept.
 

Double quote characters do not delineate argumentsDouble quote characters in a command line string have no relation to the boundaries between arguments and do not necessarily enclose arguments. Each individual double quote character by itself acts simply as a switch to enable or disable the recognition of space as a divider between arguments.

 
Attempting to find pairs of double quote characters that enclose meaningful chunks of text is possibly the biggest conceptual mistake that people make when looking at a complicated command line.

This point is extremely important (possibly the most important concept in this article) so I’m going to discuss it in depth before I move on to explain how a command line is received and processed by an executable program.

Each of the command line parsers we will consider (parse_cmdline, CommandLineToArgvW, and cmd.exe) has a set of characters that in some contexts it considers special (examples are whitespace, double quote, caret, and backslash) and which cause some action to be taken when one of them is encountered— such as beginning a new argument (and removing the special character).

In other contexts the parser treats these same characters like regular text. So at any given time the parser is in one of two states— recognizing (i.e., interpreting) special characters, or ignoring them.

The Parser States Named

In order to be explicit when referring to these two states in the text, I invented names for them. I call the first state  InterpretSpecialChars  and the second state  IgnoreSpecialChars . They correspond to what some writers refer to as being outside or inside a double quoted string. I purposely avoided giving them names containing quote or quoting because doing so reinforces the misconception that double quote characters somehow delineate arguments. The colors shown are used later to show the parser state in images I created to demonstrate how example command lines are parsed.

I originally considered calling these InterpretWhitespace and IgnoreWhitespace, but I wanted to use the same state names when discussing cmd.exe where the state governs the interpretation of a different set of special characters, not whitespace.
 

Tracking the parser state is the key to understanding any command lineThe key to understanding any command line, no matter how complex, is to pay attention to which of these two states the parser is in at any given time and understanding what causes the state to change.

 

The Last Example Explained

Let’s examine the last example in the above list to see how it is parsed, character by character, from left to right. The special character that we will see either interpreted or ignored, depending on the parser state, is the space character.

You’ll see that the parser state is IgnoreSpecialChars when both the spaces are read, so they do not cause a new argument to be started.

Each line below represents a single character read from the command line. The first column is the actual character read, followed by the parser state before processing the character, the action that was taken, and the value of the partial argument after processing the character.

Here’s the command line again:

["Ar"g"um"e"n"t" With Sp"aces""]

Char read: State when char read:    Action:                        Argument (after processing char):
           InterpretSpecialChars    Start                          []
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       []
    A      IgnoreSpecialChars       Add char [A]                   [A]
    r      IgnoreSpecialChars       Add char [r]                   [Ar]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Ar]
    g      InterpretSpecialChars    Add char [g]                   [Arg]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Arg]
    u      IgnoreSpecialChars       Add char [u]                   [Argu]
    m      IgnoreSpecialChars       Add char [m]                   [Argum]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argum]
    e      InterpretSpecialChars    Add char [e]                   [Argume]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Argume]
    n      IgnoreSpecialChars       Add char [n]                   [Argumen]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argumen]
    t      InterpretSpecialChars    Add char [t]                   [Argument]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Argument]
           IgnoreSpecialChars       Add char [ ]                   [Argument ]
    W      IgnoreSpecialChars       Add char [W]                   [Argument W]
    i      IgnoreSpecialChars       Add char [i]                   [Argument Wi]
    t      IgnoreSpecialChars       Add char [t]                   [Argument Wit]
    h      IgnoreSpecialChars       Add char [h]                   [Argument With]
           IgnoreSpecialChars       Add char [ ]                   [Argument With ]
    S      IgnoreSpecialChars       Add char [S]                   [Argument With S]
    p      IgnoreSpecialChars       Add char [p]                   [Argument With Sp]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argument With Sp]
    a      InterpretSpecialChars    Add char [a]                   [Argument With Spa]
    c      InterpretSpecialChars    Add char                       [Argument With Spac]
    e      InterpretSpecialChars    Add char [e]                   [Argument With Space]
    s      InterpretSpecialChars    Add char [s]                   [Argument With Spaces]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Argument With Spaces]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argument With Spaces]

 

Image of Example Command Line Parse

  • 3) Switch to IgnoreSpecialChars
  • 4) Switch to InterpretSpecialChars
  • 8) Single double quote character- Enter IgnoreSpecialChars
  • 9) Single double quote character- Return to InterpretSpecialChars

Quoting Defined

Enabling or disabling the recognition of special characters (switching between InterpretSpecialChars and IgnoreSpecialChars) is done by strategically placing double quote characters at specific locations on the command line— wherever you want the state to change. This is under the control of the person who writes the command line. The act of placing double quote characters for this purpose is how I define the term quoting. It is not the delineation of a piece of text by enclosing it in a pair of double quote characters.

 
Quoting is the placement of double quote characters on the command line to toggle the parser stateQuoting is the placement of individual double quote characters on the command line to control the switching between InterpretSpecialChars and IgnoreSpecialChars.
 
Since double quote characters work individually and not in pairs, there are no such concept as a dangling (unclosed) or mismatched quote as discussed by other writers. The command line parser simply ends in one or the other of the two states. If there is an even number of un-escaped double quote characters on the command line (we will define escape later) the parser ends in InterpretSpecialChars. If there is an odd number of un-escaped double quote characters it ends in IgnoreSpecialChars.

What would traditionally be called a quoted string (text between two double quote characters) can span multiple arguments or be contained completely within an argument— more evidence that it’s hopeless to use pairs of double quote characters to find the arguments!

The seemingly arcane parse rules in the Microsoft documentation actually make sense once you understand that they don’t directly tell you how the command line is split into arguments, but merely describe what causes the parser to switch state.

By training your mind to scan a command line from left to right like the parser does, instead of trying to pick out the quoted chunks, you will have little problem understanding or correctly generating even the most complicated command line.

We’ll talk more about this when we go over the specifics of the parsers later.

How a Program Receives it’s Command Line

Everyone knows the purpose of command line arguments— to customize a particular run of a program or a script. This section covers how an executable program interprets the command line string received from CreateProcess. Elsewhere I will cover the behavior of cmd.exe. For now, keep the following in mind:

 
How cmd.exe parses the command line is different and completely independent of how an executable program parses itThe way cmd.exe interprets the command line is different and completely independent of how an executable program interprets the command line. You must separate your thinking about what cmd.exe does from your thinking about what an executable program does.

 
From a programmer’s perspective, a Windows application written in C or C++ begins execution in main or WinMain and makes available (as function parameters) either an array of individual arguments (main’s argv[]) or a single command line string (WinMain’s lpCmdLine). Both types of application also have available an alternate source for the same set of arguments contained in argv[]— the global variables __argc and __argv. These are filled in before main or WinMain is called so many programmers believe that their programs receive the command line already split up into individual arguments, perhaps by Windows itself. But that is not the case.
 
A program receives just one command line string and must split it into individual arguments itselfA program receives a single command line string which the program itself, not the operating system, splits into individual arguments.
 
This string normally consists of the executable name followed by the actual arguments. I will have a lot more to say about this, but for now the important thing to remember is there is just one command line string.

All programs executed under Windows are ultimately started by CreateProcess (or variants such as CreateProcessAsUser). CreateProcess is a complex subject and I’m only going to look at two of the parameters it accepts: the executable name of the program to start (lpApplicationName) and the command line string (lpCommandLine), both of which are optional. Obviously, you have to tell Windows what program to run, so if the first parameter is not supplied (is NULL) then the program name must be given at the start of the command line string. By convention, even if you do specify the program name in the first parameter, you’re supposed to repeat it at the beginning of the command line (if you supply one), but this is not enforced and leaving it out can cause problems. If you don’t supply a command line Windows creates one containing just the program name.

 
Programs assume the first argument on the command line is the executable nameA program does not know how its name was specified and most programs blindly assume the first argument is the program name.
 
The first command line argument is usually interpreted differently than the rest of the command line or ignored altogether. A Windows GUI app will not even see the first argument (at least not in lpCmdLine). If it’s something important it will get lost:

[ImportantArgument-0 Argument-1] is received in lpCmdLine as just [Argument-1]

 
Always put the program name in the first argumentTo avoid problems caused by special handling of the first argument, or the assumption that it is the executable name, always include the program name at the beginning of any command line string you supply to CreateProcess.
 
The Microsoft documentation for CreateProcess implies that including the program name at the start of lpCommandLine is optional, something only “C programmers generally” do. After we show how the parsers give special treatment to the first argument, you will understand why I recommend that you always specify the executable name as the first thing on the command line. An interesting thing to note (which we won’t pursue) is that if you specify the executable name in the first argument, the PATH is not searched, but if you specify NULL for the first argument, the PATH is used. For security reasons, Microsoft recommends that you always supply the first argument to CreateProcess and always enclose the executable name at the beginning of the command line string in quotes, though this is strictly only necessary if the path contains spaces.

How the Command Line is Split

Programs generally split their command line by treating a sequence of one or more whitespace characters (space, tab) as the separator between arguments. I say generally because a program is free to do whatever it wants and there are no standards to guide us. There is not even universal agreement on the set of characters that constitute whitespace.
 
A program can parse a command line any way it wants but most programs use one of two standard parsersA program can interpret its command line any way it wants. There is no method to ensure with 100 percent certainty that a given command line is correct or guarantee how it will be broken up into individual components.
 
This may sound hopeless, but fortunately there’s only a small set of facilities for parsing the command line that most programs use. Once you understand these, you will understand how the vast majority of programs behave.

A program that links with the Microsoft C/C++ runtime library automatically splits the command line by calling a function named parse_cmdline and passes the result to main in argc and argv, or to WinMain through the global variables __argc and __argv.

A Windows GUI program can access the command line string (minus the program name) through the lpCmdLine argument to WinMain, and any program can access the full command line, including the program name (or whatever else was specified at the start of the command line when CreateProcess was called) by calling the aptly named GetCommandLine function. The program can then split the resulting string explicitly either by calling CommandLineToArgvW or a custom command line parser.

The splitting rule used by both these parsers is simple: splitting always occurs on a whitespace boundary, and only when the parser state is InterpretSpecialChars.

Almost all programs use the arguments generated by either parse_cmdline or CommandLineToArgvW, so we will focus on these.

Getting a View on Things

In the next post I’ll go into great detail about the algorithms used by parse_cmdline and CommandLineToArgvW to split a command line. We’ll see some ugliness caused by a special Microsoft rule governing backslashes and we’ll go over the special handling of the first argument. But first I’m going to give you some tools for looking at things (later I’ll give you additional tools that may actually make your job easier).

Download DumpArgs Project
Download RunTest Project

Both of these are console programs.

The first utility is called DumpArgs and is used to see exactly how a given command line is split into arguments.

DumpArgs first displays the command that it received, both ANSI and Unicode in case there’s a difference. Since it is a console app, it already has the command line split into arguments by parse_cmdline (in argv[]). The program retrieves the raw command line using GetCommandLine and splits it again using CommandLineToArgvW then outputs both sets of arguments for comparison.

Simply run it with any command line you want.

Example Run of DumpArgs

 
Command line:

DumpArgs “First NotSecond” Second!

 
Output:


Unicode Command Line (from GetCommandLineW): [dumpargs  "First NotSecond" Second!]


              00 01 02 03 04 05 06 07 | 08 09 0a 0b 0c 0d 0e 0f
              -------------------------------------------------
   00000000:  64 00 75 00 6d 00 70 00 | 61 00 72 00 67 00 73 00   d.u.m.p. | a.r.g.s.
   00000010:  20 00 20 00 22 00 46 00 | 69 00 72 00 73 00 74 00   _._.".F. | i.r.s.t.
   00000020:  20 00 4e 00 6f 00 74 00 | 53 00 65 00 63 00 6f 00   _.N.o.t. | S.e.c.o.
   00000030:  6e 00 64 00 22 00 20 00 | 53 00 65 00 63 00 6f 00   n.d."._. | S.e.c.o.
   00000040:  6e 00 64 00 21 00 00 00                             n.d.!...


ANSI Command Line (from GetCommandLineA): [dumpargs  "First NotSecond" Second!]


              00 01 02 03 04 05 06 07 | 08 09 0a 0b 0c 0d 0e 0f
              -------------------------------------------------
   00000000:  64 75 6d 70 61 72 67 73 | 20 20 22 46 69 72 73 74   dumpargs | __"First
   00000010:  20 4e 6f 74 53 65 63 6f | 6e 64 22 20 53 65 63 6f   _NotSeco | nd"_Seco
   00000020:  6e 64 21 00                                         nd!.


CommandLineToArgvW Found 3 Argument(s)
   arg 0   = [dumpargs]
   arg 1   = [First NotSecond]
   arg 2   = [Second!]

Command Line Arguments From argv Array (argc = 3):
   argv[0] = [dumpargs]
   argv[1] = [First NotSecond]
   argv[2] = [Second!]

I show the source code below, but you can download a zip archive containing a pre-built executable, source code, and a Visual Studio project.

Notes on the VS solution (these notes apply to both the DumpArgs project and the StartTest project, below):

  • Just unzip the archive to an empty directory and open the .sln file.
  • You can switch between ANSI and Unicode build by first clicking on the project in the Solution Explorer in Visual Studio and selecting Project->DumpArgs Properties. Under Configuration Properties->General you will see Character Set in the right pane. Select the one you want (counter-intuitively, you select Multi-Byte Character Set for an ANSI build).

 

// DumpArgs.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

void APIENTRY DumpHex (unsigned char* Buffer, int Length) // Displays a buffer of bytes
{
for (int BufIdx = 0; BufIdx < Length; ++BufIdx) {
   if ((BufIdx % 256) == 0) {
      // Print header every 256 bytes
      _tprintf (_T("\n              00 01 02 03 04 05 06 07 | 08 09 0a 0b 0c 0d 0e 0f\n"));
      _tprintf (_T("              -------------------------------------------------\n"));
      }
   if ((BufIdx % 16) == 0) {
      // Print buffer offset at start of lines
      _tprintf (_T("   %08lx:  "), BufIdx);
      }
   _tprintf (_T("%02x "), Buffer[BufIdx]);
   // Output separator in middle (but not after last byte):
   if (((BufIdx % 16) == 7) && (BufIdx < (Length - 1))) {
        _tprintf (_T("| "));
        }
     if (((BufIdx % 16) == 15) || (BufIdx == (Length - 1))) {
        // Pad last line with spaces if necessary
        int Padding = 3 * (16 - ((BufIdx % 16) + 1));
        if (Padding > 21) {
         Padding += 2; // For where separator would have been
         }
      while (Padding--) {
         _tprintf (_T(" "));
         }
      // Output printable characters
      _tprintf (_T("  "));
      for (int CharIdx = 16 * (BufIdx / 16); CharIdx <= BufIdx; ++CharIdx) {
         _tprintf (_T("%c"), isprint(Buffer[CharIdx])?Buffer[CharIdx]==' '?'_':Buffer[CharIdx]:'.');
         // Output separator in middle (but not after last byte):
         if (((CharIdx % 16) == 7) && (CharIdx != BufIdx)) {
            _tprintf (_T(" | "));
            }
         }
      _tprintf (_T("\n"));
      }
   }
}

int _tmain(int argc, _TCHAR* argv[])
{
wchar_t* CommandLineUnicode = GetCommandLineW();
wprintf(L"Unicode Command Line (from GetCommandLineW): [%s]\n\n", CommandLineUnicode);
DumpHex((unsigned char*)CommandLineUnicode, (2 * wcslen(CommandLineUnicode)) + 2);
_tprintf (_T("\n\n"));

char* CommandLineAnsi = GetCommandLineA();
printf("ANSI Command Line (from GetCommandLineA): [%s]\n\n", CommandLineAnsi);
DumpHex((unsigned char*)CommandLineAnsi, 1 + strlen(CommandLineAnsi));
_tprintf (_T("\n\n"));

int NumArgs = 0;
wchar_t** Args = CommandLineToArgvW(CommandLineUnicode, &NumArgs);
_tprintf (_T("CommandLineToArgvW Found %d Argument(s)\n"), NumArgs);
for (int arg = 0; arg < NumArgs; ++arg) {
    wprintf (L"   arg %s%d   = [%s]\n", ((NumArgs >= 10) && (arg < 10))?L" ":L"", arg, Args[arg]);
   }

_tprintf (_T("\nCommand Line Arguments From argv Array (argc = %d):\n"), argc);
for (int arg = 0; arg < argc; ++arg) {
    _tprintf (_T("   argv[%s%d] = [%s]\n"), ((argc >= 10) && (arg < 10))?" ":"", arg, argv[arg]);
   }
_tprintf (_T("\n"));
LocalFree(Args);

return 0;
}

 

The next utility, RunTest, lets you drive DumpArgs with specific command lines without worrying about how cmd.exe first mangles the command line.

You specify each command line in a file named CommandLines.txt which must be a plain ANSI text file (not Unicode) in the current directory. DumpArgs must also be in the current directory.

Put each command line on a separate line in CommandLines.txt exactly as you want DumpArgs to receive it. Empty lines are ignored and so are lines starting with a semicolon (in the first column), so you can include comments.

RunTest reads the each command line from CommandLines.txt and starts DumpArgs using CreateProcess, passing it the specified command line in the lpCommandLine parameter. It waits up to 5 seconds for DumpArgs to Finish. RunTest explicitly specifies DumpArgs.exe in lpApplicationName and does not insert DumpArgs.exe at the beginning of lpCommandLine so that you can test what happens if you pass something other than the executable name as the first command line argument (for example, to see how the first argument is parsed differently).

As I did for DumpArgs, I again show the source code here and you can download a zip archive containing pre-built executables (for both DumpArgs and RunTest), source code and a Visual Studio project for RunTest and a sample CommandLines.txt file.

 

// RunTest.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
/*
 * Very simple-minded read line function:
 *
 *    - Writes the next line into Buffer and returns true if more data, false if no data or error.
 *    - Trusts that Buffer and BufSize are both valid.
 *    - Always NULL terminates Buffer, unless Buffer is 0 bytes.
 *    - Works only with ANSI file (not Unicode).
 *    - Skips empty lines and lines that start with ";" (for comments)
 *    - Silently truncates line if Buffer too small (but consumes entire line).
 *
 */
bool ReadLine (HANDLE hFile, char* Buffer, int BufSize)
{
bool Result = false;
try {
   if (BufSize >= 1) {
      int BytesWritten = 0;
      char* BufPtr = Buffer;
      while (1) {
         char Char;
         DWORD BytesRead;
         if ((ReadFile(hFile, &Char, sizeof(char), &BytesRead, NULL) != 0) && (BytesRead == sizeof(char))) {
            if (Char == 0x0a) {
               if (BytesWritten > 0) {
                  if (Buffer[0] == ';') { // Ignore comments
                     Result = false;
                     BytesWritten = 0;
                     BufPtr = Buffer;
                     }
                  else {
                     break;
                     }
                  }
               }
            else if ((Char != 0x0d) && (BytesWritten++ < (BufSize - 1))) {
               *BufPtr++ = Char;
               Result = true;
               }
            }
         else {
            break;
            }
         }
      *BufPtr = '\0';
      }
   }
catch (...)
   {
   Result = false;
   }
return Result;
}

int _tmain(int argc, _TCHAR* argv[])
{
HANDLE hFile = CreateFileA (".\\CommandLines.txt", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
if (hFile == INVALID_HANDLE_VALUE) {
   _tprintf(_T("\n\nERROR: Could not find CommandLines.txt in the current directory\n\n"));
   return 1;
   }
while (1) {
   char CommandLine[1024];
   if (ReadLine (hFile, CommandLine, 1024) == false) {
      break;
      }
   // Create the process:
   STARTUPINFOA si;
   PROCESS_INFORMATION pi;
   memset (&si, 0, sizeof(si));
   memset (&pi, 0, sizeof(pi));
   GetStartupInfoA(&si);
   si.cb = sizeof(si);
   si.dwFlags = STARTF_USESHOWWINDOW;

   if (CreateProcessA (".\\DumpArgs.exe", CommandLine, NULL, NULL, FALSE, 0, NULL, NULL, &si, &pi) != 0) {
      // Successfully created the process. Wait for it to finish:
      if (WaitForSingleObject(pi.hProcess, 5000) == WAIT_TIMEOUT) {
         _tprintf (_T("Timed out waiting for process to exit\n"));
         }
      }
   else {
      _tprintf (_T("CreateProcess failed\n"));
      }
   }
CloseHandle (hFile);
return 0;
}

 

Now that you understand how a command line is given to a program, and have some tools for looking at things, we can begin next time to look closely at exactly how a command is parsed.

He who wishes to be obeyed must know how to command

This post covers everything you need to know about how to create and interpret command lines for Windows programs.

I’ll go over how to quote and escape arguments including those that contain spaces, embedded quotes and special characters. I’ll explain how the command line is received and interpreted. I’ll point out specific situations where problems are likely to occur and give you some tools and techniques to ensure your programs always receive exactly the arguments you intended.

Introduction

Understanding the command line is important, especially when you don’t have control over your program’s input. Arguments could come from a database or a directory listing, or be entered by a user of your software. The information here will help you correctly handle situations you didn’t anticipate and avoid having a program or script intermittently fail. I will point out the most common mistakes and explain what to do in those rare situations where there isn’t a clean solution.

Some of the information here comes from other sources as well as my own testing. I present a new way to think about the command line that I hope will eliminate some confusion about how double quote characters are interpreted by the command line parser.

That’s a lot of ground to cover, so I plan to present it over time in multiple posts. By the time I’m done you will thoroughly understand:

  • What I mean by the terms quoting and escaping, and the different contexts where arguments need to be quoted or escaped
  • How an executable program receives its command line and splits it into individual arguments
  • How the command line is interpreted and modified by by cmd.exe before passing it to an executable program
  • Batch file considerations
  • Passing received arguments on to other programs and scripts
  • The various problems and security risks that can occur and how to recognize and prevent them

I’ll also provide some free tools to help you do things the right way.

If you’re in a hurry you can jump to a summary or go see what others have to say on the subject.

Technical Notes

The following will help you understand this series better.

  • I use the term “literal” to describe any text that is intended to be passed unchanged to an application. This is in contrast to special characters such as whitespace, used to separate arguments, or double quote [”] and escape [\, ^] meta characters that control parsing but are not passed to the application.
  • When giving examples I needed a way to clearly show the content of a piece of text, especially if it has leading or trailing spaces. Enclosing these in quotes is ambiguous when the text itself contains the quote character. To avoid confusion I enclose literal strings inside square brackets:

[3 backslashes, followed by double quote: \\\”]

None of my sample text contains literal square brackets, so they always indicate the content of a literal string or a single character. Any quote character (single or double) or backslash that appears inside square brackets is always part of the sample text.

  • I include images to help visualize how various example command lines are parsed. Hopefully they’re intuitive, but I included a description of the images. You can also click on any of the images to get to the description page
  • This series of articles was written from the perspective of a C/C++ Windows programmer so there may be a noticible bias in the presentation. The information applies mainly to programs developed with Microsoft development tools (Visual Studio 2010) and which use the Microsoft standard runtime library.

Key points are marked with this iconKey points are marked with a red exclamation point