Archive for the ‘Tutorials’ Category

What saves a man is to take a step. Then another step.

This post shows you how to step through the startup and initialization code of a Windows program using Visual Studio.

Introduction

From a programmer’s perspective, a Windows application written in C or C++ begins execution at main() or WinMain(). But there’s a lot of initialization that occurs before either of these functions is called. Among other things, the runtime library (CRT) is initialized and constructors are called for C++ global objects.

It is sometimes useful and relatively easy to step through this initialization code and I will discuss several techniques for doing it.

How you go about this depends on what you want to look at, the availability of source code, and whether or not there is debugging information. I will describe techniques you can use with programs you developed yourself (and have source code for), as well as executables you did not build.

You probably don’t need to be shown how to trace such things as constructor calls for classes that you developed yourself and for which you have source code. All you have to do is set breakpoints in the constructors and restart the program. This is trivial so I won’t bother to show how to do it.

It’s only slightly more difficult to step through the CRT (C or C++ Runtime Library) startup code. Why would you want to do this? Possibly to understand how a program’s command line is parsed to generate argv[], or learn the mechanism for calling global constructors or scheduling destructor calls for the same objects. All of this occurs before main is called.

Visual Studio provides source code for the CRT, located by default at C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src. This is a valuable resource. But just reading the sources is no substitute for actually stepping through the code at runtime. You can even modify and compile these sources to add breakpoints and debugging messages.

I assume you have Visual Studio installed. I am using Visual Studio 2010 but everything should work (with minor differences) with Visual Studio 2008 or Visual Studio 2012, and probably even earlier versions.

As with class constructors, if the code you want to trace is in the CRT and the program was built with debugging information you can open one of the CRT source files and set a source-level breakpoint— provided you know what you want to trace and in which source file it is located. Because you are not likely to be as familiar with this code as with your own, and because it employs some advanced techniques (like calling initialization functions through arrays of function pointers that are populated based on specially-named code sections), you will likely want to step through some of this code starting at the program entry point.

Again, I will assume you know how to set source-level breakpoints and instead concentrate on what to do if you have just an executable image with no debugging information. The program might not even link with the CRT or might have a custom entry point, something other than MainCRTStartup.

Launching the Visual Studio Debugger

The following are the general steps you must follow to get things set up:

  1. Create a Visual Studio solution for the program and open it (this happens automatically for most of the methods shown below)
  2. Find the memory address of some piece of code that runs before the code you want to step through
  3. Set a breakpoint at this address
  4. Start the program so the breakpoint can be hit and begin stepping through code

We need to open a Visual Studio solution associated with the program’s executable image. This is not the same as attaching to a process, though attaching to a process is one way it can be achieved. We eventually want the program to be opened in Visual Studio but not yet running so that we can step through its initialization. Logically, since the program will not initially be running, there is no process to attach to.

Even though we ultimately want to start running the program from the beginning, there is one piece of information we can’t get without the program running or at least stopped at a breakpoint– its base load address. So each of the following options will leave the executable either running or stopped at a breakpoint. Once we have the load address we will restart the program.

The task is easiest if you already have a Visual Studio solution for the program. Just open it and press F11 (step into) to start execution and immediately break.

If you don’t already have a Visual Studio solution, there are several ways you can create one automatically. Use whichever of the following methods you feel comfortable with:

1) Create a debug breakpoint by adding the following statement to the program (this is for those of you who are building outside of Visual Studio, otherwise it’s easier to just use the VS solution):

__debugbreak();

Compile and run the program.

The breakpoint will cause the familiar Program has stopped working dialog:

 

Since you have Visual Studio installed there will be a Debug the program button.
 
Press it.

   
You will be presented with a dialog asking you to select a debugger.

Choose New Instance of Microsoft Visual Studio and click Yes.

 
When you see a Program has triggered a breakpoint message press Break.
 
Visual Studio will open, with the program stopped at the breakpoint.

 
2) If the program is already running, open Visual Studio and from the Debug menu select Attach to Process. Select your program’s process from the dialog that pops up. You can leave the program running or break into it by clicking Debug->Break All (Ctrl+Alt+Break).

3) If you do not have a Visual Studio solution and the program executes too quickly to allow attaching to the process, you can use the following technique to cause Visual Studio to automatically attach to the program when it is started:

  • Open the registry editor, regedit.exe.
  • Navigate to HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options
  • Right-click Image File Execution Options and add a new key with the executable’s name (no path). Example: “TestApp.exe” (without the quotes)
  • Right-click the new key and add a new string value named “debuger”, and set the value of the string to “vsjitdebugger”
  • After this, any time an executable named TestApp.exe is started (by any mechanism), the Visual Studio debugger will automatically open even if the program terminates quickly.
  • This method does not automatically generate a breakpoint, so the program will be still be running if it has not yet terminated.
  • If the program has terminated, press F11 to restart it and break immediately.
  • If you are working with several executable programs you can rename the key you just created instead of polluting the registry with multiple keys.

Whatever technique you use, at this point Visual Studio will be open with the program either running or stopped at a breakpoint.

Finding the Program Entry Point

You usually can’t just press F11 to step into a program’s entry point from a completely stopped state. If Visual Studio can find main or WinMain, it steps over the initialization and stops at the first statement of main or WinMain (unless a breakpoint is encountered first). This is one of the rare situations where having debug information/symbols available actually hinders you. If debug information is not available, pressing F11 does indeed step into the first assembly language instruction and stops at the program entry point. In theory you can avoid having to find the entry point by building your software without debug information and just pressing F11 to start it. I doubt the little bit of effort you’ll save is worth the pain of not having symbols.

footnote: I have to admit I am not certain there isn’t a way to force Visual Studio to always start in Disassembly mode. If anyone knows how please post a comment.

So let’s assume we need to find the program’s load address and entry point.

Every Windows executable has an entry point which is specified in the header of the executable file. I’ll show you first how to find the entry point using the DumpBin utility that comes with Visual Studio, but there is an easier way which I’ll also show. Make sure you use the version of DumpBin from the 64-bit compiler or you won’t see the correct information for a 64-bit executable (the 64-bit version works correctly for both 32-bit and 64-bit images).

From a command prompt run vcvars64.bat (by default in c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64\vcvars64.bat) then run the following command:

DumpBin /headers TestApp.exe

Look for “entry point” and “image base” which will look something like this for a 32-bit executable (these 2 lines are not necessarily consecutive in the output):

1690 entry point (00401690) _mainCRTStartup
400000 image base (00400000 to 00409FFF)

Or this for a 64-bit executable:

22EB8 entry point (0000000100022EB8)
100000000 image base (0000000100000000 to 0000000100660FFF)

This information is not used directly. The actual address depends on the load address of the program (which can change on every run). To get the actual address you first need to find the base load address then calculate the entry point relative to it. Don’t worry, it’s easier than it sounds!

Note: there are other ways to get the image base address if you build the program yourself (for example, from the linker configuration or the .map file generated by the build) but I’m not going to discuss them.

To get the actual load address, open the Modules window from Debug->Windows->Modules menu and locate the executable under the Name column. Then get the load address from the Address column. The program must be running or stopped at a breakpoint.

From the DumpBin information you can calculate the offset of the entry point relative to the base address (I’ll use the numbers from the above 64-bit executable):

Relative entry point = entry point – base address = 0x0000000100022eb8 – 0x0000000100000000 = 0x0000000000022eb8.

This means the real entry point in memory is 0x00022eb8 bytes after the real load address. If the Modules window shows the program is loaded at 0x00000000ff290000, then the entry point is 0x00000000ff290000 + 0x0000000000022eb8 = 0x00000000ff2b2eb8.

Setting the Breakpoint

Once you have the entry point you can enter it into the Address box at the top of the Disassembly window. This will display the entry point and you can click the left margin next to the instruction to set a breakpoint.

After setting the breakpoint you can stop the current execution, then restart the program by pressing F5. Note: if the executable does not have debugging information, you may see a message asking if you want to proceed. Just click Yes.

Houston, We Have a Problem…

We come now to a big problem with this technique called Address Space Layout Randomization (ASLR).

Every time you restart the program (even within a single debug session) the load address can change. Which means the breakpoint you just set will no longer point to the desired instruction. There are ways to disable ASLR (at build time for a specific program or for your system as a whole) but I don’t recommend them.

Instead, you can load another instance of your program and keep it loaded (either by letting it run forever or by stopping it at a breakpoint). All subsequent instances, including the one you are debugging, will load at the same address. Be aware that this has been the case in my experience, but I have not seen it documented and can’t claim it is always true. Now when you stop and re-start your program, the breakpoint address will still be valid.

If the application you are investigating is a single-instance program, then the technique of loading another instance to keep the load address constant can usually still be used. Windows always loads subsequent instances of the program and the program itself detects that a prior instance is running and exits. This is typically done inside WinMain, which runs after the part of the program you want to step through.

Finding the Program Entry Point and Setting the Breakpoint, Take 2

There is another method for getting to the entry point that I personally find easier, and which also avoids the ASLR problem. You set a breakpoint in kernel32.dll at the point where it calls the entry point of a new process. From that breakpoint you can single step into your app’s entry point.

There’s a few issues with this method that you need to watch for. I will point these out as we go along.

To begin, press F11 to start the program (it will break immediately) and examine the call stack.

Look for the entry in kernel32.dll just below the earliest stack frame for your executable. This is where the Windows calls the entry point to the program. If you double click it, you will be brought to where the entry point is called from kernel32.dll. The small green arrow represents where your program will return to kernel32.dll.

You can set a breakpoint on the call (the “call rdx” instruction in the image), stop the current execution, then restart by pressing F5. When the break point is hit, you can step into the entry point, which here is at memory address 0x00000000ff2b2eb8 (the yellow arrow).

The first thing to be aware of is that for some reason when you stop the program (Debug->Stop Debugging, or Shift+F5) the breakpoint in kernel32 always becomes disabled. I’m not sure why (though I have some ideas). You must always re-enable the breakpoint right before restarting. Do this by clicking the breakpoints tab, then click the checkbox next to the breakpoint. The second issue is that you can’t start by pressing F11. You must press F5 or the breakpoint becomes disabled and is not hit.

One advantage of this approach is that the address inside kernel32.dll of the call to the entry point seems to be the same for all processes, so you don’t have to run a second instance of the program you are testing. Presumably this is because all processes share a single instance of system DLLs such as kernel32.dll. I have not rigorously checked this fact. It is theoretically possible for Windows to map a single instance of a DLL in physical memory to different virtual addresses in different processes, but I believe this is not done (probably for CPU cache-efficiency reasons) and the limited checking I did has shown it to always be loaded at the same address.

Final Note

This was not an exhaustive treatise on the subject, but I hope it gets you going in the right direction.

If thou suffer injustice, console thyself; the true unhappiness is in doing it

This post shows you how to add a debugging console to a Windows GUI application and configure it to receive output generated by printf or cout. It also illustrates the technique of specifying a custom entry point for a program.

The console will be created before any CRT initialization or constructor calls for global C++ objects. Having the console available early can be very handy, especially when debugging.

Few Windows GUI programs create a text-based console in addition to the regular window-oriented interface. Be aware that if your app does already have a console, the following may cause the existing console creation to fail because a Windows program can have only one console.

The procedure consists of creating a new console or attaching to an existing console in another process using AttachConsole (which requires sufficient access rights). After creating or attaching to the console, the standard input/output handles, stdin, stdout, and stderr are configured to use it. This allows you to use printf or cout which is more convenient than using functions like WriteConsole directly.

By using AttachConsole you can send the output of multiple programs to a single console, or allow a sequence of nested programs to output to a console inherited from a parent process.

Specifying a Custom Program Entry Point

In order to set up the console very early in the program’s execution, we are going to specify a custom entry point. The new entry point function will create the console, configure standard I/O to use the console, then call the original entry point.

In order to call the original entry point we need to know its exact name including the leading “w” for a “wide” or Unicode build. The most reliable way to do this is to build the program before changing the entry point and use DumpBin to see what the entry point is:

  • From a command prompt, run vcvars64.bat (typically located at c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64\vcvars64.bat) to set up the development environment. Make sure you use the 64-bit version if your program is 64 bits. The 32-bit version of DumpBin will report incorrect information.
  • Type “dumpbin /headers MyApp.exe” (substituting your application name).
  • Look for a line that reads something like “1121C entry point (0041121C) @ILT+535(_wWinMainCRTStartup)”. The entry point function name is in the second set of parentheses (minus the leading underscore).

You can also get the entry point from inside Visual Studio, but it is slightly more tricky because if the default entry point is used, the field is blank and you need to figure out the exact name for yourself (wMainCRTStartup, MainCRTStartup):

  • Open the Visual Studio project or solution for the program.
  • Select the program’s project from the Solution Explorer pane on the left side.
  • Open the project’s property pages by pressing Alt+F7 or select Project, then TestApp Properties from the menu.
  • In the Property Pages dialog, expand “Configuration Properties”, then “Linker”, then click “Advanced”.
  • Look for “Entry Point”. If this field is empty, then the entry point is the default: WinMainCRTStartup (ANSI) or wWinMainCRTStartup (Unicode) for a GUI application. Otherwise record the name that is there.

Declare the entry point function

Somewhere in your source code declare the original entry point (so we can chain to it) and the new entry point. Both must be declared extern “C” to prevent name mangling:


extern "C"
   {
   int wWinMainCRTStartup (void); // Use the name you obtained in the prior step
   int MyEntryPoint (void);
   }

Implement the entry point function

For clarity, all error detection and handling has been removed.


__declspec(noinline) int MyEntryPoint (void)
   {
   // __debugbreak(); // Break into the program right at the entry point!
   // Create a new console:
   AllocConsole();
   freopen("CON", "w", stdout);
   freopen("CON", "w", stderr);
   freopen("CON", "r", stdin); // Note: "r", not "w".
   return wWinMainCRTStartup();
   }

The interesting part of this function is the set of calls to freopen for “CON”. Because this is a GUI program, stdin, stdout and stderr are not initialized. We have to explicitly set them to either an open file or the console (“CON”) in order for printf, scanf, etc. to work.

stdin, stdout, and stderr are all of type FILE* which is just an opaque pointer to a descriptor for an open file. The file itself can be either a “real” file on disk (which is how redirection of stdout and stdin to/from a file is implemented) or a reserved device name such as “CON” (the console).

freopen closes the file that the current FILE* represents (here stdout, stderr, or stdin) if it’s already open (in our case it’s not open), and opens the specified file (or special device), assigning the resulting FILE descriptor to the same FILE*. This is the logical equivalent of calling fclose(stdout) followed by stdout = fopen(…). But because stdin, stdout, and stderr are special you can’t actually assign a value to them. freopen does the equivalent of such an assignment for you.

For those of you who are interested, stdout is defined in stdio.h using the preprocessor as follows (and similarly for stdin and stderr):


#define stdout (&__iob_func()[1])

This is a function call returning a pointer to an array of FILE* from which we select one element. Since the result is not an lvalue, it can’t be assigned to. freopen instead replaces an element of the array that __iob_func() returns.

All we’re doing is making stdin, stdout and stderr point to the console. You could specify the path for a “real” file instead of “CON” to cause output go to the specified file instead.

You can now do things like the following inside WinMain. Note: the scanf call will not return until an integer is entered into the console that pops up and enter is pressed. This can be used to allow entry of configuration information, such as a test number, at run time:


fwprintf(stdout, L"This is a test to stdout\n");
fwprintf(stderr, L"This is a test to stderr\n");

cout<<"Enter an Integer Number Followed by ENTER to Continue" << endl;
_flushall();

int i = 0;
int Result = wscanf( L"%d", &i);
printf ("Read %d from console. Result = %d\n", i, Result);

The final step is to set the entry point for the program in Visual Studio to the new function. The location for this setting is the same place we went earlier to look for the original entry point (under advanced linker properties).

There you have it. A GUI program with a console that you can output to using the standard I/O functions. If you like, you can download a complete sample program and Visual Studio project.

The boundaries which divide Life from Death are at best shadowy and vague. Who shall say where the one ends and where the other begins?

This post looks at the details of how a program running under Windows splits its command line into individual arguments.

Parsers

We’re only going to consider the two most common command line parsers– parse_cmdline and CommandLineToArgvW. The first, parse_cmdline, is part of the Microsoft CRT library initialization code and is called automatically at program startup to generate argv[]. It has both an ANSI and a Unicode version. CommandLineToArgvW is called explicitly by the programmer and has just a Unicode version. There is no standard function to un-parse a command line— that is, that takes a set of arguments and outputs a command line that re-generates the same set of arguments (something like ArgvToCommandLine) but I will give you a sample utility that does just that.

Whitespace Defined

I already mentioned in my previous post that the command line is divided into arguments at whitespace boundaries by both parse_cmdline and CommandLineToArgvW. Conceptually the process is simple— the parser scans the command line string from left to right looking for sequences of one or more whitespace characters which it discards. The substrings that remain between these blocks of whitespace become the individual arguments. This simplistic description ignores for the time being the handling of special characters like double quote and does not take into account the special handling of the first argument. We will discuss these issues later.

Both parse_cmdline and CommandLineToArgvW define whitespace as just space (0×20) and horizontal tab (0×09).

Given that whitespace is afforded special treatment, the obvious question is how do you supply an argument that contains literal whitespace characters and prevent them from being interpreted as an argument separator? We’ve already seen the answer— by adding a double quote character [”] to the command line to switch the parser state from InterpretSpecialChars to IgnoreSpecialChars. This has to be done somewhere after the previous separator whitespace, and before the literal space. After encountering this double quote character (which is removed) the parser stops recognizing whitespace as an argument separator. A subsequent double quote character (also removed) re-enables this recognition. I will show later how this toggling behavior can be selectively overridden.

The following has already been stated (with different wording), but deserves repeating:
 
Quoting serves only to tell the parser to toggle the interpretation of whitespace as an argument separator on and offQuoting serves only to tell the parser to stop or restart interpreting whitespace as an argument separator. It does not determine where an argument begins or ends.

You do not need to enclose an entire argument containing spaces in double quotes— just make sure the recognition of whitespace as an argument separator is disabled (state is IgnoreSpecialChars) prior to any literal whitespace.

Escaping Defined

Since the double quote character is now given special meaning, you have a problem similar to the one you had for whitespace: how do you insert a literal double quote character without toggling the parser state?

This is where the concept of escaping comes in. Another special character is defined, the backslash ([\]), which means don’t interpret the next character as special. If you want to include a literal double quote character, whether the state is InterpretSpecialChars or IgnoreSpecialChars, just precede it by a backslash: [\”]. The word escape reflects the idea that you are temporarily escaping from the normal flow of processing. The concept is subtly different from that of quoting because it’s active for only a single character after which the state automatically switches back to normal processing. Double quotes on the other hand switch and latch the parser state until another double quote character is seen.

Once again this begs the question— how do you include a real escape character? It’s beginning to feel like we’re never going to get off this treadmill! As soon as we define a way to handle a special case, our so-called fix introduces a new special case. We turned whitespace as the special case into double quote as the special case, then double quote as the special case into backslash as the special case.

Luckily for us, the chain is broken at this point through a simple mechanism: allow the escape character to escape itself! Conceptually, it works a lot like string constants in source code— to include a literal backslash, just supply two backslashes: [first\\second] for [first\second]. No new special case is introduced.

In an ideal world that’s all there would be to it. You would separate arguments with whitespace, protect literal whitespace by including a double quote character to enter IgnoreSpecialChars, place a backslash before any double quote that you don’t want to change the parser state, and always double up (or self escape) backslashes that are not escaping something.

Not so fast! Dealing With a Proliferation of Backslashes

But we don’t live in an ideal world. If you read the Microsoft documentation, you’ll see a set of rules that complicate this ideal behavior. I’m going to explain what I think is the reasoning behind some of these rules with the hope that doing so will help you understand and remember them better.

But first I’ll just list the Microsoft rules from the documentation (with a few comments in red)

  • Arguments are delimited by white space, which is either a space or a tab.
  • The caret character (^) is not recognized as an escape character or delimiter. The character is handled completely by the command-line parser in the operating system before being passed to the argv array in the program. potentially confusing mix of information about cmd.exe and the executable’s processing of the command line. You should keep these separate in your mind!
  • A string surrounded by double quotation marks (“string”) is interpreted as a single argument, regardless of white space contained within. A quoted string can be embedded in an argument. what exactly does this mean? (see discussion below)
  • A double quotation mark preceded by a backslash (\”) is interpreted as a literal double quotation mark character (“). independent of the parser state
  • Backslashes are interpreted literally, unless they immediately precede a double quotation mark.
  • If an even number of backslashes is followed by a double quotation mark, one backslash is placed in the argv array for every pair of backslashes, and the double quotation mark is interpreted as a string delimiter. reinforces the error-prone concept that double quotes enclose some meaningful string, or (worse) a complete argument
  • If an odd number of backslashes is followed by a double quotation mark, one backslash is placed in the argv array for every pair of backslashes, and the double quotation mark is “escaped” by the remaining backslash, causing a literal double quotation mark (“) to be placed in argv.
  • (there is one more undocumented rule which we will discuss later)

 
I don’t know what is meant by the statement a quoted string can be embedded in an argument (from the third rule). To me it suggests you can so something like the following and still have the entire string interpreted as a single argument:

["She said "you can't do this!", didn't she?"]

But this is interpreted as four arguments:

argv[1] = [She said you]
argv[2] = [can't]
argv[3] = [do]
argv[4] = [this!, didn't she?]

Or maybe it means you can embed a quoted string in the middle if you’re not already in a quoted string? That’s true, but then you can’t have spaces elsewhere:

[Shesaid"you can't do this!",didn'tshe?]

Not quite.

So just forget about quoted strings and embedded quotes.

This happens to be a good example to demonstrate some of the things I have been saying. I’ll show the parse image first then point out just the interesting parts:
 

Image of Example Command Line Parse

  • 3) Switch to IgnoreSpecialChars
  • 4) Switch to InterpretSpecialChars
  • 8) Single double quote character- Enter IgnoreSpecialChars
  • 9) Single double quote character- Return to InterpretSpecialChars

 
The second double quote character, immediately before [you], does not end the first argument. This particular situation seems to causes conceptual difficulty for some people.

The double quote character simply causes the parser to leave IgnoreSpecialChars and re-enter InterpretSpecialChars (indicated by the change from red to green). Because there is no whitespace, the argument continues. But because the parser is now in InterpretSpecialChars, the space between [you] and [can’t] does start a new argument. So does the space between [can’t] and [do], as well as between [do] and [this!].

Finally, the third double quote character (before the comma) causes another transition into IgnoreSpecialChars, preventing any further arguments from being generated.

Backslash Rules

Backslashes are very common because they’re the file system path separator under Windows. The writers of the C/C++ runtime library must have realized that it’s not only inconvenient, but also just plain ugly to have to fully escape these backslashes:

A UNC path such as:

[\\SomeComputer\subdir1\subdir2\]

Would need to be passed to CreateProcess in the command line string as:

[\\\\SomeComputer\\subdir1\\subdir2\\]

Which would in turn be expressed in source code as:

["\\\\\\\\SomeComputer\\\\subdir1\\\\subdir2\\\\"]

Ouch!

I believe that it was to avoid this proliferation of backslashes that Microsoft introduced the rule that backslashes are interpreted literally (except when they precede a double quote character).

But this rule results in an ambiguity: when the parser encounters [\”], is it an escaped double quote or a backslash followed by an un-escaped quote?

The strange-sounding rules for backslashes resolve the ambiguity. It might be more easily understood if you think of an escaped double quote as a single special compound character [\”]— call it an e-quote— instead of two separate characters and re-word the rule (in terms of creating the command line instead of parsing it) to read something like: Escape any literal backslashes that occur immediately prior to either a double quote or an e-quote character so they’re not interpreted as an escape character.

The intent may have been to facilitate the quoting of paths generated or received by a script or executable (not a human)— i.e., to allow you to always unconditionally enclose paths in double quote characters without further escaping. But doing so fails if the path contains a trailing backslash because the backslash is interpreted as an escape, as we have seen.

The following example illustrates the problem. The command line is unexpectedly split into just the two arguments shown. The final two components are appended to the second argument because the parser state is still IgnoreSpecialChars after the escaped double quote character and remains so until the end of the command line, where the parser ends in the IgnoreSpecialChars state. You should think of it that way instead of as an un-closed quoted string that is implicitly closed at the end of the command line:

Command line:

[test.exe "c:\Path With Spaces\Ending In Backslash\" Arg2 Arg3]

Actual arguments generated:

[test.exe]
[c:\Path With Spaces\Ending In Backslash" Arg2 Arg3]

Probably what was expected:

[test.exe]
[c:\Path With Spaces\Ending In Backslash\]
[Arg2]
[Arg3]

You might think you can avoid this by always stripping trailing backslashes. But doing so fails for one important special case— a root directory:

The path

[c:\]

(the root directory of drive c:) has a different meaning than

[c:]

(the current working directory on drive c:, not necessarily the root).

You can unconditionally remove trailing backslashes from paths, except for a root directory which must be explicitly handled as a special case.

Parser Specifics

So far we’ve covered that the command line is split on whitespace, that you can disable or re-enable (toggle) this splitting using a double quote character, and that you can mask the special toggling behavior of the double quote character by escaping it with a backslash.

We also covered the special rules Microsoft introduced for backslashes to avoid the need to escape backslashes in paths.

We’re now going to delve into the specifics of the parsers themselves.

The following pseudocode is for both parse_cmdline and CommandLineToArgvW. It ignores the special handling of the first argument. Except for one very minor difference (highlighted), it is the same for both. Ironically, the one situation where the behavior is not the same involves the undocumented rule I alluded to when listing the Microsoft rules, above.

I have not rigorously verified the following pseudocode. If you have Visual Studio installed you can find the actual source code for parse_cmdline in the file stdargv.c in the CRT source directory.

One aspect of parse_cmdline that I do not cover is the expansion of wildcards (* and ?), a feature that must be compiled into the executable.

 

State = InterpretSpecialChars
while(command line string not finished) {
   advance past leading whitespace (space or tab)
   count and advance past leading backslashes
   if (current character is ["]) {
      for each pair of leading backslashes counted, output a single [\]
      if (a backslash is leftover) {
         skip the leftover [\] and append ["] to the current argument
      }
      else if (the current ["] is followed by a second ["]) && State == IgnoreSpecialChars {
         skip the first ["] and append the second ["] to the current argument
         if (parser is CommandLineToArgvW) { // parse_cmdline remains in IgnoreSpecialChars
            State = InterpretSpecialChars;
         }
      }
      else {
         // toggle parser state:
         if (State == InterpretSpecialChars) {
            State = IgnoreSpecialChars
         }
         else {
            State = InterpretSpecialChars
         }
      }
   }
   else {
      for each leading backslash, output a single [\]
      if (next character is space or tab) {
         if (State = InterpretSpecialChars) {
            start a new argument;
         }
      }
      else {
         append the current character to the current argument.
      }
   }
}

Other than how the first argument is handled (discussed later), the only difference between parse_cmdline and CommandLineToArgvW is what the parser does after it encounters two double quote characters in a row when the state is IgnoreSpecialChars. Both parsers treat the first double quote character as a kind of escape for the second one and discard it (this is the undocumented rule). The second double quote causes CommandLineToArgvW to exit IgnoreSpecialChars, while parse_cmdline remains in IgnoreSpecialChars.

You are not likely to encounter a practical command line that causes the two parsers to generate different results. Because I have seen numerous examples of contrived command lines that demonstrate the difference, and people seem to be interested in how to explain them, I’m going to go over one extreme example that I encountered on the Internet:

[DumpArgs foo""""""""""""bar]

parse_cmdline generates the following arguments for the prior command line:

[DumpArgs]
[foo"""""bar] 5 literal double quote characters

But CommandLineToArgvW generates these:

[DumpArgs]
[foo""""bar] only 4 literal double quote characters

I’ve seen people try, unsuccessfully, to explain this example by the usual method of looking for pairs of double quote characters. But if you simply scan from left to right as I have shown, tracking the parser state, you’ll find that the difference is due to the fact that CommandLineToArgvW exits then re-enters IgnoreSpecialChars repeatedly (every time it encounters two consecutive double quote characters), but parse_cmdline only does so one time.

This will be easier to visualize by seeing the diagrams.

First we see how parse_cmdline parses the example. It removes a double quote both when entering and when leaving IgnoreSpecialChars (at points 3 and 4). While the state remains IgnoreSpecialChars, it removes one of each of the five pairs of double quote characters (the points marked ’6′), for a total of seven double quote characters removed:
 

Image of Example Command Line Parse

Next we see how CommandLineToArgvW parses the same example. Like parse_cmdline, this parser removes a double quote character every time it enters or leaves IgnoreSpecialChars. Unlike parse_cmdline, where IgnoreSpecialChars is entered and left only once, here it happens four times each, at the points marked 3 (enter) and 5 (leave). So eight double quote characters are removed instead of just the seven that parse_cmdline removes:
 

Image of Example Command Line Parse

  • 3) Switch to IgnoreSpecialChars
  • 4) Switch to InterpretSpecialChars
  • 5) First of 2 double quote characters in a row escapes next double quote. Next double quote switches to InterpretSpecialChars
  • 6) First of 2 double quote characters in a row escapes next double quote
  • 7) CommandLineToArgvW: saw 2 double quote characters in a row. Return to InterpretSpecialChars
  • 8) Single double quote character- Enter IgnoreSpecialChars
  • 9) Single double quote character- Return to InterpretSpecialChars

To convince you this is not just academic, the next example is my attempt to come up with a plausible real-world scenario (though to me it still seems unlikely to occur):

Suppose we have a processing pipeline where the first program generates the string [hello world]. The next 2 stages in the pipeline blindly double-quote the argument and pass it on, generating first [”hello world”], then [”"hello world”"]. Finally, the last stage double quotes the entire command line and passes it on to FinalProgram.exe:

[FinalProgram.exe "first second ""embedded quote"" third"]

Command Line Arguments From CommandLineToArgvW:

arg 0   = [FinalProgram.exe]
arg 1   = [first second "embedded]
arg 2   = [quote]
arg 3   = [third]


Image of Example Command Line Parse

Command Line Arguments From argv Array (argc = 2):

argv[0] = [FinalProgram.exe]
argv[1] = [first second "embedded quote" third]


Image of Example Command Line Parse

Here, we again see the undocumented rule come into play— two double quote characters in a row while the state is IgnoreSpecialChars are interpreted as an escaped double quote character. Both parsers consume the first one and output the second one. But the difference is that for CommandLineToArgvW, there is a transition back to InterpretSpecialChars, while parse_cmdline remains in IgnoreSpecialChars.

First Command Line Argument

Both parsers process the first argument differently than the remainder of the command line. There may be a good reason for this, but I can’t think of one and I think it just causes additional confusion without providing much, if any benefit.

To compound the confusion, there is a much greater difference between the two parsers for the first argument than there is for the remainder of the command line.

I will discuss the two parsers separately.

Pseudocode for parse_cmdline Handling of First Argument

The following is pseudocode for parse_cmdline:

 

ParserState = InterpretSpecialChars
loop while not end of command line and not end of argv[0] {
   if (char is space or tab) and (ParserState is InterpretSpecialChars) {
      Overwrite the whitespace char with string terminator
      End argv[0]
      }
   else if (char is ["] {
      Toggle ParserState
      Discard the ["]
      }
   else {
      Append char to argv[0]
      }
   }
parse remainder of command line normally

Notes for parse_cmdline

  • You can enter and exit IgnoreSpecialChars as many times as you want, the same as when parsing the remainder of the command line. The double quote characters are removed.

The following command line:

["F"i"r"s"t S"e"c"o"n"d" T"h"i"r"d"]

generates just one argument (when it appears at the start of the command line):

[First Second Third]

If you trace it carefully you will find that the state is IgnoreSpecialChars when each of the two spaces is encountered.


Image of Example Command Line Parse

  • You cannot use either one of the usual ways of escaping double quote characters (with a backslash or with another double quote character):

The backslash in the following does not escape the double quote that follows it and the two pairs of back-to-back double quote characters do nothing (they’re simply removed because they just cause the state to toggle then immediately toggle back):

[F""ir"s""t \"Second Third"]

Even though the 6th double quote looks like it’s escaped, the backslash does not escape anything when parsing the first argument. Therefore this double quote causes the state to change back to InterpretSpecialChars and the following space ends the first argument. Therefore two arguments are generated instead of the single argument ([First “Second Third]) that would be generated for the same text later in the command line:

argv[0] = [First \Second]
argv[1] = [Third]
  • If the first character of the command line is a space or tab, an empty first argument is generated and the remainder of the command line is parsed normally:

This command line:

[  Something Else]

Generates these 3 arguments:

argv[0] = []
argv[1] = [Something]
argv[2] = [Else]

Pseudocode for CommandLineToArgvW Handling of First Argument

The following is pseudocode for CommandLineToArgvW:

 

if ((first char is >= 0x01) and (first char <= 0x20) {
    arg[0] is empty string
    }
 else if ((first char is ["]) {
    Discard the ["]
    loop while not end of arg[0] {
       if (char is ["]) {
          Discard the ["]
          End arg[0]
          }
       else if end of command line {
          End arg[0]
          }
       else {
          Append char to arg[0]
          }
       }
    }
 else
    loop while not end of arg[0] {
       if (char is >= 0x01 and char <= 0x20) {
         Discard char
         End arg[0]
         }
      else if end of command line {
         End arg[0]
         }
      else {
         Append char to arg[0]
         }
      }
   }
parse remainder of command line normally

Notes for parse_cmdline

CommandLineToArgvW processing of the first argument has some (at least to me) strange behavior, most-notably the way it sometimes accepts any character between 0×01 and 0×20 as whitespace.

  • The same as for parse_cmdline, if the first character of the command line is whitespace, an empty first argument is generated and the remainder of the command line is parsed normally. However, any character between 0×01 and 0×20, inclusive, is considered whitespace!

This command line:

[  Something Else]

Generates these 3 arguments:

argv[0] = []
argv[1] = [Something]
argv[2] = [Else]

The first character is shown as a space, but any character between 0×01 and 0×20 will cause an empty first argument. If the second character, again shown as a space, is something in the same range, other than a space or tab, it will become the first character of the next argument

  • If the first character is a double quote then any non-zero character is accepted as part of the first argument (even 0×01 through 0×20) until another double quote or the end of the command line is encountered.

If the [*] in the next line is really \x05 (or any other character in the range 0×01 through 0×20):

["123 456*abc\def"ghi]

It generates these arguments:

[123 456*abc\def]
[ghi]

(it is correct that [ghi] is a new argument even though there is no whitespace after the second double quote)

  • If the first character is not a double quote and not a space, then any non-whitespace character, including a double quote is accepted as part of the argument. Here also, whitespace is defined as any character between 0×01 and 0×20, inclusive (not just space and tab)

If the [*] in the next line is really \x05 (or any other character in the range 0×01 through 0×20), then it acts as an argument separator:

[123"456"*abc]

These two arguments are generated:

[123"456"]
[abc]

You can download a sample CommandLines.txt file containing all the sample command lines in this post. It can be used with the RunTest utility. The file will need to be renamed as CommandLines.txt before using it.

The last thing I need to cover is the behavior of cmd.exe and batch files. I hope to get this posted in the next week or so. See you next time!

I’d like to have an argument, please

This post presents a better way to understand the quoting and escaping of Windows command line arguments.

A Problem Scenario

In order for you to appreciate why it’s important to understand the things we’re discussing, I’m going to start with a typical problem scenario that illustrates the kind of things that can go wrong.

We know from the Microsoft documentation for CreateProcess that command lines are split into individual arguments based on the location of spaces on the command line. We’re told to enclose arguments that contain spaces between a pair of double quote characters. The typical description of how whitespace (space and tab) and quotes are dealt with while parsing a command line uses terms like double-quoted string and inside- or outside a quoted part. Even the Microsoft description of how a command line is parsed says a string surrounded by double quotation marks (“string”) is interpreted as a single argument, regardless of white space contained within. A quoted string can be embedded in an argument.

We’re led to believe that analyzing a command line is just a matter of looking for matching pairs of quote characters that enclose individual arguments or embedded quotes. It sounds simple enough and we’re pretty sure we understand how it works. But still, occasionally we come across an example we can’t explain, or (more likely), some command line breaks an existing program or script.

Suppose we have a program, CreateDocs.exe, that generates documentation for a set of C++ source files. It’s driven by a batch file that accepts a starting directory and an optional “VERBOSE” switch. The batch file first passes the starting directory to the program, then generates a list of subdirectories which it also passes one at a time to the program. Our batch file faithfully quotes each directory in case it contain spaces. Recall that %~1 is the first batch argument (%1) with any existing quotes removed, so [”%~1″] unconditionally double quotes the string whether it was quoted or not on the command line:
 

@echo off
REM %1 is the root of the directory tree to process.
REM %2 is the optional string VERBOSE
cls
setlocal
echo.
IF '%1'=='' echo No Path Specified& goto:eof
set ROOT_PATH=%~1

REM Process root directory:
CreateDocs "%ROOT_PATH%" %2

REM Process subdirectories:
FOR /F %%S IN ('dir /ad /b "%ROOT_PATH%"') DO (
   CreateDocs "%%S" %2
   )
echo.
goto:eof

 

Once in a while the script neglects to process files in the starting directory. We discover that it fails whenever the user appends a trailing backslash to the specified directory. The backslash is interpreted by CreateDocs as an escape for the closing double quote. So instead of receiving [SomeDirectory\] in argv[1] when processing [”SomeDirectory\” VERBOSE], it receives [SomeDirectory” VERBOSE]. If you don’t already understand, you will later see why this is happening. It may seem a little mysterious that the FOR loop works correctly, but that’s only because cmd.exe parses things differently than the executable does.

Although we now know what the problem is, we can’t just tell our users not to append a trailing backslash. Someone will get it wrong! To fix this in CreateDocs would require some relatively complex logic. We could, for example, detect that argv[0] contains a [”] at the end of a valid path (and the directory exists), possibly followed by [ VERBOSE].

We opt instead to add logic to our script to detect and remove the trailing backslash:

 

@echo off
REM %1 is the root of the directory tree to process.
REM %2 is the optional string VERBOSE
cls
setlocal
echo.
IF '%1'=='' echo No Path Specified& goto:eof
set ROOT_PATH=%~1

REM Strip off trailing backslash:
IF [%ROOT_PATH:~-1,1%]==[\] set ROOT_PATH=%ROOT_PATH:~0,-1%

REM Process root directory:
CreateDocs "%ROOT_PATH%" %2

REM Process subdirectories:
FOR /F %%S IN ('dir /ad /b "%ROOT_PATH%"') DO (
   CreateDocs "%%S" %2
   )
echo.
goto:eof

 

We congratulate ourselves on our cleverness and we use the script successfully for months.

Then one day we change our build process. We want to be able to debug using binaries built on different development machines. Because the source code is installed in different places on different machines (something we don’t want to change), the debugger can’t always find the .pdb file because it’s specified in the executable image with an absolute path. We could fix this various ways, but we decide to use a fixed drive letter in the build scripts and use subst to map the root of the source code tree, wherever it is on a particular machine, to the root of this drive.

Once again we start to notice that our script occasionally (but not always) fails.

What’s happening now?

The answer is that our script was smart, but not smart enough. The problem this time turns out to be caused by the fact that we are now specifying a root directory (ex. [x:\]). When the batch file removes the trailing backslash it generates [x:] which (for a root directory only) is not the same as with the backslash. The former specifies the default working directory while the latter is the root directory.

Most of the time our users open a dedicated console for running this tool and never do any other work there. In that case c:\ is the same as c:. But occasionally someone switches to the substed drive and goes to a subdirectory to do some other work. This makes c: different from c:\ and on those infrequent occasions the script fails.

We fix it by adding a special case to not remove the trailing backslash for the special case of a root directory explicitly specified.

This example was not intended to teach you what to do, but to give a reasonable example of how problems can creep in.

The above batch file still has some issues that I will leave unresolved for now (the “IF” statements may fail with certain strings containing double quote characters).

The Problem With the Existing Way of Looking at Things

Consider the following set of partial command lines:

["Argument With Spaces"]

[Argument" "With" "Spaces]

["Argument "With" Spaces"]

[Argument" With Sp"aces]

["Argument With Spaces]

["Ar"g"um"e"n"t" W"it"h Sp"aces""]

Before continuing, take a few moments and try to pick out the outer and the embedded quoted strings.

You may be surprised to learn that all of these are interpreted exactly the same way by the command line parser— as a single argument: [Argument With Spaces].

But how can all of these possibly generate the same argument? One of the quotes is not even closed!

The command line parser rules must be either inconsistent, incomprehensible, or just plain stupid.

But the problem isn’t with the rules, but simply that the conventional way of looking at things is wrong. We need a different way of analyzing command lines that’s easy to apply and always works.

(for an even more extreme and humorous example take a look at 50 Ways to Say Hello)

A Better Way To Look at Things

We come now to a very important concept.
 

Double quote characters do not delineate argumentsDouble quote characters in a command line string have no relation to the boundaries between arguments and do not necessarily enclose arguments. Each individual double quote character by itself acts simply as a switch to enable or disable the recognition of space as a divider between arguments.

 
Attempting to find pairs of double quote characters that enclose meaningful chunks of text is possibly the biggest conceptual mistake that people make when looking at a complicated command line.

This point is extremely important (possibly the most important concept in this article) so I’m going to discuss it in depth before I move on to explain how a command line is received and processed by an executable program.

Each of the command line parsers we will consider (parse_cmdline, CommandLineToArgvW, and cmd.exe) has a set of characters that in some contexts it considers special (examples are whitespace, double quote, caret, and backslash) and which cause some action to be taken when one of them is encountered— such as beginning a new argument (and removing the special character).

In other contexts the parser treats these same characters like regular text. So at any given time the parser is in one of two states— recognizing (i.e., interpreting) special characters, or ignoring them.

The Parser States Named

In order to be explicit when referring to these two states in the text, I invented names for them. I call the first state  InterpretSpecialChars  and the second state  IgnoreSpecialChars . They correspond to what some writers refer to as being outside or inside a double quoted string. I purposely avoided giving them names containing quote or quoting because doing so reinforces the misconception that double quote characters somehow delineate arguments. The colors shown are used later to show the parser state in images I created to demonstrate how example command lines are parsed.

I originally considered calling these InterpretWhitespace and IgnoreWhitespace, but I wanted to use the same state names when discussing cmd.exe where the state governs the interpretation of a different set of special characters, not whitespace.
 

Tracking the parser state is the key to understanding any command lineThe key to understanding any command line, no matter how complex, is to pay attention to which of these two states the parser is in at any given time and understanding what causes the state to change.

 

The Last Example Explained

Let’s examine the last example in the above list to see how it is parsed, character by character, from left to right. The special character that we will see either interpreted or ignored, depending on the parser state, is the space character.

You’ll see that the parser state is IgnoreSpecialChars when both the spaces are read, so they do not cause a new argument to be started.

Each line below represents a single character read from the command line. The first column is the actual character read, followed by the parser state before processing the character, the action that was taken, and the value of the partial argument after processing the character.

Here’s the command line again:

["Ar"g"um"e"n"t" With Sp"aces""]

Char read: State when char read:    Action:                        Argument (after processing char):
           InterpretSpecialChars    Start                          []
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       []
    A      IgnoreSpecialChars       Add char [A]                   [A]
    r      IgnoreSpecialChars       Add char [r]                   [Ar]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Ar]
    g      InterpretSpecialChars    Add char [g]                   [Arg]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Arg]
    u      IgnoreSpecialChars       Add char [u]                   [Argu]
    m      IgnoreSpecialChars       Add char [m]                   [Argum]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argum]
    e      InterpretSpecialChars    Add char [e]                   [Argume]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Argume]
    n      IgnoreSpecialChars       Add char [n]                   [Argumen]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argumen]
    t      InterpretSpecialChars    Add char [t]                   [Argument]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Argument]
           IgnoreSpecialChars       Add char [ ]                   [Argument ]
    W      IgnoreSpecialChars       Add char [W]                   [Argument W]
    i      IgnoreSpecialChars       Add char [i]                   [Argument Wi]
    t      IgnoreSpecialChars       Add char [t]                   [Argument Wit]
    h      IgnoreSpecialChars       Add char [h]                   [Argument With]
           IgnoreSpecialChars       Add char [ ]                   [Argument With ]
    S      IgnoreSpecialChars       Add char [S]                   [Argument With S]
    p      IgnoreSpecialChars       Add char [p]                   [Argument With Sp]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argument With Sp]
    a      InterpretSpecialChars    Add char [a]                   [Argument With Spa]
    c      InterpretSpecialChars    Add char                       [Argument With Spac]
    e      InterpretSpecialChars    Add char [e]                   [Argument With Space]
    s      InterpretSpecialChars    Add char [s]                   [Argument With Spaces]
    "      InterpretSpecialChars    Go to IgnoreSpecialChars       [Argument With Spaces]
    "      IgnoreSpecialChars       Go to InterpretSpecialChars    [Argument With Spaces]

 

Image of Example Command Line Parse

  • 3) Switch to IgnoreSpecialChars
  • 4) Switch to InterpretSpecialChars
  • 8) Single double quote character- Enter IgnoreSpecialChars
  • 9) Single double quote character- Return to InterpretSpecialChars

Quoting Defined

Enabling or disabling the recognition of special characters (switching between InterpretSpecialChars and IgnoreSpecialChars) is done by strategically placing double quote characters at specific locations on the command line— wherever you want the state to change. This is under the control of the person who writes the command line. The act of placing double quote characters for this purpose is how I define the term quoting. It is not the delineation of a piece of text by enclosing it in a pair of double quote characters.

 
Quoting is the placement of double quote characters on the command line to toggle the parser stateQuoting is the placement of individual double quote characters on the command line to control the switching between InterpretSpecialChars and IgnoreSpecialChars.
 
Since double quote characters work individually and not in pairs, there are no such concept as a dangling (unclosed) or mismatched quote as discussed by other writers. The command line parser simply ends in one or the other of the two states. If there is an even number of un-escaped double quote characters on the command line (we will define escape later) the parser ends in InterpretSpecialChars. If there is an odd number of un-escaped double quote characters it ends in IgnoreSpecialChars.

What would traditionally be called a quoted string (text between two double quote characters) can span multiple arguments or be contained completely within an argument— more evidence that it’s hopeless to use pairs of double quote characters to find the arguments!

The seemingly arcane parse rules in the Microsoft documentation actually make sense once you understand that they don’t directly tell you how the command line is split into arguments, but merely describe what causes the parser to switch state.

By training your mind to scan a command line from left to right like the parser does, instead of trying to pick out the quoted chunks, you will have little problem understanding or correctly generating even the most complicated command line.

We’ll talk more about this when we go over the specifics of the parsers later.

How a Program Receives it’s Command Line

Everyone knows the purpose of command line arguments— to customize a particular run of a program or a script. This section covers how an executable program interprets the command line string received from CreateProcess. Elsewhere I will cover the behavior of cmd.exe. For now, keep the following in mind:

 
How cmd.exe parses the command line is different and completely independent of how an executable program parses itThe way cmd.exe interprets the command line is different and completely independent of how an executable program interprets the command line. You must separate your thinking about what cmd.exe does from your thinking about what an executable program does.

 
From a programmer’s perspective, a Windows application written in C or C++ begins execution in main or WinMain and makes available (as function parameters) either an array of individual arguments (main’s argv[]) or a single command line string (WinMain’s lpCmdLine). Both types of application also have available an alternate source for the same set of arguments contained in argv[]— the global variables __argc and __argv. These are filled in before main or WinMain is called so many programmers believe that their programs receive the command line already split up into individual arguments, perhaps by Windows itself. But that is not the case.
 
A program receives just one command line string and must split it into individual arguments itselfA program receives a single command line string which the program itself, not the operating system, splits into individual arguments.
 
This string normally consists of the executable name followed by the actual arguments. I will have a lot more to say about this, but for now the important thing to remember is there is just one command line string.

All programs executed under Windows are ultimately started by CreateProcess (or variants such as CreateProcessAsUser). CreateProcess is a complex subject and I’m only going to look at two of the parameters it accepts: the executable name of the program to start (lpApplicationName) and the command line string (lpCommandLine), both of which are optional. Obviously, you have to tell Windows what program to run, so if the first parameter is not supplied (is NULL) then the program name must be given at the start of the command line string. By convention, even if you do specify the program name in the first parameter, you’re supposed to repeat it at the beginning of the command line (if you supply one), but this is not enforced and leaving it out can cause problems. If you don’t supply a command line Windows creates one containing just the program name.

 
Programs assume the first argument on the command line is the executable nameA program does not know how its name was specified and most programs blindly assume the first argument is the program name.
 
The first command line argument is usually interpreted differently than the rest of the command line or ignored altogether. A Windows GUI app will not even see the first argument (at least not in lpCmdLine). If it’s something important it will get lost:

[ImportantArgument-0 Argument-1] is received in lpCmdLine as just [Argument-1]

 
Always put the program name in the first argumentTo avoid problems caused by special handling of the first argument, or the assumption that it is the executable name, always include the program name at the beginning of any command line string you supply to CreateProcess.
 
The Microsoft documentation for CreateProcess implies that including the program name at the start of lpCommandLine is optional, something only “C programmers generally” do. After we show how the parsers give special treatment to the first argument, you will understand why I recommend that you always specify the executable name as the first thing on the command line. An interesting thing to note (which we won’t pursue) is that if you specify the executable name in the first argument, the PATH is not searched, but if you specify NULL for the first argument, the PATH is used. For security reasons, Microsoft recommends that you always supply the first argument to CreateProcess and always enclose the executable name at the beginning of the command line string in quotes, though this is strictly only necessary if the path contains spaces.

How the Command Line is Split

Programs generally split their command line by treating a sequence of one or more whitespace characters (space, tab) as the separator between arguments. I say generally because a program is free to do whatever it wants and there are no standards to guide us. There is not even universal agreement on the set of characters that constitute whitespace.
 
A program can parse a command line any way it wants but most programs use one of two standard parsersA program can interpret its command line any way it wants. There is no method to ensure with 100 percent certainty that a given command line is correct or guarantee how it will be broken up into individual components.
 
This may sound hopeless, but fortunately there’s only a small set of facilities for parsing the command line that most programs use. Once you understand these, you will understand how the vast majority of programs behave.

A program that links with the Microsoft C/C++ runtime library automatically splits the command line by calling a function named parse_cmdline and passes the result to main in argc and argv, or to WinMain through the global variables __argc and __argv.

A Windows GUI program can access the command line string (minus the program name) through the lpCmdLine argument to WinMain, and any program can access the full command line, including the program name (or whatever else was specified at the start of the command line when CreateProcess was called) by calling the aptly named GetCommandLine function. The program can then split the resulting string explicitly either by calling CommandLineToArgvW or a custom command line parser.

The splitting rule used by both these parsers is simple: splitting always occurs on a whitespace boundary, and only when the parser state is InterpretSpecialChars.

Almost all programs use the arguments generated by either parse_cmdline or CommandLineToArgvW, so we will focus on these.

Getting a View on Things

In the next post I’ll go into great detail about the algorithms used by parse_cmdline and CommandLineToArgvW to split a command line. We’ll see some ugliness caused by a special Microsoft rule governing backslashes and we’ll go over the special handling of the first argument. But first I’m going to give you some tools for looking at things (later I’ll give you additional tools that may actually make your job easier).

Download DumpArgs Project
Download RunTest Project

Both of these are console programs.

The first utility is called DumpArgs and is used to see exactly how a given command line is split into arguments.

DumpArgs first displays the command that it received, both ANSI and Unicode in case there’s a difference. Since it is a console app, it already has the command line split into arguments by parse_cmdline (in argv[]). The program retrieves the raw command line using GetCommandLine and splits it again using CommandLineToArgvW then outputs both sets of arguments for comparison.

Simply run it with any command line you want.

Example Run of DumpArgs

 
Command line:

DumpArgs “First NotSecond” Second!

 
Output:


Unicode Command Line (from GetCommandLineW): [dumpargs  "First NotSecond" Second!]


              00 01 02 03 04 05 06 07 | 08 09 0a 0b 0c 0d 0e 0f
              -------------------------------------------------
   00000000:  64 00 75 00 6d 00 70 00 | 61 00 72 00 67 00 73 00   d.u.m.p. | a.r.g.s.
   00000010:  20 00 20 00 22 00 46 00 | 69 00 72 00 73 00 74 00   _._.".F. | i.r.s.t.
   00000020:  20 00 4e 00 6f 00 74 00 | 53 00 65 00 63 00 6f 00   _.N.o.t. | S.e.c.o.
   00000030:  6e 00 64 00 22 00 20 00 | 53 00 65 00 63 00 6f 00   n.d."._. | S.e.c.o.
   00000040:  6e 00 64 00 21 00 00 00                             n.d.!...


ANSI Command Line (from GetCommandLineA): [dumpargs  "First NotSecond" Second!]


              00 01 02 03 04 05 06 07 | 08 09 0a 0b 0c 0d 0e 0f
              -------------------------------------------------
   00000000:  64 75 6d 70 61 72 67 73 | 20 20 22 46 69 72 73 74   dumpargs | __"First
   00000010:  20 4e 6f 74 53 65 63 6f | 6e 64 22 20 53 65 63 6f   _NotSeco | nd"_Seco
   00000020:  6e 64 21 00                                         nd!.


CommandLineToArgvW Found 3 Argument(s)
   arg 0   = [dumpargs]
   arg 1   = [First NotSecond]
   arg 2   = [Second!]

Command Line Arguments From argv Array (argc = 3):
   argv[0] = [dumpargs]
   argv[1] = [First NotSecond]
   argv[2] = [Second!]

I show the source code below, but you can download a zip archive containing a pre-built executable, source code, and a Visual Studio project.

Notes on the VS solution (these notes apply to both the DumpArgs project and the StartTest project, below):

  • Just unzip the archive to an empty directory and open the .sln file.
  • You can switch between ANSI and Unicode build by first clicking on the project in the Solution Explorer in Visual Studio and selecting Project->DumpArgs Properties. Under Configuration Properties->General you will see Character Set in the right pane. Select the one you want (counter-intuitively, you select Multi-Byte Character Set for an ANSI build).

 

// DumpArgs.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

void APIENTRY DumpHex (unsigned char* Buffer, int Length) // Displays a buffer of bytes
{
for (int BufIdx = 0; BufIdx < Length; ++BufIdx) {
   if ((BufIdx % 256) == 0) {
      // Print header every 256 bytes
      _tprintf (_T("\n              00 01 02 03 04 05 06 07 | 08 09 0a 0b 0c 0d 0e 0f\n"));
      _tprintf (_T("              -------------------------------------------------\n"));
      }
   if ((BufIdx % 16) == 0) {
      // Print buffer offset at start of lines
      _tprintf (_T("   %08lx:  "), BufIdx);
      }
   _tprintf (_T("%02x "), Buffer[BufIdx]);
   // Output separator in middle (but not after last byte):
   if (((BufIdx % 16) == 7) && (BufIdx < (Length - 1))) {
        _tprintf (_T("| "));
        }
     if (((BufIdx % 16) == 15) || (BufIdx == (Length - 1))) {
        // Pad last line with spaces if necessary
        int Padding = 3 * (16 - ((BufIdx % 16) + 1));
        if (Padding > 21) {
         Padding += 2; // For where separator would have been
         }
      while (Padding--) {
         _tprintf (_T(" "));
         }
      // Output printable characters
      _tprintf (_T("  "));
      for (int CharIdx = 16 * (BufIdx / 16); CharIdx <= BufIdx; ++CharIdx) {
         _tprintf (_T("%c"), isprint(Buffer[CharIdx])?Buffer[CharIdx]==' '?'_':Buffer[CharIdx]:'.');
         // Output separator in middle (but not after last byte):
         if (((CharIdx % 16) == 7) && (CharIdx != BufIdx)) {
            _tprintf (_T(" | "));
            }
         }
      _tprintf (_T("\n"));
      }
   }
}

int _tmain(int argc, _TCHAR* argv[])
{
wchar_t* CommandLineUnicode = GetCommandLineW();
wprintf(L"Unicode Command Line (from GetCommandLineW): [%s]\n\n", CommandLineUnicode);
DumpHex((unsigned char*)CommandLineUnicode, (2 * wcslen(CommandLineUnicode)) + 2);
_tprintf (_T("\n\n"));

char* CommandLineAnsi = GetCommandLineA();
printf("ANSI Command Line (from GetCommandLineA): [%s]\n\n", CommandLineAnsi);
DumpHex((unsigned char*)CommandLineAnsi, 1 + strlen(CommandLineAnsi));
_tprintf (_T("\n\n"));

int NumArgs = 0;
wchar_t** Args = CommandLineToArgvW(CommandLineUnicode, &NumArgs);
_tprintf (_T("CommandLineToArgvW Found %d Argument(s)\n"), NumArgs);
for (int arg = 0; arg < NumArgs; ++arg) {
    wprintf (L"   arg %s%d   = [%s]\n", ((NumArgs >= 10) && (arg < 10))?L" ":L"", arg, Args[arg]);
   }

_tprintf (_T("\nCommand Line Arguments From argv Array (argc = %d):\n"), argc);
for (int arg = 0; arg < argc; ++arg) {
    _tprintf (_T("   argv[%s%d] = [%s]\n"), ((argc >= 10) && (arg < 10))?" ":"", arg, argv[arg]);
   }
_tprintf (_T("\n"));
LocalFree(Args);

return 0;
}

 

The next utility, RunTest, lets you drive DumpArgs with specific command lines without worrying about how cmd.exe first mangles the command line.

You specify each command line in a file named CommandLines.txt which must be a plain ANSI text file (not Unicode) in the current directory. DumpArgs must also be in the current directory.

Put each command line on a separate line in CommandLines.txt exactly as you want DumpArgs to receive it. Empty lines are ignored and so are lines starting with a semicolon (in the first column), so you can include comments.

RunTest reads the each command line from CommandLines.txt and starts DumpArgs using CreateProcess, passing it the specified command line in the lpCommandLine parameter. It waits up to 5 seconds for DumpArgs to Finish. RunTest explicitly specifies DumpArgs.exe in lpApplicationName and does not insert DumpArgs.exe at the beginning of lpCommandLine so that you can test what happens if you pass something other than the executable name as the first command line argument (for example, to see how the first argument is parsed differently).

As I did for DumpArgs, I again show the source code here and you can download a zip archive containing pre-built executables (for both DumpArgs and RunTest), source code and a Visual Studio project for RunTest and a sample CommandLines.txt file.

 

// RunTest.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
/*
 * Very simple-minded read line function:
 *
 *    - Writes the next line into Buffer and returns true if more data, false if no data or error.
 *    - Trusts that Buffer and BufSize are both valid.
 *    - Always NULL terminates Buffer, unless Buffer is 0 bytes.
 *    - Works only with ANSI file (not Unicode).
 *    - Skips empty lines and lines that start with ";" (for comments)
 *    - Silently truncates line if Buffer too small (but consumes entire line).
 *
 */
bool ReadLine (HANDLE hFile, char* Buffer, int BufSize)
{
bool Result = false;
try {
   if (BufSize >= 1) {
      int BytesWritten = 0;
      char* BufPtr = Buffer;
      while (1) {
         char Char;
         DWORD BytesRead;
         if ((ReadFile(hFile, &Char, sizeof(char), &BytesRead, NULL) != 0) && (BytesRead == sizeof(char))) {
            if (Char == 0x0a) {
               if (BytesWritten > 0) {
                  if (Buffer[0] == ';') { // Ignore comments
                     Result = false;
                     BytesWritten = 0;
                     BufPtr = Buffer;
                     }
                  else {
                     break;
                     }
                  }
               }
            else if ((Char != 0x0d) && (BytesWritten++ < (BufSize - 1))) {
               *BufPtr++ = Char;
               Result = true;
               }
            }
         else {
            break;
            }
         }
      *BufPtr = '\0';
      }
   }
catch (...)
   {
   Result = false;
   }
return Result;
}

int _tmain(int argc, _TCHAR* argv[])
{
HANDLE hFile = CreateFileA (".\\CommandLines.txt", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
if (hFile == INVALID_HANDLE_VALUE) {
   _tprintf(_T("\n\nERROR: Could not find CommandLines.txt in the current directory\n\n"));
   return 1;
   }
while (1) {
   char CommandLine[1024];
   if (ReadLine (hFile, CommandLine, 1024) == false) {
      break;
      }
   // Create the process:
   STARTUPINFOA si;
   PROCESS_INFORMATION pi;
   memset (&si, 0, sizeof(si));
   memset (&pi, 0, sizeof(pi));
   GetStartupInfoA(&si);
   si.cb = sizeof(si);
   si.dwFlags = STARTF_USESHOWWINDOW;

   if (CreateProcessA (".\\DumpArgs.exe", CommandLine, NULL, NULL, FALSE, 0, NULL, NULL, &si, &pi) != 0) {
      // Successfully created the process. Wait for it to finish:
      if (WaitForSingleObject(pi.hProcess, 5000) == WAIT_TIMEOUT) {
         _tprintf (_T("Timed out waiting for process to exit\n"));
         }
      }
   else {
      _tprintf (_T("CreateProcess failed\n"));
      }
   }
CloseHandle (hFile);
return 0;
}

 

Now that you understand how a command line is given to a program, and have some tools for looking at things, we can begin next time to look closely at exactly how a command is parsed.

He who wishes to be obeyed must know how to command

This post covers everything you need to know about how to create and interpret command lines for Windows programs.

I’ll go over how to quote and escape arguments including those that contain spaces, embedded quotes and special characters. I’ll explain how the command line is received and interpreted. I’ll point out specific situations where problems are likely to occur and give you some tools and techniques to ensure your programs always receive exactly the arguments you intended.

Introduction

Understanding the command line is important, especially when you don’t have control over your program’s input. Arguments could come from a database or a directory listing, or be entered by a user of your software. The information here will help you correctly handle situations you didn’t anticipate and avoid having a program or script intermittently fail. I will point out the most common mistakes and explain what to do in those rare situations where there isn’t a clean solution.

Some of the information here comes from other sources as well as my own testing. I present a new way to think about the command line that I hope will eliminate some confusion about how double quote characters are interpreted by the command line parser.

That’s a lot of ground to cover, so I plan to present it over time in multiple posts. By the time I’m done you will thoroughly understand:

  • What I mean by the terms quoting and escaping, and the different contexts where arguments need to be quoted or escaped
  • How an executable program receives its command line and splits it into individual arguments
  • How the command line is interpreted and modified by by cmd.exe before passing it to an executable program
  • Batch file considerations
  • Passing received arguments on to other programs and scripts
  • The various problems and security risks that can occur and how to recognize and prevent them

I’ll also provide some free tools to help you do things the right way.

If you’re in a hurry you can jump to a summary or go see what others have to say on the subject.

Technical Notes

The following will help you understand this series better.

  • I use the term “literal” to describe any text that is intended to be passed unchanged to an application. This is in contrast to special characters such as whitespace, used to separate arguments, or double quote [”] and escape [\, ^] meta characters that control parsing but are not passed to the application.
  • When giving examples I needed a way to clearly show the content of a piece of text, especially if it has leading or trailing spaces. Enclosing these in quotes is ambiguous when the text itself contains the quote character. To avoid confusion I enclose literal strings inside square brackets:

[3 backslashes, followed by double quote: \\\”]

None of my sample text contains literal square brackets, so they always indicate the content of a literal string or a single character. Any quote character (single or double) or backslash that appears inside square brackets is always part of the sample text.

  • I include images to help visualize how various example command lines are parsed. Hopefully they’re intuitive, but I included a description of the images. You can also click on any of the images to get to the description page
  • This series of articles was written from the perspective of a C/C++ Windows programmer so there may be a noticible bias in the presentation. The information applies mainly to programs developed with Microsoft development tools (Visual Studio 2010) and which use the Microsoft standard runtime library.

Key points are marked with this iconKey points are marked with a red exclamation point