shlex/
quoting_warning.md

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
// vim: textwidth=99
/*
Meta note: This file is loaded as a .rs file by rustdoc only.
*/
/*!

A more detailed version of the [warning at the top level](super#warning) about the `quote`/`join`
family of APIs.

In general, passing the output of these APIs to a shell should recover the original string(s).
This page lists cases where it fails to do so.

In noninteractive contexts, there are only minor issues.  'Noninteractive' includes shell scripts
and `sh -c` arguments, or even scripts `source`d from interactive shells.  The issues are:

- [Nul bytes](#nul-bytes)

- [Overlong commands](#overlong-commands)

If you are writing directly to the stdin of an interactive (`-i`) shell (i.e., if you are
pretending to be a terminal), or if you are writing to a cooked-mode pty (even if the other end is
noninteractive), then there is a **severe** security issue:

- [Control characters](#control-characters-interactive-contexts-only)

Finally, there are some [solved issues](#solved-issues).

# List of issues

## Nul bytes

For non-interactive shells, the most problematic input is nul bytes (bytes with value 0).  The
non-deprecated functions all default to returning [`QuoteError::Nul`] when encountering them, but
the deprecated [`quote`] and [`join`] functions leave them as-is.

In Unix, nul bytes can't appear in command arguments, environment variables, or filenames.  It's
not a question of proper quoting; they just can't be used at all.  This is a consequence of Unix's
system calls all being designed around nul-terminated C strings.

Shells inherit that limitation.  Most of them do not accept nul bytes in strings even internally.
Even when they do, it's pretty much useless or even dangerous, since you can't pass them to
external commands.

In some cases, you might fail to pass the nul byte to the shell in the first place.  For example,
the following code uses [`join`] to tunnel a command over an SSH connection:

```rust
std::process::Command::new("ssh")
    .arg("myhost")
    .arg("--")
    .arg(join(my_cmd_args))
```

If any argument in `my_cmd_args` contains a nul byte, then `join(my_cmd_args)` will contain a nul
byte.  But `join(my_cmd_args)` is itself being passed as an argument to a command (the ssh
command), and command arguments can't contain nul bytes!  So this will simply result in the
`Command` failing to launch.

Still, there are other ways to smuggle nul bytes into a shell.  How the shell reacts depends on the
shell and the method of smuggling.  For example, here is Bash 5.2.21 exhibiting three different
behaviors:

- With ANSI-C quoting, the string is truncated at the first nul byte:
  ```bash
  $ echo $'foo\0bar' | hexdump -C
  00000000  66 6f 6f 0a                                       |foo.|
  ```

- With command substitution, nul bytes are removed with a warning:
  ```bash
  $ echo $(printf 'foo\0bar') | hexdump -C
  bash: warning: command substitution: ignored null byte in input
  00000000  66 6f 6f 62 61 72 0a                              |foobar.|
  ```

- When a nul byte appears directly in a shell script, it's removed with no warning:
  ```bash
  $ printf 'echo "foo\0bar"' | bash | hexdump -C
  00000000  66 6f 6f 62 61 72 0a                              |foobar.|
  ```

Zsh, in contrast, actually allows nul bytes internally, in shell variables and even arguments to
builtin commands.  But if a variable is exported to the environment, or if an argument is used for
an external command, then the child process will see it silently truncated at the first nul.  This
might actually be more dangerous, depending on the use case.

## Overlong commands

If you pass a long string into a shell, several things might happen:

- It might succeed, yet the shell might have trouble actually doing anything with it.  For example:

  ```bash
  x=$(printf '%010000000d' 0); /bin/echo $x
  bash: /bin/echo: Argument list too long
  ```

- If you're using certain shells (e.g. Busybox Ash) *and* using a pty for communication, then the
  shell will impose a line length limit, ignoring all input past the limit.

- If you're using a pty in cooked mode, then by default, if you write so many bytes as input that
  it fills the kernel's internal buffer, the kernel will simply drop those bytes, instead of
  blocking waiting for the shell to empty out the buffer.  In other words, random bits of input can
  be lost, which is obviously insecure.

Future versions of this crate may add an option to [`Quoter`] to check the length for you.

## Control characters (*interactive contexts only*)

Control characters are the bytes from `\x00` to `\x1f`, plus `\x7f`.  `\x00` (the nul byte) is
discussed [above](#nul-bytes), but what about the rest?  Well, many of them correspond to terminal
keyboard shortcuts.  For example, when you press Ctrl-A at a shell prompt, your terminal sends the
byte `\x01`.  The shell sees that byte and (if not configured differently) takes the standard
action for Ctrl-A, which is to move the cursor to the beginning of the line.

This means that it's quite dangerous to pipe bytes to an interactive shell.  For example, here is a
program that tries to tell Bash to echo an arbitrary string, 'safely':
```rust
use std::process::{Command, Stdio};
use std::io::Write;

let evil_string = "\x01do_something_evil; ";
let quoted = shlex::try_quote(evil_string).unwrap();
println!("quoted string is {:?}", quoted);

let mut bash = Command::new("bash")
    .arg("-i") // force interactive mode
    .stdin(Stdio::piped())
    .spawn()
    .unwrap();
let stdin = bash.stdin.as_mut().unwrap();
write!(stdin, "echo {}\n", quoted).unwrap();
```

Here's the output of the program (with irrelevant bits removed):

```text
quoted string is "'\u{1}do_something_evil; '"
/tmp comex$ do_something_evil; 'echo '
bash: do_something_evil: command not found
bash: echo : command not found
```

Even though we quoted it, Bash still ran an arbitrary command!

This is not because the quoting was insufficient, per se.  In single quotes, all input is supposed
to be treated as raw data until the closing single quote.  And in fact, this would work fine
without the `"-i"` argument.

But line input is a separate stage from shell syntax parsing.  After all, if you type a single
quote on the keyboard, you wouldn't expect it to disable all your keyboard shortcuts.  So a control
character always has its designated effect, no matter if it's quoted or backslash-escaped.

Also, some control characters are interpreted by the kernel tty layer instead, like CTRL-C to send
SIGINT.  These can be an issue even with noninteractive shells, but only if using a pty for
communication, as opposed to a pipe.

To be safe, you just have to avoid sending them.

### Why not just use hex escapes?

In any normal programming languages, this would be no big deal.

Any normal language has a way to escape arbitrary characters in strings by writing out their
numeric values.  For example, Rust lets you write them in hexadecimal, like `"\x4f"` (or
`"\u{1d546}"` for Unicode).  In this way, arbitrary strings can be represented using only 'nice'
simple characters.  Any remotely suspicious character can be replaced with a numeric escape
sequence, where the escape sequence itself consists only of alphanumeric characters and some
punctuation.  The result may not be the most readable[^choices], but it's quite safe from being
misinterpreted or corrupted in transit.

Shell is not normal.  It has no numeric escape sequences.

There are a few different ways to quote characters (unquoted, unquoted-with-backslash, single
quotes, double quotes), but all of them involve writing the character itself.  If the input
contains a control character, the output must contain that same character.

### Mitigation: terminal filters

In practice, automating interactive shells like in the above example is pretty uncommon these days.
In most cases, the only way for a programmatically generated string to make its way to the input of
an interactive shell is if a human copies and pastes it into their terminal.

And many terminals detect when you paste a string containing control characters.  iTerm2 strips
them out; gnome-terminal replaces them with alternate characters[^gr]; Kitty outright prompts for
confirmation.  This mitigates the risk.

But it's not perfect.  Some other terminals don't implement this check or implement it incorrectly.
Also, these checks tend to not filter the tab character, which could trigger tab completion.  In
most cases that's a non-issue, because most shells support paste bracketing, which disables tab and
some other control characters[^bracketing] within pasted text.  But in some cases paste bracketing
gets disabled.

### Future possibility: ANSI-C quoting

I said that shell syntax has no numeric escapes, but that only applies to *portable* shell syntax.
Bash and Zsh support an obscure alternate quoting style with the syntax `$'foo'`.  It's called
["ANSI-C quoting"][ansic], and inside it you can use all the escape sequences supported by C,
including hex escapes:

```bash
$ echo $'\x41\n\x42'
A
B
```

But other shells don't support it — including Dash, a popular choice for `/bin/sh`, and Busybox's
Ash, frequently seen on stripped-down embedded systems.  This crate's quoting functionality [tries
to be compatible](crate#compatibility) with those shells, plus all other POSIX-compatible shells.
That makes ANSI-C quoting a no-go.

Still, future versions of this crate may provide an option to enable ANSI-C quoting, at the cost of
reduced portability.

### Future possibility: printf

Another option would be to invoke the `printf` command, which is required by POSIX to support octal
escapes.  For example, you could 'escape' the Rust string `"\x01"` into the shell syntax `"$(printf
'\001')"`.  The shell will execute the command `printf` with the first argument being literally a
backslash followed by three digits; `printf` will output the actual byte with value 1; and the
shell will substitute that back into the original command.

The problem is that 'escaping' a string into a command substitution just feels too surprising.  If
nothing else, it only works with an actual shell; [other languages' shell parsing
routines](crate#compatibility) wouldn't understand it.  Neither would this crate's own parser,
though that could be fixed.

Future versions of this crate may provide an option to use `printf` for quoting.

### Special note: newlines

Did you know that `\r` and `\n` are control characters?  They aren't as dangerous as other control
characters (if quoted properly).  But there's still an issue with them in interactive contexts.

Namely, in some cases, interactive shells and/or the tty layer will 'helpfully' translate between
different line ending conventions.  The possibilities include replacing `\r` with `\n`, replacing
`\n` with `\r\n`, and others.  This can't result in command injection, but it's still a lossy
transformation which can result in a failure to round-trip (i.e. the shell sees a different string
from what was originally passed to `quote`).

Numeric escapes would solve this as well.

# Solved issues

## Solved: Past vulnerability (GHSA-r7qv-8r2h-pg27 / RUSTSEC-2024-XXX)

Versions of this crate before 1.3.0 did not quote `{`, `}`, and `\xa0`.

See:
- <https://github.com/advisories/GHSA-r7qv-8r2h-pg27>
- (TODO: Add Rustsec link)

## Solved: `!` and `^`

There are two non-control characters which have a special meaning in interactive contexts only: `!` and
`^`.  Luckily, these can be escaped adequately.

The `!` character triggers [history expansion][he]; the `^` character can trigger a variant of
history expansion known as [Quick Substitution][qs].  Both of these characters get expanded even
inside of double-quoted strings\!

If we're in a double-quoted string, then we can't just escape these characters with a backslash.
Only a specific set of characters can be backslash-escaped inside double quotes; the set of
supported characters depends on the shell, but it often doesn't include `!` and `^`.[^escbs]
Trying to backslash-escape an unsupported character produces a literal backslash:
```bash
$ echo "\!"
\!
```

However, these characters don't get expanded in single-quoted strings, so this crate just
single-quotes them.

But there's a Bash bug where `^` actually does get partially expanded in single-quoted strings:
```bash
$ echo '
> ^a^b
> '

!!:s^a^b
```

To work around that, this crate forces `^` to appear right after an opening single quote.  For
example, the string `"^` is quoted into `'"''^'` instead of `'"^'`.  This restriction is overkill,
since `^` is only meaningful right after a newline, but it's a sufficient restriction (after all, a
`^` character can't be preceded by a newline if it's forced to be preceded by a single quote), and
for now it simplifies things.

## Solved: `\xa0`

The byte `\xa0` may be treated as a shell word separator, specifically on Bash on macOS when using
the default UTF-8 locale, only when the input is invalid UTF-8.  This crate handles the issue by
always using quotes for arguments containing this byte.

In fact, this crate always uses quotes for arguments containing any non-ASCII bytes.  This may be
changed in the future, since it's a bit unfriendly to non-English users.  But for now it
minimizes risk, especially considering the large number of different legacy single-byte locales
someone might hypothetically be running their shell in.

### Demonstration

```bash
$ echo -e 'ls a\xa0b' | bash
ls: a: No such file or directory
ls: b: No such file or directory
```
The normal behavior would be to output a single line, e.g.:
```bash
$ echo -e 'ls a\xa0b' | bash
ls: cannot access 'a'$'\240''b': No such file or directory
```
(The specific quoting in the error doesn't matter.)

### Cause

Just for fun, here's why this behavior occurs:

Bash decides which bytes serve as word separators based on the libc function [`isblank`][isblank].
On macOS on UTF-8 locales, this passes for `\xa0`, corresponding to U+00A0 NO-BREAK SPACE.

This is doubly unique compared to the other systems I tested (Linux/glibc, Linux/musl, and
Windows/MSVC).  First, the other systems don't allow bytes in the range [0x80, 0xFF] to pass
<code>is<i>foo</i></code> functions in UTF-8 locales, even if the corresponding Unicode codepoint
does pass, as determined by the wide-character equivalent function, <code>isw<i>foo</i></code>.
Second, the other systems don't treat U+00A0 as blank (even using `iswblank`).

Meanwhile, Bash checks for multi-byte sequences and forbids them from being treated as special
characters, so the proper UTF-8 encoding of U+00A0, `b"\xc2\xa0"`, is not treated as a word
separator.  Treatment as a word separator only happens for `b"\xa0"` alone, which is illegal UTF-8.

[ansic]: https://www.gnu.org/software/bash/manual/html_node/ANSI_002dC-Quoting.html
[he]: https://www.gnu.org/software/bash/manual/html_node/History-Interaction.html
[qs]: https://www.gnu.org/software/bash/manual/html_node/Event-Designators.html
[isblank]: https://man7.org/linux/man-pages/man3/isblank.3p.html
[nul]: #nul-bytes

[^choices]: This can lead to tough choices over which
  characters to escape and which to leave as-is, especially when Unicode gets involved and you
  have to balance the risk of confusion with the benefit of properly supporting non-English
  languages.
  <br>
  <br>
  We don't have the luxury of those choices.

[^gr]: For example, backspace (in Unicode lingo, U+0008 BACKSPACE) turns into U+2408 SYMBOL FOR BACKSPACE.

[^bracketing]: It typically disables almost all handling of control characters by the shell proper,
    but one necessary exception is the end-of-paste sequence itself (which starts with the control
    character `\x1b`).  In addition, paste bracketing does not suppress handling of control
    characters by the kernel tty layer, such as `\x03` sending SIGINT (which typically clears the
    currently typed command, making it dangerous in a similar way to `\x01`).

[^escbs]: For example, Dash doesn't remove the backslash from `"\!"` because it simply doesn't know
    anything about `!` as a special character: it doesn't support history expansion.  On the other
    end of the spectrum, Zsh supports history expansion and does remove the backslash — though only
    in interactive mode.  Bash's behavior is weirder.  It supports history expansion, and if you
    write `"\!"`, the backslash does prevent history expansion from occurring — but it doesn't get
    removed!

*/

// `use` declarations to make auto links work:
use ::{quote, join, Shlex, Quoter, QuoteError};

// TODO: add more about copy-paste and human readability.