BUGS - toolchain/sed - Git at Google

 * ABOUT BUGS

 Before reporting a bug, please check the list of known bugs
 and the list of oft-reported non-bugs (below).

 Bugs and comments may be sent to bonzini@gnu.org; please
 include in the Subject: header the first line of the output of
 ``sed --version''.

 Please do not send a bug report like this:

 	[while building frobme-1.3.4]
 	$ configure
 	sed: file sedscr line 1: Unknown option to 's'

 If sed doesn't configure your favorite package, take a few extra
 minutes to identify the specific problem and make a stand-alone test
 case.

 A stand-alone test case includes all the data necessary to perform the
 test, and the specific invocation of sed that causes the problem.  The
 smaller a stand-alone test case is, the better.  A test case should
 not involve something as far removed from sed as ``try to configure
 frobme-1.3.4''.  Yes, that is in principle enough information to look
 for the bug, but that is not a very practical prospect.


 * NON-BUGS

 `N' command on the last line

   Most versions of sed exit without printing anything when the `N'
   command is issued on the last line of a file.  GNU sed instead
   prints pattern space before exiting unless of course the `-n'
   command switch has been specified.  More information on the reason
   behind this choice can be found in the Info manual.


 regex syntax clashes (problems with backslashes)

   sed uses the Posix basic regular expression syntax.  According to
   the standard, the meaning of some escape sequences is undefined in
   this syntax;  notable in the case of GNU sed are `\|', `\+', `\?',
   `\`', `\'', `\<', `\>', `\b', `\B', `\w', and `\W'.

   As in all GNU programs that use Posix basic regular expressions, sed
   interprets these escape sequences as meta-characters.  So, `x\+'
   matches one or more occurrences of `x'.   `abc\|def' matches either
   `abc' or `def'.

   This syntax may cause problems when running scripts written for other
   seds.  Some sed programs have been written with the assumption that
   `\|' and `\+' match the literal characters `|' and `+'.  Such scripts
   must be modified by removing the spurious backslashes if they are to
   be used with recent versions of sed (not only GNU sed).

   On the other hand, some scripts use `s|abc\|def||g' to remove occurrences
   of _either_ `abc' or `def'.  While this worked until sed 4.0.x, newer
   versions interpret this as removing the string `abc|def'.  This is
   again undefined behavior according to POSIX, but this interpretation
   is arguably more robust: the older one, for example, required that
   the regex matcher parsed `\/' as `/' in the common case of escaping
   a slash, which is again undefined behavior; the new behavior avoids
   this, and this is good because the regex matcher is only partially
   under our control.

   In addition, GNU sed supports several escape characters (some of
   which are multi-character) to insert non-printable characters
   in scripts (`\a', `\c', `\d', `\o', `\r', `\t', `\v', `\x').  These
   can cause similar problems with scripts written for other seds.


 -i clobbers read-only files

   In short, `sed d -i' will let one delete the contents of
   a read-only file, and in general the `-i' option will let
   one clobber protected files.  This is not a bug, but rather a
   consequence of how the Unix filesystem works.

   The permissions on a file say what can happen to the data
   in that file, while the permissions on a directory say what can
   happen to the list of files in that directory.  `sed -i'
   will not ever open for writing  a file that is already on disk,
   rather, it will work on a temporary file that is finally renamed
   to the original name: if you rename or delete files, you're actually
   modifying the contents of the directory, so the operation depends on
   the permissions of the directory, not of the file).  For this same
   reason, sed will not let one use `-i' on a writeable file in a
   read-only directory, and will break hard or symbolic links when
   `-i' is used on such a file.


 `0a' does not work (gives an error)

   There is no line 0.  0 is a special address that is only used to treat
   addresses like `0,/RE/' as active when the script starts: if you
   write `1,/abc/d' and the first line includes the word `abc', then
   that match would be ignored because address ranges must span at least
   two lines (barring the end of the file); but what you probably wanted is
   to delete every line up to the first one including `abc', and this
   is obtained with `0,/abc/d'.


 `[a-z]' is case insensitive
 `s/.*//' does not clear pattern space

   You are encountering problems with locales.  POSIX mandates that `[a-z]'
   uses the current locale's collation order -- in C parlance, that means
   strcoll(3) instead of strcmp(3).  Some locales have a case insensitive
   strcoll, others don't.

   Another problem is that [a-z] tries to use collation symbols.  This
   only happens if you are on the GNU system, using GNU libc's regular
   expression matcher instead of compiling the one supplied with GNU sed.
   In a Danish locale, for example, the regular expression `^[a-z]$'
   matches the string `aa', because `aa' is a single collating symbol that
   comes after `a' and before `b'; `ll' behaves similarly in Spanish
   locales, or `ij' in Dutch locales.

   Another common localization-related problem happens if your input stream
   includes invalid multibyte sequences.  POSIX mandates that such
   sequences are _not_ matched by `.', so that `s/.*//' will not clear
   pattern space as you would expect.  In fact, there is no way to clear
   sed's buffers in the middle of the script in most multibyte locales
   (including UTF-8 locales).  For this reason, GNU sed provides a `z'
   command (for `zap') as an extension.

   However, to work around both of these problems, which may cause bugs
   in shell scripts, you can set the LC_ALL environment variable to `C',
   or set the locale on a more fine-grained basis with the other LC_*
   environment variables.
	* ABOUT BUGS

	Before reporting a bug, please check the list of known bugs
	and the list of oft-reported non-bugs (below).

	Bugs and comments may be sent to bonzini@gnu.org; please
	include in the Subject: header the first line of the output of
	``sed --version''.

	Please do not send a bug report like this:

	[while building frobme-1.3.4]
	$ configure
	sed: file sedscr line 1: Unknown option to 's'

	If sed doesn't configure your favorite package, take a few extra
	minutes to identify the specific problem and make a stand-alone test
	case.

	A stand-alone test case includes all the data necessary to perform the
	test, and the specific invocation of sed that causes the problem. The
	smaller a stand-alone test case is, the better. A test case should
	not involve something as far removed from sed as ``try to configure
	frobme-1.3.4''. Yes, that is in principle enough information to look
	for the bug, but that is not a very practical prospect.



	* NON-BUGS

	`N' command on the last line

	Most versions of sed exit without printing anything when the `N'
	command is issued on the last line of a file. GNU sed instead
	prints pattern space before exiting unless of course the `-n'
	command switch has been specified. More information on the reason
	behind this choice can be found in the Info manual.


	regex syntax clashes (problems with backslashes)

	sed uses the Posix basic regular expression syntax. According to
	the standard, the meaning of some escape sequences is undefined in
	this syntax; notable in the case of GNU sed are `\\|', `\+', `\?',
	`\`', `\'', `\<', `\>', `\b', `\B', `\w', and `\W'.

	As in all GNU programs that use Posix basic regular expressions, sed
	interprets these escape sequences as meta-characters. So, `x\+'
	matches one or more occurrences of `x'. `abc\\|def' matches either
	`abc' or `def'.

	This syntax may cause problems when running scripts written for other
	seds. Some sed programs have been written with the assumption that
	`\\|' and `\+' match the literal characters `\|' and `+'. Such scripts
	must be modified by removing the spurious backslashes if they are to
	be used with recent versions of sed (not only GNU sed).

	On the other hand, some scripts use `s\|abc\\|def\|\|g' to remove occurrences
	of _either_ `abc' or `def'. While this worked until sed 4.0.x, newer
	versions interpret this as removing the string `abc\|def'. This is
	again undefined behavior according to POSIX, but this interpretation
	is arguably more robust: the older one, for example, required that
	the regex matcher parsed `\/' as `/' in the common case of escaping
	a slash, which is again undefined behavior; the new behavior avoids
	this, and this is good because the regex matcher is only partially
	under our control.

	In addition, GNU sed supports several escape characters (some of
	which are multi-character) to insert non-printable characters
	in scripts (`\a', `\c', `\d', `\o', `\r', `\t', `\v', `\x'). These
	can cause similar problems with scripts written for other seds.


	-i clobbers read-only files

	In short, `sed d -i' will let one delete the contents of
	a read-only file, and in general the `-i' option will let
	one clobber protected files. This is not a bug, but rather a
	consequence of how the Unix filesystem works.

	The permissions on a file say what can happen to the data
	in that file, while the permissions on a directory say what can
	happen to the list of files in that directory. `sed -i'
	will not ever open for writing a file that is already on disk,
	rather, it will work on a temporary file that is finally renamed
	to the original name: if you rename or delete files, you're actually
	modifying the contents of the directory, so the operation depends on
	the permissions of the directory, not of the file). For this same
	reason, sed will not let one use `-i' on a writeable file in a
	read-only directory, and will break hard or symbolic links when
	`-i' is used on such a file.


	`0a' does not work (gives an error)

	There is no line 0. 0 is a special address that is only used to treat
	addresses like `0,/RE/' as active when the script starts: if you
	write `1,/abc/d' and the first line includes the word `abc', then
	that match would be ignored because address ranges must span at least
	two lines (barring the end of the file); but what you probably wanted is
	to delete every line up to the first one including `abc', and this
	is obtained with `0,/abc/d'.


	`[a-z]' is case insensitive
	`s/.*//' does not clear pattern space

	You are encountering problems with locales. POSIX mandates that `[a-z]'
	uses the current locale's collation order -- in C parlance, that means
	strcoll(3) instead of strcmp(3). Some locales have a case insensitive
	strcoll, others don't.

	Another problem is that [a-z] tries to use collation symbols. This
	only happens if you are on the GNU system, using GNU libc's regular
	expression matcher instead of compiling the one supplied with GNU sed.
	In a Danish locale, for example, the regular expression `^[a-z]$'
	matches the string `aa', because `aa' is a single collating symbol that
	comes after `a' and before `b'; `ll' behaves similarly in Spanish
	locales, or `ij' in Dutch locales.

	Another common localization-related problem happens if your input stream
	includes invalid multibyte sequences. POSIX mandates that such
	sequences are _not_ matched by `.', so that `s/.*//' will not clear
	pattern space as you would expect. In fact, there is no way to clear
	sed's buffers in the middle of the script in most multibyte locales
	(including UTF-8 locales). For this reason, GNU sed provides a `z'
	command (for `zap') as an extension.

	However, to work around both of these problems, which may cause bugs
	in shell scripts, you can set the LC_ALL environment variable to `C',
	or set the locale on a more fine-grained basis with the other LC_*
	environment variables.