Byte order mark Information & Byte order mark Links at HealthHaven.com
advertise
add site
services
publishers
database
health videos
Bookmark and Share

search wiki for    ?
web dir firms image gallery news pdf wiki shop video 
about
toolbar
stats
live show
health store
more stuff
JOIN/LOGIN
Featured Results:
Healing Your Feelings Order ORDER | I.C.S.T.R. (QLD)
Healing Your Feelings Order ORDER | I.C.S.T.R. (QLD)
icstr.com.au
 Stretch Mark s, Scars, Remove Stretch Mark s, Remove Scars | Sona MedSpa
Stretch Marks, Scars, Remove Stretch Marks, Remove Scars | Sona MedSpa
sonamedspa.com
 Used Midmark IQ MARK - Used BRENTWOOD IQ MARK DIGITAL SPIROMTER For Sale
Used Midmark IQ MARK - Used BRENTWOOD IQ MARK DIGITAL SPIROMTER For Sale
world-widemedical.com
 About Dr. Mark Harrington | Mark J. Harrington Orthodontics | Plymouth MN
About Dr. Mark Harrington | Mark J. Harrington Orthodontics | Plymouth MN
harrington-ortho.com
 
Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1]

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving Unicode text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Contents

[edit] Usage

In UTF-16, a BOM (U+FEFF) is placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

  • If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF (where "0x" indicates hexadecimal);
  • if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE.

The Unicode value U+FFFE is guaranteed never to be assigned as a Unicode character; this implies that in a Unicode context the 0xFF, 0xFE byte pattern can only be interpreted as the U+FEFF character expressed in little-endian byte order (since it could not be a U+FFFE character expressed in big-endian byte order).

While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may nonetheless be encountered. A UTF-8 BOM is explicitly allowed by the Unicode standard[2], but is not recommended[3], as it only identifies a file as UTF-8 and does not state anything about byte order.[4] Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files by default. However in Unix-like systems (which make heavy use of text files for file formats as well as for inter-process communication) this practice is not recommended, as it will interfere with correct processing of important codes such as the shebang at the start of an interpreted script.[5] It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters  in most text editors and web browsers not prepared to handle UTF-8.

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable. For the IANA registered charsets UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE a "byte order mark" must not be used, an initial U+FEFF has to be interpreted as a (deprecated) "zero width no-break space", because the names of these charsets already determine the byte order. For the registered charsets UTF-16 and UTF-32, an initial U+FEFF indicates the byte order.

If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "zero-width non-breaking space" (essentially a null character). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character, U+2060, strongly preferred.[1] This allows U+FEFF to be used solely with the semantic of BOM.

[edit] Representations of byte order marks by encoding

Encoding Representation (hexadecimal) Representation (decimal)
UTF-8 EF BB BF[t 1] 239 187 191
UTF-16 (BE) FE FF 254 255
UTF-16 (LE) FF FE 255 254
UTF-32 (BE) 00 00 FE FF 0 0 254 255
UTF-32 (LE) FF FE 00 00 255 254 0 0
UTF-7 2B 2F 76, and one of the following bytes: [ 38 | 39 | 2B | 2F ][t 2] 43 47 118, and one of the following bytes: [ 56 | 57 | 43 | 47 ]
UTF-1 F7 64 4C 247 100 76
UTF-EBCDIC DD 73 66 73 221 115 102 115
SCSU 0E FE FF[t 3] 14 254 255
BOCU-1 FB EE 28 optionally followed by FF[t 4] 251 238 40 optionally followed by 255
GB-18030 84 31 95 33 132 49 149 51
  1. ^ In UTF-8, this is not really a "byte order" mark. It identifies the text as UTF-8 but doesn't say anything about the byte order, because UTF-8 does not have byte order issues.[4][6]
  2. ^ In UTF-7, the fourth byte of the BOM, before encoding as base64, is 001111xx in binary, and xx depends on the next character (the first character after the BOM). Hence, technically, the fourth byte is not purely a part of the BOM, but also contains information about the next (non-BOM) character. For xx=00, 01, 10, 11, this byte is, respectively, 38, 39, 2B, or 2F when encoded as base64. If no following character is encoded, 38 is used for the fourth byte and the following byte is 2D.
  3. ^ SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6.[7]
  4. ^ For BOCU-1 a signature changes the state of the decoder. Octet 0xFF resets the decoder to the initial state.[8]

[edit] See also

[edit] References

  1. ^ a b Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM
  2. ^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). pp. 36. http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf. Retrieved 2009-03-29. "Table 2-4. The Seven Unicode Encoding Schemes" 
  3. ^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). pp. 36. http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf. Retrieved 2008-11-30. "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature" 
  4. ^ a b "FAQ - UTF-8, UTF-16, UTF-32 & BOM: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?". http://unicode.org/faq/utf_bom.html#bom5. Retrieved 2009-01-04. 
  5. ^ Markus Kuhn (2007). "UTF-8 and Unicode FAQ for Unix/Linux: What different encodings are there?". http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf. Retrieved 20 January 2009. "Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for “#!” at the beginning of a plaintext executable to locate the appropriate interpreter." 
  6. ^ STD 63: UTF-8, a transformation of ISO 10646 Byte Order Mark (BOM)
  7. ^ UTR #6: Signature Byte Sequence for SCSU
  8. ^ UTN #6: Signature Byte Sequence

[edit] External links




Product Results (view all...)

search wiki for    ?
web dir firms image gallery news pdf wiki shop video 



↑ top of page ↑about thumbshots