# Binary Ordered Compression for Unicode

BOCU-1 is a MIME compatible Unicode compression scheme. BOCU stands for Binary Ordered Compression for Unicode. BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in an Unicode Technical Note. [cite web |url=http://www.unicode.org/notes/tn6/#Introduction |title=UTN #6: BOCU-1|date=2006-02-04 |author=Markus Scherer, Mark Davis |accessdate=2008-05-18] For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently. [cite web |url=http://unicode.org/notes/tn14 |title=UTN #14: A survey of Unicode compression
date=2004-01-30 |first=Doug |last=Ewell |accessdate=2008-06-13 |format=PDF
]

Both SCSU [ [http://www.iana.org/assignments/charset-reg/SCSU IANA registration record for SCSU] ] and BOCU-1 [ [http://www.iana.org/assignments/charset-reg/BOCU-1 IANA registration record for BOCU-1] ] are IANA registered charsets.

Details

All numbers in this section are hexadecimal, and all ranges are inclusive.

Code points from `U+0000` to `U+0020` are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, `U+0021` through `U+D7FF` and `U+E000` through `U+10FFFF`) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (`U+0020`). The initial state is `U+0040`. The normalization mapping is as follows:

The difference between the current code point and the normalized previous code point is encoded as follows:

Each byte range is lexicographically ordered with the following thirteen byte values excluded: `00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20`. For example, the byte sequence `FC 06 FF`, coding for a difference of `1156B`, is immediately followed by the byte sequence `FC 10 01`, coding for a difference of `1156C`.

Any ASCII input `U+0000` to `U+007F` excluding space `U+0020` resets the encoder to `U+0040`. Because the above mentioned values cover line end code points `U+000D` and `U+000A` "as is" (`0D 0A`), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8 affects at most one code point, for SCSU it can affect the entire document.

BOCU-1 offers a similar robustness also for input texts without the above mentioned values with the special reset code `0xFF`. When a decoder finds this octet it resets its state to `U+0040` as for a line end. The use of `0xFF` reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the "binary order".

The optional use of a signature `U+FEFF` at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence `FB EE 28`, changes the initial state `U+0040` to `U+FE80`. In other words the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (`FB EE 28 FF`) could avoid this effect, but the BOCU-1 specification does not recommend this practise.

In theory UTF-1 and UTF-8 could encode the original UCS-4 set with 31 bits up to `7FFFFFFF`. BOCU-1 and UTF-16 can encodethe modern Unicode set from `U+0000` to `U+10FFFF`. Excluding the thirteen "protected" code points encoded as single octets BOCU-1 can use $256 - 13 = 243$ octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference.Note that the reset byte `0xFF` is not "protected" and can occur as trail byte.

References

* UTF-1 contains a comparison of the UTF-1, UTF-8, and BOCU-1 designs
* International Components for Unicode A library that can convert between BOCU-1 and other Unicode encodings

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Binary Ordered Compression For Unicode — Unicode Jeux de caractères UCS (ISO/CEI 10646) ISO 646, ASCII ISO 8859 1 WGL4 UniHan Équivalences normalisées NFC (précomposée) NFD (décomposée) NFKC (compatibilité) NFKD (compatibilité) Propriétés et algorithmes …   Wikipédia en Français

• Binary Ordered Compression for Unicode — Le BOCU 1 est un schéma de transformation du texte, compatible avec le répertoire universel d’Unicode et ISO/CEI 10646, en séquences d’octets. Il tire son nom de l’acronyme anglais de Binary Ordered Compression for Unicode (« compression… …   Wikipédia en Français

• Standard Compression Scheme for Unicode — The Standard Compression Scheme for Unicode (SCSU) [cite web |url=http://www.unicode.org/reports/tr6/ |title=UTS #6: Compression Scheme for Unicode |date=2005 05 06 |accessdate=2008 06 13 ] is a Unicode Technical Standard for reducing the number… …   Wikipedia

• Comparison of Unicode encodings — This article compares Unicode encodings. Two situations are considered: 8 bit clean environments and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven …   Wikipedia

• BOCU-1 — Binary Ordered Compression for Unicode Unicode Jeux de caractères UCS (ISO/CEI 10646) ISO 646, ASCII ISO 8859 1 WGL4 UniHan Équivalences normalisées NFC (précomposée) NFD (décomposée) NFKC (compatibilité) NFKD (compatibilité) Propriétés et… …   Wikipédia en Français

• Trie — A trie for keys A , to , tea , ted , ten , i , in , and inn . In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search… …   Wikipedia

• Windows Registry — The Windows Registry is a hierarchical database that stores configuration settings and options on Microsoft Windows operating systems. It contains settings for low level operating system components as well as the applications running on the… …   Wikipedia

• Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

• Portable Document Format — PDF redirects here. For other uses, see PDF (disambiguation). Portable Document Format Adobe Reader icon Filename extension .pdf Internet media type application/pdf application/x pdf application/x bzpdf application/x gzpdf …   Wikipedia

• Domain Name System — The Domain Name System (DNS) is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the… …   Wikipedia