TinyMUX

Unicode Support

Hardcode

Unicode Support

TinyMUX has comprehensive Unicode support, handling the full range of Unicode code points natively in UTF-8 encoding. This enables players and builders to use characters from any writing system—Latin, Cyrillic, CJK, Arabic, emoji, and more.

What TinyMUX Handles

Unicode in TinyMUX is not just “accepting UTF-8 bytes.” The server understands Unicode at a deep level:

  • Character classification – knowing whether a code point is a letter, digit, printable, or control character. This affects name validation, pattern matching, and output.
  • Case conversion – proper uppercase, lowercase, and titlecase conversion for all scripts, not just ASCII.
  • Normalization – NFC (composed) and NFD (decomposed) normalization, ensuring that equivalent character sequences are treated identically.
  • Grapheme cluster breaking – knowing that a base character plus combining marks forms a single visual unit. This matters for functions like mid() and strlen() that operate on what users perceive as “characters.”
  • Display width – CJK characters are double-width; combining marks are zero-width. Functions like ljust(), rjust(), and center() need to know this to align columns correctly.
  • Collation – sorting strings according to the Unicode Collation Algorithm (DUCET), so that accented characters sort near their base characters.
  • Emoji – extended pictographic sequences including skin tone modifiers, ZWJ sequences, and flag sequences.

How It Works

TinyMUX processes UTF-8 at the byte level using precomputed state machine tables. These tables are generated at build time from the Unicode Consortium’s official data files (UnicodeData.txt, EastAsianWidth.txt, GraphemeBreakProperty.txt, etc.) by a pipeline of code generators in the utf/ directory of the source tree.

The state machines consume UTF-8 bytes one at a time and produce classification results without ever decoding to UTF-32. This means:

  • No per-character memory allocation
  • No decoding/encoding overhead
  • Constant-time classification per byte
  • The full Unicode range is supported (over 1.1 million code points)

For example, to test whether a UTF-8 byte sequence is a printable character, the server feeds each byte to the classification state machine. After the last byte, the accepting state indicates membership or exclusion. Often the machine can determine the result before consuming all bytes of a multi-byte sequence.

The generated tables include:

  • Input Translation Table (ITT) – maps each possible byte value (0-255) to a column number, collapsing bytes that behave identically.
  • State Transition Table (STT) – given the current state and column, produces the next state.
  • Output Table – for transforms like case conversion, maps states to output values.

This design keeps the tables compact despite covering the entire Unicode range. A classification that would require a 1.1-million-entry lookup table is instead represented as a state machine with typically 10-30 states and 20-40 columns.

Character Sets and Conversion

Not all clients support UTF-8. TinyMUX can down-convert Unicode output to legacy character sets:

  • ASCII – non-ASCII characters are transliterated or stripped
  • Latin-1 (ISO 8859-1) – Western European characters preserved
  • Latin-2 (ISO 8859-2) – Central European characters preserved
  • CP437 – DOS code page for legacy terminal emulators

These conversions also use precomputed state machine tables, ensuring consistent and fast behavior.

Client Configuration

For full Unicode support, both the server and client must be configured for UTF-8:

  • Server side: TinyMUX handles UTF-8 natively. The default_charset configuration option controls the default character set for new connections.
  • Client side: The client must send and display UTF-8. Most modern clients (Mudlet, BeipMU, TinTin++) support this natively. Some older clients need explicit configuration.

Impact on Softcode

Most softcode functions work transparently with Unicode. However, be aware that:

  • strlen() counts characters (grapheme clusters), not bytes. A single emoji might be 4 bytes but counts as 1 character.
  • mid() and left() operate on character boundaries, never splitting a multi-byte sequence.
  • Pattern matching with * and ? wildcards matches characters, not bytes.
  • Regular expressions via PCRE support Unicode character classes like \p{L} (any letter).

Building from Unicode Data

The build pipeline in utf/ works as follows:

  1. Perl scripts (gen_*.pl) parse Unicode Consortium data files into intermediate mapping files.
  2. C++ code generators (classify, integers, strings, pairs, buildFiles) consume the mappings and produce optimized state machine tables.
  3. The output (utf8tables.cpp.txt, utf8tables.h.txt) is compiled into the server binary.

Updating to a new Unicode version means dropping in the new data files and regenerating. The entire table generation is deterministic and reproducible.

Related Topics: Unicode, ACCENT.