How does AMPS support UTF-8 and Unicode?

AMPS supports the most common uses of Unicode when represented as UTF-8 in messages and topic names. There are some considerations to keep in mind when using UTF-8 with AMPS:

Server Support

  • The AMPS server correctly handles and preserves the UTF-8 representations of strings within messages.

  • The AMPS server does not normalize UTF-8 coming into the server. 60East recommends that all client applications use a consistent normalization. (Many unicode handling libraries transparently normalize UTF-8 to normalization form C, so many applications follow this recommendation without any special code.)

  • Topic names may contain UTF-8.

  • Message contents may contain UTF-8, subject to the message type support:

    • JSON, XML, and BSON support UTF-8 if properly escaped for the message type (for example, correctly escaping < and & in XML).

    • FIX and NVFIX support UTF-8 provided that the UTF-8 strings do not contain any of the characters used to delimit the message. AMPS provides support for customizing the header delimiter and field delimiter, but expects the field name and contents to be separated by the '=' character (U+003D, ASCII 61).

  • PCRE support is not unicode-aware at this time. In practice, byte-by-byte comparison works well for message filtering. Features that explicitly require unicode support (for example, identifying character classes for non-ASCII characters) are not enabled in this release.

  • The AMPS server is not collation-aware. AMPS uses binary comparisons for string comparisions, functions such MIN/MAX/BETWEEN, GroupBy and OrderBy.

Client Support (4.0 and later clients):

  • For AMPS clients for languages that represent strings as Unicode (C#, Java, Python), the clients serialize headers as UTF-8 in general. For AMPS clients that represent strings as arrays of bytes (C, C++), the application is responsible for serializing headers as UTF-8. (One exception to this is that, for the NVFIX & FIX protocols, some clients serialize headers as Latin-1 by default to avoid conflicts with delimiter characters: an application can choose to override this behavior.)

  • The AMPS clients do not parse or interpret message data. The convenience classes for handling FIX and NVIX should correctly handle UTF-8 data provided that the UTF-8 strings do not contain any of the characters used to delimit the message.

Keywords: UTF-8, unicode, character set

Last updated