TECkit
a Text Encoding Conversion toolkit
About
TECkit is a low-level toolkit intended to be used by other applications that need to perform encoding conversions (e.g., when importing legacy data into a Unicode-based application). The primary component of the TECkit package is therefore a library that performs conversions; this is the “TECkit engine”. The engine relies on mapping tables in a specific binary format (for which documentation is available); there is a compiler that creates such tables from a human-readable mapping description (a simple text file).
Documentation
Beyond UTR22: complex legacy-to-Unicode mappings | for all platforms | |
TECkit Tools: A Text Encoding Conversion toolkit | for all platforms | |
The TECkit Language: Mapping byte encodings to Unicode | for all platforms |
Further documentation is included in the Windows release archive.
Downloads
Changes in this release:
- Updated Unicode character names and normalization data to 14.0.0
- Updated documentation
Windows Release
Tools, libraries, documentation, and samples are included in the .zip archive. Command line tools are teckit_compile.exe, txtconv.exe, and sfconv.exe.
TECkit for Windows 2.5.11 | for windows |
Ubuntu Linux Release
Ubuntu includes TECkit. More recent releases of TECkit might be available for Ubuntu releases from http://packages.sil.org/. The PDF documentation is not in the Linux package, Linux users should obtain the PDF files from the Windows release.
macOS Release
Disk image for macOS containing the TECkit libraries and Unix command-line tools (teckit_compile, txtconv, sfconv) and documentation.
TECkit for macOS 2.5.11 | for macos |
Source
The TECkit package is copyright ©2002-2021 SIL International. It is being made available as free software but without any warranty; see the license for more information.
Source code and additional downloads
The TECkit source code is available from GitHub at https://github.com/silnrsi/teckit.
Additional downloads for technical users are at https://github.com/silnrsi/teckit/releases. The code is expected to compile and run on typical Unix/Linux systems using standard commands:
./configure
make
make install
Previous Versions
Windows Release
TECkit for Windows 2.5.10 | for windows | |
TECkit for Windows 2.5.9 | for windows |
MacOS Release
TECkit for macOS 2.5.10 | for macos | |
TECkit for macOS 2.5.9 | for macos |
Support
More information is available through the TECKit github repo. The Github repo does have an Issue tracker.
Frequently Asked Questions
For mapping authors
Why do all my spaces (or line ends) get mangled?
When mapping between bytes and Unicode, every character code that you are interested in needs to be mapped appropriately by the table. If you map only the visible characters, or worse still, only those where your legacy encoding differed from standard ASCII, everything else will be mapped to the default replacement character, typically U+FFFD. This applies to characters such as space, tab, carriage-return and line-feed just as to printable characters.
Note that byte-only or Unicode-only mappings (or passes within multi-pass mapping) work differently: they will pass unmapped characters through unchanged. But in byte-Unicode mappings, which are the major focus of TECkit, anything that is not explicitly mapped will be replaced by the default code.
The compiler reports “code space mismatch”; what does that mean?
This (rather cryptic) error message means that the mapping description includes multiple passes, and the output of one pass is not in the same “code space” (either bytes or Unicode) as the input of the next pass.
The compiler reports this error when it reaches the end of the second of the incompatible passes (which may be the very end of the file); the actual problem lies at the beginning of the pass, where it is chained with the preceding one.
One subtle way this error can arise is if you intend to have a single pass in the mapping, and use an explicit pass(Byte_Unicode)
statement, but accidentally place some part of the mapping content before the pass
statement. Any Class
definitions or mapping rules found before any pass
statement will implicitly begin a Byte/Unicode mapping pass. (This is a legacy of the original, single-pass TECkit system.) When your explicit pass(Byte_Unicode)
statement is read, this begins a second pass, and you can’t chain two Byte/Unicode passes: the Unicode output of the first can’t become Byte input to the second.
For application developers
How big of an output buffer should I use when calling Convert or Flush?
In general, you can’t be sure; mappings are not necessarily one-to-one. Unless the input is ridiculously large, it’s probably best to allocate a buffer that would allow for a 50% or even 100% increase in the number of character codes; however, you must still be prepared for the possibility that the engine will return kStatus_OutputBufferFull
. If this happens, either enlarge your buffer or clear out the output that has been generated so far—write it out, send it to the next process, or whatever is appropriate—so that you can restart at the beginning of your buffer.
If you can’t afford such a generous buffer, you can use a smaller one and expect to do more looping. But your buffer must be at least big enough for the engine to perform a complete unit of conversion work, and this may result in a sequence of characters being output, not just a single code.
Why do I get kStatus_OutputBufferFull
, when it isn’t?
When you call Convert or Flush, the TECkit engine does not necessarily use all the space in your output buffer. It may return kStatus_OutputBufferFull
even though there is some space remaining.
There are two reasons for this. First, the engine never puts a partial Unicode character into the output buffer. A single Unicode character may require up to 4 bytes, depending on the encoding form and the particular character, so if less than 4 bytes are available, the engine may report that the buffer is full because the next character it wants to write won’t fit in its entirety.
Second (and this applies even when mapping to bytes), the engine does not like to return with an input code partially processed. And processing a single input code may result in multiple characters of output, either because the input code itself maps to a sequence or because it provides the context needed to determine the mapping preceding codes that have been buffered by the engine because their mappings depended on following context.
So the engine may report kStatus_OutputBufferFull
even when a considerable number of bytes remain unused. In extreme cases, unlikely in real-life mappings, this could be several hundred bytes, but cases where a dozen or more bytes are needed in the output buffer to process a single input code definitely occur. This status code always means that you need to create more output buffer space, either by enlargement or by clearing previous output, even if your buffer was not completely full.
How can I detect if there were characters TECkit couldn’t map?
By default, the TECkit engine maps all input characters to something in the output; characters for which no explicit mapping was given in the table will result in the “default replacement character”. (This is 0x3F ASCII ‘?’ by default when mapping to bytes, and U+FFFD REPLACEMENT CHARACTER by default when mapping to Unicode, but the mapping table author can change these values.)
Beginning with TECkit version 2.1, released 29 March 2004, the engine has new conversion APIs (the TECkit_ConvertBufferOpt
and TECkit_FlushOpt
functions; see the TECkit_Engine.h
header file). These allow the client application to control the behavior when unmappable input is encountered. The options are:
- Silently use the replacement character, as in previous versions of the engine.
- Use the replacement character, but return a warning status to the calling application.
- Stop converting and return an error code to the calling application.
Contact
If you would like to report a problem, you can create an issue in TECKit’s issue tracker. Or, you can send an email via the contact form below.