Introduction

The SIL Converters software provide the framework to convert your texts from Roman script to Arabic script. This document is not a manual of SIL Converters or the TECkit mapping language but attempts to highlight some principles that need to be considered when converting text from Roman script to Arabic script. The SIL Converter package can be installed on a computer running Windows and can be downloaded here: SIL Converters.

Note: Cross-script conversion is often very language specific. The mapping file and discussion provided here is purely to help you think through the issues required for converting from Roman to Arabic script. The language you are working with may have consonants not used in this mapping file, and there are likely consonants used here that you will not use. The code charts will be your friends to figure out these details. Contextual rules will also vary greatly from language to language.

Mapping

The SIL Converters can do various kinds of conversions, but in this document we are looking only at a conversion based on SIL’s TECkit mapping files. In order to do the conversion, you will need to obtain a mapping file that tells the converter software how each letter or group of letters in the Roman script source text will be converted to Arabic script. You may be able to use an existing mapping file, or you may need to create your own. The conversion is always language specific. Therefore, even if you are able to find a mapping file designed for a related language, you may still need to modify it to make it work better in your situation.

To edit or create a mapping file, you will use the TECkit Mapping Editor which is part of the SIL Converters package. However, because the editor is not installed by default, make sure when running the installer that you include the TECkit Map Unicode Editor and the TECkit documentation in your installation.

One important feature of the TECkit mapping files is that the conversion is arranged in logical “passes” where the output of the first pass is the input for the second pass and so on. In the simplest form the conversion from Roman script to Arabic script would consist of only two passes:

  • Because the Arabic script does not differentiate between Lowercase (LC) and Uppercase (UC) letters, all uppercase letters in the Roman script source text need to be converted to lowercase.

  • Each Roman letter is converted to its equivalent Arabic form.

However, most conversions would need a bit more sophistication. Here is an example of what a conversion mapping for one language may look like.

Left-Hand-Side (LHS) is Roman Script

Remember that the passes in this section are dealing with Roman Script characters on the Left-Hand-Side.

Pass for Proper names

Usually many proper names already have a fixed spelling in the target script and the converter will not automatically produce this correct form. If you want the converter to handle proper names correctly, you will have to create a lexicon of common proper names for this pass. Because proper names are written with an initial capital letter in Roman script, it is a good idea to have this pass before the UC to LC conversion takes place. Example:

'Simon' > 'شمعون'
'Ylyas' > 'الیاس'
etc.

Pass for Uppercase to Lowercase Conversion

In this pass you simply create two classes where you list all the uppercase letters and their corresponding lowercase letters.

class[UpperCase] = ( U+0041..U+005A U+014A )
class[LowerCase] = ( U+0061..U+007A U+014B )

and a single line of code will convert all UpperCase letters to their LowerCase equivalents.

[UpperCase] > [LowerCase]

Pass for Loan Words

Just like the proper names, loan words, particularly if they are of Arabic origin, often retain their original spellings. If that is the case and you want the converter to handle these loan words, you have to create a lexicon of the common ones in this pass. Example:

'ebarad' > 'عئبارت'
'heqikat' > 'حاقئقت'

In the examples above (and with Proper Names), the characters listed could be embedded in another word. However, many times these conversion rules should be limited to a certain context. It is probably preferred to write the rule in such a way that context is indicated. For example:

'ader' / (#|[WordBreak]) _ > 'عطئر'
'hada' / (#|[WordBreak]) _ (#|[WordBreak]) > 'ختیٰ'
'reh' / (#|[WordBreak]) _ ^^[a] > 'روح'

In the first example above, the letter combination ader is matched only in word initial position. In the second example hada is matched only when it is a whole word. Finally, reh is matched only in word initial position and when not followed by the letter a.

In the above rules the forward slash means “in context of”. # is a special marker that matches the file initial, file final, paragraph initial and paragraph final positions. The vertical bar means “or”, the underscore represents the search string and the caret means “not”. [WordBreak] and [a] are classes that you define. Here class[a] simply contains the letter a. The [WordBreak] class is a collection of space and punctuation characters that mark word boundaries. It is a very useful class, and to save you from creating one from scratch, here is a sample one:

class[WordBreak] = (U+2028 U+2029 U+0002 U+0003 U+0009 U+000A U+000C \
U+000D U+0020 U+0021 U+0022 U+0023 U+0024 U+0025 U+0027 U+0028 U+0029 \
U+002C U+002D U+002E U+002F U+005B U+005C U+005D U+007B U+007D U+200C \
U+200E U+200F U+2018 U+2019 U+201A U+201B U+201C U+201D U+201E U+201F \
U+003A U+00AB U+00BB U+002A U+061B U+060C U+061F)

In the mapping file this class definition either needs to be on a single line, or you can use a backslash to break a line. The problem with putting it in the pass is that if you need the class in another pass, you have to put the class in that pass to. It is much safer to put the following in the header section:

Define _WordBreak U+2028 U+2029 U+0002 U+0003 U+0009 U+000A U+000C U+000D U+0020 \
U+0021 U+0022 U+0023 U+0024 U+0025 U+0027 U+0028 U+0029 U+002C U+002D U+002E \
U+002F U+005B U+005C U+005D U+007B U+007D U+200C U+200E U+200F U+2018 U+2019 U+201A \
U+201B U+201C U+201D U+201E U+201F U+003A U+00AB U+00BB U+002A U+061B U+060C U+061F

Then in the pass you would have one line:

UniClass [WordBreak] = ( _WordBreak )

This way if you realize something must be added or deleted from the WordBreak you only need to modify it in one place.

Pass for Main Conversion

This is the most important and — depending on the language — the most complex pass of the mapping file.

Dipthongs

Start by dealing with dipthongs. They are “ay, ey, oy, uy, aw, ew, iw, ow”.

Example code might be:

'ay' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+064E U+064A U+0652 ; word initial
'ay' / _ ^^[vowel] > U+064E U+064A U+0652
'ey' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+065A U+064A U+0652
'ey' / _ ^^[vowel] > U+065A U+064A U+0652
;...
Vowels

Next, still in the same pass, you have to create rules for the vowels. Typically in Arabic script the same vowel is written differently in word initial, medial and final positions. So you have to cover these positions with context-sensitive rules. Take care of long (doubled) vowels first. Example:

'aa' / (#|[WordBreak]) _ > U+0622 ;word initial
'aa' / _ (#|[WordBreak]) > U+064E U+0627 ;word final
'aa' > U+064E ;elsewhere
'a' / (#|[WordBreak]) _ > U+0623 U+064E ;word initial
'a' / _ (#|[WordBreak]) > U+064E U+0627 ;word final
'a' > U+064E ;elsewhere
;...

Depending on the language, you may not need the word final rule. Just put a semi-colon in front of the rule if you don’t need it.

Consonants

Start by creating class[ConsonantLatin] and class[ConsonantArabic] and in them list all the consonants that follow a strict unconditional one-to-one conversion. Then a single line of code

class[ConsonantLatin] = ('b' 'p' 't')
class[ConsonantArabic] = (U+0628 U+067E U+062A)

[ConsonantLatin] > [ConsonantArabic]

will convert all the listed consonants. That may be all you need to do for the consonants.

However, the conversion may not always be so simple. Arabic script has more consonants than Roman script and the target script may use some of those extra consonants. In this case you need to figure out if there is some conditioning that will determine which consonants are written.

You may need to include some simple rules such as:

'nj' > U+0683
'nd' > U+068A
'dr' > U+0694
'tr' > U+0697
'ndr' > U+0698
'ch' > U+0634

In another language, the presence of certain vowels changes certain consonants. Context-based rules can be written to achieve a more accurate conversion for many consonants. Here are the rules to convert the letter g (a semicolon indicates that the text that follows it is a comment and is thus ignored by the converter).

'g' / (#|[WordBreak]) _ [BackV] > 'ق' ;word initially
;followed by [BackV]

'g' / [BackV] _ (#|[WordBreak]) > 'ق' ;word finally
;preceeded by [BackV]

'g' / [BackV][AllC] _ (#|[WordBreak]) > 'ق' ;word finally preceeded by
;any cons and [BackV]

'g' / [BackV] _ > 'غ' ;medially preceeded by [BackV]
'g' / _ [BackV] > 'غ' ;medially followed by [BackV]

'g' / [FrontV] _ [BackV] > 'گ' ;medially preceeded by [FrontV]
;and followed by [BackV]

'g' > 'گ' ;elsewhere
Suffixes

In African languages, there may be special cases for handling hyphens with certain suffixes, which might be a pattern for some other exceptions.

'-ka' / _ (#|[WordBreak]) > U+0643 U+064E
'-nu' / _ (#|[WordBreak]) > U+0646 U+064F
'-' > ' '
Digits

In the simplest of cases, you could just list all numbers in two classes: class[RomanDigit] and class[ArabicDigit]. Code:

class[RomanDigit] = (U+0030..U+0039 '-' ':')
class[ArabicDigit] = (U+0660..U+0669 '-' ':')

[RomanDigit] > [ArabicDigit]

Of course, it is not always that simple. Digits can display in different orders, depending on the punctuation around them.

Consider whether you want:

LTR RenderingCodepointsRTL RenderingComment
a12:34-56U+0031, U+0032, U+200F, U+003A, U+0033, U+0034, U+200F, U+002D, U+0035, U+003656-34:12|RLM before colon and before hyphen
b12:34-56U+0031, U+0032, U+200F, U+003A, U+0033, U+0034, U+200E, U+002D, U+0035, U+003634-56:12|RLM before colon and LRM before hyphen

For option a, the RLM (U+200F RIGHT-TO-LEFT MARK) is inserted before the colon and before the hyphen. For option b, the RLM is inserted before the colon and the LRM (U+200E LEFT-TO-RIGHT MARK) is inserted before the hyphen. Option b is rather unusual, but it is used in some regions.

This code would give option a:

; Numerals
class[RomanDigit] = (U+0030..U+0039)
class[ArabicDigit] = (U+0660..U+0669)

U+003A / [RomanDigit] _ [RomanDigit] > U+200F U+003A ; add RLM before colon between digits (normal behavior)
U+002D / [RomanDigit] _ [RomanDigit] > U+200F U+002D ; add RLM before hyphen between digits (normal behavior)
Punctuation

Make rules for any punctuation marks you need. Example:

',' > U+060C
';' > U+061B
U+003A > U+003A U+200F ; colon (insert RLM for correct behavior)
U+0021 > U+0021 U+200F ; exclamation mark (insert RLM for correct behavior)
U+003F > U+061F ; question mark
U+0028 > U+FD3F ; bracket opening
U+0029 > U+FD3E ; bracket closing
U+002A > U+066D ; asterisk
U+005B > U+0020 U+200F U+005B U+200F ; bracket square opening
U+005D > U+005D U+200F ; bracket square closing
U+2018 > U+0020 U+200F U+2039 U+200F ; quote single open angled
U+2019 > U+203A U+200F ; quote single close angled
U+201C > U+0020 U+200F U+00AB U+200F ; quote double open angled
U+201D > U+00BB U+200F ; quote double close angled
; ...

If a punctuation mark is not from the Arabic script blocks, some of the rules require the insertion of U+200F RIGHT-TO-LEFT MARK in order to get the proper RTL behavior.

Other

Depending on your language you may need to add more rules in the main converter pass that have not been covered here.

Left-Hand-Side (LHS) is now Arabic Script

All passes after this should be matching AS characters on the LHS, as there won’t be any more Roman Script characters in the text.

Pass for Shadda

Typically in Arabic script, if a double consonant is needed it is not written as double, but instead a combination of the consonant and Arabic Shadda (U+0651) is used. In this simple pass any double consonants are replaced with the consonant and a Shadda. Example:

U+0628 U+0628 > U+0628 U+0651 ;بّ

Pass for Sukun

If you wish to add a sukun between two consonants you would have a rule such as:

class[cons] = ( U+0628 U+062A..U+063A U+0641..U+0646 U+067E U+0683 U+0686 U+068A \
U+0694 U+0697 U+0698 U+06A0 U+075D U+0766 U+0767 )

[cons]=a [cons]=b > @a U+0652 @b

Pass for Alef Maksura

Certain Arabic loan words and some Koranic names traditionally finish with an Alef Maksura (U+0649). It is likely that those spellings are retained in your language. If such a word receives a suffix, the Alef Maksura needs to be replaced by a normal Alef. For this pass a lexicon needs to be collected. Example:

'یحیی' / _ ^^[WordBreak] > 'یحیا'
'موسی' / _ ^^[WordBreak] > 'موسا'

If your language uses the combination of Alef Maksura and a superscript Alef, then the situation is simpler. You only need one rule:

U+0649 U+0670 / _ ^^[WordBreak] > U+0627

Please note that in some countries Arabic Letter Farsi Yeh (U+06CC) may be used instead of Alef Maksura.

Pass for Vowel Omission

This is an optional pass that can be used to strip the converted text from vowel marks. Many regions of the Arabic script world prefer not to have vowel markers in the text they read. If this pass is not used, please disable all the lines of code within this pass with a semicolon. Example:

U+0624 > U+0648 ;changes Waw with Hamza above to plain Waw

U+064E > '' ;removes Fatha

;U+0650 > '' ;removes Kasra

Notice that the last line is disabled (a semicolon starts the line).

The Mapping File

When you start the TECkit Mapping editor and create a new mapping file, you are first asked to select a conversion type. You probably want Unicode to Unicode. Next you are asked to provide fonts for the left-hand side and the right-hand side encodings. These will be used in the test area of the Editor screen where, at any time, you can type a word and see how it would be converted. Select your normal Roman font for the left-hand side and an Arabic font for the right-hand side.

A TECkit Mapping Editor window will now open with a section of headers which you can optionally edit, and a sample pass. Using the example passes from the section above, the structure of a sample TECkit mapping file is demonstrated below.

; This file was edited using TECkitMappingEditorU.exe v3.1.0.0 on 4/15/2013.
; Conversion Type = Unicode_to_Unicode
; Left-hand side font = Arial Unicode MS;11.25
; Right-hand side font = Scheherazade alpha;18
; Main Window Position = 989,119,1170,1007
; Left-hand side Character Map Window Position = 1395,83,457,446
; Right-hand side Character Map Window Position = 1344,540,457,446

; This mapping file is not intended to round-trip (Arabic script to Roman)

LHSName "Unicode-LanguageName-Roman"
RHSName "Unicode-LanguageName-Arabic"
LHSDescription "Unicode LanguageName Roman script encoding"
RHSDescription "Unicode LanguageName Arabic script encoding"
Version "1.0"
Contact "mailto:name@xxx.com"
RegistrationAuthority "MyCompany"
RegistrationName "LanguageName Roman to Arabic conversion 1.0"
Copyright "© 2013 . All rights reserved."
LHSFlags (ExpectsNFC GeneratesNFC)
RHSFlags ()

Define _WordBreak U+2028 U+2029 U+0002 U+0003 U+0009 U+000A U+000C U+000D U+0020 \
U+0021 U+0022 U+0023 U+0024 U+0025 U+0027 U+0028 U+0029 U+002C U+002D U+002E \
U+002F U+005B U+005C U+005D U+007B U+007D U+200C U+200E U+200F U+2018 U+2019 U+201A \
U+201B U+201C U+201D U+201E U+201F U+003A U+00AB U+00BB U+002A U+061B U+060C U+061F

; ----------
; All Left-Hand-Side is currently Roman Script
; ----------

; ----------
; Pass 1 for Proper names
; ----------

pass(Unicode)

UniClass [WordBreak] = ( _WordBreak )

'Simon' / (#|[WordBreak]) _ (#|[WordBreak]) > 'شمعون'
'Ylyas' / (#|[WordBreak]) _ (#|[WordBreak]) > 'اِلیاس'
; ...

; ----------
; Pass 2 for Uppercase to Lowercase conversion
; ----------

pass ( Unicode )

; convert all characters to lower case
;

class[UpperCase] = ( U+0041..U+005A U+014A )
class[LowerCase] = ( U+0061..U+007A U+014B )

[UpperCase] > [LowerCase]

; ----------
; Pass 3 for loan words
; ----------

pass ( Unicode )

UniClass [WordBreak] = ( _WordBreak )

class[a] = (U+0061)

'ebarad' / (#|[WordBreak]) _ (#|[WordBreak]) > 'عئبارت'
'heqikat' / (#|[WordBreak]) _ (#|[WordBreak]) > 'حاقئقت'
'ader' / (#|[WordBreak]) _ > 'عطئر'
'hada' / (#|[WordBreak]) _ (#|[WordBreak]) > 'ختیٰ'
'reh' / (#|[WordBreak]) _ ^^[a] > 'روح'

; ...

; ----------
; Pass 4 for the main converter
; ----------

pass ( Unicode )

class[vowel] = ( 'a' 'e' 'i' 'o' 'u' )

UniClass [WordBreak] = ( _WordBreak )

; take care of dipthongs first: ay, ey, oy, uy, aw, ew, iw, ow
; note that dipthongs can never be followed by another vowel
; note that word initial forms are different
'ay' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+064E U+064A U+0652
'ay' / _ ^^[vowel] > U+064E U+064A U+0652
'ey' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+065A U+064A U+0652
'ey' / _ ^^[vowel] > U+065A U+064A U+0652
'oy' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+065B U+064A U+0652
'oy' / _ ^^[vowel] > U+065B U+064A U+0652
'uy' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+064F U+064A U+0652
'uy' / _ ^^[vowel] > U+064F U+064A U+0652
'aw' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+064E U+0648 U+0652
'aw' / _ ^^[vowel] > U+064E U+0648 U+0652
'ew' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+065A U+0648 U+0652
'ew' / _ ^^[vowel] > U+065A U+0648 U+0652
'iw' / (#|[WordBreak]) _ ^^[vowel] > U+0625 U+0650 U+0648 U+0652
'iw' / _ ^^[vowel] > U+0650 U+0648 U+0652
'ow' / (#|[WordBreak]) _ ^^[vowel] > U+0623 U+065B U+0648 U+0652
'ow' / _ ^^[vowel] > U+065B U+0648 U+0652

; take care of vowels next
; check long (doubled) forms first
; note that word initial forms are different
'aa' / (#|[WordBreak]) _ > U+0622 ;word initial
'aa' > U+064E U+0627
'a' / (#|[WordBreak]) _ > U+0623 U+064E
'a' > U+064E
'ee' / (#|[WordBreak]) _ > U+0623 U+065A U+064A
'ee' > U+065A U+064A
'e' / (#|[WordBreak]) _ > U+0623 U+065A
'e' > U+065A
'ii' / (#|[WordBreak]) _ > U+0625 U+0650 U+064A
'ii' > U+0650 U+064A
'i' / (#|[WordBreak]) _ > U+0625 U+0650
'i' > U+0650
'oo' / (#|[WordBreak]) _ > U+0623 U+065B U+0648
'oo' > U+065B U+0648
'o' / (#|[WordBreak]) _ > U+0623 U+065B
'o' > U+065B
'uu' / (#|[WordBreak]) _ > U+0623 U+064F U+0648
'uu' > U+064F U+0648
'u' / (#|[WordBreak]) _ > U+0623 U+064F
'u' > U+064F

class[BackV] = ('a''y''u''o')

class[ConsonantLatin] = ('b' 'c' 'd' 'f' 'g' 'h' 'j' 'k' 'l' 'm' \
'n' 'p' 'r' 's' 't' 'w' 'y' 'z' U+014B) ; etc
class[ConsonantArabic] = (U+0628 U+0686 U+062F U+0641 U+0642 U+0647 U+062C U+0643 U+0644 U+0645 \
U+0646 U+067E U+0631 U+0633 U+062A U+0648 U+064A U+0632 U+075D) ; etc

[ConsonantLatin] > [ConsonantArabic]

'nj' > U+0683
'nd' > U+068A
'dr' > U+0694
'tr' > U+0697
'ndr' > U+0698
'ch' > U+0634
'kh' > U+062E
'mb' > U+0766
U+014B 'g' > U+06A0 ; eng-g

'g' / (#|[WordBreak]) _ [BackV] > 'ق' ;word initially followed by [BackV]
'g' / [BackV] _ (#|[WordBreak]) > 'ق' ;word finally preceded by [BackV]

; ...

; hyphen turns to space, except when followed by word final -ka or -nu, where it disappears
; Note: first two rules couldn't convert to "nothing", so I included the first
; letter of the following pair of letters. Thus the conversion for the "k" and "n"
; is actually done here.
'-ka' / _ (#|[WordBreak]) > U+0643 U+064E
'-nu' / _ (#|[WordBreak]) > U+0646 U+064F
'-' > ' '

; Numerals
class[RomanDigit] = (U+0030..U+0039)
class[ArabicDigit] = (U+0660..U+0669)
class[EasternArabicDigit] = (U+06F0..U+06F9)

; Choose either first two lines or second two lines. Comment out the one you do not want
U+003A / [RomanDigit] _ [RomanDigit] > U+200F U+003A ; add RLM before colon between digits (normal behavior)
U+002D / [RomanDigit] _ [RomanDigit] > U+200F U+002D ; add RLM before hyphen between digits (normal behavior)

; U+003A / [RomanDigit] _ [RomanDigit] > U+200F U+003A U+202D ; add RLM before colon and LRO after colon between digits
; U+002D / [RomanDigit] _ [RomanDigit] > U+002D ; preserve hyphen from suffix rules

; if standard Arabic-Indic Digits or Eastern Arabic-Indic digits are required uncomment the correct line below
;[RomanDigit] > [ArabicDigit]
;[RomanDigit] > [EasternArabicDigit]

; convert other punctuation
',' > U+060C
';' > U+061B
U+003A > U+003A U+200F ; colon (insert RLM for correct behavior)
U+0021 > U+0021 U+200F ; exclamation mark (insert RLM for correct behavior)
U+003F > U+061F ; question mark
U+0028 > U+FD3F ; bracket opening
U+0029 > U+FD3E ; bracket closing
U+002A > U+066D ; asterisk
U+005B > U+0020 U+200F U+005B U+200F ; bracket square opening
U+005D > U+005D U+200F ; bracket square closing
U+2018 > U+0020 U+200F U+2039 U+200F ; quote single open angled
U+2019 > U+203A U+200F ; quote single close angled
U+201C > U+0020 U+200F U+00AB U+200F ; quote double open angled
U+201D > U+00BB U+200F ; quote double close angled

; ----------
; All Left-Hand-Side is now Arabic Script
; ----------

; ----------
; Pass 5 for Shadda
; ----------

pass ( Unicode )

U+0628 U+0628 > U+0628 U+0651 ;بّ

; ...

; ----------
; Pass 6 for Sukun
; ----------

pass ( Unicode )

class[cons] = ( U+0628 U+062A..U+063A U+0641..U+0646 U+067E U+0683 U+0686 U+068A \
U+0694 U+0697 U+0698 U+06A0 U+075D U+0766 U+0767 )

[cons]=a [cons]=b > @a U+0652 @b

; ----------
; Pass 7 for Alef Maksura
; ----------

pass ( Unicode )

UniClass [WordBreak] = ( _WordBreak )

'موسی' / _ ^^[WordBreak] > 'موسا'
'یحیی' / _ ^^[WordBreak] > 'یحیا'
; ...

; ----------
; Pass 8 for vowel omission
; ----------

; pass ( Unicode )

; U+0624 > U+0648 ;changes Waw with Hamza above to plain Waw
; U+064E > '' ;removes Fatha
; U+0650 > '' ;removes Kasra

Installing the Mapping File

Once you have built and tested your mapping file, you can compile it from the TECkit Editor’s File menu. The compiled map has a .tec extension and can be installed in the Converter System Repository via any application that is capable of using the Repository. For example, if you want to convert SFM files, start the Bulk SFM Converter and go to Converter Mappings > Set Default Converter > Add New. Or, if you want to convert files that can be opened in Microsoft Word, go to the Add-Ins menu/tab (which should have appeared in Word after installing the SIL Converters package), and install your mapping file via the Data Conversion macro. Once the mapping file is installed, it will be available to all the applications that can access the Converter System Repository. Even after the mapping file has been installed in the Repository, it can be modified and recompiled any time without the need for reinstalling it.

Download Sample

The download contains a sample TECkit mapping file (using above code) and a sample .txt file which can be converted using the mapping file.

Roman Script to Arabic Script Sample TECkit mapping file 1.0for all platforms ZIP | 5.7 KB | 23 Apr 2013