The Balinese Unicode Text Processing

In principal, the computer only recognizes numbers as the representation of a character. Therefore, there are many encoding systems to allocate these numbers although not all characters are covered. In Europe, every single language even needs more than one encoding system. Hence, a new encoding system known as Unicode has been established to overcome this problem. Unicode provides unique id for each different characters which does not depend on platform, program, and language. Unicode standard has been applied in a number of industries, such as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, and Unisys. In addition, language standards and modern information exchanges such as XML, Java, ECMA Script (JavaScript), LDAP, CORBA 3.0, and WML make use of Unicode as an official tool for implementing ISO/IEC 10646. There are four things to do according to Balinese script: the algorithm of transliteration, searching, sorting, and word boundary analysis (spell checking). To verify the truth of algorithm, some applications are made. These applications can run on Linux/Windows OS platform using J2SDK 1.5 and J2ME WTK2 library. The input and output of the algorithm/application are character sequence that is obtained from keyboard punch and external file. This research produces a module or a library which is able to process the Balinese text based on Unicode standard. The output of this research is the ability, skill, and mastering of 1


Introduction
Language and local script are the most precious cultural assets that have to be preserved for generations to come.Balinese script which can be used to writing Balinese language is threatened to extinction because Balinese is rarely used and has less scope of usage.Efforts to preserve it have been attempted but met an obstacle, i.e. the lack of application to accommodate opinions using Balinese script.Basically, the more sophisticated the tool is, the more guaranteed the education and culture in the future are.This tool refers to the computer which is capable to build software engineering easily in such a manner to produce and process Balinese script quickly and properly [1].
The endeavor to computerize Balinese script is being conducted by Bali Galang Foundation.The first step is done by including the Balinese script in standard of Unicode character.The Unicode Consortium1 and the ISO/IEC JTC1/SC2/WG22 committee have agreed in principle to include the Balinese script as per defined in the formal proposal written by Michael Everson and I Made Suatjana in the standards that they maintain [6].This proposal, numbered N2908, was presented to the committee submitted to the WG2 46 th meeting in Xiamen, China, in January 2005.The information given in the proposal is by all means complete but will only be finalized during the next WG2 meeting in Sophia-Antipolis, France, in September 2005.
The conventional methods of processing a string of Latin text are not applicable to the Balinese text because there are at least three different areas for the Balinese Unicode text processing [6], i. should be able to perform a dictionary-based lookup to determine word boundaries and to validate the spelling of the text.

Text Processing in Computer
Generally, information which flows from and into computer is in the form of text document, Figure, audio, video, and combination among them.Text is used to submit information in language and written with understandable scripts by human being as either the subject or object of the information.In order to being processed in computer, these scripts need to be decoded in number since computer can only recognize in number.This number consists of binary numbers, i.e. 0 and 1, known as bit.
In fact, bit processing is processed on octet (from Latin word, octo which means eight), a bit combination of eight digits also called byte.Some methods of convention are made to look for solution of how to interpret octet and other series of octet on the way to represent data.For example, four series of octet are used to interpret real numbers.In this final project, octet series is used to interpret string.The simplest way which is still used widely to interpret character is by mapping one octet with one character according to the mapping table.In doing so, we can interpret 256 (2^8=256) characters.This number of characters exceeds those in character set 3 used by Latin script, a script which has widely been used to write many languages around the world such as English and Indonesian.This technique is also used by ASCII character standard ( American Standard Character for Information Interchange) which is developed at 1960's and has been being used up to now.
In general, text processing in computer works when user types with keyboard, and the keyboard sends its scan codes to the keyboard driver.Then, the driver transforms the scan codes into meaningful character sequence.In the case of nonroman input mode is on, the driver also checks the input sequences and rejects invalid sequences.After that, the text processor manipulates the characters.It may do searching, copy-pasting, sorting, word counting, line breaking, transliteration, etc.These characters are also stored in memory or other storage devices.In order to show character sequences, the rendering engine picks the glyph that represents the character.Then, the display such as monitor and printer displays the rendered glyphs.
Unicode is a character coding standard for representing a written language in computer.Unicode was actually not the first coding standard, because it came as the answer to the problems arising from the previous coding standard for years [9].Therefore, Unicode is close to the previous existing coding standard.When Unicode version 1.0 was issued in 1991, ASCII and ISO-8859 had become the most well known standard.
The development of the Unicode character model follows 10 basic rules stated below [9].However, not all are actually fulfilled.Consistency can be sacrificed in order to keep simplicity, efficiency, and compatibility with the precious standard.The basic rules are: Universality, Efficiency, Character, not glyphs, Semantics, Plain text, Logical order, Unification, Dynamic Composition, Equivalent Sequences, Convertibility.

Balinese Script
The Balinese script is used for writing the Balinese language, the native language of the people of Bali.It is a descendent of the ancient Brahmic script from India; therefore it has some notable similarities with modern scripts of South Asia and Southeast Asia that also are descendent of the Brahmic script.The Balinese script is also used for writing Kawi, or Old Javanese, which had a heavy influence to Balinese language in the 11 th century.Some Balinese words are also borrowed from Sanskrit, thus Balinese script is also used to write words from Sanskrit.
The basic elements of the alphabet are syllables.Each syllable has inherent sound of /a/ or /ĕ/ depending of the position of the syllable within a word.
The text direction of the Balinese script is from left to right, with vowel signs attached to either before, after, below or above the syllable.Some vowel signs are split vowels, meaning that they appear at more than one position to the syllable.
Writing system of Balinese script is more complex than Latin script.The alphabet consists of syllables.Every syllable ends up with vowel sound /a/.
Consonant cluster is a consonant group of syllable appearing without any vowels.In Balinese script, consonant naturally obtains the suffix of vowel sound /a/.In general, there are two ways to omit original vowel sound: Utilizing consonant in the form of gantungan or gempelang attached to the next consonant.This gantungan or gempelan consonant is applied to omit the vowel on its left side, not the vowel on itself.For example, word 'bakta' (bring).Utilizing adeg-adeg, U+1B44 BALINESE ADEG-ADEG.For example, word 'kadep' (sold).
Position adjustment in Balinese script writing is divided into several different areas (see Figure 1), i.e.: Baseline area: writing base of Balinese script.Consonants are written in this area.
Area on the left side or pre-base marks (prem) and on the right side or post-base marks (pstm) baseline: used to write dependent vowel and gempelan.
Area on the top side or above-base marks (abvm) and on the bottom side or below-1 base marks (blw1m) and below-2 base marks (blw2m) baseline: used to write gantungan, and pengangge suara.

Reordering and Split Vowel
Dependent vowel in Balinese script modifies base consonant syllable with several forms.A consonant or a cluster of consonants may have a dependant vowel to change the last vowel sound attached to it.Balinese script has various forms of dependant vowel, the spacing and the non-spacing one written on the previous, the next, the top side, or the bottom side of the base character.Yet the combination of them is also possible.
Unicode standard determines that the combining character is coded after its base character.Therefore, when a character sequence contains dependent vowels, reordering is necessary in the computer memory just before it is displayed on the screen.The function of the reordering is to make a change of the glyph order so that glyph component of dependent vowel is displayed properly (see Figure 2).Split vowel is a vowel whose components appear on two different sides of its consonant.The component may appear on either the top-right side, or the left and right side of its base consonant.
Glyph of vowels drawn on the bottom side of base character needs a special treatment because glyph selection depends on the context of the consonant frequency or the previous consonant cluster.These vowels are 1B38 BALINESE VOWEL SIGN SUKU (u) and 1B39 BALINESE VOWEL SIGN SUKU ILUT (uu).Both of these vowels have two different glyphs, i.e.: the one attached on the base consonants and their conjunct forms (Pengangge Aksara).

Ligatures
A glyph representing more than one character is called ligature, a script that is handwritten on a paper with no more than one scratch.Several Balinese scripts appearing adjacent to one another form ligature.Therefore, they seem on the screen as if they were only one glyph.For example, U+1B35 BALINESE VOWEL SIGN TEDUNG (aa) forms a ligature when attached to a syllable.

Line Breaking
Although Balinese script is written without any spaces between two successive words, line breaking cannot be conducted at random places.Hence, there are two common rules of line breaking, i.e.: Line breaking may not be done between a syllable and any following combining characters.
Line breaking may not be done just before any punctuation.

Characteristics of Balinese Script
Like any other Unicode script, Balinese script has some unique characteristics (see table 1).They are published in the proposal L2/05-008 which was approved by Unicode Consortium.However, the decomposition mapping property should be added to the proposal.Therefore, according to the Unicode standard, there should be ten characters requiring decomposition mapping (see algorithm 1).
Table 1.Characteristics of Balinese Script

Canonical Combining Class of Balinese Script
The purpose of canonical combining classes is to establish appropriate equivalence classes under Unicode normalizations for character sequences that involve combining marks.Specifically: ( , ) Symbol à showing that the left sided element has to be equivalent to the right side element Algorithm 1. Decomposition mapping of Balinese script [7] Given a pair of combining marks that interact typographically (i.e., that nominally occupy the same position relative to the base), different encoded orders correspond to visually-distinct relative positions of the marks, hence are semantically distinct.By assigning these marks to the same canonical combining class (zero or nonzero), the nonequivalence of differently-ordered sequences is established under normalization.
Given a pair of combining marks that do not interact typographically (i.e., that occupy distinct positions relative to the base), different encoded orders are visually identical, hence not semantically distinct.By assigning these marks to different, non-zero canonical combining classes, the equivalence of differently-ordered sequences is established under normalization.
In canonical combining class, the class 0 has special behavior in the Unicode normalization algorithms: if a sequence contains a combining mark in class 0 and a mark in a non-zero class n, equivalence classes are defined as though the class-0 mark belonged to class n; i.e., that sequence is not equivalent to the sequence containing those marks in the opposite order.
Using the canonical combining classes proposed in L2/05-008, there is only one pair of combining marks for which distinct orders would be considered canonically equivalent (see table 2).
In proposal L2/05-008, Combinations of syllable-modifier signs (1B00-1B03), REREKAN and vowel signs, at least, are linguistically valid.Because all of these but REREKAN are assigned to class 0, differently-ordered sequences of these marks, which would be visually distinct, are not canonically equivalent.Thus, the use of class 0 provides appropriate results in these cases.
In this case, < 1B35 BALINESE VOWEL SIGN TEDUNG, 1B04 BALINESE SIGN BISAH > is a linguistically plausible combination.Assuming it is normal use as a vowel killer, ADEG-ADEG should not co-occur with either of the other two marks.Again, though, different encoded orders of a combination of these marks are possible in principle and would be visually distinct, and so the use of class 0 provides appropriate results in these cases.
In the cases described above, the use of class 0 is sufficient to cause differently-ordered combinations of marks that do interact typographically (having different visual results) to be considered not canonically equivalent.However, the assignment of marks to class 0 breaks down because it is in failing to cause differently-ordered combinations of marks that do not interact typographically to be considered canonically equivalent.
According to Unicode standard, a suggestion is made by assigning each classes 220, 224, 226, 230 to every character whose position relative to the base is bottom, left, right, or top [7].However, character U+1B34 BALINESE SIGN REREKAN and U+1B44 BALINESE SIGN VIRAMA is assigned to fixed-position class 7 and 9.

Normalization of Balinese Script
There are four Unicode normalization forms, i.e.: 1. Normalized Form D: canonical decomposition.

Normalized
Form C: canonical decomposition, followed by canonical composition.3. Normalized Form KD: compatibility decomposition.4. Normalized Form KC: compatibility decomposition, followed by canonical composition.Considering the efficiency and performance of the Unicode normalization algorithm, Balinese script is processed in the normalized form D.
Besides, the normalized form C is basically obtained through the same steps as the normalized form D.
Normalized form D consists of two phases.First, in every Balinese word is the decomposition mapping done.Second, canonical reordering is done according to the canonical combining class of each character.Decomposition mapping has recursive property, so the correct character sequences are obtained (see algorithm 2).For example, U+1B40 turns into Symbol à showing that the left sided element has to be equivalent to the right side element Algorithm 2. Decomposition mapping algorithm Basically, UCA is a process to create sorting keys possessing particular priority value.All strings are first processed on the primary level.They all are then compared.If they equal to each other, the comparison is continued to the secondary level and later to the tertiary level when necessary.
In general, the sorting keys of UCA are: Alphabet/base character, Diacritic mark/accent, Uppercase and lowercase.

Comparison Algorithm of Balinese Script
Comparison algorithm of Balinese script consists of four phases, i.e.: Every Balinese script input is first processed with the normalized form D.
The result from step (1) is then continued to the UCA according to the chosen sorting method, either HANACARAKA or SANSKRIT.The next step is separating the values of collation element and inserting a particular value from level separator while recombining those values.
The last step is comparing the values obtained from the step (3) using binary comparison algorithm.

Transliteration of Balinese Script
Transliteration is a mapping from one writing system to another one, e.g. from Balinese script to Latin and vice versa by considering both accent and grammar of them [10].The main criterion is the lossless information so that user should be capable of retransforming the information to its original format.Therefore, transliteration is different from transcription which only focuses on voice mapping from one language to another one.The use of transliteration is for helping people who can not read the Balinese script.For example, [U+1B13 KA,U+1B2E LA] becomes 'kala' (time).
Transliteration is performed by building a table to map characters from Balinese script to Latin and vice versa.To get it done, a complex conversion is needed in order to overcome the change of shape and special case of letters in source script.The input for transliteration is two digits from keyboard in hexadecimal format (0,1,2,…,A,B,C,D,E,F) with padding bit 0 if the number of digits is odd.The algorithm of transliteration utilizes data structure of inversion list (including basic functions: invert, union, intersection, set difference, adding, and deleting) in order to save more spaces in memory.The performance of those operations is faster due to random access for every element.
Character U+1B33 BALINESE LETTER HA serves as a neutral place for vowel.Therefore, when a vowel is written on the first letter of word, either independent vowel or character U+1B3 followed by appropriate dependent vowel may be used.
Character suku kembung (U+1B2F) is used in consonant cluster between words using appended form 'wa', so this character may be transliterated to 'ua'.
Transliteration algorithm calls translate string function receiving both string input and output (see algorithm 3 and 4), for example, given an string input s with n = length(s), then an iteration for n characters is performed.
In transliteration algorithm, the most dominant function is searching character in I/O file.The comparison is performed at Balinese script block ( U+1B00-U+1B7F).In the worst case, the comparison is performed 128 times.
Given m is the number of comparison at lookup table in average, the transliteration algorithm theoretically has a complexity of O(n*m).

Searching of Balinese Script
Searching of Balinese script has some challenges, i.e. that there are some characters which are made of other characters and they are even possible to be combined into a single character.Therefore, it should work on both precomposed and decomposed strings.
Searching is performed by validating equivalent forms of Balinese script (see table 3), e.g.
In searching algorithm, the most dominant functions are normalization, sorting key creation according to UCA, and binary comparison functions (algorithm 5).Normalization function receives string input and an array of integer.Given string input s with n = length(s), then perform iteration for n characters.

Algorithm 3. Character transliteration of Balinese script
This function consists of two processes: Canonical decomposition, i.e. recursive function transforming Balinese script into atomic form.In the best case, character input has already had an atomic form.Otherwise, in the worst case, character input is processed recursively until the character gets an atomic form.For the case of Balinese script, the recursion depth is one.
Reordering canonical combining class is a function that reorders Balinese script character classes already processed in canonical decomposition.In the worst case, iteration is performed until the recursion depth minus one.
Given p is the average recursion depth, the normalization theoretically has a complexity of O(p).
In average, element collation function performs iteration for n times, so the array of integer is produced with the size of n.Therefore, this function has a complexity of O(n).
Binary comparison function performs comparison of element collation values.In the worst case, the comparison is performed for n times.

Algorithm 4. Inverse Transliteration of Balinese script
Theoretically, searching algorithm has a complexity as the following: à pick the maximum value among the three complexity values (p<n) = O(n 2 ) Therefore, the searching algorithm complexity is O(n 2 ).
Other symbols (U+1B61 -U+1B7C).Therefore, sorting algorithm of Balinese script does not depend on the character code pint.Besides, the vowels must be ignored when comparing the consonants.The vowels are taken into account when the consonants are identical.For example, 'krama' (member) appears before 'cakra' (disc) in Sanskrit sorting method.On the other hand, 'cakra' appears before 'krama' in HANACARAKA sorting method.
In sorting algorithm, the most dominant functions are normalization, sorting key creation according to UCA, and binary comparison functions (algorithm 6).
Normalization function receives string input and an array of integer.Given string input s with n = length(s), then perform iteration for n characters.This function consists of two processes: Canonical decomposition, i.e. recursive function transforming Balinese script into atomic form.In the best case, character input has already had an atomic form.Otherwise, in the worst case, character input is processed recursively until the character gets an atomic form.For the case of Balinese script, the recursion depth is one.Reordering canonical combining class is a function that reorders Balinese script character classes already processed in canonical decomposition.In the worst case, iteration is performed until the recursion depth minus one.
Given p is the average recursion depth, the normalization theoretically has a complexity of O(p).
In average, element collation function performs iteration for n times, so the array of integer is produced with the size of n.Therefore, this function has a complexity of O(n).
Binary comparison function performs comparison of element collation values.In the worst case, the comparison is performed for n times.

Word Boundary of Balinese Script
In computer terminology, word boundary or spell checker is a software facility designed for verifying the correctness of the spelling in a document in order to help user obtain the correct spelling [10].
To handle Balinese script, word boundary breaks down the sentence structure into its building words.As mentioned before, in Balinese script there is no usage of spaces as separator between two adjacent words, so a dictionary-based lookup is required.Word boundary algorithm is performed by separating character sequences of Balinese script into base character cluster (including consonant and consonant cluster).This algorithm uses data structure of stack to retrieve valid character cluster and then process it.The validity of a piece of the character sequences can be inferred through comparison between the sequences and the data stored in dictionary.There are three possible outputs resulted from the word boundary algorithm, i.e.: The character sequences are successfully separated completely and correctly (the best case).
The character sequences are separated, but only partially completed because it is possible that the input is invalid or the data in the dictionary are not complete.However, the algorithm returns the best word boundary obtained according to the statistical principle (the endeavor that produces the maximum number of Balinese script words).
No output is returned (the worst case).The result of GUI designing implementation can be seen on Figure 10, 11, and 12.
algorithm should not be based purely on character code points.Vowels should be ignored when comparing consonants, but vowels should be factored in only when the consonants are equal.Furthermore, there are two different sorting schemes exist: the traditional Balinese HANACARAKA ordering and the Sanskrit ordering.3. The Balinese text does not use spaces as word separators.A spell-checking algorithm

Figure 1 .
Figure 1.Writing position of Balinese script

Figure 9 .
Figure 5. Big O(n) Testing of Inverse Transliteration Algorithm

Figure 10 .
Figure 10.User Interface of B-Linguist

Table 3 .
The Equivalent Form of Balinese Script