Unicode Project

Goal

The goal of the Unicode Project is to replace the current Big5 and UTF-8 mapping tables by EACC-based One-to-One mapping tables. This will bypass the inconsistency problem produced by the vendor's software when mapping internal codes in Big5 and UTF-8. It enables the Library to adopt the use of Millennium modules and to prepare the systems in the direction for total conversion to Unicode.

History

Innovative approached the Library in January 2003 to recommend to run a patching program on the internal codes of CJK characters in the records. After studying the issue and obtaining more information from other local sites, the Library concluded that the fundamental problem lies in the many-to-one characteristics of the internal codes and the mixed usage of CCCII and EACC. The Library decided to take this opportunity to fix the long term problems. Upon requested from the Library, Innovative sent us the mapping tables of Big5 and UTF-8 in late May.

Methodology

Firstly, dubious cases are extracted from existing mapping tables and re-mapped. Secondly, revised tables are created. From the revised tables, many-to-one pairs are extracted. They are to be used for data conversion. Finally, pure CCCII are removed and duplicated entries in many-to-one pairs are consolidated. New one-to-one mapping tables are created for each of Big5 and UTF-8.

Study and Findings

Tables from Innovative

One Big5 and two utf8 tables are received from Innovative. All the analyses of UTF-8 mappings are based on UTF8.new. (Since more sites obtained the Big5 and UTF8, only UTF8.new is attached here for reference.)
  • BIG5 (current)
  • UTF8 (current)
  • UTF8.new (for Phase 3)

UTF-8

Doubtful cases are extracted from utf8.new and re-mapped by cross-checking with the Unicode standard. A revised UTF8 (UTF8.rev) is created. From UTF8.rev, many-to-one pairs are extracted and peferred positions are determined. To simplify the remapping, word frequency is not taken into account. Selection of preferred positions follows Unicode standard. That is, whenever possible, the EACC code chosen in Unicode standard is chosen as the preferred code.

For output, three mapping tables are generated: a many-to-one re-mapping for data conversion; a pure CCCII re-mapping for data conversion; and a new one-to-one mapping table to replace existing UTF8.

Study
Output

Big5

The BIG5 table from III does not contain Unicode information. Unicodes are extracted from UTF8.rev by mapping the CCCII/EACC. If not found, they are extracted from the Unicode site by mapping the Big5. With unicodes exist, doubtful cases could be easier to identify. They are re-ordered, added or re-mapped. A revised BIG5 (Big5.rev) is created.

There are two ways to define many-to-one pairs in Big5: have the same Big5 and the same Unicode; or have the same Big5 only. The former set is safe for use in data conversion. For the later set, multiple entries of EACC are studied. 181 pairs of this kind are identified. Most have the same low-order bytes. Our analysis shows that only one case is dubious in meaning. The rest are included in data conversion.

The outputs of Big5 are similar to those of UTF8: two many-to-one re-mappings for data conversion; a pure CCCII re-mapping for data conversion; and a new one-to-one mapping table to replace existing Big5.

Study
Output

Final tables to Innovative

The following are for patching (data conversion) only
The following are to replace current mapping tables

Conversion test run

An in-house conversion program (in Perl and MarcEdit) is developed to test run on 183398 Bib records with 880 tags.

Result log
Pure CCCII in CityU (440 unique)
Sample record converted
Sample record uploaded

Outstanding

The many-to-one re-mapping tables used in data conversion list out re-mappings in the format "{convert this}=>{to this}". This should be easily read by III's patching program. In our test run, the in-house developed program successfully converted all the Bib records using these re-mappings.

Though the conversion part could be handled by local site with some effort, it is hoped that III continues to provide a patching program as they initially proposed. This patching program should be able to understand the format described above. Then each local site can easily generate their re-mapping table for III to patch. The remaining issue need to be solved between III and the local sites is, who is going to maintain the new mapping tables once replaced, and how?

Library, City University of Hong Kong. Last revised: 4 July 2003