Moving From MARC to XML - Part Two
Handling of Multi-Script Metadata

This section discusses how MARC 21 handles bibliographic metadata in multi-scripts and how XML can do the job in a much better way.

In the MARC 21 specification, you can encode multi-scripts in MARC-8 environment or UCS/Unicode environment.  In MARC-8 environment, you use tag 066 to define an alternate character set in the metadata.  For example:

066    |c$1

specifies that the MARC record contains a default character set in Latin (known as ANSEL or ALA character set) and an alternate character set in EACC.  Subfield c in tag 066 is repeatable.  This means you can have more than one alternate character set in the same record.  However, it will become very inconvenient for traversing the record as there will be a lot of shift-ins and shift-outs between character sets.  Therefore, if your library catalog contains mainly two scripts (e.g. Latin and CJK), the MARC-8 environment may be a good choice.

For library catalogs that have an international scope, it is essential, if not mandatory, to have the capability of storing, searching and displaying all scripts.  In such cases, it is more appropriate to use UCS/Unicode.  In this UCS/Unicode environment, you can enter any script in the regular field in UTF-8 encoding.  Instead of using Tag 066, position 9 in the Leader will contain the value "a" to indicate that the metadata is in UTF-8.

Assume we are cataloging an item that has the title listed in three different languages on the title page:

Title in English: An English-Russian-Chinese dictionary of electronics and electrical engineering
Title in Chinese: 英俄汉电子电工技朮词典
Title in Cyrillic: Англо-Русско-Китайский словаръ по злектронике и злектротехнике

The following shows how this item is marked up according to the two models as specified in MARC 21 using Unicode:

Model A

Leader Position 9: a
....
246  31  |a English-Russian-Chinese dictionary of electronics and electrical engineering
246  3    |6880-01|aYing E Han dian zi dian gong ji shu ci dian
880  3    |6246-01|a英俄汉电子电工技朮词典
246  31  |6880-02|aAnglo-Russko-Kitaĭskiĭ slovar′ po ėlektronike i lektrotekhnike
880  31  |6246-02|aАнгло-Русско-Китайский словаръ по злектронике и злектротехнике
....

Model B

Leader Position 9: a
....
246  31  |aEnglish-Russian-Chinese dictionary of electronics and electrical engineering
246  3    |aYing E Han dian zi dian gong ji shu ci dian
246  3    |a英俄汉电子电工技朮词典
246  31  |a Anglo-Russko-Kitaĭskiĭ slovar′ po ėlektronike i lektrotekhnike
246  31  |a Англо-Русско-Китайский словаръ по злектронике и злектротехнике
....

Both Model A and Model B failed to markup the multi-lingual aspects of these MARC tags.  For example, there is no indication of which of the above tags is the Pinyin transliteration of the Chinese script, or which pair of the above tags is in Russian.  Model A is slightly better than Mode B in that it supports linking between the transliterated form and the vernacular form through the use of tag 880 and subfield code 6. In the Model B example above, there is no markup logic to indicate that these five 246 tags are related and how they are related.

If the multi-lingual attributes can be included in the metadata, a library automation system will be able to support alternate displays by language.  For example, a Chinese user may choose to display only the Chinese script and the Latin script, while an English speaker may choose to just display the English and the roman transliterations.

If we use XML to markup the above item, we can define a schema (DTD) that takes into consideration these multi-lingual attributes.  For example, we may have something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<marc mattype="am" cdate="20010416" udate="20010416" rcn="ACbC">
...
<fd id="1.1" script="latin.english" name="246" ind1="3" ind2="1" label="Variant Title">
<sf name="a">English-Russian-Chinese dictionary of electronics and electrical engineering</sf>
</fd>

<fd id="1.2" script="latin.pinyin" name="246" ind1="3" ind2="b"  label="Variant Title">
<sf name="a">Ying E Han dian zi dian gong ji shu ci dian</sf>
</fd>

<fd id="1.3" script="cjk.chinese" name="246" ind1="3" ind2="b"  label="Variant Title">
<sf name="a">英俄汉电子电工技朮词典</sf>
</fd>

<fd id="1.4" script="latin.russian" name="246" ind1="3" ind2="1"  label="Variant Title">
<sf name="a">Anglo-Russko-Kitaĭskiĭ slovar′ po ėlektronike i lektrotekhnike</sf>
</fd>

<fd id="1.5" script="cyrillic.russian" name="246" ind1="3" ind2="1"  label="Variant Title">
<sf name="a">Англо-Русско-Китайский словаръ по злектронике и злектротехнике</sf>
</fd>
...
</marc>

We can use the script attribute in the <fd> element to define the script and language of the data element.  For example, script="latin.english" means that the script is in Latin and the language is in English; script="latin.pinyin" means that the script is in Latin and it is a transliteration of Chinese in Pinyin.  Furthermore, we can have an id attribute to link up all forms.  This id number can be in multi-levels, so that, all <fd> elements that have the same stem (for example, 1.1, 1.2, 1.3, etc. have the same stem of 1) are automatically linked together.

Using XSL stylesheets, a library automation system will have total control on the multi-script display.  For example, we can have the following choices:

[NOTE: You need to use Internet Explorer 5.x or above to view the following XML links.]

You may notice the use of subfield 8 in the above displays to store the script and id attributes.. When transforming the XML metadata back to MARC 21 format, we need a subfield to hold the script and id attributes, and subfield 8 seems to be a good choice, i.e.

Leader Position 9: a
....
246  31  |81.1\s\latin.english|aEnglish-Russian-Chinese dictionary of electronics and electrical engineering
246  3    |81.2\s\latin.pinyin|aYing E Han dian zi dian gong ji shu ci dian
246  3    |81.3\s\cjk.chinese|a英俄汉电子电工技朮词典
246  31  |81.4\s\latin.russian|a Anglo-Russko-Kitaĭskiĭ slovar′ po ėlektronike i lektrotekhnike
246  31  |81.5\s\cyrillic.russian|a Англо-Русско-Китайский словаръ по злектронике и злектротехнике
....

Subfield 8 is defined as a Field Link Control Subfield in MARC 21 for linking tags that describe constituent items and tags that concern reproduction.  A few years ago, the Library of Congress discussed the use of subfield 8 for linking the transliteration and vernacular forms, but without much progress.  By expanding the definition of Subfield 8 to include a new "s" (script) field link type and an additional script and language scheme, i.e.

|8 <linking number.sequence number> \ <field link type> \ <script.language>

it becomes possible to maintain interoperability of the above metadata between MARC 21 and XML.   It is hoped that the Library of Congress can specify something similar to the above so that multi-lingual attributes can be included in MARC 21.

In this section, we discussed how the two MARC 21 models and the UCS/Unicode environment can be adopted for marking up bibliographic metadata in multi-scripts.  We also discussed the deficiencies of MARC 21 in recording the multi-lingual attributes of the data elements and how this issue can be dealt with in XML.  By using a simple XML schema and UTF-8 encoding, a library automation system will be able to handle multi-scripts in a more flexible environment.

K.T. Lam (lblkt@ust.hk)
Created: 18 April 2001.
Last Revised: 18 April 2001.