| { filsa.net: Polyglot Frontier - Introduction | Resource Index | Get Involved } |
|
|
Originally published on 8/6/97; 2:04:23 AMBuilding Japanese WebPages in FrontierAn explanation of the problems of rendering Japanese in Frontier |
|
|
|
Author: Nobumi Iyanaga There are sevaral codes for the Japanese language, and for the Chinese and Korean languages. The Japanese Mac system uses S-JIS, and the Chinese Mac system uses (I think) two codes, GB (for the simplified Chinese characters) and BIG5 (so called "Traditional Chinese", used mainly in Taiwan). Among those codes, at least S-JIS and BIG5 include the character "\" (ASCII 92) to compose some of the characters. For those who are not aware of the way the two byte codes work, I must explain a little (although I myself am not versed in these questions): The basic problem is that the three Far-Eastern languages, Chinese, Korean and Japanese, use a (theoretically) illimited number of characters for writing (this is because these characters are "ideograms", each character meaning one thing...). As it is obviousely impossible to make code of things of illimited number, we use usually conventional codes: the Japanese code defines more than 6000 characters, and the Chineses codes have more than 10000 characters (these are not sufficient!). These codes are "two byte", i.e. 255 * 255 (= 65025): this means that theoretically, we can have 65025 characters, each ASCII character being able to be combined with another ASCII character (for example aa, ab, ac, ad..., ba, bb, bc, bd..., etc). In fact, we don't use all the possibilities, and each code has its own rules. Anyway, you see now that some Japanese or Chinese or Korean characters may include "\" as one of the two composing elements. But this character is fatal, when it is interpreted by a (normal) programming language: as you all know, it is used as a meta-character, meaning that the character following it must be taken as a literal. In grep, for example, "." (a dot) is used to match any character. To match the "." (dot) itself, we must write "\.". The character "\" itself is dropped during the process (to match "\" itself, we must write "\\"...). In S-JIS, we have for example a character composed by ASCII 151 and ASCII 92 (meaning "in advance"...), but when it is interpreted by Frontier, it becomes simply ASCII 151...! Try for example this little script: new (wpTextType, @scratchPad.testWpObject) target.set (@scratchPad.testWpObject) wp.setText ("msg (\"" + char (151) + char (92) + char (145) + char (122) + "")") target.clear () local (s = string (scratchpad.testWpObject)) evaluate (s) In the main window, you will see only the three characters, the "\" being dropped out [the two characters, ASCII 151+92, and ASCII 145+122, óëz, mean "expectation" in Japanese]. The only true solution of this problem would be (I think) that Frontier recognize in some way if a string is in S-JIS or BIG5, etc..., and supply a supplementary "\" whenever it encouters one "\". -- But this must be very difficult, and I don't know if/when the development team of Frontier (Dave Winer and Doug Baron) will want to address these problems. Perhaps we should wait for the introduction of Unicode... Our solution is not perfect, but it may be a work-around. In S-JIS code, "\" occures always at the second position of a two byte character, and the precedent ASCII character is between ASCII 129-159 or ASCII 224-239. So, we can parse the string, find "\", and verify if the precedent character fits that condition. When the condition is satisfied, we replace these two byte by this kind of formula: {char (129) + char (92)} which is itself a "macro" written in UserTalk. You see that when it is interpreted by html.processMacro, the original two byte character is restored. But in fact, in the html processing in Frontier, there are other characters that can cause problem. These characters are: "{", "}", "@" and the "«" (chevron). The glossary entries can cause problem also, because html.processMacros does not evaluate the characters that are between double quotes, etc. -- You see that all this is very complicated. And we must take care, not only of the web page text itself, but also of any Japanese text that can be inserted by the templates, glossary entries and macros... The glossary entries have a special problem, because they are not evaluated by the processMacros script -- so that we must "protect" them (to not encode them), before passing them to the processMacros script... The problem of the macros are even more complicated, and we are not sure if we can do the right thing in every case, but we have tried... Mon, Apr 7, 1997 at 10:57:47 PM by NI Editor's Note: As of late July 1997, most of the challenges in regard to Japanese Web Authoring in Frontier have been solved, thanks to Hideaki Iimori's UCMDs. His MultiLingual Web Utilities have sped up and simplified the problems mentioned above. suites.MWU.NProcessMacros() handles most of the tricky items related to macro processing, with suites.MWU.TwoByteFixer() implement's Nobumi's strategy above. using pairs of {char(xx) + char (yy)} to protect Japanese text. Macro's dont work perfectly yet, but we're close. And with the UCMD's, speed has improved by 20%. Wed, Aug 6, 1997 at 2:12:56 AM |
|
Previous | Next |
|
| Other filsa.net services: Frontier Scripting | Script Archive |
|
This is part of Phil's Frontier Scripting Site. |