Subject: UNICODE-tools/SGML for language tagging in UNICODE Since I am trying to process several languages in TeX at the same time, and getting a bit fed up of several preprocessors, etc. I want to make one universal, table driven pre-processor in awaitance of a TeX that really handles things correctly. I also want to have it produce Unicode files so that I have a good standard for storage, file exchange, editing, and so on. I think (dream?) of some tools to help me in this work. transcribe translate to/from transcription in ASCII to Unicode given some transcription tables for the languages/scripts used in the document. It should do much more than placing 00 in front of every byte, also in English. It should replace quotes, dashes etc for the correct signs, interprete TeX commands for accented letters, and so on. pretex pre-process a unicode file for TeX, given some script tables that describe the scripts used as it is encoded in a TeX-font sorttext sort a unicode file according to the rules of a given language, on a line by line or paragraph by paragraph base. atou create `unicode' files by placing 00 in front of every byte. utoa create `ascii' files by stripping MS-byte. In the original files I create (in ASCII), I want to tag the text with SGML-style tags indicating the languages used. Transcribe reads this and translates my transcription to appropriate Unicode characters. (Including replacing TeX commands for accented letters for appropriate accents/accented characters in Unicode.) Given a transcript file for that language. A transcript file includes a list of letters and their UNICODE-values It knows how to handle the syllabic nature of the script, although it is transcribed in an alphabetic script. The default language is english (but not english.american). Need to decide on file formats and contents of transcript files I can sort the resulting files with sorttext, given a sort-key for a certain language. Need to decide on file formats and contents of sort-key files I can pre-process the files for TeX with pretex, which will turn the files into ASCII again, but places codes for glyphs and font changes as neccessary for non-latin scripts, as indicated in the script file. For latin languages it could take care of all kinds of conventions, like the use of ligatures, hypenation, spacing after sentencesm etc. The tags could be used by spell checkers. Need to decide on file formats and contents of script files (defining context dependent behaviour of characters, locations of glyphs in fonts, etc.) ---- SGML-style Language tags in UNICODE how to? (are the indicators floating marks or block formers?) Malayalam text Malayalam text it will keep those language-markers in the UNICODE file. known languages: (languages are sometimes known in several dialects or otherwish variations, I suggest some kind of standard here) dutch english english.american english.phonetic french german german.fractur hindi hindi.transcription bahasa-indonesia malay malay.arabic malayalam malayalam.traditional malayalam.transcription marathi sanskrit sanskrit.transcription urdu unknown unknown.transcribed ---- UNICODE representation of Malayalam U+0D00 -- U+0D7F structure assumed in parsing: vowel consonant vowel-sign virama diacritic joiner non-joiner other syntax: primary ::== ( | ) [diacritic] secondary ::== ( | ) [diacritic] consonant-cluster ::== { [|] } | modifiers ::== { [] } syllabe ::== [] text ::== { } semantics: The algorithm reads a syllabe from the string. build-syllabe syllabe = if prebuild syllabe glyphs = prebuild syllabe else cluster-glyphs, rest-modifiers = build-cluster cluster with modifiers glyphs = apply rest-modifiers to cluster-glyphs endif build-cluster cluster with modifiers = if prebuild cluster glyphs = prebuild syllabe else split-off final ya, ra, la, try again else split-off front ra (for repham), try again else split first letter from build-syllabe first-part with virama build-syllabe second-part with modifiers rest-modifiers = nil endif apply modifiers to glyphs = apply first-modifier to glyphs apply rest-of-modifiers to glyphs The algorithm ignores joins, non-join will cause virama's to be used