LINGUIST List 4.191

Tue 16 Mar 1993

FYI: Unicode technical reports, Frequency count of Chinese

Editor for this issue: <>


Directory

  1. Glenn Adams, Unicode Technical Reports
  2. , Corpus-Based Frequency Count of Modern Chinese

Message 1: Unicode Technical Reports

Date: Mon, 15 Mar 93 21:19:44 ESUnicode Technical Reports
From: Glenn Adams <glennmetis.com>
Subject: Unicode Technical Reports


Announcing the availability of three Unicode Technical Reports

Three Unicode Technical Reports are now available from the Unicode
Consortium. They may be obtained for a nominal fee that covers postage and
printing costs. Please note that Unicode Corporate, Associate, and
Individual members will automatically receive copies and need not order them
specifically.

These reports are being disseminated for public review and comment. In order
to achieve the widest possible distribution, they may be freely copied and
distributed for review purposes (provided the notices, etc. remain intact).
In each case, the review period ends on August 15, 1993. Contact the
consortium for costs.

Technical Report #1: Draft Proposals
 Contains Burmese, Khmer, and Ethiopian proposals which constitute the
 strong technical recommendations of the Unicode Technical Committee
 for these scripts. (To allow further review, they were not included
 in Unicode 1.0.)

Technical Report #2: Preliminary Draft Proposals
 Contains Mongolian, Sinhala, and Tibetan proposals which constitute
 recommended approaches to these scripts. (To allow further review,
 Mongolian and Sinhala were not included in Unicode 1.0. Tibetan was
 retracted for further study in the process of merging with ISO
 10646.)

Technical Report #3: Exploratory Proposals
 Contains proposals for the following scripts:
 Aramaic, Balti, Batak, Buginese, Cherokee, Etruscan,
 Glagolitic, Kirat(Limbu), Lepcha(Rong), Linear-B, Maldivian,
 Manipuri, Meroitic, Numidian, Ogham, Old Persian Cuneiform,
 Pahlavi/Avestan, Phoenician, Runes, South Arabian, Syriac,
 Tagalog/Mangyan, Tai Lu, Tai Mau, Ugaritic Cuneiform.
 These proposals represent possible encoding models for the scripts
 and are being presented in an exploratory fashion for their initial
 public comment and review. They will be issued subsequently as Draft
 Proposals.

To Order, please inquire to:
 Unicode, Inc.
 1965 Charleston Ave.
 Mountain View, CA 94043

 Phone: (415) 961-4189
 FAX: (415) 966-1637
 Internet: infounicode.org

The ASCII plain text of these technical reports, exclusive of charts, is also
available via anonymous FTP from the site "Unicode.ORG". Files are:
 pub/TechReports/UTR_1.ascii
 pub/TechReports/UTR_2.ascii
 pub/TechReports/UTR_3.ascii
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Corpus-Based Frequency Count of Modern Chinese

Date: Tue, 16 Mar 93 16:20:04 EACorpus-Based Frequency Count of Modern Chinese
From: <rocltshiis.sinica.edu.tw>
Subject: Corpus-Based Frequency Count of Modern Chinese


 Corpus-Based Frequency Count of Modern Chinese

Corpus-based study of Chinese is one of the research projects of
the Chinese Knowledge Information Processing Group (CKIP) at
Academia Sinica. The current research is based on a Chinese
newspaper corpus, which amounts to 20,698,116 characters (
9,540,444 words after word segmentation.) Four technical reports
in Chinese are published. These include:

1. Corpus-Based Frequency Count of Characters in Journal Chinese
 30 pages (US$ 5)
2. Corpus-Based Frequency Count of Words in Journal Chinese
 300 pages (US$ 20)
3. The Most Frequent Verbs in Journal Chinese and Their
Classification
 140 pages (US$ 10)
4. The Most Frequent Nouns in Journal Chinese and Their
 Classification 150 pages (US$ 10)

The first report lists 5,666 distinct characters which appear in
the entire corpus. The second report contains 42,686 words that
occur more than three times in the corpus. The most common 14,956
words constitute more than 99.9995 percent of all the words
occurring in the corpus. The third and the fourth report include
19,907 verbs and 21,368 nouns respectively which occur more than
twice in the corpus with their syntactic or semantic
classification. To order, please list the desired title(s) and
enclose a cheque of the appropriate amount payable to the
Computational Linguistic Society of the R.O.C. (ROCLING). The
prices listed above include postage and handling.

 Address : Miss Tsai Shu-hui
 ROCLING
 Institute of Information Science
 Academia Sinica, Nankang
 Taipei, Taiwan 11529
 R.O.C.

 Tel. : 886-2-788-1638
 Fax : 886-2-788-1638
 E-Mail : rocltshiis.sinica.edu.tw
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue