LINGUIST List 5.641

Fri 03 Jun 1994

FYI: Letter frequency information--correction

Editor for this issue: <>


Directory

  1. "Henry S. Thompson", 5.593, 5.555 Letter frequency information -- NOTICE OF ERROR

Message 1: 5.593, 5.555 Letter frequency information -- NOTICE OF ERROR

Date: Tue, 24 May 94 17:47:30 BS5.593, 5.555 Letter frequency information -- NOTICE OF ERROR
From: "Henry S. Thompson" <eucorpcogsci.edinburgh.ac.uk>
Subject: 5.593, 5.555 Letter frequency information -- NOTICE OF ERROR

Many thanks to Penni Sibun for noticing the anomaly she reported in
5.593. In fact in my message dated Sun, 15 May 94 23:11:52 BST in
5.555 ALL THE COUNTS (except Danish and Swedish) ARE IN ERROR, as a
result of a classic UN*X goof, i.e. that in using grep to get the text
lines from the corpora, I got file-names (hence the high count for /
which Penni noticed) and line numbers on every line!

Many apologies to anyone led astray by the bogus numbers, a better set follows.
(I don't dare say a CORRECT set at this point -- buy the CD and do
your own checking! Orders to elsnetcogsci.ed.ac.uk.)

The following was computed quickly [TOO quickly, the first time--ht]
on the basis of some of the material now available on the Multilingual
Corpus 1 CD-ROM from the European Corpus Initiative. Note these are
raw counts, and that in particular the counts for the upper-case
characters have NOT been folded in.

Also note that ISO-8859-1 (ISO Latin 1) has been used throughout, so the
third column will not have survived being mailed through 7-bit mailers.
[Note this answers Penni's second question -- we believe no escape
sequences remain in the corpora as distributed, all have been
converted to ISO Latin 1]

[Wrt Penni's third question, these are raw counts, and ^J is included
just as <space> is -- it would be misleading to merge those counts
without checking first that none of the sub-corpora I've counted over
include line-final soft hyphens, which I haven't done, although I
don't THINK there are any.]
----------------------------
 Dutch German English
37312899 bytes total: 60009192 bytes total: 15803864 bytes total:
char code char code char code
dec oct char count dec oct char count dec oct char count
101 \145 e 5525869 101 \145 e 7608479 32 \40 sp 2472774
 32 \40 sp 5011391 32 \40 sp 6912597 101 \145 e 1515990
110 \156 n 2935636 110 \156 n 4627187 116 \164 t 1204028
 97 \141 a 2216489 114 \162 r 3635022 97 \141 a 956007
105 \151 i 1975708 105 \151 i 3591895 111 \157 o 951382
116 \164 t 1963544 116 \164 t 2936208 110 \156 n 894681
114 \162 r 1921217 97 \141 a 2636054 105 \151 i 865061
111 \157 o 1739080 115 \163 s 2596653 115 \163 s 731842
100 \144 d 1583389 100 \144 d 2063950 114 \162 r 723496
115 \163 s 1221229 104 \150 h 1947435 104 \150 h 658534
108 \154 l 1118443 117 \165 u 1795680 100 \144 d 450392
103 \147 g 945101 108 \154 l 1738965 108 \154 l 410549
 10 \12 ^J 773499 10 \12 ^J 1622880 99 \143 c 352905
118 \166 v 734867 103 \147 g 1345448 109 \155 m 305783
107 \153 k 671580 99 \143 c 1239132 117 \165 u 303818
104 \150 h 667855 111 \157 o 1238644 10 \12 ^J 285088
109 \155 m 660578 109 \155 m 1121373 102 \146 f 259597
117 \165 u 580741 98 \142 b 826480 112 \160 p 216147
112 \160 p 442242 102 \146 f 730349 103 \147 g 201047
 98 \142 b 436117 46 \56 . 594979 98 \142 b 195540

 French Italian Spanish
38021456 bytes total: 2469488 bytes total: 13958952 bytes total:
char code char code char code
dec oct char count dec oct char count dec oct char count
 32 \40 sp 5361058 32 \40 sp 345678 32 \40 sp 1965055
101 \145 e 4131518 105 \151 i 223913 101 \145 e 1409861
115 \163 s 2335470 101 \145 e 217182 97 \141 a 1231944
 97 \141 a 2272281 97 \141 a 203107 111 \157 o 878141
110 \156 n 2267706 111 \157 o 177112 105 \151 i 792062
105 \151 i 2225502 110 \156 n 139952 110 \156 n 787118
116 \164 t 2162710 116 \164 t 131164 115 \163 s 757648
114 \162 r 2027634 114 \162 r 127381 114 \162 r 678584
111 \157 o 1626766 108 \154 l 120421 108 \154 l 609150
117 \165 u 1609134 115 \163 s 95885 100 \144 d 597188
108 \154 l 1581308 99 \143 c 78309 116 \164 t 516170
100 \144 d 1256893 100 \144 d 76409 99 \143 c 500272
 99 \143 c 981676 117 \165 u 55427 117 \165 u 355674
112 \160 p 814895 112 \160 p 53349 112 \160 p 248288
233 \351 i 782450 10 \12 ^J 48042 109 \155 m 243311
109 \155 m 757723 109 \155 m 47094 10 \12 ^J 234992
 10 \12 ^J 636220 103 \147 g 31753 98 \142 b 189891
 44 \54 , 419417 118 \166 v 28350 46 \56 . 126337
118 \166 v 396152 44 \54 , 22249 44 \54 , 121622
 39 \47 ' 348276 122 \172 z 20445 103 \147 g 108702

 Danish Norwegian Swedish
153289 bytes total: 11658190 bytes total: 2055441 bytes total:
char code char code char code
dec oct char count dec oct char count dec oct char count
 32 \40 sp 19648 32 \40 sp 2125882 32 \40 sp 316189
101 \145 e 18385 101 \145 e 1339855 101 \145 e 162444
114 \162 r 10590 110 \156 n 717039 97 \141 a 151823
116 \164 t 9123 116 \164 t 688487 116 \164 t 143871
110 \156 n 9101 114 \162 r 625776 110 \156 n 142849
105 \151 i 8443 97 \141 a 585927 114 \162 r 141769
115 \163 s 7671 115 \163 s 488037 115 \163 s 105005
 97 \141 a 6592 105 \151 i 455922 105 \151 i 91171
111 \157 o 6580 100 \144 d 436354 108 \154 l 85919
100 \144 d 6216 108 \154 l 403734 100 \144 d 68517
108 \154 l 6014 111 \157 o 394917 111 \157 o 67460
103 \147 g 5303 103 \147 g 368560 109 \155 m 56034
109 \155 m 4306 107 \153 k 360533 103 \147 g 53521
107 \153 k 4018 10 \12 ^J 330461 107 \153 k 52427
102 \146 f 3457 109 \155 m 270585 118 \166 v 37356
 10 \12 ^J 2878 118 \166 v 229530 228 \344 d 35293
118 \166 v 2759 104 \150 h 222695 102 \146 f 31486
117 \165 u 2049 46 \56 . 167689 104 \150 h 30990
112 \160 p 1886 229 \345 e 167028 117 \165 u 28861
 46 \56 . 1650 117 \165 u 160796 229 \345 e 27498
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue