Editor for this issue: <>
Many thanks to Penni Sibun for noticing the anomaly she reported in 5.593. In fact in my message dated Sun, 15 May 94 23:11:52 BST in 5.555 ALL THE COUNTS (except Danish and Swedish) ARE IN ERROR, as a result of a classic UN*X goof, i.e. that in using grep to get the text lines from the corpora, I got file-names (hence the high count for / which Penni noticed) and line numbers on every line! Many apologies to anyone led astray by the bogus numbers, a better set follows. (I don't dare say a CORRECT set at this point -- buy the CD and do your own checking! Orders to elsnetMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuecogsci.ed.ac.uk.) The following was computed quickly [TOO quickly, the first time--ht] on the basis of some of the material now available on the Multilingual Corpus 1 CD-ROM from the European Corpus Initiative. Note these are raw counts, and that in particular the counts for the upper-case characters have NOT been folded in. Also note that ISO-8859-1 (ISO Latin 1) has been used throughout, so the third column will not have survived being mailed through 7-bit mailers. [Note this answers Penni's second question -- we believe no escape sequences remain in the corpora as distributed, all have been converted to ISO Latin 1] [Wrt Penni's third question, these are raw counts, and ^J is included just as <space> is -- it would be misleading to merge those counts without checking first that none of the sub-corpora I've counted over include line-final soft hyphens, which I haven't done, although I don't THINK there are any.] ---------------------------- Dutch German English 37312899 bytes total: 60009192 bytes total: 15803864 bytes total: char code char code char code dec oct char count dec oct char count dec oct char count 101 \145 e 5525869 101 \145 e 7608479 32 \40 sp 2472774 32 \40 sp 5011391 32 \40 sp 6912597 101 \145 e 1515990 110 \156 n 2935636 110 \156 n 4627187 116 \164 t 1204028 97 \141 a 2216489 114 \162 r 3635022 97 \141 a 956007 105 \151 i 1975708 105 \151 i 3591895 111 \157 o 951382 116 \164 t 1963544 116 \164 t 2936208 110 \156 n 894681 114 \162 r 1921217 97 \141 a 2636054 105 \151 i 865061 111 \157 o 1739080 115 \163 s 2596653 115 \163 s 731842 100 \144 d 1583389 100 \144 d 2063950 114 \162 r 723496 115 \163 s 1221229 104 \150 h 1947435 104 \150 h 658534 108 \154 l 1118443 117 \165 u 1795680 100 \144 d 450392 103 \147 g 945101 108 \154 l 1738965 108 \154 l 410549 10 \12 ^J 773499 10 \12 ^J 1622880 99 \143 c 352905 118 \166 v 734867 103 \147 g 1345448 109 \155 m 305783 107 \153 k 671580 99 \143 c 1239132 117 \165 u 303818 104 \150 h 667855 111 \157 o 1238644 10 \12 ^J 285088 109 \155 m 660578 109 \155 m 1121373 102 \146 f 259597 117 \165 u 580741 98 \142 b 826480 112 \160 p 216147 112 \160 p 442242 102 \146 f 730349 103 \147 g 201047 98 \142 b 436117 46 \56 . 594979 98 \142 b 195540 French Italian Spanish 38021456 bytes total: 2469488 bytes total: 13958952 bytes total: char code char code char code dec oct char count dec oct char count dec oct char count 32 \40 sp 5361058 32 \40 sp 345678 32 \40 sp 1965055 101 \145 e 4131518 105 \151 i 223913 101 \145 e 1409861 115 \163 s 2335470 101 \145 e 217182 97 \141 a 1231944 97 \141 a 2272281 97 \141 a 203107 111 \157 o 878141 110 \156 n 2267706 111 \157 o 177112 105 \151 i 792062 105 \151 i 2225502 110 \156 n 139952 110 \156 n 787118 116 \164 t 2162710 116 \164 t 131164 115 \163 s 757648 114 \162 r 2027634 114 \162 r 127381 114 \162 r 678584 111 \157 o 1626766 108 \154 l 120421 108 \154 l 609150 117 \165 u 1609134 115 \163 s 95885 100 \144 d 597188 108 \154 l 1581308 99 \143 c 78309 116 \164 t 516170 100 \144 d 1256893 100 \144 d 76409 99 \143 c 500272 99 \143 c 981676 117 \165 u 55427 117 \165 u 355674 112 \160 p 814895 112 \160 p 53349 112 \160 p 248288 233 \351 i 782450 10 \12 ^J 48042 109 \155 m 243311 109 \155 m 757723 109 \155 m 47094 10 \12 ^J 234992 10 \12 ^J 636220 103 \147 g 31753 98 \142 b 189891 44 \54 , 419417 118 \166 v 28350 46 \56 . 126337 118 \166 v 396152 44 \54 , 22249 44 \54 , 121622 39 \47 ' 348276 122 \172 z 20445 103 \147 g 108702 Danish Norwegian Swedish 153289 bytes total: 11658190 bytes total: 2055441 bytes total: char code char code char code dec oct char count dec oct char count dec oct char count 32 \40 sp 19648 32 \40 sp 2125882 32 \40 sp 316189 101 \145 e 18385 101 \145 e 1339855 101 \145 e 162444 114 \162 r 10590 110 \156 n 717039 97 \141 a 151823 116 \164 t 9123 116 \164 t 688487 116 \164 t 143871 110 \156 n 9101 114 \162 r 625776 110 \156 n 142849 105 \151 i 8443 97 \141 a 585927 114 \162 r 141769 115 \163 s 7671 115 \163 s 488037 115 \163 s 105005 97 \141 a 6592 105 \151 i 455922 105 \151 i 91171 111 \157 o 6580 100 \144 d 436354 108 \154 l 85919 100 \144 d 6216 108 \154 l 403734 100 \144 d 68517 108 \154 l 6014 111 \157 o 394917 111 \157 o 67460 103 \147 g 5303 103 \147 g 368560 109 \155 m 56034 109 \155 m 4306 107 \153 k 360533 103 \147 g 53521 107 \153 k 4018 10 \12 ^J 330461 107 \153 k 52427 102 \146 f 3457 109 \155 m 270585 118 \166 v 37356 10 \12 ^J 2878 118 \166 v 229530 228 \344 d 35293 118 \166 v 2759 104 \150 h 222695 102 \146 f 31486 117 \165 u 2049 46 \56 . 167689 104 \150 h 30990 112 \160 p 1886 229 \345 e 167028 117 \165 u 28861 46 \56 . 1650 117 \165 u 160796 229 \345 e 27498