This directory contains yearly counts for word ngrams for n ranging from 1 to 5 from the KB newspaper collection. For each n, it contains a subdirectory KBngramN with
ngram YearFrequency CorpusFrequency "Array with Year:Frequency pairs for all non-zero years"
#Total number of ngram types and tokens per year #Year # of Types # of tokensThis information is provided for normalization reasons. Using the number of tokens per year, one can give the percentage of tokens matching a word, rather than the absolute frequencies. As the corpus-size varies very much from year to year, this is useful (also the default option in the Google ngram viewer).
The bigram folder contains Indexes to quickly find which words occur before or after a given word. A README file explains how this works and can be used.
The unigram index contains all words which occur at least twice in the whole collection. Thus only "corpus-hapaxes" are removed from the unigram vocabulary. That leaves 50M (to be precise: 49.514.842) unique unigrams. Of these 16M occur only in one year. 9.6M occur just twice (the minumum) in the complete collection. The total number of unigrams tokens is 18.437.979.846.
The n-gram indexes for n=2,3,4,5 contain only ngrams that occur at least in one year strictly more than 5 times. In the index-files per year, only those ngrams have been kept that occur strictly more than 5 times. This means that in all counts for an ngram only those occurances are counted which fall in years with at least 6 occurences. Thus the corpusfrequency of an ngram is the sum of the year-frequencies for those years in which it occurs at least 6 times.
The next table gives for each N the vocabulary size (in the second column) and the total number of ngram-tokens.
|KBngram1||Index(md5sum: 74a529ce10fea77712a1a5ac228967eb)||Indices per year||Total frequencies per year (md5sum: 213061a0e68b041d7d01edc0b002f443)|
|KBngram2||Index(md5sum: 6cf2eb1859171fe485552821e20cb836)||Indices per year||Total frequencies per year (md5sum: 9d2697c43a4036e4a1e466f25e82dee9)|
|KBngram3||Index(md5sum: 9fb6c3c51854576e106e890eafb5d3fe)||Indices per year||Total frequencies per year (md5sum: 798824ff5083f50c86c70b1a0f38ce1a)|
|KBngram4||Index(md5sum: ef7350aa8900dbec6eaf9b5eb9823a71)||Indices per year||Total frequencies per year (md5sum: 55e4e0285acfbd16faddb517ef1cd989)|
|KBngram5||Index(md5sum: 9c327384357c644ae98c9c32587ad34f)||Indices per year||Total frequencies per year (md5sum: aed768028ec02f7777842a289f5a5682)|
N.B.: The index files are a few gigabytes each.
Here are two examples of how to use the preceding and following word indexes. Note that the files are sorted (descending) on CorpusFrequency (the last column).
(marx@mashup3) cat WordBeforeIndex.tsv | awk -F$'\t' '$2~/^maarten$/'|head sint maarten 133 37351 van maarten 142 6845 en maarten 104 5499 aan maarten 56 1738 door maarten 62 1155 heer maarten 77 971 de maarten 78 885 jan maarten 45 735 dat maarten 50 733 (marx@mashop3) cat WordAfterIndex.tsv |awk -F$'\t' '$1~/^ondeugende/'|head ondeugende meisjes 13 2409 ondeugende vrouwen 3 985 ondeugende huisvrouwtjes 2 935 ondeugende hete 4 796 ondeugende babbelaars 3 704 ondeugende meid 12 682 ondeugende jongen 50 646 ondeugende kinderen 49 536 ondeugende meiden 4 536 ondeugende streken 36 510
To keep the index managable we removed unigrams which occured just once in the complete corpus from the index.
note that the hapaxes are still in the yearly index files in indexesPerYear.
Thus they can in principle be computed.
This slight mismatch between the Index and the indexes per year causes that the file with yearly counts 1gram-TotalYearFrequencies.csv has per year the complete vocabulary plus the total number of unigrams. Also in the total number of unigrams the hapaxes are counted. Only in the total vocabulary size the hapaxes are NOT counted (as this is calculated as the number of lines in the Index).
Due to OCR errors the number of hapaxes per year is very large. As an example, consider 1926. It has 12.8M unique unigrams, of which 10.4M are hapaxes.
(marx@mashup3) zcat 1926-1gram-Min1.csv.gz| awk -F$'\t' '$3==1'|wc -l 10430964 (marx@mashup3) zcat 1926-1gram-Min1.csv.gz|wc -l 12820204 (marx@mashup3) for n in `seq 1 10`; do c=`zcat 1926-1gram-Min1.csv.gz| awk -F$'\t' -v n=$n '$3==n'|wc -l`; echo -e "$n\t$c";done 1 10430964 2 965511 3 372856 4 204135 5 127698 6 88945 7 65633 8 50531 9 40468 10 33437