Logo Political-Mashup

KB kranten ngram collection

Abstract

This directory contains yearly counts for word ngrams for n ranging from 1 to 5 from the KB newspaper collection. For each n, it contains a subdirectory KBngramN with

The bigram folder contains Indexes to quickly find which words occur before or after a given word. A README file explains how this works and can be used.

The unigram index contains all words which occur at least twice in the whole collection. Thus only "corpus-hapaxes" are removed from the unigram vocabulary. That leaves 50M (to be precise: 49.514.842) unique unigrams. Of these 16M occur only in one year. 9.6M occur just twice (the minumum) in the complete collection. The total number of unigrams tokens is 18.437.979.846.

The n-gram indexes for n=2,3,4,5 contain only ngrams that occur at least in one year strictly more than 5 times. In the index-files per year, only those ngrams have been kept that occur strictly more than 5 times. This means that in all counts for an ngram only those occurances are counted which fall in years with at least 6 occurences. Thus the corpusfrequency of an ngram is the sum of the year-frequencies for those years in which it occurs at least 6 times.

Some statistics

Type and token frequencies

The next table gives for each N the vocabulary size (in the second column) and the total number of ngram-tokens.

Ngram sizeTypesTokens
KBngram149.514.84218.437.979.846
KBngram239.156.45111.821.165.297
KBngram365.169.5075.808.214.106
KBngram447.955.0712.386.522.277
KBngram546.222.8521.056.997.790
Total248.018.72339.510.879.316

Download the data

KBngram1Index(md5sum: 74a529ce10fea77712a1a5ac228967eb)Indices per yearTotal frequencies per year (md5sum: 213061a0e68b041d7d01edc0b002f443)
KBngram2Index(md5sum: 6cf2eb1859171fe485552821e20cb836)Indices per yearTotal frequencies per year (md5sum: 9d2697c43a4036e4a1e466f25e82dee9)
KBngram3Index(md5sum: 9fb6c3c51854576e106e890eafb5d3fe)Indices per yearTotal frequencies per year (md5sum: 798824ff5083f50c86c70b1a0f38ce1a)
KBngram4Index(md5sum: ef7350aa8900dbec6eaf9b5eb9823a71)Indices per yearTotal frequencies per year (md5sum: 55e4e0285acfbd16faddb517ef1cd989)
KBngram5Index(md5sum: 9c327384357c644ae98c9c32587ad34f)Indices per yearTotal frequencies per year (md5sum: aed768028ec02f7777842a289f5a5682)

N.B.: The index files are a few gigabytes each.

Word Before and After

Here are two examples of how to use the preceding and following word indexes. Note that the files are sorted (descending) on CorpusFrequency (the last column).

(marx@mashup3) cat WordBeforeIndex.tsv | awk -F$'\t' '$2~/^maarten$/'|head
sint    maarten 133     37351
van     maarten 142     6845
en      maarten 104     5499
aan     maarten 56      1738
door    maarten 62      1155
heer    maarten 77      971
de      maarten 78      885
jan     maarten 45      735
dat     maarten 50      733

(marx@mashop3) cat WordAfterIndex.tsv |awk -F$'\t' '$1~/^ondeugende/'|head
ondeugende      meisjes         13      2409
ondeugende      vrouwen         3       985
ondeugende      huisvrouwtjes   2       935
ondeugende      hete            4       796
ondeugende      babbelaars      3       704
ondeugende      meid            12      682
ondeugende      jongen          50      646
ondeugende      kinderen        49      536
ondeugende      meiden          4       536
ondeugende      streken         36      510
        

Note on hapaxes in the unigrams

To keep the index managable we removed unigrams which occured just once in the complete corpus from the index. But note that the hapaxes are still in the yearly index files in indexesPerYear. Thus they can in principle be computed.
This slight mismatch between the Index and the indexes per year causes that the file with yearly counts 1gram-TotalYearFrequencies.csv has per year the complete vocabulary plus the total number of unigrams. Also in the total number of unigrams the hapaxes are counted. Only in the total vocabulary size the hapaxes are NOT counted (as this is calculated as the number of lines in the Index).

Due to OCR errors the number of hapaxes per year is very large. As an example, consider 1926. It has 12.8M unique unigrams, of which 10.4M are hapaxes.

(marx@mashup3) zcat 1926-1gram-Min1.csv.gz| awk -F$'\t' '$3==1'|wc -l
10430964
(marx@mashup3) zcat 1926-1gram-Min1.csv.gz|wc -l
12820204
(marx@mashup3) for n in `seq 1 10`; do c=`zcat 1926-1gram-Min1.csv.gz| awk -F$'\t' -v n=$n '$3==n'|wc -l`; echo -e "$n\t$c";done
1       10430964
2       965511
3       372856
4       204135
5       127698
6       88945
7       65633
8       50531
9       40468
10      33437