The Old Bailey Corpus


The Old Bailey Corpus 2.0 has been released and integrated into the CLARIN-D infrastructure. This website will soon redirect to the new official landing page of the Old Bailey Corpus.

The Old Bailey Corpus 2.0 XML files can already be downloaded here.

The OBC 2.0 is searchable through CQPweb as well.

The persistent identifier for the Old Bailey Corpus 2.0 is:

About the project

The Proceedings of the Old Bailey, London's central criminal court, were published from 1674 to 1913 and constitute a large body of texts from the beginning of Present Day English. The 2163 volumes contain almost 200,000 trials, totalling ca. 134 million words. Since the proceedings were taken down in shorthand by scribes in the courtroom, the verbatim passages are arguably as near as we can get to the spoken word of the period. The material thus offers the rare opportunity of analyzing spoken language in a period that has been neglected both with regard to the compilation of primary linguistic data and the description of the structure, variability, and change of English.

The Old Bailey Corpus is based on the Proceedings of the Old Bailey and documents spoken English from 1720 to 1913. We are indebted to Robert Shoemaker (Department of History, Humanities Research Institute, University of Sheffield) and Tim Hitchcock (Department of History and Social Sciences, University of Hertfordshire) who kindly provided us with digitalized transcripts of the Old Bailey Proceedings.

Turning the digitalized Proceedings into the linguistic Old Bailey Corpus consisted of three main steps:

  • localization and tagging of direct speech in the Proceedings with the help of Perl and Python scripts (this identified ca. 113 million words of spoken English),
  • part-of-speech tagging of the Proceedings using the CLAWS 7 tagset, and
  • compiling the Old Bailey Corpus: a balanced subset of the Proceedings with detailled sociolinguistic annotation of every utterance, based on sociobiographical speaker data found in the context of the trials (407 Proceedings, ca. 318,000 utterances, ca. 14 million spoken words, ca. 750,000 spoken words/decade).

We gratefully acknowledge the support of the German Science Foundation (DFG, HU 884/6-1, HU 884/6-2) in creating the Old Bailey Corpus.

The Old Bailey Corpus is the largest diachronic collection of spoken English with this detail of utterance level sociolinguistic annotation. Almost 200 years of spoken Early Present Day English are annotated for the following sociobiographic, pragmatic and textual parameters:

  • sociobiographical speaker information: gender, age, occupation (according to the Historical International Standard Classification of Occupations, HISCO), social class (according to HISCLASS, a social class scheme based on HISCO).
  • pragmatic information: speaker role in the courtroom: defendant, judge, victim, witness ...
  • textual information: scribe, printer and publisher of the Proceeding.

The time span covered by the Old Bailey Corpus and the available sociobiographical speaker information are ideally suited for fine-tuned studies, including historical sociolinguistic approaches.

In addition, because of sheer size the Proceedings are a valuable textual source for the analysis of low-frequency features. For instance, an analysis of the present and past tense forms of the ten most frequent verbs in the Proceedings (know, go, see, say, take, live, come, give, get, tell) shows that overt inflection in the first person singular (e.g.I says, mostly with past reference )has a very low frequency of just over 0.1%. Analysing such marginal phenomena is impossible with most existing historical corpora, running up to a couple of million words at most.

By contrast, since the total number of 1sg forms of these ten verbs amounts to over half a million tokens in the Proceedings, 0.1% corresponds to 547 tokens of inflected 1sg forms, enough for a basic multivariate analysis.

For an overview of the corpus see

For detailed background information on the Old Bailey and the publication history of the Proceedings consult the excellent Old Bailey Proceedings Online.

The OBC is currently (2015) being integrated into the German section of the Common Language Resources and Technology Infrastructure (CLARIN-D) to achieve sustainability (persistent storage and access). The project is funded by the Federal Ministry of Education and Research. The OBC will be hosted at the CLARIN-D Service Centre of Saarland University.


No account? Click here to register.


* required

Already registered? Click here to log in!