Corpus Concordance Sampler

The Collins WordbanksOnline English corpus is composed of 56 million words of contemporary written and spoken text. To get a flavour of the type of linguistic data that a corpus like this can provide, you can type in some simple queries here and get a display of concordance lines from the corpus. The query syntax allows you to specify word combinations, wildcards, part-of-speech tags, and so on.


Type in your query:

Which sub-corpora should be searched?

British books, ephemera, radio, newspapers, magazines (36m words)
American books, ephemera and radio (10m words)
British transcribed speech (10m words) 

To get sample concordances, press this button:
To set concordance width (in characters), make a selection:    

Note that output from this demo facility will be restricted to 40 lines of concordance, each with a maximum width of 250 characters. The lines to be displayed will be selected at random.


Collocation Sampler

Type in your word:

Select a significance score to be calculated:

Mutual Information
T-score

To get collocations, press this button:

Note that output from this demo facility will be restricted to 100 collocates. These will be the statistically most significant ones according to the score you have selected.


Query Syntax

Overview

A query is made up of one or more terms concatenated with a + symbol. E.g.hell+hole would search for the word "hell" immediately followed by the word "hole".

Terms may be made up of simple alphabetic strings, optionally modified with a trailing asterisk or 'at'-symbol, concatenated and separated by vertical bars, or followed by an oblique stroke and a part-of-speech tag.

Word combinations

The plus may be modified with a preceding number to indicate the maximum number of intervening words. E.g. dog+4bark will search for "dog" followed by "bark" with up to 4 words intervening.

Inflected Forms

An at-sign (@) appended to a string of letters causes the software to expand the wordform preceding the @ symbol into a set of inflected forms. For example, the query blew@+away will search for the set of words blow blows blowing blew followed by the word away.

Trailing wildcard

An asterisk appended to a string of letters indicates a wildcard match for all characters at the end of a word. Be careful with this feature: in a large corpus there are a surprising number of matching words for any given prefix string. Using cut* to get instances of "cut", "cuts" and "cutting" is probably a bad idea.

Word sets

Words (or wildcard words) can be strung together with vertical bars to match an explicit set of words. E.g. cut|cuts|cutting

Part-of-speech tags

The corpus has been tagged automatically with a statistical tagger. You can specify a search on word/TAG combinations by appending an oblique stroke and a part-of-speech tag. POS tags must be in uppercase. Here are the POS tags that you can search for:
NOUN	a macro tag: stands for any noun tag
VERB	a macro tag: stands for any verb tag
NN	common noun
NNS	noun plural
JJ	adjective
DT	definite and indefinite article
IN      preposition
RB	adverb
VB	base-form verb
VBN	past participle verb
VBG	-ing form verb
VBD	past tense verb
CC      coordinating conjunction (e.g. "and" or "but")
CS      subordinating conjunction (e.g. "while", "because")
PPS     personal pronoun subject case (e.g. "she", "I")
PPO     personal pronoun object case (e.g. "her", "me")
PPP     possessive pronoun (e.g. "hers", "mine")
DTG     determiner-pronoun ("many", "all", "both", "some" etc.)

Putting it all together

Word sets, wildcards and part-of-speech tags can be combined within a term. The oblique stroke binds more tightly than the vertical bar, so that fool|fools|fooling|fooled/VERB applies the VERB restriction only to the wordform "fooled". To group alternative wordforms, use round brackets: e.g. (fool|fools|fooling|fooled)/VERB.

As long as there is at least one literal wordform in your query, you can search for POS tags in the context of a wordform. E.g.
rather+JJ
will display lines in which the word "rather" is immediately followed by an adjective.