Skip to content

corpus linguistics

May 12, 2013

Name: Fatimah

NIM: 2201410061

Summary:

Corpus Linguistics

Simpson, James.2011.The Routledge Handbook of Applied Linguistics. New York: Routledge

Introduction

Corpus linguistics most commonly refers to the study of machine-readable spoken and written language samples that have been assembled in a principled way for the purpose of linguistics research. Corpus linguistics is concerned with language use in real contexts.

 

Corpus as data

Corpora are designed to represent a particular language variety. Specialized corpora includes texts that belong to a particular type while general corpora includes many different types of texts, often assembled with the aim to serve as reference resources for linguistic research or to produce reference materials. Historical corpora include texts from different periods of time and allow for the study of language change when compared with corpora from other periods. Monitor corpora can be used for a similar purpose, but tend to focus on current changes in the language. Parallel corpora include texts in at least two languages that have either been directly translated or produced in different languages for the same purpose. Learner corpora contain collections of texts produced by learners of a language.

 

Metadata

            Metadata, or ‘data about data’, is the conventional method used to collect and document further information about the collected discourse itself. Thus, metadata are critical to a corpus to help achieve the standards for representativeness, and of balance and homogeneity (see Sinclair 2005).

            Burnard (2005) uses the term metadata as an umbrella term which includes editorial, analytic, descriptive and administrative categories:

a. Editorial metadata: providing information about the relationship between corpus components and their original source.

b. Analytic metadata: providing information about the way in which corpus components have been interpreted and analyzed.

c. Descriptive metadata: providing classificatory information derived from internal or external properties of the corpus components.

d. Administrative metadata: providing documentary information about the corpus itself, such as its title, its availability, its revision status, etc.

 

Metadata are particularly important when the corpus is shared and reused by others in a research community, Metadata can be kept in a separate database or included as a ‘header’ at the start of each document (usually encoded though mark-up language). A separate database with this information makes it easier to compare different types of documents and has the distinct advantage that it can be further extended by other users of the same data.

 

Corpus linguistics: tools and methods

A number of user-friendly software packages are available which facilitate the manipulation and analysis of corpus data. Common functionalities include the generation of frequency counts according to specified criteria, comparisons of frequency information in different texts, different formats of concordance outputs, including the Key Word In Context (KWIC) concordance, and the extraction of multiword units or clusters of items in a text.

 

Word lists

Various word lists that are based to some degree on word frequency in a corpus exist especially in the English language teaching (ELT) context. Word lists are a good starting point for subsequent searches of individual items at concordance level and can be useful in the comparison of different corpora. Lemmatisation can be done manually using an alphabetical frequency list, or in an automated way which is often based to some degree on lists of predefined lemmas. Different forms of the same lemma tend to vary significantly in terms of their overall frequency, with one particular form tending to be more frequent than others in the lemma.

Research in the area of computational linguistics has introduced new techniques for extracting meaningful units from corpora, both on the basis of frequency information (see, for example, Danielsson 2003) and on the basis of part-of-speech tagged corpora which include further annotation of semantic fields (Rayson 2003).

 

Keywords and key sequences

Keywords are identified on the basis of statistical comparisons of word frequency lists derived from the target corpus and the reference corpus. Each item in the target corpus is compared with its equivalent in the reference corpus, and the statistical significance of difference is calculated using chi-square or log-likelihood statistics (see Dunning 1993). Both of these statistics compare actual observed frequencies between two items with their expected frequencies, assuming random distribution. If the difference between observed and expected frequency is large then it is likely that the relationship between the two items is not random.

 

The concordance output

A concordance output can be useful in providing a representation of language data which allows the user to notice patterns relating to the way in which a lexical item or a sequence is used in context. In order to describe the nature of individual units of meaning, Sinclair (1996) suggests four parameters: collocation, colligation, semantic preference and semantic prosody. Collocation refers to the habitual co-occurrence of words and will be discussed in more detail below. Colligation is the co-occurrence of grammatical choices. Grammatical patterning around a particular word accounts for the ‘variation’ of a phrase, which ‘gives the phrase its essential flexibility, so that it can fit into the surrounding co-text’. Besides, ‘fixed phrases’ are therefore only fixed if we consider the lexico-grammatical ‘core’. If we extend the units of meaning, however, to patterns in the co-text, the expressions become more variable. The semantic preference of a lexical item or expression is a semantic abstraction of its prominent collocates. In addition, semantic prosodies are associations that arise from the collocates of a lexical item and are not easily detected using introspection. Semantic prosodies have mainly been described in terms of their positive or negative polarity (Sinclair 1991a; Stubbs 1995) but also in terms of their association with ‘tentativeness/indirectness/face saving’ (McCarthy 1998: 22).

 

Current issues in corpus linguistics

Phraseology

John Sinclair’s theory (1991a) said about everyday language is full of highly recurrent sequences of words challenges the traditional perception of language processing in the brain and the belief that language production (and reception) relies on a completely rule-based system. He (1991a) suggests that highly recurrent chunks are fundamental to the organization and the production of language, and proposes that language production is the result of the alternation between the idiom principle and the open-choice principle.

            The term multiword unit is often used in this context as an umbrella term for sequences of interrelated words which are retrieved from memory as single lexical units. They occur with varying degrees of fixedness, including formulae (e.g. have a nice day), metaphors (e.g. kick the bucket) and collocations (e.g. rancid butter) (Moon 1998; Wray 2002). The description and conceptualization of multiword units are a key concern in many different areas of language study ranging from psycholinguistics to Natural Language Processing (NLP). There are many different ways of identifying multiword units. These include intuitive identification, the use of discourse analytical techniques, and automatic extraction from electronic texts.

 

Corpora and English language teaching

 

While corpus linguistics has enabled better descriptions of language in use, its real impact lies in the enhancement of applications based on those descriptions. A key area to highlight in this context is that of English language teaching, where the latest findings from corpus research have led to real innovations in material design and classroom practice. There are two main areas in which corpora can benefit language teaching and learning: first, by incorporating the latest corpus-based findings into language syllabuses, teaching materials and dictionaries; second, by encouraging teachers and learners to examine language patterns in corpus as part of their (independent) learning activities in and outside classrooms (see Gavioli and Aston 2001).

Corpus linguists and language teaching researchers are often found collaborating in these two areas and there are now publications on the subject. Some of these provide further corpus-based descriptions of aspects of language which target the needs of specific groups of language learners. Others aim to equip teachers and learners with the skills of concordancing and extracting useful information from concordance lines or include practical suggestions on the various ways in which corpus research can be introduced into the language classroom to enrich the experience of language learners.

Corpus data are increasingly becoming an accepted and desirable basis for the development of English language teaching materials, and most major dictionaries and grammars now advertise the fact that they are based on ‘real’ language from a corpus.

 

The Web as corpus

Web provides more than free, instant suggestions on spellings, corpus linguists have developed Web-based interfaces that allow researchers to use the Web as a compatible resource for linguistic research.

While the World Wide Web is a very large repository of naturally occurring language, further research is needed as to the type of language that is being used on the Web, what it represents, and how balanced it is in the context of a particular research question.

 

The impact of new technologies on corpus linguistics: an example

One of the main impacts of new technology on the area of corpus linguistics is no doubt the use of the Web as a corpus.

Gesture, prosody and kinesics all add meaning to utterances and discourse as a whole, and recent research in the area of spoken corpus analysis has started to explore the potential impact of drawing on multimodal corpus resources for the descriptions of spoken language. In addition to offering a more comprehensive resource for describing discourse, multimodal corpora also allow us to reflect on and evaluate some of the methods for analyzing textual renderings of spoken discourse established so far. The representation and analysis of ‘textual’ concordance data thus becomes limited and limiting in a way that can now be avoided by using one of the tools and interfaces developed for aligning and searching text, audio and video data, such as ELAN or Transana.

 

 

Question

  1. Q: Explain the four terms of metadata from Burnered (2005)!

A: Burnard (2005) uses the four term of metadata, they are:

a. Editorial metadata: providing information about the relationship between corpus components and their original source.

b. Analytic metadata: providing information about the way in which corpus components have been interpreted and analyzed.

c. Descriptive metadata: providing classificatory information derived from internal or external properties of the corpus components.

d. Administrative metadata: providing documentary information about the corpus itself, such as its title, its availability, its revision status, etc.

 

  1. Q: What is the relation between Corpora and English language teaching? What is the benefit?

A: English language teaching, where the latest findings from corpus research, have led to real innovations in material design and classroom practice. There are two main areas in which corpora can benefit language teaching and learning: first, by incorporating the latest corpus-based findings into language syllabuses, teaching materials and dictionaries; second, by encouraging teachers and learners to examine language patterns in corpus as part of their (independent) learning activities in and outside classrooms.

 

Advertisements

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: