CLARIN: A project that speaks to you
The creation story of the Wichita people tells of a creator, “Man-never-known-on-Earth,” who formed the world, land, water and the first man and woman: “Man-with-the-Power-to-Carry-Light” and “Bright-Shining-Woman.” This couple brought to the Earth light, corn-growing, deer-hunting, game-playing and prayer, before becoming the morning star and the moon.
While the story itself is preserved in literature for antiquity (e.g., in George Dorsey’s 1904 book The Mythology of the Wichita), fewer than 10 people today can tell the story in the Wichita language, nearly all of whom are elders living on tribal lands in Oklahoma, USA.
It’s a pattern repeated around the world; many languages are endangered or dying. Preserving these languages is vital for groups seeking to revitalize and maintain their culture.
Linguists have been recording and documenting endangered languages for as long as there has been recording equipment, or about 120 years. What has been lacking — until now — is a central place to search and access these data stores, which are scattered around the world. To remedy this, the CLARIN project is studying and preparing to provide comprehensive language research and preservation tools.
CLARIN, or Common Language Resources and Technology Infrastructure, began preparing its infrastructure in 2008. At the end of 2010, it expects to move into the construction phase. Its goal is seamless access to language archives and applications; by doing so, CLARIN hopes to become an invaluable tool for helping to document and understand our languages — and therefore understand ourselves.
An advantage to all
Many sectors of society will benefit, say CLARIN’s creators.
For instance, an educator or government official reviewing educational policy could search stored archives of childrens’ recordings in her country. Using this information, she could then compare indicators of linguistic sophistication — breadth of vocabulary for example — among children of the same age from different regions in her country, or perhaps compare the language skills of boys and girls within the same age group.
Similarly, a historian researching a given politician could determine the frequency with which he used a certain word or phrase in a given month, year or decade. This kind of data could illuminate the germination of a political idea or movement.
Or a dictionary writer could clarify and expand a word’s meaning based upon the syntax and phrases commonly associated with that entry.
And a teacher seeking to expand his students’ horizons could show them language systems radically different from their own. One example of the latter is Kuuk Thaayorre, spoken by aboriginal people of Far North Queensland, Australia — a language which contains no word for left and right. Directions (north, south, east and west) do the job instead. Consequently, its speakers have a heightened spatial awareness, states linguistic researcher Lera Boroditsky of Stanford University, in an article in the website Edge:
Most likely you and I, in the absence of a compass, wouldn’t be able to get past “Hello.”
To create such a repository means overcoming a variety of challenges. “The needs of our users — as well as the needs of our sources — present some interesting problems,” says Martin Wynne, a member of CLARIN. For example, patient confidentiality must be preserved, and intellectual property rights respected. Consequently, sign-on to the CLARIN infrastructure will offer differing levels of access, with data from medical patients or children restricted, and recorded songs might be offered by only for academics, and not to commercial musicians.
More unusually, some data must be removed once the source dies.
Upon the death of a Pitjantjatjara-speaking Aborigine in central Australia (near Uluru, or “Ayers Rock”), for example, anything associated with that person — such as photographs or recordings — temporarily becomes taboo for prolonged mourning periods lasting months or even years. Even the person’s name is not spoken, instead the phrase “Kuminjay” is substituted, in what anthropologists term “avoidance language.”
As a result, “We’ll have an ethical obligation to (temporarily) cut access to recordings of that person,” says CLARIN’S Peter Wittenburg.
Like a jigsaw puzzle
Besides the ethical considerations, the team needs to make sure that sources drawn upon by the CLARIN catalogue are reliable and persistent. A PhD student using CLARIN as a source for his thesis needs to trust that cited resources remain in place.
Wynn, Wittenburg and Daan Broeder of CLARIN recently visited the CERN IT department to observe how the Worldwide LHC Computing Grid and Enabling Grids for E-sciencE had approached security, monitoring and the provision of highly-available services.
“We are at the stage of designing the architecture,” says Broeder. “It is like a jigsaw puzzle: some pieces are already defined and in place. We are now looking for the missing pieces. To the extent we can we’d like to find preformed puzzle pieces that would be a good fit to save us from making and cutting our own.”
—Danielle Venton, EGEE
From UNESCO’s Atlas of the World’s Languages in Danger:
It is impossible to estimate the total number of languages that have disappeared over human history. Linguists have calculated the numbers of extinct languages for certain regions, such as, for instance, Europe and Asia Minor (75 languages) or the United States (115 languages lost in the last five centuries, of some 280 spoken at the time of Columbus). Some examples of recently extinct languages are: