Preserving the confusion of tongues

The confusion of tongues, Gustave Doré (1865).

An engraving of the Confusion of Tongues, by Gustave Doré (1865), depicting the Biblical tale that explains the multiplicity of languages as a way to halt the building of the ambitious Tower of Babel. Image courtesy Wikimedia.

Languages and cultures have been changing throughout history due to contact between communities and the changes to living conditions. Many of the languages spoken by humans and the cultures they were living in disappeared in the last few centuries leaving only about 6,500 languages.

However, due to globalization and technological innovation in the last decades the rate of change and endangerment of languages and cultures increased so extremely that about one language is dying every week. Since 96% of the languages are spoken by only 3% of people, language extinction is mainly affecting those areas where many languages are spoken by only a few people.

Language change and inherently also cultural change, however, also affect common languages such as English. This is caused by completely different trends such as immigration of increasing amounts of people and/or creolization.

"Unique creations of evolution"

Since we can look at languages as unique creations of evolution designed to help people to survive in their environments we lose a treasure of mankind with every language dying. What we also seem to observe is a blurring of the structures between languages. We are facing a gigantic loss of knowledge and cannot assume that these trends will stop.

More than ever, we need to be documenting our languages and, if we do it properly, also document the cultural background on which they were spoken. Language documentation can be used to maintain diversity where possible, to better understand the construction of languages, and to transfer knowledge to future generations. Since we cannot foresee what future generations will do with this information, we need to carry out this documentation work careful and also take care of preserving our digital records.

This was the basis motivation of the DOBES program that started in 2000 and now covers about 50 multinational and multidisciplinary teams, with linguists, anthropologists, musicologists, ethno-biologists, and others, documenting more than 70 languages from all over the world – from Iwaidjan in the Cobourg Peninsula in Northern Australia and Totoli in Sulawesi, Indonesia, to Gorani in Northwest Iran and Awetí in Mato Grosso, Brazil.

No longer as simple as compiling a wordlist

Language documentation is no longer seen as only generating a description of the grammar and some wordlists, but it now needs to be based on large amounts of primary data, such as audio and video recordings of the speakers. In particular, video recordings also capture the environment in which languages are spoken.

These media recordings are being transcribed to a certain extent, a free translation is created into one of the main languages. For some material, morphosyntactic glossing - that is, annotations of linguistic content (morphology, syntax, semantics) - is being added to describe part of the linguistic system and other type of information can be added by special analysis such as describing the gestures, anthropologic phenomena, etc.

This annotation work is very time consuming since it has to be done manually. For higher linguistic annotations, this can take more than 100 times real time. In addition lexica are derived and where possible sketch grammars are being added. Conceptual spaces where major concepts are brought into relation allow documenters and language community members to access the documentation material from a cultural point of view.

It is well understood now that digital data will be lost in shortest time if it is not uploaded to a digital repository that is fulfilling a number of criteria. This is the reason that from the beginning the DOBES project was associated with a digital archive that will take care of bit-stream preservation and format curation.

Four copies at large data centres

Bit-stream preservation is supported by generating four external and dynamic copies of all data objects at large remote data centers and by having set up 10 regional archives at places where the languages being recorded are spoken.

The use of standards is the basis for long-term interpretability of the data. But also the use of metadata is important since it allows to relate objects with each other and it provides the contextual and provenance information that is necessary to interpret the object.

The DOBES program can be seen as a very successful start into the eHumanities era as well, since it helped in changing the scientific culture and since the archived data can now be used to carry out cross-linguistic studies.

The goals and principles of DOBES were an excellent starting point for the infrastructure work being done in CLARIN which is one of the projects selected to be on the ESFRI roadmap. CLARIN’s goal is to establish an integrated and interoperable domain of language resources and tools, the persistence of which will be guaranteed by a network of strong service centers. As an early example of the integration efforts we can refer to the virtual language observatory that also covers for example all DOBES data.

This is a brief summary of the talk "Language and culture documentation in DOBES" given by Peter Wittenburg at APAN in Mumbai this week.

