Accéder directement au contenu Accéder directement à la navigation
Communication dans un congrès

Лингвистическая обработка цифровых изданий русских текстов XVIII века

Abstract : This paper deals with the problems of language processing of Russian 18th century texts that occurred in the work on digital editions of the printed translation of Al’Quran (1716) and a manuscript translation of La Belle et la Bête (The Beauty and the Beast, 1758). The linguistic processing includes spelling normalization, tokenization, morphological markup and lemmatization. The work was carried out using manual pre-markup with Microsoft Word, conversion to TEI XML format and further automatic processing on the TXM platform including annotation with TreeTagger and building multi-layer transcription. In Al’Quaran edition the spelling normalization is fully automated but only the simplest cases are dealt with, while in La Belle et la Bête manual pre-markup allows generating modern form for all words.
Type de document :
Communication dans un congrès
Liste complète des métadonnées

https://halshs.archives-ouvertes.fr/halshs-03285725
Contributeur : Alexei Lavrentiev <>
Soumis le : mardi 13 juillet 2021 - 15:35:48
Dernière modification le : mercredi 21 juillet 2021 - 03:50:42

Fichier

Lavrentiev-Kurysheva-hal.pdf
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

  • HAL Id : halshs-03285725, version 1

Citation

Alexei Lavrentiev, L Kurysheva. Лингвистическая обработка цифровых изданий русских текстов XVIII века. Corpora 2021 International Conference, Saint-Petersburg State University, Jul 2021, Saint-Petersbourg, Russia. ⟨halshs-03285725⟩

Partager

Métriques

Consultations de la notice

10

Téléchargements de fichiers

2