NA 105

DIN Standards Committee Terminology

Project

Language resources and language technology - Derived text formats (DTF)

Abstract

Derived text formats are abstracted representations of an original text that remove copyrighted content but retain relevant information for text and data mining (TDM). Examples are word lists or N-grams. They enable legally compliant research, transparency and reusability. One area of application for derived text formats is the development and improvement of Large Language Models (LLMs). This document sets out general principles for derived text formats as such and for their creation and provision. Based on this, analysis procedures can then be adapted to the derived text formats. By using this document, the limits of the analysis procedures, e.g. for the analysis of protected works, can be named and described. The aim of these principles is to make the use of text collections more legally secure and sustainable, especially in the case of protected works, to facilitate cooperation, to create trust and to open up new possibilities for the use of modern analysis methods.

Begin

2025-01-16

Planned document number

DIN 19461

Project number

10500742

Responsible national committee

NA 105-00-06 AA - Language resources and language technology  

Contact

Annette Preissner

Am DIN-Platz, Burggrafenstr. 6
10787 Berlin

Tel.: +49 30 2601-2012
Fax: +49 30 2601-42012

Send message to contact