Note

Document in progress

1 Introduction

The different processes of management of the Nakala data perpetuation are described here and their management by the TGIR Huma-Num team. The conceptual standard ISO 14721:2012 “Open Archival Information System (OAIS)” is used here for its functional model as well as for its vocabulary (terms noted in capital letters).

2 The actors

The diagram below presents the different actors exchanging information with the OAIS (the Nakala archive).

oais-acteurs

2.1 The archive

The Nakala archive preserves and gives access to scientific resources from the French academic community academic community in humanities and social sciences. A resource, in Nakala, can contain one or several files. A set of resources can form a collection. The files can be of any type (for example an image, an audio recording, a text document, etc.). textual document, etc.)

2.2 Producers

The producers of resources are : authors or contributors who are scientifically responsible for the collection or production of the resources.

2.3 Users

The resources of the Nakala archive are intended for the SHS scientific community in order to to feed its knowledge base. The Nakala service makes sure that the data are usable for this targeted community and are presented as scientific and cultural objects and cultural objects by using formats and vocabularies widely spread in this community and beyond. (for example: the Dublin-core qualified for the metadata or repositories such as ORCID for identifying people or ISO-639-3 for identifying languages).

2.4 Management

The management of the Nakala archive is taken care of by the TGIR Huma-Num which is a unit of the Centre National de la Recherche Scientifique (CNRS) attached to the Institut des sciences humaines et sociales du CNRS (INSHS) and labeled as a Très Grande Infrastructure de Recherche (TGIR) by the Ministry of Higher Education, Research and Innovation (MESRI).

3 Information packages

The OAIS standard distinguishes 3 forms of information packages. The “SIP” (Submission Information Package) information package exchanged between a producer and the archive, the “DIP” (Dissemination Information Package) exchanged between the archive and the users and the AIP (Archival Information Package” which represents the form of the archive’s internal information package.

3.1 Producer Information Package (SIP)

The information provided in a package submitted to the Nakala archive by the producer is divided into two sets

  • one or more data files
  • metadata

The form of packaging of this information depends on the mode of deposit in the archive: via a web form or via APIs. Depending on the service level, the constraints that apply to these two sets differ significantly.

Service level 1:

  • Data files can be of any type
  • There are 5 mandatory metadata (nakala:title, nakala:created, nakala:type, nakala:license and nakala:creator)
  • Optional metadata can be added. They must conform to the qualified Dublin-Core model

Service level 2:

  • Data must be expressed in formats listed in a format repository (published on the page https://facile.cines.fr](https://facile.cines.fr)).
  • In addition to the mandatory metadata of Service Level 1, there is additional metadata required by the SEDA2 format (standard NF Z 44-022 “Modélisation des Échanges de DONnées pour archiving” and the ISO 20614 standard “Data exchange protocol for interoperability and preservation”).

3.2 Archived Information Package (AIP)

Nakala’s ingestion process generates new metadata, notably by analyzing the content of the data or by enriching it. content of the data or by the enrichment carried out by “data” experts (Cf. 4.6.2 The part of the “data” experts in the ingestion process).

3.2.1 Additional metadata

The metadata added during the ingestion process are the following.

For service level 1:

  • perennial identifier (DOI)
  • fingerprint (SHA1)
  • submission date
  • status ?
  • producer identifier

For service level 2 :

  • additional metadata from service level 1
  • format identifier
  • management rules (type of data, final destination, retention period, communicability)

3.2.2 Data versions

Describe the management of versions (modification and deposit with several formats (original vs. conservation vs. diffusion).
Replacement versions
version of different usage formats
Note: In the case of service level 2, only files in preservation formats are concerned

3.3 Disseminated Information Package (DIP)

Nakala’s data and metadata are made accessible to users via different interfaces which allow the search, filtering and retrieval of all or part of the information. See ” 4.5 Access to information (ACCESS) ” for a description of these. All data and metadata are accessible. The data files may be subject to access control. data files may be subject to access control.

4 Description of the ingestion process in Nakala

4.1 Overall scheme

This section describes how the organization of data ingestion in the Nakala OAIS in the context of an OAIS.

oais

4.1.1 The profiles

Several profiles of the TGIR Huma-Num team are involved in the data ingestion mechanism of Nakala.

  • The SHS “domain” experts guarantee the quality of the data disseminated. They also manage the priorities in the entries.

  • The “data” experts (documentalists, archivists) with good knowledge of the SHS domain are in charge of data and metadata control.

  • The IT engineers are in charge of the information system and the maintenance of the service.

4.1.1 Service levels

There are two levels of service for preservation in Nakala.

  • Level 1 ensures the preservation of the bitstream without any commitment on the long-term readability of the data.

  • Level 2 ensures the preservation of the information contained in the data and commits to their readability on the long term. Part of the tasks specific to this level of service is provided by the use of provision of service at CINES

4.2 Reception of data (ENTRY)

The data reception phase corresponds to the entity “ENTRY” in the OAIS model. In Nakala, the information package submitted SIP is composed of data files and metadata. Its ingestion can follow different paths: Filing by the author through a web interface or through APIs.

  • For service level 1, automatic checks are performed which condition the continuation of the ingestion process (verification of the presence of the 5 mandatory metadata, syntactic checks on the expression of the metadata, verification of the identity of the producer). If a check fails, the SIP is not created and error messages are sent to the producer.
  • For service level 2, in addition to the automatic controls of level 1, a “data” specialist of the TGIR is designated in charge of the repository and the continuation of the ingestion process will be conditional on the result of his audit.

4.3 Storage of information (STORAGE)

The storage phase corresponds to the entity “ARCHIVAL STORAGE” of the OAIS model. The storage process consists of:

  • Copying data files to disk.
  • Recording of metadata (provided by the producer as well as those provided by the system) in the DBMS.
  • Writing of the operations in a time-stamped log ?
  • Replication and backup operations.

4.4 Data management (MANAGEMENT)

This phase corresponds to the entity “DATA MANAGEMENT” of the OAIS model. All the metadata submitted by the producer, calculated during the controls and possibly added by a “data” expert are submitted to this entity for an update of its database (ElasticSearch). It is this database that allows users to search.

4.5 Access to information (ACCESS)

This phase corresponds to the “ACCESS” entity of the OAIS model. Nakala allows several types of access to its archives through tools maintained by Huma-Num.

  • Web interface
  • SPARQL endPoint
  • OAI-PMH
  • APIs

4.6 Administration of the information system (ADMINISTRATION)

This phase corresponds to the “ADMINISTRATION” entity of the OAIS model.

4.6.1 Share of “domain” experts in the ingestion process

The “domain” experts are consulted during the quality audits carried out by the “data” experts experts in order to bring their knowledge of the scientific domain in the evaluation of formats, models and used. They are also called upon to define priorities in the entries.

4.6.2 The role of data experts in the ingestion process

The “data” experts (documentalists, archivists) with good knowledge of the SHS domain are in charge of data and metadata control. They intervene as soon as a producer requests level of service 2. They audit the data and accompany the producers in to improve their quality. They can call on “domain” experts for their audits.

4.6.2 The role of IT experts in the ingestion process

IT engineers are in charge of the information system and service maintenance.

4.7 Planning for sustainability (PRESERVATON PLANNING)

This phase corresponds to the entity “PRESERVATON PLANNING” of the OAIS model.

5 Responsibilities of the OAIS Nakala

In this section, we link various aspects of Nakala’s management processes to the list of “mandatory responsibilities” (listed in the OAIS standard) for an archive.

  1. **Negotiate with Information Producers and accept appropriate information from them TGIR Huma-Num dialogues with the SHS research communities (notably through consortia that it funds) to identify the formats for representation of information it uses and conducts, jointly with the CINES, possible studies to evaluate studies to evaluate these formats with respect to their ability to be preserved in the long term and to identify the controls to and to identify the controls to be made for their acceptance in the archive.

  2. To acquire a sufficient control of the information provided, at the level required to guarantee its perpetuation. The Nakala ingestion process is supervised by Huma-Num’s “data” experts. The metadata are first defined by the authors. They are then completed by the result of of the controls and then eventually directly by Huma-Num’s “data” experts. Huma-Num is responsible for the conservation and access to the data published in Nakala: which gives it the right to modify their format (level of service 2) according to the technological innovations or obsolescence. The metadata are expressed in an RDF model and an XML syntax syntax that would allow to easily recreate the database if needed.

  3. Determine, either by itself or in collaboration with others, which communities should constitute the target User Community able to understand the information provided, thus defining its Knowledge Base. TGIR Huma-Num funds disciplinary consortia to maintain the link with these communities and to ensure the adequacy of the The TGIR Huma-Num funds disciplinary consortia in order to maintain the link with these communities and to guarantee the adequacy between the choices of knowledge representation and their good understanding and use.

  4. **4. Ensure that the information to be perpetuated is immediately understandable to the target User Community. In particular, the target user community should be able to understand the information without recourse to special resources such as assistance from the experts who produced the information. Particular attention is paid, at service level 2, through the audits of the “data” and “domain” experts, Special attention is paid at Service Level 2, through the audits of the “data” and “domain” experts, to the choice of representation formats, the precision, accuracy and completeness of the information.

  5. 5. Implement documented policies and procedures to ensure that information is preserved from unforeseen circumstances within reason, including the disappearance of the Archive, by ensuring that it is never destroyed without authorization in accordance with a validated policy. There should be no adhoc destruction. Service Level 2 conditions the destruction of data to the retention of an authorization by the by the archive administration. In all cases the destruction of information is traced, and in case of access to the data the data is presented a “tombstone” presents minimal information (identification, dates of deletion and form of citation).

  6. **Make perpetuated information available to the target User Community and ensure that dissemination of copies of the originally contributed Data Objects is tracked to prove the Authenticity of the information. Provenance and integrity information is available at all times in the published metadata and is verified regularly. and are verified regularly.

6 Procedures with Service Delivery

This section describes the tasks entrusted to the service provider CINES (Centre Informatique National de l’Enseignement Supérieur) and concerns only the service level 2. The details of operations is described in an agreement signed between the CNRS (Huma-Num) and CINES.

  • The data are copied on the site of CINES (Montpellier) with a copy on disk and 2 copies on tapes in robotics.

  • The data must be validated by the validation tool “Facile” (based among others on the software bricks DROID, JHOVE, ImageMagick) on the basis of the list of acceptable formats (Cf. page https://facile.cines.fr)

  • In case of detection of obsolescence on used formats that could be identified by “domain” experts or experts (from Huma-Num or CINES), CINES can implement operations of transformation to other formats to be identified in advance.