Note
Document in progress
NAKALA and the OAIS Model¶
1 Introduction¶
This document describes the various processes involved in managing the sustainability of Nakala data and their management by the Huma-Num IR* team. The conceptual standard ISO 14721:2012 “Open Archival Information System (OAIS)” is used here for its functional model as well as for its vocabulary (terms noted in capital letters).
2 Actors¶
The diagram below shows the various actors exchanging information with the OAIS (the Nakala archive).
2.1 The archive¶
The Nakala archive preserves and provides access to scientific resources from the French academic community in the humanities and social sciences. In Nakala, a resource may contain one or more files. A set of resources can form a collection. The files can be of any type (e.g. image, audio recording, text document text document, etc.)
2.2 Producers¶
Resource producers are : authors or contributors with scientific responsibility for the collection or production of resources.
2.3 Users¶
The resources in the Nakala archive are intended for the SHS scientific community, to feed its knowledge base. knowledge base. The Nakala service ensures that the data is usable usable for this target community, and presented as scientific and cultural objects and cultural objects, using formats and vocabularies widely used in this community and beyond. (e.g. Dublin-core qualified for metadata, or repositories such as ORCID such as ORCID for identifying people or ISO-639-3 for identifying languages).
2.4 Management¶
The Nakala archive is managed by IR Huma-Num, a unit of the Centre National de la Recherche Scientifique (CNRS) attached to the Institut des Sciences Humaines et Sociales du CNRS (INSHS) and accredited as a “star” Research Infrastructure (IR) by the French Ministry of Higher Education and Research (MESR).
3 Information packages¶
The OAIS standard distinguishes between 3 types of information packets. The “SIP” (Submission Information package exchanged between a producer and the archive, the “DIP” (Dissemination Information Package) exchanged between the archive and users, and the AIP (Archival Information Package), which represents the archive’s internal information package.
3.1 Producer-supplied information package (SIP)¶
The information provided in a package submitted to the Nakala archive by the producer is divided into two groups:
- one or more data files
- metadata
The way this information is packaged depends on how it is submitted to the archive: via a web form or via APIs. Depending on the service level, the constraints that apply to these two sets differ significantly.
Service level 1:
- Data files can be of any type
- There are 5 mandatory metadata sets (
nakala:title
,nakala:created
,nakala:type
,nakala:license
andnakala:creator
) - Optional metadata can be added. They must conform to the qualified Dublin-Core model.
Service Level 2:
- Data must be expressed in formats listed in a format repository (published on page https://facile.cines.fr).
- In addition to the mandatory metadata of service level 1, there are additional metadata requirements required by the sip.xml format (schema http://www.cines.fr/pac/3.0/sip.xsd).
3.2 Archived Information Package (AIP)¶
Nakala’s ingestion process generates new metadata, in particular by analyzing the content of the data or by enriching it. content analysis or enrichment by curators (Cf. 4.6.2 The role of curators in the ingestion process).
3.2.1 Additional metadata¶
The metadata added during the ingestion process is as follows.
For service level 1:
- perennial identifier (DOI)
- file fingerprint (SHA1)
- submission date
- file sizes
- file names
- producer identifier
For service level 2 :
- additional metadata for service level 1
- format identifier
- management rules (type of data, final destination, retention period, communicability)
3.2.2 Data versions¶
When a published data file is modified, a new version of the data is generated, and all previous versions remain accessible.
For level 2, any modification to the metadata generates a new version of the data, and all older versions are retained.
3.3 Disseminated information package (DIP)¶
Nakala’s data and metadata are made accessible to users via various interfaces which allow users to search, filter and retrieve all or part of the information. See “4.5 Access to information (ACCESS)” for a description of these interfaces. All data and metadata are accessible. Data files may be subject to access control. may be subject to access control.
4 Description of the ingestion process in Nakala¶
4.1 Overall diagram¶
This section describes how data ingestion is organized in the Nakala OAIS in the context of an OAIS.
4.1.1 Profiles¶
Several profiles from the IR* Huma-Num team are involved in Nakala’s data ingestion mechanism.
-
SHS “domain” experts guarantee the quality of the data disseminated. They also manage input priorities.
-
Curators (documentalists, archivists) with a good knowledge of the SHS domain, are in charge of data and metadata control.
-
IT engineers are in charge of the information system and service maintenance.
4.1.1 Service levels¶
There are 2 service levels for preservation in Nakala.
-
Level 1 ensures bitstream preservation, with no commitment to long-term data readability.
-
Level 2 ensures preservation of the information contained in the data, with a commitment to long-term readability. long-term readability. Some of the tasks specific to this level of service are carried out by outsourcing service from CINES
4.2 Data reception (ENTRY)¶
The data reception phase corresponds to the “ENTRY” entity in the OAIS model. In Nakala, the SIP information package is made up of data files and metadata. metadata. Its ingestion can follow different paths: Deposit by the author via a web interface or through APIs
- For service level 1, automatic checks are carried out which condition the continuation of the ingestion process (verification of the presence of the 5 mandatory metadata, syntax checks on the expression of the metadata, verification of the identity of the producer). If a check fails, the SIP is not created and error messages are sent to the producer.
- For service level 2, in addition to the automatic controls of level 1, a curator is designated to be in charge of the repository, and the continuation of the ingestion process will be conditional on the result of his audit.
4.3 Information storage (STORAGE)¶
The storage phase corresponds to the “ARCHIVAL STORAGE” entity in the OAIS model. The storage process consists of :
- Copying data files to disk.
- Recording metadata (supplied by the producer as well as by the system) in the DBMS.
- Recording of operations in a time-stamped log (TO BE CHECKED).
- Replication and backup operations (TO BE CHECKED)
4.4 Data management (MANAGEMENT)¶
This phase corresponds to the “DATA MANAGEMENT” entity in the OAIS model. All the metadata submitted by the producer, calculated during checks and possibly added by a curator are submitted to this entity for updating its database (mySQL + ElasticSearch). It is this database that enables searches by users.
4.5 Access to information (ACCESS)¶
This phase corresponds to the “ACCESS” entity of the OAIS model. Nakala provides several types of access to its archives through tools maintained by Huma-Num.
- Web interface
- SPARQL endPoint
- OAI-PMH
- APIs
4.6 Information system administration (ADMINISTRATION)¶
This phase corresponds to the “ADMINISTRATION” entity in the OAIS model.
4.6.1 The role of domain experts in the ingestion process¶
Domain” experts are consulted during quality audits carried out by curators to contribute their knowledge of the scientific field to the evaluation of formats, models used. They are also called upon to define input priorities.
4.6.2 The role of curators in the ingestion process¶
Curators (documentalists, archivists) with a good knowledge of the SHS, are responsible for checking data and metadata. They intervene as soon as a producer requests service level 2. They audit data and support producers in improving quality improvement. They may call on “domain” experts for their audits.
4.6.2 The role of IT experts in the ingestion process¶
IT engineers are in charge of the information system and service maintenance.
4.7 PRESERVATION PLANNING¶
This phase corresponds to the “PRESERVATON PLANNING” entity in the OAIS model.
5 OAIS Nakala responsibilities¶
In this section, we link various aspects of Nakala’s management processes to the list of “mandatory responsibilities” (listed in the OAIS standard) for an archive.
-
Negotiate with Information Producers and accept appropriate information from them. IR* Huma-Num dialogues with the SHS research communities (notably through the consortia it funds) to identify the formats for representing the information it uses, and conducts, in conjunction with CINES, any studies needed to evaluate these formats in terms of their ability to be preserved over the long term, as well as to identify the controls required for their acceptance in the archive.
-
Acquire sufficient mastery of the information provided, to the level required to guarantee its perpetuation. The Nakala ingestion process is supervised by the curators. Metadata is first defined by the authors. They are then supplemented by the results then directly by the curators. Huma-Num is responsible for the conservation of and access to data published in Nakala: which gives it the right to modify their format (service level 2) in line with new technology or obsolescence. or obsolescence. Metadata are expressed in an RDF model and XML syntax syntax, making it easy to recreate the database if required.
-
*Determine, either on its own or in collaboration with others, which communities should make up the target User Community able to understand the information provided, thus defining its Knowledge Base. IR Huma-Num funds disciplinary consortia to maintain the link with these communities and ensure that to ensure that the choices made in terms of knowledge representation are consistent with its proper understanding and use.
-
**Ensure that the information to be perpetuated is immediately comprehensible to the target user community. In particular, the target user community should be able to understand the information without recourse to special resources such as the assistance of the experts who produced the information. Particular attention is paid, at service level 2, through the audits of curators and “domain” experts, on the choice of representation formats, and the precision, accuracy and completeness of the information.
-
Implement documented policies and procedures to safeguard information against unforeseen circumstances within reasonable limits, including the disappearance of the Archive, ensuring that it is never destroyed without authorization in line with a validated policy. There should be no adhoc destruction. Service level 2 makes data destruction conditional on obtaining authorization by the archive administration. In all cases, the destruction of information is traced, and in the event of access to the data, a “tombstone” is displayed. a “tombstone” displaying minimal information (identification, deletion dates and form of citation). form of citation).
-
**Make perpetuated information available to the target User Community, and ensure that the distribution of copies of the Data Objects originally contributed is traced to prove the authenticity of the information. Provenance and integrity information is available at all times in published metadata, and is checked regularly. and are checked regularly.
6 Procedures using service providers¶
This section describes the tasks entrusted to the service provider CINES (Centre Informatique National de l’Enseignement Supérieur) and concerns service level 2 only. Detailed operations are described in an agreement signed between CNRS (Huma-Num) and CINES.
-
Data are copied to the CINES site (Montpellier) with one copy on disk and 2 copies on robotic on robotic tapes.
-
The data must be validated by the “Facile” validation tool (based on the following software bricks, among others DROID, JHOVE, ImageMagick…) based on the list of acceptable formats (see page https://facile.cines.fr).
-
In the event of obsolescence detection on formats used, which could be identified by “domain” experts or experts (from Huma-Num or CINES), CINES can implement transformation operations to other formats to be identified in advance.