Note

Document in progress

NAKALA Frequently Asked Questions

Data in NAKALA

What data can be deposited in NAKALA?

All types of data can be deposited in NAKALA, provided they are research data (e.g. NAKALA does not accept administrative data).

The type of format is not imposed, but it is (strongly) recommended to use open formats (Cf. Preparing your data).

Data must be documented in as much detail as possible. Five pieces of metadata are mandatory, but it is (strongly) recommended to use more (Cf. Description guide).

Ownership of data deposited in NAKALA

Data entered in NAKALA remain the property of the depositor and under his or her responsibility.

Published data cannot be deleted except in cases of force majeure, in which case the deletion operation is carried out after verification by the NAKALA team. A trace associated with the perennial identifier is systematically kept (i.e. “tombstone”). The definitive deletion of the information will only be effective after a latency period due to media refreshing and backup rotation.

Reuse of data is governed by the associated license: for this reason, the license is a mandatory metadata.

Data size

There is no formal limitation on the size of data deposited. However, if you plan to upload large volumes (e.g. over 10 GB per file), please contact the NAKALA team beforehand.

How Digital Object Identifiers are managed

All repositories are assigned a DOI identifier (e.g. 10.34847/nkl.f11cyqlk), which enables the data to be cited in a standardized way and accessed on a permanent basis (Cf. via the NAKALA interface or directly via CrossCite).

As mentioned above, when data is deleted, the metadata is updated to keep track of the NAKALA repository.

What are the different statuses of data in NAKALA?

Data deposited in NAKALA can have different statuses, depending on how it progresses through the lifecycle:
- Deposited data: data in the process of being documented before being published and not accessible.
- Published data (with or without embargo): documented data that is published and accessible if not under embargo
- Deleted data: data that has been published in NAKALA and whose deletion has been requested by the user from the NAKALA team. A trace of its presence is kept in the metadata associated with the DOI identifier. - Data preserved at CINES: data published in NAKALA that has been deposited at CINES after an audit (e.g. format verification) and preparation (e.g. organization and documentation) prior to deposit at CINES.

What checks are carried out on the data?

At the time of deposit:

Various checks are carried out when the data is deposited for validation purposes:
- Checking the presence of mandatory metadata;
- nakala:license” values must be taken from the NAKALA license repository;
- nakala:type” values must be taken from NAKALA’s type repository;
- The ISO language code of a metadata must belong to the NAKALA language repository (ISO-639-1 when possible, otherwise ISO-639-3);
- The “nakala:created” date value may be empty or must be a character string in the format “YYYY”, “YYYY-MM”, “YYYY-MM-DD” ;
etc.

Regular checks :

Automated

As part of the HNSO project, Huma-Num is studying the possibility of implementing a quality index calculated on the basis of a set of controllable criteria:

Verification of the types of formats used for the files and the conformity of the files to these formats (Cf. Preparing your data);
Verification of the quantity and quality of metadata (e.g. in relation to repositories) ; etc.

A comparison of file footprints is carried out regularly to verify the integrity of data files.

Manual checks

Huma-Num is developing the production of manual audits of certain collections based on different selection criteria (e.g. producer requests, new repositories, random selection, etc.). These audits will be the subject of reports provided to the producer, which will give rise to a dialogue to improve the quality of the deposited data.

When requesting long-term preservation

When a request for long-term preservation is made, an audit of the data to be preserved is carried out by the “Data and user support” department (Pôle Données).

Discussions take place to bring the data (and metadata) into line with the requirements expected for long-term preservation:
- General data organization ;
- Quality of formats used and compliance of data with format specifications;
- Verification of metadata and addition of information required for long-term preservation (e.g. status, communicability, etc.);
etc.

Once these various points have been examined, the choice of the type of long-term preservation is made within a “liaison committee” defined by the collaboration agreement with CINES, Huma-Num’s partner for preservation.

Data security

Where are the data hosted?

Data deposited in NAKALA are stored on servers managed by Huma-Num and hosted at the IN2P3 computing center. This center was created to manage data produced in particle physics, nuclear physics and astroparticle physics.

This major national center is secured from a hardware point of view (e.g. redundant power supply, network devices, cooling systems).

How is data backed up?

NAKALA data is stored on a network-attached storage (NAS) device. An image of the data (snapshot) is taken at regular intervals, enabling rapid restoration of data in the event of problems.

In addition, a tape backup is performed daily on the CCIN2P3 backup robot, using IBM’s TSM software.

How is metadata saved?

NAKALA metadata is stored in a SQL database (MariaDB), which is backed up daily on Huma-Num’s infrastructure.

Metadata is also exposed in RDF format via a Triple-Store (GraphDB) which is also backed up daily on Huma-Num’s infrastructure.

NAKALA service continuity

The NAKALA service is hosted on Huma-Num’s infrastructure, which has a general disaster recovery plan.

More specifically, NAKALA service redundancy is ensured by :
- the NAKALA application is redundant on two different machines using a dispatching tool (HAProxy), thus avoiding service interruptions in the event of failure;
- data is stored on a NAS-type device, enabling rapid data restoration. In addition, backups are made on magnetic tapes on a daily basis;
- Metadata are stored in a “classic” relational database, which is backed up daily. This metadata is also stored in RDF format in a Triple-Store, which is backed up daily.

To find out more about the technologies used in NAKALA

NAKALA and Huma-Num

NAKALA is a data repository developed by Huma-Num. The Nakala repository is based on proven, standards-compliant technologies (e.g. Symfony Framework, Triple Store GraphDB etc.). A three-person Huma-Num team is working on its evolution and maintenance.

What is the status of Huma-Num?

Huma-Num is a national infrastructure for the Humanities and Social Sciences.

As a national infrastructure, Huma-Num is part of the national roadmap, whose development is aligned with that of European infrastructures managed by ESFRI.

Huma-Num is operated by the CNRS (Centre National de la Recherche Scientifique), one of the world’s most important research institutions, founded in 1939. In 2021, the CNRS employs over 30,000 people, including more than 10,000 researchers, and has a budget of €3 million.

How is Huma-Num funded?

Huma-Num is funded by the MESR (Ministry of Higher Education and Research) as part of the national infrastructure roadmap (see previous section). Huma-Num has accordingly received the IR* label from the MESR, confirming the French state’s longterm investment in the infrastructure.

What if Huma-Num disappears?

As Huma-Num is operated by the CNRS, responsibility for the data hosted in NAKALA is transferred to the CNRS in the event of Huma-Num’s dissolution.

The technological choices made, based on international standards, will at the very least enable us to transfer the data and make it available with the associated metadata on another type of infrastructure. For example, as the metadata is expressed in RDF, the basis of Semantic Web technologies, transferring it to a Triple-Store that complies with these standards will be simplified. The data itself is stored on standard file systems (Unix). Perennial identifiers can be easily updated to maintain data access links.