Note
Document in progress
NAKALA: Frequently Asked Questions¶
Data in NAKALA¶
What data can be deposited in NAKALA¶
Any type of data can be deposited in NAKALA as long as it is research data (e.g. NAKALA does not accept administrative data).
The type of format is not imposed but it is (strongly) recommended to use open formats (see Preparing your data).
The data must be documented in as much detail as possible. Five metadata are mandatory, but it is (strongly) recommended to use more (Cf. Filing Guide).
Ownership of data deposited in NAKALA¶
The data deposited in NAKALA remains the property of the depositor and is under his responsibility.
Published data cannot be deleted except for reasons of force majeure in which case the deletion operation is carried out after verification by the NAKALA team team and a trace associated with the perennial identifier is kept (i.e. “tombstone”). The final deletion of the information will only be effective after a delay due to the refreshment of the media and the rotation of the backups.
The reuse of data is governed by the license associated with it: for this reason, the license is a mandatory metadata.
Data size¶
There is no formal limitation on the size of the deposited data. However, if you plan to upload large volumes of data (e.g. more than 10 GB per file), it is necessary to contact the NAKALA team beforehand.
How are the identifiers (Digital Object Identifiers) managed?¶
All repositories are assigned a DOI (e.g. 10.34847/nkl.de148w0r), which allows the data to be cited (via the NAKALA interface or directly https://citation.crosscite.org/) in a standardized way and to be accessed in a permanent manner.
As mentioned above, when data is deleted, the metadata is updated to keep a record of the deposit in NAKALA.
What are the different statuses of data in NAKALA¶
A piece of data deposited in NAKALA can have a different status depending on its progression in the life cycle:
- Deposited data: data that is being documented before being published and not accessible
- Published data (with or without embargo): documented data that is published and accessible if it is not under embargo
- Deleted data: data that has been published in NAKALA and whose deletion has been requested by the user from the NAKALA team. A trace of its presence is kept in the metadata associated with the DOI
- Data preserved at CINES: data published in NAKALA that has been deposited at CINES after an audit (eg verification of format) and a preparation (eg organization and documentation) prior to deposit at CINES
What controls are performed on the data¶
At the time of submission:¶
Various checks are performed at the time of data deposit for validation, some examples are:
- Checking for the presence of mandatory metadata;
- The values of “nakala:license” must be from the NAKALA license repository;
- The values of “nakala:type” must be from the NAKALA type repository;
- The ISO code for the language of a metadata must belong to the NAKALA language repository (ISO-639-2 standard when possible otherwise ISO-639-3 standard);
- The value of the date “nakala:created” can be empty or must be a string that respects the format “YYYY”, “MM-YYYY”, “DD-MM-YYYY”.
etc.
Regular checks¶
An assessment of the quality of the deposits is carried out regularly and provided to the depositors:
- Verification of the types of formats used for the files and the conformity of the files to these formats (see Preparing your data);
- Verification of the number and quality of metadata (e.g. against repositories).
etc.
A “quality index” based on these different criteria is calculated.
A comparison of file fingerprints is performed regularly to verify the integrity of the data files.
When requesting long-term preservation¶
When the long-term preservation request is made, an audit of the data to be preserved is carried out by the “user support” department.
Discussions take place to bring the data (and metadata) into compliance with the requirements expected for long-term preservation:
- General organization of the data;
- Quality of the formats used and conformity of the data to the format specifications;
- Verification of metadata and addition of information needed for long-term preservation (e.g. status, discoverability, etc.).
etc.
When these different points have been examined, the choice of the type of long-term preservation is made within a “liaison committee” defined by the collaboration agreement with CINES, Huma-Num partner for preservation.
Data security¶
Where are hosted the data¶
The data deposited in NAKALA are stored on servers managed by Huma-Num which are hosted at the IN2P3 computing center. This center was created to manage data produced by particle physics, nuclear physics and astroparticle physics.
This important national center is secured from a hardware point of view (e.g. redundant power supply, network devices, cooling systems) and implements the ZRR (Zone à Régime Restrictif) access restrictions.
How the data is stored¶
NAKALA’s data is stored on a NAS network storage device. An image of the data (snapshot) is taken at regular intervals, which allows for quick restoration of the data in case of problems.
In addition, a tape backup is performed daily on the CCIN2P3 backup robot using the TSM software published by IBM.
How the metadata are saved¶
NAKALA metadata are stored in a SQL database (MariaDB) that is backed up daily on Huma-Num’s infrastructure.
The metadata is also exposed in RDF format via a Triple-Store (GraphDB) which is also backed up daily on the Huma-Num infrastructure.
NAKALA service continuity¶
The NAKALA service is hosted on Huma-Num’s infrastructure, which has a general disaster recovery plan.
More specifically, the redundancy of the NAKALA service is ensured by :
- the NAKALA application is redundant on two different machines using a distribution tool (HAProxy) which avoids service interruptions in case of failure;
- the data is stored on a NAS type device which allows to quickly restore a data. In addition, backups are made on magnetic tapes on a daily basis;
- Metadata are stored in a “classic” relational database which is backed up daily. These metadata are also stored in RDF format in a TripleStore which is backed up daily.
Technologies used in NAKALA
NAKALA and Huma-Num¶
NAKALA is a data warehouse developed by Huma-Num based on proven and standard technologies (e.g. Symfony Framework, Triple Store GraphDB etc.). A team of three people from Huma-Num are working on its evolution and maintenance.
What is the status of Huma-Num¶
Huma-Num is a national infrastructure for the Humanities and Social Sciences.
As a national infrastructure, Huma-Num is included in the national roadmap whose evolution is aligned with that of the European infrastructures managed by ESFRI.
Huma-Num is operated by the CNRS (Centre National de la Recherche Scientifique), one of the most important research institutions in the world, which was created in 1939. CNRS in 2021 employs more than 30,000 people, including more than 10,000 researchers with a budget of 3 Billion €.
How Huma-Num is funded¶
Huma-Num is funded by the MESRI (Ministry of Higher Education, Research and Innovation) as part of the national infrastructure roadmap (see previous section).
And if Huma-Num disappears…¶
As Huma-Num is operated by the CNRS, responsibility for the data hosted in NAKALA is transferred to the CNRS if Huma-Num is dissolved.
The technological choices made, based on international standards, will allow at least the transfer of the data to make them available with the associated metadata on another type of infrastructure. For example, the metadata being expressed in the RDF format, the basis of Semantic Web technologies, their transfer to a Triple Store respecting these standards will be simplified. The data themselves are stored on standard file systems (Unix). The perennial identifiers can be easily updated to keep the access links to the data.