Document in progress
How to prepare your Data before a deposit in NAKALA¶
Before starting the project¶
Today, most funding agencies require you to draw up a data management plan (plan de gestion de données), which enables you to plan the management of your data during the life of the project, but also, and above all, once the project has been completed.
This is an important step, and should be taken before starting a project, even if a data management plan is not formally required.
Questions to ask include (but are not limited to) the following:
- How will the data required for my project be collected or created?
- What is the anticipated volume of data?
- What tools will be used to process the data?
- How will the data be organized (e.g. with what granularity)?
- What formats will be used for the data?
- How the data will be documented (e.g. what information needs to be collected);
- In which warehouse(s) will this data be stored, with the aim, among other things, of making it widely available via aggregators (e.g. ISIDORE);
- What licenses will be used to make the data available?
- Does the data include personal, “sensitive” information? (steps to comply with RGPD) ; etc.
Reference documents maintained by the CNRS are also useful to consult at this stage of the project:
- Plan données de la recherche du CNRS](https://www.cnrs.fr/sites/default/files/pdf/Plaquette_PlanDDOR_Nov20.pdf);
- INSHS : Guide pour la recherche](https://www.inshs.cnrs.fr/sites/institut_inshs/files/pdf/Guide_rgpd_2021.pdf). The humanities and social sciences and the protection of personal data in the context of open science.
Choosing the format(s)¶
Open formats with a known evolution roadmap are to be preferred to proprietary formats.
On a national level, CINES, Huma-Num partner for long-term preservation, is monitoring formats. A list of quality formats associated, where possible, with a validation tool is available at this address http://facile.cines.fr
Another interesting resource is the risk matrix associated with the use of formats maintained by the US National Archives (NARA). For most file formats, you’ll find a risk indicator (Low, Moderate, High) as well as an indication of “acceptable” or “preferred” formats to guide your choices.
As far as related tools are concerned, you can consult the directory compiled by the COPTER (Community Owned digital Preservation Tool Registry) project.
It may also be useful to follow the work of the national working group PIN (Preservation of Digital Information) on format issues. This group is made up of experts from numerous institutions, and was created under the aegis of the Aristote association.
NAKALA allows you to create repositories containing one or more files, and these repositories can be organized into collections (see NAKALA documentation).
It is therefore necessary to consider how the data should be organized in order to make them available in the most comprehensible way: by providing as much information as possible on the context in which the data was produced, and on the processing that has been carried out on the data. Generally speaking, a finer granularity (e.g. one file per repository) enables a more detailed description. However, groupings can make sense intellectually, for example, by associating all the digitized pages of a book in the same repository.
In all cases, for efficient data management, even well in advance of the NAKALA repository, it is useful to draw up a coherent naming plan that will enable data files to be organized by grouping them into folders. Among other benefits, this will help avoid duplication or loss of data.
Here are a few best practices for naming files and folders:
- adopt a reasonable size for names (i.e. less than 30 characters) ;
- use of standard characters, excluding diacritics (e.g. “ç”), special characters (e.g. “&”), spaces, etc.
- use of meaningful titles and consistency in naming (e.g. adding the date to the file name for sorting purposes).
If data documentation is carried out after it has been collected, a simple, tree-structured classification plan can be used to structure the data to facilitate searching, checking and processing. These structures can range from a simple chrono to the reproduction of geographical or thematic organizations, depending on the nature of the project.
Properly describing data with metadata is fundamental to making it available. This work complements the data organization described in the previous section. As before, it will be useful to develop and implement consistent description rules (e.g. titles, dates, authors, keywords etc.) for all project data.
NAKALA uses the qualified Dublin Core model to describe data. Guidelines for using this model can be found in the description guide.