Note
Document in progress
Prepare the data to deposit them in NAKALA¶
Before starting the project¶
Most funding agencies now require a data management plan to be written, which allows you to plan the management of data during the project, but also and especially when the project is finished.
This is important and should be done before starting a project even if the data management plan is not formally required.
The questions (the list is not exhaustive) to ask are the following;
- How will the data needed for my project be collected or created?
- What will be the projected volume of this data?
- What tools will be used to process the data?
- How will the data be organized (e.g. with what granularity);
- What formats will be used for this data;
- How the data will be documented (e.g. what information should be collected);
- In which repository(s) these data will be stored, e.g. in order to make them widely available via aggregators (e.g. ISIDORE);
- What licenses will be used to make the data available;
- Does the data contain personal information, “sensitive” information? (steps to comply with the RGPD)
etc.
Tools to help create a data management plan are available such as DMP OPIDOR proposed by INIST.
Reference documents maintained by the CNRS are also useful to consult at this stage of the project:
- CNRS Research Data Plan;
- INSHS: Guide pour la recherche
Humanities and social sciences and personal data protection in the context of open science.
Choose the format(s)¶
Open formats and whose evolution roadmap is known are to be preferred to proprietary formats.
Nationally the CINES, partner of Huma-Num for long-term preservation, is monitoring formats. A list of quality formats associated when possible with a validation tool is available at this address http://facile.cines.fr
Another interesting resource is the risk matrix associated with format use maintained by the U.S. National Archives (NARA):
In addition, it may be useful to follow the work of the National Working Group of the PIN (Preservation of Nuclear Information)group on format issues composed of experts from many institutions and which was created under the auspices of the Aristotle Association :
Organization of data¶
Granularity of the deposit¶
NAKALA allows for the creation of repositories containing one or more files and these repositories can be organized into collections (see NAKALA’s documentation).
It is therefore necessary to ask the question of how to organize the data in order to make it available in the most comprehensible way, by giving the maximum amount of information on the context of production of these data as well as the processes that have been associated with them. Generally, a finer granularity (e.g. one file per repository) allows for a more thorough description, but groupings can make sense intellectually, such as associating all the digitized pages of a work in the same repository.
Naming plan¶
In all cases, for efficient data management, even well before the data is deposited in NAKALA, it is useful to draw up a coherent naming plan that will make it possible to organize the data files by grouping them into folders. Among other benefits, this will help avoid duplication or loss of data.
A few good practices should be used to name files and folders:
- adoption of a reasonable size for names (i.e. less than 30 characters);
- use standard characters excluding diacritics (e.g. “ç”), special characters (e.g. “&”), spaces etc.
- use meaningful titles and adopt consistency in naming (e.g. add the date in the file name for sorting purposes).
Filing plan¶
If the documentation phase of the data is carried out after it has been collected, a simple, tree-like classification plan can be used to structure the data in a way that facilitates research, controls and processing. These structures can range from a simple chrono to the reproduction of geographical or thematic organizations depending on the nature of the project.
Document the data¶
A good description of the data through metadata is fundamental for their availability, it is a complementary work to the organization of the data described in the previous section. As before, it will be useful to develop and implement consistent description rules (e.g. titles, dates, authors, keywords etc.) for all the data in the project.
NAKALA uses the Dublin Core model to describe the data, guidance on using this model can be found in the repository guide.