- Organisation of the Repository
- Repository operations and data management
- Legal aspects
- Data storage
- Technical infrastructure
1. Organisation of the Repository
1.1 Responsible entity
The Repository for Open Data RepOD (“Repository”) is operated by the University of Warsaw, based in Warsaw (00-927), Krakowskie Przedmieście 26/28 (“UW”) at the Repository website (https://repod.icm.edu.pl).
1.2 Purpose of the Repository
The purpose of the Repository is to make research data, such as tabular data, images, audiovisual materials, and any other type of data produced, collected, or described for research purposes, available on the web.
1.3 Scope of data and target audience
The Repository for Open Data RepOD is a general-purpose data repository for all academic community members, with a particular focus on researchers affiliated with Polish scientific institutions.
Research data from all fields of science can be collected, stored, and made available in the Repository.
1.4 Basis for the Repository’s operation
The Repository operates according to the Terms of Use available at https://repod.icm.edu.pl/terms-of-use-page.xhtml. Additional information on depositing research data is provided in the “User Guide” available at https://repod.icm.edu.pl/guides/pl/4.11/user/index.html and on the Repository information website available at https://repod.icm.edu.pl/info.
The current instance of the Repository was created as part of the “Disciplinary Open Research Data Repositories” project, implemented by the University of Warsaw between 2018 and 2021.
1.5 Repository staff
The documents defining the responsibilities of the relevant staff of the Interdisciplinary Centre for Mathematical and Computational Modelling UW (ICM UW) include tasks related to the daily maintenance of the repository and the management of data in the repository.
Responsibilities for the daily maintenance of the Repository are taken into account by ICM UW management when defining the organisational structure of ICM UW (following §5 of the ICM UW Regulations) and in the workforce planning process.
1.6 Financing of the Repository’s activities
The daily maintenance of the Repository is financed from the “Disciplinary Open Research Data Repositories” project funds and, after its completion, from the University of Warsaw’s funds. As part of maintaining the project’s effects during its sustainability period, the University of Warsaw is committed to ensuring the operation of the Repository at least until 2026.
Efforts are also being made on an ongoing basis to raise additional funds from both national and international sources to enable further development of the Repository.
1.7 Risk register
During the implementation period of the project “Disciplinary Open Research Data Repositories,” the project’s “Risk register” is the basis for managing risks associated with the repository’s functioning.
At the end of the project, a new risk register for the Repository will be developed based on the existing register, considering the Repository’s functioning in the following years.
At the end of the project, the risks will be reviewed once a year, and the risk register will be updated.
If significant risks are identified outside the annual risk review procedure, the risk register may be updated ad hoc.
1.8 Periodic inspections
Once a year, the following shall be carried out:
- review of the file formats within the shared datasets;
- review of funding opportunities for the development of the Repository;
- review of the functionalities and new technologies that justify modification of the software on which the Repository is based;
- risk review.
1.9 Information on data usage
The system collects information on downloads of individual files and datasets. The number of downloads for individual files is publicly available.
Information about the total number of downloads of files and entire datasets is also publicly available for the entire Repository collection.
1.10 Audit and certification
The Repository has not been subject to an audit related to the certification process. However, the University of Warsaw plans to submit the Repository for a certification process in the future.
1.11 Proceedings in the event of termination of the Repository
In the event of circumstances necessitating the termination of the Repository, the University of Warsaw will seek to transfer all the data stored in the Repository to another location while maintaining the continuity of the correct functioning of the DOI numbers assigned to the datasets.
The choice of a new location for the datasets will depend on the current availability of infrastructure for the transfer.
1.12. Amendments to the long-term preservation policy
Subsequent versions of this document can be found on the Repository’s information website, available at https://repod.icm.edu.pl/info/.
2. Repository operations and data management
2.1 Institutional collections
An agreement between another academic institution and the University of Warsaw allows the institution to maintain a separate institutional collection within the Repository.
The extent of data and metadata made available within institutional collections may depend on their specific configuration and the rules set by the co-hosting academic institutions.
Institutional collections may also impose additional conditions on who can upload and share research data in the collection and how.
2.2 Data acquisition
The Repository does not define specific requirements for the characteristics of datasets, except that at least one file must be deposited within a dataset. Open file formats are preferred.
The Repository always maintains the original format of the deposited file.
The Repository accepts datasets of any size. The only limitation is a single file size limit of 5GB.
An MD5 checksum is calculated for each file uploaded to the Repository. This makes it possible to compare it with the checksum of the file calculated on the User side and verify the correctness of the data transfer.
The Repository allows data files to be uploaded through a web-based graphical user interface or API.
When creating a dataset, the User initially produces a draft version, which may be subject to changes and additions in terms of metadata and file editing. Once it has been published and given a version number, the User is not able to modify the metadata and files that comprise it. However, the User can create additional numbered versions of the dataset based on the existing published version.
2.3 Requirements for depositing datasets
There is no charge for Users depositing or downloading data.
The use of the data storage and sharing functionality of the Repository requires the creation of a User account and acceptance of the Terms of Use. The depositor has an individual account to which an e-mail address and an authentication password are assigned.
The scope of information required to create and edit a dataset is indicated directly in the metadata form. A dataset can be submitted for verification if it contains at least one file and all required metadata fields have been completed.
The configuration of metadata sets and mandatory fields is done at the collection level. This means, in particular, that different collections within the Repository may have different metadata sets and, within them, different sets of mandatory fields.
2.4 Minimum set of metadata in the main Repository collection
The Repository, as a general-purpose repository, requires only basic information about the datasets to be entered. These metadata fields are:
- title;
- author;
- contact person;
- description;
- subject area.
In addition, the following fields are available in the main collection of the Repository and are optional:
- keywords;
- related publication;
- grant information;
- related dataset.
Different requirements for the mandatory metadata fields may apply within individual collections, especially institutional collections co-hosted by external institutions under separate agreements between these institutions and UW.
When depositing data, the User must also indicate the licences or conditions under which individual files will be available.
2.5 Metadata validation
Entering the minimum required metadata is necessary to create a draft dataset and then publish it. The repository software automatically verifies the completion of the relevant fields when the draft version is saved. In addition, the correctness of the description is checked when the dataset is verified before publication.
2.6 Embargoed datasets
For embargoed datasets, the User specifies the date from which the files in the dataset become available. From the moment such a dataset is published, its metadata is publicly available. Embargo can only be set for datasets that do not yet have a published version.
2.7 Maximum embargo period
The maximum embargo period in the Repository is 36 months.
2.8 Files with restricted access
For files with restricted access, the Repository allows requesting access to a specific file. The request is addressed to the Repository User who deposited the dataset containing the file. The Repository maintains a copy of the e-mail containing the access request for this resource but does not interfere with further correspondence between the User requesting access and the User who deposited the file.
2.9 Anti-malware check
When uploading to the Repository, files are subject to anti-virus checks.
If an anomaly is detected, saving the file is prevented, and the User is informed that the file upload operation has failed due to malware detection.
2.10 Verification of deposited datasets
Once research data has been deposited and described in the Repository, the User submits it for verification. Verification is performed by the person with the permissions to publish datasets in the collection. This consists of:
- verification of the correctness of the metadata entered;
- validation of files.
In the Repository collections where ICM UW is responsible for verification of the content, this process (from the moment of submission for verification until the decision to publish or return the dataset for correction) usually takes up to 3 working days.
In the case of datasets deposited in institutional collections, verification may be the responsibility of designated individuals representing the institution that co-manages the collection.
A formal confirmation of the publication of the dataset version is a notification sent to the depositor’s e-mail address.
2.11 Dataset versions
Each dataset deposited by a User consists of at least one version. A dataset version consists of metadata and files containing the data and its supporting documentation.
The User initially creates a draft version when creating or editing a dataset. This version can be edited in terms of both metadata and files. The possibility of editing is blocked when a specific dataset version is published. An exception to this rule is minor corrections of obvious errors in the metadata applied by the Repository Administrator. Such changes do not require the publication of a new dataset version.
When publishing a dataset, the collection Administrator determines whether the changes made require publishing a draft version as a new major or minor version.
A draft version in which the files have been modified can only be published as a new major version.
The dataset version number is part of the suggested citation of the dataset, which is visible on the dataset page.
2.12 Deletion and withdrawal of datasets and versions of datasets
Once a version of a dataset has been published, it cannot be deleted. However, it is possible to withdraw a version of a dataset in exceptional cases. The entire dataset is withdrawn by withdrawing all the versions that comprise it.
When a version of a dataset is withdrawn, only the basic information about the dataset (the so-called tombstone) remains publicly available:
- citation;
- the reason for the withdrawal.
The full metadata and data of the withdrawn version remain available to Users with system roles, with permission to publish and withdraw the resources.
2.13 Form of data archiving
Data shall be archived and made available in the form provided to the Repository by the User. The Administrator of a given collection may request corrections or additions to a dataset before publishing it and, in the case of minor and obvious mistakes (e.g. typos), make the necessary corrections himself.
In the case of selected tabular data formats, copies of these data are additionally created in other formats to make the data more accessible to Users using different types of software. This conversion takes place automatically. The deposited file is always retained in its original format.
In the case of tabular files, a UNF (universal numerical fingerprint) value is additionally generated, allowing Users to verify the correctness of the conversion.
2.14 DOI identifier
Each dataset deposited in the Repository is assigned a DOI number within the Repository prefix. This number is reserved locally within the Repository installation when the first draft of a dataset is created. The DOI number is activated when the first version of the dataset is published.
The DOI number of a dataset is part of the suggested citation, which is visible on the dataset page.
2.15 DOI and dataset versions
All dataset versions have the same DOI identifier and are distinguished by version number. Information about all dataset versions is presented on the dataset page under the tab “Versions”.
Within the metadata, it is also possible to identify other objects (publications, datasets) linked to the dataset and indicate their DOI numbers, as well as the type of relation between them.
2.16 Guest book and system logs
In selected collections, the ability to download files via the graphical user interface can be conditioned on the downloader completing a short questionnaire.
Every file download is logged for all resources in the Repository (both open and restricted access).
For each such activity, the logs record information about:
- the file identifier;
- the type of download;
- the e-mail of the User downloading the file;
- the name of the User downloading the file (if applicable);
- the position of the User downloading the file (if applicable);
- the date and time the file was downloaded;
- the ID of the logged-in User (if applicable);
- ID of the data file
- dataset identifier;
- dataset version identifier;
- guestbook identifier (if applicable);
- answers to additional guestbook questions (if applicable).
2.17 User support in the Repository
The Repository allows Users to contact it via e-mail (repod@icm.edu.pl) and the contact form on its website. Relevant ICM UW staff deal with reported problems.
In the case of questions about a collection, the Repository makes it possible to reach the designated contact person for the specific collection.
For questions about a particular dataset, the Repository allows to reach the contact persons indicated for that specific dataset.
In addition, Users receive automatic e-mail notifications regarding:
- dataset creation;
- dataset sent for verification;
- dataset returned for correction;
- publication of a dataset;
- sending a request for access to a restricted file;
- granting a system role;
- withdrawing a system role.
3. Legal aspects
3.1 Rights over datasets
It is the responsibility of the User of the Repository to clarify the legal status of the deposited data under the Terms of Use. By supplying data to the Repository, the User declares that he/she has sufficient rights to deposit data and make it available. Clarifying any doubts about the legal status of the deposited data under the Terms of Use shall be the responsibility of the User providing the data to the Repository.
3.2 Licence for the University of Warsaw
By submitting data to the Repository, the User grants the University of Warsaw a non-exclusive licence covering the preservation, storage, and reproduction of research data by digital technology for the purpose of operating the Repository. This licence covers all submitted research data, especially metadata, rights related to research data as a collection of files, and files constituting research data. The licence also covers sharing research data with third parties, except where public access to a particular file has been disabled.
3.3 Declaration regarding metadata and collection of files
By submitting data to the Repository, the User subjects the data to the terms of Creative Commons 0 deed, with its full text available at https://creativecommons.org/publicdomain/zero/1.0/legalcode (“CC0”), the subject of which is metadata describing the research data and any rights related to the research data as a collection of files.
The inclusion of metadata describing research data in this declaration enables the free exchange of metadata with other services aggregating information about available datasets.
The inclusion of rights related to the dataset as a set of files in this declaration (e.g. collective work rights, sui generis database rights) facilitates the assessment of the extent of the User’s rights, which in this situation result only from the rules defined for the individual files included in the set (e.g. the indicated CC licence).
3.4 Conditions for file sharing
For each file included in a deposited dataset, the User can assign a licence from the list of licences available in the Repository, make it available on a fair use or restricted use basis.
In the case of restricted-use files, the decision to make a file available to a specific User is made on a case-by-case basis by the depositor and may be subject to the User’s acceptance of additional conditions (e.g. that the file may only be used for research purposes).
3.5 Multiple licensing
It is permissible for the same files to be available under different licences in different versions of a dataset (multiple licensing). If each of these files is publicly available, the User of the file can choose which of these licences to adhere to.
3.6 Form of licence
When the User makes resources, including files and metadata, available under a non-exclusive licence, its written form is not required (Polish law allows for the granting of non-exclusive licences in any form).
When the User makes the dataset files available based on fair use, sharing and using the data is possible under the applicable law. Therefore, in this case, there are no provisions for which it would be necessary to consider the issue of their form.
3.7 Restricted file sharing
If a User makes dataset files available in a restricted manner, the additional conditions for access and use of the files are described for each file made available under these rules.
The granting of access to files to a particular User is in each case a decision of the depositor, who is also responsible for any additional measures to verify the person who requested the data.
3.8 Communication on the use of data
The Terms of Use of the Repository contain information on the terms and conditions under which a User can make data and metadata available. Each User accepts the Terms of Use when registering an account.
Information on the licence under which each file in the dataset has been made available is presented next to each file in the “Files” tab of the dataset page and additionally on the page of each file.
3.9 Monitoring of data use
The Repository does not monitor the compliance of the data’s use with the licences under which it is made available.
The Repository does not monitor whether the data downloaded is used in accordance with its licence. In the case of a breach of the licences or access conditions, it is up to the rights holder of the affected resource to take possible legal action.
The Repository also does not interfere with additional User verification for files made available under restricted access conditions (e.g., for research purposes only). This possibility is also left to the Repository users depositing restricted access files.
No login or account is required to access datasets made available under other conditions.
4. Data storage
4.1 Data integrity and backup copies
The software of the S3 storage on which the data files are stored is responsible for maintaining their integrity and correctness.
In addition, in the event of a failure or accidental deletion of resources from the S3 storage (e.g. in the event of a software error), it is also possible to restore lost resources from an additional backup copy stored in a separate location.
Database and file backups are stored separately.
4.2 Verification of data integrity
When a file is uploaded to the Repository, a checksum is generated. The user can compare it with the checksum generated locally to confirm that the copy of the file uploaded to the Repository matches the file on the User’s side.
The Repository periodically checks that the resources that, according to the information in the Repository database, should be in the S3 storage are indeed there. If a discrepancy is found, a backup copy of the files residing in a separate storage in a different location is retrieved.
Periodically, the checksums in the Repository database are also compared with those stored in the files’ metadata in the S3 storage.
Checksum information is also displayed next to each file on the dataset and file pages. This makes it possible to compare the match between the downloaded copy of the file on the User side and the copy deposited in the Repository.
4.3 Relation between archival (AIP) and dissemination (DIP) copies
The Repository does not create archival copies (AIPs) separately from copies of datasets subject to dissemination (DIPs).
5. Technical infrastructure
5.1 The Repository software
The Repository runs on the free Dataverse software version 4.11, modified by the ICM UW development team. The modified version of the software is open source, and its source code is available on the GitHub platform at https://github.com/CeON/dataverse. This platform allows the submission and monitoring of changes and corrections made to the software code.
5.2 Data storage
File data is stored in an S3 storage with a redundant storage system. Dataset and file metadata are stored as a relational database in a storage resource attached directly to the server.
The security of the data deposited in the Repository is further ensured by the native redundant storage mechanism of the S3 disk storage on which the data is placed. The system is supported by an external company.
In addition, the metadata database and file data are transferred to a storage resource in another geographical location. This makes it possible to recover data files and the Repository database in case of damage or failure.
File data and the metadata database are backed up once a day. Copies of the database from the last 30 days and copies of data from the last two days are stored in a separate location.
5.3 Proceedings in the event of data disintegration
In the event of a difference in the checksums of the deposited files stored in the database and the system metadata in the S3 storage, the system sends a notification to the Administrators. The Administrators take action to restore system integrity.
5.4 Operation logs
System logs are created and stored in the Repository for the following operations:
- creating a User account;
- withdrawing a dataset version;
- setting an embargo date for a dataset;
- activating a lock on a dataset;
- assigning a system role;
- creating a collection;
- creating a guestbook entry;
- creating a new dataset;
- creating a private URL for a dataset;
- creating a dataset template;
- deleting a draft version of a dataset;
- deleting a draft version, which is the only version of a dataset;
- deleting a dataset template;
- publishing a dataset;
- fetching JSON ID;
- fetching a private URL of a dataset;
- saving a provenance file;
- publishing a dataset;
- publishing a collection;
- removing a lock on a dataset;
- issuing a request for restricted access to a file;
- returning a dataset for correction;
- revoking the system role;
- submitting a dataset for verification;
- editing a dataset thumbnail;
- editing a draft dataset;
- creating a draft version of a dataset;
- editing a collection;
- changing a default role for a User creating a dataset in a collection;
- editing a dataset template;
- editing collection appearance;
- logging a User in;
- logging a User out;
- issuing a request to change the password;
- sending a request to change the password;
- retrieving available login methods from the database;
- changing the maximum length of an embargo;
- activating or deactivating file sharing conditions (restricted access and permitted use);
- changing User account parameters.
5.5 Periodic review of functionalities
Once a year, the repository’s functionalities are reviewed against identified new User needs, technologies, and features. New functionalities will be progressively developed and implemented as staff and financial resources allow.
Document version: 1.0.0