Legal guide – Information page

1. Who can decide to make data available in the RepOD repository?

From a legal point of view, a dataset in a repository consists of three elements:

– the files making it up,
– the metadata (title, description, etc.), and
– the compilation of the files and the metadata into a single entity.

Various laws (copyright, sui generis rights, privacy, etc.) may protect these elements, and different people may have different rights. According to the RepOD Terms of Use, we require the person depositing a dataset in the repository to do so without violating anyone’s rights. The depositor should learn what rights the dataset is protected by and who holds those rights. Only then, in agreement with other authorized persons, can the dataset be made available (see: section 10).

2. What kind of rights can apply to (research) datasets?

With regard to datasets, we most often deal with protection under copyright and/or database rights and sometimes with other rights.

Copyright and related rights protect a work when it is a manifestation of creative activity of individual nature (Act on Copyright and Related Rights). Thus, research data are subject to copyright protection only to the extent that the work put into it was creative. In particular, we should be aware that facts and ‘discoveries, ideas, procedures, methods, and principles of operation as well as mathematical concepts’ are not subject to copyright protection.

In the case of a research dataset, copyright:

– Protects the content of the files to the extent that it has the nature of individual creativity (e.g. the results of the measurements – no, but the way they are presented – sometimes yes);
– Protects creative metadata (e.g. author’s name – no, but a creative title and research description – yes);
– Protects the compilation of a dataset into a whole to the extent that it has the nature of individual creativity (the selection or arrangement of the whole dataset must be creative and not based on some well-known and used criteria).

Database rights (so-called sui generis rights) protect a dataset when it has required ‘substantial investment, evaluated qualitatively and/or quantitatively, for its production, revision or presentation of its contents’ (Act on the Protection of Databases). Thus, investment in a database is essential. According to the Act, a database is defined as a ‘collection of data or any other materials and elements arranged systematically or methodically, individually accessible by any means’. Therefore, this definition covers not only databases in the IT sense but also certain other databases. The protection of databases understood this way is in force in EU countries (including Poland) and a few others.

For a research dataset, database rights:

– Protect the content of files only to the extent that it qualifies as a database (e.g. research results compiled into tabular form if a significant investment has been made in collecting it);
– may cover separately metadata by itself if it constitutes a database (a dataset of metadata of different datasets). A single set of metadata for a single dataset will typically not require sufficient investment to be considered a separate database;
– Cover the compilation of a dataset into a whole only if it required significant investment.

Please note that other rights may cover part or all of the dataset, depending on the circumstances. In most cases, these rights are associated with individuals whose goods or interests have a specific relationship to the dataset. Such individuals may include content creators, research participants, and so forth. In this case, it is relevant to consider the rights of these individuals to protect their privacy, image, personal data, respect for their copyright, and others.

In the case of a research dataset, rights other than copyright or sui generis rights (1) may apply to the contents of the files if third parties were involved in their creation or if the content pertains to them. However, with respect to metadata (2) and the compilation of the dataset as a whole (3), the occurrence of rights other than copyright or sui generis rights will be rare.

3. Who the rights to research datasets described above belong to?

If we believe that certain elements of our research dataset are covered by copyright, database rights, or other rights, we should identify to whom these rights belong. To do so, we need to take into account the legislation (e.g. in copyright law, there is a rule that the author is entitled to economic rights, but there are also exceptions to this rule) and the arrangements made in the documents of respective academic institutions (universities, institutes), primarily in the ‘Regulations for the management of copyright, related rights and industrial property rights and the principles of commercialization’ (or similar) applicable in our academic institution.

Copyright is divided into moral rights and economic rights. Moral rights always belong to the author (or co-authors), while economic rights can be transferred to others, including academic institutions. From the perspective of the possibility of sharing data in RepOD, the most important aspect is who holds the economic rights.

The Act on Copyright and Related Rights states that employers have priority in publishing their employees’ scientific works, while the author(s) retain full economic rights. However, it is widely practiced in academic institutions in Poland to allow authors to decide on publishing their research results. This applies not only to research papers but also to elements of datasets considered to be works (objects of copyright), such as metadata or creative aspects of file content and combining files into a dataset. Authors should still ensure that data availability agreements are made with all co-authors, even if only verbally, and they should also respect any other legal and ethical restrictions on data availability (see: section 6).

Computer programs and creative databases created by academics that meet the requirements of collective work are examples of scientific results for which the employer most likely holds economic rights. If you wish to make such data available, you should contact the technology transfer center or the head of your unit to ensure that you are allowed to share the program or database. In practice, if the database or computer program has no commercial potential (e.g. a computer script solely for processing our data), this will likely be just a formality.

We emphasize that the above general guidelines should always be verified, e.g., in the regulations of the home academic institution, which may specify the issues described here.

The rights to the databases are granted by law to the entity that has made a substantial investment in the database, which is usually the employer. To make the database available, we must, therefore, obtain its permission—we should contact the technology transfer center or the institution’s management. In practice, if the database has no commercial potential, this will likely be just a formality.

Third-party rights. Individuals who have been involved in data collection or are the subjects of the collected data may have specific rights. If this is the case, we must agree with them on whether we can make the data available. If the study is conducted with individuals (participants), the content of the consent signed to participate in the study is important (see: section 5).

4. Personal data, sensitive data, and data anonymization

A dataset of personal data is a particular case where someone else’s material is included in the deposited data. Personal data is not limited to just a name; it encompasses all data that allows the identification of an individual, living person. According to the GDPR (General Data Protection Regulation), the identification can be made ‘in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person’. If the dataset contains personal data, the requirements under legislation such as the GDPR should be considered while making it available. This will be difficult if there is no consent from the data subjects.

In the case of lack of consent, the simplest way to make a dataset accessible is through anonymization, understood as the removal of specific information from the dataset that makes the data personal. Anonymization must be conducted carefully, as modern technology makes it easier than ever to de-anonymize datasets, particularly when combined with similar datasets from other sources.

The GDPR provides special rules for specific categories of personal data ‘revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership and the processing of genetic data, biometric data’. Processing such data to uniquely identify a natural person or data concerning that person’s health, sexuality, or sexual orientation is generally prohibited. Such processing is only possible in specific cases, e.g. when the data has been made public by the data subject. Conducting academic research does not exempt researchers from the obligation to comply with the GDPR and adhere to data protection principles.

It is essential to remember that the dataset depositor is responsible for properly anonymizing the data made available in the RepOD repository.

5. Participant’s consent to taking part in an academic research

Under the general regulations of the Act on the Protection of Personal Data, if the data is anonymized (i.e. the identity of individuals cannot be effectively reconstructed), it is not protected as personal data and can be made freely available (including in a repository). If we are confident in the anonymization method we have employed, and the individuals participating in the research have no other rights (e.g. copyright on the materials they have contributed), then we may not need to seek their consent to share the information obtained from them.

However, in the case of medical data, there is a specific regulation: ‘Information obtained in connection with a medical experiment may be used for academic purposes, without the consent of the person subjected to that experiment, in a form that does not allow for the identification of the person’ (Act on the Profession of Medical Doctor and Dentist, Art. 28 [Use of information obtained in the course of an experiment]). This rule can be interpreted both as a confirmation of the general regulation above (that anonymized data can be freely made available) and as providing additional protection (stating that even anonymized data must be accompanied by the patient’s consent if it is to be used beyond academic purposes). Therefore, if we aim for a cautious approach, even when anonymizing data, it is advisable to inform the subjects beforehand about our intention to share the data in a repository and – crucially – specify the intended licensing terms while also obtaining their explicit consent for these actions.

Standard consent forms often include additional declarations made by the researcher. If the participant agrees to ‘publication of the results’, this can be understood as encompassing both the article and sharing data in the repository. On the other hand, if the participant agrees specifically to ‘the publication of results in an academic article’, while the law does not explicitly prohibit sharing data in a repository, the researched person could feel misled. Similarly, if the form states that the participant has been ‘informed as to which persons will have access to the data obtained’, or that ‘the data will only be used for the academic study entitled…’, then we should respect these declarations and not make the data available for other studies. Therefore, we should be aware of our commitments to the participants in our research. Using wording that is too specific in the forms may unnecessarily limit our rights and commit us to actions not required under implicit legal provisions. On the other hand, any specific use or sharing of data beyond the scope of fair use or similar legislation must be explicitly agreed upon with the individual who holds the rights to the data.

6. Other legal and ethical restrictions on data availability

In addition to copyright, database protection rights, and personal data protection restrictions, other restrictions may come into play. Not all of these are legal restrictions; some arise from accepted custom or simply a sense of morality. When considering these aspects, it is important to ask:

– Does the data disclose information that affects national security or defense?

– Does the data disclose information that will prevent a patent or other form of commercialization of research?

– Is the disclosure of data likely to endanger protected species or archaeological monuments?

– Does the disclosure of the data raise other ethical concerns?

In all the cases mentioned above, we recommend carefully identifying the issue with the assistance of a specialist in the relevant field.

7. What is the choice of license for files in the RepOD repository? What do the different licenses mean? What rights do the licenses provide to data users?

Any material made available on the Internet without additional express permissions may only be used to the extent permitted by law. In the case of a dataset, this means that to the extent that the dataset is protected by copyright, its use by recipients is limited by the fair use provisions outlined in the Act on Copyright and Related Rights. To the extent that the dataset is protected by database rights, its use is subject to the restrictions contained in the Act on the Protection of Databases. Therefore, as with any work available without additional permission from the author, we can view such a dataset, read it, quote excerpts (including audio or visual), and draw inspiration from it. We can also re-analyze the data for academic research purposes and publish information on whether we achieved results identical to those obtained by the authors of the original publications. Thus, a dataset made available in this way is usually sufficient to verify the reliability of an academic article.

However, in the case of a dataset made available without an additional license, there are no regulations that unequivocally allow copying or distributing this dataset in its entirety (even if we correctly cite the authors). This severely limits the ability to conduct analyses based on such data, especially if we intend to involve computers and thoroughly document our research findings.

Data not accompanied by explicit consent cannot be combined with others into larger wholes without significant legal risk. They also cannot be included in meta-analyses of the scientific literature or subjected to text-and-data mining (TDM) methods, although this may change with the implementation of the Directive on copyright and related rights in the Digital Single Market, which introduces new rules for TDM.

For these reasons, researchers often choose to license their dataset, granting users additional and broader rights to use the data. In RepOD, we allow each file in the dataset to be licensed separately (we also provide the option to choose the same license for all files in the dataset). The licenses available in RepOD repository have been selected from among the most popular free licenses on the internet (Creative Commons) and free software licenses. The latter are recommended for files that are computer programs, while the former are suitable for all other types of files. Below are brief descriptions of each license.

CC0 (Creative Commons Zero 1.0) is a statement developed by Creative Commons to release a resource into the public domain. This means the rights holder waives all copyright and related rights and agrees not to exercise database rights and other rights over the resource. Therefore, public domain resources can be used without restrictions imposed by these rights. However, general laws, including moral rights (such as the right to attribution), must still be respected.

CC-BY (Creative Commons Attribution 4.0) is a license where the rights holder allows others to freely use the material – as broadly as with CC0 – meaning it is permissible to copy, distribute, remix, adapt, and perform text and data mining, provided that proper attribution is given (hence the “BY”). The name of the author(s) of the material must be correctly cited, as well as the source of the material and the license under which it was made available.

CC-BY-SA (Creative Commons Attribution-ShareAlike 4.0) differs from the CC-BY license solely by the additional share-alike (SA) requirement imposed on users; if they create a new derivative work based on the shared material (data), they are obliged to also license that derivative work under CC-BY-SA.

Software licenses. The Free Software Foundation website (http://www.fsf.org/licensing/) is a good source of information about free software licenses.

All CC licenses are designed so that the person granting them does not waive any rights on behalf of third parties but only those rights they hold. Furthermore, these are international licenses, ensuring that their construction maintains as much consistency as possible across all jurisdictions.

When choosing a license, we need to understand the rights we are granting to users so that we can make an informed decision. The RepOD repository encourages sharing data on a CC0 basis (see: section 9), but we recommend thoroughly analyzing the issue and making a well-considered decision.

8. Why are the metadata describing our dataset and the rights associated with the compilation of the files covered by a CC0 statement in the RepOD repository?

The declaration of releasing metadata and the dataset as a set of files under CC0 is made automatically by the user upon acceptance of the RepOD Terms of Use.

Metadata are the data describing dataset: the title and description of the set, the authors’ names, surnames, and affiliations, keywords, file names and descriptions, etc. Releasing metadata into the public domain facilitates the search for shared datasets, as metadata can then be freely transferred to other services. In particular, metadata of RepOD repository datasets are automatically transferred to the Europeana collection (https://www.europeana.eu/en) because we are obliged to do so by an agreement under which we assign DOIs to datasets. For these reasons, we require all metadata in RepOD to be covered by CC0.

Additionally, a given dataset may be protected by database rights or copyright as a compilation of elements. In such cases, certain uses of the dataset as a whole, or even of a substantial part of it, would require obtaining the appropriate permission (license) from the rights holder. In our view, in the current legal environment, CC0 provides the best means to ensure the genuine openness of shared data. Without this mechanism, the rights associated with the dataset as a compilation of files could unnecessarily hinder the reuse of these compilations or substantial parts thereof. For example, in the legislation of many countries, including Poland, without such a declaration, datasets could not be subjected to machine analysis without the special consent of the rights holders. Therefore, concerning the rights associated with the compilation of the files, we also require CC0.

Making a correct declaration of releasing metadata and the rights associated with the compilation of the files into the public domain under CC0 by the person depositing the dataset is possible without obtaining third-party consents (provided that the person is the creator/producer of the dataset or acts on their behalf). A separate issue is whether the data itself contains material protected by third-party rights (see: section 3).

9. Why do we recommend releasing also files’ contents into the public domain or sharing them under open licenses?

While we require that metadata and the rights protecting the dataset as a compilation be covered by a CC0 statement, the decision to license the deposited files is left to the dataset authors. However, if there are no specific objections, we recommend — as does the European Commission — releasing data into the public domain or at least sharing them under a CC-BY license. Why?

User’s rights under the principles of fair use are usually insufficient for conducting new research based on shared data or for using them for entirely new and innovative purposes. However, the point of making research data publicly available is to enable the possibility of their reuse — for the benefit of the entire society, which funds scientific research through taxes.

More information on this topic can be found at http://www.dcc.ac.uk/resources/how-guides/license-research-data#x1-40000.

10. Practical summary: 5 steps to sharing data

Remember that copyright jointly belongs to all co-authors, that is, all researchers who have creatively contributed to the dataset. Therefore, ensure all your collaborators agree to share the data (verbal agreements are sufficient, although written consent is easier to prove later). If possible, it is best to establish this at the beginning of the project. Appropriate provisions can be included in the consortium agreement if the research is conducted within a larger scientific consortium.

Check the regulations in your research institution. Look at the ‘Regulations for the management of copyright, related rights and industrial property rights and the principles of commercialization’ or a similar document. If your data are not suitable for commercialization, you can most likely decide on their sharing by yourself.

Consider whether anyone else besides your collaborators and employer was involved in any way in the creation of your dataset. Did participants participate in the study (medical, sociological, historical research), and does the dataset contain data that allows their identification? Does your dataset contain images or statements of individuals (photos, interviews)? Does your dataset contain the results of other people’s work (photos taken by a photographer, extracts from other people’s texts or translations, etc.)? If so, ensure these people agree to share the dataset in the RepOD repository. (See also: sections 4 and 5).

Consider whether your dataset contains information you should not disclose (or not yet) for other reasons. Does it affect national security or defense? Does it reveal information that would prevent obtaining a patent or another form of commercialization of research? Are there any other ethical concerns? If so, consult someone competent before sharing (See also: section 6).

Decide if you want to license the data. Are there reasons why you do not want or cannot release your data into the public domain? Choose the license that best suits your needs. Ensure that everyone whose rights are included in the dataset has given not only consent for the sharing but also explicit consent for the licensing.

Prepared by: dr Krzysztof Siewicz; translated by: PON team