As part of the DiTraRe work programme, four Research Clusters are based on a scientific use case that raises concrete questions.
Protected Data Spaces
Various categories of research data are subject to legal restrictions such as data protection laws, personal rights or copyrights. Ethical restrictions for data sharing include e.g. the geolocation of sensitive sites or politically or socially unacceptable content.
Nevertheless, there is a legitimate research interest in such data. In the LSC, we clarify which categories of data can be re-used for research purposes. We propose legal, ethical and technical solutions by taking into account different levels of data sensitivity.
We investigate procedures for pseudonymisation and anonymisation and linking of sensitive data with non-critical data in distributed systems of knowledge organisation.
We study researchers’ awareness of the associated security and privacy risks and possible consequences.
The MO|RE data platform makes physical fitness data derived from sports science studies available to both academia and the public. Research would strongly benefit from linking health with physical fitness data, e.g. in longitudinal data sets. However, publishing sensitive health (e.g. BMI, blood pressure) and other personal data (e.g. geolocation, social status) is challenging. An overarching concept for the secure handling of sensitive data is lacking, ranging from a trustworthy IT environment to sophisticated access management and auditing mechanisms, which ensures compliance with legal regulations.
Smart Data Acquisition
The Research Cluster investigates innovative technical and societal method, quality criteria for data acquisition as well as partially automated procedures for documentation, analysis and interpretation of data, thus fostering the acceleration of research processes.
It assesses associated opportunities and risks, including legal challenges related to IP protection.
The Chemotion Electronic Lab Notebook (ELN) will serve as a testbed to investigate the efficiency of data acquisition and analysis as well as the establishment of trust and accountability.
Chemistry labs in academia make limited use of lab automation and device integration. Despite current research data guidelines by funders and positive examples in industry, there is reluctance to adopt technologies such as ELNs. Concerns include dependencies on software and technologies not under control of scientists, faulty methods for data assignment and data analysis, and missing control over re-use of their data.
AI-Based Knowledge Realms
Machine learning and AI hold great promise to enable new discoveries and innovation. They help address issues of ever-increasing amounts of data and offer opportunities to semantically link currently separated information.
However, they are accompanied by risks, ranging from legal assessment of the use of synthetic training data for AI systems, limited or biased training data and quality problems in indexing to a lack of acceptance by users due to unverifiable decisions by AI systems.
This applies in particular to the social, political and economic consequences of AI-based decisions made by models that can no longer be explained or understood ("black boxes").
KIT-IBT develops computer models of the human heart to predict cardiovascular diseases earlier and more accurately using software engineering, algorithmics, numerics, signal
processing, data analysis, and machine learning. We employ AI methods trained on purely synthetic or hybrid (simulated + clinical) datasets to help decipher disease mechanisms. Simulated data are often essential to overcome issues of data privacy and existing bias in most available datasets, but raise questions of explainability of AI decisions and trust.
Publication Cultures
New publication formats beyond classic peer-reviewed articles are gaining in importance. Data publications make scientific findings reproducible and form the basis for further research. Software used to generate or interpret data must be included with data publications as a quality assurance measure. Both should be understood as first-class scholarly outputs.
Existing publishing infrastructures are not yet well suited for data and software. The dynamically changing legal framework requires an in-depth analysis of European and national data laws and policies and their impact on new publication formats and researchers' willingness to share data, algorithms and software.
The shift towards Open Science must be accompanied by a suitable communication strategy to help prevent misinterpretation of research results. It takes into account new communication formats and stakeholders such as science communicators or decision-makers to improve the exchange between science and society.
KIT-IMK generates and analyses very large datasets in chemistry-climate simulations or in satellite data for observing the state of the atmosphere. Publication of those data is currently very inefficient due to their size. Re-use is hampered by missing methods for exploring such datasets efficiently in order to evaluate their relevance for other research questions. Selecting subsets of datasets for re-use or peer review is currently not possible.