Integrate
What is it?
You might want to integrate a dataset that you discovered and that fits to your own data to verify your results, as a starting point for an integrative study or just to test a new hypothesis for a follow-up study. Integration is the merging of multiple datasets from different sources, like your recently collected data with former data from other owners, resulting in a new, bigger dataset. You can manually integrate data or save time by using automatic integration procedures. This requires the use of a common syntax and terminology right from the start (see Fact-Sheets ‘Collect’ and ‘Describe’). When other authors’ data are re-used, it is fundamental to provide credit to the data creators through a robust data citation practice which works best when data are equipped with a persistent identifier (PID).
How to do it?
- For an efficient data integration good data management planning is the key. Make data management easy by using existing tools and workbenches for data collection, assurance, description, submission and discovery.
- Understand the data and assess suitability for the required purpose. With visualization and aggregation tools (like the GFBio VAT system), you can create geographic maps or descriptive statistics for a better understanding.
- Use agreed terms, e.g. from the GFBio Terminology Service during the collection and description stages. This will enrich your data semantically and facilitate the integration process later on.
- Ensure that formats and parameters are compatible (date, resolution, metric units).
- Document the relationships among data sets from different sources.
- Bear in mind using unique identifiers to prevent duplication of the data set.
- Document the data integration process (script, workflow) and describe it in the metadata.
- Document your data analysis.
- Establish a proper citation of the data sets you used (data provenance)!
- Keep data quality and metadata quality in mind.
Who does it?
Data re-users in general, e.g. modelers or researchers performing integrative or comparative studies.
Key Elements
- Proper data management practices in preceding data life cycle stages
- Documentation of the integration and analysis workflows
- Data citation
- Data provenance
- Data and metadata quality
GFBio Services
- GFBio VAT System for visualization, aggregation and transformation
- Upload own data and compare them to others
- Get data you found with the GFBio Search via the basket option, and further explore them with the VAT system
- Graphical overlays
- Standard GIS operations
- Statistical analysis
- Option to access data with R statistical software
Useful Links
http://www.dataone.org/education-modules (DataONE - Education Modules)
https://www.dataone.org/best-practices/document-integration-multiple-datasets (DataONE - Integration)
http://www.dcc.ac.uk/sites/default/files/documents/DC%20101%20Transform.pdf (DCC - Digital Curation)
http://ukdataservice.ac.uk/media/104397/data_citation_online.pdf (Economic & Social Research Council)
Discover ← → Analyze

Recommended citation:
German Federation for Biological Data (2021). GFBio Training Materials: Data Life Cycle Fact-Sheet: Data Life Cycle: Integrate. Retrieved 16 Dec 2021 from https://www.gfbio.org/training/materials/data-lifecycle/integrate.