Data Management Standard Operating Procedures
GSI Technical Report #01-22
Published: August 26, 2022
Revised: May 26, 2023
Prepared by: Stephen G. Hesterberg, Ph.D., Executive Director
Purpose & Overview
Data management, which is the process of collecting, organizing, and storing data, is critical for scientific collaboration and to connect data producers with the end-users they seek to engage. Additionally, scientific research conducted in the public interest should be transparent and readily accessible to both funding agencies and taxpayers. At the Gulf Shellfish Institute (GSI), we affirm our commitment to high-quality, scientific research through the establishment of a Data Management Standard Operating Procedure (SOP) outlined in this document. The goals of our Data Management SOP are to: (1) standardize data collection procedures, (2) produce reliable data, and (3) archive scientific data in accessible repositories. To ensure these objectives are met, we have guided SOP development using several conceptual underpinnings of data management best practices, including ‘FAIR’ principles, ‘tidy’ data organization, and an Open Science philosophy (see Beck et al., 2021 for an in-depth review).
The FAIR guiding principles for data management state that data should be Findable, Accessible, Interoperable, and Reusable (Sensu Wilkinson et al. 2016). To be findable, data files must be uniquely named and contain sufficient information for identification. Once found, data will be easily accessible – no special permissions or programs will be necessary to use data. Such data should also be interoperable, in which commonly available tools and software will allow users to work with the data. Data with these qualities will then be reusable, and ready for replication and application to new contexts or with other datasets.
Tidy data organization will further increase GSI’s ability to produce reliable data. Wickham (2014) defines tidy data as a standardized method for connecting the meaning and structure of a dataset. This is done by giving each variable its own column, each observation its own row, and each value its own cell. This format ensures that each data point is completely unique, findable, and capable of merging or linking with other data. Structuring data using tidy principles makes data analysis easier and ready for archival in repositories to support our Open Science philosophy.
Open Science is defined by Creative Commons as “Practicing science in such a way that others can collaborate and contribute...under terms that enable reuse, redistribution and reproduction of the research and its underlying data and methods.” For our purposes, Open Science is a conceptual framework that we will strive to realize in our workflow as much as possible. For example, we will utilize Google Workspace and its Shared Drive feature to organize and store data and make it accessible from our website. However, there are circumstances which may not permit data being published prior to project completion, such as when collaborating with a private entity or in situations where data made public could jeopardize project success (e.g., location of a shellfish release). In the vast majority of cases, our Open Science approach will promote collaboration and engagement throughout our workflow.
Data Management Workflow
Project Development
Data management begins at project conception by identifying and prioritizing data contributions in each research project that will be of use to the end-user. Thus, not all information collected for a project will end up archived in a data repository, but only relevant parameters related to specific research questions or stakeholder needs. For instance, certain water quality parameters recorded during field data collection might not be pertinent to the main research question and therefore remain undigitized in notebooks. However, those data could be used later to create an open file of water quality measurements for use by the public or our partners. In this case, such water quality parameters have been deemed useful to our stakeholders and made accessible through an Open Science framework.
Prior to project initiation, we will also establish written protocols to ensure repeatable and standardized data collection. Such protocols will provide a reference for all GSI personnel and the scientific partners we work with to minimize variability in data due to different sampling methods. All protocols should be adopted from existing methodology first, either from agency protocols, peer-reviewed literature, technical reports, or reputable scientific sources, and new protocols developed only if required. A list of published protocols will also be accessible on our website, allowing for other users to freely access and utilize our methodology for their contexts.
Implementation
Following identification of project priorities and protocol development, the process of collecting high-quality data will begin. Data collection can occur in both field and laboratory contexts and typically requires the use of analog records prior to digitization. All data, regardless of whether recorded in a field or laboratory setting, will be recorded on waterproof paper in the form of either a data sheet or notebook. For most field data collection, premade datasheets tailored to the specific survey will be created prior to the day of work. Such data sheets will standardize data collection between sampling trips and ensure that the necessary information is recorded each sampling event in the appropriate units. Survey sheets will also include descriptive information about the date, location, weather, site conditions, and any notable events or impediments for future reference.
In laboratory settings, individual notebooks will be used to record outcomes and provide sufficient descriptive information for other personnel to replicate the findings. Protocols used should be referenced, the date of measurement noted, and labels with units present. Ideally, separate notebooks should be used for field and laboratory work, and the latter notebook always remain in the laboratory. Once a notebook is full, a new one will be issued by the Executive Director and the preceding one archived. All data collected by GSI personnel or by independent contractors is property of the Gulf Shellfish Institute, Inc unless otherwise stated.
Quality assurance protocols strengthen the integrity and validity of data and their findings. Following data collection, hand-written records will be digitized into a spreadsheet on Google Drive adherent to tidy data principles. We will use careful digitization, a data dictionary, and metadata to create strong quality assurance in our research. These tools will help us to quickly check datasets for errors and create high quality, accurate data.
The transfer of data from an analog to digital format requires human entry, which provides an opportunity for errors to accumulate in files, especially when datasets are large. To reduce the likelihood of errors entering our workflow, we will screen digitized files for erroneous values or mistakes. Screening will involve functions such as calculating maximums, minimums, summation, and difference values for columns or rows sums, and averages to identify errors. Upon finding an error, analog records will be used to reconcile the discrepancy. If possible, the measurement will be repeated to determine whether the value in question is accurate.
A pre-formatted metadata file and a data dictionary file will be created for each dataset. Metadata will either be compiled into an accompanying tab in a spreadsheet or as a separate text document for data that are not tabular (e.g., spatial data). Metadata will provide context for the data and will include all relevant information necessary for the identification, representation, interoperability, technical management, performance, and use of data (Gilliland, 2016). The following information should be included as metadata:
Title: Project title
File Name: Specific name of file
File Type: Information on the file type (e.g., .xlsx, .txt., .csv)
Location: Spatial information of data collection
Sampling Period: Temporal information of data collection
Data Type: The type of data contained within the dataset
Date of Last Modification: Last date the dataset was revised Description: General overview of dataset
Licensing/Permitting: Permissions required to collect or share data Author(s): Individual(s) who collected data
Custodian Institution: Entity responsible for data curation
Contact Name: Individual(s) responsible for data curation
Contact Email: Individual(s) contact email responsible for data curation Software: Software necessary to access and utilize data
Access: Availability of data to individuals or entities
Each datasheet will also be accompanied by a data dictionary either found in the same spreadsheet or in a separate text document. The purpose of a data dictionary is to define specific terms, labels, and value possibilities that inform completion of fields in a metadata file. Data dictionaries will note abbreviations used in column labels, the type of data in each column (e.g. numerical measurements, counts, or categories), and the possible range of values. Such information will not only allow for concise column headings, but also provide structured quality control.
Communication
The archival of data in permanent, accessible storage will be our measure of successful data management. Every project, regardless of success or scientific impact, should be curated and its contributions made available to stakeholders and the public. Proper storage allows partners to easily locate specific data upon request and compile major findings into reports for our target audience. Perhaps most importantly, good data curation includes all tools and materials necessary for ensuring the integrity of current and future data.
Pursuant to our data management goals and an Open Science philosophy, we will store data in two main ways: on (1) Google Drive and (2) third-party data repositories. Google Drive will be used to store data that should be available to the public for easy and free access, use, replication, and or application. We will make such datasets directly available to anyone on our Google Drive, which is linked on our website (Resources & Data). For data associated with peer-reviewed publications, we will use a repository such as Dryad Digital Repository, figshare, or Harvard Dataverse; all of which are specifically designed to hold open-access research data. Storing data in these two different locations will make it easier for the target end-user to locate such information. For example, the general public or shellfish aquaculture industry members may not be as interested in scientific articles as members of academia or government agencies. By curating data with the end-user in mind, our findings will hopefully better reach the intended audience. Further, managing data under the criteria detailed in this document will make GSI an effective Open Science communicator. Collaboration and community are essential for communication and collaboration not only between researchers and interested parties, but also between the general public and the next generation of scientists.
References
Beck, M., Raulerson, G., Burke, M., Whalen, J., Scolaro, S., Sherwood, E. (2021) Tampa Bay Estuary Program Data Management Standard Operating Procedures. https://tbep- tech.github.io/data-management-sop
Open Science. Creative Commons. Retrieved April 14, 2022, from
https://creativecommons.org/about/program-areas/open-science/
Wickham, H. (2014) Tidy data. Journal of Statistical Software 59:1–23.
Wilkinson, M., Dumontier, M., Aalbersberg, I., Appleton, G., Axton, M., Baak, A., Blomberg, N., et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3:160018. https://www.nature.com/articles/sdata201618
Gilliland A. J. (2016) Setting the stage. In Introduction to Metadata. Getty Publications, Los Angeles, California, 3rd edition.