Operational Synthesis
Learning Objectives
After completing this module, you will be able to:
- Identify characteristics of reproducible coding / project organization
- Explain benefits of reproducibility (to your team and beyond)
- Summarize the advantages of creating a defined contribution workflow
- Explain how synthesis teams can use GitHub to collaborate more efficiently and reproducibly
- Understand best practices for preparing and analyzing data to be used in synthesis projects
Introduction
Here are a few serviceable definitions of what research is:
- “Creative and systematic work undertaken in order to increase the stock of knowledge” from the 2015 Frascati Manual
- “Studious inquiry or examination,” and especially “investigation or experimentation aimed at the discovery and interpretation of facts” from the Merriam-Webster dictionary
To these basic definitions of research, our definition of synthesis research adds collaborative work, and the integration and analysis of a wide range of data sources, to achieve a more complete, generalizable, or useful research result. In Module 1 we discussed many of the collaborative considerations for synthesis research, including creating a diverse and inclusive team, asking synthesis-ready scientific questions (often broad in scope or spatial scale), and finding suitable information (or data) from a wide variety of sources to answer those questions. Once the synthesis team moves into the operational phase of research, which includes the integration and analysis of data, there are some key activities that must happen:
- Cleaning and harmonizing data to make it usable
- Analyzing data to answer questions
- Interpreting the results of your analysis
- Writing the papers or creating other research products
We’ve already seen that creating a collaborative, inclusive team can set the stage for successful synthesis research. Each of the operational activities above will also benefit from this mindset, and in this module we highlight some of the most important considerations and practices for a team science approach to the nuts-and-bolts of synthesis research.
Reproducibility Practices
Making one’s work “reproducible”–particularly in code contexts–has become increasingly popular but is not always clearly defined. For the purposes of this short course, we believe that reproducible work:
- Uses scripted workflows for all interactions with data
- Contains sufficient documentation for those outside of the project team to navigate the project’s contents
- Contains detailed metadata for all data products
- Allows anyone to recreate the entire workflow from start to finish
- Leads to modular, extensible research projects. Adding data from a new site, or a new analysis, should be relatively easy in a reproducible workflow.
Contributing
- Create a formal plan for collaborating with which your whole team agrees
- Quarantine external inputs
- Plan for “future you”
- Communicate to your collaborators whenever you’re working on a specific script to avoid conflicting edits
Documenting
- One folder per project
- Further organize content via sub-folders
- Make file names informative and intuitive
- Avoid spaces and special characters in file names
- Follow a consistent naming convention throughout
- Good names should be machine readable, human readable, and sorted in a useful way
- Use READMEs to record organization rules / explanation
- Keep a log of where source data came from.
- Where did you search?
- What search terms did you use?
- List the dataset identifiers you downloaded/used
- Ideally, include downloading data as part of your scripted workflow
Example project structure:
project_new
|– README.txt
|– 01_grant_management
|– 02_project_coordination
|– 03_documentation
|– 04_participant_tracking
|– 05_data
| |– README.txt
| |– hydrology
| L water_chemistry
|– 06_src
| |– README.txt
| |– data_aggregation
| |– data_harmonization
| L modeling
L 06_publications
L biogeochemistry
Coding
- Use a version control system
- Load libraries/packages explicitly
- Track (and document) software versions
- Namespace functions (if not already required by your coding language)
- E.g.,
dplyr::mutate(mtcars, hp_disp = hp / disp)
- E.g.,
- Use relative file paths that are operating system-agnostic
- Balance how descriptive object names are with striving for concise names
- Use comments in the code!
- Consider custom functions
- For scripts that need to be run in order, consider adding step numbers to the file name
Synthesizing
How does reproducibility in synthesis considerations differ from individual / non-synthesis applications?
- Judgement calls need to be made / agreed to as a group
- But “defer to the doers”
- Increased emphasis on contribution guidelines / planning being formalized
- More communication needs
- Must ensure that every team member has sufficient access to the project files
- Its best to keep track of who contributed what, so that everyone gets credit. This can be challenging in practice.
In groups of 3-5, discuss the following questions:
- What elements of reproducibility from our list have you used/are interested in using?
- Which feel unreasonable or confusing?
- What activities do you do in your own work to ensure reproducibility that our list is missing?
Version Control
In all scientific research, the data work (cleaning, harmonizing, analyzing) and the writing are iterative processes. The process and products change over time and usually require a series of revisions. In synthesis research, the process can become even more complex because the team is usually large and multiple people are contributing data, analysis, writing, revisions, and more. Using version control helps manage this complexity by recording changes, tracking individual contributions, and ensuring that things can be rolled-back to an earlier state if needed.
Like this comic shows here, you might have several drafts of your paper before the finalized version. With a version control system, all the revisions in each draft are saved. Version control systems provide a framework for preserving these changes without cluttering your computer with all of the files that precede the final version.1
Using version control enhances your workflow by allowing you to:
- maintain a descriptive history of your research project’s development while keeping a clean workspace
- no more cryptic file names or commented-out lines of code to track your progress
- collaborate with team members and merge everyone’s edits together
- explore bugs or new features without disrupting your team members’ work.2
Vocabulary
Here are some brief definitions for a selection of fundamental version control vocabulary terms.
- Version control system: software that tracks iterative changes to your code and other files
- Repository: the specific folder/directory that is being tracked by a version control system
- Git: a popular open-source distributed version control system
- GitHub: a website that allows users to store their Git repositories online and share them with others
GitHub
While this section of the module focuses on GitHub, there are several other viable alternatives for working with Git individually or as part of a larger team (e.g., GitLab, GitKraken, etc.). Any of these may be viable option for your team and we focus on GitHub here only to ensure a standard backdrop for the case studies we’ll discuss shortly.
There are a lot of GitHub tutorials that exist already so, rather than add our own variant to the list, we’ll work through part of one created by the Scientific Computing team of the National Center for Ecological Analysis and Synthesis (NCEAS).
See the workshop materials here.
Given the time restrictions for this short course, we’ll only cover how you engage with GitHub directly through the GitHub website. However, your chosen software for writing code will certainly have a method of connecting to GitHub/etc., so if this topic is of interest it will be beneficial for you to search out the relevant tutorial.
Data Preparation
The scientific questions being asked in synthesis projects are usually broad in scope, and it is therefore common to bring together many datasets from different sources for analysis. The datasets selected for analysis (source data) may have been collected by different people, in different places, using different methods, as part of different projects… or all of the above. Typically, some amount of data cleaning - filtering or removing unwanted observations - and data harmonization - putting data together in common structures, file formats, and units of measurement - is necessary before analysis can begin. This process can be easy or difficult depending on the quality of the source data, the differences between source data, and how much metadata (see callout below) is available to understand them.
- How many of you work directly with data in your day-to-day?
- What percentage of the time that you spend working on data is spent on data cleaning?
- How much on metadata creation?
- How much on data preparation?
Metadata is “data about the data,” or information that describes who collected the data, what was observed or measured, when the data were collected, where the data were collected, how the observations or measurements were made, and why they were collected. Metadata provide important contextual information about the origin of the data and how they can be analyzed or used. They are most useful when attached or linked to the data being described, and data and related metadata together are commonly referred to as a dataset.
Metadata for ecological research data are well described in Michener et al (1997),3 but there are many other kinds of metadata with different purposes.4 If you are publishing a research dataset and have questions about metadata, ask a data manager for your project, or staff at the repository you are working with, for help. Either can typically provide guidance on creating metadata that will describe your data and be useful to the community (here is one example). We’ll return to the subject of metadata in Module 3.
Cleaning Data
When assembling large datasets from diverse sources, as in synthesis research, not all the source data will be useful. This may be because there are real or suspected errors, missing values, or simply because they are not needed to answer the scientific question being asked (wrong variable, different ecosystem, etc.). Data that are not useful are usually excluded from analysis or removed altogether. Data cleaning tends to be a stepwise, iterative process that follows a different path for every dataset and research project. There are some standard techniques and algorithms for cleaning and filtering data, but they are beyond the scope of this course. Below are a few guidelines to remember, and more in-depth resources for data cleaning are found at the end of this section.
- Always preserve the raw data. Chances are you’ll want to go back and check the original source data at least once.
- Use a scripted workflow to clean and filter the raw data, and follow the usual rules about reproducibility (comments, version control, functionalization).
- Consider using the concept of data processing “levels,” meaning that defined sets of data flagging, removal, or transformation operations are applied consistently to the data in stepwise fashion. For example, incoming raw data would be labeled “level 0” data, and “level 1” data is reached after the first set of processing steps is applied.
- Spread the data cleaning workload around! Data cleaning typically demands a HUGE fraction of the total time devoted to working with data,567 and it can be tedious work. Make sure the team shares this workload equitably.
Data Harmonization
Data harmonization is the process of bringing different datasets into a common format for analysis. The harmonized data format chosen for a synthesis project depends on the source data, analysis plans, and overall project goals. It is best to make a plan for harmonizing data BEFORE analysis begins, which means discussing this with the team in the early stages of a synthesis project. As a general rule, it is also wise to use a scripted workflow that is as reproducible as possible to accomplish the harmonization you need for your project. Following this guidance lets others understand, check, and adapt your work, and will also make it much, much easier to bring new data and analysis methods into the project.
Data harmonization is hard work that sometimes requires trial and error to arrive at a useful end product. At the end of this section are some additional data harmonization resources to help you get started. Looking at a simple example might also help.
A Word about Harmonized Data Formats
Above, we have discussed several aspects of selecting a data format. There are at least three related, but not exactly equivalent, concepts to consider when formatting data. First, formats describe the way data are structured, organized, and related within a data file. For example, in a tabular data file about biomass, the measured biomass values might appear in one column, or in muiltiple columns. Second, the values of any variable can be represented in more than one format. The same date, for example, could be formatted using text as “July 2, 1974” or “1974-07-02.” Third, format may refer to the file format used to hold data on a disk or other storage medium. File formats like comma separated value text files (CSV), Excel files (.xlsx), JPEG images, are commonly used for research data, and each has particular strengths for certain kinds of data.
A few guidelines apply:
- For formatting a tabular dataset, err towards simpler data structures, which are usually easier to clean, filter, and analyze. Long-format tables, or tidy data 8, is one common recommendation for this.
- When choosing a file format, err towards open, non-proprietary file formats that more people know and have access to. Delimited text files, such as CSV files, are a good choice for tabular data.
- Use existing community standards for formatting variables and files as long as they suit your project methods and scientific goals. Using ISO standards for date-time variables, or species identifiers from a taxonomic authority, are good examples of this practice.
- There is no perfect data format! Harmonizing data always involves some judgement calls and tradeoffs.
When choosing a destination format for the harmonized data for a synthesis project, the audience and future uses of the data are also an important consideration. Consider how your synthesis team will analyze the data, as well as how the world outside that team will use and interact with the data once it is published. Again, there is no one answer, but below are a few examples of harmonized destination formats to consider.
Here our grassland biomass data is in long format, often referred to as “tidy” data. Data in this format is generally easy to understand and use. There are three rules for tidy data:
- Each column is one variable.
- Each row is one observation.
- Each cell contains a single value.
Advantages: clear meaning of rows and columns; ease in filtering/cleaning/appending
Disadvantages: not as human-friendly so it can be difficult to assess the data visually
Possible file formats: Delimited text (tab delimited shown here), spreadsheets, database tables
In this dataset, our grassland data has been restructured into wide format, often referred to (sometimes unfairly) as “messy” or “untidy” data. Note that the biomass variable has been split into two columns, one for control plots and one for fertilized plots.
Advantages: easier for some statistical analyses (ANOVA, for example); easier to assess the data visually
Disadvantages: may be more difficult to clean/filter/append, multiple observations per row; more likely to contain empty (NULL) cells
Possible file formats: Delimited text (tab delimited shown here), spreadsheets, database tables
Below is an example of how we might structure our grassland data in a relational database. The schema consists of three tables that house information about sampling events (when, where data were collected), the plots from which the samples are collected, and the biomass values for each collection. The schema allows us to define the data types (e.g., text, integer), add constraints (e.g., values cannot be missing), and to describe relationships between tables (keys). Relational formats are normalized to reduce data redundancy and increase data integrity, which can help us to manage complex data9.
Advantages: reduced redundancy, greater integrity; community standard; powerful extensions (e.g., store and process spatial data); many different database flavors to meet specific needs
Disadvantages: significant metadata needed to describe and use; more complex to publish; learning curve
Possible file formats: Database stores, can be represented in delimited text (CSV)
A richer example is a schematic of the related tables that comprise the ecocomDP10 harmonized data format for biodiversity data. Eight tables are defined, along with a set of relationships between tables (keys), and constraints on the allowable values in each table.
There are many possibilities to make large synthesis datasets available and useful in the cloud. These require specialized knowledge and tooling, and reliable access to cloud platforms.
Advantages: easier access to big (high volume) data, can integrate with web apps
Disadvantages: less familiar/accessible to many scientists, few best practices to follow, costs can be higher
Possible file formats: Parquet files, object storage, distributed/cloud databases
There are many, many other possible harmonized data formats. Here are a few possible examples:
- DarwinCore archives for biodiversity data
- Organismal trait databases
- Archives of cropped, labeled images for training machine or deep learning models
- Libraries of standardized raster imagery in Google Earth Engine
Data Analysis
Once the team has found sufficient source data, then cleaned, filtered, and harmonized countless datasets, and documented and described everything with quality metadata, it is finally time to analyze the data! Great! Load up R or Python and get started, and then tell us how it goes. We simply don’t have enough time to cover all the ins and outs of data analysis in a three-hour course. However, we have put a few helpful resources below to get you started, and many of the best practices we have talked about, or will talk about, apply:
- Document your analysis steps and comment your code, and generally try to make everything reproducible.
- Use version control as you analyze data.
- Give everyone a chance! Analyzing data is challenging, exciting, and a great learning opportunity. Having more eyes on the analysis process also helps catch interesting results or subtle errors.
Synthesis Group Case Studies
Estimated time: 10 min
To make some of these concepts more tangible, let’s consider some case studies. The following tabs contain GitHub repositories for real teams that have engaged in synthesis research and chosen to preserve and maintain their scripts in GitHub. Each has different strengths and you may find that facets of each feel most appropriate for your group to adopt. There is no single “right” way of tackling this but hopefully parts of these exemplars inspire you.
LTER SPARC Group: Soil Phosphorus Control of Carbon and Nitrogen
Stored their code here: lter / lter-sparc-soil-p
Highlights
- Straightforward & transparent numbering of workflow scripts
- File names also reasonably informative even without numbering
- Simple README in each folder written in human-readable language
- Custom
.gitignore
safety net- Controls which files are “ignored” by Git (prevents accidentally sharing data/private information)
LTER Full Synthesis Working Group: The Flux Gradient Project
Stored their code here: lter / lterwg-flux-gradient
Highlights
- Extremely consistent file naming conventions
- Strong use of sub-folders for within-project organization
- Top-level README includes robust description of naming convention, folder structure, and order of scripts in workflow
- Active contribution to code base by nearly all group members
- Facilitated by strong internal documentation and consenus-building prior to choosing this structure
LTER Full Synthesis Working Group: From Poles to Tropics: A Multi-Biome Synthesis Investigating the Controls on River Si Exports
Stored their code here: lter / lterwg-silica-spatial
Highlights
- Files performing similar functions share a prefix in their file name
- Use of GitHub “Release” feature to get a persistent DOI for their codebase
- Separate repositories for each manuscript
- Nice use of README as pseudo-bookmarks for later reference to other repositories
For more information about LTER synthesis working groups and how you can get involved in one, click here.
Additional Resources
Data Preparation
Data cleaning and filtering resources
- Data cleaning is complicated and varied, and entire books have been written on the subject.1112 For some general considerations on cleaning data, see EDI’s “Cleaning Data and Quality Control” resource
- OpenRefine is an open-source, cross-platform tool for iterative, scripted data cleaning.
- In the R language, the
tidyverse
libraries (particularlytidyr
anddplyr
) are often used for data cleaning, as are additional libraries likejanitor
. - In Python,
pandas
andnumpy
libraries provide useful data cleaning features. There are also some stand-alone cleaning tools likepyjanitor
(started as a re-implementation of the R version) andcleanlab
(geared towards machine learning applications). - Both the R and Python data science ecosystems have excellent documentation resources that thoroughly cover data cleaning. For R, consider starting with Hadley Wickham’s R for Data Science book chapter on data tidying,13 and for python check Wes McKinney’s Python for Data Analysis book chapter on data cleaning and preparation.14
Data harmonization resources
- For R and Python users, there are, again, excellent documentation resources that thoroughly cover data harmonization techniques like data filtering, reformatting, joins, and standardization. In Hadley Wickham’s R for Data Science book, the chapters on data transforms and data tidying are a good place to start. In Wes McKinney’s Python for Data Analysis book, the chapter on data wrangling is helpful.
- A nice article in “The Analysis Factor” describes wide vs long data formats and when to choose which (TLDR, it depends on your statistical analysis plan).
Data Analysis
- Harrer, M. et al. Doing Meta-Analysis with R: A Hands-On Guide. 2023. GitHub
- Once again, for R and Python users, the same two books mentioned above provide excellent beginning guidance on data analysis techniques (exploratory analysis, summary stats, visualization, model fitting, etc). In Wickham’s R for Data Science book, the chapter on exploratory data analysis will help. In McKinney’s Python for Data Analysis book, try the chapters on plotting and visualization and the introduction to modeling.
Courses, Workshops, and Tutorials
- Synthesis Skills for Early Career Researchers (SSECR) course. 2024. LTER Network Office
- Reproducible Approaches to Arctic Research Using R workshop. 2024. Arctic Data Center & NCEAS Learning Hub
- Collaborative Coding with GitHub workshop. 2024. NCEAS Scientific Computing team
- Coding in the Tidyverse workshop. 2023. NCEAS Scientific Computing team
- Shiny Apps for Sharing Science workshop. 2022. Lyon, N.J. et al.
- Ten Commandments for Good Data Management. 2016. McGill, B.
Literature
- Todd-Brown, K.E.O., et al. Reviews and Syntheses: The Promise of Big Diverse Soil Data, Moving Current Practices Towards Future Potential. 2022. Biogeosciences
- Borer, E.T. et al. Some Simple Guidelines for Effective Data Management. 2009. Ecological Society of America Bulletin
Other
- Better commit messages with Conventional Commits
Footnotes
Lyon, N. J., Chen, A., Brun, J. (2023). Collaborative Coding with GitHub. LNO Scientific Computing Team. https://nceas.github.io/scicomp-workshop-collaborative-coding/.↩︎
Poulsen, C. V. & Chen, A. (2024). NCEAS coreR for Delta Science Program. NCEAS Learning Hub. https://learning.nceas.ucsb.edu/2024-06-delta.↩︎
Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. (1997), NONGEOSPATIAL METADATA FOR THE ECOLOGICAL SCIENCES. Ecological Applications, 7: 330-342. https://doi.org/10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2↩︎
Mayernik, M.S. and Acker, A. (2018), Tracing the traces: The critical role of metadata within networked communications. Journal of the Association for Information Science and Technology, 69: 177-180. https://doi.org/10.1002/asi.23927↩︎
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎
Zimmerman, N. 2016. Hand-crafted relational databases for fun and science↩︎
O’Brien, Margaret, et al. “ecocomDP: a flexible data design pattern for ecological community survey data.” Ecological Informatics 64 (2021): 101374. https://doi.org/10.1016/j.ecoinf.2021.101374↩︎
Osborne, Jason W. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage publications, 2012.↩︎
Van der Loo, Mark, and Edwin De Jonge. Statistical data cleaning with applications in R. John Wiley & Sons, 2018. https://doi.org/10.1002/9781118897126↩︎
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. R for data science. ” O’Reilly Media, Inc.”, 2023. https://r4ds.hadley.nz/↩︎
McKinney, Wes. Python for data analysis. ” O’Reilly Media, Inc.”, 2022. https://wesmckinney.com/book↩︎