Operational Synthesis

Learning Objectives

After completing this module, you will be able to:

  • Identify characteristics of reproducible coding / project organization
  • Explain benefits of reproducibility (to your team and beyond)
  • Summarize the advantages of creating a defined contribution workflow
  • Explain how synthesis teams can use GitHub to collaborate more efficiently and reproducibly
  • Understand best practices for preparing and analyzing data to be used in synthesis projects

Introduction

Here are a few serviceable definitions of what research is:

  • “Creative and systematic work undertaken in order to increase the stock of knowledge” from the 2015 Frascati Manual
  • “Studious inquiry or examination,” and especially “investigation or experimentation aimed at the discovery and interpretation of facts” from the Merriam-Webster dictionary

To these basic definitions of research, our definition of synthesis research adds collaborative work, and the integration and analysis of a wide range of data sources, to achieve a more complete, generalizable, or useful research result. In Module 1 we discussed many of the collaborative considerations for synthesis research, including creating a diverse and inclusive team, asking synthesis-ready scientific questions (often broad in scope or spatial scale), and finding suitable information (or data) from a wide variety of sources to answer those questions. Once the synthesis team moves into the operational phase of research, which includes the integration and analysis of data, there are some key activities that must happen:

  • Cleaning and harmonizing data to make it usable
  • Analyzing data to answer questions
  • Interpreting the results of your analysis
  • Writing the papers or creating other research products

We’ve already seen that creating a collaborative, inclusive team can set the stage for successful synthesis research. Each of the operational activities above will also benefit from this mindset, and in this module we highlight some of the most important considerations and practices for a team science approach to the nuts-and-bolts of synthesis research.

Reproducibility Practices

PhD comics strip where a profesor reassures a graduate student that they won't have to start code on a project from scratch but then goes on to describe the many reproducibility issues with the code to the graduate student's increasing dismay

Making one’s work “reproducible”–particularly in code contexts–has become increasingly popular but is not always clearly defined. For the purposes of this short course, we believe that reproducible work:

  • Uses scripted workflows for all interactions with data
  • Contains sufficient documentation for those outside of the project team to navigate the project’s contents
  • Contains detailed metadata for all data products
  • Allows anyone to recreate the entire workflow from start to finish
  • Leads to modular, extensible research projects. Adding data from a new site, or a new analysis, should be relatively easy in a reproducible workflow.

Contributing

  • Create a formal plan for collaborating with which your whole team agrees
  • Quarantine external inputs
  • Plan for “future you”
  • Communicate to your collaborators whenever you’re working on a specific script to avoid conflicting edits

Documenting

  • One folder per project
  • Further organize content via sub-folders
  • Make file names informative and intuitive
    • Avoid spaces and special characters in file names
    • Follow a consistent naming convention throughout
    • Good names should be machine readable, human readable, and sorted in a useful way
  • Use READMEs to record organization rules / explanation
  • Keep a log of where source data came from.
    • Where did you search?
    • What search terms did you use?
    • List the dataset identifiers you downloaded/used
    • Ideally, include downloading data as part of your scripted workflow

Example project structure:

project_new
 |–  README.txt
 |–  01_grant_management
 |–  02_project_coordination
 |–  03_documentation
 |–  04_participant_tracking
 |–  05_data
 |       |–  README.txt
 |       |–  hydrology
 |       L  water_chemistry
 |–  06_src
 |       |–  README.txt
 |       |–  data_aggregation
 |       |–  data_harmonization
 |       L  modeling
 L  06_publications
        L  biogeochemistry

Coding

  • Use a version control system
  • Load libraries/packages explicitly
  • Track (and document) software versions
  • Namespace functions (if not already required by your coding language)
    • E.g., dplyr::mutate(mtcars, hp_disp = hp / disp)
  • Use relative file paths that are operating system-agnostic
  • Balance how descriptive object names are with striving for concise names
  • Use comments in the code!
  • Consider custom functions
  • For scripts that need to be run in order, consider adding step numbers to the file name

Synthesizing

How does reproducibility in synthesis considerations differ from individual / non-synthesis applications?

  • Judgement calls need to be made / agreed to as a group
    • But “defer to the doers”
  • Increased emphasis on contribution guidelines / planning being formalized
  • More communication needs
  • Must ensure that every team member has sufficient access to the project files
  • Its best to keep track of who contributed what, so that everyone gets credit. This can be challenging in practice.

In groups of 3-5, discuss the following questions:

  • What elements of reproducibility from our list have you used/are interested in using?
  • Which feel unreasonable or confusing?
  • What activities do you do in your own work to ensure reproducibility that our list is missing?

Version Control

In all scientific research, the data work (cleaning, harmonizing, analyzing) and the writing are iterative processes. The process and products change over time and usually require a series of revisions. In synthesis research, the process can become even more complex because the team is usually large and multiple people are contributing data, analysis, writing, revisions, and more. Using version control helps manage this complexity by recording changes, tracking individual contributions, and ensuring that things can be rolled-back to an earlier state if needed.

PhD comics strip of a graduate student prematurely titling a document 'final.doc' and then undergoing a series of revisions with progressively more complicated and less informative final names

Like this comic shows here, you might have several drafts of your paper before the finalized version. With a version control system, all the revisions in each draft are saved. Version control systems provide a framework for preserving these changes without cluttering your computer with all of the files that precede the final version.1

Using version control enhances your workflow by allowing you to:

  • maintain a descriptive history of your research project’s development while keeping a clean workspace
    • no more cryptic file names or commented-out lines of code to track your progress
  • collaborate with team members and merge everyone’s edits together
  • explore bugs or new features without disrupting your team members’ work.2

Vocabulary

Image showing how Git is used from a person's local computer Image showing how GitHub is an online platform that hosts Git repositories

Here are some brief definitions for a selection of fundamental version control vocabulary terms.

  • Version control system: software that tracks iterative changes to your code and other files
  • Repository: the specific folder/directory that is being tracked by a version control system
  • Git: a popular open-source distributed version control system
  • GitHub: a website that allows users to store their Git repositories online and share them with others

GitHub

While this section of the module focuses on GitHub, there are several other viable alternatives for working with Git individually or as part of a larger team (e.g., GitLab, GitKraken, etc.). Any of these may be viable option for your team and we focus on GitHub here only to ensure a standard backdrop for the case studies we’ll discuss shortly.

There are a lot of GitHub tutorials that exist already so, rather than add our own variant to the list, we’ll work through part of one created by the Scientific Computing team of the National Center for Ecological Analysis and Synthesis (NCEAS).

See the workshop materials here.

Given the time restrictions for this short course, we’ll only cover how you engage with GitHub directly through the GitHub website. However, your chosen software for writing code will certainly have a method of connecting to GitHub/etc., so if this topic is of interest it will be beneficial for you to search out the relevant tutorial.

Data Preparation

The scientific questions being asked in synthesis projects are usually broad in scope, and it is therefore common to bring together many datasets from different sources for analysis. The datasets selected for analysis (source data) may have been collected by different people, in different places, using different methods, as part of different projects… or all of the above. Typically, some amount of data cleaning - filtering or removing unwanted observations - and data harmonization - putting data together in common structures, file formats, and units of measurement - is necessary before analysis can begin. This process can be easy or difficult depending on the quality of the source data, the differences between source data, and how much metadata (see callout below) is available to understand them.

  • How many of you work directly with data in your day-to-day?
  • What percentage of the time that you spend working on data is spent on data cleaning?
  • How much on metadata creation?
  • How much on data preparation?

Metadata is “data about the data,” or information that describes who collected the data, what was observed or measured, when the data were collected, where the data were collected, how the observations or measurements were made, and why they were collected. Metadata provide important contextual information about the origin of the data and how they can be analyzed or used. They are most useful when attached or linked to the data being described, and data and related metadata together are commonly referred to as a dataset.

Metadata for ecological research data are well described in Michener et al (1997),3 but there are many other kinds of metadata with different purposes.4 If you are publishing a research dataset and have questions about metadata, ask a data manager for your project, or staff at the repository you are working with, for help. Either can typically provide guidance on creating metadata that will describe your data and be useful to the community (here is one example). We’ll return to the subject of metadata in Module 3.

Cleaning Data

When assembling large datasets from diverse sources, as in synthesis research, not all the source data will be useful. This may be because there are real or suspected errors, missing values, or simply because they are not needed to answer the scientific question being asked (wrong variable, different ecosystem, etc.). Data that are not useful are usually excluded from analysis or removed altogether. Data cleaning tends to be a stepwise, iterative process that follows a different path for every dataset and research project. There are some standard techniques and algorithms for cleaning and filtering data, but they are beyond the scope of this course. Below are a few guidelines to remember, and more in-depth resources for data cleaning are found at the end of this section.

  1. Always preserve the raw data. Chances are you’ll want to go back and check the original source data at least once.
  2. Use a scripted workflow to clean and filter the raw data, and follow the usual rules about reproducibility (comments, version control, functionalization).
  3. Consider using the concept of data processing “levels,” meaning that defined sets of data flagging, removal, or transformation operations are applied consistently to the data in stepwise fashion. For example, incoming raw data would be labeled “level 0” data, and “level 1” data is reached after the first set of processing steps is applied.
  4. Spread the data cleaning workload around! Data cleaning typically demands a HUGE fraction of the total time devoted to working with data,567 and it can be tedious work. Make sure the team shares this workload equitably.

Data Harmonization

Data harmonization is the process of bringing different datasets into a common format for analysis. The harmonized data format chosen for a synthesis project depends on the source data, analysis plans, and overall project goals. It is best to make a plan for harmonizing data BEFORE analysis begins, which means discussing this with the team in the early stages of a synthesis project. As a general rule, it is also wise to use a scripted workflow that is as reproducible as possible to accomplish the harmonization you need for your project. Following this guidance lets others understand, check, and adapt your work, and will also make it much, much easier to bring new data and analysis methods into the project.

Data harmonization is hard work that sometimes requires trial and error to arrive at a useful end product. At the end of this section are some additional data harmonization resources to help you get started. Looking at a simple example might also help.

Example: Harmonizing grassland biomass data

In the figure below, two datasets from different LTER sites have been harmonized into one file for analysis. We don’t have all the metadata here, but based on the column naming we can assume that the file on the left (Konza_harvestplots.txt, yellow) contains columns for the sampling date, a plot identifier, a treatment categorical value, and measured biomass values. The filename suggests that the data come from the Konza Prarie LTER site. The file on the right (2022_clips.csv, blue) has a column denoting the LTER site, in this case Sevilleta LTER (abbreviated as SEV), as well as similar date, plot identifier, treatment, and measured biomass columns.

There are some similarities and some differences in these two source files. A harmonized file (lter_grass_biomass_harmonized.txt) appears below.

Image showing two small data tables with different column names and data structure being modified to be able to combine them into a single, larger table

Take a minute to look at the harmonized file and consider how these data were harmonized. Then, answer the question below.

Even though the source data files were similar, several important changes were made to create the harmonized file. Among them:

  • The site column was preserved and contains the “SEV” and “KNZ” categorical values denoting which LTER site is observed in each row.
  • The dates in the date column were converted to a standard format (YYYY-MM-DD). Incidentally, this matches the International Organization for Standardization (ISO) recommendation for date/time data, making it generally a good choice for dates.
  • A new rep column was created to identify the replicate measurements at each site. This is essentially a standard version the plot identifier columns (PlotID and plot) in the original data file. Also note that the original values are preserved in the new plot_orig column.
  • The treatment columns from the original data files (Treatment and trt) were standardized to one column with “C” and “F” categorical values, for control and fertilized treatments, respectively.
  • The biomass values from Konza were converted to units grams per meter squared (g/m2) because the original Konza measurements were for total biomass in 2x2 meter plots. Note that this conversion is for illustration purposes only - we can’t be sure if this conversion is correct without it being spelled out in the metadata, or asking the data provider directly.

A Word about Harmonized Data Formats

Above, we have discussed several aspects of selecting a data format. There are at least three related, but not exactly equivalent, concepts to consider when formatting data. First, formats describe the way data are structured, organized, and related within a data file. For example, in a tabular data file about biomass, the measured biomass values might appear in one column, or in muiltiple columns. Second, the values of any variable can be represented in more than one format. The same date, for example, could be formatted using text as “July 2, 1974” or “1974-07-02.” Third, format may refer to the file format used to hold data on a disk or other storage medium. File formats like comma separated value text files (CSV), Excel files (.xlsx), JPEG images, are commonly used for research data, and each has particular strengths for certain kinds of data.

A few guidelines apply:

  1. For formatting a tabular dataset, err towards simpler data structures, which are usually easier to clean, filter, and analyze. Long-format tables, or tidy data 8, is one common recommendation for this.
  2. When choosing a file format, err towards open, non-proprietary file formats that more people know and have access to. Delimited text files, such as CSV files, are a good choice for tabular data.
  3. Use existing community standards for formatting variables and files as long as they suit your project methods and scientific goals. Using ISO standards for date-time variables, or species identifiers from a taxonomic authority, are good examples of this practice.
  4. There is no perfect data format! Harmonizing data always involves some judgement calls and tradeoffs.

When choosing a destination format for the harmonized data for a synthesis project, the audience and future uses of the data are also an important consideration. Consider how your synthesis team will analyze the data, as well as how the world outside that team will use and interact with the data once it is published. Again, there is no one answer, but below are a few examples of harmonized destination formats to consider.

Here our grassland biomass data is in long format, often referred to as “tidy” data. Data in this format is generally easy to understand and use. There are three rules for tidy data:

  1. Each column is one variable.
  2. Each row is one observation.
  3. Each cell contains a single value.

Image of a table of data in 'tidy' format (i.e., where each column is one variable, each row is one observation, and each cell contains a single value)

visual representation of the tidy data structure

Advantages: clear meaning of rows and columns; ease in filtering/cleaning/appending
Disadvantages: not as human-friendly so it can be difficult to assess the data visually
Possible file formats: Delimited text (tab delimited shown here), spreadsheets, database tables

An example of tidy data in long format

The harmonized grassland data, in long format.

In this dataset, our grassland data has been restructured into wide format, often referred to (sometimes unfairly) as “messy” or “untidy” data. Note that the biomass variable has been split into two columns, one for control plots and one for fertilized plots.

Advantages: easier for some statistical analyses (ANOVA, for example); easier to assess the data visually
Disadvantages: may be more difficult to clean/filter/append, multiple observations per row; more likely to contain empty (NULL) cells
Possible file formats: Delimited text (tab delimited shown here), spreadsheets, database tables

An example of tidy data in wide format

The harmonized grassland data, restructured into wide format with biomass values in control and fertilized columns.

Below is an example of how we might structure our grassland data in a relational database. The schema consists of three tables that house information about sampling events (when, where data were collected), the plots from which the samples are collected, and the biomass values for each collection. The schema allows us to define the data types (e.g., text, integer), add constraints (e.g., values cannot be missing), and to describe relationships between tables (keys). Relational formats are normalized to reduce data redundancy and increase data integrity, which can help us to manage complex data9.

An example of relational data where several tables are linked by a shared column even though they have differing variables and numbers of observations

example grassland database schema

Advantages: reduced redundancy, greater integrity; community standard; powerful extensions (e.g., store and process spatial data); many different database flavors to meet specific needs
Disadvantages: significant metadata needed to describe and use; more complex to publish; learning curve
Possible file formats: Database stores, can be represented in delimited text (CSV)

A richer example is a schematic of the related tables that comprise the ecocomDP10 harmonized data format for biodiversity data. Eight tables are defined, along with a set of relationships between tables (keys), and constraints on the allowable values in each table.

Another example of a relational set of tables implicitly connected by shared index columns. Curved lines indicate which tables are connected to one another and which columns define that relationship

The ecocomDP schema. Each table has a name (top cell) and a list of columns. Shaded column names are primary keys, hashed columns have constraints, and arrows represent relations between keys/constraints in different tables.

There are many possibilities to make large synthesis datasets available and useful in the cloud. These require specialized knowledge and tooling, and reliable access to cloud platforms.

Advantages: easier access to big (high volume) data, can integrate with web apps
Disadvantages: less familiar/accessible to many scientists, few best practices to follow, costs can be higher
Possible file formats: Parquet files, object storage, distributed/cloud databases

A stylized cloud containing the logos for Parquet, S3, PostgreSQL, and Apache Arrow

A few of the cloud-native technologies that might be useful for synthesis research products.

There are many, many other possible harmonized data formats. Here are a few possible examples:

  • DarwinCore archives for biodiversity data
  • Organismal trait databases
  • Archives of cropped, labeled images for training machine or deep learning models
  • Libraries of standardized raster imagery in Google Earth Engine

Data Analysis

Once the team has found sufficient source data, then cleaned, filtered, and harmonized countless datasets, and documented and described everything with quality metadata, it is finally time to analyze the data! Great! Load up R or Python and get started, and then tell us how it goes. We simply don’t have enough time to cover all the ins and outs of data analysis in a three-hour course. However, we have put a few helpful resources below to get you started, and many of the best practices we have talked about, or will talk about, apply:

  1. Document your analysis steps and comment your code, and generally try to make everything reproducible.
  2. Use version control as you analyze data.
  3. Give everyone a chance! Analyzing data is challenging, exciting, and a great learning opportunity. Having more eyes on the analysis process also helps catch interesting results or subtle errors.

Synthesis Group Case Studies

Estimated time: 10 min

To make some of these concepts more tangible, let’s consider some case studies. The following tabs contain GitHub repositories for real teams that have engaged in synthesis research and chosen to preserve and maintain their scripts in GitHub. Each has different strengths and you may find that facets of each feel most appropriate for your group to adopt. There is no single “right” way of tackling this but hopefully parts of these exemplars inspire you.

LTER SPARC Group: Soil Phosphorus Control of Carbon and Nitrogen

Stored their code here: lter / lter-sparc-soil-p

Highlights

  • Straightforward & transparent numbering of workflow scripts
    • File names also reasonably informative even without numbering
  • Simple README in each folder written in human-readable language
  • Custom .gitignore safety net
    • Controls which files are “ignored” by Git (prevents accidentally sharing data/private information)

LTER Full Synthesis Working Group: The Flux Gradient Project

Stored their code here: lter / lterwg-flux-gradient

Highlights

  • Extremely consistent file naming conventions
  • Strong use of sub-folders for within-project organization
  • Top-level README includes robust description of naming convention, folder structure, and order of scripts in workflow
  • Active contribution to code base by nearly all group members
    • Facilitated by strong internal documentation and consenus-building prior to choosing this structure

LTER Full Synthesis Working Group: From Poles to Tropics: A Multi-Biome Synthesis Investigating the Controls on River Si Exports

Stored their code here: lter / lterwg-silica-spatial

Highlights

  • Files performing similar functions share a prefix in their file name
  • Use of GitHub “Release” feature to get a persistent DOI for their codebase
  • Separate repositories for each manuscript
  • Nice use of README as pseudo-bookmarks for later reference to other repositories

For more information about LTER synthesis working groups and how you can get involved in one, click here.

Additional Resources

Data Preparation

Data cleaning and filtering resources

  • Data cleaning is complicated and varied, and entire books have been written on the subject.1112 For some general considerations on cleaning data, see EDI’s “Cleaning Data and Quality Control” resource
  • OpenRefine is an open-source, cross-platform tool for iterative, scripted data cleaning.
  • In the R language, the tidyverse libraries (particularly tidyr and dplyr) are often used for data cleaning, as are additional libraries like janitor.
  • In Python, pandas and numpy libraries provide useful data cleaning features. There are also some stand-alone cleaning tools like pyjanitor (started as a re-implementation of the R version) and cleanlab (geared towards machine learning applications).
  • Both the R and Python data science ecosystems have excellent documentation resources that thoroughly cover data cleaning. For R, consider starting with Hadley Wickham’s R for Data Science book chapter on data tidying,13 and for python check Wes McKinney’s Python for Data Analysis book chapter on data cleaning and preparation.14

Data harmonization resources

  • For R and Python users, there are, again, excellent documentation resources that thoroughly cover data harmonization techniques like data filtering, reformatting, joins, and standardization. In Hadley Wickham’s R for Data Science book, the chapters on data transforms and data tidying are a good place to start. In Wes McKinney’s Python for Data Analysis book, the chapter on data wrangling is helpful.
  • A nice article in “The Analysis Factor” describes wide vs long data formats and when to choose which (TLDR, it depends on your statistical analysis plan).

Data Analysis

Courses, Workshops, and Tutorials

Literature

Other

Footnotes

  1. Lyon, N. J., Chen, A., Brun, J. (2023). Collaborative Coding with GitHub. LNO Scientific Computing Team. https://nceas.github.io/scicomp-workshop-collaborative-coding/.↩︎

  2. Poulsen, C. V. & Chen, A. (2024). NCEAS coreR for Delta Science Program. NCEAS Learning Hub. https://learning.nceas.ucsb.edu/2024-06-delta.↩︎

  3. Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. (1997), NONGEOSPATIAL METADATA FOR THE ECOLOGICAL SCIENCES. Ecological Applications, 7: 330-342. https://doi.org/10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2↩︎

  4. Mayernik, M.S. and Acker, A. (2018), Tracing the traces: The critical role of metadata within networked communications. Journal of the Association for Information Science and Technology, 69: 177-180. https://doi.org/10.1002/asi.23927↩︎

  5. Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎

  6. New York Times, 2014↩︎

  7. Anaconda State of Data Science Report, 2022↩︎

  8. Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎

  9. Zimmerman, N. 2016. Hand-crafted relational databases for fun and science↩︎

  10. O’Brien, Margaret, et al. “ecocomDP: a flexible data design pattern for ecological community survey data.” Ecological Informatics 64 (2021): 101374. https://doi.org/10.1016/j.ecoinf.2021.101374↩︎

  11. Osborne, Jason W. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage publications, 2012.↩︎

  12. Van der Loo, Mark, and Edwin De Jonge. Statistical data cleaning with applications in R. John Wiley & Sons, 2018. https://doi.org/10.1002/9781118897126↩︎

  13. Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. R for data science. ” O’Reilly Media, Inc.”, 2023. https://r4ds.hadley.nz/↩︎

  14. McKinney, Wes. Python for data analysis. ” O’Reilly Media, Inc.”, 2022. https://wesmckinney.com/book↩︎