LTER-core-metabase

Populate LTER-core-metabase for generating EML

Last updated: January 10th, 2020

See installation here.

This guide walks you through populating LTER-core-metabase for generating EML, and addresses frequently encountered problems.

CV means “controlled vocabulary” as in parent table, not as in the LTER Controlled Vocabulary Thesaurus.

Schemas

The overall database design contains these schemas. They are designed for separate purposes and are mostly self-contained for each of those tasks.

Schema lter_metabase

There are four (but really only three) broad types of tables in lter_metabase:

Generally, you will only need to populate tables starting with EML once at the beginning and/or use the pre-loaded CVs that come with LTER-core-metabase. Tables starting with List contain site-specific CVs which tend to need additions occasionally. You will need to update dataset-specific tables (table names starting with DataSet) to add a new dataset or to revise existing datasets.

How to update datasets

When describing a new version of a dataset for which a previous version was already described in the database, you overwrite existing values with updated values. In other words, you only store the metadata for the updated/new version. You won’t be able to generate EML for the old data and old metadata (unless you change everything back to the old version in your database tables).

Two cases:

  1. Routine time series update (format of data remains the same; new data is appended to pre-existing tables)
  2. Dataset redesign (format of data changes: different rows or adding a table)

Case 1: Routine time series update

NOTE: If using MetaEgress or another external program which examines data files and inserts file-related metadata directly into the EML, you may choose to skip storing that metadata in metabase.

Case 2: Dataset redesign or non-routine changes

In either case,

  1. Increment DataSet.Revision
  2. Update DataSet.pubDate

How to create new datasets

For simplicity, let’s assume the parent tables (EMLStuff and ListStuff) are already populated. Most of the DataSetStuff tables are cross-reference tables, selecting an item from a parent table and attaching it to a dataset, or at dataTable of that dataset, or an attribute of a dataTable of that dataset.

  1. DataSet: enter one new row for the new dataset. This must be done first.
  2. DataSetEntities must be filled before
    1. DataSetAttributes which must be filled before these, which are optional and in any order:
      1. DataSetAttributeEnumeration
      2. DataSetAttributeMissingCodes
  3. DataSetMethodSteps must be filled before these, which are optional and in any order:
    1. DataSetMethodInstruments
    2. DataSetMethodProtocols
    3. DataSetMethodProvenance
    4. DataSetMethodSoftware
  4. These may be filled any time after DataSet, and in any order:
    1. DataSetKeywords
    2. DataSetPersonnel
    3. DataSetSites
    4. DataSetTaxa
    5. DataSetTemporal

Boilerplate information

Boilerplate refers to information about your project or LTER site that is reused across most if not all of your datasets, including:

Boilerplate is modelled in LTER-core-metabase in the table mb2eml_r.boilerplate. Some columns in that table are type XML, within which you would paste a XML snippet as appropriate for that column. For example, in the licensed column, you can have:

<licensed>
  <licenseName>Creative Commons Zero v1.0 Universal</licenseName>
  <url>https://spdx.org/licenses/CC0-1.0.html</url>
  <identifier>CC0-1.0</identifier>
</licensed>

Fill out these trees as applicable to your LTER site or project. Refer to the EML best practices v3 document. Scope is required. Note that contact, metadata provider, and publisher utilize metabase name IDs, which are controlled in the table ListPeople. The corresponding view mb2eml_r.vw_eml_bp_people joins the two source tables to return information on personnel, and thus updates when you update ListPeople. We thought about implementing the same treatment for project personnel; however, this requires fully modeling the project tree, which we thought was a bit overkill. Tell us if you think otherwise.

NOTE: MetaEgress will not read in other fields not listed above. In fact, the new boilerplate database model doesn’t quite allow for other fields.

NOTE: licensed is a new EML 2.2 element; optional in metabase. Another new EML 2.2 element is award; if you wish to use this element, insert a XML snippet inside the project tree under boilerplate.project.

Dataset methods

Metabase can describe dataset collection and sampling methods, instruments, software, protocols, and other aspects related to how data were generated. There are two ways to populate an EML methods section from metabase:

Note that the VIEWs abstraction layer and MetaEgress support both approaches (minimal, document-only, and normalized, modelled).

Confused what goes where?

Most of the time tables and column names make it pretty clear where different pieces of metadata is supposed to go. We have peppered the database with comments on columns where we think there might be confusion. However, for certain pieces it’s not so obvious.

Where does this go?

What is this table column meant for? Does it populate an EML element?

Schema mb2eml_r

Nothing to do here. The only table is boilerplate. See boilerplate section. Everything else is VIEWs for MetaEgress.

Schema pkg_mgmt

There’s a lot in this schema; however as of 2020-01-10, we have not critically reviewed these tables for their usefulness beyond SBC/MCR LTER.

The only table whose information can be exposed to EML via MetaEgress is pkg_mgmt.maintenance_changehistory.

Another table of note is version_tracker_metabase. This tracks changes to the schema of your installation of LTER-core-metabase itself. Due to the evolving nature of the schema, any one installation of metabase will eventually need to be updated with patches. It’s our team practice to write an INSERT statement to this table in each patch, and so by executing the patches on your installation, this table records the patches and the time installed.

Schema semantic_annotation

This schema contains tables to annotate your datasets and table attributes with semantic annotations from established ontologies. The same rules of thumb about table prefixes and population order from before apply here, i.e. fill out EMLAnnotationTerms and EMLObjectProperties first before DataSetAnnotations and DataSetAttributeAnnotations.

EMLAnnotationTerms are meant to contain terms that are concepts, measurements, etc, e.g. “temperature”, “current speed”. Recommended ontologies: ESCO, ENVO, SWEET.

EMLObjectProperties are meant to contain terms that denote relationships between terms, e.g. “isAbout” or “containsMeasurementsOf”. Recommended ontologies: OBOE.

Known wonkinesses and workarounds

General notes

Remove all hyperlinks in Word documents linked to in abstracts and methods documents.

Any attributes for which you wish to enumerate code-definition pairs in EML needs to be entered as either “nominalEnum” or “ordinalEnum” in DataSetAttributes.MeasurementScaleDomainID. Either of those options will be listed as “EnumeratedDomain” in vw_eml_attributes.domain. If an attribute is not “EnumeratedDomain”, then even if there are code-defintion pairs listed for it, the EML::set_attributes() function in the EML R package will not list them in the exported EML document.

Issues specific to DBeaver

The pilcrow/paragraph symbol/backwards P. Not sure how these materialize in DBeaver but they do. If present, exported metadata doesn’t look any different, but if present in “filename” type columns, files cannot be located. Cannot be deleted by itself, so re-enter the whole field.

Blanks in DBeaver are NOT the same as NULLs. Be careful when copying and pasting. Constraints/conditions that require NULLs will not work correctly. If pasting whole columns from Excel, you need to right-click/Edit/Set to NULL in DBeaver manually.

Note that the m-dash (long) is not valid in numeric fields, as opposed to the n-dash (short) which denotes negative numbers.