This is a portion of the ‘Values’ table. The fundamental record in Vega represents individual floating point values of discrete measurements. These values are all stored in a single table called ‘Values’. Because this table stores all observations within the system, it must be optimized to minimize storage space requirements and enhance query performance against it.
Each value is stored as a floating point double, is time stamped by a date time field, and is linked to its metadata by its stream identification (figure to the right), to be described in more detail later. The ‘ValueID’ field is included for convenience when programming and manipulating individual values. ‘Flag’ is included to allow for QA/QC descriptive data and to maintain backwards compatibility with systems that use data flagging as an indicator of potential data quality or other metadata.
Duplicate data are prevented at the table level. A unique index is defined for the ‘Values’ table on the ‘DateTime’ and ‘StreamID’ columns. No two values can have both the same stream and timestamp.
This is a portion of the ‘Streams’ table in Vega, showing streams relating to the Crystal Bog Buoy site. The data stream is an entity designed to fully describe data that only vary through time, or in other words, a unique time series. Each stream is described by attributes stored in the ‘Streams’ table and can be thought of as a unique combination of attributes. For example, air temperature sampled at a particular meteorological station through time would be a unique stream. Soil temperature at that same station would be a different data stream.
Each stream has required attributes necessary to form a unique description and optional attributes necessary when those required are insufficient to uniquely describe the stream or when additional metadata are desired (the portion of the table to the left leaves out the optional attributes). Each stream is assigned a unique integer identifier, ‘StreamID’, forming the one-to-many relationship back to the ‘Values’ table.
The Vega database takes in different types of raw data from various sources (i.e. Air Temperature from the Crystal Bog Buoy) . If a particular source isn’t working properly, the data will often come in erroneously, hindering the use of the way this data can be handled. For this reason, a range checks program runs as part of the GLEON QA/QC process. This program filters out these erroneous data points as they come into the Vega database, and puts the filtered data into a different database, Vega 1.
The range checks program uses a table in the Vega database called ‘Ranges’. This table lists the possible maximum and minimum values that each variable can have, or none if the variable can freely accept any value that comes in (e.g. if air temperature comes in as more than 50 or less than -100 degree Celsius, the value will not be added to the Vega 1 database). New variables can be added to this table at any time, and will immediately be used by the range checks program.
To the right is a portion of the table. VariableName is used to specify the variable, and UnitShort is used to specify the units of that variable. This lets us set the max and min for one unit, and set it to something else for another if we choose to do so (e.g. wind speed – its max and min are 50 and 0 for m/s, and its max and min are unspecified for m^3/s^3). A “NULL” field is used when no maximum or minimum is specified, thus allowing all data points from that particular variable and unit through to Vega 1.
The range checks program will run continuously on a machine in the UW-CFL. It checks for new data points coming into the Vega database every minute, filters the data accordingly, puts the clean data to the Vega 1 database, waits for another minute, and repeats. This means the Vega 1 database is always current to the minute with clean data.
Below demonstrates the effects of the program when something went wrong with the device that read the Crystal Bog buoy’s sensor temperature in June of 2007. Both plot the buoy’s sensor temperature data over the same time frame, from the beginning of May 2007 to the end of July 2007. For this particular stream, data was coming in once every ten minutes. However, from June 4th to June 5th something went wrong with the buoy, and every data point coming in (120 data points in total) was a large negative number, around –99999. This caused the graph on the left to appear the way it does when just using the plot function of Matlab. The graph on the right was plotted the same way, from Vega 1’s database.
Currently in the Vega database, there are ~250,000 erroneous data points out of the ~145,000,000 total. Over a particular time frame, as in the example from the last slide, these data points can be enough to disrupt a proper evaluation of the real data.
These bad data points can come in at any time, for many different reasons (buoy sensor malfunction, animal tampering with sensor, etc) and thus are difficult to predict. With the range checks program in place, one can always be assured a clean data set on the Vega 1 database.
Reference: Winslow, L. A., B. J. Benson, K. E. Chiu, P. C. Hanson, and T. K. Kratz. Vega: A Flexible Data Model for Environmental Time Series Data. Rep. Web. <http://gleon.org/media/Winslow_vega.pdf>.