Glossary
Authors | Rufus Pollock, Paul Walsh, Adam Kariv, Evgeny Karev, Peter Desmet, Data Package Working Group |
---|
A dictionary of special terms for the Data Package Standard.
Language
The key words MUST
, MUST NOT
, REQUIRED
, SHALL
, SHALL NOT
, SHOULD
, SHOULD NOT
, RECOMMENDED
, MAY
, and OPTIONAL
in this document are to be interpreted as described in RFC 2119.
Definitions
Profile
A profile is a URL that MUST
:
- resolves to a valid JSON Schema descriptor under the
draft-07
version - be versioned and immutable i.e. once published under some version it cannot be changed
A profile is both used as a metadata version identifier and the location of a JSON Schema against which a descriptor having it as a root level $schema
property MUST
be valid and MUST
be validated.
Similarly to JSON Schema, the $schema
property has effect only on the root level of a descriptor. For example, if a Table Dialect is published as a file it can include a $schema
property that affects its validation. If the same dialect is an object inlined into a Data Package descriptor, the dialect’s $schema
property MUST
be ignored and the descriptor as whole MUST
be validated against a root level $schema
property provided by the package.
Data Package Standard employes profiles as a mechanism for creating extensions as per Extensions specification.
Descriptor
The Data Package Standard uses a concept of a descriptor
to represent metadata defined according to the core specefications such as Data Package or Table Schema.
On logical level, a descriptor is represented by a data structure. The data structure MUST
be a JSON object
as defined in RFC 4627.
On physical level, a descriptor is represented by a file. The file MUST
contain a valid JSON object
as defined in RFC 4627.
This specification does not define any discoverability mechanisms. Any URI can be used to directly reference a file containing a descriptor.
Custom Properties
The Data Package specifications define a set of standard properties to be used and allows custom properties to be added. It is RECOMMENDED
to use namespace:property
naming convention for custom properties. It is RECOMMENDED
to use lower camel case convention for naming custom properties, for example, namespace:propertyName
.
Adherence to a specification does not imply that additional, non-specified properties cannot be used: a descriptor MAY
include any number of properties in additional to those described as required and optional properties. For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package you could add a property temporal
(cf Dublin Core):
This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the Fiscal Data Package specification extends Data Package for publishing and consuming fiscal data.
URL or Path
A URL or Path
is a string
with the following additional constraints:
MUST
either be a URL or a POSIX path- URLs
MUST
be fully qualified.MUST
be using eitherhttp
,https
,ftp
, orftps
scheme. (Absence of a scheme indicatesMUST
be a POSIX path) - POSIX paths (unix-style with
/
as separator) are supported for referencing local files, with the security restraint that theyMUST
be relative siblings or children of the descriptor. Absolute paths/
, relative parent paths../
, hidden folders starting from a dot.hidden
MUST NOT
be used.
Example of a fully qualified url:
Example of a relative path that this will work both as a relative path on disk and online:
Tabular Data
Tabular data consists of a set of rows. Each row has a set of fields (columns). We usually expect that each row has the same set of fields and thus we can talk about the fields for the table as a whole.
In case of tables in spreadsheets or CSV files we often interpret the first row as a header row, giving the names of the fields. By contrast, in other situations, e.g. tables in SQL databases, the field names are explicitly designated.
To illustrate, here’s a classic spreadsheet table:
In JSON, a table would be:
Data Representation
In order to talk about the representation and processing of tabular data from text-based sources, it is useful to introduce the concepts of the physical and the logical representation of data.
The physical representation of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation can have some type information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form).
The logical representation of data refers to the “ideal” representation of the data in terms of primitive types, data structures, and relations, all as defined by the specification. We could say that the specification is about the logical representation of data, as well as about ways in which to handle conversion of a physical representation to a logical one.
In this document, we’ll explicitly refer to either the physical or logical representation in places where it prevents ambiguity for those engaging with the specification, especially implementors.
For example, constraints
SHOULD
be tested on the logical representation of data, whereas a property like missingValues
applies to the physical representation of the data.