Context for this example:
- Provenance is open ended and recursive
- It should be easy to state provenance, and to find + traverse such statements
- A realistic example helps visualize what is happening as multiple people interact with a collection
- Uses:
- Curators may use this to verify and correct collections
- Others may use this to attempt to replicate a process or conclusion
- Overlays may use this to apply their own source analysis in deciding what to show or rely on
At a minimum, this should include:
: publisher - who published this to the registry (signatory)
: authors - the proximal source for the collection as a whole, and for its elements
: sources - where the author isn’t the primary source, what other sources were used†
: place + time - where possible, capture this context (e.g. within R1: timestamps, network addresses)
: revision history - where possible, starting w/ other revisions of the same collection
† Or a positive indicator they are the primary source.
The Recipe Magnet collection
- Bo has been scraping recipes from two websites (Wi) into a collection. (v1 published as CL1.1 on Bo’s site at URx)
- Bo discovers a neat ‘Recipe Magnet’ scraper script (S, URs) maintained by Aya, and uses it to generate a new revision (CL2.1, URy).
- Cas, working with Bo, uploads CL2.1 to R1.
Some elements of data- and process-provenance we might care about in the future, for the entire collection:
-
R1 automatically captures some registry-specific publication data:
- Who published the collection to R1, when, where, past versions. (+ creation date?)
- (User:Cas, now, from 10.0.0.42 using
r1-update
, link to CL1?)
-
A simple
cite-web
schema could include data on where something was found on the web.- Author, source URL, source publication date, retrieved-on date.
(Bo, URy, last month, now) - There are many citation schemas to draw on.
- Author, source URL, source publication date, retrieved-on date.
-
A
cite-process
schema could include data on how it was produced.- Method, method author, method source URL, method parameters, other source URLs.
(S, Aya, URs, , [AllRecipes cookies, SpruceEats, Recipe Scraping guide] )
- Method, method author, method source URL, method parameters, other source URLs.
-
Aya could define a custom
S-evocation
schema for capturing their evocation of S (which can be used/referenced in the cite-process template).- Script URL, version, config; target websites [(name, URL)]
(S, 0.8.12, param-string, W[])
- Script URL, version, config; target websites [(name, URL)]
Enriching provenance over time
I imagine this happening in stages, particularly when deciding what schemas apply and which parameters to fill out. One way this could proceed:
-
The registry publication-schema is captured on upload.
-
Bo + Cas discuss whether to source the upload to a) the existing URL, b) a description of the process, or both. They update the collection to include partial data for both. (They can’t remember the retrieved-on date or the exact source URLs used. In fact the
cite-process
style guide is fuzzy on when to include a source in the list of source URLs. But they take a stab.). -
After getting feedback from an interested replicator, Bo rerenders the collection and captures the exact configuration in an
S-evocation
, adding that as a new version (CL2.2). Minor changes to the data, revised process prov. -
After discovering a bug in the scraper affecting 10% of recipes, it is fixed and rerun on that subset, producing a new version of S and a revised collection (CL2.3).
Provenancing individual recipes
We also want to know where each recipe comes from. Some elements of context we may care about:
-
collection and registry context: Collections it is in, when it was created, other versions (as the details of that recipe are refined).
-
cite-web
details, just for that recipe: Author, source URL, publication date, retrieved-on date
(different for each recipe. parts of this data not all available for some recipes) -
cite-process
details, for the recipe, including the details ofS-evocation
, and which steps in that process applied to this recipe. Sometimes a blanket collection-level process summary suffices – a single evocation produced the entire output. But if S handles extraction, reconciliation, deduplication, and merging, or more complex modeling, you might want it to track which of these (using which sources + rules) produced each recipe in the result.
What else?
Is this entire collection a single assertion? When would you have separate assertions for each recipe?
Does some of this provenance vary by recipe, and need to be stored at that level of granularity?
- Part 1 and some of part 3 are the same across the collection, and don’t need repetition.
- The data schema for a recipe may include key parts of 2: author, source URL, publication date.
A hard-core recipe site’s schema might include recipe versioning + variants.
This leaves details (retrieved-on date, archive-url; description of applicable S steps; long-tail fields in cite templates [which can often be autogenerated from the URL]) that are recipe-specific context, which you might want to know when referring to the recipe, and are not captured in the default recipe data-schema. But these are more optional: the common use cases noted at the start do not rely on these.