Fabian Steeg /
@fsteeg &
Adrian Pohl /
@acka47 &
Pascal Christoph /
@dr0ide
Linked Open Data, Hochschulbibliothekszentrum NRW (hbz)
Berlin, 2019-05-07
This presentation:
http://hbz.github.io/elag2019-bootcamp
Part I | Convert RDF data into usable JSON-LD | 10:00-11:30 |
Part II | Index and access the data with Elasticsearch | 11:30-13:00 |
Break | 13:00–14:00 | |
Part III | Use the data to build a web application | 14:00-15:30 |
Part IV | Use the data with existing tools: Kibana, OpenRefine | 15:30-17:00 |
Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen, est. 1973
= Academic library center of North Rhine-Westphalia
Software services for libraries in NRW and beyond
E.g. union catalog, discovery portal DigiBib, ILL, digitization & digital preservation, consortial acquisition
lobid: hbz's LOD service, providing bibliographic data, authorities & organizations
Started 2010 with open data publication, triple store & Perl transformation script
Original team: Adrian, Pascal, Felix. Fabian joined in 2012
2013: First API release based on JSON-LD & Elasticsearch
2017–2018: LOUD with lobid API 2.0
RDF?
Bulk data processing?
Command line?
Software development?
Term coined by Robert Sanderson in 2016
(see also his Linked Pasts Keynote)
Core idea: to make your data useful you have to know and cater to your audience
Primary audience of LOD: people developing (1st and 2nd party) or using (3rd party) software to access the data
LOUD: oriented towards the needs and conventions for developing software
Source: Rob Sanderson, "Shout it Out: LOUD", CC-BY (video)
Source: Rob Sanderson, "Shout it Out: LOUD", CC-BY (video)
(Our background & this workshop: we start with RDF data)
Usable data & APIs | Application Programming Interfaces: why, how | 10:15–10:30 |
JSON APIs | Using web APIs, processing JSON | 10:30–10:45 |
From RDF to JSON | JSON-LD processing with jsonld.js via jsonld-cli | 10:45–11:30 |
Part Ia.
Usable data & APIs
APIs: why, how
Usable data [& APIs]
Build new software with the data
Use existing software with the data
Our collections and services are delivered primarily via software. [...] The choices we make in the development, selection, and implementation of this software [...] define the limits of our content and services. We can only be as good as our software.
— Cody Hanson, Libraries are Software
[Usable data &] APIs
API: why?
★ Right abstraction for the audience
Data is used with software
Software requires APIs
API: why?
API: how?
3. Don't break the web
API: how?
{
"linkedTo" : {
"id" : "http://lobid.org/organisations/DE-601#!",
"label" : "Verbundzentrale des GBV (VZG)"
},
"rs" : "110000000000",
"address" : {
"addressLocality" : "Berlin",
"type" : "PostalAddress",
"addressCountry" : "DE",
"postalCode" : "10772"
}
...
Part Ib.
JSON APIs
Using the lobid-organisations API, processing JSON
($ git pull origin master) ('$' means you should type this in your terminal application)
Command line tool for transferring data with URLs
https://curl.haxx.se/download.html
$ curl --help
cURL |
|
Paste |
|
History |
|
(To copy & paste from/to GUI applications like your browser use (left!) CTRL+c resp. CTRL+v. You can copy and paste into the terminal with SHIFT+CTRL+c and SHIFT+CTRL+v. )
$ curl https://lobid.org/organisations/DE-1a
{
"linkedTo" : {
"id" : "http://lobid.org/organisations/DE-601#!",
"label" : "Verbundzentrale des GBV (VZG)"
},
"rs" : "110000000000",
"address" : {
"addressLocality" : "Berlin",
"type" : "PostalAddress",
"addressCountry" : "DE",
"postalCode" : "10772"
},
...
Output is quite long, we often want specific values
var options = {
url: 'https://lobid.org/organisations/DE-1a'
};
request(options, function (error, response, body) {
var doc = JSON.parse(body);
console.log('postal code:', doc.address.postalCode) // <--
});
> postal code: 10772
A lightweight and flexible command-line JSON processor
https://stedolan.github.io/jq/
$ jq --help
$ curl "https://lobid.org/organisations/DE-1a" \
| jq .name # filter: .name
Funder |
|
Modified |
|
Funder |
|
Modified |
|
Part Ic.
From RDF to JSON
JSON-LD processing with jsonld.js via jsonld-cli
"designed to be usable directly as JSON, with no knowledge of RDF" — it's real JSON!
"also designed to be usable as RDF"
5. Design for JSON-LD, consistently
JSON-LD command line interface tool
https://github.com/hbz/jsonld-cli
Our fork of https://github.com/digitalbazaar/jsonld-cli, added import from N-Quads
$ jsonld --help
Location |
|
Import |
|
Write |
|
{
"@id": "http://id.loc.gov/resources/works/c000101650",
"@type": [
"http://id.loc.gov/ontologies/bibframe/Work",
"http://id.loc.gov/ontologies/bibframe/Text"
],
"http://id.loc.gov/ontologies/bibframe/subject": [{
"@id": "http://id.loc.gov/resources/works/101650#Topic650-20"
}, ...]
}, {
"@id": "http://id.loc.gov/resources/works/101650#Topic650-20",
"http://www.w3.org/2000/01/rdf-schema#label": [{
"@value": "Climatic changes--Europe."
}],...
}
Identify field |
|
Access field |
|
`http://www.w3.org/2000/01/rdf-schema#label`.`@value`?
Can't work: we have just a flat array of objects, which label?
Problem: only technically JSON, but not usable
So JSON is great?
Unwieldy URIs as keys from RDF
Syntactic overhead from JSON
Bad readability & usability
{
"@id": "http://id.loc.gov/resources/works/c000101650",
"@type": [
"http://id.loc.gov/ontologies/bibframe/Work",
"http://id.loc.gov/ontologies/bibframe/Text"
],
"http://id.loc.gov/ontologies/bibframe/subject": [{
"@id": "http://id.loc.gov/resources/works/101650#Topic650-20"
}, ...]
}, {
"@id": "http://id.loc.gov/resources/works/101650#Topic650-20",
"http://www.w3.org/2000/01/rdf-schema#label": [{
"@value": "Climatic changes--Europe."
}],...
}
Data processing |
|
API access |
|
If you've chosen the right data structures and organized things well, the algorithms will almost always be selfevident. Data structures, not algorithms, are central to programming.
When it comes to APIs, developers are your users. The same principles of user-centred-design apply to the development and publication of APIs (simplicity, obviousness, fit-for-purpose etc).
API Design Guide, Digital Transformation Agency, Australia
Default RDF -> JSON-LD requires programming with loops & conditionals to access a single nested field, the subject label
Should be as simple as doc.address.postalCode
Not providing an API actually excludes non-developers by putting up barriers for usage with CLI or GUI tools
★ Few barriers to entry
Design for JSON-LD from RDF, consistently
Frame the way we look at our graph from the perspective of one entity type (to make it a tree, or a document)
Look at my data from a "Work" perspective, and embed other entities under their works
frame .json |
|
Frame |
|
Write |
|
{
"@id": "http://id.loc.gov/resources/works/c000101650",
"http://id.loc.gov/ontologies/bibframe/subject": [{
"@id":"http://id.loc.gov/resources/works/101650#Topic650-20",
"@type": [
"http://id.loc.gov/ontologies/bibframe/Topic",
"http://www.loc.gov/mads/rdf/v1#ComplexSubject"
],
...
"http://www.w3.org/2000/01/rdf-schema#label": "Climatic changes--Europe."
}, {...}],
}
Identify field |
|
Access field |
|
`http://id.loc.gov/ontologies/bibframe/subject`
.`http://www.w3.org/2000/01/rdf-schema#label`?
Could work conceptually, but not very handy
2. As simple as possible?
Mapping of JSON keys to URIs: "@context":{"name":"http://schema.org/name"}
With this context, we can compact (replace URIs with short keys) or expand (replace short keys with URIs) our JSON-LD
http://schema.org/name ⇄ name
`http://id.loc.gov/ontologies/bibframe/subject`
.`http://www.w3.org/2000/01/rdf-schema#label`?
Here: context generated from ontologies with (much) manual post-processing
Alternatives: upfront design or generate from data
★ Few exceptions, many consistent patterns
e.g. the contribution field should always be an array, even if a particular record has only one contribution | spec
Other fields, which occur only once for every record can use the default non-array single values
To have all fields always be arrays: compacting option compactArrays:false
(Consistent field types also required for Elasticsearch indexing)
e.g. field 'contribution' should always be a JSON array
"contribution": {
"@id": "http://id.loc.gov/ontologies/bibframe/contribution",
"@container": "@set"
}
e.g. data typing with "type coercion"
"creationDate": {
"@id": "http://id.loc.gov/ontologies/bibframe/creationDate",
"@type": "http://www.w3.org/2001/XMLSchema#date"
}
(Works similar for language tags)
Compact |
|
Write |
|
{
"@id": "http://id.loc.gov/resources/works/c000101650",
"subject": [
{
"id": "http://id.loc.gov/resources/works/101650#Topic650-20",
"type": [
"Topic",
"ComplexSubject"
],
"label": "Climatic changes--Europe."
}
], ...
}
Identify field |
|
Full Title |
|
Subject |
|
like .address .postalCode? |
|
Access one |
|
Access all |
|
Un-mapped URI in the data
Will still be URI after compaction
Iterate: compact, check data, fix context, compact again
Check paths |
|
Find URIs |
|
Update context |
|
Compact |
|
Identify fields |
|
Authors |
|
Subtitle |
|
To N-Quads |
|
Authors |
|
Subtitle |
|
So JSON is great
RDF-compatible linked data
Useful, practical JSON
Multiple steps to make usable JSON-LD
Serialize RDF as JSON-LD, frame, compact
1. Require use cases, with data
Add context to usable JSON
(Add @id
s where needed)
Applications can use the 'name
' field in their code, no matter what URI it's mapped to (which unfortunately may change)
(based on part of https://api.github.com/users/hbz)
{
"@id": "https://github.com/users/hbz",
"type": "Organization",
"name": "hbz",
"blog": "https://www.hbz-nrw.de",
"@context": {
"type": "@type",
"name": "https://schema.org/name",
"blog": { "@id": "https://schema.org/url", "@type": "@id" },
"Organization": "https://schema.org/Organization"
}
}
You can test this in the JSON-LD Playground
JSON as RDF
{
"@context": "...",
"type": "Organization",
"name": "hbz",
"blog": "https://www.hbz-nrw.de",
}
<https://github.com/users/hbz> \
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> \
<https://schema.org/Organization> .
<https://github.com/users/hbz> <https://schema.org/name> "hbz" .
<https://github.com/users/hbz> <https://schema.org/url> \
<https://www.hbz-nrw.de> .
Output of N-Quads tab in the JSON-LD Playground
Part I: Convert RDF data into usable JSON-LD
Part II: Index and access the data with ElasticsearchElasticsearch bulk indexing | Create bulk format with jq, index with curl | 11:30–12:15 |
Access the Elasticsearch index | Search queries, direct access by ID | 12:15–13:00 |
Part IIa.
Elasticsearch bulk indexing
Create bulk format with jq, index with curl
"A distributed, RESTful search and analytics engine"
Based on Lucene, like Solr
Stores JSON documents, no explicit schema required
Input file |
|
Import |
|
Frame |
|
Compact |
|
Import |
|
Frame |
|
Compact |
|
{"index":{"_index":"loc","_type":"work","_id":"c000101650"}}
{@context: ...,"id":"http://...","type":["Work","Text"],...}
bulk.ndjson
{
"@context": { ... },
"@graph": [
{ "id": "http://.../1", "type": ["Work", "Text"], ... },
{ "id": "http://.../2", "type": ["Work", "Text"], ... },
...
]
}
We want to process each element of the @graph array
All |
|
All, 1 line |
|
Each doc, 1 line |
|
Second line for each pair in bulk.ndjson: 1 doc in 1 line
Like previous |
|
Construct JSON |
|
Insert values |
|
Full metadata |
|
First line for each pair in bulk.ndjson: index metadata
Insert newline with -r |
|
Write bulk .ndjson |
|
Send the bulk.ndjson file to the Elasticsearch bulk API
HTTP POST to /_bulk
Setup
You Know,
for Search
$ curl localhost:9200
Check data |
|
Index data |
|
Check index |
|
Part IIb.
Access the Elasticsearch index
Search queries, direct access by ID
Search query as part of the URI (very versatile: full text, fields, boolean operators, wildcards, etc.)
(Won't cover here: body search, even more advanced)
HTTP GET .../<index>/<type>/_search?q=<query-string>
Full text |
|
Field search |
|
Boolean |
|
Wildcards |
|
Full text |
|
Field search |
|
Boolean |
|
Wildcards |
|
Direct doc access: /<index>/<type>/<id>
http://example.com/loc/work/c000000007
But our index IDs are URIs
http://localhost:9200/loc/work/http%3A%2F%2Fid.loc.gov%2Fresources%2Fworks%2Fc000000007
jq: split |
|
Recreate bulk .ndjson |
|
Delete index |
|
Reindex |
|
GET by ID |
|
Source only, formatted |
|
GET by ID |
|
Source only, formatted |
|
Direct access by ID: /<index>/<type>/<id>
Local |
|
Remote |
|
No context: was included once, at top level, with @graph
{
"@context": { ... },
"@graph": [
{ "id": "http://.../1", "type": ["Work", "Text"], ... },
{ "id": "http://.../2", "type": ["Work", "Text"], ... },
...
]
}
We processed & indexed each element of the @graph array
Need to include for each document
Not embedded, but remote: add context URL
(Server needs to serve context as application/ld+json, with CORS header, we will cover CORS in part III)
Run server (new terminal) |
|
Access context (old terminal) |
|
jq: add JSON |
|
Add context URL |
|
Write bulk .ndjson |
|
Delete index |
|
Reindex |
|
To N-Quads |
|
API: How — JSON over HTTP
Boosting, Analyzers, Aggregations, etc.
Part II: Index and access the data with Elasticsearch
Part III: Use the data to build a web applicationSearch page | HTML page for index search | 14:00–14:30 |
Sample searches | Use the page to query the data | 14:30–15:00 |
Hypothes.is | Embed Hypothes.is in HTML, add annotations to document the data | 15:00–15:30 |
Part IIIa.
Search page
HTML page for index search
Edit new file |
|
Paste content |
|
Open in browser |
|
Provides API to simplify interaction with the browser DOM (document object model)
Dynamic content, user interaction, web service calls
<head>
<!-- previous head content -->
<script src="https://code.jquery.com/jquery-3.3.1.min.js">
</script>
</head>
<body>
<input id='search' type='text'/>
<script>
$('#search').keyup(function() {
var q = $('#search').val();
console.log(q);
});
</script>
<div id='result'></div>
</body>
(Open your browser's developer tools with Ctrl+Shift+I to see the console.log output)
"Asynchronous JavaScript And XML"
Asynchronous JavaScript And XML JSON
Call web services, do something when we get the result
Part of the browser: XMLHttpRequest
Also available in JQuery: $.ajax(options)
$('#search').keyup(function() { // <-- insert after this
var q = encodeURIComponent($('#search').val());
$.ajax({
url: `http://localhost:9200/loc/work/_search?&q=${q}`,
success: function(result) {
console.log(JSON.stringify(result));
},
error: function(jqXHR, textStatus, errorThrown) {
console.log(textStatus, ': ', errorThrown);
}
});
});
(Open your browser's developer tools with Ctrl+Shift+I to see the console.log output)
Error: not allowed by Access-Control-Allow-Origin
Client (our HTML+JS) is in a different location than the web service API we are calling (Elasticsearch server)
API server needs to set header: Access-Control-Allow-Origin
Specify which domain can make requests, or set it to * to allow access from anywhere
Edit config file |
|
Add content |
|
Restart Elasticsearch |
|
Check log |
|
Stop the tail command with Ctrl+C after the restart is complete ("started")
$('#search').keyup(function() { // <-- insert after this
var q = encodeURIComponent($('#search').val());
$.ajax({
url: `http://localhost:9200/loc/work/_search?&q=${q}`,
success: function(result) {
console.log(JSON.stringify(result));
},
error: function(jqXHR, textStatus, errorThrown) {
console.log(textStatus, ': ', errorThrown);
}
});
});
(Open your browser's developer tools with Ctrl+Shift+I to see the console.log output)
{
"took": 1,
"timed_out": false,
"hits": {
"total": 100,
"max_score": 1,
"hits": [
{
"_index": "loc",
"_type": "work",
"_id": "c000000007",
"_score": 1,
"_source": {
"@context": "context.json",
"id": "http://id.loc.gov/resources/works/c000000007",
success: function(result) {
console.log(JSON.stringify(result)); // add after this
$("#result").html('<p/>');
var hits = result.hits.hits;
for (let hit of hits) {
var label = hit._source.label;
$("#result").append('<hr/>' + label);
}
}
Append the label of each hit to the result div in our HTML
Total hits |
|
Links |
|
Show total hits, link from each hit to its JSON
Part IIIb.
Sample searches
Use the page to query the data
Which field? How is it nested?
Samples very useful as documentation
Embed a live sample from the index in the search UI
★ Documentation with working examples
Wrap body content in divs |
|
Define CSS styles in head |
|
<script>
$.ajax({
url: 'http://localhost:9200/loc/work/c000000026/_source',
success: function(result) {
console.log(JSON.stringify(result));
$("#sample").html(JSON.stringify(result, null, 2));
},
error: function(jqXHR, textStatus, errorThrown) {
console.log(textStatus, ': ', errorThrown);
}
});
// previous script content here
</script>
(Open your browser's developer tools with Ctrl+Shift+I to see the console.log output)
★ Comprehensible by introspection
Look at sample document
Full query access to every field you see
(Look up details for meaning of keys in the context)
Don't pre-define fixed params for authors, subjects, etc.
Instead define a general syntax to query any field
New fields added to the data can immediately be queried
4. Define success, not failure
Field search |
|
Boolean |
|
Wildcards |
|
Ranges |
|
Exists |
|
Field search |
|
Boolean |
|
Wildcards |
|
Ranges |
|
Exists |
|
Part IIIc.
Hypothes.is
Embed Hypothes.is in HTML, add annotations to document the data
"Annotate the web, with anyone, anywhere"
"Use Hypothesis right now to hold discussions, read socially, organize your research, and take personal notes"
or: to create sample-based, collaborative documentation
Based on browser plugin, proxy server, or embedded JS
<head>
<!-- previous head content -->
<script type="application/json" class="js-hypothesis-config">
{ "openSidebar": true }
</script>
<script src="https://hypothes.is/embed.js" async></script>
<link rel="canonical"
href="https://hbz.github.io/elag2019-bootcamp/search.html"/>
</head>
(Only embed.js script is required, config and canonical link are optional)
Based on your experience in the query exercise, pick a field and document it like in one of the sample annotations
Markdown syntax is supported for links etc.
Use the `elag2019-bootcamp` tag to make it searchable
Part III: Use the data to build a web application
Part IV: Use the data with existing toolsKibana | Create visualizations for local index data; look at full data visualizations | 15:30-16:15 |
OpenRefine | Use index data in OpenRefine; brief intro to the reconciliation API | 16:15-17:00 |
Part IVa.
Kibana
Create visualizations for local index data; look at full data visualizations
"Explore, Visualize, Discover Data"
Open source data visualization plugin for Elasticsearch
Running locally at http://localhost:5601/
Create a bar chart visualization for the 'contribution.agent.type'
field
(Kibana will display an additional 'keyword'
suffix)
Create a pie chart visualization for the 'adminMetadata.source.label'
field
(Kibana will display an additional 'keyword'
suffix)
Create a tag cloud visualization for the 'subject.componentList.authoritativeLabel'
field
(Kibana will display an additional 'keyword'
suffix)
Other types of aggregations
Multiple aggregations in a visualization
Other visualization types
Part IVb.
OpenRefine
Use index data in OpenRefine; brief intro to the reconciliation API
"A powerful tool for working with messy data"
"cleaning it; transforming it from one format into another; and extending it with web services and external data"
Spreadsheet-like user interface, in the browser, locally
Run with ./refine
in your OpenRefine location, then open http://localhost:3333/
http://id.loc.gov/resources/works/c000000011
http://id.loc.gov/resources/works/c000000020
http://id.loc.gov/resources/works/c000000026
http://id.loc.gov/resources/works/c000000028
http://id.loc.gov/resources/works/c000000100
Copy the lines above to the clipboard
"http://localhost:9200/loc/work/" + split(value, '/')[-1] + "/_source"
(Copy this to the clipboard)
contribution[0].agent.label
Letters from an American farmer
Italian journeys
Report of the trial of George Ryan
The devil upon crutches in England
Educational interests of Colorado
"http://localhost:9200/loc/work/_search?q=\"" + replace(value, " ", "+") + "\""
value.parseJson().hits.hits[0]._source.id
All of the above: nothing specific built for OpenRefine, just JSON over HTTP, and lots to explore in OpenRefine
Implement OpenRefine Reconciliation API (serve some specific JSON) for better integration in OpenRefine UI
Select best matches using scores and small inline previews, filter by type, suggest fields for data extension, etc.
blog.lobid.org/2018/08/27/openrefine.html
Part I | Convert RDF data into usable JSON-LD |
Part II | Index & query the data with Elasticsearch |
Part III | Use the data to build a web application |
Part IV | Use it with existing tools: Kibana, OpenRefine |
https://github.com/hbz/elag2019-bootcamp/