Support Google Dataset Search on search.dataone.org via partial server side rendering
Potential paths forward to get DataONE Search compatible with Google's Dataset Search include (none of which are mutually exclusive):
- The assets that make up MetacatUI and the asset loading strategies could be optimized: https://github.com/NCEAS/metacatui/issues/224
- Move the code (and any dependencies) that injects JSON-LD further up in the app boot so that Google sees it
- Inject the appropriate JSON-LD on the server side to guarantee that Google sees it (originally Matt Jones' idea!)
(1) is being worked on for sure, and (2) may not be needed if (1) is successful. I want to talk about option (3) because:
- It's a quicker solution (I already have something working) which would help get us involved in the project faster
- It paves the way for future features and/or improvements to MetacatUI (we could be rendering more on the server side than just JSON-LD, like other metadata, more page content, etc)
What I did¶
To test this idea, I modified a previous project which is just a simple Node (Express.js) app that hosts MetacatUI by intercepting every request and serving the appropriate asset. In injects Schema.org JSON-LD, when appropriate, by querying the CN Solr index before sending MetacatUI's index.html to the client. Code is here and its deployed here. View source on any /view/... pages and you'll see a minimal Schema.org/Dataset description in the head. More properties can be added later. I did it quick and dirty: The app pre-loads MetacatUI's index.html as a
String at app boot and injects the JSON-LD into it. No templating language or other magic.
Things to address¶
- How do we feel abouts switching from hosting MetacatUI via Apache (simple, bullet proof) to a Node based deployment just to support this feature (new territory, at least for me)?
- If we do switch, we'd want to make really sure the Node app doesn't have weird failure cases where it doesn't return index.html (e.g., when Solr is down, or slow). The app needs to return index.html (and every other static asset) on every request and do it very fast and we should decide what the cutoff is so that it doesn't hold up app boot if Solr is slow/down.
- Can this type of deployment easily be integrated with CN buildouts? I've deployed Node apps before by fronting them with Apache/nginx (via reverse proxy) and then keeping the node process up with Upstart
- Is this performant enough for DataONE? I think my implementation is non-blocking but I'm not a Node expert so we'd want to code review and probably benchmark
- We could wait on (1) and stick with our current deployment strategy
Unrelated to the Google Dataset Search issue but related to Google's crawling for Google Search, we've also identified:
#1 Updated by Dave Vieglais over 4 years ago
Optimizing search UI for rendering performance is a good move, however the application is fundamentally orthogonal to LOD in that the server response to a client request is an application that must be executed to retrieve a resource rather than the requested resource.
Server side processing is the alternative, and switching to nodejs is one path towards such a refactor though other technologies (e.g. java or python) would work just as well, though this is what nodejs does work quite well for.
#2 Updated by Bryce Mecum over 4 years ago
Thanks Dave. Two thoughts:
- Re: "however the application is fundamentally orthogonal to LOD". If I understand you, I think it'd say that MetacatUI and JSON-LD are only orthogonal if you don't consider that Google only wants to see JSON-LD in HTTP response bodies from web applications and static sites and not in a standalone fashion. I don' think I fully understood your point though.
#3 Updated by Dave Vieglais over 4 years ago
wrt #1, it's very simple - when requesting a resource with a URL, the resource should be returned, which in this case is reasonably expected to be the requested item. Instead, MetacatUI returns an application that must then be executed to retrieve the requested resource. While technically correct from a HTTP semantics point of view (since the URL does point to an application), it does make it far more complicated for a client to actually get and inspect the listed resource, especially when the goal is to support LOD.
#5 Updated by Bryce Mecum over 3 years ago
- % Done changed from 0 to 30
- Priority changed from Normal to Low
- Status changed from New to In Progress
Google staff have indicated we only need to send them a robots.txt that points to our sitemaps for them to begin crawling so we're going to try that first (See https://redmine.dataone.org/issues/8817) and come back to this if necessary.