This is my idea of Node Knockout 2018. I mulled over it for about two weeks and am highly satisfied with what I could achieve during the hackathon. So much, that I bought the domain lobsang.network and continue developing it on GitHub.

But what is it about?

Executive Summary

Analyse websites for Search Engine Optimisation by leveraging the power of the web itself.

Distribute the effort over the web, make use of specialised services and coordinate via distributed protocols.

Technical Description

So what does that mean?

Looking at it from different angles:

  1. What does Search Engine Optimisation mean here?
  2. What effort is needed to do it?
  3. What specialised services exist?
  4. How do they coordinate their work?

Let’s dive in!

What does Search Engine Optimisation mean here?

To make sure we are on the same page, here is what I understand under these terms

What is Search Engine Optimisaton?

Here, Search Engine Optimisation (SEO, for short) describes all measures taken (optimisation) to improve the ranking of a website or -page in the Search Engine Result Page (SERP).

In other words: What can you do to make it on site one in Google?

Categories of Search Engine Optimisation

Broadly speaking, there are two categories, where you can optimise: On-Page and Off-Page.

On-Page describes all measures on your website. Optimising images, change wording, URL path structure and so on.

Off-Page describes all measures elsewhere. Think getting links from other sites (so-called backlink optimisation), improve your reputation on Social Media (generating buzz) etc.

I am focussing on On-Page optimisation for now. For Off-Page you would need more time to crawl the web, process it and discover trends.

What effort is needed to do it?

My idea is to bring ScreamingFrog SEO Spider to the web. If you look at their homepage or give their free edition a try, you can see that it basically crawls a site and looks at different information (links among your site and external, HTTP status codes, meta description and many more). No rocket science. They put some charts on top and offer an export to CSV.

Thinking through the process, we can break it down like this:

  1. Crawl the web
  2. Share the link with other processes
  3. Pick up the link
  4. So some analysis with it
  5. Share the results with other processes
  6. Pick up the results and aggregate them
  7. Present findings to the user

I can see five distinct areas here:

  1. Crawler
  2. Publisher
  3. Subscriber
  4. Processor
  5. Presenter

What specialised services exist?

Crawler

Those are basically taking a URL and follow it. They get more links and get their content as well. Really dumb. We should keep this part as lightweight as possible, so IoT devices can do it, too.

Publisher

The crawler and processor need to communicate somehow. In my hackathon I focussed on matrix.org, but other protocols could be used as well. For example WebTorrents, IPFS or MQTT.

So if we add some metadata we can use any of them. Let’s go with this:

  1. JSON
  2. With a license (SPDX identifier)
  3. With a timestamp (in ISO 8601)
  4. With a ID to allow immutability
  5. With a link to the node it was derived from
  6. With an issuer ID
  7. With an identifier to describe the topic of payload
  8. With some payload

This should be all it needs. Then the transport protocol doesn’t matter,

Subscriber

Not every service is interested in every message. I envision something like topics in MQTT or PubSub here. So a service subscribes to a topic of message and filters out everything else.

If an interesting message is received, it passes along some other service.

Processor

This are highly trained services. Say a Natural Language Processor. Or image recogniser. Or a scanner for potential accessibility issues. Something like that. By now, you find several of them in the cloud.

Presenter

A presenter is actually build up of two parts: Some kind of database to store interesting data and some means to visualise it.

They are actually dumb insofar as they should not have to process the data any further (except maybe on the writing phase of the database).

How do they coordinate their work?

Currently all of those steps have to be done by every search engine on its own. Be it Google, Bing, Baidu, Yandex, DuckDuckGo or IxQuick.

Why not share the burden of crawling the web? Why not having the processing part concentrated?

I would like to see the basis for the building of the search engine open to all. What you weight against each other can be closed source if you insist. But we have then the way to compare the input with the output. This could stir some interest in science. Or somebody comes up with a new idea?

By agreeing on a protocol, we all speak the same language.

Challenges

Speaking of language: By relying on JSON, we are independent of the programming language used. I will start with node.js and python. I can imagine writing some ruby and lua code as well. Hopefully people will contribute Java, PHP and Perl code.

This way we can smooth rough edges (I’m sure I missed something in the concept!)

For example, I don’t whether it scales as I imagine. What about memory consumption? What about congestion on the database? The message protocol? On the other hand, by relying on standards I can build upon the work of other engineers.

Conclusion

Although I haven’t won (again) in the hackathon, I am motivated enough to follow my idea for a few years and see, whether it gets traction.

Will you join me?