Home The price of data quality

The price of data quality

29th Mar 2022 trusted price data quality content

Google recently published a full page advert on one of the major Italian newspapers with the title: "Find news you can trust. With Google" (The example is in Italian but most likely the campaign was run on several media and in several countries). The main content quoted: "We work with a great number of editors to help you find trusted stories from a variety of trusted sources".

The sample search as you can see in the picture is for the terms "covid vaccine" (one of the most searched terms in the last 2 years).

Google sample search

In this post we will try to touch on a few topics related to this ad like:

why such a campaign and what does it mean ?
what are trusted sources and how to select them
why Google cannot afford to have (only) content from trusted sources
what has it got to do with myHealthbox
myHealthbox approach on trusted content
an hybrid licensing model for access to quality information.

Searching for trusted content

First of all let's try searching for "covid vaccine" to see what type of content Google is showing in the search results.

Google quality score

The first thing we notice is that there are no ads shown, we see images showing vaccination trends, graphs, lists of vaccination hubs, maps of where to get a vaccine, a list of vaccines available from different producers, some health related information on vaccines like side effects, ingredients etc.. but no ads. This is followed by query terms related sites, top stories and more results sites but still there are no paid ads on the first page (which is very unusual for Google), the first (and only) paid ad is on page 2 and is from WHO (World Health Organization, hardly a commercial site).

By looking at the list of results it seems that the search results have been extensively curated and probably manually refined (hospitals, vaccines hubs, producers make most of the results) with only "quality" sources allowed to appear (like it said in the ad) in the query results list.

The results shown seem to confirm that all the "editors/sources" that appear in the results list belong to a "trusted" group and only results from that group are allowed; what remains unclear is how are trusted sources defined and who defines and clears their status (for example I would agree with FDA being a trusted source but would be less confident about Walmart which also appears in the list), these criteria are unknown.

In my view what happened is that Google took the list of results that the search engine displayed automatically (just a few of them actually as working through the whole list would be impossible: searching for "covid shot" shows About 4,380,000,000 results ....that is over 4 Billion results ... let that sink in for a while ...) and then a review group manually went through the list to determine, based on some criteria, which of the results was to be considered "trusted" and which should be kept out, this was probably done at the domain level but a page level approach is also possible (but considerably more expensive).

Google has been implementing this process for some time and has an extensive criteria and an how-to manual for employees who carry out "domain reputation reviews" or "raters" as they are called. Such a review translates in more or less weight to be assigned to domains and therefore implicitly to search results coming from those domains. This process complements Google algorithms that cannot properly determine the actual "quality" of some pages or how much "trust" we can assign to a source.

Despite Google doing a fairly good job at selecting authoritative sources, most of which are government agencies, this process obviously raises one important point: a commercial company sets the rules about what can be trusted or not, this implicitly determines what kind of information we get access to or not.

In absence of any specific rules/laws/processes from the national legislators it is left to the individual companies' initiative to determine the rules (which criteria need to be satisfied to be considered a trusted source) and apply them.

What-is-google-e-a-t

Quality content for all

Why did Google (and also Facebook, Twitter and other social aggregators in fact as they share similar problems) felt the urge to reassure their users that they would be able to find trusted information through their search engine?

Well, simply because they are aware that their search algorithms are not able to distinguish very well between "trusted" or "untrusted" sources, this is a process that cannot be completely automated and for this reason cannot be implemented at content delivery time with the consequence that without specific preventive actions the risk of delivering "untrusted" (garbage) content to end users is fairly high especially with high traffic keywords.

There are primarily 2 solutions to this problem:

only take content from reputable sources (this is what Google is doing for selected search terms) thus minimizing the risk that "bad" content may creep through to end users
curate content, meaning automatically or manually remove "bad" content after it has been published or accessed (this is what Facebook is doing with their AI algorithms and reviewers).

While valid these approaches also suffer from a number of drawbacks:

they do not scale well: more content=more reviewers means escalating labour costs and time
complexity increases by orders of magnitude with different content, search terms, languages etc.. and it becomes more and more difficult to match the correct set of search terms with sources. For example searching in Italian language for "covid puntura (covid shot)" shows in first position a very untrusted site with fairly bad reputation in terms of SEO practices, old and unreliable content. This just proves how difficult it may be to provide comprehensive and extensive coverage.
they can cut significantly into ads-driven revenues. Which company is likely to auction for search terms when the results are predetermined? If I am a "trusted" source I will appear anyway at no cost and if I am not "trusted" I will not rank (or rank very low) no matter how high I bid for those keywords or how much SEO effort I put in
trust is domain specific
trust may need to be limited to a page or section. For example a reputable news site may host a blog that allows user input and which may be open to unverified content.

It seems obvious that the problems mentioned above do not have an easy solution but also that "generic" search engines have a very hard time not just controlling what content they provide access to but also defining what a trusted source is and under which conditions such a source could be trusted.

There are also areas were trust is more important than others, health for example vs travel or gaming. topics

If providing content from trusted sources is so great why don't they all try to follow this route ?

Can Google (and other generic content platform) afford to (only) have trusted sources?

The short and somewhat harsh answer is: not at all!

At least not with their current (ads based) business model, going along the trusted route would mean higher costs, lower revenues and an uncertain outcome.

Domain-specific platforms

This leaves the road open for domain-specific search engines (like myHealthbox.eu) and content platform that are highly content specific: by limiting the content domain better procedures and algorithms can be implemented at start resulting in high quality content and less "garbage".

myHealthbox implements a number of processes to guarantee content quality:

allowing content only from official, trusted sources (i.e. Health Ministry, Medicine Agency, Manufacturer)
implementing content verification procedures during the content ingestion and indexing phases
implementing a manual content review process when receiving notifications about possible content errors from users.

A quality-first approach in regards to content has some consequences on the business models that can be implemented, this is due to the higher costs of implementing data quality processes and guarantee, as much as possible, quality information throughout.

We also need to add that a pay-for-quality approach is rarely accepted in a context where web users expect content for free or are happy to give up personal data in return for free access, an approach that would be extremely dangerous in a healthcare context.

The myHealthbox licensing model

myHealthbox implemented a hybrid model that tries to compromise between allowing free access for occasional users, limited costs for recurring users looking for a reliable source of trusted information and minimal requirements in terms of access to personal information (i.e. only an email is required for registration).

This end solution is based on a number of criteria the most important of which is that information on the use of medicines should be free for occasional users, access to this information may affect your health and must be easy to get to and free to access (no registration required).

Beyond a free, occasional use model (which basically limits the number of documents that can be viewed in a given month) 3 licenses are available that allow incremental access to more documents and data, paid subscriptions also remove ads providing a faster and better user experience, the 3 subscriptions available are:

trial (default for registered users)
lite
pro.

So the question really turns out to be: how much are you prepared to pay to access reliable, quality information on medicines ?

Full details of options, limitations and costs for each license are available at myHealthbox License page

For any questions please contact our Customer Care at info@myhealthbox.eu

Previous Post Next Post

The price of data quality

Searching for trusted content

Quality content for all

Domain-specific platforms

The myHealthbox licensing model

Related Posts

Related Posts

Popular Tags

Archives