Breaking data silos: platform analyses content for valuable insights

Industry

Technology

Akka Akka-http Apache Spark Cassandra GridGain (Apache Ignite)PostgreSql Quetzal Scala SOLR Spark-ML Zeppelin

Introduction

The technology stack and architecture met the SLAs which were required for the platform. ELSSIE save a lots of manual effort and time while fetching the relevant data from the research papers.

About Elsevier

Elsevier is a world-leading provider of information solutions that enhance the performance of science, health, and technology professionals, empowering them to make better decisions and deliver better care. They want to make analysis easier for everyone, enabling them to manage their work more efficiently and spend more time making breakthroughs.

Elsevier provides products and services which help researchers, governments, universities, and healthcare professionals to make discoveries, evaluate and improve their research strategies and provide insight for physicians to find the right clinical answers. Their goal is to expand the boundaries of knowledge for the benefit of humanity.

Elsevier publishes 430,000 peer-reviewed research articles annually.

The challenge

Elsevier’s major segment of customers are drug companies across the world, and drug discovery is a complex process. The cost to develop one new drug is $2.6 Billion, and the approval rate for drugs entering clinical development is less than 12%. The attrition rate for drug candidates that is, the number of candidates you start with for each successful launch can be in the order of 10,000:1.

Scientists rely on knowledgebases related to pharmacology, medicine, chemistry, and biology as well as experimental data like clinical trials, experimental publications, tests performed on similar candidates etc. Some of these are purchased, while some are developed over a period within the company. Scientists spend an incredible amount of their expensive time searching through these knowledge bases. Take for example a simple question “What are the compounds that are similar in structure to benzene, have a boiling point of more than 40 Degrees F, and have no side effects on people with lymphoma”. The question requires joining information from chemistry, medicine, and pharmacology. By “Joining,” we mean understanding the question as if it’s human and bringing information from different domains and joining them to provide a definitive answer.

The customer envisioned “A platform” that can join knowledge from different domains to make it searchable, and the search engine reacts as if it’s a human by understanding the question, parsing it into a machine-readable query and crawling through the databases, and bringing results along with the accuracy at which the answer is likely to answer customer’s questions. That platform is ELSSIE; that’s what NashTech built it for Elseiver.

ELSSIE is a platform that connects information from multiple sources stored in the format of a knowledge graph and maintained by Elsevier’s Subject Matter Experts (SMEs). ELSSIE enables users to find and analyse content across traditional data silos to derive new value-driven insights.

The solution

The ultimate goal of ELSSIE is to make complex information at the fingertips of the scientists so that they can carry on drug invention at a rapid pace.

For accomplishing this, the solution needs to be able to ingest multiple structured and unstructured content, store it as queriable structured data, semantically understand and generate relationships by recognising entities and concepts, interpret stored data and offer graph query capabilities and provide an API to integrate with external applications and finally make it easy for scientists to search for information.

ELSSIE as a final solution included the following components:

Ingestion Layer provides ability to ingest structured sources like DBpedia and unstructured sources like scientific publications. Most difficult part of this layer is to achieve ability to construct structured knowledge from unstructured data using NLP. For example, a scientific journal article might refer to “Oxygen”, which ELSSIE should recognise as a chemical element and tag appropriately. This is built using integration of Apache Spark with Stanford NLP libraries.
Data Lake Layer consists of storing the structured knowledge generated from ingest pipelines in a central repository built using apache cassandra. ELSSIE knowledge consists of large number of ‘Triples’ that make up a large graph. These triples are staged in data lake and loaded into in-memory database (Grid gain) so that the performance is within the stringent boundaries. Entitlements is the sublayer within data lake that controls what part of knowledge should be accessible to whom. This access metadata information itself is stored as triples, such that the query engines can interpret and provide information.
Query Layer provides a way to ask graph questions (SPARQL Queries) and retrieve results. Lot of innovation and research is invested in how to parse SPARQL query and bring results from a key value store. NashTech built the parser that converts the graph queries into equivalent KV store retrievals from in-memory database. This layer leveraged a paper published by IBM and extended the concept. NashTech proven performance by using LUBM (Lehigh university benchmark) queries.
Search Layer provides the API to perform searches on data lake. This connects the google like free form search with the definitive query capability provided by Query Layer, thus enriching and enhancing the use of the product. Search is fed with “Clusters” or “Topics” of knowledge generated by ML pipelines, and make the facets in search much more meaningful. This allowed scientists to search for “Sugar” and the results will be shown in the context of “Diabetes” or “Cell energy” or “Recreational Drinks”.
Machine Learning (ML) Layer provided a way for curating the content, verifying the output generated by humans, measuring the algorithm’s accuracy, experimenting with new models, testing and fixing issues. ML is the primary driver for two purposes. First, for ingesting the content generated by various sources. The sources for ELSSIE are diverse, from nicely structured content like DBpedia all the way to scanned pdf documents. The second function is to make search more intelligent. Extensive NLP is deployed to understand the incoming and ever-growing content. Pipelines implemented several clustering (Latent Direchlet) and classification (Multi class classification) algorithms. ML Layer and Ingestion layers are closely tied together.

In summary, ELSSIE project used Apache Spark, Apache Hadoop, Apache Cassandra, Apache Kafka, Apache Solr, Apache grid gain, all built on AWS. Several innovations like dynamically scaled Apache Spark and Hadoop clusters, extending QUERTZL using Antlr parsers, using LDA along with NLP to find entities in text and their contextual meaning instead of hard literary meaning are achieved.

The outcome

The technology stack and architecture met the SLAs which were required for the platform. ELSSIE saved a lots of manual effort and time while fetching the relevant data from the research papers.