User:Aandre156/sandbox

The Vadalog system is a Knowledge Graph Management System (KGMS) that offers a language for performing complex logic reasoning tasks over knowledge graphs. At the same time, Vadalog delivers a platform to support the entire spectrum of data science tasks: data integration, pre-processing, statistical analysis, machine learning, algorithmic modeling, probabilistic reasoning and temporal reasoning. Its language is based on an extension of the rule-based language Datalog, Warded Datalog±, a high-performance language using an aggressive termination control strategy. Vadalog can support the entire spectrum of data science activities and tools. The system can read from and connect to multiple sources, from relational databases, such as PostgreSQL and MySQL, to graph databases, such as Neo4j, as well as make use of machine learning tools (e.g., Weka and Scikit-learn), and a web data extraction tool, OXPath. Additional Python libraries and extensions can also be easily integrated into the system.

Vadalog is the result of a joint effort between the Applied Research Team of the Bank of Italy, the Knowledge Graph Lab of the Technische Universität Wien and the University of Oxford.

Knowledge graph management systems (KGMS)
A knowledge graph management system (KGMS) has to manage knowledge graphs, which incorporate large amounts of data in the form of facts and relationships. In general, it can be seen as the union of three components:

From a more technical standpoint, some additional requirements can be identified for defining a proper KGMS:
 * 1) KBMS, that is, a knowledge base management system,
 * 2) Big Data, which is the need of handling large amounts of data, especially when considering that knowledge graphs have been thought as a solution for integrating multiple data sources, both corporate and public knowledge, which can be integrated into large knowledge graphs,
 * 3) (Data) Analytics is the need to provide access to existing software packages for machine learning, text mining, data analytics, and data visualization and to combine them together in the same platform.


 * definition of a language and a formalism with high expressive power,
 * cost-effective data wrangling, in all its steps, from data cleaning to web data extraction and big data access from many different sources,
 * efficient logical, probabilistic and ontological reasoning,
 * low complexity, both in terms of space complexity (to handle big data) and syntax,
 * interfaces (APIs) to access many heterogeneous data sources, such as corporate RDBMS, NoSQL or RDF stores, the web, machine-learning and analytics packages.

Other requirements may include more typical DBMS functions and services, as the ones proposed by Codd.

Vadalog system
Vadalog offers a platform that fulfills all the requirements of a KGMS listed above. It is able to perform rule-based reasoning tasks on top of knowledge graphs and it also supports the data science workflow, such as data visualization and machine learning.

Reasoning task and recursion
A rule is an expression of the form $n$ :&minus; $a1, &hellip;, an$ where:


 * $a1, &hellip;, an$ are the atoms of the body,
 * $n$ is the atom of the head.

A rule allows to infer new knowledge starting from the variables that are in the body: when all the variables in the body of a rule are successfully assigned, the rule is activated and it results in the derivation of the head predicate: given a database $D$ and a set of rules $&Sigma;$, a reasoning task aims at inferring new knowledge, applying the rules of the set $&Sigma;$ to the database $D$ (the extensional knowledge).

The most widespread form of knowledge that has been adopted over the last decades has been in the form of rules, be it in rule-based systems, ontology-based systems or other forms and it can be typically captured in knowledge graphs. The nature of knowledge graphs also makes the presence of recursion in these rules a particularly important aspect. Recursion means that the same rules might be called multiple times before obtaining the final answer of the reasoning task and it is particularly powerful as it allows an inference based on previously inferred results. This implies that the system must provide a strategy that guarantees termination. More technically, a program is recursive if the dependency graph built with the application of the rules is cyclical. The simplest form of recursion is that in which the head of a rule also appears in the body (self-recursive rules).

The query language
The Vadalog language allows to answer reasoning queries that also include recursion. It is based on Warded Datalog±, which belongs to the Datalog± family of languages that extends Datalog with existential quantifiers in rule heads and at the same time restricts its syntax in order to achieve decidability and tractability. Existential rules are also known as tuple-generating dependencies (tgds).

An existential rule has the following form:


 * $$\varphi(x) \Rightarrow \exist z \Psi(x,z)$$

or, alternatively, in Datalog syntax, it can be written as follows: Variables in Vadalog are like variables in first-order logic and a variable is local to the rule in which it occurs. This means that occurrences of the same variable name in different rules refer to different variables.

Warded Datalog±
In case of a set of rules $$\Sigma$$, consisting of the following: the variable Z in the second rule is said to be dangerous, since the first rule will generate a null in the second term of the atom r and this will be injected to the second rule to get the atom p, leading to a propagation of nulls when trying to find an answer to the program. If arbitrary propagation is allowed, reasoning is undecidable and the program will be infinite. Warded Datalog± overcomes this issue asking that for every rule defined in a set $$\Sigma$$, all the variables in the rule bodies must coexist in at least one atom in the head, called a ward. The concept of wardness restricts the way dangerous variable can be used inside a program. Although this is a limit in terms of expressive power, with this requirement and thanks to its architecture and termination algorithms, Warded Datalog± is able to find answers to a program in a finite number of steps. It also exhibits a good trade-off between computational complexity and expressive power, capturing PTIME data complexity while allowing ontological reasoning and the possibility of running programs with recursion.

Vadalog extension
Vadalog replicates in its entirety Warded Datalog± and extends it with the inclusion in the language of:


 * monotonic aggregations (min, max, sum, prod, count operators),
 * stratified negation,
 * support for different data types (strings, integer, double, date, boolean, set, lists, marked nulls),
 * rich annotation mechanism to define how to interact with data sources and external libraries.

In addition, the system provides a highly engineered architecture to allow efficient computation. This is done in the following two ways.


 * 1) Adopting an in-memory processing architecture and a pull stream-based approach, which limits the memory consumption or make it predictable.
 * 2) Using an aggressive termination control strategy, which detects patterns and redundancy while building of the chase-graph (used to generate the answers) as early as possible and cuts off computation when they occur. This is connected with the concept of isomorphism of facts, which reduces the number of steps needed to get an answer: if a fact h is isomorphic to h', the system will only explore the chase graph of fact h.

The Vadalog system is therefore able to perform ontological reasoning tasks, as it belongs to the Datalog family. Reasoning with the logical core of Vadalog captures OWL 2 QL and SPARQL (through the use of existential quantifiers), and graph analytics (through support for recursion and aggregation). The declarative nature of the language makes the code easy-to-read and manageable.

Example of ontological reasoning task
Consider the following set of Vadalog rules: The first rule states that for each person $$X$$ there exists an ancestor $$Y$$. The second rule states that, if $$X$$ is a parent of $$Z$$, then $$Y$$ is an ancestor of $$Z$$ too. Note the existential quantification in the first position of the ancestor predicate in the first rule, which will generate a null νi in the chase procedure. Such null is then propagated to the head of the second rule. Consider a database D = {person(Alice), person(Bob), parent(Alice,Bob)} with the extensional facts and the query of finding all the entailed ancestor facts as reasoning task.

By performing the chase procedure, the fact ancestor(ν1,Alice) is generated by triggering the first rule on person(Alice). Then, ancestor(ν1,Bob) is created by activating the second rule on ancestor(ν1,Alice) and parent(Alice,Bob). Finally, the first rule could be triggered on person(Bob), but the resulting fact ancestor(ν2,Bob) is isomorphic with ancestor(ν1,Bob), thus this fact is not generated and the corresponding portion of the chase graph is not explored.

In conclusion, the answer to the query is the set of facts {ancestor(ν1,Alice), ancestor(ν1,Bob)}.

Additional Features
The integration of Vadalog with data science tools is achieved by means of data bindings primitives and functions.


 * 1) Data binding primitives: bindings allow to connect to external data sources or systems like a database, a framework or a library and tell the system how to deal with them. This includes connections with relational database systems as PostgreSQL or MySQL, and graph databases, such as neo4j. It is also possible to integrate machine learning libraries (Weka, scikit-learn, Tensorflow) and a web data extraction tool, OXPath, which allows to retrieve data directly from the web. This is possible through the so-called annotations: they are special facts that augment the sets of existential rules with specific behaviours. The   annotation tells the system that the facts for an atom of the program are imported from an external data source, e.g., a relational database. The   annotation specifies that the facts for an atom of the program will be exported to an external target, e.g., the standard output or a relational database. Other annotations like ,  ,   allow to customize the data sources for the   annotation or the targets for the   annotation.
 * 2) Functions: custom expressions which make use of arithmetic operators, string manipulation or comparison can also be supported. Libraries written in Python are also enabled and they can be integrated with the dedicated annotation.

The system also provides an integration with the JupyterLab platform, where Vadalog programs can be written and run and the output can be read, exploiting the functionalities of the platform. It gives also the possibility to evaluate the correctness of the program, run it and analyse the derivation process of output facts by means of tools as syntax highlighting, code analysis (checking whether the code is correct or there are errors) and explanations of results (how the result has been obtained): all these functionalities are embedded in the notebook and help in writing and analyzing Vadalog code.

Use Cases
The Vadalog system can be employed to address many real-world use cases from distinct research and industry fields. Among the latter, this section presents two relevant and accessible cases belonging to the financial domain.

Company Control
A company ownership graph shows entities as nodes and shares as edges. When an entity has a certain amount of shares on another one (commonly identified in the absolute majority), it is able to exert a decision power on that entity and this configures a company control and, more generally, a group structure. Searching for all control relationships requires to investigate different scenarios and very complex group structures, namely direct and indirect control. This query be translated into the following rules:


 * a company $$X$$ directly controls a company $$Y$$ if $$X$$ directly owns more than 50% of the total equity of $$Y$$ ,
 * a company $$X$$ indirectly controls a company $$Y$$ if $$X$$ controls a set of intermediary companies that jointly (i.e., summing their shares) own more that 50% of $$Y$$.

These rules can be written in a Vadalog program that will derive all control edges like the following: The first rule states that each company controls itself. The second rule defines control of $$X$$ over $$Z$$ by summing the shares of $$Z$$ held by companies $$Y$$, over all companies $$Y$$ controlled by $$X$$.

Close Link
This scenario consists in determining whether there exists a link between two entities in a company ownerships graph. Determining the existence of such links is relevant, for instance, in banking supervision and credit worthiness evaluation, as a company cannot act as guarantor for loans to another company if the two share such a relationship. Formally, two companies $$X$$ and $$Y$$ are involved in a close link if:


 * $$x$$ (resp., $$Y$$) owns directly or indirectly 20% or more of the equity of $$Y$$ (resp., $$X$$), or
 * a common third-party $$Z$$ owns directly or indirectly 20% or more of the equity of both $$X$$ and $$Y$$.

These rules can be written in a Vadalog program that will derive all close link edges like the following: The first rule states that two companies $$X$$ and $$Y$$ connected by an ownership edge are possible close links. The second rule states that, if $$X$$ and $$Y$$ are possible close links with a share $$S1$$ and there exists an ownership edge from $$Y$$ to a company $$Z$$ with a share $$S2$$, then also $$X$$ and $$Z$$ are possible close links with a share $$S1$$ $$*$$ $$S2$$. The third rule states that, if the sum of all the partial shares $$S$$ of $$Y$$ owned directly or indirectly by $$X$$ is greater than or equal to 20% of the equity of $$Y$$, then they are close links according to the first definition. The fourth rule models the second definition of close links, i.e., the third-party case.