Enterprise Intelligence, TL;DR

The following is a prompt I used during the writing of Enterprise Intelligence (available at Technics Publications and Amazon) to prime ChatGPT for assistance–fact check, what is a better word, draw a picture, ttl to cql, etc. I found this serves as a good TL;DR, so here it is:

The subject of my book, Enterprise Intelligence, is in the context of Business Intelligence (BI) structures added to an enterprise knowledge graph (EKG). The EKG consists of three major parts: A knowledge graph (KG) authored by subject matter experts (SME) to Semantic Web standards, a data catalog (DC) that holds metadata for all data sources in the enterprise, and two BI-derived structures – Insight Space Graph (ISG) and Tuple Correlation Web (TCW) – passively built from the normal BI query activity of BI analysts across the enterprise. It’s for a book I’m writing where I claim businesses are like organisms, departments like organs, and the EKG is like the brain.
I chose to have BI as the spearhead for this EKG since BI data is highly-curated. Whatever data makes it into a BI database must be readily understood, of high analytical value, cleansed, and trustworthy. It is the data used for most business decisions.

The KG is like “System 1“ (Kahneman), fast response time, more direct, more deterministic. It’s a collection of domain-level ontologies, analogous to domain-level data products of data mesh, authored by SMEs. Authorship of this KG is now feasible thanks to the emergence of readily available and high-quality large-language models (ChatGPT 3.5 in Nov 2022). The LLMs have a symbiotic relationship with KGs (LLMs help to build KGs and KGs ground the LLMs in reality). Incorporating retrieval augmented generation (RAG) into this ecosystem further strengthens its capabilities. RAG allows for more sophisticated query scenarios by combining the generative abilities of LLMs with the structured, fact-based data from the EKG. This mirrors advanced cognitive functions, such as problem-solving and creative thinking.

The nexus of this EKG is the DC, an ontology of the data sources, databases/cubes,
tables/views/dimensions, columns/attributes, and even column members (as necessary―since there could be billions of members). It sits between the KG and ISG/TCW. All items of the ISG and TCW are traceable to DC elements. Further, DC tables, columns, and members could be linked to entities and individuals in the KG, expanding the semantics of those DC elements.

The main idea of the ISG/TCW is to passively capture what dozens to thousands of BI analysts have seen (or could have seen) in visualizations rendered from their BI activity. Those salient points are captured across what could be thousands to billions of queries consuming hundreds to tens of thousands of compute hours across dozens to hundreds of data sources. It charts the points of interest across what is an unbelievably expansive space of insights. The ISG/TCW is more like System 2.

The ISG consists of nodes representing queries that were rendered in a visualization (line graph, bar chart, scatter plot, pie chart, etc.) using a visualization tool, such as Tableau or PowerBI, requested by the actions of BI analysts using those tools. For each of those dataframes resulting from those queries, an array of simple functions wrings out things a human would notice from hose visualizations. For example, in a line graph, the user might recognize trend up, trend down, periodicity, steps, and spikes. Each of those insights is linked as properties of those query nodes. The columns and metrics of the query are linked to the appropriate DC nodes. Note that the data of the dataframe isn’t stored in the EKG, just the metadata of the query and any insights. These insights are like the things we notice as we go about our day.

The TCW consists of nodes, each representing a tuple. For example, the price of oil in Beijing or the water consumption in San Diego. A tuple could be thought of as one row in a dataframe. The members represented in the tuples are associated to member nodes in the data catalog (the member nodes are, in turn, linked to the column node in the DC). The tuple nodes can also be connected to each other through Pearson Correlations or Conditional probabilities. These are calculated by comparing the tuples sliced by time series. These correlations are what we notice as patterns to what is related to what. We can construct chains of strong correlations.

With salient points captured in the ISG and strong correlations captured in the TCW from across dozens to thousands of diverse analysts across dozens of domains, we have a single integrated source of insights.

See supplemental blogs in the Enterprise Intelligence category.