Exploring a Sea of Data with Massive Numbers of RegEx – And Maybe Even the Automata Processor

Overview

This blog explores taking the lid off RegEx (regular expressions) and its less powerful cousin, the LIKE keyword in SQL. By “taking the lid off”, I mean looking for more than one pattern at a time. Indeed thousands, tens of thousand, or even more patterns. After all, the things we’re somewhat randomly seeing with our eyes at any given instant is recognized because our brain recognizes millions of things we could be seeing in massively parallel fashion.

Traditionally, we use RegEx and LIKE when we already know the pattern we’re looking for. We just need help wading through a ton of data for that pattern. For example, in a search, we could express we’re looking for any with the characters “hara” and a first name beginning with “E” with this RegEx (or LIKE): %hara, E%

Figure 1 shows a sample of the LIKE keyword looking for names that contain “hara, e”. Most search functions in applications offer a little more flexibility than just finding a string, mostly the simple Contains, Begins With.

SQL LIKE key word.
Figure 1 – SQL Like key word.

In this case, we’re checking if any values matches a single pattern, one rule to apply to very many values. Database people also know that LIKE parameters beginning with % results in poor performance since indexes on the column can’t be utilized.

A little more sophisticated use case would be to find all occurrences of some pattern in a long text. This is the Find function in text editors such as Word. This would result in multiple hits, many occurrences throughout a single long string of text. A further step would be to search a longer text column, such as those of the VARCHAR(MAX) or TEXT types, holding free text notes or descriptions for each row (ex: free form comments on a sales call); multiple hits for multiple rows. But whether we’re searching one big text like a long Word document or the text strings of many rows, we’re still searching for one key word or pattern at a time.

So let’s throw all caution out the window and ponder searching for thousands of patterns occurring in millions of long strings of text. That would normally be implemented as: for each text column in each row, iterate through many patterns. Meaning, if searching 1 GB of text used for one pattern takes 1 second, search for 100 patterns would take about 100 seconds. Yes, we could apply optimization techniques such as parallelizing the task by breaking up the rows into chunks shared across many servers, or apply some logic eliminating some impossibilities upfront. But that’s not the point of this blog. This blog is about two things: Why one would want to do such a thing, and that Micron’s Automata Processor offers an amazingly direct path to the ultimate solution to this problem.

I’ll first describe the first time the notion of running very many RegEx expressions over a large amount of data occurred to me in my daily life as a BI consultant. Even though this particular use case is somewhat covered by features in many current “Data Profiling” applications, the use case in which I recognized the need helps to identify metaphorically similar use cases elsewhere.

Before continuing, I’d like to mention a couple of things:

  • The coding samples will be stripped to the bare minimum. For example, I haven’t included parameter validation logic. There are numerous code examples related to most of the topics of this blog such as RegEx using whatever programming language as well as for the SQL LIKE keyword. Additionally, some code involves SQL CLR, and that code is minimized as well since there are many examples of how to register SQL CLR assemblies into SQL Server.
  • I do mention topics on the Theory of Computation, mainly the Turing Machine, Regular Expressions, and Finite State Machines. But I won’t go into them deeply. For the most part, C# developers are familiar enough with RegEx and SQL developers with LIKE.
  • Although the main point of this blog is, “How great is that Automata Processor?”, I don’t actually get to the point of actually implementing it on the AP. Much of the reason is that I’m still focusing on communicating how to recognize use cases for the AP in a BI environment. Meaning, I’m still trying to sell the folks in the BI world (well, more the bosses of the folks in the BI world) on investing in this very “strange” but amazing technology. Besides, the AP SDK is still in limited preview, but you can ask to register anyway. However, once you’re comfortable with the concepts around finite state automata (the core principle of the AP), authoring them and implementing them into the AP is relatively easy.
  • This blog pulls together many tough subjects from the highly technical levels of Business Intelligence and the Theory of Computing. This means there are many terms that I may not fully define or even take some liberties glossing over a few details, in the name of simplicity.

My Data Exploration Use Case

After an “analytics” consultant understands the client’s problem and requirements and a project is scoped, the consultant (the ones in my usual world anyway), find themselves sitting in front of an instance of a tool to explore the heap of data such as SQL Server Management Studio or Aginity. By “heap of data” I mean that since as a consultant the customer is usually new to me, the people and their roles and the data and the meanings are unknown to me. Until I go through the non-trivial process of learning about the data, I’m Lewis and Clark on an exploration journey.

My ability to learn about the data depends upon several factors, for which all factors or even a quorum which surprisingly even today hardly ever exist at the same time:

  • The presence of knowledgeable and patient DBAs or data stewards or developers. Using the plurals illustrates that in BI there are usually a large number of databases scattered all over the enterprise, usually numbering in the hundreds for a large enterprise. Quite often as well, part of the reason I’m there is because a DBA or analyst moved on, taking all that knowledge trapped in her brain with her.
  • The presence of a Data Dictionary. A data dictionary is a catalog of data sources throughout an enterprise, down to the column levels, including types, descriptions, even a lineage of the data (“Source to Target Mapping”), the valid values for the columns, and keys. This is the other MDM, MetaData Management, not Master Data Management.
  • The “penmanship” of the database designers. The better the names of the tables and columns, the easier it is to explore the data. But even if the tables and columns are well named, they can still sound ambiguous (ex: cost and price). I usually work with a Data Warehouse, which is not in a nice third normal form with primary/foreign key relationships. Adding to that, a Data Warehouse is subject to fast growth without discipline (because “disk storage is cheap”).

This learning about the data is a part of a wider task called Data Profiling, for which there are many very good tools on the market. But to me heart of Data Profiling is something we usually do at the actual analytics stage, after we’ve identified our data, and now we’re analyzing its value towards solving our problem. In the scenario I’m describing, I know what problem I’m addressing, but I still don’t know what data I have to work with.

About a third of the time, my client has a specific data set to explore. “Here’s the last two years of clicks from our web site. Have at it.” Even in those cases, where the data structure is relatively simple and well-known, yes, it would be nice to find the usual patterns in the click streams, but even nicer to correlate those click patterns to something important to the client. Meaning, I’d like to go beyond the usual, looking for other data to think outside of the box of data given to us. So I’m back to searching for what else is out there.

In the end, after exhausting all sources of data information known to me, I’m still usually left with some level of looking for something. The thing about analytics is it’s often about unknown unknowns – I don’t know what I don’t know. And because the nature of analytics, at least applied towards improving success towards some goal, is fuzzy, imprecise, we don’t always know that we’ve done the best job possible. We usually take “good enough” and move on to the next thing.

Too often, in a casual conversation with colleagues at a customer site, a data source of some sort is mentioned and I think back to some earlier project with that customer, “Gee, I wish I knew about that data source back then.” So in an ideal world, I’d want to do an extensive search for opportunities, beyond what is already known. Exploring hundreds of servers for data within a reasonable amount of time for something that may or may not exist doesn’t make sense either. It would be nice if I could inexpensively do that comprehensive exploration.

As we humans go about our daily lives, sure, any given day may seem somewhat routine. We get up, eat breakfast, say good-bye to those at home, head to work, play solitaire, sneak out of the office, etc. But it’s probably less routine than we think. Other people move cars around, leave at different times, the weather is different, some tasks at work are different, our moods and the moods of others are different. So the signals entering our brains through our eyes, ears, nose, tongue, and skin need to be capable of recognizing all manner of things from all manner of angles and combinations. Our “inputs” don’t see things like cars and other people. They sense light, molecules, sound waves, and physical contact. Each of these symbols could represent millions of different things. We usually don’t have time to sequentially scroll down a list of possibilities. Fortunately all of these possibilities are “considered” by our brain in massively parallel fashion.

 A Single Algorithm for a Wide Range of Rules

Identifying patterns can involve a great number of different types of algorithms. Regular Expressions are one type of algorithm. The calculating pi, the various methods for predicting the weather, all those C# functions you’ve written, and making McDonalds fries are other examples. Our world of business, society, and the environment is composed of countless executions of algorithms of very many types. Therefore, our current CPUs are based on the Turing Machine, an algorithm of algorithms, which can process just about any algorithm our human brains can imagine (implying there are probably problems we cannot imagine).

Instead of burning hard-wired silicon for each of those countless algorithms we’ve identified, we developed pretty much a single “computing device” (as Mr. Burns may say) capable of executing those processes with a single “algorithm”. We encode those algorithms with sequences of instructions, software.

Similarly, many patterns can be easily or at least relatively easily expressed using the simple Regular Expression algorithm. For example, we can robustly (lots of different formats) recognize sequences of characters such as phone numbers, social security numbers, dates, currency, hexadecimal numbers, credit card numbers, license plate numbers from various states, email addresses, etc. Regular Expression is closely related to Finite State Automata, near the bottom of the “Theory of Computation stack”, the simpler side, the opposite side of the Turing Machine.

Now, the examples I listed are few and don’t constitute the thousands of rules I’ve been mentioning so far in this blog. And that perhaps is one reason RegEx hasn’t exactly blown the minds of programmers over the years as it would seem to have limited use. However, here are three thoughts of sources of thousands of regular expressions:

  • At a brute force level, every word and number could be a RegEx. Regular expressions encapsulate patterns. The name “Eugene” is indeed a pattern, albeit not very interesting.
  • Many classes of things may not be easily abstracted as a concise regular expression, but if you look hard enough, a there can at least be abstract codification of a subset. For example, some names for a culture follow a pattern. The Scandinavian pattern of first name of your father followed by “sen”. It’s easy to recognize most of the 4-syllable Japanese names as Japanese. Such patterns may not cover all Scandinavian or Japanese names, but it’s still a pattern that can be encoded.
  • Most importantly, streams of events can yield a great number of patterns. This is analogous to finding patterns of words close to each other in the long text of a book. But instead of words, they would be events such as the page click patterns normally leading to a purchase or a sequence of “events” often leading to sepsis. This notion of streams of events is actually part of a much bigger discussion, which is mostly out of scope for this blog.

One more important note on the last point. For a sequence of events to be of value, the events don’t need to be adjacent. For example, if we wish to find all customers going from meat then to vegetables, we may or may not be interested whether they stopped to pick up black pepper or beer in between. Meaning, some of those events will be noise that should be ignored. That will be especially true when we combine events from heterogeneous sources. For example, studying the sequence of events of one nurse making the rounds wouldn’t provide insight as rich as studying the sequence of events across all care providers (nurses, physical therapists, doctors of various specialties, etc). Regular expressions are versatile enough to ignore events thought to be extraneous to what pattern we’re encoding.

Further, an “event” that’s part of a sequence can be one of many options. For example, an important pattern may be those who went to the meat department first, then to beverages or produce but not baking goods, and finally to pick up wine. Again, regular expressions are versatile enough to encode that sequence. The point is (where a full exploration of this is outside the scope of this blog) that there is very much opportunity to study complex streams of events where a different approach is necessary for query performance suitable for analysis.

When we run a regular expression through our current software running on our current commodity servers, we’re running a regular expression algorithm over the Turing Machine algorithm. With the Automata Processor, these are what I consider the three major performance turbo-charges:

  1. The Regular Expression algorithm is directly translatable to a Finite State Machine, with the algorithm (not the instructions, the actual FSMs) “hard-wired” on the Automata Processor. Therefore processing of FSMs are as direct as possible.
  2. Large numbers of FSMs can be loaded and updated onto an AP, a self-contained single piece of silicon (a co-processor on a conventional motherboard). Meaning, there is no marshaling of bytes back and forth from the CPU to RAM back to the CPU and so forth. The processing and the storage of the instructions live together.
  3. Each symbol is processed in parallel by ALL of the FSMs on the AP chip. They are not processed iteratively, one by one as through nested for-each loops.

The first two items describe the aspects of the performance gains at a “lower level” (in the weeds) than where the majority of BI developers ever want to live. It’s that third point that is the most compelling. With all due apologies to the “massive parallelism” of Hadoop, it’s not quite the same thing as the massive parallelism of the AP.

The massive parallelism of Hadoop occurs primarily at the server level, scaling out to even thousands of servers. It’s more like partitioning of a set of data onto a large set of servers. This means there is still processing to figure out which server(s) hold the subset(s) of data we’re interest in, sending those instructions over a wire to the server, the server running through conventional read operations, etc.

The massive parallelism of the AP is more like someone finding someone who is interested in free tickets to the Boise Hawks game by shouting out into the crowd, as opposed to asking each person serially, one by one. The AP is fed a stream of symbols that is seen by all of the FSMs programmed onto that AP. Those “interested” in that symbol accept it as an event and move to the next state. Those FSMs in a state not interested in a particular symbol “ignore” the symbol and are unaffected.

In the case of this RegEx example, the valid symbols, the “alphabet”, are essentially the 255 characters of the ASCII set (0-9, a-z, A-Z, and the common symbols). Incidentally, the number of symbols recognized by an AP is 255, one eight-bit byte. With that said, it’s critical to remember that a symbol can represent anything, not just a literal alphabet or digit. For example, the symbols can represent up to 255 Web pages of a click stream analysis or the four nucleotides forming a DNA sequence.

Yes, that can be a limitation, but I’m sure that will change some time, and there are techniques involving banks of Automata Processors, where FSMs are artfully partitioned based on a limited subset of 255 of the total symbols.

Multiple RegEx Example

This example will test a set of words against a set of rules, for which there is a many to many relationship. In other words, each word can be recognized by multiple regular expressions. This example reflects the use case I described above (in the section, “My Data Exploration Use Case”) concerning the exploration of heaps of data.

This example is primarily utilizing SQL Server with a small C# function to leverage the .NET Framework’s RegEx functionality, which is much richer than SQL’s LIKE key word. As a reminder, I’ve kept the code as minimal as possible as many details, such as how to register a .NET DLL into SQL Server, are well documented elsewhere.

The data set used in this example is very small in order to illustrate the concept which would be greatly scaled-up were we using the Automata Processor. Millions of words and tens of thousands of patterns (RegEx) would not run very well using the conventional approach shown in the example.

The code can be downloaded as a .txt file. It’s just text, no binary stuff. Be sure to rename the file with the .sql extension to open in SQL Server Management Studio.

Figure 2 shows a SQL script that creates a temporary table of the words we wish to recognize.

SQL LIKE key word.
Figure 2 – Words.

Glancing  through the “words” (in this case “phrase” may sound more normal) inserted into the temp table in Figure 2, some are easily recognizable by our robust brains as formats such as dates, street addresses, and phone numbers. Some are ambiguous such as the 9-digit words. So the idea is to take these words and check them against all the patterns we know as shown in Figure 3.

SQL LIKE key word.
Figure 3 – Patterns in our “knowledge base”.

The temp table, #RegEx, holds a row for each regular expression, a category, and a more specific description.

Figures 4 and 5 show the translation of two of the regular expressions held in the #RegEx table; one a fairly simple one for ADA County Auto License numbers and one a little more complicated for phone numbers. Some of the patterns are very specific such as  the one for an ADA County Auto License. I’ve included such specific ones to help demonstrate that patterns don’t need to be universal. We could instead encode many patterns addressing subsets.

SQL LIKE key word.
Figure 4 – Finite State Machine representation of an ADA County License Plate Regular Expression.

Finite State Machines are the heart of the Automata Processor. Once you’re registered to for the Automata Processor preview I mention towards the beginning of this blog, you will see an interface that allows you “author” such diagrams in this WYSIWYG manner. However, keep in mind that there are methods for authoring such diagrams en masse, for example, from the sequences in a Time Sequence data mining model. The sequences can be uploaded through an the AP’s own XML format, ANML (pronounced animal).

SQL LIKE key word.
Figure 5 – Finite State Machine representation of the Phone Number Regular Expression.

Figure 6 is a very simple scalar SQL Server function written in C# that uses the .NET Framework’s RegEx class. Figure 8 below shows how this function is used in SQL.

SQL LIKE key word.
Figure 6 – C# SQL Server Scalar Function utilizing the .NET Framework’s RegEx.

Figure 7 is a script for registering the .NET function into SQL Server. I haven’t described all of the warnings related to enabling CLR functions as it is explained very well elsewhere.

Code to register the CLR function into SQL Server.
Figure 7 – SQL Sever code to register the .NET DLL into SQL Server.

Now that all of the pieces are in place (the RegEx function is registered in SQL Server and we have a query window open in SSMS where the two temp tables are created), we can run a SQL (Figure 8) to test each word against each rule. The SQL statement CROSS APPLIES the words with each rule, filtering out any word/pattern combination that does not result in a recognition. A recognition is determined via the RegEx function, which will return a 0 or 1, we created and registered into SQL Server as shown in Figures 6 and 7.

 SQL LIKE key word.
Figure 8 – SQL to process the words and RegEx.

Using the SQL in Figure 8 with the CROSS APPLY join, with the 18 words we inserted into #txt and the 9 patterns we loaded into #RegEx, there were 162 (18*9) comparisons made. In other words, for each word, check each rule. If this were scaled up, for example if there were millions of words and thousands of patterns, the number of comparisons would be huge.

If these 18 words were fed into an Automata Processor loaded with those 9 patterns, each word is fed only once and all 8 patterns will analyze it in parallel. To rephrase something similar I mention earlier, this is the same as someone holding up the word, 555-55-6666, shouting to a bunch of people, “Hey! What is this?”. That is, as opposed to walking to each one asking them that question.

Figure 9 shows the results of the SQL shown in Figure 8.

SQL LIKE key word.
Figure 9 – Words.

We’ll look at a few of the interesting results discussing some interesting aspects of exploring data in this manner:

  • Rows 1 and 10 show the “Phone #” RegEx is versatile enough to recognize a phone number with and without parenthesis. In this case, for any word containing a set of 3 digits, 2-digits, and 4-digits, we can be fairly confident it’s a phone number with or without parenthesis around the first three digits. So it’s OK to use one versatile RegEx.
  • Rows 4 and 5 show that ‘1A Z999’ is recognized as both a more specific ADA County Auto License as well as a more generic Idaho Auto License. The less specific Idaho Auto License also recognized 2GZ999 as, even without a space between the 2G and the Z999 parts. It’s good to recognize something at various levels. For example, sometimes we need real sugar, sometimes any sweetener will do.
  • Row 7 recognized “45-888 Kamehameha Highway” as an address in Kaneohe, but not “1200 Kamehameha Highway”. Because Kamehameha Highway practically circles Oahu, being able to recognize an address as specific as one in Kaneohe on Kamehameha Highway requires this fairly stringent rule. Also, this doesn’t mean all addresses in Kaneohe follow this rule. Other rules would be developed, hopefully with at least some abstraction into a RegEx. For example because Luluku Road is only in Kaneohe, any address on Luluku Road (also following the 45-ddd format typical for Kaneohe) is a street address in Kaneohe.
  • Row 8 shows 555556666 as a Zip code although another word for which the only difference are dashes, 555-55-6666, is clearly a social security #. However, there really is no reason 555556666 cannot be a legitimate Zip code (somewhere in Minnesota). Even though our human brains may think this as more of a SSN, it’s good to have a something that can see beyond our biases.

So suppose that over the years, through dozens of customers, hundreds of databases, I collected thousands of formats for data. Most will not be as universal as date and phone number formats. But even seemingly one-off formats could provide insight. For example, suppose years ago I encountered some old software system that stored case numbers in the format of 4 upper-case letters, a dash, two digits, a dash, and 3 digits (RegEx: [A-Z]{3}-\d{2}-\d{4} ). If today at another customer I encounter such a format, it adds a relationship that may or may not matter.

To take the code presented here to that level where we explore the hundreds of databases throughout an enterprise, I would expand this example to:

  1. Iterate through a list of database server, each database, each table, each view (because there could be calculated columns), and column of those tables and views.
  2. For each column, the tool would retrieve each distinct value and a count of each value.
  3. For each of those distinct values, the tool would test it against each regular expression in the library accumulated over a long consulting career. Every value recognized by a regular expression, whether a street address from a little town on Oahu to just being numeric, would be added to a table similar to the one shown in Figure 9. However, there would be rows for the server, database, table, and column as well.
  4. Because this could mean trillions of rows for a large enterprise, we could actually store only the counts for each regular expression for each column. So if there were say 50,000 columns across all databases, each triggering around ten regular expressions, that’s only 500,000 rows.

Remember though, the purpose of this blog isn’t so much to suggest a data profiling technique as to present a pattern for an Automata Processor use case which could provide inspiration for other applications.

Conclusion

It seems that the combined rate of data growth, the complexity of the world, and the computing power required to answer the sort of questions we now face is outpacing Moore’s Law in terms of the increasing computing power of CPUs. But we can still tackle this problem by looking towards these massively parallel approaches.

Last week (August 3, 2015) I posted a blog on a “Graph Database Symposium” I’m planning. At the time, the planning is even earlier than the “early stages”. The intent of that blog is to gauge the interest for such a symposium at this time. Hopefully, this blog helps take the reader further along in recognizing the value of graphs and the Automata Processor.

 

 

Posted in BI Development | Tagged , , , | Leave a comment

Planning a 1-Day Symposium in Boise on the Utilization of Graph-Centric Data Technologies in Business Intelligence

Introduction

I’m currently working with the organizers of the Boise BI User Group and a few heavy hitters from various Boise-based technology communities on a 1-day symposium introducing graph-based technologies to those in the Boise Business Intelligence community. (To clarify, by “graphs”, I’m referring to those web-like “networks” of relationships, and not visualizations such as line graphs seen in software such as Pyramid Analytics or Tableau.) The overarching goal is to inform BI practitioners of the toolset already out there required to begin addressing what I consider to be BI’s “hard problem”. That is, to feasibly formulate, organize, maintain, and query relationships between data throughout an enterprise.

We’re in the early design and planning stages, shooting for a mid-October (2015) delivery. The nature of this symposium is forward-thinking, meaning not many people would think to look even for it, so it doesn’t come with a ready-made audience (ex: such as a class on advanced Tableau). I chose to post this blog early in the process as a feeler gauging interest in this symposium as well as to gather input for the content. This post is by no means a formal announcement.

As a caveat, it’s important to state upfront that in the overarching Business Intelligence context of this symposium, in order to apply many of the techniques that will be covered, there will still a pre-requisite for a well-developed BI infrastructure … for the most part. I realize that for many enterprises, even a somewhat-developed BI infrastructure is still a far off dream. But hopefully this symposium will reveal a much bigger payoff than was previously imagined for a well-developed BI infrastructure, spurring much more incentive to aggressively strive for that goal. However, it’s crucial to keep in mind this doesn’t mean that there aren’t narrower-scoped use cases for graph technologies ready to tackle without a well-developed BI infrastructure, particularly with the Automata Processor.

Abstract

An accelerating maturity of analytics combined with Boise’s rich Business Intelligence community, innovative spirit, and the headquarters of Micron with its Automata Processor presents a powerful opportunity for Boise to yield world-class analytics innovation. The “three v’s” of Big Data, massive volume, velocity, and variety is simply just more data without improvement of the even tougher task of organizing the myriad data relationships which today are mostly not encoded. We need to begin solving our problems of a complex world in non-linear, truly massively parallel, massively hierarchical, and non-deterministic manners. Such an effort begins by shifting away from the central role of the tidy simplicity of our current relational databases to the scalable, reflective, modeling capabilities of graph (network) structures taking center stage.

Everything is a set of relationships and that is what graphs are all about. Our human intelligence is based on a model of our world, a big graph of relationships, built in our brains over the course of our lives. We humans are able to readily communicate with each other because those unique models of the world held in each of our brains mostly overlaps – our cultures. Where our individual models of the world don’t overlap with those of others represents our unique talents. The net effect is that our society is incredibly richer because we can exceed the limitations of our individual brains through the aggregation of our collective knowledge.

Likewise, machine analytics systems of our enterprises possess skills beyond the limitation of our brains. The problem is that those systems don’t share our human culture. In order for us humans to effectively leverage the “intelligence” captured in those enterprise analytics systems, those systems also need to possess models of the world at least somewhat overlapping with us. Models in current analytics systems are limited by restrictions dictated by the limitations of computers of the past, for example, the limited notion of “relationships” of relational databases. Deeper communication between humans and machine intelligence currently requires grueling programming of the computers and sophisticated training on our part. Today’s technology, particularly graph technologies, is our opening to surpass those outdated techniques, building, maintaining, and querying superior models of the world in our analytics systems. The improved machine intelligence fosters smoother, more robust communication between human and machine intelligence.

The key takeaways are:

  • Understand why breaking away from the predominantly relational database model to graph databases opens the door to quantum leaps in analytic capability.
  • The challenges of navigating through the increasing complexity of the real world, at the risk of being left behind by enterprises that do build that capability.
  • An introduction to the technologies and concepts of graphs.
  • A roadmap towards the transition to graph data.

My Initial Vision as a Starting Point

As I mentioned earlier, we are in the early design and planning stages, and the purpose of this blog is to gauge the interest for such a symposium as well as to gather input from the potential attendees on the content. So nothing is set in stone, the concrete is just starting to be mixed. However, I would like to include my initial vision of the agenda in this post just as a starting point.

As we have just this past week reached a few critical milestones (participation of a few key parties, a venue), we’re just starting to engage other key players to work out an agenda that will provide maximum value to the attendees. So it will certainly morph to a noticeable extent by the time we formally announce the symposium.

Before continuing on to my initial agenda, Sessions 1 and 6 are targeted at mature BI practitioners. Because the symposium is set in a BI context, I thought to begin laying out the current BI landscape and pointing out big problem. Sessions 2 through 5 are at a rather introductory level on graph technologies, laying out the pieces required to attack that big problem. We would then wrap up with a discussion on how to apply graph technologies to BI. Anyway, here is the initial agenda I tossed out to begin the process:

Session 1: The Current State of Analytics

The enterprise analytics world is currently a complicated zoo of concepts, processes, and technologies, all of which do hold legitimate roles. However, they exist in our enterprises as islands of poorly linked pieces lacking the rich integration, as do the memories in our brains or the organs in our bodies. A business enterprise is a system of relationships like any natural system. In this session we explore these “tectonic plates” of BI and the gaps required to lead towards an increased capability of our business enterprises leaping ahead through the vastly improved bridging of human and machine intelligence.

  • The Current Landscape of “the Intelligence of Business”: ETL, Data Marts and Warehouses, Data Lakes, Performance Management, Self-Service BI and Analytics, Master Data Management, Metadata Management, Complex Event Processing, Predictive Analytics and Machine Learning, Deep Learning, Knowledge Management.
  • The Missing Links: Why do we still make bad decisions, fail to see things coming, and keep acting on organizational myths and legends?
  • The Secret Sauce: Soften the boundaries between objects and balance bottom-up flexibility and top-down centralization.

Session 2: Graphs and the Theory of Computation

It’s certainly not that graphs are unfamiliar to us. We are well familiar with org charts, food chains, flow charts, family trees, etc., even decision trees. While such simple “maps” we’re used to seeing in applications such as Visio, PowerPoint, or SQL Server Integration Services are very helpful in our everyday lives, they quickly grow like kudzu into incomprehensible messes from which we readily shy away. This session will introduce basic concepts of graph theory and the Theory of Computation as well as to begin exploring that unwieldy reality of relationships we’ve so far punted down the road.

  • Introduction to Graphs: Terminology and Basics of Graph Theory, and a it on the Theory of Computation.
  • The Importance of Graphs, Models and Rules in the Enterprise – Everything is a graph. Examples of graphs used in commonly used business tools.
  • Robust Graph Processing: Model Integration, Fuzziness, Inference, massively parallel, many to many, massively hierarchical.
  • Where Relational Databases Fail in the Enterprise and why we keep retreating back to that comfort zone (ex. retreat from OLAP back to relational databases). Note: It may sound odd that I’m talking about focusing on relationships even though today’s primary data sources,  “relational databases”, are called “relational”. The problem is its not relational enough.

Session 3: Embracing Complexity

It doesn’t take a network of seven billion independent minds and billions more Web-enabled devices forming the so-called Internet of Things to result in a complex system where nothing is reliably predictable. For example, a distributor of goods lives in an environment of vendors, stores, customers, their customers’ customers, regulations from all governments (in the “Global Economy”), and world events where reliable predictability is limited to low-hanging fruit problems. Each is rife with imperfect information of many sorts and competing goals. Consequently, the problems faced by such enterprises are of a “bigger” nature than the limited-scope problems we’ve so far typically addressed with our analytics systems. The reason is we are attempting to resolve complex problems using techniques for resolving complicated problems.

  • Overview of Complex Adaptive Systems. The many to many, heterogeneously parallel, massively hierarchical, non-linear nature of our world.
  • The Things We Know we Don’t Know and the Things We Don’t Know We Don’t Know: Predator vs Prey, Predator vs Predator
  • Rare Event Processing: Statistics-based prediction models fall short for those high impact rare events, where novel solutions are engineered from a comprehensive map of relationships.
  • The world is a complex system: Situational Awareness
  • Healthcare: Perfect Storms of Many Little Things
  • Lots of Independent and Intelligent Moving Parts: Supply Chain Management, Manufacturing, Agriculture

Session 4: Beyond Visio – Robust Graph Technologies

Graph concepts and technologies have been around for a long time, in fact, from the beginning of computing. Many of the concepts are core in the world of application developers who hide the ugliness from end users by presenting flattened, sterilized, distilled, templated chunks of data. Think of the wiring of your computer hidden from the end user by the casing. Gradually, the complexity is such that the ugliness demands to be addressed at the higher levels of the end user, albeit in a cleaner form.

  • Graph Databases: Neo4j Introduction and Demo
  • Overview of IBM’s Watson
  • Object-Oriented Databases and ORM.
  • The Semantic Web: RDF, OWL, SPARQL.
  • Introduction to graph-like co-processors; particularly the Automata Processor

Session 5: Micron’s Automata Processor

Micron’s Automata Processor is one of the most important innovations in semi-conductors. It presents a shift away from the current computer architecture that for decades has been geared towards the simplicity of solving strictly procedural problems. Ironically, in order to effectively tackle the problems of an increasingly complex world, we retreat from the current computer architecture of today to a simpler model based on finite state machines. The massively parallel, loosely-coupled nature of the Automata Processor more comfortably reflects the nature of the environments in which we live, whether business, nature, or social. The truly massively parallel nature of the Automata Processor represents a leap as big a leap akin to the leap from single-threaded to multi-tasking operating systems decades ago.

  • Micron’s AP demo and examples of current applications
  • Proposed Automata Processor BI Use Case.
  • Recognizing Opportunities for the Automata Processor.

 Session 6: The Big Problem of Building the Robust Models

So what is the roadmap for building such ambitious systems? This is not about building an Artificial Intelligence but to soften the communication boundaries between people and our databases by drastically improving upon the relationships between data. Automation of the generation and maintenance of these relationships, the rules, are the keys. For example, it’s not much harder to map out the relationships within a static system than it is to write a comprehensive book on a fairly static but complicated subject. The trick is to do the same for a system/subject in constant flux.

  • Where do Rules Come From?
  • Existing Sources of models and rules in the Enterprise.
  • A Common Model and Rule Encoding Language.
  • Mechanism for Handling Change, Massive Parallelism, Massively Hierarchical, Missing or low confidence data.
  • Knitting together the Pieces of the Current Analytics Landscape mentioned in Session 1.

A Little Background on Where I’m Coming From

It’s not that people, particularly those involved with BI and analytics, aren’t aware of the importance and value of encoding knowledge onto graphs. It’s actually rather obvious and graphs are very much in use. It’s that these graphs are for the most part simple, disparate artifacts (connect the dots pictures), disconnected islands of knowledge. That condition is similar to enterprises only a few years ago with hundreds of OLTP systems scattered throughout  (and thousands of Excel documents today – even with SharePoint Excel Services!), with their silos of data and clumsy methods of integration. There have been efforts in the recent past to promote graphs to a more prominent level that gained very much attention, but fizzled back into relative obscurity. Relevant examples include UML and the Semantic Web. Neither are dead, but maybe with the fuller complement of related technologies today, they may finally find lasting traction.

A couple of years ago I wrote a blog – strictly for entertainment purposes only – titled, The Magic of the Whole is Greater than the Sum of Its Parts. It’s just a fun exploration of the notion of a business as an organism. Particularly that organism’s intelligence, what I call the “intelligence of business”. Although we shouldn’t take that metaphor too far (and maybe I did in that blog … hahaha), I think it’s fair to say that a business has rough counterparts to a human’s organs, desires, pain, ability to physically manipulate its surroundings, and knowledge, which today is much more harmonious in us than in the business analog.

However, the problem is that a business’ “intelligence”, the ability to store, analyze, and maintain webs of relationships, lies almost exclusively in the human brains of the workers and hardly in the fairly hard-coded/wired mechanical things (devices, software, documents). That’s fine as long as the quality of the knowledge is fairly transferrable to another person (in case the worker leaves) or the skill has been commoditized, and that there is some level of overlap of knowledge among the employees (redundancy).

One major outcome of failing to address this, at least in my opinion, is that in the name of optimization (particularly when the elimination of variance and redundancy is say overly-zealous), workers are forced into deeper and deeper specialization which draws stronger boxes around these “organic components” of the business. The knowledge in those workers’ brains are hardly ever recorded to an extent that a replacement is able to readily take over. When a knowledge worker leaves, it’s as if the enterprise had a stroke and must relearn capabilities.

Our poor human brains are filled to capacity to the point where we whittle away at things in life outside of work in order to keep up. We long ago maxed out on our ability to work optimally in groups when our “tribes” began consisting of too many people and there is too much flux in the membership. It used to be that knowledge could be captured in books. But change and increasing complexity comes too fast for the subject matter experts to effectively document, then for we readers to assimilate. As we’ve increased the scalability of data through the Big Data mantra of volume, velocity, and variety of data, we need to improve the scalability of our ability to encode and assimilate increasing knowledge requirements.

The answer isn’t AI, at least the Commander Data or HAL version promised for the last half century. Even with IBM Watson’s success on Jeopardy and its subsequent exponential improvement, I seriously don’t think there will be an AI more innovative than teams of motivated and educated humans for quite a while. The answer is to build a better “pidgin” bridging human intelligence and data, a far less grandiose track for which the pieces are mostly there and offers a long-time incremental path towards improvement.

Here are a few old blogs that sample much of my earlier thoughts that lead to the idea for this symposium:

Actually, almost all of my blogs are somewhat related to the subject of this symposium. My blogs have always been about pushing the boundaries of Business Intelligence. A couple of years ago I attempted to materialize all my thoughts around this subject into a software system I developed which I named Map Rock. This symposium is not about Map Rock as I’ve “retired” it and Map Rock only represents my vision. It makes more sense today to pull together the best of breed pieces out there into something from which be can begin to evolve an “intelligence of business”. However, my 5-part series on Map Rock  offers a comprehensive description of what I was after.

Conclusion

This symposium is intended to be an introduction that will hopefully cut down some of those fences we fear to hop so that we can seriously explore the vast frontier of BI becoming a truly strategic asset, rather than being stuck straddling the tactical and operational realms. It can begin to move from “Help me calculate this value to plug into this formula” to “Help me create and maintain this formula”.

To recap the current status:

  • We’re in the early stages of planning. The agenda presented here is just an initial draft.
  • We’re planning to deliver this in Boise in the mid-October (2015) timeframe. We should have a date and a tighter agenda well before the end of August.
  • We’re trying to gauge the interest in the Boise area for such a 1-day symposium.
  • We’re asking for any input on content or hard problems in your business that could be better approached as a complex problem, not a complicated problem.

Please email me at eugene@softcodedlogic.com with any questions or comments.

Posted in BI Development, Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics | Tagged , , , | Leave a comment

Protected: Embracing Complexity – Pt 2 of 6 – Data Prep and Pattern Recognition

This content is password protected. To view it please enter your password below:

Posted in BI Development, Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics | Tagged , , , ,

Embracing Complexity – Pt 1 of 6 – Non-Deterministic Finite Automata

Introduction to this Series

“Big Data” and “Machine Learning” are two technologies/buzzwords making significant headway into the mainstream enterprise. Drawing analogies between these two technologies to those at the start of the Internet era twenty-something years ago:

  1. Big Data is analogous to a Web crawler capable of accessing large numbers of available Web pages.
  2. Machine Learning is analogous to search engines such as Yahoo’s back then followed by Google that index the salient pieces of data (key words and page ranks – the latter in Google’s case).

But the search engines fell short of being anything more than a glorified index, like a telephone book providing someone’s name and address only if you already know the person’s name. Similarly, our current analytics systems fall short in that they only provide answers to questions we already have.

Before moving on, it’s important to ensure the distinction between a complicated and a complex problem is understood up front. The Twitter summation of the theme of this blog is: We currently analyze complex problems using methodologies for a complicated problem. Our dependable machines with many moving parts, from washing machines to airplanes, are complicated. Ecosystems, the weather, customer behavior, the human body, the stock market are complex.

Complicated things (no matter how complicated) operate in a closed system (at least we create the closed system environment or just pretend it is closed) where cause and effect between all parts are well understood. Complex systems have many moving parts as well, but unlike complicated systems, relationships between the moving parts are not well defined; therefore, outcomes are not deterministic. Most of our “truly” analytical questions actually address complex systems, which we attempt to answer using techniques designed for answering complicated questions.

This is the first in a series of blogs laying a foundation for breaking free from the analytical constraints of our current mainstream analytics systems founded upon “intelligently designed” (designed by we humans) databases. Such systems are built based on what we already know and expect, hardly ever considering the unknown, giving it the “out of sight, out of mind” treatment. For this blog, I’ll introduce the basic idea for employing NFAs and a simple T-SQL-based sample. Subsequent blogs in this series will build on the concept.

To deal better deal with complexity, we can paradoxically retreat from the current mainstream, Turing Machine-inspired, computer design (the top of the stack of the theory of computation) to the far less sophisticated Non-Deterministic Finite Automata, NFA (2nd from the bottom only to the Deterministic Finite Automata). NFAs are simple, more elemental, constructs with far more flexibility in expressing the a very wide variety of rules of daily life. The tradeoff is that the NFAs may be not exactly streamlined and they could seem unwieldy to an engineer, but we’ll have the power to emulate the rules of daily life with a “higher resolution” from that lower granularity.

This post and the next two posts of this series comprise a sub-series, an introduction to the utilization of NFAs and pattern recognition in the BI world. Post 2 will introduce pattern recognition and a simple T-SQL-based application as well. Post 3 will tie Post 1 and 2 together, again with a T-SQL-based application, with a mechanism for processing incoming symbols by multiple NFAs in parallel – at least in a set-based manner (let’s just call it “quasi or pseudo-parallel”) as a first step.

Posts 4 through 6 will deal with further characteristics of the system, for example, exploring further the notion of “what fires together, wires together”, as well as diving deeper into a physical implementation better suited for scalability of such ideas. In particular, Hekaton and Micron’s Automata Processor, which I’ll discuss briefly in this post. By Post 6, we will be at the doorstep of what I had intended to encapsulate in Map Rock, which is a focus on changing relationships as opposed to just keener recognition and control.

This is a blog about AI, but not the sort of AI we usually think about which I believe is still a few years away (despite the incredibly rapid improvement of IBM’s Watson on several dimensions since its debut on Jeopardy in 2011). I certainly can’t explain everything about NFAs and  AI in these 5000+ words, or even 500,000. However, I think you’ll find the theme of this set of blogs useful if we can for now at least agree that:

  1. The world is a complex, adaptive system, where our current analytical systems are about to reach their limits,
  2. In order for us to make superior decisions we need true massive parallelism to paint the always dynamic, messy picture of what is around us,
  3. Predictive analytics models are rules reflecting what we’ve come to know, but that they work because life on Earth although dynamic, is relatively stable, but the models eventually go stale,
  4. And that working at a lower, more elemental level gives us flexibility we don’t have at a higher, more object-oriented level.

Lowering the Granularity of Our Computing Objects

Most readers of this blog have probably encountered the concepts of NFAs (and Regular Language, Context-Free Language, Turing Machine, etc) in college under a course in theory of computation. Most would agree that it is still taught simply because it has always been taught, just a formality towards a CS degree, as such concepts almost never appear in the world of high-level programming. But as we’re to running into a wall as we begin to ask our analytics systems new sorts of questions of a complex nature answering with a complicated. Our computer systems are built to address well-defined, well-understood problems, employed merely as something that can do certain jobs better, as we would employ a good fighter as a nightclub bouncer.

Computing at the less sophisticated but more granular level of the NFA removes much of the rigidity imposed by computations that have the luxury of being optimized for static, reliable conditions, for which we make countless assumptions. This is analogous to how we don’t deal with life thinking at the atomic or molecular level of the things we encounter every day but at the macro level of objects; apples, bosses, tasks, and ourselves (we’re a macro object too).

We could even look at 3D printing as a cousin of this lower granularity concept. Instead of being completely limited to the need of manufacturing, shipping, and storing zillions of very specific parts, we can instead have big globs of a few types of stuff from which we can generate almost anything. Well, it’s not quite that extreme, but it’s the same idea. Similarly, I don’t believe NFA processing will replace relational databases in the same way 3D printing shouldn’t replace manufacturing. 3D printing isn’t optimal for things for which we know won’t change and for which we need great quantities. There will be a mix of the two.

We Already Do Quite Well with Determinism, So Why Bother with This?

Our human brand of intelligence works because our activities are mostly confined to a limited scope of time and space. Meaning, our best decisions work on a second to second, day to day basis involving things physically close to us. Additionally, the primary characteristics of things we deal with, whether cars, bears or social rules remain fairly constant. At least they evolve at a slow enough pace that we can assume validity of the vast majority of relationships we don’t consciously realize that are nonetheless engaged into our decisions. In fact, the ratio of what we know to what we don’t know makes “tip of the iceberg” sound like a ridiculous understatement. If things evolved (changed) too quickly, we couldn’t make those assumptions and our human brand of intelligence would quickly fall to pieces through information overload.

In fact, changes are the units of our decision making, predictions we make with the intent of furthering us towards our goals. Our brains (at least through vision) starts with the 2D image on our retina, then applies some innate intelligence (such as shadows) and some logic (such as what is obscuring what) to process depth. And finally tracking and processing changes is how we handle the 4th dimension of time.

When we form strategies to achieve a goal, it’s the changes, how a change leads to transitions in things, that form our strategies, ranging from as mundane to getting breakfast to planning for retirement to getting a person on Mars. Strategies are like molecules of cause and effect between the atoms of change that we notice. The fewer changes involved, the more effective our decisions will be as accuracy is progressively lost over a Bayesian chain of cause and effect. We are more successful obtaining the breakfast we desire right now than on planning how we will retire decades from now as we envisioned today (due to cumulative changes over a long period of time).

A key thing to keep in mind is that in enterprises it seems the default attitude towards change is to deal with it as an enemy or pest, something to be reviled, resisted, and eliminated. However, to state the obvious in Yogi Berra fashion, without change, nothing changes. Change is what makes things better or worse. Unfortunately, in what is for all pragmatic purposes a zero-sum world, some will experience change for the better, some for the worse. But because change is always happening, those currently in “better” positions (the top echelon of enterprises) must vigilantly improve or at least maintain that condition.

Even maintaining the status quo is the result of constant change, except the net measurement is the same. For example, maintaining my body weight doesn’t mean nothing has changed. I’m constantly overeating, then compensating by under-eating (and occasionally even vice-versa). For those finding themselves in worse conditions, the ubiquity of change means there is always hope for the better.

Change as the basis for intelligence is rooted in the fact is that our home, Earth, is a hugely dynamic, complex system powered by intertwined geologic forces and biological replication. Geologic forces are driven by forces deep in the Earth as well as way over our heads in the clouds, sun, and meteors. The ability for cells to replicate is the underlying mechanism by which all the life we live with self-organized. Every creature from viruses through swarms of bees through humans are driven to “mindlessly” take over the world. But we millions of species and billions of humans have settled into somewhat of a kaleidoscope of  opposing forces at least in the bigger picture, which is like a pleasantly flowing stream, seemingly the same, but in reality in a constant state of change. The mechanisms of evolution and our human intelligence both enable adaptability on this fairly smoothly-dynamic planet.

A Few Clarifications

If all of this sounds obvious and/or a bunch of flowery crap, it could be that it’s only obvious when it’s brought to our attention, but quickly dismissed and forgotten as we resume the drudgery of our daily lives, being careful not to break any of the hundreds of thousands of laws micro-managing our lives, follow best (expected) practices that immunizes us from culpability, and careful not to trip over social mores that weren’t there yesterday. Our Industrial Revolution upbringing raised us to seek and expect comfort.

I would also like to point out that I’m not suggesting a system that simply gives us more to worry about, distracting us from what’s important, undermining our abilities through information overload (like a DoS attack). The main idea is not to replace us with some sort of AI system. It is to supplement us; watch our backs (it can multitask better than we can), see what our innate biases overlook, reliably rule out false positives and false negatives through faster exploration of the exponentially growing number of possibilities and continued testing of paths to goals (the definition of success).

The Expanding Reach of Our Daily Lives

However, many factors emerging in most part due to the increasing power of technology are expanding the scope of the time and space in which we individually or as an enterprise operate. Globalization introduces many more independently moving parts. Longer lives increases the cumulative changes each of us experiences in our lifetime. The rapidly growing rate of human population has greatly expands the reach of our species to the point where there’s practically nowhere on the surface we don’t inhabit. The countless devices feeding a bottomless pit of data collection, storage and dissemination expands the scope of our activities over time and space.

I purposely prepend the word “independent” to the phrase “moving parts” used in the previous paragraph. That’s because the fact that the parts are independently intelligent decision makers defines the world-shattering difference between complicated and complex. However, the level of “intelligence” of these independently moving parts doesn’t necessarily mean matching or even attempting to emulate human-level intelligence. Billions of machines from household appliances to robots running a manufacturing plant are being fitted with some level of ability to make decisions independently, whether that means executing rules based on current conditions or even sorting through true positives, false positives, false negatives, and all that good stuff.

With the limited scope of time and space typical for the typical human during the 1800s and 1900s, complicated machines were effective in performing repetitive actions that have, still do, and always will serve us very well. But in the vastly increasing scope of time and space in which individuals, their families, and businesses operate, making good decisions becomes an increasingly elusive goal.

If we don’t learn to embrace complexity to make smarter decisions, to entities that are embracing it (such as hedge funds and organizations with Big Brother power) we will be as fish and other game are at the mercy of humans with symbolic thinking. Embracing complexity doesn’t mean something like giving up our ego and becoming one with the universe or going with the flow. It means we need to understand that in a complex world:

  • We need to be flexible. We cannot reject the answer of “it depends” obsessively seeking out the most convenient answer.
  • Trial and error is a good methodology. It’s also called iterative. Evolution is based on it, although our intelligence can streamline the process significantly. On the other hand, our limited experience (knowledge) means we very often miss those precious weak ties, the seeds that beat out competition to rule the next iteration.
  • Our logic is prone to error because of the ubiquitous presence of imperfect information (a huge topic).
  • It’s a jungle out there, every creature and species out to take over the world. The only thing stopping them is that every other creature is trying to take over the world.

I discuss my thoughts around complexity, strategy, and how Map Rock approaches the taming of it in Map Rock Problem Statement – Parts 4 and 5.

Reflecting the World’s Cascading and Dynamic Many to Many Nature

A very intriguing discipline for dealing with complexity is Situation Awareness. It’s roots lay in war, battle scenarios, for example as a methodology for fighter pilots to deal with the chaotic realities of life and death fighting. In such situations, there are many independently moving parts, including some that you cannot trust. With all the training on tactics and strategies, a good opponent knows the best way to win is to hit you where you weren’t looking. In other words, things don’t go as planned. So we must be able to recognize things from imperfect and/or unknown information.

Figure 1 depicts a variation of an entity relationship diagram of a supply chain. Notice that unlike the usual ERD, there aren’t lines linking the relationships between the various entities. That’s pretty much because there are so many relationships between the entities that representing each with a line would result in a very ugly graph and simply showing that there exists relationships is oversimplifying things.

Those entities have minds of their own (will seek their own goals), unlike the “ask no questions” machines such as cars and refrigerators (at least for now). Instead of the conventional lines from entity to entity that strongly reinforce only the sequential aspects of a system, I depict “waves” (the yellow arcs) which attempt to reinforce the massively parallel aspects of a system as well.

Figure 1 – A sort of Entity Relationship Diagram of a Supply Chain.

The entity diagrams shows each entity broken down into three parts of varying proportions denoted by three colors:

  • Black – Unknown information. Highly private, classified. This information does indeed exist, but it is unobtainable or much too expensive to obtain. Contrast that with unknowable information, for example, so far out in the future that no one could possibly predict it … except in hindsight. Therefore, perhaps there should be a black or unknowable data and a dark gray for private/classified data.
  • Gray – Imperfect information. This could be indirectly shared, but statistically reliable; the basis for Predictive Analytics. Or it could be shared information, but suspect, or possibly outdated.
  • White – Known. This is information readily shared, validated, and up to date. We also would tend to find perceive it as reliable if we knew that it benefited us.

The proportions of black, gray, and white are just my personal unscientific impressions of such entities based on my personal experience exploring the boundaries of what we can and cannot automate after 35+ years of building software systems. The main point of Figure 1 is to convey that the white portion is the “easy” part we’ve been mostly dealing with through OLTP systems. The gray and black parts are the hard part, which does comprise the majority of information out there and the stuff with the potential to screw up our plans.

In a fight, whether as a predator versus prey, a street brawl, or business competition, we can either play defensively (reactively) or offensively (proactively). Defensive querying is what we’re mostly used to when we utilize computers. We have a problem we’re attempting to solve and query computers for data to support the resolution process executing in our heads. However, in the “jungle”, situations (problems) are imposed on us, we don’t choose the problem to work on. Our brains are constantly receiving input from multiple channels, recognizing things, grouping them, correlating them.

Not counting video games, most of how we relate to computers is that computers answer direct questions posed to them from us, which help us to answer complicated questions, involving relationships between things, we’re working through with our brains. Video games are different from most of the software systems we use in that the computer is generating things happening to us, not just responding to our specific queries. In the real world, things happen to us but software is still not able to differentiate to an adequate degree what is relevant and what isn’t true, so we end up with enough false positives that it creates more confusion or there are so many false negatives that we miss too much.

The most important thing to remember is that the goal of this series of blogs is to work towards better reflecting the cascading and dynamic many to many relationships of the things in the world in which we live our lives. To do this, our analytics systems must handle objects at a lower level of granularity than we’re accustomed to, which can be reconstructed in an agile number of ways, similar to how proteins we consume are broken down in our guts into more elemental amino acids and reconstructed into whatever is needed.

Then, we must be able to ultimately process countless changes occurring between all these objects in a massively parallel fashion. To paint the most accurate picture in our heads required to make wise decisions, we need to resist forcing all that is going on into brittle, well-defined procedures and objects.

Non-Deterministic Finite Automata

Probably the most common exposure to NFA-like functionality for the typical BI/database developer is RegEx (same idea as the less powerful LIKE clause in a SQL statement). But I think it’s thought of as just a fancy filter for VARCHAR columns in a WHERE clause, not as a way of encoding a pattern which we wish to spot in a stream of symbols. These symbols can be characters of words (pattern) in a text document (the typical use case for RegEx), the stream of bases in DNA, the sales of some product over many time segments. Indeed, NFAs are the implemented version of regular expressions.

The NFA is a primary tool in the field of pattern recognition. They are patterns that can recognize those sequences of symbols. Sometimes these sequences may not be entirely consecutive (handled by loopbacks), sometimes not even entirely ordered (handled by multiple start states), and can lead to multiple outcomes (the “non deterministic part, handled by multiple transitions for a symbol).

When we think of pattern recognition, we usually think of high-end, glamorous applications such as facial or fingerprint recognition. But pattern recognition is one of the foundational keys for intelligence. It’s exactly what we humans seek when browsing through views in PowerPivot or Tableau. We look for things that happen together or in sequence. And the word “thing” (object) should be taken in as liberal a manner as possible. For example, we normally think of a thing as a solid physical object, but things can be as ephemeral as an event (I like to think of an event as a loosely-coupled set of attributes), recognized sequence of events, and things for which there isn’t a word (it needs a few sentences or even an entire course to describe).

If we think about it, seeing the things we see (at the “normal” macro level) is sort of an illusion. Meaning, we see an image (such as an acquaintance or a dog) reconstructed from scratch. When I see a friend of mine, my eyes don’t reflect into my head a literal picture of that friend. That vision begins with my eyes taking in zillions of photons bouncing off of that friend, which are quantified into a hundred million or so dots (the number of rods and cones on my 2D retina), into a number of edges (lines) processed by my visual cortex, into a smaller number of recognitions processed with components throughout my brain. If it didn’t work this way, it would be impossible to recognize objects from whatever angle 4D space-time allows and even partially obscured views, as presented to us in real life.

Further, I don’t see just that person in isolation. “That friend” is just one of the things my brain is recognizing. It also in massively parallel fashion recognizes all other aspects from the expression on his face, to what he is wearing, to funny angles that I wouldn’t be able to make out all by itself (all the countless things for which there isn’t a single word), to memories associated with that friend, as well as things around that friend (the context). Intelligence is massively parallel, iterative (hone in on a solution, through experimentation), massively recursive (explores many possibilities, the blind alleys), and massively hierarchical (things within things).

NFAs are visualized as a kind of directed graph, connected nodes and relationships (lines, edges). We’re well familiar with them, for example, org charts, flow charts, UML diagrams, entity relationship diagrams. However, NFAs are mathematical constructs abiding by very specific rules. These rules are simple, but from these simple rules, we can express a wide range of  rules for pattern recognition.

Figure 2 depicts an example of an NFA. The NFA on the left is used to recognize when a market basket contains chicken wings, pizza, and beer. That could signify a large party which could be used to notify neighbors of the need to book that weekend getaway. The one of the right is used in more of the conventional, “if you have pizza and beer, perhaps you’d like chicken wings as well”, utilization of market basket analysis.

Figure 2 – Sample of an NFA.

A very good series of lectures on NFAs, actually the wider field of the Theory of Computation is Theory of Computation is by Dan Gusfield. L1 through L3 of the series is probably the minimum for a sufficient understanding of NFAs. Although there is very much value in understanding the more sophisticated concepts, particularly Context-Free Language and Turing Machine. I like Prof Gusfield’s pace in this series. Ironically (for a tech blog), the fact that he still writes on a chalkboard slows things down enough to let it all settle.

I believe a big part of the reason why graph structures still play a relatively minor role in mainstream BI development is because we’ve been trained since our first database-related “hello world” apps to think in terms of high-level, well-defined, discrete “entities” reflected in the tables of relational databases (tables). Each entity occupies a row and each column represents an attribute. It’s easier to understand the tidy matrix of rows and columns, particularly the fixed set of columns of tables, than the open-ended definitions contained in graphs. That goes for both people and computers as a matrix-like table is easier for servers to process.

In order to process enterprise-sized loads of data, we needed to ease the processing burden on the hardware by limiting the scope of what’s possible. We made “classes” and table schemas as structured as possible (ex, a value can be pinpointed directly with row and column coordinates). We stripped out extraneous information not pertinent to the automation of our well-defined process.

We also neglected implementing the ability to easily switch in and out new characteristics once extraneous but now relevant. Graphs, lacking a fixed schema, don’t have the crisp row and column schemas of tables nor the fixed set of attributes. So we decided we can live with defining entities by specific sets of characteristics forgetting that objects in the world don’t fit into such perfectly fitted slots.

There are techniques for conventional relational databases that support and reflect the unlimited attributes of objects, for example, the “open schema” techniques where each row has three columns for an object id, an attribute, and the value of the attribute. There can be an unlimited number of attributes associated with each object. But the relational database servers executing those techniques struggle under the highly recursive and highly self-joined queries.

In an ideal world, if computers way back to the days when only large, well-oiled enterprises used them (where taking in ambiguity as a factor isn’t usually a problem), were then already as powerful as they are today, my guess is we would have always known graph databases as the mainstream and relational databases appearing later as a specialized data format (as OLAP is also a special case). Instead of a relational database being mainstream, we would think of a relational database table as a materialization of objects into a fixed set of attributes pulled from a large graph. For example, in a graph database there would be many customer ids linked to all sorts of attributes customers (people) have spanning all the roles these people play and all the experiences. There could be thousands of them. But for a payroll system, we only need a handful of them. So we distill a nice row/column table where each row represents a customer and a small, fixed set of columns represents the attributes needed to process payroll.

Graph-based user interfaces (most notably Visio – at least to my heavily MSFT-centric network) have long existed for niche applications. But there is a rapidly growing realization that simply more data (volume), provided ever faster (velocity), and in more variety alone doesn’t necessarily lead to vastly superior analytics. Rather, it’s the relationships between data, the correlations, that directly lead to actionable and novel insight. So enterprise-class graph databases such as Neo4j, optimized for authoring and querying graph structures, are making headway in the mainstream enterprise world.

However, keep in mind that NFAs are very small, somewhat independent graphs, unlike large, unwieldy graphs more akin to a set of large relational database tables. In other words, the idea of this blog is querying very many small NFAs in massively parallel fashion, as opposed to one or a few large tables (or a messy “spider-web” graph). In a subsequent post in this series, we’ll address the “somewhat independent” aspect of NFAs I mention above; loosely-coupled.

Up to now we weren’t compelled enough to take a leap to graph databases since we were able to accomplish very much with the sterile, fixed column tables of relational databases. And, it graphs were too unwieldy to deal with, we retreated back to the comfort of the 2D world of rows and columns. But we’re now beginning to ask questions from our computer systems that are different. We don’t ask simply what the sales were of blue socks in CA during the last three Christmas seasons.

We ask how we can improve sales of blue socks and attempt to identify the consequences (ex. Does it cannibalize sales of purple socks?). The questions are more subjective, ambiguous, and dynamic (SAD – hahaha). These are the sort of questions that were in the realm of the human brain, which in turn we turn to our computer databases to answer the empirical questions supporting those more complicated questions.

SQL Server 2014’s new in-memory Hekaton features could help as well. Similar to an OLTP load, processing NFAs would involve a large number of queries making small reads and writes. This is in contrast to an analytics application such as OLAP which involves relatively few queries by comparison reading a large amount of data and no updates are made except for a scheduled refresh of the read-only data store. I’ve made this comparison because I think of this utilization of NFAs as something in the analytics realm.

But applied in an implementation involving thousands to millions of NFAs (such as Situation Awareness), a highly-parallel implementation, it could involve a large number of large reads and writes as well. So we have an analytics use case for a technology billed as “in-memory OLTP”. The advantage of using Hekaton over Neo4j is that we could implement our NFA system using familiar relational schemas and querying techniques (SQL and stored procedures) instead of invoking a new technology such as Neo4j.

Hekaton should provide at least a magnitude improvement over doing the same thing with a conventional disk-based relational tables. This performance improvement comes with first the all-in-memory processing and dropping the overhead required for disk-based tables.

Micron’s Automata Processor

Much more intriguing and relevant than Hekaton for this blog focused on NFAs is Micron’s Automata Processor, which I wrote about almost a year ago in my blog, A Rare Big Thing Out of Boise. The Automata Processor (AP) is a memory-based chip directly implementing the mechanisms for the massively parallel processing of NFAs.

This should result in at least a few orders of magnitude of further performance improvement, first from the fact that it has little if no “generalist” overhead since it is designed as an NFA-specific chip and not a general-purpose memory chip. It also processes NFAs in truly massively parallel fashion.

Thirdly, the “processing” mechanism (to process large numbers of NFAs in parallel) is built directly onto the chip, which means that there is no marshaling of bits between a CPU and memory over a bus for every single operation. So even if we were to compile NFAs down to “native code” (as Hekaton’s native stored procedures do), massively multi-threaded on a relatively massive number of CPUs, there would be great hurdles to overcome in beating the Automata Processor.

We could look at the AP as merely an optimization for a particular class of problems. The sort that recognizes patterns such as faces or gene sequences in a huge stream of data. But similarly we can look at the current mainstream computer architecture (CPU and storage device – RAM, hard drive, or even tape) as an optimization for the vast majority of the classes of problems we deal with in our daily lives as we’re accustomed to (at the macro level) in our vestigial Industrial Revolution mentality. That would be the well-defined, highly repetitive, deterministic class of problem which is the hallmark of the Industrial Revolution.

So instead I like to look at the Automata Processor as a technology that is a lower-level (lower than the Turing Machine) information processing device capable of handling a wider variety of problems; those that are not well-defined, highly repetitive, and deterministic. NFAs are like molecules (including very complicated protein molecules), not too high, not too low, high enough to solve real-world problems, but not so unnecessarily low-level and cumbersome. An analogy would be assembler language being low enough to dance around the roadblocks imposed by a high-level language, but not as cumbersome as programming in zeros and ones.

This parallelism could mean massive scales such as up to tens of thousands or even millions of NFAs. The main idea is that each piece of information streaming into a complex system could mean something to multiple things. The reality of the world is that things have a cascading many to many relationship with other things. For example, a sequence of three sounds could be the sound of countless segments of a song rendered by countless artists, countless phrases uttered by countless people with countless voices, countless animals.

NFA Sample Application

At the time of this blog’s writing, Micron had not yet released its development toolkit (the chip/board itself and the Software Development Kit, SDK) to the public for experimentation. So it is one of the major reasons I decided to demonstrate NFAs using conventional, disk-based SQL on SQL Server 2014, at least for this blog. However, the Automata Processor’s SDK is accessible at the time of this blog’s posting (after signing up) by visiting  http://www.micronautomata.com/

There is still much value in demonstrating NFAs in this conventional manner, even with the impending release of the Automata Processor SDK. First, the AP is etched in stone (literally in the silicon). For business-based solutions I can imagine (ie, more mainstream than specific applications such as bioinformatics), I believe that there are a few deficiencies in the implementation (which I intend to supplement with Hekaton). There will be techniques outside the realm of the AP that would be of benefit and for which we can implement using a flexible, software-based system (ie a relational database management system).  The details of the business-based solutions I can imagine and designs I’ve created around the AP are well beyond the scope of this blog. However, these old blogs provide sample background on such efforts of mine:

This sample implements a very bare-bones NFA storage and query system for SQL Server. This version isn’t optimized or modified for Hekaton (which is a subsequent post) as that further expands the scope of this blog.  This simple application supports the NFA concepts of:

  • Multiple Start states.
  • Looping back on a transition. This allows us to “filter” noise.
  • Transitions to multiple nodes for the same symbol. This is the main feature of an NFA that distinguishes it from the less powerful “deterministic” finite automata.
  • Epsilon transitions (transitioning for any reason).

Please understand that this sample code is intended to be used as “workable pseudo code”. This sample certainly doesn’t scale. It is meant to convey the concepts described here. More on this in Part 3 of this series.

This script, generated through SQL Server Management Studio, NFA -Asahara.sql, consists of a TSQL DDL script that creates a SQL Server database named NFA, a few tables, and a few stored procedures:

  • Database, NFA – A simple, conventional (disk-based) database in which the sample database objects will reside.
  • NFA, Schema – A schema simply for cleaner naming purposes.
  • NFA.[States], Table – Holds the states (nodes) of the NFAs.
  • NFA.[Symbols], Table – A table holding the symbols
  • NFA.[Transitions], Table -Holds the transitions (edges, lines) of the NFAs.
  • [NFA].AddNFAFromXML, Stored Procedure – Takes an XML representing an NFA and registers (persists) it to the tables.
  • [NFA].ProcessWord, Stored Procedure – Takes a “word” (string of symbols) as a parameter and processes the word, symbol by symbol, through all of the registered NFAs.

For this sample, you will simply need the SQL Server 2008 (or above) relational engine as well as the SQL Server Management Studio to run the scripts. I didn’t include Hekaton features for this sample in part to accommodate those who have not yet started on SQL Server 2014. These are the high-level steps for executing the sample:

  1. Select or create a directory to place the SQL Server database (NFA.MDF and NFA.LDF), then alter the CREATE DATABASE command in the NFA – Asahara.sql script to specify that directory. The script uses C:\temp.
  2. Create the NFA database (database, tables, stored procedures) using the NFA – Asahara.sql T-SQL scrip.
  3. Register the NFAs following the instructions near the top of the [NFA].AddNFAFromXML stored procedure.
  4. Run the sample queries located in comments near the top of the [NFA].ProcessWord stored procedure.

Figure 3 depicts the SQL Server objects that are created by the script (database, tables, stored procedures, one TVF, but not the schema) and the part of the script where the database file directory can be set.

Figure 3 – Beginning of the SQL script for this blog.

Once the objects are created, sample NFAs need to be registered to be processed. A few examples are provided in a commented section near the top of the code for the stored procedure, [NFA].AddNFAFromXML. The stored procedure accepts the NFA as an XML file since it is a flexible way to exchange this information without a fancy UI.

Figure 4 shows one of those sample NFAs (passed as an XML document) that determines if a prospective job would be favorable, as well the state/transitions that are registered into the tables.

Before continuing, please remember that these sample NFAs are unrealistically simple for a real-world problem. But the point is that individual NFAs are simple, but many simple NFAs could better reflect the nuances of the real world. This will be addressed in a subsequent post in this series.

Figure 4 – Register an NFA using an XML document.

Regarding the XML, it is a collection of elements named Transistion, with four attributes:

  1. FromState – Each State (node) has a name. Usually, states are just called “qx” where x is an integer (ex: q0, q1, q2 …). However, I chose to give them a more descriptive name indicating something about how we got to that state.
  2. ToState – This is the state that we are transitioning to in reaction to a symbol.
  3. TransitionSymbol – The name of a transition resulting in a switch from the FromState to the ToState.
  4. IsFinal – Final States are a special kind of state (another being a Start State). If a word (series of symbols) ends on a Final State, it’s said that the NFA recognizes the word.

Figure 5 is a graphical representation of the NFA encoded as XML above (in Figure 4).

Figure 5 – Graphical representation of the sample NFA.

Figure 6 shows the results of processing a “word”. A “word” is a series of symbols, usually a string of characters. But in this case, a symbol is more verbose, thus requiring a delimiter (comma) to separate the symbols:

  • “Distance Long” or “Distance Short” – How far is the commute from home to this job?
  • “Salary Good” – The salary is above my minimum requirements.
  • “Benefits Good” – The benefits exceeds my requirements.

Figure 6 – Process a word, which is a string of symbols.

The first result set just shows the trace of the iterative process of processing a word. Each iteration (there are three) processes one of the symbols of the word. It’s possible that multiple paths would be explored, for which in those cases, an iteration would have a row for each path. Multiple paths are one of the two levels of parallel processes on the Automata Processor; the ability to process multiple paths in an NFA and multiple NFAs.

The second result set shows only the final state, indicating this is indeed a good job.

 

Next Up

As mentioned towards the beginning of this Post, Post 2 of this series will introduce Pattern Recognition along with another T-SQL-based sample. In Part 3, we will tie Posts 1 and 2 together. Parts 4 through 6 will expand upon the NFA functionality (most notably feeding back recognitions), implementation of Hekaton (as a step up in performance for the Part 1-3 samples and in a supportive role around the AP) and the Automata Processor itself, as well as further exploration of the use cases.

As with any other system, for example SQL Server or the .NET Framework, there are an incredible number of optimizations (think about it as known shortcuts) based on usage patterns all types (admin, process, structures, etc) to be implemented. Many of these optimizations have already been incorporated in the components of Map Rock, not to mention the NFA-specific optimizations offered by the Automata Processor.

Posted in BI Development, Cutting-Edge Business Intelligence, Map Rock | Tagged , , , , , , , , , | Leave a comment

Levels of Pain – Refining the “Bad” side of the KPI Status

Everything we do towards achieving goals involves costs – sacrifice, investment (a positive way to look at it). We purposefully put things we already have (time, money, our career) to be consumed or at risk in the hope of achieving a goal. For the sake of this short blog, I’ll call that investment “pain”. I present this blog for several reasons:

  1. KPI statuses are usually too simple. Most are a percentage of an actual value towards a target value. But in real life, there are real events that happen at certain thresholds which completely change the relationships between things.
  2. There really are no free lunches. Even if a cost is not readily evident, it will remain a credit in the account of the Complex World to be cashed in sometime, somehow.
  3. Life isn’t really as rigid with artificial boundaries. The systems of life mostly allow for some leeway, even though there may be an optimal value.

With that in mind, I’d like to explore the definition of a KPI’s “bad” status. Measures of a KPI status are not simply on smooth continuum of values (-1 to 1) ranging from terrible to bad to OK to good to great. This simple take on a KPI status was necessary at first because of limited data and processing capability. But in the Big Data era, we can do better by recognizing conditions as a status, not just a calculation.

Events start to happen at thresholds. The bad side of the continuum is really a set of progressively severe stages, each with its own continuum. A nicely graphic example is how the degradation of a car’s tires affect the car’s ability to move along a road progresses through these stages (along with the colors I use):

  1. Warning (pink)- The car tire is balding. This continuum starts with significant wearing of the treads to complete disappearance of it.
  2. Pain (red)- The tire pops, however, a car can still physically move. The tire will progressively shred as the tire rolls along, but you can still get the car off the freeway.
  3. Major Pain (maroon)- The tire breaks off the rim. The car can still move even though it will be sparking on the asphalt.
  4. Broken (black)- The axle breaks. At this point, barring gravity or a tow truck, the car cannot move.

Investment can also demonstrate these stages of pain:

  1. Warning – An investment of money that doesn’t impede cash flow or the ability to keep the “doors open”. The Warning goes away once we’ve recouped our investment.
  2. Pain – We’ve missed payments on a bill. At the high end of the continuum, we begin to receive Dunning letters and even court orders.
  3. Major Pain – Cash flow is impeded, we’re forced to sell of things, our wages are garnisheed.
  4. Broken – The doors are closed on us. We’re out on the street.

Most healthcare procedures such as curing cancer involves trading one Major Pain (death would be “Broken”) for one or more lesser Pains (side-effects).

When taking a 7am flight, but deciding to sleep until 4am, we can:

  1. Warning – Cut it close getting in time to board early.
  2. Pain – Get there boarding late missing out on overhead luggage space.
  3. Major Pain – Needing to run clear across the airport while your name is announced over the PA.
  4. Broken – Missing your flight.

The stages could be different, but the point is that clearly different sets of events are triggered at certain thresholds. Things don’t just become progressively bad from a balding tire to a broken axle. We can live with a balding tire while we ensure our water and electricity stays on.

In the implementation of a KPI capable of reflecting these events, the status calculation requires a CASE-WHEN-THEN statement for each level of pain. To keep with the convention of returning -1 through 1 (bad to good) for a KPI status, I return (-1 and <-.75)=Broken, (>=-.75 and <-.5)=Major Pain, (>=-.5 and <-.25)=Pain, (>=-.25 and <0)=Warning, 0=OK., and anything greater than 0 is Good to Great.

If KPI statuses are set up with these stages in mind, we will have a better idea of the real consequences as we prioritize. Being cognizant of these thresholds helps us prioritize the many pains that we need to address throughout an enterprise or just our lives at any given moment. Prioritization really is deciding what is most important at the moment. But that importance depends on the consequences for failing to address the pain.

If KPIs are the nerves of the “intelligence” of an enterprise, the KPIs should be linked to whatever consequences (effects) that may be triggered. Keep in mind too that there probably isn’t just one progression of pain, but varying progressions depending upon other circumstances. However, that gets into another story, so for now, let’s just start with recognizing that pain isn’t a smooth continuum. (Note: It was this reason that lead me to develop SCL to define relationships in a very robust manner.)

Consider too that warning isn’t necessarily a “bad” thing. As I’ve defined warning in this blog, it very often can be thought of as an investment or a sacrifice that we can live with. As I mentioned, every endeavor towards a goal, the resolution of a problem, requires some sort of investment. In other words, we purposefully put certain things in a level of pain we can live with in order to fix worse pain, which is sacrifice.

This notion of purposefully putting certain aspects of our lives or enterprise into pain (investment, sacrifice) opens the door for a twist on optimization. It is a technique I developed back in 2008 that I call Just Right Targets. The idea is to set up the KPI bad statuses as described here, select a threshold we’re willing to tolerate, select values we can manipulate (in what-if fashion), and run a genetic algorithm finding a set of those values whereby we do not exceed any of the pain thresholds.

For example, the purpose for Project Management is to deliver a product with the features required to provide intended value (scope), on time, and within budget – hopefully without having trashed the “resources” (burning out the workers and wrecking the equipment) for the next project. It’s the project manager’s job to balance the competing goals of scope, time, and resources:

  • If it’s discovered that new features are required, the project manager must negotiate a reduction in a combination of time and resources.
  • If the project deadline must be delivered earlier, say for competitive reasons, scope is reduced and/or resources are added.
  • If the key developer is hit by the proverbial bus, and there is no replacement, need to extend the timeline and/or reduce the scope.

Just Right Targets accepts the maximum acceptable pain bucket for each KPI (at the highest level for this example, scope, time, and resources), runs what-if forecasts on a wide range of scenarios, and scores each scenario based on the minimum pain it will cost, with preference to scenarios that do not cross any pain margins.

Posted in BI Development, Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics, Map Rock | Tagged , | Leave a comment

The Magic of the Whole is Greater than the Sum of Its Parts

Prelude

If businesses were people, they would lumber about in a vaguely purposeful manner like zombies. That’s due to the top-down, military-style hierarchies of modern corporations that result in integration of information only at the top and only to a limited extent below. Imagine Gepetto the CEO (external puppet master) of Pinocchio. Pinocchio is managed through strings from the CEO manipulating each of Pinocchio’s parts. The movements are jerky and not very lifelike. When Pinocchio becomes a real person managed through his completely integrated brain, his movements are smooth and lifelike. He can control and grow his life more effectively this way than through the indirect, latent, imperfect information-driven command of Gepetto.

This isn’t a criticism of how businesses are run today. Business enterprises are well beyond the capability of a single person to control it to the level that the enterprise appears “lifelike”. But taking Performance Management and Process Management to the next level supplements what is needed to achieve that “lifelike” movement in an enterprise. Businesses obviously have succeeded executed top-down from a command center (Gepetto’s brain) as opposed to the distributed, networked intelligence of the parts (Pinnochio’s brain). Businesses have produced valuable goods for their customers, met targets, supported the livelihood of employees and investors, and innovated. But most businesses don’t make it and for the ones that do, there was a lot of luck along the way, sometimes they made it in spite of themselves, and eventually do die.

Business Intelligence is supposed to provide the information required for decision makers to make better decisions. Although BI made significant impacts towards that goal, it still hasn’t quite made businesses look more like a world-class scientist or athlete than the lumbering zombie. So Big Data comes to the rescue … or does it?

At the time of this writing, one of the major reasons for why a Big Data project may yield underwhelming results is that it’s simply more data. There is no question that the availability of more data is beneficial, but most businesses still don’t know how to effectively analyze the data it already has and to implement actions off of it. On the other hand, more data can lead to counterproductive information overload.

So what happened to the huge buzzword of about ten years ago (circa 2005), Semantic Web? It’s still around, but it’s taken a seat far in the back of the buzzword bus. Yes, dealing with the complexity and mess of webs of relationships is more difficult than dealing with the tidy matrices of data (tables/spreadsheets) or even objects (aggregates in NoSQL terms). But we need to have relationships between data at least keep pace with the growing amount of data, or we just end up with information overload. Sure, we can find a needle in a haystack with Big Data, but so can a magnet.

In this blog, I present some of my thoughts on the feasibility and value of purposefully pursuing and measuring the level of integration of intelligence within a business, even if such an effort doesn’t address a clear and present need.

This blog is perhaps too philosophical for the tastes of the primarily “technical” audience. But as I look through the zoo of Apache projects around Big Data and all the skills required of a data scientist, it innately seems much too complicated. So at least for me, I need to take a few steps back in an attempt to see the single big picture of Big Data which cannot be captured in merely a thousand words or a catchy marketing phrase such as “the three Vs”, but can be felt by my consciousness.

Measuring a Continuum of Consciousness

On my flight home for Christmas Vacation some thoughts reminiscent of Map Rock were triggered by an incredibly intriguing article in the Jan/Feb 2014 issue of Scientific American Mind titled, Ubiquitous Minds, by Christof Koch. The article discusses the question of what things have consciousness. Is consciousness strictly a human phenomenon (at least on Earth today) or do other animals and even things possess it albeit to lesser degrees? The article suggests that it’s more of a continuum for which we can measure the degree.

That article includes introductions to two concepts, one called panpsychism and another called Integrated Information Theory. For the former, panpsychism, it would be too much of a stretch for this blog to place it in a business context.  However, I can for the latter. In particular, a crucial part of Integrated Information Theory is this notion of a measurement of consciousness referred to as Φ (phi). From a philosophical point of view, the notion of measuring consciousness in non-human, even things seeming completely non-sentient would drastically change the world view of many. Growing up Buddhist, especially in Hawaii, that notion isn’t so foreign to me. From a BI practitioner’s point of view, this is very compelling since I’ve always thought of each business as individual organisms competing in various levels of ecosystems.

The Scientific American Mind article  is actually an excerpt from Christof Koch’s book, Consciousness: Confessions of a Romantic Reductionist. I downloaded it onto my Kindle as soon as I could and blasted through it over Christmas Eve and Christmas.

Whether or not something is “conscious” as people are conscious, the notion of measuring the factors of consciousness as a measure of an enterprise’s ability to proactively evolve and get itself out of messes could be extremely compelling for a business. This notion is an extension of the foundation of my approach to BI, that businesses are like organisms. This metaphor has always been helpful in guiding me through the progression of BI over the past 15 years or so, as well as my earlier days with expert systems 25 years or so ago.

I know that sounds like useless BS, that I had too much time and maybe “Christmas cheer” over Christmas vacation. We don’t really even know what consciousness is in people, much less in an entity such as a business. Further it’s extremely difficult for most people to accept that anything but humans could possibly be conscious. Please bear with me a bit as thinking through the question of whether a business is conscious is related to the question of whether improving the aspects of sentience can serve a business as well as it has served humans.

So throughout this blog, I’ll stop apologizing every other sentence and assume that consciousness is something by far most highly developed in humans and that it is in fact the primary factor for our success as a species. Take this all with a grain of salt. This is a reset of how to approach BI, stepping out of the weeds.

As the developer of Map Rock, this means almost everything. Map Rock is about the integration of rules across an enterprise (a complex system) as is integration of the brain across regions what our symbolic-thinking consciousness is all about.

Businesses as Organisms

Like organic creatures, whether individuals or groups of creatures (hive, pride, species, or some higher level individuals such as dogs and great white sharks), businesses share major characteristics such as goals, strategies, complexity, intelligence, the ability to evolve, and maybe even “feel”. Those characteristics range widely in level; some businesses are more complex, some less, some more intelligent, some less.

The first four characteristics, goals, strategies, evolution, and complexity, are rather easy to understand and buy into. Businesses exist for a reason, therefore they have goals. This usually means to turn a profit for the owners or to improve the condition of a cause for non-profits. Reaching these goals means achieving some desired state in a complex system. It is accomplished through a strategy of taking in raw materials and converting them into something leading it towards its goals. Strategies involve a hodgepodge of many moving parts (usually complex, at least complicated) such as employees, software applications, machines, partners, and of course customers orchestrated in workflows.

Eventually strategies cease to work due to relentless changes in the world, and the business must adapt. Or sometimes the business is attacked at a vulnerable point, whereby it defends itself, and makes adjustments to strengthen that point. It evolves. In the case of the humans (highly conscious symbolic thinkers) we can proactively adapt in the face of predicted change.

The Magic of the Whole is Greater than the Sum of Its Parts

The intelligence of a business is mostly tied up in the heads of decision makers, mostly “bosses”. However, with the aid of Business Intelligence it’s increasingly feasible to de-centralize that intelligence, delegating more and more decision making to non-manager information workers. Additionally, certain aspects of BI-class applications such as predictive analytics models create non-human intelligence as well, encoded as objects such as regression formulas and sets of IF-THEN rules.

The sum of these individual intelligent components (the employees of the business and “smart” applications) does not equate to the intelligence at the business level. Even though this is an apples and oranges comparison of intelligence (like comparing the intelligence of your immune system to the intelligence from your brain), unfortunately the sum of the intelligence of the parts, inadequately integrated, is still greater than the whole (real intelligence doesn’t currently emerge in businesses today). In other words, businesses currently lacking adequate integration of intelligent parts are generally stupider than the collective intelligence of the parts. Genuine magic happens when the whole is greater than the sum of its parts.

To expand on the italicized term in the previous paragraph, inadequately integrated, the integration must be sensible as well. For example, a “bag o’ parts” is not the same as a properly assembled and tuned automobile that actually works. For a business, integration is the level of cross-entity reach within the collection and assemblage of the web of validated cause and effect throughout the business. “Cross entity” means not just cause and effect amongst team members or inter-department, but team to team across departments, role to role across departments, etc. “Validated cause and effect” refers to the data-driven proof of the validity of theories underlying the strategies that dictate our actions. I wrote more about this in my blog, The Effect Correlation Score for KPIs.

Unfortunately, I’ve experienced in the BI world an addiction to quantified values, values must be deterministic to be valid. It isn’t difficult to see why this is the case as BI is based on computers and quantitative is what computers excel at. Quantitative values are appealing because they are easier to understand than ambiguous qualitative “stories”. The phrase “gut decisions” (intuition) are the sworn enemy, inferior to data-driven decisions, because there is no elucidated process consisting of verifiable, deterministic values.

The thing is, neither quantitative nor qualitative (data-driven versus intuition) analysis is superior to the other. They are the crests and troughs of the iterative analytical process. They form a yin and yang cycle: Qualitative values address the fact that the world is complex and objects are constantly in flux. However, it’s difficult to make decisions based on such ambiguity. Therefore, we discretize objects into unambiguous entities so we’re able to execute our symbolic what-if experiments.

The problem with these discretized objects is that they are now a sterilized, hard definition of something, stripped of any “flair”, stripped of any fuzziness (reminds me of particle-wave duality). These hard values are now carved into square pegs for square holes, poorly able to deal with the ambiguity of real life. Everything is its full set of relationships, not a subset of what at some point seems like the most salient. What we’ve done is to define objects as particular states of a finite set of relationships (snapshots). Long story short, eventually we begin trying to shove rhomboid pegs into square holes.

It’s also interesting to consider that the results of Predictive Analytics models come with a probability attached to them. This is because a PA model only sees things from a small number of angles. In the real world, anything we are attempting to predict can happen from an infinite number of angles. In this context, we can think of an “angle” as a point of view, or a fixed set of inputs. This is the result of quantifying the inputs, stripping what appear to be non-salient relationships, reducing it to an unqualified, non-ambiguous object. We can consider those supposedly non-salient relationships as “context”, that unique set of circumstances around the phenomenon (the object we’re quantifying) that makes an instance in time unique even if the same object otherwise appears identical some time later.

Quantification really is a double-edged sword. On one hand, it is that very thing that enables symbolic thinking, the ability for us to play what-if games in our heard before committing to a physically irreversible action. On the other hand, that quantification strips out the context leading us to only probable answers. The real world in which businesses live isn’t a science lab where we can completely isolate an experiment from variance, thereby declaring that the definition of insanity is to keep doing the same thing expecting different results. In the real world, that is the definition of naïve.

Quantitative is the norm in the BI world. For example, I’ve become too used to the OLAP context of the word “aggregate”. These are sums, averages, maximum or minimum values, or counts over a large number of facts. These values still compare apples to apples values (ex, sum of all sales in CA to sum of sales in NY). Another is the knee-jerk rejection of the use of neural networks in predictive analytics because their results cannot be readily understood.

So it’s not always the case that the whole is greater than the sum of its parts. It’s more that the whole is at least very different from the sum of its parts. It is more like the chemistry context of a compound – water has very different properties than hydrogen and oxygen. Think of the great rock bands such as the Eagles, Fleetwood Mac, or the Beatles where none of them individually are the greatest.

Addressing the development and management of goals, strategies, complexity, evolution, and intelligence are where we BI/Analytics folks make our contributions. These are aspects of business that we can wrap our human brains and arms around towards improvement. I still wouldn’t go so far as claiming a business is alive like an organic creature, even though it acts like one; albeit, again, maybe a little on the zombie side. Setting aside that “alive” issue for now, do businesses possess some concept of “feeling”?  It’s that elusive thing that’s hardly even considered in the context of a business. And if it is, it is readily dismissed as just an irrelevant, indulgent exercise.

Our feelings and the feelings of others (people we deal with) do matter (well, for the most part) in the course of our daily grind. But the feelings of animals, if we were to consider they have some level of feelings, with the exception of our pets, don’t matter as much and often not at all. We’ve all anthropomorphized inanimate things (such as stuffed animals, cool cars, laptops) attributing them with feelings, but it’s really our feelings towards those things that matter, not the imagined feelings of those inanimate things.

At least, businesses can be in anthropomorphized states of “feeling”. For example, businesses can be in a state of pain when its losing money, in an ill state when the equipment and/or employees are wearing down, in a state of happiness when goals are met, and in a state of confidence when it is secure and will reach goals in the near future. But being in a state isn’t the same thing as feeling it. A KPI on a scorecard can express a state of pain with a red icon, but is it felt?

Certainly the humans working in the business are conscious, but that doesn’t mean there is a consciousness at the business level. Even if there is one, that doesn’t mean we (the human parts) can converse with it. Similarly, whether my organs such as my liver and heart have a consciousness of their own, I seriously doubt they are aware of or could even comprehend my life as a complete human being.

I call the properties emerging from the interaction of things (the whole is greater than the sum of its parts) “magic” because it really is a gift that in some sense defies all the conservation of energy, zero-sum thinking permeating our lives. It’s like the “magic” of compounded interest. We’re all aware of this concept and hear someone mention it almost daily, but probably don’t appreciate it and strive for it as much as we should.

We also see it very often at the level of the teams we work on, whether it’s a rock band, a basketball team, or a BI development team. It’s fairly easy at the team level because we as individuals are directly a part of it. It becomes trickier at higher scales (teams of teams) because then we’re not directly observed.  The point of this blog is to explore whether we can measure and the value of measuring this at the scales way bigger than teams; teams of teams of teams …

The Value of Φ to a Business

Does it matter if businesses have consciousness? Admittedly, the question of consciousness, whether people or whatever, was to me just an academic exercise. I thought of it as a very tough problem that wasn’t standing in the way of making technical progress towards smarter software.

I don’t think any business today is conscious, but that businesses could be driven towards consciousness and that progress in that direction would be beneficial. It is just an academic exercise from the standpoint of how measuring a business’ level of consciousness can help the business resolve current problems. However, to continue the analogy of a business of being an organism, I first ask if consciousness has provided benefits to humans. For the sake of argument let’s say that we humans are the most successful creatures and we are the most conscious. Is that directly linked?

Consciousness, self-awareness, enables us to imagine and perform symbolic what-if experiments in our heads before we commit to a physically irreversible action. That’s the game-changer because the world is complex and Black Swans, big and small, will constantly bombard businesses – like strokes and mini-strokes. Without the symbolic thinking afforded to us by consciousness and self-awareness, we would be at the mercy of the mathematical outcomes of the algorithms in our heads.

Would it be sensible to say that if a business were conscious, it would have the superior capability of generating strategies in a complex, only semi-predictable world as humans can? To me, that is the skill of skills. Consciousness is the secret sauce that makes humans resilient to the ambiguities the complex world in which we live during the course of our relatively short-term existence. That is the ability to fish; even if you happen to lose a fish, you’re not dead, you can still fish. When (not “if”) a business’ strategy starts to fail, it need not worry too much if it has superior capability to develop and implement new strategy.

If a business were indeed conscious, even if we individual humans were not able to interact with that consciousness (the voice of the CEO is not the same thing), would it still be valuable to attempt to nurture that consciousness (i.e. maintain a good value for the Φ KPI)? If that answer is yes, how could we do that?

The Term, “Actionable”

Before moving on to what it takes to calculate Φ for a business, I need to clarify that this is not what is thought of as an “actionable” KPI. I still read very many articles, mostly Big Data articles, insisting that the only good information is actionable. That is, we can readily do something with that information, otherwise, it’s merely interesting. I don’t disagree. However, I just think we need to consider that life throws curve balls and that it’s the human ability to deal with the unknown and ambiguous that sets us apart from other species, where we can imagine something not there right in front of us. It’s survival of the most adaptable and lucky, not survival of the fittest (see Note 1). “Fittest” is a definition that is constantly in flux. Regarding the luck component, as they say, “we make our own luck” through our ingenuity and agility.

When I read articles describing “actionable information”, the context usually implies that it means we can apply it to a clear and present problem. What can you do for me now regarding the problems that kept me up all last night? This mentality is dangerous because it strips imagination, the most prized human intellectual asset, out of the workflow. Remember, it’s often the un-actionable but still interesting pieces of information that allow us to step outside of the box and say, “but what if …”. We then avoid the side-swipe from that nasty startup.

Being a stickler for actionable information is seductive because it’s easier to deal with than the worrying about problem that could happen (see Note 2). It’s scary to get up in the morning needing to invent a new, un-Google-able (no one has yet figured it out) way to do something. Focusing completely on actionable information is smart as long as we’re in evolutionary, not revolutionary mode (see Note 3). The length of these periods is unpredictable. Periods of evolution could be decades or it could be days. But at any given moment,  surprises big and small, good and bad, deadly or inconvenient, show up. The complexity of the world guarantees that.

Calculating Φ for a Business

Consciousness, at least in the context of this blog, is a complex system’s ability to have a big picture feeling of its state. What we’re measuring is the level of quality of that “feeling” from that complex system.

Calculating Φ, in all honesty and as an extreme understatement, isn’t easy. Our own individual consciousness is built up over a long period time using input and processing equipment that in most ways greatly exceeds computers and sensors submersed in an environment (society) vastly richer than sterile databases (even Big Data databases).

Unfortunately, there are no direct formulas already developed for calculating Φ for any consciousness; people, animals, or a business enterprise. There is though high-level guidance, even though it is tedious at this time. In an ideal world, the idea is to inventory the actors, the relationships between the actors, and every possible state of each actor in the enterprise resulting in a very messy graph of cause and effect. To make it worse, actors don’t stop at the level of individual people or software applications. There are lower-level actors with those actors.

I know very well how difficult it is to build such a graph even if we were just talking about a family tree because I’ve built a few. Mapping out such relationships in the SQL Server engine was extremely daunting. Documenting a heterogeneous mix of relationships among the large number of agents in even a small enterprise is unfeasible if not impossible. Just for the sake argument, let’s say we did manage to do this. The Φ value would roughly speaking measure the level of integration across all aspects of the graph. To put it in terms of our own consciousness, we know that there is a difference between just knowing a list of historical facts and integrating those facts into a story serving as a metaphoric scaffolding for a problem we currently face.

A feasible but still very non-trivial process would involve specialized software and is dependent upon the maturity of the business’ IT, especially BI, Business Process Management, Collaboration (such as SharePoint) and Performance Management. Such systems have captured much of the essence of an enterprise and can even be thought of as the IT side of a business’ neo cortex. However, these systems are not yet purposefully integrated as a mainstream practice, meaning the knowledge exists in silos. In fact, integrating knowledge is the purpose of Map Rock. But before diving a little more, I’ll just mention for now that the third level is really a set (“set” implies un-integrated) of baby steps that I’ll describe soon.

For example, creating a graph of communication (email, IM, meetings on the calendar) between employees shows connections between people. Aggregating it on attributes of employees, we can determine the levels of inter-department communication. Also identify groupings of people on the communications; these groups are “virtual” teams. Based on titles or roles, we can determine cross-functional communication. This acts as a reasonable facsimile to needing to catalog everything everyone knows. Mining a collaboration software such as SharePoint, we could find relationships such as people working on the same document.

For that graph of communication, what we’re looking for are relationships. We may not even know the nature of the communication, but if we know the Inventory department is talking to the Sales department, chances are good that each department is making decisions with consideration to the other department’s needs. We would know a senior engineer from IT regularly talks to the CIO.

Map Rock

The idea is that we’re building a rich, extensive, highly-connected web of cause and effect. This “database” of cause and effect is the experiences of a business. Collecting these experiences is only one side of the coin. The equally important other side of the coin is the maintenance of the web; the expiration, updating, deletion, or consolidation of the rules and statistics.

I designed and developed Map Rock from 2010 through 2013. I consider Map Rock a child of the “Big Data” buzzword age, but I was shooting downstream along the analytics workflow from Hadoop and a couple years into the future. At the time, as a long-time BI practitioner, I well understood that most companies haven’t even come close to exhausting what they could do with the data already at their disposal.

Would simply more data (Big Data) drastically improve the quality of insights from analytics? Certainly the incorporation of very granular data from all sorts of sources and semi-structured or even unstructured data trapped in documents and pictures would greatly enhance analytical capabilities. But more data doesn’t mean greater understanding. It’s relationships, strings or webs of cause and effect, that are the molecules of intelligence.

Departments within an enterprise are usually headed by a director or VP. Because of the hierarchical nature of the organization, the tendency (not necessarily the intent) is for each department to operate as if they are an independent company serving their “customers” (other departments). Orders are given from above and that drives an information worker’s actions. Even an awareness of a pain in another department is placed down the priority list, as the worker has enough problems of their own to deal with.

The result in the end is that this dis-integration leads companies behave like puppets, as I mentioned at the beginning, where the actions are controlled loosely from above. They work well enough to hobble down the street, not quite like world-class athletes. As entities, businesses aren’t very smart and thus have a mortality along the line of crabs where for every fully grown crab tens of thousands of  larvae perished.

Intelligence exists at a relationship level, not at the data level. Data is simply the state of an agent living in system or relationships. Whenever we reason our way through a problem or create a solution, it is all based on a web of cause and effect slowly charted into our heads over the course of our lives (our experiences).

Catalogs, lists, and tables of data in our OLTP systems consisting of unrelated rows are not relationships. They are simply repositories placed together for convenience, no more than a list of alphabetically ordered names in a phone book says much about the community. The row in a typical customer database table for Eugene Asahara may store relationships about attributes of Eugene Asahara, but there are no relationships between another row for someone else. However, the fact that I am a customer may have relevance to the person who sold me something.

Relationships are rules. These rules are experiences in your head associating all sorts of things, predictive analytics models which precipitated relationships out of a slurry of data. In a business these rules are distributed amongst the brains of every employee and model (such as predictive analytics or relational schemas), and each is more island than a supercontinent where species roam wherever they can (see Note 4).

Agents

Before we can catalog relationships, we need to catalog the things we are relating – agents in a system. Agents are the players, the tangible and intangible things in a complex system. They can be as elemental and tangible as a document to as complex and intangible as a supply chain or major event. It’s important to realize too that agents can be part of many different macro agents; for example, a person can be an individual, part of a team, part of a department. Conversely, agents may not be as elemental as you would think either.

The direction of the communication matters as well. Obviously line employees receive communication from a chain spanning from their manager all the way to the CEO. But one way communication doesn’t promote informed decisions.

Events are really objects in a sense as well. I think of an “event” as an object that is ephemeral and consists of a loose set of characteristics. The attributes of a car or building are tightly tied together in terms of time and space (my arm is definitely a part of me and barring unforeseen circumstances always a part of me).

Machine-Stored Relationships

As mentioned, the intelligence of a business doesn’t only lie in the heads of humans. At most enterprises, there are many BI/Analytics objects possessing a level of intelligence.

Predictive Analytics models are the prime example of units of machine intelligence – even though humans do have a hand in their creation to varying extents. PA models are machine or human “authored” clusters of IF-THEN rules and calculations taking inputs and splitting best guesses either conveyed to a human or in many cases executed automatically. They can range from “white box” components where we can clearly see what goes on, to “dark gray  boxes” (not “black box” where the internals are completely unknown) such as neural networks where it’s very difficult to figure out what actually goes on.

The many Predictive Analytics models scattered across enterprises form an un-integrated web of knowledge that is very reminiscent of the structure of our brains. Neurons are not simple gates but more complex mechanisms in themselves. Like neurons, PA models take many varied sets of inputs and spit out an some answer. That answer, the resulting mash from many inputs is in turn the input for a subsequent decision. Integration of these models, not just the storage of the metadata of all the models, is what the Prediction Performance Cube is about.

OLAP cubes and departmental data marts hold some level of intelligence even though they are primarily just data. They store what has been vetted (during the requirements gathering phase of the project) as analytically important at a departmental level (a subset of a full enterprise). For example, OLAP attributes provide a clue as to how agents are categorized and the hierarchies show taxonomies. In other words, OLAP cubes and Data Marts don’t store the entire enterprise universe, so we select data that is probably analytically useful. Browsing through the OLAP cubes shed light on what is important to the departments that built them.

In Business Problem Silos I describe how departments create department-level BI (cubes or data marts), which integrates data from two or more data silos to resolve departmental business problems, but we end up with a different kind of silo, a business problem silo

Naturally something with the word “rule” in it such as Business Rules Management must be rules. IBM’s ILOG is a great example of  rules management software. Such Business Management Rules are usually intimately tied into a Business Process Management implementation as the decision mechanisms determining the next steps in a process flow.

Human-Stored Rules

Of course, the vast majority of the intelligence of a business lies in the heads of the employees. Machine-stored rules cannot even begin to match the nuances, flexibility, richer relationships, richer store of information that is in a human brain. The problem is that even though the intelligence of a human is richer, it can’t capture everything in an enterprise. Not even the enterprise-wide scope of a CEOs can match the collective intelligence of the narrowly-scoped “line workers”. Instead, they see a higher-level, more abstract, wider, shallower picture than the narrower but deeper intelligence of a line worker.

Not every employee can know everything and no one should need to know everything. But we must be cognizant of the side-effects of our actions. My discussion on KPIs below addresses that.

Human and Machine Interaction

In most enterprises, Performance Management, along with every workers’ assigned KPIs is a very well-socialized part of our work lives. KPIs are assigned (related) to people or machines measuring our efforts. They are like nerves informing us of pleasure or pain and the direction. I’ve written a few other blogs on KPI relationships that have the common theme of integrating KPIs into a central nervous system:

  • KPI Cause and Effect Graph Part 1 and Part 2. These old blogs take the notion of KPIs as nerves to the next level, attempting to integrate them to at least some degree as a nervous system, by decomposing the calculations of the target, value, and status and mapping KPIs into a web of cause and effect at that more granular level.
  • Effect Correlation Score for KPIs. This blog proposes another component to the KPI (besides value, target, status, and trend) that measures the intended effect meeting a KPI target is supposed to have. For example, does increasing employee satisfaction still lead to an increase in customer satisfaction?

Realistically, the KPI is the best bet for even beginning to attempt a Φ calculation. As KPIs are well socialized at the workplace, prevalent throughout an enterprise, they already define what is important to just about every breakdown of workers from departments to teams to individuals.

Business Process Management workflows are another fairly well-socialized program of an enterprise which documents the sequence of events that happen in business processes. Workflows include prescriptions for procurement, proposals, filing taxes, performance reviews, manufacturing, distributing. These are usually inter-department processes which will relate cause and effect across the departments. However, the workflows that are formalized probably account for just a small fraction of what actually goes on. The vast majority exist in the heads of the people involved in the workflow. This could be worse than it sounds as many workflows don’t exist in one person’s head, but pieces of it are distributed among many people (meaning people are only aware of their part of the workflow).

Map Rock is designed to integrate, both in automatic and manual fashion, such sources of relationships into a single repository. For example, a component of Map Rock known as the Prediction Performance Cube, automatically collects PMML metadata, training data, test data, and metadata of the training and test data sources. Another component of Map Rock, called the Correlation Grid, is a UI for exploring correlations among dimensions in different cubes.

Map Rock converts these rules, from whatever sources, into my SCL language, which is used as the lowest common denominator. SCL is based on the old Prolog language of the AI world. Beyond being a .NET implementation of Prolog, it is modified for the distributed world of today as opposed to the more monolithic world of the 1980s.

Putting it All Together

A suggestion for a calculation of Φ is a graph of communication ties between decision makers down the traditional hierarchical chains of command as well as across the departmental silos. These ties are the relationships described in the sections above from email, meetings, KPIs, workflows, etc.

Those are the ties that do exist, but we also need to figure out what should exist. In a nutshell, the Φ calculation is a comparison what exists and what should exist. I need to mention that what should exist is different from what could exist. What could exist is a Cartesian product of every agent in the system, a number so great it’s unusable. Determining what should exist is extremely difficult akin to “we don’t know what we don’t know”.

There will be four categories of ties:

  1. Known ties that should exist. These are the ties that we’ve discovered from KPIs, workflows, email, etc that should exist because there is information that must be passed.
  2. Known ties that do not need to exist. These are ties for which there is no required formal communication. Examples could be email from amongst employees who are friends but never work with each other or someone subscribing to updates on something for which he is not involved. These are not necessarily useless ties. The interest shown by an employee subscribing to updates unrelated to him demonstrates a possible future resource.
  3. Ties that should exist, but don’t.
  4. Ties that could exist, but don’t. This should make up the vast majority of conceivable ties. #1, #2, and #3 account for such a minority that this is considered “sparse”.

Ties between should and could, whether they exist in our enterprise or not, is interesting since these are the ties that can lead to innovation. The ties that should exist are in existence because they are part of formal processes already in place. But when we need to change processes to tackle a competitor or simply make an incremental improvement, we’ll need to explore these other ties. So ideally, our graph would consist of some level of ties that aren’t currently required. A good place to start would be ties involving agents from #2, those agents with ties that aren’t mapped to a formal process.

What would the Φ calculation look like? Imagine a graph of the agents of our enterprise and the relationships. Imagine that each relationship is a color ranging along a spectrum of green to yellow to red, where the extreme of green represents excellent communication and the extreme of red represents no communication (and that there should be communication).

The calculation of Φ would be range from greenish (score of close to 1 for generally good communication where there should be communication) to yellowish (score of around 0 for OK communication) to reddish (score close to -1 for generally poor communication where there should be communication) of the entire graph. If your graph ends up greenish, but that doesn’t correspond to your experience that the enterprise is dysfunctional, it would mean that there is a deficiency at identifying the ties that should exist.

A good starting place for what ties should exist would be in the KPI Cause and Effect graph I mentioned above. Going further with the KPIs, Map Rock includes a feature whereby we can view a matrix of the KPIs where each cell displays the level of correlation between the two. This will not necessarily identify all ties that exist, but it may show some surprises.

Back to the graph. We could start this graph of ties with the chain of command, the “org chart” tree, from the CEO at the top to the line workers with no one reporting to them. It’s a good starting point because every company has a chain of command, which forms a natural route of communication.

The next step is to identify the decision makers. Although decision makers and non-decision makers require information to operate, the decision makers are the ones that initiate changes to the system. However, in reality everyone makes decision to some extent, so this is more of a fuzzy score than a binary yes and no.

As a first pass it’s safe enough to assume that anyone with direct reports makes decisions. But for non managers, there are many who primarily operate under temporary project cross-functional teams consisting of members from across departments. In these cases, they may not necessarily have formal report relationships, but do make decisions in the team. For example, a “BI Architect” may not be a manager, but may decide on the components to be utilized. It’s tougher in these cases to automatically infer these decision makers.

In addition to identifying decision makers, we need to measure the strength of communication between the decision makers and those affected by and changes. This is an important aspect since we know a tie should exist, but it is poor. This is in some ways worse than not realizing a tie should exist. If a decision maker issues direction to someone and thinks the direction was received, the decision maker may go on assuming her instructions are carried out.

In the end, we should end up with a graph illustrating strong connections between many parts and little in the way of weak ties or ties that should exist but don’t. The thought processes behind such a graph is the bread and butter of companies like Facebook and LinkedIn. But the networks of enterprises are the soul of corporations.

For studying networks, particularly the social network aspects, there is no better place to start than with the work of Marc Smith. Marc also has lead the development the Excel add-in known as NodeXL, a visualization for displaying graph structures.

Baby Steps Towards Φ Calculation

All that is what Map Rock is about. Duplicating what I’ve developed in Map Rock will be daunting at the time of this writing. However, there are baby steps that should be very informative in helping to improve the agility of an enterprise. For the most part, these baby steps consist of the pieces I described above such as studying social communication networks, developing a KPI cause and Effect graph, implementing the Effect Correlation Score, and creating a Predictive Analytics Metadata repository. Still, though these parts won’t be integrated with each other, without Map Rock, they are an un-integrated set of tools.

Conclusion

So there it is, my attempt to step far back, looking at what it is we’re really trying to do with Big Data. From my point of view, we really are trying to apply to our businesses the same technique that made humans the dominant species at this time. That is, we’ve built the mechanisms for dealing with ambiguity. Therefore, with the proper implementation of Big Data we can much better deal with the roadblocks along the way.

The NoSQL world, particularly Neo4J the NoSQL graph database offering, should finally break down some walls in better embracing the complexity of the world. Graphs, the true representation of the world, the structure of workflow, semantics, taxonomies, etc, will finally take its place as the center of our “data processing”. Perhaps with Neo4j, the visions of the Semantic Web can finally help us make real progress.

Lastly, I apologize that this blog isn’t a step by step cookbook – that would indeed require a book. This blog ended up way larger than I had intended. My intent is to toss out this idea at a high level description, and drill down further in future blogs.

Notes:

1. I asked my favorite judo instructor, What is the most important skill, if you could pick one? I expected him to tell me that is a useless question completely missing the point, that it’s a balance of many things; I’ve failed to think in a systemic way. But surprisingly, without a blink he said flexibility. Not technique, not strength, not size, not endurance, not balance, not tenacity. Certainly, they are all factors. He said, “With flexibility you can do all sorts of things your opponent had not thought of.”

2. I know it’s out of fashion to worry about problems that haven’t yet popped up; that we should focus on the now. However, there is a difference between being in the now because we’re:

    1. Incapable of predicting what will come next. So like honey bees we put faith in the math behind the rules. This is similar to how casinos put faith in the math behind black jack. In the end they will win.
    2. Deluded into thinking that if it hasn’t yet happened (i.e. the experience isn’t in my brain) it couldn’t possibly happen. A variation is, “No one is asking for that.”
    3. So busy with what we already have on the plate that anything not currently in your face goes down the task list.

and that we’re confident that we can get ourselves out of a jam if the time comes because we’ve packed our brains with an extensive library of experiences from which we can draw analogies to future problems, we are fully aware of our surroundings and at the same time fully engaged with the task at this instant in time, and we fully accept the outcome not dwelling in the past. In other words we’re satisfied that we’ve prepared as best as be could for whatever comes about.

3. I don’t like the phrase, “It’s evolution, not revolution”, uttered by the VP of the department I worked in a while back. The sentiment seemed to be the antithesis of high tech. Progress involves cycles of long periods of evolution and short bursts of revolution (even though I’m pretty sure that wasn’t the message of that VP). It’s a balancing act where progress requires some level of focus towards a goal and another where the evolution actually leads to some kind of stasis (leading to proactive revolution) or crisis (leading to reactive revolution). During evolution, distractions from elsewhere impede progress. However, when stasis or crisis arrives and there is a need to move on, integration of knowledge fertilizes the system.

4. The extremes of Island or supercontinent isn’t good. Rather there is something in between. There needs to be some level of isolation to allow diversity to first build, then for small, diverse populations to mature. Either a supercontinent separates into little islands allowing diversity to take shape, then converging to let the best one win, or there is some natural barrier whereby weak links can occasionally travel to other populations infusing some new ideas. Too much integration is really just chaos, too little will eventually lead to stasis.

Posted in Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics, Map Rock, SCL (Soft Coded Logic) | 1 Comment

Micron’s Automata – A Rare Big Thing Out of Boise

Not many huge things come out of Boise and that is the very reason I live here. But something that should disrupt the semiconductor industry (the world in fact) is Micron’s Automata. It’s been occupying much of my thoughts since it was announced a few weeks ago. The Idaho Statesman’s front-page headline today is Big Data, Big Move for Micron with New Processor.

In a nutshell this is hybrid memory that is capable of having a large number of custom operations programmed directly on it that can be run in massively parallel fashion. From a software developer’s point of view, the parallelism is the key value. But from an IT guy’s point of view, not needing to depend on a trip to the CPU and back for every operation drastically cuts out a ton of processing overhead.

When I’ve told colleagues about this, many see this as just faster throughput which would be beneficial for a class of software applications – niche. This is like the dawn of the jet airplane; it’s not just a faster way to get from New York to Los Angeles. Automata isn’t ultimately a niche product only useful for research organizations or the NSA. We all live in a complex world where outcomes are dependent on things happening in massively parallel fashion.

Really, the way we use computers today is what is niche. They are useful for a subset of relatively simple (still within the realm of “complicated” as opposed to “complex”) problems that are deterministic, well-defined, limited in volume, and pretty linear. We’ve taken these well-defined applications, propagated and socialized them to severe extents, and forgot that these are really reduced, simplified versions of the real world. Then we consider the applications attempting to embrace real-world complexity as niche.

It will be incredibly interesting to see what players jump onto Automata first. Probably Hadoop with a retrofitted Automata version. But it could be Microsoft’s CEP (complex event processing) offering, StreamInsight, something from Oracle, maybe IBM’s ILog, or a yet unknown startup.

Regarding that yet unknown startup, it probably won’t be SCL. Ever since I wrote the SCL language, interpreter, base test-harness UI (Map Rock is in part a rich SCL UI), I’ve taken several approaches to writing a bad-ass “SCL Server”. They actually worked fairly well, but I knew I needed superior parallelism to what was offered by the current methods, lots of servers with lots of CPUs, in order to achieve what I really envisioned. Then Hadoop and all the associated Big Data things came along. Anyway, weighing everything, I decided to aim way downstream (Map Rock).

However, keep in mind that the kind of massively parallel processing that Hadoop can do is different from the sort I’m thinking of. I’m thinking of Hadoop as a homogeneous MPP where a long list of like tasks is divided amongst like processors. The sort I’m thinking of is a much more compelling heterogeneous sort of MPP where thousands of points of view work on streams of data in order to promote robust recognition.

The key question is, Can Automata-based Big Data servers resolve problems that currently are just too expensive (in terms of cost and time) to resolve today? Apparently it can, according to some of the sparse literature on Automata (see slide 12, titled Breakthrough Performance, of this file). If so, that’s disruptive.

Micron readily admits that the concepts of Automata-like technology has been around for a while. Micron showed a lot of courage sticking to the development of something so different instead of the sort of incremental improvements we’ve been seeing lately. It would be unfair to call technologies like multicore processors and NUMA merely incremental, but Automata is truly fresh and timely.

On a light note, related to not much coming out of Boise, on my flight back to Boise from my client site the other day, I overheard something very funny in a conversation behind me. This woman was talking to a guy about how people react when we say we’re from Idaho. Of course, it’s always something about potatoes. She said she tells people she’s the Bubba (from Forest Gump) of potatoes; French fried, au gratin, baked, mashed, … Then others started throwing in their potato styles too; twice-baked, potato skins, potato pancakes …

Posted in Uncategorized | Tagged | 2 Comments