Protected: Draft: Update to “Is OLAP Terminally Ill?”

This content is password protected. To view it please enter your password below:

Posted in BI Development, SQL Server Analysis Services

Exploring a Sea of Data with Massive Numbers of RegEx – And Maybe Even the Automata Processor

Overview

This blog explores taking the lid off RegEx (regular expressions) and its less powerful cousin, the LIKE keyword in SQL. By “taking the lid off”, I mean looking for more than one pattern at a time. Indeed thousands, tens of thousand, or even more patterns. After all, the things we’re somewhat randomly seeing with our eyes at any given instant is recognized because our brain recognizes millions of things we could be seeing in massively parallel fashion.

Traditionally, we use RegEx and LIKE when we already know the pattern we’re looking for. We just need help wading through a ton of data for that pattern. For example, in a search, we could express we’re looking for any with the characters “hara” and a first name beginning with “E” with this RegEx (or LIKE): %hara, E%

Figure 1 shows a sample of the LIKE keyword looking for names that contain “hara, e”. Most search functions in applications offer a little more flexibility than just finding a string, mostly the simple Contains, Begins With.

SQL LIKE key word.
Figure 1 – SQL Like key word.

In this case, we’re checking if any values matches a single pattern, one rule to apply to very many values. Database people also know that LIKE parameters beginning with % results in poor performance since indexes on the column can’t be utilized.

A little more sophisticated use case would be to find all occurrences of some pattern in a long text. This is the Find function in text editors such as Word. This would result in multiple hits, many occurrences throughout a single long string of text. A further step would be to search a longer text column, such as those of the VARCHAR(MAX) or TEXT types, holding free text notes or descriptions for each row (ex: free form comments on a sales call); multiple hits for multiple rows. But whether we’re searching one big text like a long Word document or the text strings of many rows, we’re still searching for one key word or pattern at a time.

So let’s throw all caution out the window and ponder searching for thousands of patterns occurring in millions of long strings of text. That would normally be implemented as: for each text column in each row, iterate through many patterns. Meaning, if searching 1 GB of text used for one pattern takes 1 second, search for 100 patterns would take about 100 seconds. Yes, we could apply optimization techniques such as parallelizing the task by breaking up the rows into chunks shared across many servers, or apply some logic eliminating some impossibilities upfront. But that’s not the point of this blog. This blog is about two things: Why one would want to do such a thing, and that Micron’s Automata Processor offers an amazingly direct path to the ultimate solution to this problem.

I’ll first describe the first time the notion of running very many RegEx expressions over a large amount of data occurred to me in my daily life as a BI consultant. Even though this particular use case is somewhat covered by features in many current “Data Profiling” applications, the use case in which I recognized the need helps to identify metaphorically similar use cases elsewhere.

Before continuing, I’d like to mention a couple of things:

  • The coding samples will be stripped to the bare minimum. For example, I haven’t included parameter validation logic. There are numerous code examples related to most of the topics of this blog such as RegEx using whatever programming language as well as for the SQL LIKE keyword. Additionally, some code involves SQL CLR, and that code is minimized as well since there are many examples of how to register SQL CLR assemblies into SQL Server.
  • I do mention topics on the Theory of Computation, mainly the Turing Machine, Regular Expressions, and Finite State Machines. But I won’t go into them deeply. For the most part, C# developers are familiar enough with RegEx and SQL developers with LIKE.
  • Although the main point of this blog is, “How great is that Automata Processor?”, I don’t actually get to the point of actually implementing it on the AP. Much of the reason is that I’m still focusing on communicating how to recognize use cases for the AP in a BI environment. Meaning, I’m still trying to sell the folks in the BI world (well, more the bosses of the folks in the BI world) on investing in this very “strange” but amazing technology. Besides, the AP SDK is still in limited preview, but you can ask to register anyway. However, once you’re comfortable with the concepts around finite state automata (the core principle of the AP), authoring them and implementing them into the AP is relatively easy.
  • This blog pulls together many tough subjects from the highly technical levels of Business Intelligence and the Theory of Computing. This means there are many terms that I may not fully define or even take some liberties glossing over a few details, in the name of simplicity.

My Data Exploration Use Case

After an “analytics” consultant understands the client’s problem and requirements and a project is scoped, the consultant (the ones in my usual world anyway), find themselves sitting in front of an instance of a tool to explore the heap of data such as SQL Server Management Studio or Aginity. By “heap of data” I mean that since as a consultant the customer is usually new to me, the people and their roles and the data and the meanings are unknown to me. Until I go through the non-trivial process of learning about the data, I’m Lewis and Clark on an exploration journey.

My ability to learn about the data depends upon several factors, for which all factors or even a quorum which surprisingly even today hardly ever exist at the same time:

  • The presence of knowledgeable and patient DBAs or data stewards or developers. Using the plurals illustrates that in BI there are usually a large number of databases scattered all over the enterprise, usually numbering in the hundreds for a large enterprise. Quite often as well, part of the reason I’m there is because a DBA or analyst moved on, taking all that knowledge trapped in her brain with her.
  • The presence of a Data Dictionary. A data dictionary is a catalog of data sources throughout an enterprise, down to the column levels, including types, descriptions, even a lineage of the data (“Source to Target Mapping”), the valid values for the columns, and keys. This is the other MDM, MetaData Management, not Master Data Management.
  • The “penmanship” of the database designers. The better the names of the tables and columns, the easier it is to explore the data. But even if the tables and columns are well named, they can still sound ambiguous (ex: cost and price). I usually work with a Data Warehouse, which is not in a nice third normal form with primary/foreign key relationships. Adding to that, a Data Warehouse is subject to fast growth without discipline (because “disk storage is cheap”).

This learning about the data is a part of a wider task called Data Profiling, for which there are many very good tools on the market. But to me heart of Data Profiling is something we usually do at the actual analytics stage, after we’ve identified our data, and now we’re analyzing its value towards solving our problem. In the scenario I’m describing, I know what problem I’m addressing, but I still don’t know what data I have to work with.

About a third of the time, my client has a specific data set to explore. “Here’s the last two years of clicks from our web site. Have at it.” Even in those cases, where the data structure is relatively simple and well-known, yes, it would be nice to find the usual patterns in the click streams, but even nicer to correlate those click patterns to something important to the client. Meaning, I’d like to go beyond the usual, looking for other data to think outside of the box of data given to us. So I’m back to searching for what else is out there.

In the end, after exhausting all sources of data information known to me, I’m still usually left with some level of looking for something. The thing about analytics is it’s often about unknown unknowns – I don’t know what I don’t know. And because the nature of analytics, at least applied towards improving success towards some goal, is fuzzy, imprecise, we don’t always know that we’ve done the best job possible. We usually take “good enough” and move on to the next thing.

Too often, in a casual conversation with colleagues at a customer site, a data source of some sort is mentioned and I think back to some earlier project with that customer, “Gee, I wish I knew about that data source back then.” So in an ideal world, I’d want to do an extensive search for opportunities, beyond what is already known. Exploring hundreds of servers for data within a reasonable amount of time for something that may or may not exist doesn’t make sense either. It would be nice if I could inexpensively do that comprehensive exploration.

As we humans go about our daily lives, sure, any given day may seem somewhat routine. We get up, eat breakfast, say good-bye to those at home, head to work, play solitaire, sneak out of the office, etc. But it’s probably less routine than we think. Other people move cars around, leave at different times, the weather is different, some tasks at work are different, our moods and the moods of others are different. So the signals entering our brains through our eyes, ears, nose, tongue, and skin need to be capable of recognizing all manner of things from all manner of angles and combinations. Our “inputs” don’t see things like cars and other people. They sense light, molecules, sound waves, and physical contact. Each of these symbols could represent millions of different things. We usually don’t have time to sequentially scroll down a list of possibilities. Fortunately all of these possibilities are “considered” by our brain in massively parallel fashion.

 A Single Algorithm for a Wide Range of Rules

Identifying patterns can involve a great number of different types of algorithms. Regular Expressions are one type of algorithm. The calculating pi, the various methods for predicting the weather, all those C# functions you’ve written, and making McDonalds fries are other examples. Our world of business, society, and the environment is composed of countless executions of algorithms of very many types. Therefore, our current CPUs are based on the Turing Machine, an algorithm of algorithms, which can process just about any algorithm our human brains can imagine (implying there are probably problems we cannot imagine).

Instead of burning hard-wired silicon for each of those countless algorithms we’ve identified, we developed pretty much a single “computing device” (as Mr. Burns may say) capable of executing those processes with a single “algorithm”. We encode those algorithms with sequences of instructions, software.

Similarly, many patterns can be easily or at least relatively easily expressed using the simple Regular Expression algorithm. For example, we can robustly (lots of different formats) recognize sequences of characters such as phone numbers, social security numbers, dates, currency, hexadecimal numbers, credit card numbers, license plate numbers from various states, email addresses, etc. Regular Expression is closely related to Finite State Automata, near the bottom of the “Theory of Computation stack”, the simpler side, the opposite side of the Turing Machine.

Now, the examples I listed are few and don’t constitute the thousands of rules I’ve been mentioning so far in this blog. And that perhaps is one reason RegEx hasn’t exactly blown the minds of programmers over the years as it would seem to have limited use. However, here are three thoughts of sources of thousands of regular expressions:

  • At a brute force level, every word and number could be a RegEx. Regular expressions encapsulate patterns. The name “Eugene” is indeed a pattern, albeit not very interesting.
  • Many classes of things may not be easily abstracted as a concise regular expression, but if you look hard enough, a there can at least be abstract codification of a subset. For example, some names for a culture follow a pattern. The Scandinavian pattern of first name of your father followed by “sen”. It’s easy to recognize most of the 4-syllable Japanese names as Japanese. Such patterns may not cover all Scandinavian or Japanese names, but it’s still a pattern that can be encoded.
  • Most importantly, streams of events can yield a great number of patterns. This is analogous to finding patterns of words close to each other in the long text of a book. But instead of words, they would be events such as the page click patterns normally leading to a purchase or a sequence of “events” often leading to sepsis. This notion of streams of events is actually part of a much bigger discussion, which is mostly out of scope for this blog.

One more important note on the last point. For a sequence of events to be of value, the events don’t need to be adjacent. For example, if we wish to find all customers going from meat then to vegetables, we may or may not be interested whether they stopped to pick up black pepper or beer in between. Meaning, some of those events will be noise that should be ignored. That will be especially true when we combine events from heterogeneous sources. For example, studying the sequence of events of one nurse making the rounds wouldn’t provide insight as rich as studying the sequence of events across all care providers (nurses, physical therapists, doctors of various specialties, etc). Regular expressions are versatile enough to ignore events thought to be extraneous to what pattern we’re encoding.

Further, an “event” that’s part of a sequence can be one of many options. For example, an important pattern may be those who went to the meat department first, then to beverages or produce but not baking goods, and finally to pick up wine. Again, regular expressions are versatile enough to encode that sequence. The point is (where a full exploration of this is outside the scope of this blog) that there is very much opportunity to study complex streams of events where a different approach is necessary for query performance suitable for analysis.

When we run a regular expression through our current software running on our current commodity servers, we’re running a regular expression algorithm over the Turing Machine algorithm. With the Automata Processor, these are what I consider the three major performance turbo-charges:

  1. The Regular Expression algorithm is directly translatable to a Finite State Machine, with the algorithm (not the instructions, the actual FSMs) “hard-wired” on the Automata Processor. Therefore processing of FSMs are as direct as possible.
  2. Large numbers of FSMs can be loaded and updated onto an AP, a self-contained single piece of silicon (a co-processor on a conventional motherboard). Meaning, there is no marshaling of bytes back and forth from the CPU to RAM back to the CPU and so forth. The processing and the storage of the instructions live together.
  3. Each symbol is processed in parallel by ALL of the FSMs on the AP chip. They are not processed iteratively, one by one as through nested for-each loops.

The first two items describe the aspects of the performance gains at a “lower level” (in the weeds) than where the majority of BI developers ever want to live. It’s that third point that is the most compelling. With all due apologies to the “massive parallelism” of Hadoop, it’s not quite the same thing as the massive parallelism of the AP.

The massive parallelism of Hadoop occurs primarily at the server level, scaling out to even thousands of servers. It’s more like partitioning of a set of data onto a large set of servers. This means there is still processing to figure out which server(s) hold the subset(s) of data we’re interest in, sending those instructions over a wire to the server, the server running through conventional read operations, etc.

The massive parallelism of the AP is more like someone finding someone who is interested in free tickets to the Boise Hawks game by shouting out into the crowd, as opposed to asking each person serially, one by one. The AP is fed a stream of symbols that is seen by all of the FSMs programmed onto that AP. Those “interested” in that symbol accept it as an event and move to the next state. Those FSMs in a state not interested in a particular symbol “ignore” the symbol and are unaffected.

In the case of this RegEx example, the valid symbols, the “alphabet”, are essentially the 255 characters of the ASCII set (0-9, a-z, A-Z, and the common symbols). Incidentally, the number of symbols recognized by an AP is 255, one eight-bit byte. With that said, it’s critical to remember that a symbol can represent anything, not just a literal alphabet or digit. For example, the symbols can represent up to 255 Web pages of a click stream analysis or the four nucleotides forming a DNA sequence.

Yes, that can be a limitation, but I’m sure that will change some time, and there are techniques involving banks of Automata Processors, where FSMs are artfully partitioned based on a limited subset of 255 of the total symbols.

Multiple RegEx Example

This example will test a set of words against a set of rules, for which there is a many to many relationship. In other words, each word can be recognized by multiple regular expressions. This example reflects the use case I described above (in the section, “My Data Exploration Use Case”) concerning the exploration of heaps of data.

This example is primarily utilizing SQL Server with a small C# function to leverage the .NET Framework’s RegEx functionality, which is much richer than SQL’s LIKE key word. As a reminder, I’ve kept the code as minimal as possible as many details, such as how to register a .NET DLL into SQL Server, are well documented elsewhere.

The data set used in this example is very small in order to illustrate the concept which would be greatly scaled-up were we using the Automata Processor. Millions of words and tens of thousands of patterns (RegEx) would not run very well using the conventional approach shown in the example.

The code can be downloaded as a .txt file. It’s just text, no binary stuff. Be sure to rename the file with the .sql extension to open in SQL Server Management Studio.

Figure 2 shows a SQL script that creates a temporary table of the words we wish to recognize.

SQL LIKE key word.
Figure 2 – Words.

Glancing  through the “words” (in this case “phrase” may sound more normal) inserted into the temp table in Figure 2, some are easily recognizable by our robust brains as formats such as dates, street addresses, and phone numbers. Some are ambiguous such as the 9-digit words. So the idea is to take these words and check them against all the patterns we know as shown in Figure 3.

SQL LIKE key word.
Figure 3 – Patterns in our “knowledge base”.

The temp table, #RegEx, holds a row for each regular expression, a category, and a more specific description.

Figures 4 and 5 show the translation of two of the regular expressions held in the #RegEx table; one a fairly simple one for ADA County Auto License numbers and one a little more complicated for phone numbers. Some of the patterns are very specific such as  the one for an ADA County Auto License. I’ve included such specific ones to help demonstrate that patterns don’t need to be universal. We could instead encode many patterns addressing subsets.

SQL LIKE key word.
Figure 4 – Finite State Machine representation of an ADA County License Plate Regular Expression.

Finite State Machines are the heart of the Automata Processor. Once you’re registered to for the Automata Processor preview I mention towards the beginning of this blog, you will see an interface that allows you “author” such diagrams in this WYSIWYG manner. However, keep in mind that there are methods for authoring such diagrams en masse, for example, from the sequences in a Time Sequence data mining model. The sequences can be uploaded through an the AP’s own XML format, ANML (pronounced animal).

SQL LIKE key word.
Figure 5 – Finite State Machine representation of the Phone Number Regular Expression.

Figure 6 is a very simple scalar SQL Server function written in C# that uses the .NET Framework’s RegEx class. Figure 8 below shows how this function is used in SQL.

SQL LIKE key word.
Figure 6 – C# SQL Server Scalar Function utilizing the .NET Framework’s RegEx.

Figure 7 is a script for registering the .NET function into SQL Server. I haven’t described all of the warnings related to enabling CLR functions as it is explained very well elsewhere.

Code to register the CLR function into SQL Server.
Figure 7 – SQL Sever code to register the .NET DLL into SQL Server.

Now that all of the pieces are in place (the RegEx function is registered in SQL Server and we have a query window open in SSMS where the two temp tables are created), we can run a SQL (Figure 8) to test each word against each rule. The SQL statement CROSS APPLIES the words with each rule, filtering out any word/pattern combination that does not result in a recognition. A recognition is determined via the RegEx function, which will return a 0 or 1, we created and registered into SQL Server as shown in Figures 6 and 7.

 SQL LIKE key word.
Figure 8 – SQL to process the words and RegEx.

Using the SQL in Figure 8 with the CROSS APPLY join, with the 18 words we inserted into #txt and the 9 patterns we loaded into #RegEx, there were 162 (18*9) comparisons made. In other words, for each word, check each rule. If this were scaled up, for example if there were millions of words and thousands of patterns, the number of comparisons would be huge.

If these 18 words were fed into an Automata Processor loaded with those 9 patterns, each word is fed only once and all 8 patterns will analyze it in parallel. To rephrase something similar I mention earlier, this is the same as someone holding up the word, 555-55-6666, shouting to a bunch of people, “Hey! What is this?”. That is, as opposed to walking to each one asking them that question.

Figure 9 shows the results of the SQL shown in Figure 8.

SQL LIKE key word.
Figure 9 – Words.

We’ll look at a few of the interesting results discussing some interesting aspects of exploring data in this manner:

  • Rows 1 and 10 show the “Phone #” RegEx is versatile enough to recognize a phone number with and without parenthesis. In this case, for any word containing a set of 3 digits, 2-digits, and 4-digits, we can be fairly confident it’s a phone number with or without parenthesis around the first three digits. So it’s OK to use one versatile RegEx.
  • Rows 4 and 5 show that ‘1A Z999’ is recognized as both a more specific ADA County Auto License as well as a more generic Idaho Auto License. The less specific Idaho Auto License also recognized 2GZ999 as, even without a space between the 2G and the Z999 parts. It’s good to recognize something at various levels. For example, sometimes we need real sugar, sometimes any sweetener will do.
  • Row 7 recognized “45-888 Kamehameha Highway” as an address in Kaneohe, but not “1200 Kamehameha Highway”. Because Kamehameha Highway practically circles Oahu, being able to recognize an address as specific as one in Kaneohe on Kamehameha Highway requires this fairly stringent rule. Also, this doesn’t mean all addresses in Kaneohe follow this rule. Other rules would be developed, hopefully with at least some abstraction into a RegEx. For example because Luluku Road is only in Kaneohe, any address on Luluku Road (also following the 45-ddd format typical for Kaneohe) is a street address in Kaneohe.
  • Row 8 shows 555556666 as a Zip code although another word for which the only difference are dashes, 555-55-6666, is clearly a social security #. However, there really is no reason 555556666 cannot be a legitimate Zip code (somewhere in Minnesota). Even though our human brains may think this as more of a SSN, it’s good to have a something that can see beyond our biases.

So suppose that over the years, through dozens of customers, hundreds of databases, I collected thousands of formats for data. Most will not be as universal as date and phone number formats. But even seemingly one-off formats could provide insight. For example, suppose years ago I encountered some old software system that stored case numbers in the format of 4 upper-case letters, a dash, two digits, a dash, and 3 digits (RegEx: [A-Z]{3}-\d{2}-\d{4} ). If today at another customer I encounter such a format, it adds a relationship that may or may not matter.

To take the code presented here to that level where we explore the hundreds of databases throughout an enterprise, I would expand this example to:

  1. Iterate through a list of database server, each database, each table, each view (because there could be calculated columns), and column of those tables and views.
  2. For each column, the tool would retrieve each distinct value and a count of each value.
  3. For each of those distinct values, the tool would test it against each regular expression in the library accumulated over a long consulting career. Every value recognized by a regular expression, whether a street address from a little town on Oahu to just being numeric, would be added to a table similar to the one shown in Figure 9. However, there would be rows for the server, database, table, and column as well.
  4. Because this could mean trillions of rows for a large enterprise, we could actually store only the counts for each regular expression for each column. So if there were say 50,000 columns across all databases, each triggering around ten regular expressions, that’s only 500,000 rows.

Remember though, the purpose of this blog isn’t so much to suggest a data profiling technique as to present a pattern for an Automata Processor use case which could provide inspiration for other applications.

Conclusion

It seems that the combined rate of data growth, the complexity of the world, and the computing power required to answer the sort of questions we now face is outpacing Moore’s Law in terms of the increasing computing power of CPUs. But we can still tackle this problem by looking towards these massively parallel approaches.

Last week (August 3, 2015) I posted a blog on a “Graph Database Symposium” I’m planning. At the time, the planning is even earlier than the “early stages”. The intent of that blog is to gauge the interest for such a symposium at this time. Hopefully, this blog helps take the reader further along in recognizing the value of graphs and the Automata Processor.

 

 

Posted in BI Development | Tagged , , , | Leave a comment

Planning a 1-Day Symposium in Boise on the Utilization of Graph-Centric Data Technologies in Business Intelligence

Introduction

I’m currently working with the organizers of the Boise BI User Group and a few heavy hitters from various Boise-based technology communities on a 1-day symposium introducing graph-based technologies to those in the Boise Business Intelligence community. (To clarify, by “graphs”, I’m referring to those web-like “networks” of relationships, and not visualizations such as line graphs seen in software such as Pyramid Analytics or Tableau.) The overarching goal is to inform BI practitioners of the toolset already out there required to begin addressing what I consider to be BI’s “hard problem”. That is, to feasibly formulate, organize, maintain, and query relationships between data throughout an enterprise.

We’re in the early design and planning stages, shooting for a mid-October (2015) delivery. The nature of this symposium is forward-thinking, meaning not many people would think to look even for it, so it doesn’t come with a ready-made audience (ex: such as a class on advanced Tableau). I chose to post this blog early in the process as a feeler gauging interest in this symposium as well as to gather input for the content. This post is by no means a formal announcement.

As a caveat, it’s important to state upfront that in the overarching Business Intelligence context of this symposium, in order to apply many of the techniques that will be covered, there will still a pre-requisite for a well-developed BI infrastructure … for the most part. I realize that for many enterprises, even a somewhat-developed BI infrastructure is still a far off dream. But hopefully this symposium will reveal a much bigger payoff than was previously imagined for a well-developed BI infrastructure, spurring much more incentive to aggressively strive for that goal. However, it’s crucial to keep in mind this doesn’t mean that there aren’t narrower-scoped use cases for graph technologies ready to tackle without a well-developed BI infrastructure, particularly with the Automata Processor.

Abstract

An accelerating maturity of analytics combined with Boise’s rich Business Intelligence community, innovative spirit, and the headquarters of Micron with its Automata Processor presents a powerful opportunity for Boise to yield world-class analytics innovation. The “three v’s” of Big Data, massive volume, velocity, and variety is simply just more data without improvement of the even tougher task of organizing the myriad data relationships which today are mostly not encoded. We need to begin solving our problems of a complex world in non-linear, truly massively parallel, massively hierarchical, and non-deterministic manners. Such an effort begins by shifting away from the central role of the tidy simplicity of our current relational databases to the scalable, reflective, modeling capabilities of graph (network) structures taking center stage.

Everything is a set of relationships and that is what graphs are all about. Our human intelligence is based on a model of our world, a big graph of relationships, built in our brains over the course of our lives. We humans are able to readily communicate with each other because those unique models of the world held in each of our brains mostly overlaps – our cultures. Where our individual models of the world don’t overlap with those of others represents our unique talents. The net effect is that our society is incredibly richer because we can exceed the limitations of our individual brains through the aggregation of our collective knowledge.

Likewise, machine analytics systems of our enterprises possess skills beyond the limitation of our brains. The problem is that those systems don’t share our human culture. In order for us humans to effectively leverage the “intelligence” captured in those enterprise analytics systems, those systems also need to possess models of the world at least somewhat overlapping with us. Models in current analytics systems are limited by restrictions dictated by the limitations of computers of the past, for example, the limited notion of “relationships” of relational databases. Deeper communication between humans and machine intelligence currently requires grueling programming of the computers and sophisticated training on our part. Today’s technology, particularly graph technologies, is our opening to surpass those outdated techniques, building, maintaining, and querying superior models of the world in our analytics systems. The improved machine intelligence fosters smoother, more robust communication between human and machine intelligence.

The key takeaways are:

  • Understand why breaking away from the predominantly relational database model to graph databases opens the door to quantum leaps in analytic capability.
  • The challenges of navigating through the increasing complexity of the real world, at the risk of being left behind by enterprises that do build that capability.
  • An introduction to the technologies and concepts of graphs.
  • A roadmap towards the transition to graph data.

My Initial Vision as a Starting Point

As I mentioned earlier, we are in the early design and planning stages, and the purpose of this blog is to gauge the interest for such a symposium as well as to gather input from the potential attendees on the content. So nothing is set in stone, the concrete is just starting to be mixed. However, I would like to include my initial vision of the agenda in this post just as a starting point.

As we have just this past week reached a few critical milestones (participation of a few key parties, a venue), we’re just starting to engage other key players to work out an agenda that will provide maximum value to the attendees. So it will certainly morph to a noticeable extent by the time we formally announce the symposium.

Before continuing on to my initial agenda, Sessions 1 and 6 are targeted at mature BI practitioners. Because the symposium is set in a BI context, I thought to begin laying out the current BI landscape and pointing out big problem. Sessions 2 through 5 are at a rather introductory level on graph technologies, laying out the pieces required to attack that big problem. We would then wrap up with a discussion on how to apply graph technologies to BI. Anyway, here is the initial agenda I tossed out to begin the process:

Session 1: The Current State of Analytics

The enterprise analytics world is currently a complicated zoo of concepts, processes, and technologies, all of which do hold legitimate roles. However, they exist in our enterprises as islands of poorly linked pieces lacking the rich integration, as do the memories in our brains or the organs in our bodies. A business enterprise is a system of relationships like any natural system. In this session we explore these “tectonic plates” of BI and the gaps required to lead towards an increased capability of our business enterprises leaping ahead through the vastly improved bridging of human and machine intelligence.

  • The Current Landscape of “the Intelligence of Business”: ETL, Data Marts and Warehouses, Data Lakes, Performance Management, Self-Service BI and Analytics, Master Data Management, Metadata Management, Complex Event Processing, Predictive Analytics and Machine Learning, Deep Learning, Knowledge Management.
  • The Missing Links: Why do we still make bad decisions, fail to see things coming, and keep acting on organizational myths and legends?
  • The Secret Sauce: Soften the boundaries between objects and balance bottom-up flexibility and top-down centralization.

Session 2: Graphs and the Theory of Computation

It’s certainly not that graphs are unfamiliar to us. We are well familiar with org charts, food chains, flow charts, family trees, etc., even decision trees. While such simple “maps” we’re used to seeing in applications such as Visio, PowerPoint, or SQL Server Integration Services are very helpful in our everyday lives, they quickly grow like kudzu into incomprehensible messes from which we readily shy away. This session will introduce basic concepts of graph theory and the Theory of Computation as well as to begin exploring that unwieldy reality of relationships we’ve so far punted down the road.

  • Introduction to Graphs: Terminology and Basics of Graph Theory, and a it on the Theory of Computation.
  • The Importance of Graphs, Models and Rules in the Enterprise – Everything is a graph. Examples of graphs used in commonly used business tools.
  • Robust Graph Processing: Model Integration, Fuzziness, Inference, massively parallel, many to many, massively hierarchical.
  • Where Relational Databases Fail in the Enterprise and why we keep retreating back to that comfort zone (ex. retreat from OLAP back to relational databases). Note: It may sound odd that I’m talking about focusing on relationships even though today’s primary data sources,  “relational databases”, are called “relational”. The problem is its not relational enough.

Session 3: Embracing Complexity

It doesn’t take a network of seven billion independent minds and billions more Web-enabled devices forming the so-called Internet of Things to result in a complex system where nothing is reliably predictable. For example, a distributor of goods lives in an environment of vendors, stores, customers, their customers’ customers, regulations from all governments (in the “Global Economy”), and world events where reliable predictability is limited to low-hanging fruit problems. Each is rife with imperfect information of many sorts and competing goals. Consequently, the problems faced by such enterprises are of a “bigger” nature than the limited-scope problems we’ve so far typically addressed with our analytics systems. The reason is we are attempting to resolve complex problems using techniques for resolving complicated problems.

  • Overview of Complex Adaptive Systems. The many to many, heterogeneously parallel, massively hierarchical, non-linear nature of our world.
  • The Things We Know we Don’t Know and the Things We Don’t Know We Don’t Know: Predator vs Prey, Predator vs Predator
  • Rare Event Processing: Statistics-based prediction models fall short for those high impact rare events, where novel solutions are engineered from a comprehensive map of relationships.
  • The world is a complex system: Situational Awareness
  • Healthcare: Perfect Storms of Many Little Things
  • Lots of Independent and Intelligent Moving Parts: Supply Chain Management, Manufacturing, Agriculture

Session 4: Beyond Visio – Robust Graph Technologies

Graph concepts and technologies have been around for a long time, in fact, from the beginning of computing. Many of the concepts are core in the world of application developers who hide the ugliness from end users by presenting flattened, sterilized, distilled, templated chunks of data. Think of the wiring of your computer hidden from the end user by the casing. Gradually, the complexity is such that the ugliness demands to be addressed at the higher levels of the end user, albeit in a cleaner form.

  • Graph Databases: Neo4j Introduction and Demo
  • Overview of IBM’s Watson
  • Object-Oriented Databases and ORM.
  • The Semantic Web: RDF, OWL, SPARQL.
  • Introduction to graph-like co-processors; particularly the Automata Processor

Session 5: Micron’s Automata Processor

Micron’s Automata Processor is one of the most important innovations in semi-conductors. It presents a shift away from the current computer architecture that for decades has been geared towards the simplicity of solving strictly procedural problems. Ironically, in order to effectively tackle the problems of an increasingly complex world, we retreat from the current computer architecture of today to a simpler model based on finite state machines. The massively parallel, loosely-coupled nature of the Automata Processor more comfortably reflects the nature of the environments in which we live, whether business, nature, or social. The truly massively parallel nature of the Automata Processor represents a leap as big a leap akin to the leap from single-threaded to multi-tasking operating systems decades ago.

  • Micron’s AP demo and examples of current applications
  • Proposed Automata Processor BI Use Case.
  • Recognizing Opportunities for the Automata Processor.

 Session 6: The Big Problem of Building the Robust Models

So what is the roadmap for building such ambitious systems? This is not about building an Artificial Intelligence but to soften the communication boundaries between people and our databases by drastically improving upon the relationships between data. Automation of the generation and maintenance of these relationships, the rules, are the keys. For example, it’s not much harder to map out the relationships within a static system than it is to write a comprehensive book on a fairly static but complicated subject. The trick is to do the same for a system/subject in constant flux.

  • Where do Rules Come From?
  • Existing Sources of models and rules in the Enterprise.
  • A Common Model and Rule Encoding Language.
  • Mechanism for Handling Change, Massive Parallelism, Massively Hierarchical, Missing or low confidence data.
  • Knitting together the Pieces of the Current Analytics Landscape mentioned in Session 1.

A Little Background on Where I’m Coming From

It’s not that people, particularly those involved with BI and analytics, aren’t aware of the importance and value of encoding knowledge onto graphs. It’s actually rather obvious and graphs are very much in use. It’s that these graphs are for the most part simple, disparate artifacts (connect the dots pictures), disconnected islands of knowledge. That condition is similar to enterprises only a few years ago with hundreds of OLTP systems scattered throughout  (and thousands of Excel documents today – even with SharePoint Excel Services!), with their silos of data and clumsy methods of integration. There have been efforts in the recent past to promote graphs to a more prominent level that gained very much attention, but fizzled back into relative obscurity. Relevant examples include UML and the Semantic Web. Neither are dead, but maybe with the fuller complement of related technologies today, they may finally find lasting traction.

A couple of years ago I wrote a blog – strictly for entertainment purposes only – titled, The Magic of the Whole is Greater than the Sum of Its Parts. It’s just a fun exploration of the notion of a business as an organism. Particularly that organism’s intelligence, what I call the “intelligence of business”. Although we shouldn’t take that metaphor too far (and maybe I did in that blog … hahaha), I think it’s fair to say that a business has rough counterparts to a human’s organs, desires, pain, ability to physically manipulate its surroundings, and knowledge, which today is much more harmonious in us than in the business analog.

However, the problem is that a business’ “intelligence”, the ability to store, analyze, and maintain webs of relationships, lies almost exclusively in the human brains of the workers and hardly in the fairly hard-coded/wired mechanical things (devices, software, documents). That’s fine as long as the quality of the knowledge is fairly transferrable to another person (in case the worker leaves) or the skill has been commoditized, and that there is some level of overlap of knowledge among the employees (redundancy).

One major outcome of failing to address this, at least in my opinion, is that in the name of optimization (particularly when the elimination of variance and redundancy is say overly-zealous), workers are forced into deeper and deeper specialization which draws stronger boxes around these “organic components” of the business. The knowledge in those workers’ brains are hardly ever recorded to an extent that a replacement is able to readily take over. When a knowledge worker leaves, it’s as if the enterprise had a stroke and must relearn capabilities.

Our poor human brains are filled to capacity to the point where we whittle away at things in life outside of work in order to keep up. We long ago maxed out on our ability to work optimally in groups when our “tribes” began consisting of too many people and there is too much flux in the membership. It used to be that knowledge could be captured in books. But change and increasing complexity comes too fast for the subject matter experts to effectively document, then for we readers to assimilate. As we’ve increased the scalability of data through the Big Data mantra of volume, velocity, and variety of data, we need to improve the scalability of our ability to encode and assimilate increasing knowledge requirements.

The answer isn’t AI, at least the Commander Data or HAL version promised for the last half century. Even with IBM Watson’s success on Jeopardy and its subsequent exponential improvement, I seriously don’t think there will be an AI more innovative than teams of motivated and educated humans for quite a while. The answer is to build a better “pidgin” bridging human intelligence and data, a far less grandiose track for which the pieces are mostly there and offers a long-time incremental path towards improvement.

Here are a few old blogs that sample much of my earlier thoughts that lead to the idea for this symposium:

Actually, almost all of my blogs are somewhat related to the subject of this symposium. My blogs have always been about pushing the boundaries of Business Intelligence. A couple of years ago I attempted to materialize all my thoughts around this subject into a software system I developed which I named Map Rock. This symposium is not about Map Rock as I’ve “retired” it and Map Rock only represents my vision. It makes more sense today to pull together the best of breed pieces out there into something from which be can begin to evolve an “intelligence of business”. However, my 5-part series on Map Rock  offers a comprehensive description of what I was after.

Conclusion

This symposium is intended to be an introduction that will hopefully cut down some of those fences we fear to hop so that we can seriously explore the vast frontier of BI becoming a truly strategic asset, rather than being stuck straddling the tactical and operational realms. It can begin to move from “Help me calculate this value to plug into this formula” to “Help me create and maintain this formula”.

To recap the current status:

  • We’re in the early stages of planning. The agenda presented here is just an initial draft.
  • We’re planning to deliver this in Boise in the mid-October (2015) timeframe. We should have a date and a tighter agenda well before the end of August.
  • We’re trying to gauge the interest in the Boise area for such a 1-day symposium.
  • We’re asking for any input on content or hard problems in your business that could be better approached as a complex problem, not a complicated problem.

Please email me at eugene@softcodedlogic.com with any questions or comments.

Posted in BI Development, Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics | Tagged , , , | Leave a comment

Being a Lowest Common Denominator

I love the current series of commercials put out by esurance. They all involve people encountering someone that is “sort of” like someone else they were expecting. Watch them up on YouTube:

  • Sorta Mr. Craig – Parents visit an unlikely teacher in the classroom. You’re not Mr. Craig. Well, sort of; “we’re both between 35 and 45, both like to save on car insurance, and both really good at teaching people a lesson”.
  • Sorta Your Mom – An unlikely mom pulls up to pick up two kids at school. That’s not my mom. “I’m sorta your mom. We’re both 25 to 35 years old, both women on the go, and we both clocked a lot of miles.”
  • and my favorite, Say My Name – A woman approaches a pharmacy counter which is manned by Walter White. You’re got Greg. “I’m sorta Greg. We’re both over 50 years old, we both used to own a Pontiac Aztec, and we both have a lot of experience with drugs.”

I don’t love these commercials just because they are funny and one includes Walter White, but because they brilliantly demonstrate the folly with the commoditization of us through classification algorithms. By commoditization, I mean we’re lumped into a group for which the members share characteristics mostly determined by “letting the numbers speak”, stripping away all those little things that seems irrelevant, but in aggregate are what makes each of us unique and special. The theory is that folks sharing a set of characteristics would likely share one or more that are of interest. It eases a business’ efforts in dealing with people if we can carve away that nasty complexity of real people and deal with a nice, clean entity who we will now assume works as the categorization suggests. Things are simplified at the expense of our individuality.

This subject has touched a nerve in me lately through a couple of recent conversations with colleagues on personality tests. I’m not comfortable with being just a number and thought to be that simple of a creature. I’ve had to take personality tests such as Myers Briggs in the past for work. In fact it would be very hypocritical of me as a BI practitioner to not appreciate the value of such tests since it is in fact the result of a classification algorithm, which is a big part of what I do for a living through my Business Intelligence practice.

Therefore, I want to be clear that this blog isn’t a hit piece on such tests. I very much see the value as a tool that helps to smooth out working relationships within teams, which is a worthwhile thing. My point is to provide a friendly reminder to everyone that predictive analytics is just a reasonable guess based on statistics. It’s easy to get overly hung up on such numbers as the simplification of decision making is seductive. Clustering, or any data mining algorithm for that matter, provide heuristics. They are simple rules that at least kick us off in some direction, freeing us from analysis paralysis. They toss out the noise. They theoretically should be better than a shot in the dark, at least for the person doing the shooting.

The problem is when we become married to that initial impression – which of course, brings up that “first impression” thing, which seems to be a real thing. And that is an easy thing to happen. Our brains are constantly and desperately seeking out patterns to tame the complexity of the real world. We hate to give up old patterns because we’re in the groove with those current patterns and now we need disrupt things with new ones.

Whenever I take those personality tests, I really, truly struggle to answer those multiple choice questions. With almost all of them, I think, “Well, it depends!” “It’s not as simple as that!” The problem I have answering those questions probably pegs me into some hole in itself. Extravert or Introvert? It depends. For me anyway, it’s not a one dimensional continuum. Sometimes you can’t shut me up, sometimes I just want to sit in the back row and watch. I don’t recall exactly which of the sixteen categories I fell into, but I do remember it wasn’t what I would have thought for myself.

Like any statistics in the hands of those not skilled with statistics (or with too much skin in the game), things can be taken out of context and/or continue to be applied long after it’s no longer valid. The irony of all this is that as we strive so hard to remove old stereotypes and prejudices from society, we replace them with a much larger number of newly invented ones. Granted, these new ones usually appear to be more benign than those old ones.

I rarely ever see a cluster created in which the vast majority of the members are in it with a high probability. Meaning the clustering is right most of the time, but when it’s wrong, it can be merely rude to significantly troublesome as the esurance commercials humorously illustrates. Every now and then we’ll be tossed into a group for which we are not really well suited, but it’s the closest fit (shoving a trapezoid into a square hole). Yet, we’ll be treated the same way as those who fit very nicely into the categories.

Posted in Uncategorized

What I Would Have Said at TEDx Boise 2015 Had I Been Selected

I applied to speak at the inaugural TEDx Boise 2015 event, but I was not selected. However, I would still like to share what I had intended to say had I been chosen. The Synopsis is what I submitted to the Selection Committee and the Speech is what I subsequently started to put together just in case.

Last month (Nov 2014) I took a two-week sabbatical in the Zion area to think through concepts that I began to write about in a previous blog, Embracing Complexity – Part 1 of 6. It involves breaking through the limited paradigm under which analytics is currently implemented at mainstream enterprises, which I think are remnants of a time when the hardware and supporting software could not support true Business Intelligence. And now we’re too used to BI systems being more reporting tool than accentuating our human analytical capabilities.

After over 100 miles of hikes, over 50,000 words of notes, and maxing out my iPhone with voice recordings over those two weeks at Zion, I’m attempting to digest those concepts of moving analytics in the mainstream forward into a book and some supporting software. I suppose this speech I would have made will serve as a fair abstract to the book.

It’s worth mentioning that Boise actually has quite an impressive community of Business Intelligence professionals. ProClarity played a large part, attracting top talent as well as nurturing much local talent. Additionally, the outdoors lifestyle is attractive to the creative sorts who migrate to Business Intelligence. In fact, it was the purchase of ProClarity by Microsoft that brought me to Boise almost exactly eight years ago. So my speech was in large part intended to help further the nurturing of this already impressive community.

Before reading the (would be) Speech, keep in mind, that it is certainly not exactly what I would have ended up delivering. Had I been selected, I would have run it by many friends, cleaned and tightened it up much more based on feedback, and adjusted some things that I’d forgotten would be obvious only to BI and/or software sort of folks. Meaning, adjusting to a more generalized audience. Consequently, for a generalized audience, I take some liberties with definitions and gloss over some messiness towards the goal of getting my point out in 18 minutes.

Additionally, I think the TEDx folks provide some level of guidance. And besides, if you’ve ever attended one of my presentations or workshops, I hardly ever stick to script anyway … hahaha.

Lastly, the Speech is written as a speech. I wrote it imagining me presenting it as I would with certain inflections. So, some sentences may seem rambling because of the limitations of punctuation, some things perhaps redundant because I’m trying to drive home an important point.

Title: How to Think in 4 Dimensions

Synopsis: The latest generation of analytics tools driven by Big Data drastically accentuates our ability to navigate a truly 4D world. However, it’s a skill that must be learned for that power to be fully appreciated. We’re good at recognizing 3D things such as faces, places, and food, from any angle, up close or at a distance, in full view or obscured. But we’re actually bad at making predictions, which involves the chaotic interaction of many independently moving things over time. More often than not, we get things wrong, sometimes for millennia. And it only gets worse as the complexity of the world increases with more people, the Internet of Things, and the farther reach of everything. We ourselves are mere 3D objects faking our way through a 4D world, via our intelligent use of information. It’s unnatural for us, so we struggle to fight chaos through oversimplification and tightening control instead of counter-intuitively embracing complexity.

The Speech (what I would have said):

We live in a four-dimensional world, three spatial dimensions and time. We readily understand the three spatial dimensions, depth, width, and height, but most of us probably don’t fully comprehend the 4th dimension of time. Or at least, we forget about the true nature of the 4th dimension as we struggle through our daily lives addressing problems after unrelated problems, plural intended, that hit us as fast as snowflakes as we walk through a snowstorm. That relentless multi-tasking shatters our energy into unrelated silos of effort that don’t and can’t add up to the grander things we’re striving for. Addressing so many simple things makes us good at handling simple things resulting in the atrophy of our ability to think into that realm that’s bigger than our brains.

The problem is that we’re actually 3D creatures living in a point in time, along with countless other things existing with us at that point, faking our way through the 4th dimension with our uniquely human super powers of memory and symbolic thinking. Oversimplifying, memory and symbolic thinking allow us to perform what-if experiments about what could happen, in the safety of our heads before committing to physically irreversible actions. For a 3D creature faking its way through the 4th dimension, the 4th dimension means the exploration of all possibilities. So what does that mean?

Imagine a weather site in your Web browser with one of those 3D visualizations of your region, a fairly large area. It’s tilted so you can see the places and topography of your region and you can also see the height and thickness of the clouds, the density of the rain or snow. At the lower left of that page is a slider bar showing time, the 4th dimension, from the past, say four hours ago, to even the future predictions several hours from now. You slide that bar from far left towards the right watching the clouds and rain as they had moved over the past few hours along in that 3D visualization. You reach the present on that slider and continue to sliding towards the right, but now into the future seeing the clouds and rain move through space as the sophisticated predictive models have calculated.

Is that 4D? It is, to a limited extent. We can see all the 3D clouds and rain moving through time. But the future we see is just one possibility. One possibility is fine if we knew it’s a certainly, not merely one possibility among many others. We all know that’s not the case. We know weather predictions are often wrong, sometimes very wrong. Predictions are often wrong because the changes through time are calculated step by step, say every minute. At any given point in time, the calculated future of those clouds and rain is never exact. The most likely position is selected, but we treat it as though it is exact, ignoring the possibilities calculated to be less likely. Over subsequent steps, those approximations accumulate and exacerbate errors until a few hours out, the prediction looks nothing like what actually happens.

In the 4th dimension of time, as far as a 3D thing like we humans are concerned, the 4th dimension encompasses all possibilities. We generally think of more than one possibility, worrying about a less likely outcome, and are sometimes called paranoid for considering unlikely bad outcomes or delusional for thinking about very unlikely success. Certainly, we cannot consider all possibilities, not even a small fraction of them, even if our brains held magnitudes more capability. And that’s OK since most things would be so unlikely it’s not worth thinking about. For example, a rock on a planet in a distant solar system not will appear out of the blue, hit us on our head, with our head turning into asparagus.

Humanity’s unsurpassed capability with the 4th dimension is the secret sauce of our Earthly dominance. Our memories and symbolic thinking, though not flawless, are enough power for us to beat out every other species on this planet. But now that we don’t worry about a grizzly bear running into our camps and mauling one of us for dinner, we now have moved up to beating other equally intelligent humans with that same level of intelligence that gave us mastery over bears, wheat, cows, and chickens. It turns out that the smarts to beat the grizzly bears isn’t enough to beat other humans on the same playing field, so we learn to strategize in an arms race of technology. That is, we’re capable of knitting together a set of steps, plans, towards a goal. Our brains happened to have excess capacity and our ability to form teams helped scale up our capabilities further.

Here’s the catch. Eventually our brains reach their limits, and they become the bottleneck, even with teamwork and technology. The world we’ve created for ourselves is at or even beyond our ability to fully master it. Today, the world is so complex that as we joke in the software industry, we fix one bug and create two new ones. Another analogy from the software industry is that everything is easy until scalability becomes an issue.

To remedy that, we seek the comforts of simplicity and tighter control, taming the complexity of life back under the limits of our brains. We actively train ourselves to just one possibility. Our decisions are becoming formulaic heuristics, not thoughtful actions. We really have no choice. Seven billion and growing sentient beings, each with their own intelligence and desires, result in a messy web of demands set upon all of us. There are tens of thousands of government regulations, many which contradict each other, thousands of sensitive points to avoid, hundreds of mini decisions off-loaded by companies to us in the guise of self-service convenience. There are so many decisions we must make that we shun the answers that begin with “It depends …” or “Just one number please …”. There are so many factors that we can’t afford the time to consider the multitudes of “it depends” scenarios. We don’t even have time to consider even just a tiny fraction of the infinite possibilities, or we quickly freeze through analysis paralysis.

We never have time to deeply learn, to reflect, to let it all sink in, to let the fundamental “what fires together wires together” mechanism of our intelligence fully digest our recent experiences. We rush through tightly-scoped Scrum sprint projects at work in two weeks and before the current one is even over we’re dipping into the next one. Sure, there is a post-mortem phase, but it’s usually the first thing to be cut when time is short, which is usually the case, and it’s too little anyway. Additionally, there’s also no real time to celebrate, to let our brains equate the success with something good so that we’re driven to succeed during subsequent sprints.

So much is thrust upon us in this literally mind-bogglingly complex world we’ve created for ourselves that our power to see into the 4th dimension overloads our brains and we keep failing. Consequently, we retreat to worshiping simplicity, best practices, conformity; the comfort of “now”.

For the sake of simplicity and conformity, we disregard the diversity of our genes by developing drugs, cars, and airplane seats that are optimal for few, even detrimental for some, but work merely OK enough for the vast majority of the people. We insist on shoving pegs of all shapes into square holes. I chose squares because they easily stack into nice, compliant piles.

Our diversity of experience is swept under the rug through standardized testing. Our unique experiences form the boundaries of our logic, so the more collective experience we have, the better our logic. There are multiple ways to attack a problem, some more optimal than others, but those are usually under assumed, rigid circumstances. Meaning, that shortcut won’t work all the time, and the real value of human intelligence is when we need to come up with an alternative. Anyone can learn a standardized answer or well-defined process, even a computer, hint hint, but it’s a completely different thing to engineer a novel solution. Those novel solutions exist out there in the 4th dimension.

We’ve lost our capability for delayed gratification, arguably the single most important skill. That is, the ability to see something that the bears and alligators cannot see because they can only see the possibilities where an action leads to immediate improvement.

Everyone panics if there isn’t immediate progress, not considering that sometimes things go one way before they go the other; sometimes worse before getting better, sometimes better before getting worse. Answers are given to us without forcing us to explore webs of paths. Most of those paths don’t solve our problem, but they expose us to much more that will in part or whole be of use to us someday. Those currently fruitless, seemingly dead-end paths enrich our treasure trove of experience, the experience from which the quality of our logic is derived. It strengthens the skills required to engineer; perseverance, imagination, resourcefulness. We instantly disregard anything but the perfect 5-star choices, even though the choice with the checkered past, failing through all of those fruitless paths, may indeed now be the choice for a world that is all too complex.

To recap, our ability to consider many possible paths, at least more than other species, is our human shtick. But we’ve created a world of such complexity that it requires us to look at a daunting number of possibilities that has overloaded the capacity of our brain, resulting in chronically making bad decisions. So we’ve retreated to simplicity and tighter control, conformed heuristics. It strips from each of us the capacity to be the far-sighted forces of the universe that each of us are, devolving towards becoming short-sighted, push-button automata, relinquishing our sentience to the rule-based, single-step intelligence of our pre-neo-cortex ancestors.

Even so, the ability to think deeply still exists in areas where it must. Great chess and Texas Holdem players look several steps out, where each step along the way doesn’t necessarily take them closer to the ultimate goal. They understand the notion of delayed gratification as sacrifice and investment. They consider the consequences for being wrong as well as giving weight to plans with the most “outs”.

Over the years I’ve met many brilliant software debuggers and troubleshooters. They have this knack for beginning in the end, where all the pieces have converged into an incident, an error code or murder, and working backwards, with all those clues dispersing into what appears to be a mess of unrelated facts. Working backwards as such helps us to prune out the virtually impossible paths without tossing out the interesting outliers and weak links which may be the signs of things to come. Maybe that’s why “visualizing” a successful outcome, a technique taught in sports psychology, often works. Seeing that end, our subconscious works backwards ultimately to where we are now.

Geologists, like mystery writers, start with the end product, what they are currently seeing, such as a strange piece of sandstone with silver in it, and work backwards to unravel how it became what it is. They wonder, “How did this silver get here?” They must work through events that involve the effects from combinations of moving continents, volcanos, erosion, chemical and physical processes, the effects of biological activity. Every rock has a complicated and somewhat unique story, a story we can Google, but someone has to have figured it out first.

Experienced, grizzled doctors, geologists, mystery writers, detectives, software debuggers, and the elderly, know that for every outcome there are countless combinations of contributing factors, and so there is hardly really just one “major cause”; and that every time we attempt to never let something bad happen again by outlawing the causes for that one incident, there are countless other ways for it to happen.

So with all that said, now that Big Data has laid the foundation for gathering and accessing practically any sort of data, the next step in analytics will mature beyond the reactive, quick answer machines, instant gratification, asking questions of the “What is the ‘blank’ ?” nature we’ve conditioned ourselves to ask, to “What are the steps towards achieving ‘blank’ ?” The mainstream analytics space has punted the issue of what it takes for the level of analytics that can promote orderly design and execution of strategies but not at the expense of our personal individuality or ignoring the fact that our competitors aren’t going to cooperate. That is, we need to move back to focusing on relationships and the complex nature of things, as opposed to simply more data, the illusion of doing something while avoiding the hard problem. And by “back to” I mean before the instant gratification days when every day was a unique puzzle unanswerable by Google and food didn’t appear in five minutes.

Instead of accentuating specific “powers”, such as obtaining the accuracy of an eagle’s eyes, as facial recognition system do, or the speed of rather dumb machines, new technologies will help accentuate our ability to explore the infinite frontiers of the 4th dimension, in truly massively parallel, massively recursive, and massively hierarchical fashion. It will accentuate our logical capability rather than just being a glorified index, leaving the heavy intellectual lifting to our three pounds of brains. Genetic Algorithms will find solutions not by progressively selecting from sets of possibilities with immediate improvement, but looking at multiple steps towards improvement. With that expanded capability to explore the 4th dimension, we can then be braver about considering the unlikely, the black swans and weak ties, and not freeze through analysis paralysis or over-react to paranoia.

Our business intelligence systems will finally focus on the relationships between data and not just improving analysis with yet more data. Such an improvement will enable real integration, of the “whole is greater than the sum of its parts” variety, of our collective experiences and expertise, allowing for a genuine move from the limitations of top-down control at our institutions towards a distributed , scalable intelligence.

Until then, with or without that new technology accentuating our ability to explore the 4th dimension, here are a few things to keep in mind.

Keep it in the forefront of your mind that anything is possible. I mean anything. I don’t mean this in a “new age, positive outlook” or even a “quantum physics, uncertainty principle, multiverse” way. I mean that, firstly, the world is a complex system and we’re always making decisions on imperfect information, nothing is certain, even those weather forecasts stating there is a zero-percent chance for rain. Secondly, we are just 3D creatures faking our way through what is probably even more than a four dimensional world. Each dimension doesn’t merely add another element to a tuple, but almost god-like powers; read “Flatland” and “Spaceland”. And thirdly, our brains, though massive in capacity are still limited to the experiences we can shove into them, in what is a very short time and being trapped to one place at a time, so our logic is innately limited. Meaning, our logic could be infallible, but it’s based on our limited and unique set of experiences; so we could still be wrong. Don’t be one of those, “If it hasn’t yet happened, it couldn’t possibly happen” or “That’s not the way it’s been done” people.

Lastly, visit Zion, Canyonland, and Bryce Canyon National Parks; someplace bigger than you, bigger than humanity, that cannot be tamed. Hike the trails, even the scary ones; but don’t do anything crazy. Say hello to every single person you pass on the trail and always yield to them on the narrow, scary parts. Find one of those views that cannot be put into words, beyond just a poetic sense, because it requires integration of all of your senses; Angel’s Landing and Observation Point are my favorite. Take off your backpack, pull out a bottle of water and a Cliff Bar, and sit. Feel the vastness of the past written in every wall of those parks. Try to take a picture of it, and realize that 2D photo faking a 3D scene will never match what your five senses, consciousness, and subconscious as a whole are feeling at that instant. Ponder that the distance between humans and the dinosaurs is 65 million years, but the distance of strata from the top of Bryce Canyon to the lower reaches of Zion spans over 200 million years. And know that in a short time what you see will also be gone, replaced by something else. Think about how those crevices and even the entire canyons were formed by many processes, one of which goes something like, a random ball of plant matter inside that sandstone long ago submerged under the water table leached out and attracted the iron in the sandstone around it into a hard, walnut-sized iron concretion; that iron concretion eventually was exposed through erosion and fell out leaving a pock mark, which grew to a larger dent, merging with other dents, exacerbating into a crevice and even a canyon. Stay there long enough to let that feeling etch into you. Then begin your journey back. Take your time so as to be careful about your knees and not twist your ankle. And think of all the ways Zion could have been or even not have been had even the most miniscule of things been different.

Posted in BI Development, Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics | Tagged , , , , , | 1 Comment

Protected: Embracing Complexity – Pt 2 of 6 – Data Prep and Pattern Recognition

This content is password protected. To view it please enter your password below:

Posted in BI Development, Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics | Tagged , , , ,

Embracing Complexity – Pt 1 of 6 – Non-Deterministic Finite Automata

Introduction to this Series

“Big Data” and “Machine Learning” are two technologies/buzzwords making significant headway into the mainstream enterprise. Drawing analogies between these two technologies to those at the start of the Internet era twenty-something years ago:

  1. Big Data is analogous to a Web crawler capable of accessing large numbers of available Web pages.
  2. Machine Learning is analogous to search engines such as Yahoo’s back then followed by Google that index the salient pieces of data (key words and page ranks – the latter in Google’s case).

But the search engines fell short of being anything more than a glorified index, like a telephone book providing someone’s name and address only if you already know the person’s name. Similarly, our current analytics systems fall short in that they only provide answers to questions we already have.

Before moving on, it’s important to ensure the distinction between a complicated and a complex problem is understood up front. The Twitter summation of the theme of this blog is: We currently analyze complex problems using methodologies for a complicated problem. Our dependable machines with many moving parts, from washing machines to airplanes, are complicated. Ecosystems, the weather, customer behavior, the human body, the stock market are complex.

Complicated things (no matter how complicated) operate in a closed system (at least we create the closed system environment or just pretend it is closed) where cause and effect between all parts are well understood. Complex systems have many moving parts as well, but unlike complicated systems, relationships between the moving parts are not well defined; therefore, outcomes are not deterministic. Most of our “truly” analytical questions actually address complex systems, which we attempt to answer using techniques designed for answering complicated questions.

This is the first in a series of blogs laying a foundation for breaking free from the analytical constraints of our current mainstream analytics systems founded upon “intelligently designed” (designed by we humans) databases. Such systems are built based on what we already know and expect, hardly ever considering the unknown, giving it the “out of sight, out of mind” treatment. For this blog, I’ll introduce the basic idea for employing NFAs and a simple T-SQL-based sample. Subsequent blogs in this series will build on the concept.

To deal better deal with complexity, we can paradoxically retreat from the current mainstream, Turing Machine-inspired, computer design (the top of the stack of the theory of computation) to the far less sophisticated Non-Deterministic Finite Automata, NFA (2nd from the bottom only to the Deterministic Finite Automata). NFAs are simple, more elemental, constructs with far more flexibility in expressing the a very wide variety of rules of daily life. The tradeoff is that the NFAs may be not exactly streamlined and they could seem unwieldy to an engineer, but we’ll have the power to emulate the rules of daily life with a “higher resolution” from that lower granularity.

This post and the next two posts of this series comprise a sub-series, an introduction to the utilization of NFAs and pattern recognition in the BI world. Post 2 will introduce pattern recognition and a simple T-SQL-based application as well. Post 3 will tie Post 1 and 2 together, again with a T-SQL-based application, with a mechanism for processing incoming symbols by multiple NFAs in parallel – at least in a set-based manner (let’s just call it “quasi or pseudo-parallel”) as a first step.

Posts 4 through 6 will deal with further characteristics of the system, for example, exploring further the notion of “what fires together, wires together”, as well as diving deeper into a physical implementation better suited for scalability of such ideas. In particular, Hekaton and Micron’s Automata Processor, which I’ll discuss briefly in this post. By Post 6, we will be at the doorstep of what I had intended to encapsulate in Map Rock, which is a focus on changing relationships as opposed to just keener recognition and control.

This is a blog about AI, but not the sort of AI we usually think about which I believe is still a few years away (despite the incredibly rapid improvement of IBM’s Watson on several dimensions since its debut on Jeopardy in 2011). I certainly can’t explain everything about NFAs and  AI in these 5000+ words, or even 500,000. However, I think you’ll find the theme of this set of blogs useful if we can for now at least agree that:

  1. The world is a complex, adaptive system, where our current analytical systems are about to reach their limits,
  2. In order for us to make superior decisions we need true massive parallelism to paint the always dynamic, messy picture of what is around us,
  3. Predictive analytics models are rules reflecting what we’ve come to know, but that they work because life on Earth although dynamic, is relatively stable, but the models eventually go stale,
  4. And that working at a lower, more elemental level gives us flexibility we don’t have at a higher, more object-oriented level.

Lowering the Granularity of Our Computing Objects

Most readers of this blog have probably encountered the concepts of NFAs (and Regular Language, Context-Free Language, Turing Machine, etc) in college under a course in theory of computation. Most would agree that it is still taught simply because it has always been taught, just a formality towards a CS degree, as such concepts almost never appear in the world of high-level programming. But as we’re to running into a wall as we begin to ask our analytics systems new sorts of questions of a complex nature answering with a complicated. Our computer systems are built to address well-defined, well-understood problems, employed merely as something that can do certain jobs better, as we would employ a good fighter as a nightclub bouncer.

Computing at the less sophisticated but more granular level of the NFA removes much of the rigidity imposed by computations that have the luxury of being optimized for static, reliable conditions, for which we make countless assumptions. This is analogous to how we don’t deal with life thinking at the atomic or molecular level of the things we encounter every day but at the macro level of objects; apples, bosses, tasks, and ourselves (we’re a macro object too).

We could even look at 3D printing as a cousin of this lower granularity concept. Instead of being completely limited to the need of manufacturing, shipping, and storing zillions of very specific parts, we can instead have big globs of a few types of stuff from which we can generate almost anything. Well, it’s not quite that extreme, but it’s the same idea. Similarly, I don’t believe NFA processing will replace relational databases in the same way 3D printing shouldn’t replace manufacturing. 3D printing isn’t optimal for things for which we know won’t change and for which we need great quantities. There will be a mix of the two.

We Already Do Quite Well with Determinism, So Why Bother with This?

Our human brand of intelligence works because our activities are mostly confined to a limited scope of time and space. Meaning, our best decisions work on a second to second, day to day basis involving things physically close to us. Additionally, the primary characteristics of things we deal with, whether cars, bears or social rules remain fairly constant. At least they evolve at a slow enough pace that we can assume validity of the vast majority of relationships we don’t consciously realize that are nonetheless engaged into our decisions. In fact, the ratio of what we know to what we don’t know makes “tip of the iceberg” sound like a ridiculous understatement. If things evolved (changed) too quickly, we couldn’t make those assumptions and our human brand of intelligence would quickly fall to pieces through information overload.

In fact, changes are the units of our decision making, predictions we make with the intent of furthering us towards our goals. Our brains (at least through vision) starts with the 2D image on our retina, then applies some innate intelligence (such as shadows) and some logic (such as what is obscuring what) to process depth. And finally tracking and processing changes is how we handle the 4th dimension of time.

When we form strategies to achieve a goal, it’s the changes, how a change leads to transitions in things, that form our strategies, ranging from as mundane to getting breakfast to planning for retirement to getting a person on Mars. Strategies are like molecules of cause and effect between the atoms of change that we notice. The fewer changes involved, the more effective our decisions will be as accuracy is progressively lost over a Bayesian chain of cause and effect. We are more successful obtaining the breakfast we desire right now than on planning how we will retire decades from now as we envisioned today (due to cumulative changes over a long period of time).

A key thing to keep in mind is that in enterprises it seems the default attitude towards change is to deal with it as an enemy or pest, something to be reviled, resisted, and eliminated. However, to state the obvious in Yogi Berra fashion, without change, nothing changes. Change is what makes things better or worse. Unfortunately, in what is for all pragmatic purposes a zero-sum world, some will experience change for the better, some for the worse. But because change is always happening, those currently in “better” positions (the top echelon of enterprises) must vigilantly improve or at least maintain that condition.

Even maintaining the status quo is the result of constant change, except the net measurement is the same. For example, maintaining my body weight doesn’t mean nothing has changed. I’m constantly overeating, then compensating by under-eating (and occasionally even vice-versa). For those finding themselves in worse conditions, the ubiquity of change means there is always hope for the better.

Change as the basis for intelligence is rooted in the fact is that our home, Earth, is a hugely dynamic, complex system powered by intertwined geologic forces and biological replication. Geologic forces are driven by forces deep in the Earth as well as way over our heads in the clouds, sun, and meteors. The ability for cells to replicate is the underlying mechanism by which all the life we live with self-organized. Every creature from viruses through swarms of bees through humans are driven to “mindlessly” take over the world. But we millions of species and billions of humans have settled into somewhat of a kaleidoscope of  opposing forces at least in the bigger picture, which is like a pleasantly flowing stream, seemingly the same, but in reality in a constant state of change. The mechanisms of evolution and our human intelligence both enable adaptability on this fairly smoothly-dynamic planet.

A Few Clarifications

If all of this sounds obvious and/or a bunch of flowery crap, it could be that it’s only obvious when it’s brought to our attention, but quickly dismissed and forgotten as we resume the drudgery of our daily lives, being careful not to break any of the hundreds of thousands of laws micro-managing our lives, follow best (expected) practices that immunizes us from culpability, and careful not to trip over social mores that weren’t there yesterday. Our Industrial Revolution upbringing raised us to seek and expect comfort.

I would also like to point out that I’m not suggesting a system that simply gives us more to worry about, distracting us from what’s important, undermining our abilities through information overload (like a DoS attack). The main idea is not to replace us with some sort of AI system. It is to supplement us; watch our backs (it can multitask better than we can), see what our innate biases overlook, reliably rule out false positives and false negatives through faster exploration of the exponentially growing number of possibilities and continued testing of paths to goals (the definition of success).

The Expanding Reach of Our Daily Lives

However, many factors emerging in most part due to the increasing power of technology are expanding the scope of the time and space in which we individually or as an enterprise operate. Globalization introduces many more independently moving parts. Longer lives increases the cumulative changes each of us experiences in our lifetime. The rapidly growing rate of human population has greatly expands the reach of our species to the point where there’s practically nowhere on the surface we don’t inhabit. The countless devices feeding a bottomless pit of data collection, storage and dissemination expands the scope of our activities over time and space.

I purposely prepend the word “independent” to the phrase “moving parts” used in the previous paragraph. That’s because the fact that the parts are independently intelligent decision makers defines the world-shattering difference between complicated and complex. However, the level of “intelligence” of these independently moving parts doesn’t necessarily mean matching or even attempting to emulate human-level intelligence. Billions of machines from household appliances to robots running a manufacturing plant are being fitted with some level of ability to make decisions independently, whether that means executing rules based on current conditions or even sorting through true positives, false positives, false negatives, and all that good stuff.

With the limited scope of time and space typical for the typical human during the 1800s and 1900s, complicated machines were effective in performing repetitive actions that have, still do, and always will serve us very well. But in the vastly increasing scope of time and space in which individuals, their families, and businesses operate, making good decisions becomes an increasingly elusive goal.

If we don’t learn to embrace complexity to make smarter decisions, to entities that are embracing it (such as hedge funds and organizations with Big Brother power) we will be as fish and other game are at the mercy of humans with symbolic thinking. Embracing complexity doesn’t mean something like giving up our ego and becoming one with the universe or going with the flow. It means we need to understand that in a complex world:

  • We need to be flexible. We cannot reject the answer of “it depends” obsessively seeking out the most convenient answer.
  • Trial and error is a good methodology. It’s also called iterative. Evolution is based on it, although our intelligence can streamline the process significantly. On the other hand, our limited experience (knowledge) means we very often miss those precious weak ties, the seeds that beat out competition to rule the next iteration.
  • Our logic is prone to error because of the ubiquitous presence of imperfect information (a huge topic).
  • It’s a jungle out there, every creature and species out to take over the world. The only thing stopping them is that every other creature is trying to take over the world.

I discuss my thoughts around complexity, strategy, and how Map Rock approaches the taming of it in Map Rock Problem Statement – Parts 4 and 5.

Reflecting the World’s Cascading and Dynamic Many to Many Nature

A very intriguing discipline for dealing with complexity is Situation Awareness. It’s roots lay in war, battle scenarios, for example as a methodology for fighter pilots to deal with the chaotic realities of life and death fighting. In such situations, there are many independently moving parts, including some that you cannot trust. With all the training on tactics and strategies, a good opponent knows the best way to win is to hit you where you weren’t looking. In other words, things don’t go as planned. So we must be able to recognize things from imperfect and/or unknown information.

Figure 1 depicts a variation of an entity relationship diagram of a supply chain. Notice that unlike the usual ERD, there aren’t lines linking the relationships between the various entities. That’s pretty much because there are so many relationships between the entities that representing each with a line would result in a very ugly graph and simply showing that there exists relationships is oversimplifying things.

Those entities have minds of their own (will seek their own goals), unlike the “ask no questions” machines such as cars and refrigerators (at least for now). Instead of the conventional lines from entity to entity that strongly reinforce only the sequential aspects of a system, I depict “waves” (the yellow arcs) which attempt to reinforce the massively parallel aspects of a system as well.

Figure 1 – A sort of Entity Relationship Diagram of a Supply Chain.

The entity diagrams shows each entity broken down into three parts of varying proportions denoted by three colors:

  • Black – Unknown information. Highly private, classified. This information does indeed exist, but it is unobtainable or much too expensive to obtain. Contrast that with unknowable information, for example, so far out in the future that no one could possibly predict it … except in hindsight. Therefore, perhaps there should be a black or unknowable data and a dark gray for private/classified data.
  • Gray – Imperfect information. This could be indirectly shared, but statistically reliable; the basis for Predictive Analytics. Or it could be shared information, but suspect, or possibly outdated.
  • White – Known. This is information readily shared, validated, and up to date. We also would tend to find perceive it as reliable if we knew that it benefited us.

The proportions of black, gray, and white are just my personal unscientific impressions of such entities based on my personal experience exploring the boundaries of what we can and cannot automate after 35+ years of building software systems. The main point of Figure 1 is to convey that the white portion is the “easy” part we’ve been mostly dealing with through OLTP systems. The gray and black parts are the hard part, which does comprise the majority of information out there and the stuff with the potential to screw up our plans.

In a fight, whether as a predator versus prey, a street brawl, or business competition, we can either play defensively (reactively) or offensively (proactively). Defensive querying is what we’re mostly used to when we utilize computers. We have a problem we’re attempting to solve and query computers for data to support the resolution process executing in our heads. However, in the “jungle”, situations (problems) are imposed on us, we don’t choose the problem to work on. Our brains are constantly receiving input from multiple channels, recognizing things, grouping them, correlating them.

Not counting video games, most of how we relate to computers is that computers answer direct questions posed to them from us, which help us to answer complicated questions, involving relationships between things, we’re working through with our brains. Video games are different from most of the software systems we use in that the computer is generating things happening to us, not just responding to our specific queries. In the real world, things happen to us but software is still not able to differentiate to an adequate degree what is relevant and what isn’t true, so we end up with enough false positives that it creates more confusion or there are so many false negatives that we miss too much.

The most important thing to remember is that the goal of this series of blogs is to work towards better reflecting the cascading and dynamic many to many relationships of the things in the world in which we live our lives. To do this, our analytics systems must handle objects at a lower level of granularity than we’re accustomed to, which can be reconstructed in an agile number of ways, similar to how proteins we consume are broken down in our guts into more elemental amino acids and reconstructed into whatever is needed.

Then, we must be able to ultimately process countless changes occurring between all these objects in a massively parallel fashion. To paint the most accurate picture in our heads required to make wise decisions, we need to resist forcing all that is going on into brittle, well-defined procedures and objects.

Non-Deterministic Finite Automata

Probably the most common exposure to NFA-like functionality for the typical BI/database developer is RegEx (same idea as the less powerful LIKE clause in a SQL statement). But I think it’s thought of as just a fancy filter for VARCHAR columns in a WHERE clause, not as a way of encoding a pattern which we wish to spot in a stream of symbols. These symbols can be characters of words (pattern) in a text document (the typical use case for RegEx), the stream of bases in DNA, the sales of some product over many time segments. Indeed, NFAs are the implemented version of regular expressions.

The NFA is a primary tool in the field of pattern recognition. They are patterns that can recognize those sequences of symbols. Sometimes these sequences may not be entirely consecutive (handled by loopbacks), sometimes not even entirely ordered (handled by multiple start states), and can lead to multiple outcomes (the “non deterministic part, handled by multiple transitions for a symbol).

When we think of pattern recognition, we usually think of high-end, glamorous applications such as facial or fingerprint recognition. But pattern recognition is one of the foundational keys for intelligence. It’s exactly what we humans seek when browsing through views in PowerPivot or Tableau. We look for things that happen together or in sequence. And the word “thing” (object) should be taken in as liberal a manner as possible. For example, we normally think of a thing as a solid physical object, but things can be as ephemeral as an event (I like to think of an event as a loosely-coupled set of attributes), recognized sequence of events, and things for which there isn’t a word (it needs a few sentences or even an entire course to describe).

If we think about it, seeing the things we see (at the “normal” macro level) is sort of an illusion. Meaning, we see an image (such as an acquaintance or a dog) reconstructed from scratch. When I see a friend of mine, my eyes don’t reflect into my head a literal picture of that friend. That vision begins with my eyes taking in zillions of photons bouncing off of that friend, which are quantified into a hundred million or so dots (the number of rods and cones on my 2D retina), into a number of edges (lines) processed by my visual cortex, into a smaller number of recognitions processed with components throughout my brain. If it didn’t work this way, it would be impossible to recognize objects from whatever angle 4D space-time allows and even partially obscured views, as presented to us in real life.

Further, I don’t see just that person in isolation. “That friend” is just one of the things my brain is recognizing. It also in massively parallel fashion recognizes all other aspects from the expression on his face, to what he is wearing, to funny angles that I wouldn’t be able to make out all by itself (all the countless things for which there isn’t a single word), to memories associated with that friend, as well as things around that friend (the context). Intelligence is massively parallel, iterative (hone in on a solution, through experimentation), massively recursive (explores many possibilities, the blind alleys), and massively hierarchical (things within things).

NFAs are visualized as a kind of directed graph, connected nodes and relationships (lines, edges). We’re well familiar with them, for example, org charts, flow charts, UML diagrams, entity relationship diagrams. However, NFAs are mathematical constructs abiding by very specific rules. These rules are simple, but from these simple rules, we can express a wide range of  rules for pattern recognition.

Figure 2 depicts an example of an NFA. The NFA on the left is used to recognize when a market basket contains chicken wings, pizza, and beer. That could signify a large party which could be used to notify neighbors of the need to book that weekend getaway. The one of the right is used in more of the conventional, “if you have pizza and beer, perhaps you’d like chicken wings as well”, utilization of market basket analysis.

Figure 2 – Sample of an NFA.

A very good series of lectures on NFAs, actually the wider field of the Theory of Computation is Theory of Computation is by Dan Gusfield. L1 through L3 of the series is probably the minimum for a sufficient understanding of NFAs. Although there is very much value in understanding the more sophisticated concepts, particularly Context-Free Language and Turing Machine. I like Prof Gusfield’s pace in this series. Ironically (for a tech blog), the fact that he still writes on a chalkboard slows things down enough to let it all settle.

I believe a big part of the reason why graph structures still play a relatively minor role in mainstream BI development is because we’ve been trained since our first database-related “hello world” apps to think in terms of high-level, well-defined, discrete “entities” reflected in the tables of relational databases (tables). Each entity occupies a row and each column represents an attribute. It’s easier to understand the tidy matrix of rows and columns, particularly the fixed set of columns of tables, than the open-ended definitions contained in graphs. That goes for both people and computers as a matrix-like table is easier for servers to process.

In order to process enterprise-sized loads of data, we needed to ease the processing burden on the hardware by limiting the scope of what’s possible. We made “classes” and table schemas as structured as possible (ex, a value can be pinpointed directly with row and column coordinates). We stripped out extraneous information not pertinent to the automation of our well-defined process.

We also neglected implementing the ability to easily switch in and out new characteristics once extraneous but now relevant. Graphs, lacking a fixed schema, don’t have the crisp row and column schemas of tables nor the fixed set of attributes. So we decided we can live with defining entities by specific sets of characteristics forgetting that objects in the world don’t fit into such perfectly fitted slots.

There are techniques for conventional relational databases that support and reflect the unlimited attributes of objects, for example, the “open schema” techniques where each row has three columns for an object id, an attribute, and the value of the attribute. There can be an unlimited number of attributes associated with each object. But the relational database servers executing those techniques struggle under the highly recursive and highly self-joined queries.

In an ideal world, if computers way back to the days when only large, well-oiled enterprises used them (where taking in ambiguity as a factor isn’t usually a problem), were then already as powerful as they are today, my guess is we would have always known graph databases as the mainstream and relational databases appearing later as a specialized data format (as OLAP is also a special case). Instead of a relational database being mainstream, we would think of a relational database table as a materialization of objects into a fixed set of attributes pulled from a large graph. For example, in a graph database there would be many customer ids linked to all sorts of attributes customers (people) have spanning all the roles these people play and all the experiences. There could be thousands of them. But for a payroll system, we only need a handful of them. So we distill a nice row/column table where each row represents a customer and a small, fixed set of columns represents the attributes needed to process payroll.

Graph-based user interfaces (most notably Visio – at least to my heavily MSFT-centric network) have long existed for niche applications. But there is a rapidly growing realization that simply more data (volume), provided ever faster (velocity), and in more variety alone doesn’t necessarily lead to vastly superior analytics. Rather, it’s the relationships between data, the correlations, that directly lead to actionable and novel insight. So enterprise-class graph databases such as Neo4j, optimized for authoring and querying graph structures, are making headway in the mainstream enterprise world.

However, keep in mind that NFAs are very small, somewhat independent graphs, unlike large, unwieldy graphs more akin to a set of large relational database tables. In other words, the idea of this blog is querying very many small NFAs in massively parallel fashion, as opposed to one or a few large tables (or a messy “spider-web” graph). In a subsequent post in this series, we’ll address the “somewhat independent” aspect of NFAs I mention above; loosely-coupled.

Up to now we weren’t compelled enough to take a leap to graph databases since we were able to accomplish very much with the sterile, fixed column tables of relational databases. And, it graphs were too unwieldy to deal with, we retreated back to the comfort of the 2D world of rows and columns. But we’re now beginning to ask questions from our computer systems that are different. We don’t ask simply what the sales were of blue socks in CA during the last three Christmas seasons.

We ask how we can improve sales of blue socks and attempt to identify the consequences (ex. Does it cannibalize sales of purple socks?). The questions are more subjective, ambiguous, and dynamic (SAD – hahaha). These are the sort of questions that were in the realm of the human brain, which in turn we turn to our computer databases to answer the empirical questions supporting those more complicated questions.

SQL Server 2014’s new in-memory Hekaton features could help as well. Similar to an OLTP load, processing NFAs would involve a large number of queries making small reads and writes. This is in contrast to an analytics application such as OLAP which involves relatively few queries by comparison reading a large amount of data and no updates are made except for a scheduled refresh of the read-only data store. I’ve made this comparison because I think of this utilization of NFAs as something in the analytics realm.

But applied in an implementation involving thousands to millions of NFAs (such as Situation Awareness), a highly-parallel implementation, it could involve a large number of large reads and writes as well. So we have an analytics use case for a technology billed as “in-memory OLTP”. The advantage of using Hekaton over Neo4j is that we could implement our NFA system using familiar relational schemas and querying techniques (SQL and stored procedures) instead of invoking a new technology such as Neo4j.

Hekaton should provide at least a magnitude improvement over doing the same thing with a conventional disk-based relational tables. This performance improvement comes with first the all-in-memory processing and dropping the overhead required for disk-based tables.

Micron’s Automata Processor

Much more intriguing and relevant than Hekaton for this blog focused on NFAs is Micron’s Automata Processor, which I wrote about almost a year ago in my blog, A Rare Big Thing Out of Boise. The Automata Processor (AP) is a memory-based chip directly implementing the mechanisms for the massively parallel processing of NFAs.

This should result in at least a few orders of magnitude of further performance improvement, first from the fact that it has little if no “generalist” overhead since it is designed as an NFA-specific chip and not a general-purpose memory chip. It also processes NFAs in truly massively parallel fashion.

Thirdly, the “processing” mechanism (to process large numbers of NFAs in parallel) is built directly onto the chip, which means that there is no marshaling of bits between a CPU and memory over a bus for every single operation. So even if we were to compile NFAs down to “native code” (as Hekaton’s native stored procedures do), massively multi-threaded on a relatively massive number of CPUs, there would be great hurdles to overcome in beating the Automata Processor.

We could look at the AP as merely an optimization for a particular class of problems. The sort that recognizes patterns such as faces or gene sequences in a huge stream of data. But similarly we can look at the current mainstream computer architecture (CPU and storage device – RAM, hard drive, or even tape) as an optimization for the vast majority of the classes of problems we deal with in our daily lives as we’re accustomed to (at the macro level) in our vestigial Industrial Revolution mentality. That would be the well-defined, highly repetitive, deterministic class of problem which is the hallmark of the Industrial Revolution.

So instead I like to look at the Automata Processor as a technology that is a lower-level (lower than the Turing Machine) information processing device capable of handling a wider variety of problems; those that are not well-defined, highly repetitive, and deterministic. NFAs are like molecules (including very complicated protein molecules), not too high, not too low, high enough to solve real-world problems, but not so unnecessarily low-level and cumbersome. An analogy would be assembler language being low enough to dance around the roadblocks imposed by a high-level language, but not as cumbersome as programming in zeros and ones.

This parallelism could mean massive scales such as up to tens of thousands or even millions of NFAs. The main idea is that each piece of information streaming into a complex system could mean something to multiple things. The reality of the world is that things have a cascading many to many relationship with other things. For example, a sequence of three sounds could be the sound of countless segments of a song rendered by countless artists, countless phrases uttered by countless people with countless voices, countless animals.

NFA Sample Application

At the time of this blog’s writing, Micron had not yet released its development toolkit (the chip/board itself and the Software Development Kit, SDK) to the public for experimentation. So it is one of the major reasons I decided to demonstrate NFAs using conventional, disk-based SQL on SQL Server 2014, at least for this blog. However, the Automata Processor’s SDK is accessible at the time of this blog’s posting (after signing up) by visiting  http://www.micronautomata.com/

There is still much value in demonstrating NFAs in this conventional manner, even with the impending release of the Automata Processor SDK. First, the AP is etched in stone (literally in the silicon). For business-based solutions I can imagine (ie, more mainstream than specific applications such as bioinformatics), I believe that there are a few deficiencies in the implementation (which I intend to supplement with Hekaton). There will be techniques outside the realm of the AP that would be of benefit and for which we can implement using a flexible, software-based system (ie a relational database management system).  The details of the business-based solutions I can imagine and designs I’ve created around the AP are well beyond the scope of this blog. However, these old blogs provide sample background on such efforts of mine:

This sample implements a very bare-bones NFA storage and query system for SQL Server. This version isn’t optimized or modified for Hekaton (which is a subsequent post) as that further expands the scope of this blog.  This simple application supports the NFA concepts of:

  • Multiple Start states.
  • Looping back on a transition. This allows us to “filter” noise.
  • Transitions to multiple nodes for the same symbol. This is the main feature of an NFA that distinguishes it from the less powerful “deterministic” finite automata.
  • Epsilon transitions (transitioning for any reason).

Please understand that this sample code is intended to be used as “workable pseudo code”. This sample certainly doesn’t scale. It is meant to convey the concepts described here. More on this in Part 3 of this series.

This script, generated through SQL Server Management Studio, NFA -Asahara.sql, consists of a TSQL DDL script that creates a SQL Server database named NFA, a few tables, and a few stored procedures:

  • Database, NFA – A simple, conventional (disk-based) database in which the sample database objects will reside.
  • NFA, Schema – A schema simply for cleaner naming purposes.
  • NFA.[States], Table – Holds the states (nodes) of the NFAs.
  • NFA.[Symbols], Table – A table holding the symbols
  • NFA.[Transitions], Table -Holds the transitions (edges, lines) of the NFAs.
  • [NFA].AddNFAFromXML, Stored Procedure – Takes an XML representing an NFA and registers (persists) it to the tables.
  • [NFA].ProcessWord, Stored Procedure – Takes a “word” (string of symbols) as a parameter and processes the word, symbol by symbol, through all of the registered NFAs.

For this sample, you will simply need the SQL Server 2008 (or above) relational engine as well as the SQL Server Management Studio to run the scripts. I didn’t include Hekaton features for this sample in part to accommodate those who have not yet started on SQL Server 2014. These are the high-level steps for executing the sample:

  1. Select or create a directory to place the SQL Server database (NFA.MDF and NFA.LDF), then alter the CREATE DATABASE command in the NFA – Asahara.sql script to specify that directory. The script uses C:\temp.
  2. Create the NFA database (database, tables, stored procedures) using the NFA – Asahara.sql T-SQL scrip.
  3. Register the NFAs following the instructions near the top of the [NFA].AddNFAFromXML stored procedure.
  4. Run the sample queries located in comments near the top of the [NFA].ProcessWord stored procedure.

Figure 3 depicts the SQL Server objects that are created by the script (database, tables, stored procedures, one TVF, but not the schema) and the part of the script where the database file directory can be set.

Figure 3 – Beginning of the SQL script for this blog.

Once the objects are created, sample NFAs need to be registered to be processed. A few examples are provided in a commented section near the top of the code for the stored procedure, [NFA].AddNFAFromXML. The stored procedure accepts the NFA as an XML file since it is a flexible way to exchange this information without a fancy UI.

Figure 4 shows one of those sample NFAs (passed as an XML document) that determines if a prospective job would be favorable, as well the state/transitions that are registered into the tables.

Before continuing, please remember that these sample NFAs are unrealistically simple for a real-world problem. But the point is that individual NFAs are simple, but many simple NFAs could better reflect the nuances of the real world. This will be addressed in a subsequent post in this series.

Figure 4 – Register an NFA using an XML document.

Regarding the XML, it is a collection of elements named Transistion, with four attributes:

  1. FromState – Each State (node) has a name. Usually, states are just called “qx” where x is an integer (ex: q0, q1, q2 …). However, I chose to give them a more descriptive name indicating something about how we got to that state.
  2. ToState – This is the state that we are transitioning to in reaction to a symbol.
  3. TransitionSymbol – The name of a transition resulting in a switch from the FromState to the ToState.
  4. IsFinal – Final States are a special kind of state (another being a Start State). If a word (series of symbols) ends on a Final State, it’s said that the NFA recognizes the word.

Figure 5 is a graphical representation of the NFA encoded as XML above (in Figure 4).

Figure 5 – Graphical representation of the sample NFA.

Figure 6 shows the results of processing a “word”. A “word” is a series of symbols, usually a string of characters. But in this case, a symbol is more verbose, thus requiring a delimiter (comma) to separate the symbols:

  • “Distance Long” or “Distance Short” – How far is the commute from home to this job?
  • “Salary Good” – The salary is above my minimum requirements.
  • “Benefits Good” – The benefits exceeds my requirements.

Figure 6 – Process a word, which is a string of symbols.

The first result set just shows the trace of the iterative process of processing a word. Each iteration (there are three) processes one of the symbols of the word. It’s possible that multiple paths would be explored, for which in those cases, an iteration would have a row for each path. Multiple paths are one of the two levels of parallel processes on the Automata Processor; the ability to process multiple paths in an NFA and multiple NFAs.

The second result set shows only the final state, indicating this is indeed a good job.

 

Next Up

As mentioned towards the beginning of this Post, Post 2 of this series will introduce Pattern Recognition along with another T-SQL-based sample. In Part 3, we will tie Posts 1 and 2 together. Parts 4 through 6 will expand upon the NFA functionality (most notably feeding back recognitions), implementation of Hekaton (as a step up in performance for the Part 1-3 samples and in a supportive role around the AP) and the Automata Processor itself, as well as further exploration of the use cases.

As with any other system, for example SQL Server or the .NET Framework, there are an incredible number of optimizations (think about it as known shortcuts) based on usage patterns all types (admin, process, structures, etc) to be implemented. Many of these optimizations have already been incorporated in the components of Map Rock, not to mention the NFA-specific optimizations offered by the Automata Processor.

Posted in BI Development, Cutting-Edge Business Intelligence, Map Rock | Tagged , , , , , , , , , | Leave a comment