Introduction to this Series
“Big Data” and “Machine Learning” are two technologies/buzzwords making significant headway into the mainstream enterprise. Drawing analogies between these two technologies to those at the start of the Internet era twenty-something years ago:
- Big Data is analogous to a Web crawler capable of accessing large numbers of available Web pages.
- Machine Learning is analogous to search engines such as Yahoo’s back then followed by Google that index the salient pieces of data (key words and page ranks – the latter in Google’s case).
But the search engines fell short of being anything more than a glorified index, like a telephone book providing someone’s name and address only if you already know the person’s name. Similarly, our current analytics systems fall short in that they only provide answers to questions we already have.
Before moving on, it’s important to ensure the distinction between a complicated and a complex problem is understood up front. The Twitter summation of the theme of this blog is: We currently analyze complex problems using methodologies for a complicated problem. Our dependable machines with many moving parts, from washing machines to airplanes, are complicated. Ecosystems, the weather, customer behavior, the human body, the stock market are complex.
Complicated things (no matter how complicated) operate in a closed system (at least we create the closed system environment or just pretend it is closed) where cause and effect between all parts are well understood. Complex systems have many moving parts as well, but unlike complicated systems, relationships between the moving parts are not well defined; therefore, outcomes are not deterministic. Most of our “truly” analytical questions actually address complex systems, which we attempt to answer using techniques designed for answering complicated questions.
This is the first in a series of blogs laying a foundation for breaking free from the analytical constraints of our current mainstream analytics systems founded upon “intelligently designed” (designed by we humans) databases. Such systems are built based on what we already know and expect, hardly ever considering the unknown, giving it the “out of sight, out of mind” treatment. For this blog, I’ll introduce the basic idea for employing NFAs and a simple T-SQL-based sample. Subsequent blogs in this series will build on the concept.
To deal better deal with complexity, we can paradoxically retreat from the current mainstream, Turing Machine-inspired, computer design (the top of the stack of the theory of computation) to the far less sophisticated Non-Deterministic Finite Automata, NFA (2nd from the bottom only to the Deterministic Finite Automata). NFAs are simple, more elemental, constructs with far more flexibility in expressing the a very wide variety of rules of daily life. The tradeoff is that the NFAs may be not exactly streamlined and they could seem unwieldy to an engineer, but we’ll have the power to emulate the rules of daily life with a “higher resolution” from that lower granularity.
This post and the next two posts of this series comprise a sub-series, an introduction to the utilization of NFAs and pattern recognition in the BI world. Post 2 will introduce pattern recognition and a simple T-SQL-based application as well. Post 3 will tie Post 1 and 2 together, again with a T-SQL-based application, with a mechanism for processing incoming symbols by multiple NFAs in parallel – at least in a set-based manner (let’s just call it “quasi or pseudo-parallel”) as a first step.
Posts 4 through 6 will deal with further characteristics of the system, for example, exploring further the notion of “what fires together, wires together”, as well as diving deeper into a physical implementation better suited for scalability of such ideas. In particular, Hekaton and Micron’s Automata Processor, which I’ll discuss briefly in this post. By Post 6, we will be at the doorstep of what I had intended to encapsulate in Map Rock, which is a focus on changing relationships as opposed to just keener recognition and control.
This is a blog about AI, but not the sort of AI we usually think about which I believe is still a few years away (despite the incredibly rapid improvement of IBM’s Watson on several dimensions since its debut on Jeopardy in 2011). I certainly can’t explain everything about NFAs and AI in these 5000+ words, or even 500,000. However, I think you’ll find the theme of this set of blogs useful if we can for now at least agree that:
- The world is a complex, adaptive system, where our current analytical systems are about to reach their limits,
- In order for us to make superior decisions we need true massive parallelism to paint the always dynamic, messy picture of what is around us,
- Predictive analytics models are rules reflecting what we’ve come to know, but that they work because life on Earth although dynamic, is relatively stable, but the models eventually go stale,
- And that working at a lower, more elemental level gives us flexibility we don’t have at a higher, more object-oriented level.
Lowering the Granularity of Our Computing Objects
Most readers of this blog have probably encountered the concepts of NFAs (and Regular Language, Context-Free Language, Turing Machine, etc) in college under a course in theory of computation. Most would agree that it is still taught simply because it has always been taught, just a formality towards a CS degree, as such concepts almost never appear in the world of high-level programming. But as we’re to running into a wall as we begin to ask our analytics systems new sorts of questions of a complex nature answering with a complicated. Our computer systems are built to address well-defined, well-understood problems, employed merely as something that can do certain jobs better, as we would employ a good fighter as a nightclub bouncer.
Computing at the less sophisticated but more granular level of the NFA removes much of the rigidity imposed by computations that have the luxury of being optimized for static, reliable conditions, for which we make countless assumptions. This is analogous to how we don’t deal with life thinking at the atomic or molecular level of the things we encounter every day but at the macro level of objects; apples, bosses, tasks, and ourselves (we’re a macro object too).
We could even look at 3D printing as a cousin of this lower granularity concept. Instead of being completely limited to the need of manufacturing, shipping, and storing zillions of very specific parts, we can instead have big globs of a few types of stuff from which we can generate almost anything. Well, it’s not quite that extreme, but it’s the same idea. Similarly, I don’t believe NFA processing will replace relational databases in the same way 3D printing shouldn’t replace manufacturing. 3D printing isn’t optimal for things for which we know won’t change and for which we need great quantities. There will be a mix of the two.
We Already Do Quite Well with Determinism, So Why Bother with This?
Our human brand of intelligence works because our activities are mostly confined to a limited scope of time and space. Meaning, our best decisions work on a second to second, day to day basis involving things physically close to us. Additionally, the primary characteristics of things we deal with, whether cars, bears or social rules remain fairly constant. At least they evolve at a slow enough pace that we can assume validity of the vast majority of relationships we don’t consciously realize that are nonetheless engaged into our decisions. In fact, the ratio of what we know to what we don’t know makes “tip of the iceberg” sound like a ridiculous understatement. If things evolved (changed) too quickly, we couldn’t make those assumptions and our human brand of intelligence would quickly fall to pieces through information overload.
In fact, changes are the units of our decision making, predictions we make with the intent of furthering us towards our goals. Our brains (at least through vision) starts with the 2D image on our retina, then applies some innate intelligence (such as shadows) and some logic (such as what is obscuring what) to process depth. And finally tracking and processing changes is how we handle the 4th dimension of time.
When we form strategies to achieve a goal, it’s the changes, how a change leads to transitions in things, that form our strategies, ranging from as mundane to getting breakfast to planning for retirement to getting a person on Mars. Strategies are like molecules of cause and effect between the atoms of change that we notice. The fewer changes involved, the more effective our decisions will be as accuracy is progressively lost over a Bayesian chain of cause and effect. We are more successful obtaining the breakfast we desire right now than on planning how we will retire decades from now as we envisioned today (due to cumulative changes over a long period of time).
A key thing to keep in mind is that in enterprises it seems the default attitude towards change is to deal with it as an enemy or pest, something to be reviled, resisted, and eliminated. However, to state the obvious in Yogi Berra fashion, without change, nothing changes. Change is what makes things better or worse. Unfortunately, in what is for all pragmatic purposes a zero-sum world, some will experience change for the better, some for the worse. But because change is always happening, those currently in “better” positions (the top echelon of enterprises) must vigilantly improve or at least maintain that condition.
Even maintaining the status quo is the result of constant change, except the net measurement is the same. For example, maintaining my body weight doesn’t mean nothing has changed. I’m constantly overeating, then compensating by under-eating (and occasionally even vice-versa). For those finding themselves in worse conditions, the ubiquity of change means there is always hope for the better.
Change as the basis for intelligence is rooted in the fact is that our home, Earth, is a hugely dynamic, complex system powered by intertwined geologic forces and biological replication. Geologic forces are driven by forces deep in the Earth as well as way over our heads in the clouds, sun, and meteors. The ability for cells to replicate is the underlying mechanism by which all the life we live with self-organized. Every creature from viruses through swarms of bees through humans are driven to “mindlessly” take over the world. But we millions of species and billions of humans have settled into somewhat of a kaleidoscope of opposing forces at least in the bigger picture, which is like a pleasantly flowing stream, seemingly the same, but in reality in a constant state of change. The mechanisms of evolution and our human intelligence both enable adaptability on this fairly smoothly-dynamic planet.
A Few Clarifications
If all of this sounds obvious and/or a bunch of flowery crap, it could be that it’s only obvious when it’s brought to our attention, but quickly dismissed and forgotten as we resume the drudgery of our daily lives, being careful not to break any of the hundreds of thousands of laws micro-managing our lives, follow best (expected) practices that immunizes us from culpability, and careful not to trip over social mores that weren’t there yesterday. Our Industrial Revolution upbringing raised us to seek and expect comfort.
I would also like to point out that I’m not suggesting a system that simply gives us more to worry about, distracting us from what’s important, undermining our abilities through information overload (like a DoS attack). The main idea is not to replace us with some sort of AI system. It is to supplement us; watch our backs (it can multitask better than we can), see what our innate biases overlook, reliably rule out false positives and false negatives through faster exploration of the exponentially growing number of possibilities and continued testing of paths to goals (the definition of success).
The Expanding Reach of Our Daily Lives
However, many factors emerging in most part due to the increasing power of technology are expanding the scope of the time and space in which we individually or as an enterprise operate. Globalization introduces many more independently moving parts. Longer lives increases the cumulative changes each of us experiences in our lifetime. The rapidly growing rate of human population has greatly expands the reach of our species to the point where there’s practically nowhere on the surface we don’t inhabit. The countless devices feeding a bottomless pit of data collection, storage and dissemination expands the scope of our activities over time and space.
I purposely prepend the word “independent” to the phrase “moving parts” used in the previous paragraph. That’s because the fact that the parts are independently intelligent decision makers defines the world-shattering difference between complicated and complex. However, the level of “intelligence” of these independently moving parts doesn’t necessarily mean matching or even attempting to emulate human-level intelligence. Billions of machines from household appliances to robots running a manufacturing plant are being fitted with some level of ability to make decisions independently, whether that means executing rules based on current conditions or even sorting through true positives, false positives, false negatives, and all that good stuff.
With the limited scope of time and space typical for the typical human during the 1800s and 1900s, complicated machines were effective in performing repetitive actions that have, still do, and always will serve us very well. But in the vastly increasing scope of time and space in which individuals, their families, and businesses operate, making good decisions becomes an increasingly elusive goal.
If we don’t learn to embrace complexity to make smarter decisions, to entities that are embracing it (such as hedge funds and organizations with Big Brother power) we will be as fish and other game are at the mercy of humans with symbolic thinking. Embracing complexity doesn’t mean something like giving up our ego and becoming one with the universe or going with the flow. It means we need to understand that in a complex world:
- We need to be flexible. We cannot reject the answer of “it depends” obsessively seeking out the most convenient answer.
- Trial and error is a good methodology. It’s also called iterative. Evolution is based on it, although our intelligence can streamline the process significantly. On the other hand, our limited experience (knowledge) means we very often miss those precious weak ties, the seeds that beat out competition to rule the next iteration.
- Our logic is prone to error because of the ubiquitous presence of imperfect information (a huge topic).
- It’s a jungle out there, every creature and species out to take over the world. The only thing stopping them is that every other creature is trying to take over the world.
I discuss my thoughts around complexity, strategy, and how Map Rock approaches the taming of it in Map Rock Problem Statement – Parts 4 and 5.
Reflecting the World’s Cascading and Dynamic Many to Many Nature
A very intriguing discipline for dealing with complexity is Situation Awareness. It’s roots lay in war, battle scenarios, for example as a methodology for fighter pilots to deal with the chaotic realities of life and death fighting. In such situations, there are many independently moving parts, including some that you cannot trust. With all the training on tactics and strategies, a good opponent knows the best way to win is to hit you where you weren’t looking. In other words, things don’t go as planned. So we must be able to recognize things from imperfect and/or unknown information.
Figure 1 depicts a variation of an entity relationship diagram of a supply chain. Notice that unlike the usual ERD, there aren’t lines linking the relationships between the various entities. That’s pretty much because there are so many relationships between the entities that representing each with a line would result in a very ugly graph and simply showing that there exists relationships is oversimplifying things.
Those entities have minds of their own (will seek their own goals), unlike the “ask no questions” machines such as cars and refrigerators (at least for now). Instead of the conventional lines from entity to entity that strongly reinforce only the sequential aspects of a system, I depict “waves” (the yellow arcs) which attempt to reinforce the massively parallel aspects of a system as well.
Figure 1 – A sort of Entity Relationship Diagram of a Supply Chain.
The entity diagrams shows each entity broken down into three parts of varying proportions denoted by three colors:
- Black – Unknown information. Highly private, classified. This information does indeed exist, but it is unobtainable or much too expensive to obtain. Contrast that with unknowable information, for example, so far out in the future that no one could possibly predict it … except in hindsight. Therefore, perhaps there should be a black or unknowable data and a dark gray for private/classified data.
- Gray – Imperfect information. This could be indirectly shared, but statistically reliable; the basis for Predictive Analytics. Or it could be shared information, but suspect, or possibly outdated.
- White – Known. This is information readily shared, validated, and up to date. We also would tend to find perceive it as reliable if we knew that it benefited us.
The proportions of black, gray, and white are just my personal unscientific impressions of such entities based on my personal experience exploring the boundaries of what we can and cannot automate after 35+ years of building software systems. The main point of Figure 1 is to convey that the white portion is the “easy” part we’ve been mostly dealing with through OLTP systems. The gray and black parts are the hard part, which does comprise the majority of information out there and the stuff with the potential to screw up our plans.
In a fight, whether as a predator versus prey, a street brawl, or business competition, we can either play defensively (reactively) or offensively (proactively). Defensive querying is what we’re mostly used to when we utilize computers. We have a problem we’re attempting to solve and query computers for data to support the resolution process executing in our heads. However, in the “jungle”, situations (problems) are imposed on us, we don’t choose the problem to work on. Our brains are constantly receiving input from multiple channels, recognizing things, grouping them, correlating them.
Not counting video games, most of how we relate to computers is that computers answer direct questions posed to them from us, which help us to answer complicated questions, involving relationships between things, we’re working through with our brains. Video games are different from most of the software systems we use in that the computer is generating things happening to us, not just responding to our specific queries. In the real world, things happen to us but software is still not able to differentiate to an adequate degree what is relevant and what isn’t true, so we end up with enough false positives that it creates more confusion or there are so many false negatives that we miss too much.
The most important thing to remember is that the goal of this series of blogs is to work towards better reflecting the cascading and dynamic many to many relationships of the things in the world in which we live our lives. To do this, our analytics systems must handle objects at a lower level of granularity than we’re accustomed to, which can be reconstructed in an agile number of ways, similar to how proteins we consume are broken down in our guts into more elemental amino acids and reconstructed into whatever is needed.
Then, we must be able to ultimately process countless changes occurring between all these objects in a massively parallel fashion. To paint the most accurate picture in our heads required to make wise decisions, we need to resist forcing all that is going on into brittle, well-defined procedures and objects.
Non-Deterministic Finite Automata
Probably the most common exposure to NFA-like functionality for the typical BI/database developer is RegEx (same idea as the less powerful LIKE clause in a SQL statement). But I think it’s thought of as just a fancy filter for VARCHAR columns in a WHERE clause, not as a way of encoding a pattern which we wish to spot in a stream of symbols. These symbols can be characters of words (pattern) in a text document (the typical use case for RegEx), the stream of bases in DNA, the sales of some product over many time segments. Indeed, NFAs are the implemented version of regular expressions.
The NFA is a primary tool in the field of pattern recognition. They are patterns that can recognize those sequences of symbols. Sometimes these sequences may not be entirely consecutive (handled by loopbacks), sometimes not even entirely ordered (handled by multiple start states), and can lead to multiple outcomes (the “non deterministic part, handled by multiple transitions for a symbol).
When we think of pattern recognition, we usually think of high-end, glamorous applications such as facial or fingerprint recognition. But pattern recognition is one of the foundational keys for intelligence. It’s exactly what we humans seek when browsing through views in PowerPivot or Tableau. We look for things that happen together or in sequence. And the word “thing” (object) should be taken in as liberal a manner as possible. For example, we normally think of a thing as a solid physical object, but things can be as ephemeral as an event (I like to think of an event as a loosely-coupled set of attributes), recognized sequence of events, and things for which there isn’t a word (it needs a few sentences or even an entire course to describe).
If we think about it, seeing the things we see (at the “normal” macro level) is sort of an illusion. Meaning, we see an image (such as an acquaintance or a dog) reconstructed from scratch. When I see a friend of mine, my eyes don’t reflect into my head a literal picture of that friend. That vision begins with my eyes taking in zillions of photons bouncing off of that friend, which are quantified into a hundred million or so dots (the number of rods and cones on my 2D retina), into a number of edges (lines) processed by my visual cortex, into a smaller number of recognitions processed with components throughout my brain. If it didn’t work this way, it would be impossible to recognize objects from whatever angle 4D space-time allows and even partially obscured views, as presented to us in real life.
Further, I don’t see just that person in isolation. “That friend” is just one of the things my brain is recognizing. It also in massively parallel fashion recognizes all other aspects from the expression on his face, to what he is wearing, to funny angles that I wouldn’t be able to make out all by itself (all the countless things for which there isn’t a single word), to memories associated with that friend, as well as things around that friend (the context). Intelligence is massively parallel, iterative (hone in on a solution, through experimentation), massively recursive (explores many possibilities, the blind alleys), and massively hierarchical (things within things).
NFAs are visualized as a kind of directed graph, connected nodes and relationships (lines, edges). We’re well familiar with them, for example, org charts, flow charts, UML diagrams, entity relationship diagrams. However, NFAs are mathematical constructs abiding by very specific rules. These rules are simple, but from these simple rules, we can express a wide range of rules for pattern recognition.
Figure 2 depicts an example of an NFA. The NFA on the left is used to recognize when a market basket contains chicken wings, pizza, and beer. That could signify a large party which could be used to notify neighbors of the need to book that weekend getaway. The one of the right is used in more of the conventional, “if you have pizza and beer, perhaps you’d like chicken wings as well”, utilization of market basket analysis.
Figure 2 – Sample of an NFA.
A very good series of lectures on NFAs, actually the wider field of the Theory of Computation is Theory of Computation is by Dan Gusfield. L1 through L3 of the series is probably the minimum for a sufficient understanding of NFAs. Although there is very much value in understanding the more sophisticated concepts, particularly Context-Free Language and Turing Machine. I like Prof Gusfield’s pace in this series. Ironically (for a tech blog), the fact that he still writes on a chalkboard slows things down enough to let it all settle.
I believe a big part of the reason why graph structures still play a relatively minor role in mainstream BI development is because we’ve been trained since our first database-related “hello world” apps to think in terms of high-level, well-defined, discrete “entities” reflected in the tables of relational databases (tables). Each entity occupies a row and each column represents an attribute. It’s easier to understand the tidy matrix of rows and columns, particularly the fixed set of columns of tables, than the open-ended definitions contained in graphs. That goes for both people and computers as a matrix-like table is easier for servers to process.
In order to process enterprise-sized loads of data, we needed to ease the processing burden on the hardware by limiting the scope of what’s possible. We made “classes” and table schemas as structured as possible (ex, a value can be pinpointed directly with row and column coordinates). We stripped out extraneous information not pertinent to the automation of our well-defined process.
We also neglected implementing the ability to easily switch in and out new characteristics once extraneous but now relevant. Graphs, lacking a fixed schema, don’t have the crisp row and column schemas of tables nor the fixed set of attributes. So we decided we can live with defining entities by specific sets of characteristics forgetting that objects in the world don’t fit into such perfectly fitted slots.
There are techniques for conventional relational databases that support and reflect the unlimited attributes of objects, for example, the “open schema” techniques where each row has three columns for an object id, an attribute, and the value of the attribute. There can be an unlimited number of attributes associated with each object. But the relational database servers executing those techniques struggle under the highly recursive and highly self-joined queries.
In an ideal world, if computers way back to the days when only large, well-oiled enterprises used them (where taking in ambiguity as a factor isn’t usually a problem), were then already as powerful as they are today, my guess is we would have always known graph databases as the mainstream and relational databases appearing later as a specialized data format (as OLAP is also a special case). Instead of a relational database being mainstream, we would think of a relational database table as a materialization of objects into a fixed set of attributes pulled from a large graph. For example, in a graph database there would be many customer ids linked to all sorts of attributes customers (people) have spanning all the roles these people play and all the experiences. There could be thousands of them. But for a payroll system, we only need a handful of them. So we distill a nice row/column table where each row represents a customer and a small, fixed set of columns represents the attributes needed to process payroll.
Graph-based user interfaces (most notably Visio – at least to my heavily MSFT-centric network) have long existed for niche applications. But there is a rapidly growing realization that simply more data (volume), provided ever faster (velocity), and in more variety alone doesn’t necessarily lead to vastly superior analytics. Rather, it’s the relationships between data, the correlations, that directly lead to actionable and novel insight. So enterprise-class graph databases such as Neo4j, optimized for authoring and querying graph structures, are making headway in the mainstream enterprise world.
However, keep in mind that NFAs are very small, somewhat independent graphs, unlike large, unwieldy graphs more akin to a set of large relational database tables. In other words, the idea of this blog is querying very many small NFAs in massively parallel fashion, as opposed to one or a few large tables (or a messy “spider-web” graph). In a subsequent post in this series, we’ll address the “somewhat independent” aspect of NFAs I mention above; loosely-coupled.
Up to now we weren’t compelled enough to take a leap to graph databases since we were able to accomplish very much with the sterile, fixed column tables of relational databases. And, it graphs were too unwieldy to deal with, we retreated back to the comfort of the 2D world of rows and columns. But we’re now beginning to ask questions from our computer systems that are different. We don’t ask simply what the sales were of blue socks in CA during the last three Christmas seasons.
We ask how we can improve sales of blue socks and attempt to identify the consequences (ex. Does it cannibalize sales of purple socks?). The questions are more subjective, ambiguous, and dynamic (SAD – hahaha). These are the sort of questions that were in the realm of the human brain, which in turn we turn to our computer databases to answer the empirical questions supporting those more complicated questions.
SQL Server 2014’s new in-memory Hekaton features could help as well. Similar to an OLTP load, processing NFAs would involve a large number of queries making small reads and writes. This is in contrast to an analytics application such as OLAP which involves relatively few queries by comparison reading a large amount of data and no updates are made except for a scheduled refresh of the read-only data store. I’ve made this comparison because I think of this utilization of NFAs as something in the analytics realm.
But applied in an implementation involving thousands to millions of NFAs (such as Situation Awareness), a highly-parallel implementation, it could involve a large number of large reads and writes as well. So we have an analytics use case for a technology billed as “in-memory OLTP”. The advantage of using Hekaton over Neo4j is that we could implement our NFA system using familiar relational schemas and querying techniques (SQL and stored procedures) instead of invoking a new technology such as Neo4j.
Hekaton should provide at least a magnitude improvement over doing the same thing with a conventional disk-based relational tables. This performance improvement comes with first the all-in-memory processing and dropping the overhead required for disk-based tables.
Micron’s Automata Processor
Much more intriguing and relevant than Hekaton for this blog focused on NFAs is Micron’s Automata Processor, which I wrote about almost a year ago in my blog, A Rare Big Thing Out of Boise. The Automata Processor (AP) is a memory-based chip directly implementing the mechanisms for the massively parallel processing of NFAs.
This should result in at least a few orders of magnitude of further performance improvement, first from the fact that it has little if no “generalist” overhead since it is designed as an NFA-specific chip and not a general-purpose memory chip. It also processes NFAs in truly massively parallel fashion.
Thirdly, the “processing” mechanism (to process large numbers of NFAs in parallel) is built directly onto the chip, which means that there is no marshaling of bits between a CPU and memory over a bus for every single operation. So even if we were to compile NFAs down to “native code” (as Hekaton’s native stored procedures do), massively multi-threaded on a relatively massive number of CPUs, there would be great hurdles to overcome in beating the Automata Processor.
We could look at the AP as merely an optimization for a particular class of problems. The sort that recognizes patterns such as faces or gene sequences in a huge stream of data. But similarly we can look at the current mainstream computer architecture (CPU and storage device – RAM, hard drive, or even tape) as an optimization for the vast majority of the classes of problems we deal with in our daily lives as we’re accustomed to (at the macro level) in our vestigial Industrial Revolution mentality. That would be the well-defined, highly repetitive, deterministic class of problem which is the hallmark of the Industrial Revolution.
So instead I like to look at the Automata Processor as a technology that is a lower-level (lower than the Turing Machine) information processing device capable of handling a wider variety of problems; those that are not well-defined, highly repetitive, and deterministic. NFAs are like molecules (including very complicated protein molecules), not too high, not too low, high enough to solve real-world problems, but not so unnecessarily low-level and cumbersome. An analogy would be assembler language being low enough to dance around the roadblocks imposed by a high-level language, but not as cumbersome as programming in zeros and ones.
This parallelism could mean massive scales such as up to tens of thousands or even millions of NFAs. The main idea is that each piece of information streaming into a complex system could mean something to multiple things. The reality of the world is that things have a cascading many to many relationship with other things. For example, a sequence of three sounds could be the sound of countless segments of a song rendered by countless artists, countless phrases uttered by countless people with countless voices, countless animals.
NFA Sample Application
At the time of this blog’s writing, Micron had not yet released its development toolkit (the chip/board itself and the Software Development Kit, SDK) to the public for experimentation. So it is one of the major reasons I decided to demonstrate NFAs using conventional, disk-based SQL on SQL Server 2014, at least for this blog. However, the Automata Processor’s SDK is accessible at the time of this blog’s posting (after signing up) by visiting http://www.micronautomata.com/
There is still much value in demonstrating NFAs in this conventional manner, even with the impending release of the Automata Processor SDK. First, the AP is etched in stone (literally in the silicon). For business-based solutions I can imagine (ie, more mainstream than specific applications such as bioinformatics), I believe that there are a few deficiencies in the implementation (which I intend to supplement with Hekaton). There will be techniques outside the realm of the AP that would be of benefit and for which we can implement using a flexible, software-based system (ie a relational database management system). The details of the business-based solutions I can imagine and designs I’ve created around the AP are well beyond the scope of this blog. However, these old blogs provide sample background on such efforts of mine:
- Find and Measure Relationships in Your OLAP Cubes
- Bridging Predictive Analytics and Performance Management
- The Effect Correlation Score
This sample implements a very bare-bones NFA storage and query system for SQL Server. This version isn’t optimized or modified for Hekaton (which is a subsequent post) as that further expands the scope of this blog. This simple application supports the NFA concepts of:
- Multiple Start states.
- Looping back on a transition. This allows us to “filter” noise.
- Transitions to multiple nodes for the same symbol. This is the main feature of an NFA that distinguishes it from the less powerful “deterministic” finite automata.
- Epsilon transitions (transitioning for any reason).
Please understand that this sample code is intended to be used as “workable pseudo code”. This sample certainly doesn’t scale. It is meant to convey the concepts described here. More on this in Part 3 of this series.
This script, generated through SQL Server Management Studio, NFA -Asahara.sql, consists of a TSQL DDL script that creates a SQL Server database named NFA, a few tables, and a few stored procedures:
- Database, NFA – A simple, conventional (disk-based) database in which the sample database objects will reside.
- NFA, Schema – A schema simply for cleaner naming purposes.
- NFA.[States], Table – Holds the states (nodes) of the NFAs.
- NFA.[Symbols], Table – A table holding the symbols
- NFA.[Transitions], Table -Holds the transitions (edges, lines) of the NFAs.
- [NFA].AddNFAFromXML, Stored Procedure – Takes an XML representing an NFA and registers (persists) it to the tables.
- [NFA].ProcessWord, Stored Procedure – Takes a “word” (string of symbols) as a parameter and processes the word, symbol by symbol, through all of the registered NFAs.
For this sample, you will simply need the SQL Server 2008 (or above) relational engine as well as the SQL Server Management Studio to run the scripts. I didn’t include Hekaton features for this sample in part to accommodate those who have not yet started on SQL Server 2014. These are the high-level steps for executing the sample:
- Select or create a directory to place the SQL Server database (NFA.MDF and NFA.LDF), then alter the CREATE DATABASE command in the NFA – Asahara.sql script to specify that directory. The script uses C:\temp.
- Create the NFA database (database, tables, stored procedures) using the NFA – Asahara.sql T-SQL scrip.
- Register the NFAs following the instructions near the top of the [NFA].AddNFAFromXML stored procedure.
- Run the sample queries located in comments near the top of the [NFA].ProcessWord stored procedure.
Figure 3 depicts the SQL Server objects that are created by the script (database, tables, stored procedures, one TVF, but not the schema) and the part of the script where the database file directory can be set.
Figure 3 – Beginning of the SQL script for this blog.
Once the objects are created, sample NFAs need to be registered to be processed. A few examples are provided in a commented section near the top of the code for the stored procedure, [NFA].AddNFAFromXML. The stored procedure accepts the NFA as an XML file since it is a flexible way to exchange this information without a fancy UI.
Figure 4 shows one of those sample NFAs (passed as an XML document) that determines if a prospective job would be favorable, as well the state/transitions that are registered into the tables.
Before continuing, please remember that these sample NFAs are unrealistically simple for a real-world problem. But the point is that individual NFAs are simple, but many simple NFAs could better reflect the nuances of the real world. This will be addressed in a subsequent post in this series.
Figure 4 – Register an NFA using an XML document.
Regarding the XML, it is a collection of elements named Transistion, with four attributes:
- FromState – Each State (node) has a name. Usually, states are just called “qx” where x is an integer (ex: q0, q1, q2 …). However, I chose to give them a more descriptive name indicating something about how we got to that state.
- ToState – This is the state that we are transitioning to in reaction to a symbol.
- TransitionSymbol – The name of a transition resulting in a switch from the FromState to the ToState.
- IsFinal – Final States are a special kind of state (another being a Start State). If a word (series of symbols) ends on a Final State, it’s said that the NFA recognizes the word.
Figure 5 is a graphical representation of the NFA encoded as XML above (in Figure 4).
Figure 5 – Graphical representation of the sample NFA.
Figure 6 shows the results of processing a “word”. A “word” is a series of symbols, usually a string of characters. But in this case, a symbol is more verbose, thus requiring a delimiter (comma) to separate the symbols:
- “Distance Long” or “Distance Short” – How far is the commute from home to this job?
- “Salary Good” – The salary is above my minimum requirements.
- “Benefits Good” – The benefits exceeds my requirements.
Figure 6 – Process a word, which is a string of symbols.
The first result set just shows the trace of the iterative process of processing a word. Each iteration (there are three) processes one of the symbols of the word. It’s possible that multiple paths would be explored, for which in those cases, an iteration would have a row for each path. Multiple paths are one of the two levels of parallel processes on the Automata Processor; the ability to process multiple paths in an NFA and multiple NFAs.
The second result set shows only the final state, indicating this is indeed a good job.
As mentioned towards the beginning of this Post, Post 2 of this series will introduce Pattern Recognition along with another T-SQL-based sample. In Part 3, we will tie Posts 1 and 2 together. Parts 4 through 6 will expand upon the NFA functionality (most notably feeding back recognitions), implementation of Hekaton (as a step up in performance for the Part 1-3 samples and in a supportive role around the AP) and the Automata Processor itself, as well as further exploration of the use cases.
As with any other system, for example SQL Server or the .NET Framework, there are an incredible number of optimizations (think about it as known shortcuts) based on usage patterns all types (admin, process, structures, etc) to be implemented. Many of these optimizations have already been incorporated in the components of Map Rock, not to mention the NFA-specific optimizations offered by the Automata Processor.