Exploring a Sea of Data with Massive Numbers of RegEx – And Maybe Even the Automata Processor

Overview

This blog explores taking the lid off RegEx (regular expressions) and its less powerful cousin, the LIKE keyword in SQL. By “taking the lid off”, I mean looking for more than one pattern at a time. Indeed thousands, tens of thousand, or even more patterns. After all, the things we’re somewhat randomly seeing with our eyes at any given instant is recognized because our brain recognizes millions of things we could be seeing in massively parallel fashion.

Traditionally, we use RegEx and LIKE when we already know the pattern we’re looking for. We just need help wading through a ton of data for that pattern. For example, in a search, we could express we’re looking for any with the characters “hara” and a first name beginning with “E” with this RegEx (or LIKE): %hara, E%

Figure 1 shows a sample of the LIKE keyword looking for names that contain “hara, e”. Most search functions in applications offer a little more flexibility than just finding a string, mostly the simple Contains, Begins With.

SQL LIKE key word.
Figure 1 – SQL Like key word.

In this case, we’re checking if any values matches a single pattern, one rule to apply to very many values. Database people also know that LIKE parameters beginning with % results in poor performance since indexes on the column can’t be utilized.

A little more sophisticated use case would be to find all occurrences of some pattern in a long text. This is the Find function in text editors such as Word. This would result in multiple hits, many occurrences throughout a single long string of text. A further step would be to search a longer text column, such as those of the VARCHAR(MAX) or TEXT types, holding free text notes or descriptions for each row (ex: free form comments on a sales call); multiple hits for multiple rows. But whether we’re searching one big text like a long Word document or the text strings of many rows, we’re still searching for one key word or pattern at a time.

So let’s throw all caution out the window and ponder searching for thousands of patterns occurring in millions of long strings of text. That would normally be implemented as: for each text column in each row, iterate through many patterns. Meaning, if searching 1 GB of text used for one pattern takes 1 second, search for 100 patterns would take about 100 seconds. Yes, we could apply optimization techniques such as parallelizing the task by breaking up the rows into chunks shared across many servers, or apply some logic eliminating some impossibilities upfront. But that’s not the point of this blog. This blog is about two things: Why one would want to do such a thing, and that Micron’s Automata Processor offers an amazingly direct path to the ultimate solution to this problem.

I’ll first describe the first time the notion of running very many RegEx expressions over a large amount of data occurred to me in my daily life as a BI consultant. Even though this particular use case is somewhat covered by features in many current “Data Profiling” applications, the use case in which I recognized the need helps to identify metaphorically similar use cases elsewhere.

Before continuing, I’d like to mention a couple of things:

  • The coding samples will be stripped to the bare minimum. For example, I haven’t included parameter validation logic. There are numerous code examples related to most of the topics of this blog such as RegEx using whatever programming language as well as for the SQL LIKE keyword. Additionally, some code involves SQL CLR, and that code is minimized as well since there are many examples of how to register SQL CLR assemblies into SQL Server.
  • I do mention topics on the Theory of Computation, mainly the Turing Machine, Regular Expressions, and Finite State Machines. But I won’t go into them deeply. For the most part, C# developers are familiar enough with RegEx and SQL developers with LIKE.
  • Although the main point of this blog is, “How great is that Automata Processor?”, I don’t actually get to the point of actually implementing it on the AP. Much of the reason is that I’m still focusing on communicating how to recognize use cases for the AP in a BI environment. Meaning, I’m still trying to sell the folks in the BI world (well, more the bosses of the folks in the BI world) on investing in this very “strange” but amazing technology. Besides, the AP SDK is still in limited preview, but you can ask to register anyway. However, once you’re comfortable with the concepts around finite state automata (the core principle of the AP), authoring them and implementing them into the AP is relatively easy.
  • This blog pulls together many tough subjects from the highly technical levels of Business Intelligence and the Theory of Computing. This means there are many terms that I may not fully define or even take some liberties glossing over a few details, in the name of simplicity.

My Data Exploration Use Case

After an “analytics” consultant understands the client’s problem and requirements and a project is scoped, the consultant (the ones in my usual world anyway), find themselves sitting in front of an instance of a tool to explore the heap of data such as SQL Server Management Studio or Aginity. By “heap of data” I mean that since as a consultant the customer is usually new to me, the people and their roles and the data and the meanings are unknown to me. Until I go through the non-trivial process of learning about the data, I’m Lewis and Clark on an exploration journey.

My ability to learn about the data depends upon several factors, for which all factors or even a quorum which surprisingly even today hardly ever exist at the same time:

  • The presence of knowledgeable and patient DBAs or data stewards or developers. Using the plurals illustrates that in BI there are usually a large number of databases scattered all over the enterprise, usually numbering in the hundreds for a large enterprise. Quite often as well, part of the reason I’m there is because a DBA or analyst moved on, taking all that knowledge trapped in her brain with her.
  • The presence of a Data Dictionary. A data dictionary is a catalog of data sources throughout an enterprise, down to the column levels, including types, descriptions, even a lineage of the data (“Source to Target Mapping”), the valid values for the columns, and keys. This is the other MDM, MetaData Management, not Master Data Management.
  • The “penmanship” of the database designers. The better the names of the tables and columns, the easier it is to explore the data. But even if the tables and columns are well named, they can still sound ambiguous (ex: cost and price). I usually work with a Data Warehouse, which is not in a nice third normal form with primary/foreign key relationships. Adding to that, a Data Warehouse is subject to fast growth without discipline (because “disk storage is cheap”).

This learning about the data is a part of a wider task called Data Profiling, for which there are many very good tools on the market. But to me heart of Data Profiling is something we usually do at the actual analytics stage, after we’ve identified our data, and now we’re analyzing its value towards solving our problem. In the scenario I’m describing, I know what problem I’m addressing, but I still don’t know what data I have to work with.

About a third of the time, my client has a specific data set to explore. “Here’s the last two years of clicks from our web site. Have at it.” Even in those cases, where the data structure is relatively simple and well-known, yes, it would be nice to find the usual patterns in the click streams, but even nicer to correlate those click patterns to something important to the client. Meaning, I’d like to go beyond the usual, looking for other data to think outside of the box of data given to us. So I’m back to searching for what else is out there.

In the end, after exhausting all sources of data information known to me, I’m still usually left with some level of looking for something. The thing about analytics is it’s often about unknown unknowns – I don’t know what I don’t know. And because the nature of analytics, at least applied towards improving success towards some goal, is fuzzy, imprecise, we don’t always know that we’ve done the best job possible. We usually take “good enough” and move on to the next thing.

Too often, in a casual conversation with colleagues at a customer site, a data source of some sort is mentioned and I think back to some earlier project with that customer, “Gee, I wish I knew about that data source back then.” So in an ideal world, I’d want to do an extensive search for opportunities, beyond what is already known. Exploring hundreds of servers for data within a reasonable amount of time for something that may or may not exist doesn’t make sense either. It would be nice if I could inexpensively do that comprehensive exploration.

As we humans go about our daily lives, sure, any given day may seem somewhat routine. We get up, eat breakfast, say good-bye to those at home, head to work, play solitaire, sneak out of the office, etc. But it’s probably less routine than we think. Other people move cars around, leave at different times, the weather is different, some tasks at work are different, our moods and the moods of others are different. So the signals entering our brains through our eyes, ears, nose, tongue, and skin need to be capable of recognizing all manner of things from all manner of angles and combinations. Our “inputs” don’t see things like cars and other people. They sense light, molecules, sound waves, and physical contact. Each of these symbols could represent millions of different things. We usually don’t have time to sequentially scroll down a list of possibilities. Fortunately all of these possibilities are “considered” by our brain in massively parallel fashion.

 A Single Algorithm for a Wide Range of Rules

Identifying patterns can involve a great number of different types of algorithms. Regular Expressions are one type of algorithm. The calculating pi, the various methods for predicting the weather, all those C# functions you’ve written, and making McDonalds fries are other examples. Our world of business, society, and the environment is composed of countless executions of algorithms of very many types. Therefore, our current CPUs are based on the Turing Machine, an algorithm of algorithms, which can process just about any algorithm our human brains can imagine (implying there are probably problems we cannot imagine).

Instead of burning hard-wired silicon for each of those countless algorithms we’ve identified, we developed pretty much a single “computing device” (as Mr. Burns may say) capable of executing those processes with a single “algorithm”. We encode those algorithms with sequences of instructions, software.

Similarly, many patterns can be easily or at least relatively easily expressed using the simple Regular Expression algorithm. For example, we can robustly (lots of different formats) recognize sequences of characters such as phone numbers, social security numbers, dates, currency, hexadecimal numbers, credit card numbers, license plate numbers from various states, email addresses, etc. Regular Expression is closely related to Finite State Automata, near the bottom of the “Theory of Computation stack”, the simpler side, the opposite side of the Turing Machine.

Now, the examples I listed are few and don’t constitute the thousands of rules I’ve been mentioning so far in this blog. And that perhaps is one reason RegEx hasn’t exactly blown the minds of programmers over the years as it would seem to have limited use. However, here are three thoughts of sources of thousands of regular expressions:

  • At a brute force level, every word and number could be a RegEx. Regular expressions encapsulate patterns. The name “Eugene” is indeed a pattern, albeit not very interesting.
  • Many classes of things may not be easily abstracted as a concise regular expression, but if you look hard enough, a there can at least be abstract codification of a subset. For example, some names for a culture follow a pattern. The Scandinavian pattern of first name of your father followed by “sen”. It’s easy to recognize most of the 4-syllable Japanese names as Japanese. Such patterns may not cover all Scandinavian or Japanese names, but it’s still a pattern that can be encoded.
  • Most importantly, streams of events can yield a great number of patterns. This is analogous to finding patterns of words close to each other in the long text of a book. But instead of words, they would be events such as the page click patterns normally leading to a purchase or a sequence of “events” often leading to sepsis. This notion of streams of events is actually part of a much bigger discussion, which is mostly out of scope for this blog.

One more important note on the last point. For a sequence of events to be of value, the events don’t need to be adjacent. For example, if we wish to find all customers going from meat then to vegetables, we may or may not be interested whether they stopped to pick up black pepper or beer in between. Meaning, some of those events will be noise that should be ignored. That will be especially true when we combine events from heterogeneous sources. For example, studying the sequence of events of one nurse making the rounds wouldn’t provide insight as rich as studying the sequence of events across all care providers (nurses, physical therapists, doctors of various specialties, etc). Regular expressions are versatile enough to ignore events thought to be extraneous to what pattern we’re encoding.

Further, an “event” that’s part of a sequence can be one of many options. For example, an important pattern may be those who went to the meat department first, then to beverages or produce but not baking goods, and finally to pick up wine. Again, regular expressions are versatile enough to encode that sequence. The point is (where a full exploration of this is outside the scope of this blog) that there is very much opportunity to study complex streams of events where a different approach is necessary for query performance suitable for analysis.

When we run a regular expression through our current software running on our current commodity servers, we’re running a regular expression algorithm over the Turing Machine algorithm. With the Automata Processor, these are what I consider the three major performance turbo-charges:

  1. The Regular Expression algorithm is directly translatable to a Finite State Machine, with the algorithm (not the instructions, the actual FSMs) “hard-wired” on the Automata Processor. Therefore processing of FSMs are as direct as possible.
  2. Large numbers of FSMs can be loaded and updated onto an AP, a self-contained single piece of silicon (a co-processor on a conventional motherboard). Meaning, there is no marshaling of bytes back and forth from the CPU to RAM back to the CPU and so forth. The processing and the storage of the instructions live together.
  3. Each symbol is processed in parallel by ALL of the FSMs on the AP chip. They are not processed iteratively, one by one as through nested for-each loops.

The first two items describe the aspects of the performance gains at a “lower level” (in the weeds) than where the majority of BI developers ever want to live. It’s that third point that is the most compelling. With all due apologies to the “massive parallelism” of Hadoop, it’s not quite the same thing as the massive parallelism of the AP.

The massive parallelism of Hadoop occurs primarily at the server level, scaling out to even thousands of servers. It’s more like partitioning of a set of data onto a large set of servers. This means there is still processing to figure out which server(s) hold the subset(s) of data we’re interest in, sending those instructions over a wire to the server, the server running through conventional read operations, etc.

The massive parallelism of the AP is more like someone finding someone who is interested in free tickets to the Boise Hawks game by shouting out into the crowd, as opposed to asking each person serially, one by one. The AP is fed a stream of symbols that is seen by all of the FSMs programmed onto that AP. Those “interested” in that symbol accept it as an event and move to the next state. Those FSMs in a state not interested in a particular symbol “ignore” the symbol and are unaffected.

In the case of this RegEx example, the valid symbols, the “alphabet”, are essentially the 255 characters of the ASCII set (0-9, a-z, A-Z, and the common symbols). Incidentally, the number of symbols recognized by an AP is 255, one eight-bit byte. With that said, it’s critical to remember that a symbol can represent anything, not just a literal alphabet or digit. For example, the symbols can represent up to 255 Web pages of a click stream analysis or the four nucleotides forming a DNA sequence.

Yes, that can be a limitation, but I’m sure that will change some time, and there are techniques involving banks of Automata Processors, where FSMs are artfully partitioned based on a limited subset of 255 of the total symbols.

Multiple RegEx Example

This example will test a set of words against a set of rules, for which there is a many to many relationship. In other words, each word can be recognized by multiple regular expressions. This example reflects the use case I described above (in the section, “My Data Exploration Use Case”) concerning the exploration of heaps of data.

This example is primarily utilizing SQL Server with a small C# function to leverage the .NET Framework’s RegEx functionality, which is much richer than SQL’s LIKE key word. As a reminder, I’ve kept the code as minimal as possible as many details, such as how to register a .NET DLL into SQL Server, are well documented elsewhere.

The data set used in this example is very small in order to illustrate the concept which would be greatly scaled-up were we using the Automata Processor. Millions of words and tens of thousands of patterns (RegEx) would not run very well using the conventional approach shown in the example.

The code can be downloaded as a .txt file. It’s just text, no binary stuff. Be sure to rename the file with the .sql extension to open in SQL Server Management Studio.

Figure 2 shows a SQL script that creates a temporary table of the words we wish to recognize.

SQL LIKE key word.
Figure 2 – Words.

Glancing  through the “words” (in this case “phrase” may sound more normal) inserted into the temp table in Figure 2, some are easily recognizable by our robust brains as formats such as dates, street addresses, and phone numbers. Some are ambiguous such as the 9-digit words. So the idea is to take these words and check them against all the patterns we know as shown in Figure 3.

SQL LIKE key word.
Figure 3 – Patterns in our “knowledge base”.

The temp table, #RegEx, holds a row for each regular expression, a category, and a more specific description.

Figures 4 and 5 show the translation of two of the regular expressions held in the #RegEx table; one a fairly simple one for ADA County Auto License numbers and one a little more complicated for phone numbers. Some of the patterns are very specific such as  the one for an ADA County Auto License. I’ve included such specific ones to help demonstrate that patterns don’t need to be universal. We could instead encode many patterns addressing subsets.

SQL LIKE key word.
Figure 4 – Finite State Machine representation of an ADA County License Plate Regular Expression.

Finite State Machines are the heart of the Automata Processor. Once you’re registered to for the Automata Processor preview I mention towards the beginning of this blog, you will see an interface that allows you “author” such diagrams in this WYSIWYG manner. However, keep in mind that there are methods for authoring such diagrams en masse, for example, from the sequences in a Time Sequence data mining model. The sequences can be uploaded through an the AP’s own XML format, ANML (pronounced animal).

SQL LIKE key word.
Figure 5 – Finite State Machine representation of the Phone Number Regular Expression.

Figure 6 is a very simple scalar SQL Server function written in C# that uses the .NET Framework’s RegEx class. Figure 8 below shows how this function is used in SQL.

SQL LIKE key word.
Figure 6 – C# SQL Server Scalar Function utilizing the .NET Framework’s RegEx.

Figure 7 is a script for registering the .NET function into SQL Server. I haven’t described all of the warnings related to enabling CLR functions as it is explained very well elsewhere.

Code to register the CLR function into SQL Server.
Figure 7 – SQL Sever code to register the .NET DLL into SQL Server.

Now that all of the pieces are in place (the RegEx function is registered in SQL Server and we have a query window open in SSMS where the two temp tables are created), we can run a SQL (Figure 8) to test each word against each rule. The SQL statement CROSS APPLIES the words with each rule, filtering out any word/pattern combination that does not result in a recognition. A recognition is determined via the RegEx function, which will return a 0 or 1, we created and registered into SQL Server as shown in Figures 6 and 7.

 SQL LIKE key word.
Figure 8 – SQL to process the words and RegEx.

Using the SQL in Figure 8 with the CROSS APPLY join, with the 18 words we inserted into #txt and the 9 patterns we loaded into #RegEx, there were 162 (18*9) comparisons made. In other words, for each word, check each rule. If this were scaled up, for example if there were millions of words and thousands of patterns, the number of comparisons would be huge.

If these 18 words were fed into an Automata Processor loaded with those 9 patterns, each word is fed only once and all 8 patterns will analyze it in parallel. To rephrase something similar I mention earlier, this is the same as someone holding up the word, 555-55-6666, shouting to a bunch of people, “Hey! What is this?”. That is, as opposed to walking to each one asking them that question.

Figure 9 shows the results of the SQL shown in Figure 8.

SQL LIKE key word.
Figure 9 – Words.

We’ll look at a few of the interesting results discussing some interesting aspects of exploring data in this manner:

  • Rows 1 and 10 show the “Phone #” RegEx is versatile enough to recognize a phone number with and without parenthesis. In this case, for any word containing a set of 3 digits, 2-digits, and 4-digits, we can be fairly confident it’s a phone number with or without parenthesis around the first three digits. So it’s OK to use one versatile RegEx.
  • Rows 4 and 5 show that ‘1A Z999’ is recognized as both a more specific ADA County Auto License as well as a more generic Idaho Auto License. The less specific Idaho Auto License also recognized 2GZ999 as, even without a space between the 2G and the Z999 parts. It’s good to recognize something at various levels. For example, sometimes we need real sugar, sometimes any sweetener will do.
  • Row 7 recognized “45-888 Kamehameha Highway” as an address in Kaneohe, but not “1200 Kamehameha Highway”. Because Kamehameha Highway practically circles Oahu, being able to recognize an address as specific as one in Kaneohe on Kamehameha Highway requires this fairly stringent rule. Also, this doesn’t mean all addresses in Kaneohe follow this rule. Other rules would be developed, hopefully with at least some abstraction into a RegEx. For example because Luluku Road is only in Kaneohe, any address on Luluku Road (also following the 45-ddd format typical for Kaneohe) is a street address in Kaneohe.
  • Row 8 shows 555556666 as a Zip code although another word for which the only difference are dashes, 555-55-6666, is clearly a social security #. However, there really is no reason 555556666 cannot be a legitimate Zip code (somewhere in Minnesota). Even though our human brains may think this as more of a SSN, it’s good to have a something that can see beyond our biases.

So suppose that over the years, through dozens of customers, hundreds of databases, I collected thousands of formats for data. Most will not be as universal as date and phone number formats. But even seemingly one-off formats could provide insight. For example, suppose years ago I encountered some old software system that stored case numbers in the format of 4 upper-case letters, a dash, two digits, a dash, and 3 digits (RegEx: [A-Z]{3}-\d{2}-\d{4} ). If today at another customer I encounter such a format, it adds a relationship that may or may not matter.

To take the code presented here to that level where we explore the hundreds of databases throughout an enterprise, I would expand this example to:

  1. Iterate through a list of database server, each database, each table, each view (because there could be calculated columns), and column of those tables and views.
  2. For each column, the tool would retrieve each distinct value and a count of each value.
  3. For each of those distinct values, the tool would test it against each regular expression in the library accumulated over a long consulting career. Every value recognized by a regular expression, whether a street address from a little town on Oahu to just being numeric, would be added to a table similar to the one shown in Figure 9. However, there would be rows for the server, database, table, and column as well.
  4. Because this could mean trillions of rows for a large enterprise, we could actually store only the counts for each regular expression for each column. So if there were say 50,000 columns across all databases, each triggering around ten regular expressions, that’s only 500,000 rows.

Remember though, the purpose of this blog isn’t so much to suggest a data profiling technique as to present a pattern for an Automata Processor use case which could provide inspiration for other applications.

Conclusion

It seems that the combined rate of data growth, the complexity of the world, and the computing power required to answer the sort of questions we now face is outpacing Moore’s Law in terms of the increasing computing power of CPUs. But we can still tackle this problem by looking towards these massively parallel approaches.

Last week (August 3, 2015) I posted a blog on a “Graph Database Symposium” I’m planning. At the time, the planning is even earlier than the “early stages”. The intent of that blog is to gauge the interest for such a symposium at this time. Hopefully, this blog helps take the reader further along in recognizing the value of graphs and the Automata Processor.

 

 

About Eugene

Business Intelligence and Predictive Analytics on the Microsoft BI Stack.
This entry was posted in BI Development and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s