In the end, it's Sender's Game. 

by: Chris Fuentes

email factorization.png

 

Doesn't that look tasty? I had something like that for dinner last night.

Anyways, that tasty plate of spaghetti and meatballs is in fact a bayesian network of the conditional dependencies of the factors which contribute to the classification of an email. In probabilistic terms, this equates to:

Pr(classification | attachments, trigrams(body), trigrams(subject), cc, bcc, from, In-Reply-To, seen) * Pr(from) * Pr(attachments | from) * Pr (trigrams(body) | bigrams(body)) * Pr(bigrams(body) | monograms (body)) * Pr(trigrams(subject) | bigrams(subject)) * Pr(bigrams(subject) | monograms(subject)) * Pr(monograms(body) | In-Reply-To) * Pr(monograms(subject) | from) * Pr(seen | trigrams(subject)) * Pr(monograms(subject) | In-Reply-To) * Pr(cc | trigrams(body)) * Pr(bcc | trigrams(body))

I actually wasn't trying to find an exact probabilistic model for classification. That would be kind of dumb in this domain, since one of the initial hurdles is the unfortunate lack of data. Without data, there is no probability that can be determined. Rather, the reason I drew all this spaghetti was because I wanted a clear picture of the factors which influence the classification of an email, and which ones we don't need to care about.

According to the graph, just about everything comes down to the sender. That makes sense intuitively, since that is perhaps the first thing we look at when determining whether an email is important or not.

Therefore, I designed the neural net to consider only the following: Sender, N-grams. I threw in a few of the other headers (cc, bcc), but primarily the classification is performed by examining the sender and N-grams of the ( body | subject ) string.

Did it work?

It's too subjective to have a statistical basis. I can tell you this - recent tests I've run indicate that over time, given enough recategorizations, the network will learn the user's preferences exactly.

 


Previously, on New Frontiers: 

So Much Depends Upon a Read Wheel Barrow

First some background: after implementing a Neural Network which is fully capable of handling arbitrary tagged dataset classification, I realized that the "beliefs" and "assumptions" parameters were useless and contaminating, respectively. "Beliefs" translated into nothing more than initial weight biases which would sway the hidden nodes' weight vectors one way or another. And after I'd finally cooked up a scheme for these biases to directly affect the outcome, it hit me: if the beliefs are wrong, they will be overturned within the first few rounds of training. All they would do is slow down the neural net. If they were correct, then the neural net could converge faster, but not by much. In the end, no real benefit.

The "assumptions" parameters were more interesting. These were ideas that translated into a priori tagging of untagged data, so that it could be used as a training set when little or no tagged data existed. If the assumptions were correct, even partially, it would allow us to circumvent the hurdle of not having a user-tagged set of training data. But things could not be so simple.

For the sake of analogy (and, again, not to give away the farm), suppose we were classifying books. We receive a pile of books, and have no idea which ones the user likes. We are told to make the neural net classify them as "Interesting" or "not". Suppose also that we are told which ones the user has read cover to cover, and which ones haven't been touched. Aha, that's a clue. So let's make an assumption that if the user read a book cover to cover, that book is "interesting" to the user, and the other books are "not". We tag the dataset this way, and shove the books into the abyss of the neural net.

The Little Neural Net That Could happily trains on this tagged set, and converges to perfect performance on the existing data. It even performs perfectly if some of the books are kept to the side as testing data. Wowee, isn't that great? But wait a second, what has the neural net actually learned about the relationships between what the user finds interesting?

Nothing, of course. If you were to look at the individual weight vectors and take them apart to analyze their individual influences (sounds like a week's worth of fun), you'd find that the output node for "interesting" just massively upweights all training data where "read cover to cover" is set to true. The other output node proportionally downweights them. All other nodes could be ignored.

The problem is this: we have some attributes, one of which is "read-cover-to-cover". If that is set to true, then the network reports "interesting," else it reports "not." So, in bayesian network terms, the outcome is conditionally dependent on a single attribute. But in framing the problem this way, we are saying that "read-cover-to-cover" is not just an attribute, but is actually conditionally dependent on the other attributes, since it itself is the deciding factor of the outcome. Thus we have this nonsense:

Features of book (author, length, style, etc...) ===> Read-cover-to-cover? ===> Interesting/Not

In my professional opinion, bah humbug. 

Gee wilikers, Batman, what's the flash of insight already?

The flash of insight is ... we remove the input node related to the assumption! If we pre-tag a bunch of text based on whether or not they've been read cover to cover, we then remove the input node corresponding to that boolean value. This way the neural net will be forced (politely, of course) to find a weighting of the other attributes of the book which fit the training data. In other words, we started with this chain:

Features of book (author, length, style, etc...) ===> Read-cover-to-cover? ===> Interesting/Not

And removed the middle node!

Features of book (author, length, style, etc...) ===> Interesting/Not

Therefore, the network will determine which of the other features of the book gave rise to the classification which coincides with the user having read the book cover to cover. If our assumption is correct, this will give us a clue which features of the book actually make it interesting for the reader.

Of course, there is a risk of conflation (it is, after all, an assumption), but remember: this is just the initial dataset. After feeding it through a few more datasets, tagged on different characteristics with hopefully a large amount tagged by the user themselves eventually, the net can build a complete picture of what factors influence a classification and then generalize them accurately to unseen data. So, in our book example, we would hope that after feeding it a stack of books tagged on the basis of their read-status and finding that all Charles Dickens books were read cover to cover, a new Charles Dickens book (which has not been read) would still be tagged as "interesting" because its characteristics overlap those of Dickens’ entire oeuvre. Less consistent authors might present more of a challenge, but again we can only ask for so much.

Well, that's it for now folks! Once I apply these changes to the flux capacitor neural net, I'll be sure to post my results.

 


CROUCHING TIGER, HIDDEN BIAS

I have begun coding the "feedforward" portion of the network, only to discover an issue I had not foreseen: the effect of hidden nodes on beliefs.

Assume a fully connected network: the input nodes have their values directly from the text (and we'll just say boolean values for now: 1 for the presence of a word, 0 for the absence thereof). If we don't have any hidden nodes, then we can incorporate our beliefs directly into the weights of the output nodes (that is, manually alter the default weight corresponding to that input node). However, if we have a hidden layer, things get tricky. Here's a simple example:

Let's say we have two input nodes, one which checks for the presence of the word "blue" and the other which checks for the presence of the word "green". Then lets say we have one output node, which is assumed to be "Important" if it reaches the threshold above 0.5, and not important otherwise. Now say we bias "blue == true" to have a weight of 1, and all other values have a default weight of ±epsilon. 

Simple enough: if we fire the network as is with values (blue == true, green == true), then the output node will receive inputs  [1, 1] (since each word was present) and weight the inputs with with weight vector [1, ±epsilon], which through becomes ~1.0 after the activation function. This is the behavior we would expect from incorporating such a bias. Similarly, if blue was set to false, then the output node receives [0, 1], weight vector [±epsilon, ±epsilon] and the output is ~ 0. 

Now let us introduce a hidden node between the two inputs and the output. This hidden node now takes on the characteristics that the output node previously had: it will be directly influenced by the initial bias. But the output node has no idea if the hidden node should be biased or not, so it will use the default weight of ±epsilon. Therefore, even though the hidden node was able to incorporate the bias for values (blue == true), the output node completely disregards this. 

I see two potential solutions: one is to check each hidden node and see if any of its influences had bias. If so, mark the node as biased, and pass along a dictionary of the biases. then when it reaches the output node, the output node will know which weights to bias and by how much. 

Obviously, this will have dramatic effects on a fully connected network, since if any input node is biased, all hidden nodes will be biased and all weights for the output nodes will be biased. The hidden layers will initially be representations of the bias, not of the rest of the data as is. I wonder, though, if this is really such a bad thing: We want to be able to produce results with very little or no training data. If all we have to go by is our biases, maybe the network should initially lean heavily towards it. We could also downweight the bias as we move further along the network, but this seems like it would require a lot of tuning to get the downweight curve correct. We could do it geometrically on a basis of biased to non-biased input ratio, but that seems arbitrary and disregards the effect that the bias is supposed to have. 

The other approach is to just connect biased inputs directly to the output nodes and skip the hidden layers altogether. On the one hand this seems to maintain integrity better, since the biased nodes won't overwhelm the rest of the network. On the other hand, this disconnects the nodes from the hidden layers, and therefore more complex functions cant be learned. 

I was initially leaning toward the second approach, but in hindsight I think I 'll try the first. Afterall, the bias is only for initial feedforward() runs of the network. After backprop() runs a few times, incorrect biases will be ignored. 

 


WHITE RABBITS

I've hacked together most of my first approach toward text classification. Though I am having some mental roadblocks because of the new challenge of integrating the unsupervised portion of the algorithm, a larger conceptual concern has reared its head:

Markov Time Series

Preferences change over time. If we have a corpus, even a corpus tagged by date of appearance, over time the data has a potential to become very inconsistent. In the extreme case, suppose a person's preferences invert overtime. Taking the corpus of data as a single point would lead to completely inaccurate results, since we'd have texts that are both "important" and "not".

I believe the solution lies in finding a way to only train on the most recent set of data, or otherwise to down-weight earlier data beyond a certain horizon. 

I do want to take the opportunity to show one of the great features of the way I am constructing this algorithm. That is: the configuration file. You are allowed to configure the algorithm in a way that's consistent at a high and very much intuitive level with your personal a priori beliefs and assumptions. In the case of email, like so:

{

"beliefs" : {

    "seen"    : { "true" : 1 },

    "replied" : { "true" : 1 },

},

"assumptions": {

    "headers" : { 'notification' : 'not important' }",

},

}

Beliefs are measured on a scale of -1 to 1 (corresponding to "I think this is not important" to "I think this is very important). Assumptions are simple a priori tags, e.g. "this is important" vs. "this is not important".

Best of all, the way I have build the algorithm, it will have no trouble giving your beliefs the metaphorical middle finger (though perhaps an ASCII middle finger would be a nice touch) if it finds out they are totally wrong. That is, it can learn to not just disregard them, but rather to modify them to fit its actual view of the world. As for the assumptions, well you've got to be more careful there, because it will be slow to deviate from them, but over time if it receives enough counter examples, it should be able to disregard them as well.

Now, to see if it actually works...


PROTOCOL 13, OR WHY NEURAL NETWORKS TRUMP CLUSTERING

After some initial considerations, a bit of reading, and no small amount of talking to myself, I've settled upon two potential approaches: a clustering adaptation, and a back-propagation adaptation. The first one I don't like so much. The second one is incredibly complicated, but looks far more promising. 


A CLUSTERING ADAPTATION ADVENTURE

Clustering in general is a catch-all approach to unsupervised learning tasks. But, as discussed, what we have is not an unsupervised learning task entirely, but rather a developing task. That is, it starts out with no substantial supervision and slowly gains supervision overtime. So, it's got to be able to account for both cases, and transition smoothly from one to the other. 

How can clustering help us? All we have to do, in theory, is provide some guidance to our algorithm to help it determine which cluster is which. Here is the general approach:

    def TextClassify(data, initialBeliefs = range(0, 1) ):

        newBeliefs = initialBeliefs

        while (true):

            cluster(data, newBeliefs)

            classifiedData = checkDatabaseForNewData()

            newBeliefs = addTrainingData(classifiedData, data)

            classifier = findSeparatorFunction(geometricTag(data))

def Cluster(data, initialBeliefs):
        dataWithBeliefs = [addBeliefs(datum) for datum in data]
        geometricClustering(dataWithBeliefs)

def addTrainingData(classifiedData, data):
        newCentroids = findCentroids(classifiedData)
        return expectationMaximization(data, newCentroids)

Basically, we are taking our data and tacking on another parameter to it: "belief", which is something in the range of 0 to 1, representing our conviction that it is important (1) or not important (0). We cluster the data geometrically into two clusters (presumably, important and unimportant) and then wait to get some tagged data. Once we get the tagged data, we adjust our conception of where the centroids are and run expectation maximization to adjust the belief parameter of each datum. Once that's done, we classify the data geometrically by cluster, and then find a separator function that can assist in future classification. Rinse and repeat. 

I haven't thought this approach out too much further than this. There are two compelling reasons why:

The first point >> The dimensionality of english text, on a monogram basis, is over 500,000. Mathematically, I don't think that having a single parameter fluctuating between 0 and 1, even if alongside only binary features, can't robustly push a datum toward one cluster or another in a meaningful way. We can mitigate this by change the numeric range of belief-states to be much larger, but I'm not sure at first pass how we could guarantee that we've settled on a good range. 

The second point >> We're not going to have a clean linear separation between data. Yet, only having two clusters implies that we will. We're not just cutting corners, we're shooting ourselves in the foot. And cutting corners. 

A MORE SENSIBLE APPROACH: PROTOCOL 13

Back propagation isn't for the faint of heart, nor for those who don't like math all that much. However, I envision that there is an incredibly flexible model that can be built using the back-prop algorithm as a base-line. 

Without giving away the farm, here's the basic outline:

The in's: We have a variable number of input nodes, each representing one feature. Since we're dealing with text, we can assign input values in one of two sensible ways: TF*IDF, or boolean. Frankly, I haven't the faintest which would be better, but hey–that's why they hire guys like me. 

The not-quite-in's and not-quite-out's : Then we have the hidden layer. Or several hidden layers. I guess that's another thing to tune, but since it's been proven that enough hidden nodes can learn any function, really the only thing to watch out for is making sure we've got enough, or more than enough. This parameter might need to be tuned either live or a priori in order to maximize run-time performance. But at this stage, I don't give a cow if it's slow, it just needs to work. 

Each hidden node has an array of FeatureConverters: objects that take as input a certain feature (identifiable via featureID or featureType or whatever), the actual observed value for that feature, and spits out the current weight assigned for that feature at that weight. 

The out's: Lastly, we've got the output nodes. Two of them, to be precise. One is proudly labeled "Important", the other: "not". We'll just go with normalized unit values for now. 

The in-between's : What about connectivity? Fully connected or not? I think intuitively, we would want *most* of the nodes connected. There are a few that I wonder about, though – principally, whether meta-data should be connected to observed data. I guess we're really making a statement about the relationships we're expecting from a text document by choosing not to connect certain nodes, but experimentation will hopefully provide some insight as to whether we actually should choose not to be fully connected. 

Framing the problem as a neural network allows us to exploit any and all relationships between any and all features. It also allows us to gradually add in data and continuously train on more data as it arrives. But what about when there's no data? Well, we have to initialize all features by default to have no influence, except for the ones that we believe actually do have influence. Some of this could be metadata, some of it could be actual features of natural language.

The way I've framed it, I'm not intuitively sure that the weights for individual values of a feature will be updated to robustly reflect relationships with other particular values of other particular features. But, I could be wrong. If each hidden node can learn one relationship, then it may just be a matter of finding the optimal number of hidden nodes. Or even hidden layers. ML problem turned Graph Search problem? You bet. Fun? Not at all. Beer time?

Absolutely. 

 


ONE (SMALL) STEP FOR MACHINES.  

I have decided to attempt to solve a problem which falls outside of the realm of straight-forward techniques in Artificial Intelligence.

The fully generalized version of the issue is this:

Given a bunch of texts, (could be anything: books, text messages, facebook posts, whatever ~) can we have a computer figure out which ones are important to you, yes, you and which ones are not? 


M.O. 

To be fair, this is a problem for which solutions have been and are currently being attempted. On one domain, recommendation services, (such as Apple Genius and Amazon's recommendation services) are constantly seeking ways to improve their algorithms which recommend products that a particular user is more likely to buy. Historically, such tasks were addressed by compiling huge amounts of sales data, and simply following the trends (e.g., we know that ten thousand people who like A also like B, so if the 10,001th likes A, he'll probably like B).

However, this approach completely neglects the intrinsic properties of the products themselves--properties which are in fact the reasons why someone would want them or not want them. 

Secondarily, in the realm of filtering and advertising, it has become important to various industries to be able to streamline only the information that a particular user wants to see, cutting down as much as possible on clutter which the user would only view as annoyance (in the case of advertisement) or spam (in the case of email). 


DEFINE IMPORTANCE

This particular problem has some interesting features. First of all, at its heart we have a binary classification problem: text can fall into the category of "Important"/"Interesting", or "Not". This fact initially gave me lots of hope, because there exist a plethora of off-the-shelf-ready techniques for performing arbitrary binary classification. All we needed to do is define a feature set, collect a bunch of training data, and boom: we're done. 

Kidding. 

The idea of training data qua training data contradicts the nature of the problem we are trying to solve. Suppose we asked 1 million users to each go through 1 million pieces of text and mark them as either "interesting" or "not". We could then run any of our awesome out-of-the-box algorithms on the data and go forth to classify the known world of text. What would happen?

Our classifications would be wrong, just about every time.

They would be wrong because "important" is not an objective classification. In fact, from one person to the next, there may be a 100% inversion of "important" to "not" on certain text. Trying to build general rules based on large amounts of training data will lead to very unpredictable results, and it will only be correct when all of the training data is in agreement about certain types of text (e.g., if everyone hates text that contains the word "pineapple", then the classification will do just fine for texts that contain "pineapple". But I don't think I need to convince you that there would be very few, if any, cases like this). 


CONSIDER THE INDIVIDUAL

What we need is a set of training data for each individual end user. This training data can not be corrupted by mixing it with any other training data. That way, the classification will be able to learn from the user's training data proportional to the level of consistency that the user classifies texts as important or not. In theory, that would result in a pretty good classifier. 

But not really. First, machine learning algorithms which are data dependent, such as classification, require quite ideal data in order to function correctly. That is, there has to be enough of it, it should be consistent, and it should be general enough to speak to all features in the particular domain we are classifying (for example, if we showed a user ten spiderman comic books and had them tag them as important or not, then tried to guess how the user would feel about "A Tale of Two Cities", we'd essentially be just guessing).  

For this per-user classification to work, we'd need each user to have a pre-existing robust set of training data. And that's... not gonna happen.

So  we essentially have none-to-sparse training data for a class of algorithms that require training data. Worse, if we are to solve this problem in a commercially appealing way, we want to be able to actually start making accurate classifications without any training data. Of course, once the user starts actually using the application, the user can verify that we've done a good job or not and thereby provide training data that we could collect and use to improve our accuracy. But how do we get a baseline?


A ROGUE SCIENCE

What we need is a new kind of algorithm. We need an algorithm that takes, as input, features and an expectation of their influence toward classification. The algorithm must be able to account for instances of new training data that appear, weight them with our feature-weight expectations, adjust our current feature weight expectations, and proceed to classify. Essentially, we have an algorithm that starts as completely unsupervised, but slowly gains supervision over time. 

The real gotcha is, when we're at the completely unsupervised phase, even if we can linearly separate all text into two clusters, the algorithm needs to have some hint about which category corresponds to which cluster. 

Therefore, the algorithm needs to be able to account for the experimenter's a priori belief about the effect of certain features in the data. Or, more ideally, it needs to have some way to guess which features are more likely to contribute to a certain classification and give them initial weights accordingly. 


FIRST STEP 

I'll begin by modifying some existing techniques to see if I can come up with an algorithm that meets the requirements expressed above. 

Stay with me. 

 

Comment