Lead Scoring Models: Avoiding Pitfalls (Multicollinearity)

December 24, 2009

Last week, I wrote about the dangers of assigning point values to lead attributes with low sample sizes. This week I will cover multicollinearity, a statistical phenomenon in which predictor variables (in our case, lead attributes) are highly correlated with one another. Multicollinearity will only distort a manually-made lead scoring model if two or more of the lead attributes that are highly correlated with conversion and have assigned point values are also highly correlated with one another. Multicollinearity is a more evasive pitfall than lead attributes with low sample sizes (since they can easily be spotted), but it is equally dangerous. If ignored, it too can significantly distort a lead scoring model.

Multicollinearity
Lets assume we are building a lead scoring model for a company that sells sales force automation software and competes with salesforce.com. The company is advertising under salesforce.com-related Google keywords, and has a landing page designed specifically to convince salesforce.com prospects to trial their competitive product. Leads that clicked on a salesforce.com-related Google keyword convert at a much higher rate than average, and leads that land on the salesforce.com landing page also convert much higher than average. The process of manually assigning point values seems to suggest assigning positive points to both attributes (since leads with either attribute convert at a high rate). If we do that, though, we are in danger of falling into the multicollinearity trap and doubly awarding leads for what is essentially the same attribute (clicking on a salesforce.com-related keyword and landing on the salesforce.com landing page). To test for multicollinearity, we need to analyze the relationship between the two attributes. If we find that most or all of the salesforce.com landing page leads also clicked on a salesforce.com-related keyword, we should only assign points to one of the attributes and disregard the other. If we awarded points for both attributes, the lead scoring model would over-prioritize salesforce-related keyword leads/salesforce landing page leads, to the detriment of other, potentially higher quality leads.

In this case, these two attributes were prime suspects for multicollinearity because their relationship is logical. In other cases, lead attributes that do not seem to be connected can be highly correlated with one another. To ferret out the less obvious highly correlated lead attributes, analyze each important attribute’s (those that have assigned point values) relationship with every other important attribute.

Once we have assigned point values to all of the important attributes and have neutralized the dangers of low sample sizes and multicollinearity, it is time to test the model. I will cover how to test a lead scoring model in my next post. Once the testing section is written, we will have a process that should help optimize the sales resources of any expansion stage software company (or any company that has a large inflow of leads).

CEO

Vlad is a CEO at <a href="http://www.scan-dent.com">Scandent</a>, which develops radio frequency identification (RFID) systems that prevent theft, loss, and wandering/elopement in hospitals and nursing facilities. Previously, he was an Associate at OpenView.