Lead Scoring Models: Avoiding Pitfalls

December 17, 2009

When manually assigning point values to lead attributes (covered in my last post), there are a couple of pitfalls that need to be avoided: assigning point values to lead attributes with low sample sizes and to lead attributes that are highly correlated with one another (multicollinearity). If not averted, both of these can cause a model to assign or deduct too many points for certain leads, effectively distorting it and potentially rendering it unusable. This week, I will cover the issue of low sample sizes.

Low Sample Sizes
Let’s assume that we are working with a data set of 500 leads and their attributes, that this set has a 10% conversion rate, and that only 5 of the leads have “friend/colleague” as their lead source (one of the lead attributes). Further, let’s assume that none of those 5 “friend/colleague” leads converted to an opportunity or sale. Since the conversion rate of “friend/colleague” leads is much lower than the overall conversion rate (0% v. 10%), labeling the “friend/colleague” lead source attribute as a negative predictor of conversion and directing the model to deduct points from “friend/colleague” leads may be tempting.

But assigning a point value to this attribute is dangerous due to its low sample size. It may very well be true that the population of “friend/colleague” leads actually has a high conversion rate, and that by chance, this data set contains 5 “friend/colleague” leads that did not convert. If we had directed our model to deduct points for the “friend/colleague” attribute, we would be deprioritizing potentially great leads.

It is advisable to assign point values only to lead attributes that have statistically significant populations, or in other words, when an attribute’s sample size is big enough to reflect the general population. As a general rule, don’t assign point values to attributes that have sample sizes of less than 40, and once a sample surpasses 100, it is fairly safe to assign a point value to it (that is, if the attribute’s conversion rate differs from the overall conversion rate).

To be more statistically correct, the appropriate sample size actually depends on your desired confidence level and error margin. For more detail on how sample sizes vary depending on confidence levels and error margins, this blog is a good read.

Next week, I will cover multicollinearity, and the dangers it presents to expansion stage software companies trying to build effective lead scoring models.

CEO

Vlad is a CEO at <a href="http://www.scan-dent.com">Scandent</a>, which develops radio frequency identification (RFID) systems that prevent theft, loss, and wandering/elopement in hospitals and nursing facilities. Previously, he was an Associate at OpenView.