How JumpCloud Handles Over 100 Million Unique Daily Data Points From Their Product
When a company has no single source of truth for data, things get messy.
You’ve likely seen this scenario at work before: Each team leverages a different data system to answer the same analytical question, and each team answers the question differently.
For example, our sales cycle at JumpCloud is around 65 days, but one of my colleagues might observe a sales cycle closer to 90 days because she’s taking an average versus a median. Taking it further, a different colleague observes a sales cycle closer to 52 days because he includes our partner program in the results while I exclude that sales channel.
This misalignment in measurement can be costly for the business and its stakeholders. That’s why successful businesses solve this problem early. And one of the best ways to do that is to develop and maintain a data warehouse.
What is a data warehouse?
A data warehouse is that single source of truth for a company’s data. A data warehouse can mean different things to different people, so I’ll break it down to its core features so that we’re all on the same page.
Borrowing from our friends at Amazon, “A data warehouse is a central repository of information that can be analyzed to make better informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence”. I like how clean and focused this definition is, but what it does not address are the complexities and challenges that a business will face when developing a data warehouse.
There’s no question that deploying a data warehouse can be an intimidating project for any organization. Unless there’s a fairly large and urgent business use case to stand up a data infrastructure early on, it’s easy to delay these types of projects.
But the further along a business advances in its lifecycle, the more difficult it can be to capture and store the valuable historical data points from those systems.
Building analytics in a product led growth (PLG) business
I’d like to cast some light on an approach and data framework that has worked well for us at JumpCloud. This isn’t a one-size-fits-all approach, nor do I think we did it perfectly, but there are some guiding principles that will work in any business model and that should serve as a resource during implementation.
Related read: What is product led growth?
I’m talking about an infrastructure that:
- Is lightweight and quick to deploy
- Has flexible yet robust integrations with common source systems
- Enables all parts of the business with access to KPIs for their department
- Allows for a single view of the customer throughout their journey
That last point is key—it’s easy to derive metrics and insights for isolated segments of the customer journey, but the real value is derived once you’re able to provide a complete view of the customer journey from beginning to end, allowing for further explanatory and predictive analysis when viewing these data in aggregate.
What JumpCloud does and how we’ve adopted our analytics approach to the business model
JumpCloud is a Directory-as-a-Service® that securely manages and connects users to their systems, applications, files and networks. Think of our B2B SaaS platform as controlling who has access to what IT resources within an organization.
We consider ourselves a product led growth company by investing heavily in our product so that we empower our customers to answer their own questions and solve their pain points quickly, whether they’re remote or on-prem.
This point is particularly critical for us as we’re transforming a legacy enterprise infrastructure software category with heavy product and people demands into an almost consumer-like experience. Not easy, of course—and next to impossible without the data infrastructure we’re building.
It’s important to mention a key ingredient that assisted with our project momentum: We had a couple of experienced team members who focused on these efforts early on. While the tooling and framework has come a long way in the data space, these types of data endeavors can turn out to be costly and time-intensive without the right leadership. Spend time to make sure you have the right team in place from day one.
The data we work with
Yes, a company like ours has a lot of data. We aren’t absorbing data just from Sales, Marketing, Customer Success and Finance, but we’re also seeing more and more data from our Growth, Product, and Engineering teams. Think A/B product testing, client-side product usage and server-side product usage.
To give you an idea of the velocity and timing of this project, we deployed our Snowflake data warehouse in September of 2018. And as of today, that same data warehouse is the central repository for Marketing, Growth, Sales, Finance and Product analytics.
Our product data alone is responsible for over 100 million unique daily data points.
Our model has scaled quickly, and I believe it’s because we chose the right tooling—it’s allowed us to scale appropriately with the data volume we’re creating.
But before I get into the decisions behind our tool selection, I think it’s valuable to understand some of our initial use cases and why the business was willing to deploy capital and people resources on this project.
Why deploy a data warehouse?
Depending on your business model, there can be a whole slew of reasons to deploy a data warehouse and analytics environment. At JumpCloud, we had some obvious areas of the business where it just made sense to dig deeper into the data so that we could make more informed decisions.
Whether your marketing team is trying to decide where to deploy spend on paid search or which content to prioritize, understanding down-funnel performance is crucial.
Our warehouse allowed us to connect all of the data so that we can see which keywords, for example, were producing the highest quality of traffic. There could, and likely is, certain content and keywords that bring in leads, but are those leads converting? Perhaps there’s specific content that brings in leads that have longer sales cycles, and we have to know as a business that these prospects need more time to close.
And which marketing efforts are having the greatest amount of impact on our top-of-funnel growth? These are the types of questions that we wanted to answer early on, and our data warehouse has been helping us course-correct around these important decisions.
While Salesforce has decent (not great) reporting out of the box, it wasn’t nearly as flexible as we needed it to be. Sure, we could visualize our sales funnel and opportunity forecast in Salesforce, but what if we wanted to understand how our funnel or sales forecast is changing over time?
Salesforce provides history tracking for data objects, but these change events can be difficult to aggregate and understand holistically (and you’ll likely need to export them to Excel). We didn’t want to do that. Instead, we started snapshotting data at a daily rate so that we can see how certain processes are changing over time. This would be near impossible without a data storage solution.
Another use case for the warehouse that was a bit unexpected is around the idea of data quality and feedback loops. Looking at opportunities, leads, contacts and accounts from a higher level has allowed us to identify opportunities to improve many of our operational processes.
These opportunities aren’t limited to just sales—they involve marketing, support and customer success as well. For example, something we discovered while working with our Marketing team was that we were over-counting the number of customer trials coming from certain channels. We found this by looking at the email domain associated with each email address by parsing out the unique name (coming before the @) from the email domain (coming after the @).
This approach provided us with a unique identifier that we could leverage to better understand if a customer had more than one account, and it has allowed us to understand the true number of new customers that we’re acquiring from each marketing channel.
Monitor these operational feedback loops on an ongoing basis because they will pay back dividends consistently over time.
It wasn’t until we integrated our product usage data with our other databases that we really leveraged the power of our data warehouse and understood the potential impact that it could have on the business.
Developing and integrating your server-side product logs isn’t something you can outsource (at least not easily). We were fortunate enough to have data engineering team members to partner with when developing this part of the warehouse—this was a crucial aspect of the project because it was important to work with a team that knew the product inside and out. We had to make sure that we were deriving product events accurately and in a scalable manner.
While we’ve spent a lot of time and energy developing our server-side events, we’ve also been focusing heavily on client-side activity. These data collections have helped inform the way our customers are using our product console and they’ve enabled us to see where we might have potential friction points in the product—valuable insights for both us and our customers.
Since we’re a fast-moving company, our product continues to evolve and expand quickly, and this causes us to constantly add tracking events around important aspects of the product. Early in our deployment, we were reactive in adding these events, but over time we’ve become more and more proactive.
You’d be hard-pressed to find a better example of a team that leverages data on a daily basis to make decisions than Growth.
While using data to better understand the implications and impacts of design changes to your product is important, measuring the difference in the performance of variants within an A/B test is nearly impossible without a reliable data infrastructure in place.
Tying back down-funnel customer behavior to client-side experiments has been a really strong use case for our data warehouse. While we’re still learning better ways each day to approach the measurements of our experiments, it’s been really powerful to work with an experimentation framework that’s connected to all of the vital areas of the business.
Now that you’ve gotten a glimpse at how some of our teams leverage our data infrastructure, you likely now have a better sense of how we analyze customer journeys. From that initial marketing touchpoint, when the customer is trying to decide how well JumpCloud can solve their pain paint, all the way to the rate at which the customer is expanding and using the product, our data collection has been key in helping us better evolve the product to best meet our customers’ needs.
Choosing the right tools for your data warehouse
Deciding on tools that are not only appropriate for the business today but also support the vision of the business five years down the road is essential. For example, if your data warehouse needs to store JSON or XML data from your product logs in the future, then you’d better choose a system that can support these semi-structured data types.
Like almost any other SaaS startup, we use Salesforce as our central CRM system and source of truth when it comes to sales and customer information. Salesforce offers three out-of-the box objects that you can use to speak to other systems from a who perspective: Accounts, Contacts and Leads. The Account ID, or a custom iteration of it, is the common language between all of our systems, and this field enables us to maintain a single view of our customer within our data model.
Choosing tools that integrate well with Salesforce is, of course, essential. If, for example, you’re testing an ETL tool and it can’t sync custom objects from Salesforce, then forget about that tool and move on to the next.
It’s tough to know what functionality you’ll need in two or three years down the road, but think about your business model and consider the evolution of your product so that you can support those analytics.
There are generally three major tools that are the pillars in setting up any data analytics infrastructure. While one could argue that there are many more tool types that are needed to be considered a best-in-class analytics infrastructure, I consider these three tool types to be absolutely essential: ETL, Data Warehouse, and BI.
If you’re new to the world of data, ETL stands for extract-tranform-load and it represents the process of moving data between different systems while potentially transforming it along the way. I say potentially because in recent years, ELT models (you guessed it, extract-load-transform) seem to have become increasingly popular.
I think the ELT approach allows an analytics team to be more flexible and lets them transform the data to their liking after it has reached the data warehouse. Of course, there are certain datasets that likely require transformation prior to being dropped off in a data warehouse, such as product and engineering logs. In instances like this, I recommend advising with a data or software engineer to assist with setting up the proper framework with your product data.
Since your ETL tool speaks to each of your systems, be sure that it supports not only the systems you’re currently using, but the systems you plan on using in the near future.
We decided to go with an ETL tool that supported connections with the systems that we worked with everyday, that was very easy to configure, and one that we ultimately really liked how it integrated with Salesforce.
While our tool worked well for us, your use cases and needs may differ. So kick off a trial with each of the tools you’re considering and do the proper testing.
One concept I’m noticing in the data industry is that there are more and more third-party tools that support ETL whether or not that’s their core service, so you may be able to find a tool that services multiple solutions that you’re interested in. FiveTran, Matillion, Stitch, Segment and Alooma are tools that come to mind, but at the end of the day you’ll want to find a tool that solves your core use cases.
Key questions to ask when selecting an ETL tool:
- Can my data team easily maintain this tool, or will we need constant engineering intervention to maintain it?
- Does this tool support the transformations that we’ll need going forward (ETL), or are we fine with more straight-forward database replication (ELT)?
- Does this tool connect to the data sources that we’ll need to work with, now and in the future?
- Will this tool scale with the business as we grow (both from a data volume and financial budget perspective)?
It should go without saying that out of the three tools types we’re discussing, the data warehouse is the most important—it’s the beating heart of your infrastructure.
Swapping out ETL and BI tools would be a pain, but it would be manageable. Swapping out data warehouse providers would be a heart transplant on your infrastructure. With this in mind, think ahead to your future use cases when deciding on your data warehouse, because you want a solution that will work for the business years into the future.
At JumpCloud, we wanted a data warehouse that would scale with us as our data grew, but also had reasonable setup costs and was easy to configure. We tested instances of Google BigQuery, Amazon Redshift and Snowflake. We ultimately went with Snowflake for these reasons:
- We knew that we were going to be digesting incongruent data types in the future, specifically related to our product data, so we needed a warehouse solution that would support both semi-structured and unstructured data. Snowflake does this.
- The costs seemed reasonable. At the time, JumpCloud had recently received Series C funding, and we wanted a cost-effective solution that didn’t require a lot of upfront investment. Snowflake allows for various ways to control the costs of your warehouse, but the primary way we kept our costs down was by limiting the amount of time our warehouse was active (compute costs) by controlling the amount of time we were processing data.
- The platform was easy to work with and allowed us to move fast. While our team was lucky enough to have a data engineer to assist with deployment, it was not totally necessary because Snowflakes support and documentation made the setup process very manageable. To this day, I find myself utilizing more and more of Snowflake’s functionality that I would typically consider to be more DBA activities. An example of this functionality is Snowflake tasks. Powerful stuff.
Key questions to ask when selecting a data warehouse:
- Does this solution scale with the company from a cost perspective? If I forecast my data volume over the next two years, will my finance team approve the future budget?
- Does this data warehouse support the data types that I will be needing to store going forward?
- Is this data warehouse compliant with the security needs of my business based on the type of data that we will be storing?
- Who will be responsible for maintaining the data warehouse? Do I have the right technical expertise to manage the platform?
Since our BI tool is the main access point into the data warehouse for our company, we wanted to pick a tool that offered a useful reporting interface and allowed for data exploration.
There are a ton of different BI tools out there, but we believe we’ve found a product that has delivered on a lot of the functionality that we were interested in. Here’s why:
- Quick and clear data exploration. The quick rendering of visualizations based on SQL has made it very easy to explore datasets and perform ad hoc analysis.
- Report and dashboard partitioning. Role-based access control and group-based logic have been convenient for our team while we have been setting up new users within the tool to ensure they are seeing data related to their department.
- Advanced analytics. Our tool supports in-app R and Python functionality, which has been helpful for when we need to jump into statistical analysis.
Our current BI platform has some shortcomings, but it has evolved to fit our needs nicely during this phase of our company’s lifecycle. We wanted an infrastructure that we could move fast with, and our BI tool has allowed us to do just that.
There’s more work to be done around some of the ways we’re using our BI platform, but we’re happy with what we’ve built and we’re constantly thinking of new, sustainable ways to better utilize the data.
Key questions to ask when selecting an BI tool:
- Does this tool support the service model that our organization wants to leverage? (For example, if we want the business to be able to self-serve their own data, I want to make sure the tool’s self-service functionality fits our needs.)
- How easy is it to maintain reporting and develop new insights within?
- How robust is the alerting and notification layer of the tool?
- Does this tool support the level of advanced analytics that our data team will be working on? (For example, does the platform support R and Python?)
What we think went right
We know that there are many areas of our infrastructure that can and will be improved upon as we evolve, but we thought it was worth highlighting at least some of the overall strategies that allowed us to scale as quickly and efficiently as we did:
- Task an experienced data analyst, data scientist or data engineer to take the lead on developing the data model so that it’s being built correctly as early as possible. Be careful not to have the data model be built by any one specific department so that the framework isn’t biased towards the operations of that particular team. At JumpCloud, our data team sits independently but interfaces quite frequently with almost every department and I believe that this has allowed us to take unbiased approaches when we make decisions on priorities and overall architecture.
- Involve your (data) engineering team early. This ensures that a future integration with the product data will go smoothly, whenever that time comes. If you don’t yet have data engineers, then collaborate with DevOps or the principal engineers that work with the product. You are just ensuring that the various systems will play nice together in the future state.
- Create a data model that allows the business to push insights back to the source systems. Don’t have your insights end at your BI tool—enable your data warehouse to send processed and enriched data back to your fellow employees so that they can make data-informed decisions without breaking their workflow. Think custom objects within Salesforce.
- Start capturing and storing your data as early as possible. Data storage is cheap within many of today’s data warehouse solutions, so it should be in your best interest to store your historical data as early as possible because there will come a time when you wish you had those historical data points.
- Build a data-driven culture that encourages curiosity. This one is more for the CEOs and CFOs in the crowd—it’s something that has to be done early on and at a high level. I have been fortunate enough to work with leaders who not only understand the value in having access to sound data, but who also encourage curiosity and thinking big. This mentality will allow the team to take calculated risks that can have meaningful impact.
The key theme that encapsulates these various ideas of success towards building an analytics program is centered around a supportive culture that fosters data-driven approaches and decision-making.
We’ve found it’s helpful to start with the quantitative insights and let that inform the qualitative discussions that follow.
A good data point will ultimately provoke 10 more questions—especially within business models that can attract unique customer journeys. Take JumpCloud, for example. While our core product is access control and device management, we have a lot of different customers who are trying to solve various pain points, each one slightly different than the next. Sure, there are lots of common themes between their use cases, but the journey they take and the way that they use the product is almost always unique. It’s these unique customer interactions that make the data analysis that much more interesting and complex.
We’ve only skimmed the surface of discovering the insights that lie within our data. There are so many insights, processes and models we’ve yet to develop, but we continue to have active conversations about how we want to guide that strategy.
A few topics on our minds:
- Building a more robust layer of alerting and notifications within the data to inform key stakeholders around key events. This can add value for just about every part of the business. Lots of potential applications.
- Further developing our predictive analytics and machine learning capabilities to help us understand our customers and, specifically, what is driving their behaviors.
- Make the data more accessible and easy to self-serve for our internal customers to enable them to make decisions faster. This can be tricky, especially without a data catalogue or data dictionary that is integrated within the BI tool.
Building the right data analytics infrastructure is truly a mixture of art and science—it really all depends on the business model and the values of the team you’re working with. The constant innovation in practices and frameworks makes this an exciting time to be involved in the data analytics space of a product led growth company.
While there’s no “right way” to build your analytics infrastructure, know that the tools have come a long way in recent years and there are ways to move fast and efficiently while building a model that’s able to consistently deliver value.
More from JumpCloud
It takes more than a great idea to build a successful business. Learn how to turn the seed of an idea into a strong business model in the second edition of The Startup Playbook, by JumpCloud’s Will Herman and Rajat Bhargava.
Miro’s Kate Syuma shares how the company’s growth team iterated smart to improve the user onboarding journey for their popular collaborative platform.