The Challenge of Digital – Big Data and More

This week’s edition of Webinarmaggedon hits Thursday when I’ll be co-presenting with Krishnan Parasuraman, CTO of IBM Big Data Solutions, on Choosing a Digital Big-Data Technology Stack. With so many webinars in a row, I feel like I’m not able to do full justice to any of them, but if you’re looking at a big data project in 2013, you really should attend this one. As with all of our webinars, it’s heavy on content not sales. And Krishnan and I have structured this one to be unusually conversational – my favorite format.

I’ve built a deck based on our recent Whitepaper and it covers the core concepts of the paper: why “big data” is real (as opposed to just marketing hype) and how it’s about more than having lots of rows; some of the unique challenges of digital data at the detail level – particularly around joins; and how different problems in digital marketing encapsulate different aspects of those challenges. Put these things together, and you have a really nice framework for thinking about how to evaluate a particular technology stack given your digital marketing priorities.

Instead of a straightforward walk-through, however, Krishnan and I have broken up the deck with a number of “conversation points” where we plan to kick around the core concepts.

I wrote an overview of the Whitepaper a couple of weeks back, but today I wanted to deep-dive into one particular portion of the broader topic: the shift to detail level data for analysis and how this is fundamental to our concept of big-data.

Getting Detailed

A great deal of the analysis we do (and by “we” I mean all of us in the digital measurement community) is at a level that doesn’t require detail data. Forecasts, for example, are almost never created from aggregations of individual level predictions. Instead, we use data aggregated up to some higher level of the digital system: things like sources or campaigns. In Semphonic’s traditional site analytics practice, techniques like Functionalism or Use Case Analytics were developed specifically to work at the level of available aggregation – visits and pages. When we analyze campaigns we look at the pages viewed, the time on site, measures of engagement, conversions and more, but all are aggregated. It isn’t necessary to know which visitors had specific combinations of any of these measures.

It's a big advantage when an analysis doesn’t require any detail data. The data can be pre-aggregated once into cubes at the level of the various dimensions we want to study and then used for a wide-variety of reporting and analysis purposes.

When you’re aggregating data, it doesn’t really matter how much data you need to process – you’re not working with “big” data. Companies like Adobe or Webtrends process ENORMOUS amounts of data as they aggregate it into Web analytics systems. But nobody would think of SiteCatalyst as a “big data” tool – and I think that’s correct.

On the other hand, some types of questions and some types of analytic methods absolutely require full detail data. Whenever I want to say anything about a specific individual (or anything that’s beneath a logical dimension), the drive down to the detail level is essential.

  • If I want to classify visitors into segments, I need detail data.
  • If I want to identify visitors who are increasing or declining in engagement, I need detail data.
  • If I want to target visitors based on share of interest, I need detail data.
  • If I want to understand campaign attribution by incremental lift, I need detail data.
  • If I want to build a behavior site topology, I need detail data.

In general, for any type of targeted marketing, site personalization, customer segmentation, or customer modeling, detail data is essential. It’s precisely because these types of uses of the data have begun to emerge that digital measurement has been transitioning to “big data.”

It isn’t the number of rows that’s decisive – it’s the need to use those rows at the detail level. There are Semphonic clients that generate staggering amounts of Web analytics data that’s handled perfectly well in SiteCatalyst or Webtrends. Why? Because with a single, one-time aggregation of the data, it’s possible for Web analytics vendors to create aggregations that serve a wide variety of analytic needs. After that one aggregation pass, there’s no need to ever touch the detail again.

On other hand, you might only have ten thousand records in your database, but if you need to build a customer segmentation or do true campaign attribution, it’s impossible in a Web analytics solution.

One of the penalties to Web analytics tools is that, because they have to handle EVERY type of client, even clients with small or moderate amounts of data are trapped into using higher-levels of aggregation. It’s a somewhat ironic paradox, in my view, that the enterprises that will have the easiest time effectively deploying big data systems don’t necessarily have huge amounts of data! They can go to systems that will quite easily give them high-performance access to detail-level data. The larger your detail data volumes are, the harder it is to easily surpass the accomplishments of the Web analytics vendors.  

The Implications of Detail

Unless you’re data volumes are really small, the most obvious impact of dropping your level of analysis is exactly what you’d expect: more data. Sure, it’s not impossible for a detail data file to be smaller than an aggregation, but it IS quite unusual. So detail-level analysis really does force you to process more data than before; often it will force you process orders of magnitude more data than you did with a good aggregation. That’s one of the main reasons it’s “big data”.

More data isn’t the only impact, though. In fact, many types of common data exploration also get sacrificed. At the detail level, human exploration of the data and visualization techniques to facilitate it are largely useless.  In addition, the statistical analysis techniques employed are nearly always more complex and more processor intensive that aggregate analytic techniques.

This has created a strong drive to machine-learning techniques that come along with “big data”. There’s nothing necessary about this – it’s perfectly possible to do advanced analysis and modeling on big data without machine learning techniques. But there’s a reason these two often get associated.

There’s a third impact to detail level of analysis that is trickier to explain and is, if not unique to digital, at least uniquely important in digital.

The vast majority of analytic techniques are designed to handle situations where the unit of analysis (the visitor, the page, the campaign, etc.) is in a single row with all the characteristics of that particular element represented as the column values for that row.

When, for example, we build a digital segmentation, we create a set of visit-level rows with each row having a set of columns that describe the visit behavior. We do this, naturally enough, from the detail data because it's a completely custom aggregation for the analysis (and it produces a very low-level aggregation). We have to put the data together in that format because that’s the way cluster analysis tools expect the data to be presented. But it means we have to do the very tricky work of pre-create the aggregations without sacrificing the meaning in the data.

There are analytics techniques that wouldn’t require this type of approach, but they also have challenges since we typically find that our pre-aggregation is necessary to create interesting levels of meta-data in the analysis.

So I’m not sure we’ve found the ideal approach. But whether you’re building aggregates or meta-data to support analysis, we’ve found that creating a digital big data analysis almost always involves significant algorithmic/procedural access to the data. Sometimes you can create clever SQL with constructs like partition and rank, sometimes you have non-standard SQL constructs like nPath to help, and sometimes you need to have a good old fashioned Java or C++ programmer handy to get the job done.

Regardless, the degree to which digital big data demands programmatic access to the data is probably the single biggest change (and maybe the biggest headache) facing any enterprise embarking on a digital big-data effort.

I often describe this last problem as arising because the meaning in digital data doesn’t exist at the row level but at the “stream” level – the path or combination of multiple events. This “stream” level of meaning also presents unique challenges for join strategies and data integration. That’s a complex and rich topic of its own and one I hope to discuss in some detail on the webinar. Suffice to say that it makes creating that elusive 360 degree customer view far harder than people think!

Join Krishnan and I this Thursday to hear more!

Webinarmaggedon Update

Awhile back I posted on a the "Perfect Storm" of Whitepapers Semphonic is releasing. Over last month and this, that’s resulted in a Perfect Storm of Webinars. So I’ve just put together a handy little list with links to everything…enjoy!

            Whitepaper(s) to follow…

            Download the Accompanying Whitepaper

            Download the Accompanying Whitepaper

and while the webinar may have been just passed when I created this list, you can still read the very recent Whitepaper on Digital Merchandising for Multi-Product (List and Aisle) Pages with Cloudmeter.

Comments

Published
Categorized as Database

By Gary Angel

Gary Angel is the author of the "SEMAngel blog - Web Analytics and Search Engine Marketing practices and perspectives from a 10-year experienced guru.

Leave a comment