Thursday, November 15, 2012

Why Sears Is Going All-In On Hadoop

Why Sears Is Going All-In On Hadoop is an interesting, if ‘rose coloured’ view of Hadoop from Phil Shelley, CTO at Sears.  Note that he also leads a Sears subsidiary called MetaScale – which is offering Big Data architecture, consulting & services to companies outside the retail space.

A few choice quotes:
  • Moving up the stack, Sears is consolidating its databases to MySQL, InfoBright, and Teradata--EMC Greenplum, Microsoft SQL Server, and Oracle (including four Exadata boxes) are on their way out, Shelley says.
  • "The Holy Grail in data warehousing has always been to have all your data in one place so you can do big models on large data sets, but that hasn't been feasible either economically or in terms of technical capabilities," Shelley says, noting that Sears previously kept data anywhere from 90 days to two years. "With Hadoop we can keep everything, which is crucial because we don't want to archive or delete meaningful data."
  • "ETL is an antiquated technique, and for large companies it's inefficient and wasteful because you create multiple copies of data," he says. "Everybody used ETL because they couldn't put everything in one place, but that has changed with Hadoop, and now we copy data, as a matter of principle, only when we absolutely have to copy."
  • Shelley sees Hadoop as part of a larger IT ecosystem, too, and says systems such as Teradata will continue to have an important, focused role at Sears. But he's on the far end of the spectrum in terms of how much of the legacy environment Hadoop might replace. Countering Shelley's sometimes sweeping predictions of legacy system replacement, Mike Olson, CEO of Cloudera says: "It's unlikely that a brand-new entrant to the market [like Hadoop] is going to displace tools for established workloads”.
  • MetaScale also offers data architecture, modeling, and management services and consulting. The big idea behind Hadoop is to bring in as much data as possible while keeping data structures simple. "People want to overcomplicate things by representing data and dividing things up into separate files," says Scott LaCosse, director of data management at Sears and MetaScale. "The object is not to save space, it's to eliminate joins, denormalize the data, and put it all in one big file where you can analyze it."  It's an approach that's counterintuitive for a SQL veteran, so a big part of MetaScale's work is to help customers change their thinking: You apply schema as you pull data out to use it, rather than take the relational database approach of imposing a schema on data before it's loaded onto the platform. Hadoop holds data in its raw form, giving users the flexibility to combine and examine the data in many ways over time

Big Data – The Reality beyond the Hype

I was having a coffee with a friend last week and the conversation turned to the latest trends in technology – as it often does.  His view was that ‘Big Data’ was just another in a long line of over-hyped technologies, aimed more at selling the shiniest new product than solving some real-world problem.

I think that the Big Data term is really a shorthand way of describing the escalating amount of data being generated by the actions of people and their devices as they interact with each other and the world at large.  Every time we use a web-site, smartphone or other electronic service data is created and collected – to understand our behaviour, predict what we’d like to buy or where we’ll go, or perhaps show a relevant advertisement. 

An even larger amount of data is beginning to be created by the ‘internet of things’ – a term used to describe the invisible devices and sensors all around us in our vehicles and transport systems, communications and power grids which collect and report on the health of these environments.  For example, engines in the latest commercial aircraft capture a large volume of performance data and report any abnormal operation in real-time via satellite links.  Current car models can already report back if they are involved in a crash or require roadside assistance, collecting engine and performance data can’t be too far in the future.

As the cost of collecting and storing this data continues to drop, it doesn’t take too much imagination to see the value in being able to analyse more fine-grained data on power consumption, real-time traffic, when and where we buy products, or whatever we can imagine being sensed and measured.  Having this newly available data can lead to discovery of previously unknown patterns of behaviour or relationships – telling us about a new artist, restaurant or author, a nearby bargain or a group who share our passion.

So, even if you think Big Data is just an over-hyped buzzword, a tremendous and ever-growing variety and volume of data is being created by our use of web-sites devices and sensors. I don’t think this trend is likely to slow down in the foreseeable future, as ever more of our interactions move into the digital realm.

The era of Big Data is with us, no matter what we call it.