Taming the Elephant….Building a Big Data & Analytics Practice – Part I

Source By: 

A couple of decades ago, the data and information management landscape was significantly different. Though the core concepts of Analytics, in a large sense, has not changed dramatically, adoption and the ease of analytical model development has taken a paradigm shift in recent years. Traditional Analytics adoption has grown exponentially and Big Data Analytics needs additional and newer skills.

The evolution of database management systems…..

For further elaboration, we need to go back in time and look at the journey of data. Before 1950, most of the data and information was stored in file based systems (after the discovery and use of punched cards earlier). Around 1960, Database Management Systems (DBMS) became a reality with the introduction of hierarchical database system like IBM Information Management and thereafter the network database system like Raima Database Manager (RDM). Then came Dr. Codds Normal Forms and the Relational Model. Small scale relational databases (mostly single user initially) like DBase, Access and FoxPro started gaining popularity.

Maturing into Relational Database Mgmt. Systems and beyond….

With System R from IBM (later becoming the widely used Structured Query Language  database from which IBM DB2 was created), and ACID (Atomicity, Consistency, Isolation, Durability) compliant Ingres Databases getting released, commercialization of multi-user RDBMS became a reality with Oracle and Sybase (now acquired by SAP) databases coming into use in the coming years. Microsoft had licensed Sybase on OS2 as SQL Server and later split with Sybase to continue on the Windows OS platform. The open source movement however continued with PostGreSQL (an Object-Relational DBMS) and MySQL (now acquired by Oracle) being released around mid 1990′s. For over 2 decades, RDBMS and SQL grew to become a standard for enterprises to store and manage their data.

The coming of Data Warehouses….

From 1980′s, Data Warehousing systems started to evolve to store historical information to separate the overhead of Reporting and MIS from OLTP systems. With Bill Inmon’s CIF model and later Ralph Kimball’s popular OLAP supporting Dimensional Model (Denormalized Star & Snowflake schema) gaining popularity, metadata driven ETL & Business Intelligence tools started gaining traction, while database product strategy promoted the then lesser used ELT approach and other in-database capabilities like in-database data mining that was released in Oracle 10g.  For DWBI products and solutions, storing and managing metadata in the most efficient manner proved to be the differentiator. Data Modeling Tools started to gain importance beyond desktop and web application development. Business Rules Management technologies like ILOG JRules, FICO Blaze Advisor or Pega started to integrate with DWBI applications.

The missing pieces….Data Quality…..MDM and more…

Once the Data Warehouses started maturing, the need for Data Quality initiatives started to rise, since most Data Warehousing development cycles would have used a subset of the production data (at times obfuscated / masked) during development and hence even if the implementation approach would have included Data Cleansing and Standardization, core DQ issues would start to emerge post production release to even at times render the warehouse unusable till the DQ issues were resolved.

Multi-domain Master Data Management (Both Operational or Analytical) / Data Governance projects started to grow in demand once organizations started to view Data as an Enterprise Asset for enabling a single version of truth to help increase business efficiency and also for both internal and at times external data monetization.  OLAP integrated with BI to provide Ad-hoc reporting besides being popular for what-if modeling and analysis in EPM / CPM implementations (Cognos TM1, Hyperion Essbase, etc.)

Traditional Analytics ……not Business Intelligence !

Analytics was primarily implemented by practitioners using SAS (1976) and SPSS (1968) for Descriptive and Predictive Analytics in a production environment and ILOG (1987) CPLEX, ARENA (2000) for Prescriptive Modeling including Optimization and Simulation. While SAS had programming components within Base SAS, SAS STAT and SAS Graph, the strategy evolved to move SAS towards a UI based modeling platform with Enterprise Miner and Enterprise Guide getting launched, products that were similar to SPSS Statistics and Clementine (later IBM PASW modeler) which were essentially UI based drag-drop-configure analytics model development software for practitioners usually having a background in Mathematics, Statistics, Economics, Operations Research, Marketing Research or Business Management. Models used sample representative data and a reduced set of factors / attributes and hence performance was not an issue till then.

Around mid of last decade, if anyone had knowledge and experience with Oracle, ERWin, Informatica and MicroStrategy or competing technologies, they could play the role of a DWBI  Technology Lead or even as an Information Architect with additional exposure & experience on designing Non Functional DW requirements including scaleability, best practices, security, etc.

The growing data……

Sooner, the enterprise data warehouses, now needing to store years of data, often without an archival strategy, started to grow exponentially in size. Even with optimized databases and queries, there was a drop in performance. Then came Appliances or balanced / optimized data warehouses. These were optimized database software often coupled with the operating system and custom hardware. However most appliances were only supporting vertical scaling. However, the benefits that appliances brought were rapid accessibility, rapid deployment, high availability, fault tolerance and security.

The next big thing….Appliances

Appliances thus became the next big thing with Agile Data Warehouse migration projects being undertaken to move from RDBMS like Oracle, DB2, SQL Server to query optimized DW Appliances like Teradata,  Netezza, GreenPlum, etc. incorporating capabilities like data compression, massive parallel processing (shared nothing architecture), apart from other features. HP Vertica, which took the appliance route initially, later reverted to become a software only solution.

An interesting thing to note here was that most appliances were built over PostgreSQL. A reference link on the same :https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases

Initially Parallel Processing had 3 basic architectures – MPP, SMP and NUMA. MPP stands for Massive Parallel Processing, and is the most commonly implemented architecture for query intensive systems. SMP stands for Symmetric Multiprocessing and had a Shared Everything (including shared disk) Architecture while NUMA stands for Non Uniform Memory Architecture which is essentially a combination of SMP and MPP. Over a period of time, the architectures definitions became more amorphous as products kept on improvising their offerings.

Getting Packaged…….

While Industry and Cross-Industry packaged DWBI & Analytics Solutions became increasingly a Product and SI / Solution Partner  Strategy, end of last decade started to see increasing adoption of Open Source ETL, BI and Analytics  technologies like Talend, Pentaho, R Library, etc. adopted within industries (with the only exceptions of Pharma & Life Science and BFSI Industry groups / sectors), and in organizations where essential features and functionality were sufficient to justify the ROI on DWBI initiatives that were usually undertaken for strategic requirements and not for day to day operational intelligence or for  insight driven additional or new revenue generation.

The Cloud Movement…….

Also, cloud based platforms and  solutions adoption and even DWBI and Analytics application development on  private or public cloud platforms like Amazon, Azure, etc. (IBM has now come with BlueMix and DashDB as an alternate) started to grow as part of either a start-up strategy or cost optimization initiative of Small and Medium Businesses and even in some large enterprises as an exploratory initiative, given confidence on data security.

Business Intelligence / Reporting / Visualizaiton…….

Visualization Software also started to emerge and carve a niche, growing in increasing relevance mostly as a complementary solution to the IT dependent Enterprise Reporting Platforms. The Visualization products were business driven, unlike technology forward enterprise BI platforms that could also provide self-service, mobile dashboards, write-back, collaboration, etc. but had multiple components with complex integration and pricing at times.

Hence while traditional enterprise BI platforms had a data driven “Bottom Up” product strategy, with dependence and control with the IT team, Visualization Software took a business driven “Top Down” Product Strategy, empowering business users to analyze data on their own and create their own dashboards with minimal or no support from the IT department.

With capabilities like geospatial visualization, in-memory analytics, data blending, etc. visualization software like Tableau is increasingly growing in acceptance. Some others have blended Visualization with out-of-box Analytics like TIBCO Spotfire and in recent years SAS Visual Analytics, a capability which otherwise is achieved in Visualization tools mostly by integrating with R.


All of the above was manageable with reasonable flexibility and continuity till data was more or less structured and ECM tools were used to take care of documents and EAI technologies were used mostly for real-time integration and complex event processing between Applications / Transactional Systems.

But a few year ago, Digital platforms including Social, Mobile and other platforms like IOT/M2M started to grow in relevance and Big Data Analytics grew beyond being POCs undertaken as an experiment to thereafter complement an enterprise data warehouse (along with enterprise search capabilities), to at times even replace them. The data explosion gave rise to the 3 V dilemma of velocity, volume and variety and now data was available in all possible forms and in newer formats like JSON, BSON, etc. which had to be stored and transformed real-time.

Analytics had to be now done over millions of data in motion unlike the traditional end of day analytics over data at rest. Business Intelligence including Reporting and Monitoring, Alerts and even Visualization had to become real-time. Even the consumption of analytics models now needed to be real-time as in the case of  customer recommendations and personalization, trying to leverage smallest windows of opportunity to up-sell / cross-sell to customers.

It is Artificial Intelligence systems, powered by Big Data that is becoming the game changer in the near future and it is Google, IBM and others like Honda who are leading the way in this direction.

Continued in the second part……Big Data Analytics Practice – Part 2

Comments: 0



  • Subscribe for Blog Updates