CONTACT USPlease enter your name. Please enter alphabets only for Name. Please enter Organization. Please enter valid email id. Please enter Phone number. Please enter numeric only for Phone number.
This blog is published bySreejit Menon
The key aspect covered in the first part, was the journey of data to Big
This second part, would essentially try and cover the contemporary Information Management and Analytics landscape that has started to take shape and evolve in the last few years where Big Data Analytics has moved on from being experimental Proof of Concept engagements to truly complement traditional DWBI & Analytics systems and at times even replace them completely as well and become the bedrock for building Artificial Intelligence / Cognitive Computing platforms.
It is however critical to establish a perspective about Big Data & Data Science and why it will become more critical in the coming years.
The following events took place from around the beginning of 21st century.
Internet, Web and Social Media explosion….
* The above are only some of the key web and social websites and statistics about their data.
The Smartphone Revolution…..
A few other digital milestones…..
Other growing data sources….
And then the Internet Of Things (IOT) / Machine 2 Machine…
It’s important to define what truly is a Big Data Analytics scenario. There have been quite a few point of views, but in general most organizations see Big Data primarily as a data volume related challenge, but that is not the best use case for implementing Big Data
In general Big Data makes sense for scenarios where the solution needs to solve at least all the 3 Vs, namely Volume, Velocity and Variety.
If the challenge is just around volumes, databases like SQL Server, Oracle, etc. can be migrated to appliances like IBM Netezza, SAP HANA, EMC Greenplum, etc. Teradata DW Appliance can store over 54 PB of data and Teradata Active Enterprise Data Warehouse over 94 PB of data. Most Large and Mid sized Retailers were primarily using Teradata with the combination of MicroStrategy and SAS for DWBI & Analytics
But Appliances are expensive and hence the alternative could be to move to powerful object-relational database systems like PostgreSQL which can support unlimited database size including 32 TB table size, 1.6 TB Row Size and 1 GB Field sizes with unlimited indexes per table. This would however need good amount of technical expertise.
If it’s about velocity, there are Enterprise Application Integration (EAI) solutions that have been available for over a few decades that are real-time / near real-time, leaders in this segment being TIBCO, IBM, Oracle, etc. Event series analysis and event correlation engine or complex event processing engines like Esper are a popular open source alternative.
In general, the following real time (or near-real time) ETL approaches can be adopted:
When it comes to variety, there are content management solutions that work with variety of data including documents, audio and video, etc. There are solutions that can work with multi-structured (structured, semi-structured as well as unstructured) data.
Techniques like Natural Language Processing (NLP) using enterprise Text Mining software from SAS, SPSS, etc. or open source software like Python Natural Language Tool Kit, Apache OpenNLP or Stanford CoreNLP enable analyzing Text to a large extent and are commonly used to extract features, derive sentiments, etc. IBM Watson, a cognitive platform, currently offers around 15+ APIs including Alchemy, Speech to Text and vice versa, Tone Analyzer, etc. Unstructured Information Management Architecture (UIMA) is a popular framework for Content Analytics
It is however in solving all the problems combined together (volume, velocity and variety) where and when Big Data Analytics platforms make a lot of sense. Typically with volumes greater than at least 1 TB of both transactional as well as digital (social + others) data sources and the need for real or near real-time analytics to start with would be a good thumb rule.
What constitutes Big Data ?
By and large Big Data is one or a combination of the following in a Data Lake or EDW + Data Lake formation:
– Apache Hadoop Framework / Ecosystem (Open Source) and its enterprise distributions from Cloudera, Hortonworks, MapR, Mirosoft Azure HDInsight, etc.
– NoSQL (Not only SQL) Data Storage
– NewSQL Databases like TokuDB, Akiban, Drizzle, etc.
– Blockchain, a distributed database, used as the public ledger for bitcoin transactions. A community edition is BigchainDB
– Appliances (Hardware + Software in a box) like IBM ure Data, IBM Netezza, Aster Data from Teradata, Oracle Exadata, EMC Greenplum, HP Vertica, etc. were earlier considered the next option once standard enterprise databases like Oracle, DB2, SQL Server etc. became hard to manage with exponentially growing data volumes.
Appliances which were mostly following a MPP architecture, were also known for features like Polymorphic Data Storage, in-database compression, multi-level partitioning, in-database analytical functions, workload management, in-memory analytics, query prioritization / optimization, fast installation and configuration (mostly up and running in just hours), self healing fault tolerance, interoperatability and in general High Performance, Scalability, Availability and Reliability
There are distributed big data analytics frameworks like SPARK and Apache Flink that many organizations are currently implementing or experimenting with. These frameworks have a SQL library like SHARK/Spark SQL or Table, a stream processing API and library, a Machine Learning library and a Graph processing API and library.
Adoption of Big Data……
Industries that are leading the Big Data Adoption are Healthcare, Retail, Education, Utilities, BFSI and Media while other like Manufacturing, Travel & Transportation and Public Sector have also started to invest driven by IoT and M2M initiatives
One of the top challenges for organizations to adopt Big Data is determining the use case on how to derive value and determine ROI from a Big Data Analytics implementation.
Other organizations are concerned around risk of adoption, especially security, given that they manage end customer or high confidential data (as in the case of governments, defense, etc.) but given the advancement of Cloud and Big Data solutions like Apache Accumulo which extends the Bigtable data model to implement cell level security, there is very less that needs to be worried about, provided the right technology/tool stack and configuration is implemented
While Big Data, and Data Science on top of it, is positioned at times with the use case of needing to manage Image, Audio and Video Files in future, the truth of the majority of implementations, specially around Retail, e-commerce and BFSI industries is that the data being moved into Big Data platforms are mostly transaction data, web and/or mobile (clickstream) data, social media data, log data, sensor and location data and in some cases additional sources like emails, documents, mobile app data, etc.
Also, some organizations experience challenges in building, nurturing and retaining talent in Big Data Analytics which essentially has 4 categories of skills. Knowledge of Hadoop ecosystem and related tool sets like Hive, Pig, etc. is no longer niche but skills necessary to architect/design and build near real-time and streaming analytics platforms requires certain levels of exposure and maturity
What kinds of Big Data Analytics Initiatives are being undertaken ?
Most of the Big Data initiatives being implemented have a strategy to either achieve outcomes of enriched Customer Experience (including personalized next best actions and recommendations) or enhanced Business Efficiency or both and are mostly around the following areas:
In some industries like Retail, most of the Big Data Initiatives still involve migration of the structured transaction data along with Clickstream data to get a combined view of the customer behavior both in-store as well as online.
What are some of the industry wise Big Data Analytics use cases ?
Why some enterprises are still not considering harnessing their data for competitive advantage?
Many enterprises do not still consider Data as a strategic Asset and hence do not have a clear strategy for monetizing their Data and at times the amount of data that these organizations may be generating or would in coming years or at times already have present over years in multiple siloed stores is humungous.
There needs to be an enterprise wide Data Strategy initiative supported or anchored along with key Business Stakeholders for developing a roadmap and execution plan for the same including MDM and Data Governance programs to enable the larger objectives. Many organizations hence now have roles like the Chief Data Officer. This shouldn’t be confused with Chief Digital Officer whose role would at times have an overlap, but is clearly set for a different objective, more towards enhancing customer experience across multiple channels and customer journey touch-points
Big Data with and without Data Science ? Data Science without Big Data ?
Technically speaking Big Data is the augmented or replacement Information Management layer that provides the platform for Data Science / Advanced Analytics to be performed.
Quite a few Big Data Visualization and Big Data Analytics platforms and products, both on premise as well as on cloud, are now available with custom industry and cross-industry solutions out-of-the-box.
Many are still frameworks and can be used to accelerate the Big Data implementation. However Big Data Analytics is mostly an initiative that combines implementation of Big Data (Hadoop Ecosystem with or without NoSQL and others) with Data Science / Machine Learning / Advanced Analytics and thereafter Visualization on the top, consumable over mobile and other devices, channels, formats, etc.
There is still a lot of traditional analytics being done and consumed that has nothing yet to do with Big Data as such, as in the Pharma and Life Science Industry for example, but there are changes taking place there and with all industries as well in terms of Big Data Analytics adoption.
The concepts of Calculus, Probability, Correlation and Regression, Least Squares, Time Series, Bayes Theorem, Matrices and Generalization, Fourier and other Transforms, Hypothesis Theory, Design of Experiments, Optimization Methods, etc. existed before mid of last century and while the basic techniques and applications of Analytics, be it descriptive, predictive or prescriptive, more or less remains the same even on Big Data, traditional Analytics used technologies like SAS and SPSS and was more of a GUI supported Drag-Drop-Configure based Analytics (with exceptions of the programming interfaces that SAS, SPSS and others provided), with open source alternatives like R / RStudio and enterprise versions like Revolution Analytics (acquired by Microsoft)
Big Data however needs more scalable and compatible analytics algorithm / model development and involves code development in Java, Python, Scala, etc. with use of Machine Learning libraries (Java based libraries like Mahout, SparkML, Weka or Python based libraries like NLTK, Pybrain, Pylearn, MDP) apart from others like H2O, Shogun,Vowpal Wabbit, etc.
* Programming languages in general that are getting popular include Swift, C++11, Rust, Go, Clojure, F#, Haskell, C#, Ruby & Ruby on Rails, etc.
Once implemented, what are the best practices to make Big Data continue to work for an organization?
To make Big Data work for an organization and ensure that the ROI on the implementation is met, it is critical that increased measurable insights are generated and enabled through Data Science / Advanced Analytics on top of the Big Data platform with the larger objectives of improving customer experience impacting positive revenue growth and/or improved business efficiency.
Some of the “post implementation” Best Practices involve:
True success is when Big Data helps enable higher Automation and /or in developing Artificial Intelligence that stands to provide a long-term benefit and not just meet short term goals and hence it should be part of a larger Data Strategy and Insights Roadmap.
What is the near future for Big Data Analytics ?
There is the cyclical pattern in the information management products and services space in terms of going from IT and programming intensive, data and information management systems to more business oriented, self service and rich UI and metadata driven systems and platforms, and then again circling back to IT and programming based storage, data management and information processing systems for managing faster, newer and unstructured data sources, and in case of Big Data, once again in file systems while now adopting massive parallel processing. This perhaps in conjunction to the EDW, BI, Visualization and Analytics landscape, already functional in an enterprise
Big Data Analytics & Visualization platforms like IBM BigInsights, HP HAVEn, Teradata Integrated Data Platform (Unified Data Architecture), etc. and products like Datameer, Platfora, Lumify and industry specific ones like Palantir, Ayasdi, etc. as well as others like Lucidworks (Search and Analytics platform) are proprietary solutions that organizations are evaluating after initial tinkering with open source
To be Continued………In the next blog, the focus would be on what it takes to run a Big Data Analytics practice, now that the technical breadth and depth in this space has been established.
* Credits: Research for this blog includes reports from Gartner, Nasscom, Wiki apart from online research
* The image used in the cover of this blog is courtesy of the respective artist
No related posts.
Sreejit has over 18 years of IT experience in Analytics leadership roles. He is the Analytics Practice head for DT&ES Services at Happiest Minds and is responsible for developing Strategies, Competency/Capability Development, Sales, Products & Solutions Development & creating Non Linear Revenue Growth, Account Mining, Leading Alliances & Partnerships, Program Management and People Management. Sreejit is a B. Tech in Computers, a PMI certified Project Management Professional and has completed an Executive Management Program in Sales and Marketing from the Indian Institute of Management, Lucknow.
Sreejit Menon Sreejit has over 18 years of IT experience in Analytics leadership roles. He is the Analytics Practice head for DT&ES Services at Happiest Minds and is responsible for developing Strategies, Competency/Capability Development, Sales, Products & Solutions Development & creating Non Linear Revenue Growth, Account Mining, Leading Alliances & Partnerships, Program Management and People Management. Sreejit is a B. Tech in Computers, a PMI certified Project Management Professional and has completed an Executive Management Program in Sales and Marketing from the Indian Institute of Management, Lucknow.
These blogs might interest you
by Chetan Deshpande on 20 Aug 2018
by Sharon Andrew on 17 Aug 2018
by Sangram Sawant on 25 Jul 2018
by Rajiv Peddada on 22 May 2018
Subscribe for blog updates
ABOUT HAPPIEST MINDS
Happiest Minds enables Digital Transformation for enterprises and technology providers by delivering seamless customer experience, business efficiency and actionable insights through an integrated set of disruptive technologies: big data analytics, internet of things, mobility, cloud, security, unified communications, etc...