Hub & Spoken Podcast Episode 151: Building massive scale analytics

What’s in this podcast?

In this episode, Jason talks to David Bader, a Distinguished Professor in the Department of Data Science at the New Jersey Institute of Technology about large scale analytics.

Listen to this episode on Spotify, iTunes, and Stitcher. You can also catch up on the previous episodes of the Hub & Spoken podcast when you subscribe.

What are your thoughts on this topic? We’d love to hear from you; join the #HubandSpoken discussion and let us know on Twitter and LinkedIn.

For more on data, take a look at the webinars and events that we have lined up for you.

One big message

Information is gradually becoming easier and easier to gather and utilise, causing data sets to expand exponentially. This expansion has been a huge driver for change in the way that these massive data sets are stored and analysed. Frameworks and infrastructures that are being developed for large data need to be able to bridge the gap along a full vertical stack of knowledge and skills to make it easier for everyone to use.

[00:54] David’s path from electrical engineering, to founding the Data Department at New Jersey Institute of Technology

[02:30] Classifying large scale data and what has been driving the trend to collect and analyse more data

[04:39] Looking at instances where scale is important and how to determine when you need to scale

[10:40] Comparing large scale data analysis locally vs. in the cloud

[13:29] The capabilities and skills your data department requires to run large scale data projects

[15:03] The ‘secret sauce’ that helps David’s team handle large scale data

[17:14] Breaking down what hardware you need for large amounts of data

[20:17] What experience is needed to build the entire infrastructure that can hold and analyse large scale data

[24:25] Exploring use cases

The rise of large scale data

Over the past decade, we have seen a dramatic increase in the amount of data that is being generated. This has been driven by a number of factors, including the growth of social media, the proliferation of connected devices, and the rise of big data analytics. As a result, organisations are now able to collect and store large amounts of data more efficiently than ever before.

But what exactly is large scale data?

From a data scientist’s perspective, it’s usually classed as an amount of data that cannot typically fit on a single laptop as that is often where the data scientist is playing. But large scale data isn’t always about how many bytes the raw data takes up on a hard drive. Often large scale data is about sophisticated data that generates more complex scenarios and data structures.

When do you need to scale data?

The data analytics landscape is constantly evolving, and with new tools and techniques emerging all the time, it can be hard to know when to scale your data operation.

When determining when to scale your data, often it is the easiest to work with the smallest amount of data that can provide the solution to your problem. Smaller sets of data are not just easier to work with, but can also be analysed faster and leave less room for error.

Sometimes you may be dealing with increasing data volumes, expanding user requirements, or simply looking to future-proof your operation as your organisation grows over time. In these scenarios you don’t need to go big straight away but having a system in place that can be scaled at any time will help you become more flexible with your capabilities.

Bridging the gap between architecture and algorithms

There’s a lot of talk these days about the divide between data architecture and algorithms. On one side, you have the data experts who understand the inner workings of databases and know how to extract insights from large data sets. On the other side, you have the algorithm specialists who can develop sophisticated models to solve complex problems.

When it comes to large scale data, you need experts who are able to bridge the vertical knowledge between the two. This requires people who have much broader skill sets and oftentimes more experience in the industry.

Conclusion

The past few years have seen a dramatic increase in the amount of data being generated. This is largely due to the proliferation of connected devices and sensors, which are constantly collecting and transmitting information. Along with this, there has been a corresponding increase in the ability to store and process large amounts of data. As a result, organisations are now able to leverage data on a scale that was previously unimaginable by creating backend infrastructure and frameworks that make it easier to analyse and pull insights from.

This newfound capability is driving innovation across all sectors, as businesses look to gain insights that can give them a competitive edge. In many cases, those who are able to effectively utilise data are rewriting the rules of their respective industries. It is clear that data is becoming increasingly important in today’s economy, and its impact will only continue to grow in the years ahead.


Cynozure: https://www.cynozure.com/
Twitter: https://twitter.com/cynozuregroup/
LinkedIn: https://linkedin.com/company/cynozure/

Hub&Spoken: http://www.hubandspoken.com
Twitter: https://twitter.com/HubandSpoken/
LinkedIn: https://www.linkedin.com/showcase/hubandspoken

CDO Hub: http://www.cdohub.com
Twitter: https://twitter.com/cdohub1/
LinkedIn: https://www.linkedin.com/showcase/cdo-hub/

https://www.cynozure.com/hub-spoken/building-massive-scale-analytics/

David A. Bader
David A. Bader
Distinguished Professor and Director of the Institute for Data Science

David A. Bader is a Distinguished Professor in the Department of Computer Science at New Jersey Institute of Technology.