Data Stack Summit 2023

Monica Kay Royal
Apr 20, 2023
13 min read

The Data Stack Summit is all about finding ways to efficiently conquer the modern data stack by gathering together collaboratively as a community and discussing the tools and capabilities desired by future-forward organizations. During this event, we got to hear real-world perspectives from a variety of data practitioners.

First it was Data Teams Summit, now Data Stack Summit, and coming this summer is Enterprise Summit. I can’t wait! Here is a summary of my favorite sessions, including a summary of my own presentation at the end 🤓

Peer-to-Peer Panel: Managing cloud closets right now

Mike Mooney, Co-Founder at Solution Monday

Joseph Machado, Senior Data Engineer at LinkedIn

Carlos Costa, Data & Analytics Hub Lead at Adidas

Vikas Ranjan, Senior Leader, Data Intelligence & Innovation at T-Mobile

Mike Fuller, CTO at FinOps Foundation

First panel of the day, our MC, Mike Mooney starts off by asking:

What is the greatest challenge with costs to managing data in the cloud?

Carlos mentions that we are living in a world where we want to have a single unified platform to serve multiple teams that can scale for the analytical workload, but it is hard to assign compute and storage for an individual team while also keeping costs in check. It’s more than just monitoring, you need to be proactive about it to make sure that the bills don’t get out of control.

Joseph agreed and added that with all the software choices available, it is easy enough to set up these monitoring features. However, if you are not familiar with the pricing models then you are at risk of costing the company a lot of money.

Mike agrees with both Carlos and Joseph and states that it is key to be transparent with the teams on the costs. He feels that all parts of business should be aware of the costs impacts and you can do this by building it into the platform. It can be tricky, but it is a good idea.

Who in the organization is responsible for cost management?

Vikas thinks this is quite the interesting and challenging topic nowadays. WHile there use to be a dedicated platform team that handled most of the responsibility, it should be a shared responsibility between the platform, application, and business teams. It is also good to bring in the financial folks to look at overall costs. And don’t forget about procurement, make them your friend!

Mike says to start with the product leaders so they can set the direction of expectations and constraints that should be applied to the data platform, in relation to costs. The engineering and architect teams can then build those constraints where needed. He also mentions that consumers need to be aware of how they work with the platform and how it interacts with or relates to the costs, so training would be helpful to include as well. And, in agreement with Vikas, don’t forget about procurement.

What tools and processes are out there to help level up with cloud costs?

Joseph finds that sending out weekly cost reports, broken down by team, helps drive conversation and figure out if certain things are necessary to keep. This also resonates better with management since they have data and evidence to review. These reports do not have to be fancy, they could be a simple Excel spreadsheet, just something to show the data. In fact, he had better luck with emailing spreadsheets than he did with building out dashboards.

What is one thing that is most valuable in controlling costs?

Carlos advises that when you first start looking into a data platform you need to set hard limits on budget for storage and costs at day 1. Also, it helps if you cap the end users so they do not go over the limitations.

Vikas agrees and adds that costs should be part of the process and design of the data platform. You should know the cost and potential ROI at the start and build that into the culture. This helps bring that awareness and mindset to the individuals involved. Also, tagging is very very important.

Joseph strongly agrees and shares that tags have saved his life a couple of times in his past. His advice is to find out a way to know when teams are approaching so you can catch before you get billed. To Carlos’ point, setting caps can be a viable solution here.

Mike has experienced that getting key metrics in front of people is the way to go. Making sure information is presented to the organization around costs in an informative way, not just presenting fancy graphs. Actionable insights was the key word here.

To close us out, Carlos says to treat the costs of the platform like we treat our analytical abilities, think ‘data as a product’. Joseph shares that there is a balance between ease of use vs. costs and everyone should understand the tradeoffs. Vikas reminds us that this is not a one and done effort but a continuous process and Mike emphasizes the act of collaboration.

Modernizing the data stack - keeping it real!

Mark Mullins, Chief Data Officer at United Community Bank

Raj Joseph, Founder and CEO at DQLabs

In a world of active metadata, semantic layer, data contracts, and modernization of data quality, sometimes it’s easy to overlook the challenges of delivering business value and jump upstream towards a vision of a modern stack.

Hear from a true data leader currently transforming his entire banking data stack and team with careful planning and making progress. This session is about keeping it real and for other leaders who want to learn how to swim in a world of hypes and buzzwords.

Mark’s experience working in the Financial Industry was very relatable to my background and it was fun to hear how he is working through today’s data world.

In this world of ever changing technologies, how do you keep up with everything?

Mark agrees that there is a lot going on in the world of data and while working in a bank, new data comes in constantly. His approach is to focus on people before technology.

His team started out as a classic DBA team, but as things changed with the industry and regulatory requirements he started adding roles where they made sense. Roles including Data Governance, Data Quality, Data Delivery, and Data Architecture. He maintained his focus on the people side which immensely helped evangelize the different programs and gain the necessary support.

What was the reason for adding a Data Governance Role?

It had a lot to do with banks being highly regulated. When regulators start asking about data, they want to talk about data governance and making sure there is a framework in place.

While it started because of the interest from the regulators, it was also important to make sure that the organization was ready for the addition.

With banks, a lot of data still exists in both on-prem and cloud environments which could have its challenges. How do you plan around things like projects between environments?

Mark was up front and shared that it takes a lot of patience. Mark shared a tip that he tries to maintain 3-4 key partnerships in data management to help them with the different types of projects. He also works with the team’s availability and capacity at the same time as working with vendors and support companies in order to help continue the momentum.

As you are catching up with modern tools, how do you also keep up with new stuff like ChatGPT?

Mark honestly replied that he has to resist the urge to spend time on all the hype, which is incredibly hard because of all the shiny new things that are coming out. However, in order to keep up and learn new things he does enjoy reading and spends about 20% of his time reading about new technology to invest in or programs to mature within the company. If there is something that he finds really interesting, he will put it on his ‘research board’ and fold that into discussions with his teams to see if there are things to further investigate and pursue.

If you had a time machine, what would you go back and re-do or change?

Mark would have liked to implement better communication, like a 360 communication approach. He feels that he has done well communicating with the people that are using data, but not as good at communicating up to executive level.

Raj closes out the interview by thanking Mark for keeping it real!

An ever-increasing need for data quality

Dr. Rajkumar J. Bhojan, AI Researcher at Fidelity

As the big data explosion matures even further, we're seeing how data quality is so closely linked to information quality, decision quality, and outcome quality. So, good data is more important than big data but how fast can we make good data?

Data quality is a critical factor in every organization as it helps with decision making, productivity, customer experience, enhanced competitiveness, and reduced costs. Data Quality can be viewed as the health of data at any stage in its lifecycle and is a tricky thing to manage as it can also be impacted at any of these stages.

To emphasize the importance of data quality, Dr. Bhojan shared that the video game company, Unity, lost $110 million on their AI advertising system after ingesting bad training data from a third party.

Many of us have seen this epic iceberg example before, which typically represents what an end user sees after all the data work is complete.

This one specifically shows that there are many causes of bad data, most of which are actually hidden.

Good data helps with many things as mentioned above, and you can probably imagine that bad data would result in the opposite. Bad data could be a little more harmful as it may also introduce legal and/or compliance risks which could then lead to potential damage in company reputation. Or as Dr. Bhojan summarized:

good data > good results, bad data > bad results.

A few questions come to mind when talking about data quality:

What makes data good or bad?

There is a concept of data dimensions that are measurement attributes of data which you can individually assess, interpret, and improve. This topic could be an entire blog post on its own really, and I actually have a couple of resources I can point you to if you are interested in learning more here.

10Vs of Big Data: Hot Ones Style (YouTube) - I teach my husband about the 10 Vs of Big Data while participating in a Hot Ones hot sauce challenge

10Vs of Big Data: Vegas Style (Blog Post) - Learn the 10Vs of Big Data with the help of Vegas and Gordon Ramsay

What makes data quality hard to control?

There are many challenges involved when dealing with the quality of data, including:

Data volume and complexity
Data inconsistency
Human error
Lack of data governance
Outdated data
Data security
Data integration

Plus, the point at which the data quality can change exists at pretty much every step in the data lifecycle like when dealing with new features, bug fixes, refactors, optimization, new teams, and outages.

When to start data quality?

This one is a bit of a spicy topic as there are a lot of different answers including when there are changes to the organization, the team, or the procedures. But the ultimate answer is NOW!

Turning your data lake into an asset

Bill Inmon, Founder, Chairman, CEO and Author at Forest Rim Technology

Data architecture is constantly evolving. First, there were applications. Then data warehouses. Today we have the data lake. People are discovering that the data lake quickly turns into a data swamp or data sewer. What do you need to do to turn your data lake into a productive, vibrant data lakehouse?

Bill is the father of the data warehouse and a true legend. I love listening to him present and could probably listen to him all day. His knowledge and insights he shares are invaluable! Today, Bill shares with us how to save your data lake from turning into a data swamp.

This isn’t news, almost every corporation these days deals with massive amounts of data. But what do they do with it? A lot of vendors are advising them to put it into a data lake. The problem with this is that in some cases you can’t find anything, you can’t relate data to another piece of data, and no one knows what anything means. The biggest impact to all of this madness is on the data scientist. Bill said that the data scientist spends 95% of their time struggling with data and only 5% being a data scientist.

But, have no fear, there is a solution! You can turn the data lake into a useful structure by converting it into a data lakehouse. This requires two basic components, an analytical infrastructure and the integration of data. Let’s break it down.

Data in a corporation consists of 3 categories:

Structured (transactions from bank, phone, airlines companies)
Text (medical records or social media reviews)
Analog/IoT (security camera footage)

Now getting this into the a new infrastructure requires integration, which means something different for each of these data categories. At a high level, structured and text data need to be put through some form of ETL process so everything is apples to apples. And the analog/IoT data needs to go through some distillation, meaning that the useful data needs to be separated from the unuseful data.

Bill provided some really wonderful and in depth examples here and my recommendation would be to view his recorded presentation (https://datastacksummit.com/) for more information.

At the end of the day, once you have built the data lakehouse, your data scientist can become a happy camper.

Peer-to-Peer Panel: Enabling the analytics end user

Sunny Zhu, ESG Data Analytics & Operations at Indeed

Jess Ramos, Senior Data Analyst at Crunchbase

Sangeeta Krishman, Senior Analytics Lead at Bayer

Nicole Radziwill, SVP & Chief Data Scientist

Nicole led us through our last panel by asking:

What is self service analytics and what is its purpose?

Sunny pointed out that self service is more than just beautiful visualizations and dashboards. It should be something to help the business answer questions. While some just want the data so they can do whatever they want with it, this does still require collaboration with the data analytics function.

Jess believes that democratizing data and making it accessible to everyone is important, but you should not forget the education side and make sure that the users of the data know what they are doing. It is a delicate balance of giving people data so they can run with it and helping translate the data to the user.

Sangeeta believes that self service is not a one size fits all and gives us a wonderful analogy to baking. Some people want to gather raw ingredients and make their cake from scratch , some want to buy a kit that will help with the preparation and they can still add their own touch, others just want to go to a bakery and buy the prepared cake.

Nicole mentions that when we provide access to the data, there is a risk of ending up in a situation where there is a proliferation of products and while attempting to be everything for everyone, you end up being are nothing to no one.

What does enabling the end user mean to you?

This reminds Sunny of the 80/20 rule, where most of the time is spent on the data work but we can’t just hire a ton of analysts to tackle all the problems. She mentions that we should be leveraging automation where we can with these types of data prep tasks but also there should be an alignment on the data structure and support for those end users.

Jess thinks that there should be some restrictions put into place on the types of data or projects that are available for these end users to tackle. The simpler repetitive tasks are easier to manage, but this still does not mean that the data analysts should be taken completely out of the equation. Also, it helps when there is clear documentation available to the end users to help with things like data definitions!

Sangeeta views the 80/20 rule a little differently. Enabling self service should not be applied to every scenario and every data set. She believes that 20% of the answers are not able to be found, therefore you will have to have the help of the data team. It’s like the self checkout at a store, most people are going to get it but there will be a select few that are going to run into some issues and need help.

What do you believe are the benefits when you are able to free up resources from the IT and data teams?

Jess shares a big benefit for the company as it relates to costs. She said that you really don’t want to be paying a data scientist to perform quick and simple tasks when those can be easily completed by the end users. Sangeeta highlights that there is more time for innovations, which helps with ROI, and Sunny reminds us that no one wants to just be a reporting machine. When the data team is free, there is more time that can be dedicated to solving problems. Sunny also mentions that it helps if there is a goal set at the organization level for what self service is going to achieve.

How do you make people more data literate?

I personally found this question to be a difficult one to answer with my experiences in this area. Sangeeta compares data literacy to financial literacy, in that we did not get that type of education in grade school. Therefore, you have to start with educating others to give them a basic understanding.

What is one piece of advice you would give for the executives?

In summary:

We cannot solve 100 problems by hiring 100 data scientists
It is important to align data strategy to business strategy
Data should be a strategic asset
Leverage automation to improve operational efficiency
Don’t roll out self service as an objective for the entire company, start with one team
Provide training and education
Make sure processes are documented
Don’t create an us vs. them mentality, be there to help if needed

Is synthetic data useful for data engineers?

Matthew Norton, Product Owner at Nationwide

Dr. Alexander Mikhalev, AI/ML Architect at NBS

With my audit and security background, I really like the topic of using synthetic data. This was a short presentation as half was a technical demo from Dr. Alexander Mikhalev. There are still some good takeaways that I want to summarize here and you can check out the Data Stack Summit website for the full presentation.

Synthetic data is information that is computer generated rather than produced by real-world events. It is used in machine learning models to represent any situation and can help to improve model accuracy, protect sensitive data, and mitigate bias.

Benefits

Volume
Rare events
Privacy & Safety
Collaboration & innovation

Use Cases

Data Pipeline Testing to make sure ETL and workflows are working correctly
Database Schema, Integration, and Performance Testing is a lot safer
Data Modeling & Analytics to increase training data efficiency

Challenges

Infrastructure (machine learning training can result in overload to some infrastructure if not robust enough)
Knowledge & Expertise (does not always exist at the right level in the org so you could partner with a third party)
Culture Change (introduction of new things often cause some hesitations but you can start with a small use case first which will help you demonstrate success stories and get buy-in)

DataOps Teams: Stop sprinting!

Monica Kay Royal, Founder & Chief Data Enthusiast

DevOps and DataOps have a few similarities in the processes and tooling required to achieve the goals of each. So why are data teams struggling with the implementation of DataOps?

This is a presentation that I gave at the summit and had so much fun doing it. Thank you to the team for asking me to present this topic!! 😃

First, we start off by polling the audience. I wanted to ask my LinkedIn network if the two terms were different. I anticipated the answer was ‘yes’ but was really curious what people were going to share in the comments. Here is a summary:

Not a clear yes/no
Depends on the definitions
There are still many interpretations
Strategies and processes are similar, but implementation is different

+ Two quotes

‘DataOps is spicy DevOps for data’ ~ Juan Manuel Perafan,

‘DataOps is complicated’ ~ me

When we look at the definitions, we can see that they are actually quite similar and only differ in each of their goals.

This brings to light two things, 1) although they are different, DataOps teams are trying to achieve both goals and 2) DataOps requires human interaction

If you focus on the process and tools involved to complete a project, DataOps teams are kind of set up for failure. They have to understand the data which requires interaction with humans AND constantly add new tools to their stack which they have to learn quickly while trying to meet tight sprint schedules. 🤯

During the presentation, I gave some examples with fun pictures and analogies. The main point being, of course, that DataOps teams need to stop sprinting if they want to do their jobs right.

After the presentation I received some feedback that I wanted to share, it made me smile!!

If you are interested in learning more, you can find the video on my YouTube channel here.

Conclusion

If you would like to watch any and all of the sessions from this year's event, or past year's events, visit the Data Stack Summit website

I am already getting excited for Enterprise Data Summit, happening June 7th, 2023!! Register here

Data Stack Summit 2023

Peer-to-Peer Panel: Managing cloud closets right now

Modernizing the data stack - keeping it real!

An ever-increasing need for data quality

What makes data good or bad?

What makes data quality hard to control?

When to start data quality?

Turning your data lake into an asset

Peer-to-Peer Panel: Enabling the analytics end user

Is synthetic data useful for data engineers?

DataOps Teams: Stop sprinting!

Conclusion

Thank you for reading, Thank you for supporting

and as always,

Happy Learning!!

Recent Posts

Comments

nerdnourishment