Uniting data lakes and data warehouses in the cloud! Yes, it is possible, and it brings a lot of benefits. Hi everyone and welcome to our GCP Podcast called “Mastering Google Cloud Made Easy”. My name is Ivaylo Dimitrov, and I am a Google Cloud Solution Architect at Adastra. Today, I am going to share some insights on how to bring data lakes and data warehouses together with Google Cloud, but before we proceed, make sure you subscribe to our channel to always stay tuned with our latest GCP insights.
In a nutshell, I will talk about data. How it is being processed, how and where it has been stored over the years, and what the future looks like. Companies have always been data driven, but over the last 30 years we have seen a striking change in the way companies generate, collect, and use their data. In today`s episode I will cover key advancements in data processing and will share an opinionated perspective for how the data can be stored and accessed more efficiently in the future.
Relational Databases Management Systems (RDBMS)
Let me start with the Relational databases management systems or RDBMS. They have been the hearth of supporting business growth for many years. These systems were designed and optimized for business data processing and are commonly referred to as online transactional processing (OLTP) systems. Because they support the day-to-day operations of companies, throughout time, organizations have realized the need of deeper actionable insights into their daily business operations. To achieve that, they had to move together multiple sources into a centralized repository. This way, it would support all the data manipulations and produce deeper insights about business trends and performance. Thus, the idea of a centralized data warehouse was born.
Early Data Warehouses
Early data warehouses are built on existing RDBMS stacks, and the adaptations made to that technology were never sufficient to support the volume, variety, and velocity of the Big Data era. As more companies embraced the digital transformation, data volumes and types increased dramatically. The nature of that data was mostly structured or semi-structured. With the rise of social media, shared platforms, and IoT devices, the diversity of data types increased. Data warehouses could only handle structured and semi-structured data, but not the ever-increasing unstructured data volumes ingested from the new sources. A new method of collecting, storing, and exploring these combined data types was needed.
The Big Data explosion changed the rules of the game with a host of new distributed database systems and data engines, mainly from the NoSQL and columnar families.
They marked the end of the “one size fits all” paradigm that fueled data warehouses and business intelligence until then. This gave rise to a new concept called a data lake, which soon became a core pillar of data management alongside the data warehouse.
What is data lake?
A data lake is a place for enterprises to ingest, store, explore, process, and analyze any type or volume of raw data coming from disparate sources like operational systems, web sources, social media, and Internet of Things. To make the best use of a data lake, data is stored in its original format without the added structure or much pre-processing.
Whether used as the source system for data warehouses, as data processing and transformation layer, as a platform for experimentation for data scientists and analysts, or as a direct source for self-service BI — it’s clear that data warehouses and data lakes complement each other and the main transactional data stores in an organization. Over time, organizations were led to believe that Hadoop in combination with their warehouse would solve all their analytic needs, but that was not necessarily the case.
Data lakes began expanding their capabilities beyond storage of raw data to include advanced analytics and data science on large volumes of data. This enabled self-service analytics across organizations, but it required an extensive working knowledge of advanced Hadoop and engineering processes in order to access the data.
Meanwhile, data volumes and types are continuing to grow, and conventional data analytics platforms are failing to keep up. The cost and complexity of provisioning, maintaining, and scaling data lake clusters has kept organizations from using them to their full potential.
Data Lakes and Data Warehouses in the Cloud
Now, businesses are looking to modernize the data lakes and data warehouses by moving them to the cloud because of cost savings and the need to realize value from data by making it available for real-time business insights and artificial intelligence. As more companies optimize to become fully data-driven, AI and real-time analytics are in higher demand.
Facing the shortcomings of traditional data warehouses and data lakes on-premises, data stakeholders struggle with the challenge of scaling infrastructure, finding critical talent, improving costs, and ultimately managing the growing expectation to deliver valuable insights.
To stay competitive, companies need a data platform that enables data-driven decision making across the enterprise. But this requires more than technical changes; organizations need to embrace a culture of data sharing first. Siloed data is silenced data. It is critical to secure data sharing across lines of business to facilitate enterprise intelligence. When users are no longer limited by the capacity of their infrastructure, data nirvana is reached when the value-driven data products are only constrained by an enterprise’s imagination.
Google Cloud Advantages
By migrating to Google Cloud and modernizing these traditional management systems, organizations can get the well-known and understood advantages such as reduced storage and processing cost, scalability to ingest and analyze data, better time to value, powerful data governance and security layers, and democratized data.
As mentioned previously, some of the key differences between a data lake and a data warehouse relate to the types of data that can be ingested and the ability to land unprocessed raw data into a common location. This can happen without the governance, metadata, and data quality that would have been applied in traditional data warehouses.
These core differences explain the changes around the personas using the two platforms:
- Traditional data warehouse users are BI analysts who are closer to the business, focusing on driving insights from the data. Data is traditionally prepared by the ETL tools based on the requirements of the data analysts. These users are traditionally using the data to answer questions.
- Data lake users (in addition to analysts), include data engineers and data scientists. They are closer to the raw data with the tools and capabilities in place to explore and mine that data.
They not only transform it to business data that can be transferred to the data warehouses, but also experiment with it and use it to train their Machine Learning models and for AI processing. These users not only find answers in the data, but they also find questions.
As an outcome, we often see these two systems are traditionally managed by different IT departments with different teams. They are split between their use of the data warehouse and the data lake.
However, this approach has a number of tradeoffs for customers and traditional workloads. This disconnect has an opportunity cost; organizations spend their resources on operational aspects rather than focusing on business insights. As such, they cannot allocate resources to focus on key business drivers or on challenges that would allow them to gain a competitive edge.
For example, if an online retailer spends all resources on managing a traditional data warehouse to provide daily reporting which is key to the business, then they fall behind on creating business value from the data such as leveraging AI for predictive intelligence and automated actions. Hence, they lose competitive advantage as they have increased costs, lower revenues, and higher risk. At the end of the day, it is a barrier to gain a competitive edge. The alternative is to use fully managed cloud environments whereby most of the operational challenges are resolved by the services provided.
Every organization wants a unified platform that provides secure and high-quality data that is accessible to the right data users. But what if they don’t have to compromise?
With Google Cloud you can get Dataplex that help companies build the right balance of governance and access to their data platform. Dataplex is an intelligent data fabric that unifies distributed data to help automate data management and power cloud analytics at scale. It brings data warehouses, data lakes, and data marts together.
By understanding that all end users in an enterprise can and should be a “data person”, user experience tools can help minimize the skills gap which have been a barrier to people getting access to real-time and central data.
One prominent example of a serverless and fully managed native cloud product that can serve as a complete modern replacement for an Enterprise data warehouse is Google’s BigQuery.
While these systems are working towards converging at an organization, it also makes sense to leverage cloud-native technologies oriented towards file-based or data lake style workloads.
Google’s Dataproc is a managed offering that enables persistent Hadoop or Spark clusters, among others, to be thought of as serverless job-based tasks.
Because it is serverless and only persists for the length of the job, rather than a 24×7 cluster, there can be a paradigm shift towards the way data teams interact with the data lake. It can be modernized as it is converging with the data warehouse.
Bringing Data Lakes and Data Warehouses Together
Convergence of the data lake and data warehouse is about simplifying and unifying the ingestion and storage of the data and leveraging the correct computing framework for a given problem. It no longer matters if the data is stored within the data warehouse or within the freely floating cloud bucket. This is because on the background it is the similar distributed storage architecture, but data is structured differently. As a result, data is easily accessible and managed by both data lake and data warehouse architectures in one place. Therefore, organizations can now apply governance rules around data residing in the lake and the same data accessed by the data warehouse.
As an outcome, we can break down silos, by not just putting data into a central repository, but by enabling processing and query engines to move to wherever that data is.
To wrap it up – Cloud computing has changed the way that we approach data. Traditionally, organizations have had to manage large amounts of infrastructure to extract value from data, starting with the data warehouses and leading to the rise of Hadoop-based data lakes. However, both approaches have their challenges, and we are in a new, transformative technical era in cloud computing technology where we can leverage the best of both worlds. Google has gone through this transformation, too.
In fact, Google’s data processing environment is built with this in mind from the first principles. Google BigQuery acts as a massive data warehouse, hosting, and processing exabytes of data.
Processing engines such as Dataproc and Dataflow has been closely coupled with BigQuery and other solutions. These tools are then used seamlessly by different teams and personas to enable a data-driven decision making and applications.
More than ever before, companies see the need to modernize their data storage and processing systems to manage massive data volumes and close the data-value gap. This is a challenging problem to solve, and it can be a significant engineering undertaking to overhaul and consolidate legacy data analytics stacks. It’s important to understand the technical, business, and financial impacts of not only what data is being collected, but how it’s being stored and accessed.
Part of this, too, is the organizational impact that changes to a data platform can have. It’s hard to bring together multiple stakeholders, especially when it seems like their goals aren’t aligned. The good news is that when you bring together key data owners, users, and stewards of data systems, you can find a lot of common ground and agree on areas of compromise.
Well, this is all for today’s episode of “Mastering Google Cloud Made Easy”. Thank you for joining and I hope you can take these insights and drive your business with ease with Google Cloud.
Make sure you subscribe to our channel to always stay tuned with the latest insights we share. Join us again next time when you will hear the next portion of GCP insights. Goodbye.