Dynamic data value chains

null

How often do big data initiatives end with the question, “Great I have the data, but now what do I do with it?” Organizations today know that their data is valuable, but often times it is not actionable, even after building out complex and expensive infrastructures to manage said data.   

Similarly, as organizations go from minimum viable product to growth stage, how often does the flexibility of their traditional NoSQL solution later become a burden? Frequently, organizations begin loving the ease of use and flexibility of a schema-less database solution; however, later as they mature the lack of insight into that schema and control become major pain points.  

One of the major reasons behind this is that the nature of big data is dynamic. At the beginning of a big data initiative the project team may have one perspective about the value of their data, but once the project is completed and the demo is shown to the executive team, a completely different group of questions may need to be answered. This happens frequently.    

The challenge then becomes that most big data solutions are not flexible or dynamic enough to answer questions in a truly ad hoc way. There are many great BI solutions like Tableau, GoodData, Periscope Data, etc. that can easily ask ad hoc questions; however, they all have one thing in common - they need the underlying data repositories to be able to answer those questions.   

In a NoSQL database solution developers potentially choose hashes, ranges, and indices that fit what they expect to be the business need for that data, but later find out that different data points are required for search. This could be as simple as realizing that the organization needs to filter by gender. Unless the developer wants to execute expensive scans and paging against a NoSQL solution - this can often lead to global secondary index hell, or the need to move the data to map reduce or elastic search solutions.  

In a SQL database, similar problems arise. However, developers then become dependent on DBAs to configure new views, new indexes, and create new schemas and columns. This is time consuming and reduces the velocity of big data projects, and requires more and more compute resource to maintain. In the data science world, this leads to more time spent on massaging data than actually performing data science.   

In the end, most organizations end up running a multi-tiered database architecture which requires multi-licenses, multi-servers, and many resources with different skill sets to maintain and support. It also can cause major issues at scale with data integrity as there ends up being multi-sources of the truth and middleware issues can cause data repositories to be out of sync. This is a major problem that affects almost all verticals.   

These problems have lead most technologists down two paths. First, is the path of in-memory computing. Amazing speeds from a reporting perspective can be gained using in-memory databases. This is a viable option for organizations with incredibly deep pockets willing to spend millions on licensing, hardware, and technical resources. It requires a deep understanding of traditional database technology in order to grow and maintain at scale. As a result, while it is a viable option for many organizations, it is not practical for most. Additionally, many of these in-memory solutions struggle to fulfill the promise of Hybrid transactional/ analytical processing (HTAP) use cases, as managing the in-memory component of streaming data becomes challenging at high scale.   

Alternatively, a lot of developers look for more dynamic and flexible solutions. As a result, we have seen the rise of data lake solutions as well as elastic cache solutions in order to gain more flexibility into large data sets for analytics and querying.  These are great technologies and have a strong place in the market; however, as the need for real-time analytics rises with Industrial IoT and edge computing, these solutions will struggle to keep up. Today when most folks talk about real-time computing they really mean hours or minutes.  As more and more money is invested in edge technology - the need for real-time will shift from minutes to seconds, and eventually from seconds to sub-second time.

The problem with data lake solutions or elastic cache type solutions is the data value chain. Data needs to be ingested, transformed, analyzed, and then made actionable. These types of solutions are highly dependent on other technologies for the ingestion and transformation of data. They are not designed to function as application databases. As a result, they often require a complex process to deliver data to them for insight and analysis, and then often require that data to be fed back to yet another third-party solution for actionability. This can be problematic and complex as maintaining different integrations and data can easily become out of sync. Furthermore, many data lake solutions often require the compute power of another database in order to search/query, and in high-scale streaming data use cases this can lead to crashes and slow response times.

One option to solve these problems is to use a combination of different cloud services which significantly reduce the complexity and can offer developers and newer companies an inexpensive alternative to reach a production ready product. However, over time and with increased scale, these services become increasingly expensive and make predictable margins and cash flow complex, particularly for SaaS based products. These solutions can also create problems in IoT edge cases, as they require consistent internet connectivity which is not always possible on the edge. 

In order to truly solve these problems new data management technologies need to solve four problems. First, they need to have more dynamic schema capability. Developers are not going to want to go back to the days of rigid RDBMS schemas when they have seen the ease of use and flexibility of a NoSQL database. That said, those same developers are realizing as their products and organizations mature they need deep analytical capability that only SQL can really provide. They need describe schema, multi-table joins, multi-conditions etc.  

As a result, a dynamic schema that adapts to the data ingested is key. This gives developers the ability to ingest ever-changing data, while having the capability to understand, inspect, and search that data however they desire.

Secondly, new data management technologies need to own the data value chain. It is unfair to unload the complexities of HTAP and streaming data use cases onto developers and DevOps teams. Asking engineering teams to manage 7 or 8 products within their data value chain to reach actionable data is uncalled for. Data management technologies need to streamline these workloads and expose easy to use interfaces that developers of any skill level can manage.  

Third, as the value of data continues to grow across all verticals, it’s important to establish predictable spend. Companies need data management solutions that can scale exponentially without exponential costs. As organizations continue to attempt to monetize their data, this predictability is key or monetization is impossible.  

Finally, as edge computing and hybrid cloud environment use cases continue to grow in popularity, data management technologies need the ability to span from the edge to the server in order to effectively manage the data value chain.

Stephen Goldberg, CEO of HarperDB    

Image Credit: Alexskopje / Shutterstock