One of the biggest myths that still remains is that only big companies can afford Big Data driven solutions and that it is only appropriate for massive data volumes and costs a fortune. That is no longer true and there have been several revolutions that have changed this state of mind.
The maturity of Big Data technologies
The first revolution is related to maturity and quality. It is no secret that ten years ago big data technologies required a certain amount of effort to make things work and make all the pieces work together.
There were countless stories in the past from developers who wasted 80% of their time trying to overcome silly glitches with Spark, Hadoop, Kafka, or others. Nowadays these technologies have become sufficiently reliable and they have eliminated childhood diseases and learned how to work with each other.
There is a much bigger chance of having infrastructure outages than catching internal bugs. Even infrastructure issues can be tolerated in most cases gently as most big data processing frameworks are designed to be fault-tolerant. In addition, those technologies provide stable, powerful, and simple abstractions through calculations that allow developers to be focused on the business side of development.
The variety of big data technologies
The second revolution is happening right now -- myriads of open source and proprietary technologies have been invented in recent years -- Apache Pino, Delta Lake, Hudi, Presto, Clickhouse, Snowflake, Upsolver, Serverless, and many more. Creative energy and the ideas of thousands of developers have been converted into bold and outstanding solutions with great motivating synergy around them.
Let’s address a typical analytical data platform (ADP). It consists of four major tiers:
- Dashboards and Visualization – the facade of ADP that exposes analytical summaries to end-users
- Data Processing – data pipelines to validate, enrich, and convert data from one form to another
- Data Warehouse – a place to keep well-organized data – rollups, data marts etc
- Data Lake, places where pure raw data settles down, a base for Data Warehouses
Every tier has sufficient alternatives for any taste and requirement. Half of these technologies have appeared within the last 5 years.
The important thing about them is that technologies are developed with the intention of being compatible with each other. For instance, typical low-cost small ADP’s might consist of Apache Spark as a base of processing AWS S3 components or similar items such as a Data Lake, Clickhouse as a Warehouse and OLAP for low latency queries, and Grafana for nice dashboarding.
More complex ADPs with stronger guarantees could be composed in a different way. For instance, introducing Apache Hudi with S3 as a Data Warehouse can provide a much bigger scale while Clickhouse can remain for low-latency access to aggregated data.
The third revolution is made by cloud services. Cloud services have become real game-changers. They address Big Data as a ready-to-use platform (Big Data as a Service) allowing developers to focus on feature development and leaving cloud care for infrastructure.
There’s another example of ADP’s that leverage the power of serverless technologies from storage and processing until the presentation tier. It has the same design ideas while technologies are replaced by AWS managed services.
Worth mentioning is that the AWS here is just an example. The same ADP could be built on top of any other cloud provider.
Developers have the option to choose particular technologies and to the degree of being serverless. The more serverless it is, the more composable it can be; however, the downside is that it will be more vendor-locked. Solutions being locked into a particular cloud provider and serverless stack can have a quick time to market the runway. A wise choice between serverless technologies can make the solution more cost-effective.
This option though is not quite as useful for startups as they tend to leverage typical $100K cloud credits and jumping between AWS, GCP, and Azure is quite an ordinary situation. This fact has to be clarified in advance and more cloud-agnostic technologies have to be proposed instead.
Usually, engineers distinguish the following costs:
- Development costs
- Maintenance costs
- Cost of change
Let’s address them one by one.
Cloud technologies definitely simplify engineering efforts. There are several zones where it can have a positive impact.
The first one is regarding architecture and design decisions. Serverless stacks provide a rich set of patterns and reusable components that give a solid and consistent foundation for solution architectures.
There is only one concern that might slow down the design stage -- big data technologies are distributed by nature so related solutions must be designed considering possible failures and outages in order to ensure data availability and consistency. As a bonus, solutions require less effort to be up-scaled.
The second one is integration and end-to-end testing. Serverless stacks allow for the creation of isolated sandboxes, play, test, and fix issues therefore reducing development loopback and time.
Another advantage is that the cloud imposes automation of the solution's deployment process. Needless to say, this feature is a critical attribute of any successful team.
One of the major goals that cloud providers claim to have solved is using less effort to monitor and keep production environments alive. They tried to build some kind of ideal abstraction with almost zero DevOps involvement.
The reality is a bit different though. With respect to that idea, usually, maintenance still requires some effort. But besides this, the bill depends a lot on infrastructure and licensing costs. The design phase is extremely important as it gives a chance to challenge particular technologies and estimate runtime costs in advance.
Cost of change
Another important side of big data technologies that concerns customers is the cost of change. Our experience shows there is no difference between Big Data and any other technologies. If the solution is not over-engineered then the cost of change can be accurately comparable to a non-big-data stack. There is one benefit though that comes with Big Data. It is natural for Big Data solutions to be designed as decoupled. Properly designed solutions do not look like a monolith, allowing for the application of local changes within short time periods where they are needed and with less risk of affecting production.
In summary, we do think Big Data can be affordable. It proposes new design patterns and approaches to developers who can leverage it to assemble any analytical data platform while maintaining the strongest business requirements and to be cost-effective at the same time.
Big Data driven solutions might be a great foundation for fast-growing startups that would like to be flexible, apply quick changes, and have a short TTM runway. Once businesses demand bigger data volumes, Big Data driven solutions might scale alongside the business.
Big Data technologies allow for the implementation of near-real-time analytics on a small or large scale while classic solutions struggle with performance.
Cloud providers have elevated Big Data to the next level providing reliable, scalable, and ready-to-use capabilities. It’s never been easier to develop cost-effective ADPs with quick delivery. Elevate your business with Big Data.
Boris Trofimov, Software Architect, Sigma Software