We all know the story of “Goldilocks and the Three Bears,” where a fictional girl named Goldilocks tries different bowls of porridge, chairs, and beds in search of each one that is “just right.” Today nearly all businesses use open source software with a rapidly increasing number of businesses creating and contributing to open source projects themselves. By using open source, companies have a lower cost barrier to entry to try new software solutions for their business problems. Businesses welcome the fact that the aggregate community of popular open source projects assist substantially in testing, code scrutiny, bug fixing, and enhancement of the software. They feel comfort in that they could contribute if they so desired or read the code in order to debug a problem. The businesses are no longer at the mercy of a vendor to fix issues, add features, or end of life the product. They love the fact by leveraging open source software they are finally in control. However, as businesses adopt more open source software into critical pieces of their business they are still met with some challenges.
In this article we’ll explore how businesses can acquire open source software. In particular, we will address concerns related to open source distributed data processing software such as Presto, Spark, Hive/Hadoop, Kafka, and MongoDB. We’ll explore the business advantages and disadvantages introduced by each delivery method and how to choose which is “just right” for your business.
Origins of open source software
As a brief history lesson, the origin of open source started in academia and research where source code was shared among researchers without license restrictions. In fact, open source wasn’t really a term. Software was open source and therefore no need to differentiate. During this time, commercial computing was still in its infancy, and the software development community worked collaboratively out of necessity. Around the 1970s, as businesses saw a growing opportunity to commercialise and monetise their software, they began to shift away from their collaborative roots. IBM, for example, famously started charging separately for some of their software and stopped providing the source code. More frequently, businesses would copyright and withhold the code to their software, effectively changing the nature of open source collaboration of the day. This was furthered by the United States case law asserting software can be copyrighted.
Of course, this swing of the pendulum would not last forever. The simple truth regarding the benefits of open source software was too great to be ignored. There was no hiding that the collaborative nature of open source software often provides flexibility, cost-reductions, and cutting-edge development. But before open source, there was “free software,” promoted by the Free Software Foundation - Free as in “free speech, not free beer.” Founded by Richard Stallman in 1985, the Free Software Foundation defines the four essential freedoms of software that users are free to run, copy, distribute, study, change, and improve the software. While the FSF worked to protect the freedom of software, it was not completely viable in the commercial world. Eventually, in the late 1990s, the first big shoe dropped when Netscape chose to release its source code of their suite of web browsing tools, Netscape Communicator. This was a pivotal point in the open source movement and the foundation of the Open Source Initiative.
The difference between free software and open source deserves a longer explanation and is not necessarily considered a settled matter. In essence, you can think of free software as a philosophical and social movement whereas open source is about the collaborative movement of software. In both cases, they achieve the same ends of available source code for software. When we refer to open source software, it’s often assumed the software has been developed in a collaborative nature by various developers. Sometimes companies may choose to employ developers to contribute and some developers chose to volunteer their time for the good of the project. The open source movement is about collaboration and with this movement over the next decade, open source alternatives to proprietary software become increasingly available. This is more palatable by businesses both using and contributing to software.
Open source in 2019
Generally speaking, there are three options through which one can use and participate in distributed data processing open source software.
- Download it from the community (Community Website, GitHub, SourceForge, BitBucket)
- Obtain from software vendors offering open core or SaaS offerings (Confluent, Databricks, RedHat, Starburst)
- Procure it as a service through cloud providers (AWS, Azure, Google Cloud Platform)
From the community
Many popular community driven open source projects provide a variety of ways to obtain the source code or prebuilt artefacts. The PostgreSQL community is a good example. At a minimum, other projects provide the way to obtain the source code in order to compile and use the software. These days, source code is often retrieved via a publicly accessible source control repository such as Git, Subversion, or Mercurial. GitHub is a popular platform for hosting Git repositories and creating and collaborating on open source software. In the way GitHub has allowed for this collaboration on open source software, it is largely responsible for the rise of open source. Using GitHub, users are free to copy, distribute, study, change, and improve the software, and use the software however they like.
However the software is obtained, using open source software directly from the community provides users with the benefits of a low initial cost and incredible version and update flexibility. The onus of compilation, installation, and configuration, however, is still on the user. And of course, the code comes with the standard warning of “buyer beware” as there is almost never any technical support or maintenance. The user should be technically skilled enough to work with the community and potentially even contribute patches. This is common practice among the large web scale companies that embrace open source such as Facebook, Netflix, Uber, and LinkedIn, where they employ teams of software engineers that are part of the open source community advancing the project. For other businesses unwilling or unable to make the engineering investment in open source, this method of using open source should be limited to trying out the software. But they should remain wary about deploying open source software at scale or in production. Once a business reaches the production stage of a project, there are significant benefits to choosing to work with the expertise of an established vendor.
Commercial software vendor
As the business model for open source has evolved, business users understand it better and are more willing to engage with a vendor. Although the projects are largely community driven, the contributions of code for popular open source projects often come from a handful of companies with a long tail of smaller contributors. Of the contributing companies, often one or two of them are a vendor that has chosen to focus their business on the development and support of a particular piece of open source software. Some great examples of this are Databricks for Apache Spark, Confluent for Apache Kafka, Cloudera or Hortonworks for Apache Hadoop, Starburst Data for Presto, and DataStax for Apache Cassandra, and many others.
Vendors backing an open source project are often Open Core or SaaS business models. Sometimes providing both. For Open Core, vendors may offer a community or limited edition of the software that is free and open source, but also offer proprietary adds-ons at the peripherals under a commercial license. These days, this is well understood and accepted by buyers. Cloudera is one example. Another model involves offering the open source software as SaaS. SaaS alleviates the user from the operational burden and allows them to get up and running quickly. Often, SaaS provides the open source software as the main component but provides a lot of extra features to add value to the use. Databricks and GitHub are great examples of this.
These models certainly provide more comfort for businesses using open source software. Companies pay for the latest features, patches, long term support, roadmap influence, and enterprise-level support service level agreements. However, it costs more and is still proprietary in certain aspects. The cost is often far cheaper than the traditional commercial proprietary products the open source software is displacing. And even though proprietary, the businesses still have the comfort and optionality to simply move back to the pure open source software and manage it themselves. If the vendor deviates too far from the original spirit of the project, then customers run the risk of vendor lock-in and losing the benefits of leveraging a truly open piece of software. If the vendor stays true to the project, any contributions to open source outside of the vendor’s contributions are included and there is far less threat of lock-in.
Similar to the commercial software vendors, public cloud vendors such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform deliver open source software-as-a-service. The benefit of this model is a much lower cost and need for resources in order to maintain the software. The public clouds have the greatest advantage in that they own the infrastructure that the commercial vendors of a similar SaaS solution also use. Public cloud vendors can easily undercut on the price.
The disadvantages, however, can sometimes be great. Often, these vendors have little expertise in the open source projects they deliver. Frequently, there are significant version and feature latencies, as well as slower updates to patch bugs. This can leave an organisation exposed to potential security threats and poor performance, as compared to other versions of the software. They also often lack a lot of the value adds that open source vendors provide. Normally, the public clouds provide the open source software-as-a-service in its raw generic form, without much configuration or tuning options, whereas with a vendor, the software is one component of a more holistic, greater value solution. With public cloud vendors, even though based on open-source software, it’s effectively proprietary software with vendor lock-in. It is the most closed and least free of the three options.
Some open source projects have worked to prevent cloud vendors from taking advantage of open source software by changing its licenses to prevent cloud vendors from hosting open source software to which they don’t contribute. MongoDB is an interesting example and we’ll continue to see how this works out. It’s worth mentioning that the Open Source Initiative does not view their license as open source. There are some positive exceptions to cloud vendors exploiting open source projects without giving anything back. For example, Google’s Kubernetes Engine is offering Kubernetes as a service for which they contribute heavily.
Ultimately, there are many benefits to choosing open source over traditional proprietary software. The increase in flexibility, quality, and cost-reductions are just a few that can have a dramatic impact on a business. They do not come without risks, however. Choosing the right delivery model, as well as knowing the specific needs and goals of the deployment, can help manage, mitigate, or even eliminate these risks.
If an organisation is still in the testing phase and wants to better understand the software's functionality, a source like GitHub will help meet the needs of a quick and minimal deployment. If costs and resource allocation are the biggest concerns, then exploring a cloud vendor can provide the needed software as-a-service.
For a business that requires both the flexibility of free and open source and the cost effectiveness of a turn key open source solution, they should follow the Goldilocks principle. For this need, it’s almost always the case that choosing the commercial software vendor behind the open source project is the one that will prove just right.
Matt Fuller, Co-Founder and VP of Engineering, Starburst Data
Image Credit: Wright Studio / Shutterstock