Key research findings on data persistence

Technical decisions around data persistence are hard, which is why we surveyed 583 IT professionals on everything from current DBMS and ORM usage to modern database engines’ data structures and access patterns to storing data on a mobile device.

The demographics of this survey are as follows:
• 69 per cent of these respondents use Java as their primary programming language at work.
• 68 per cent develop primarily web applications.
• 66 per cent have been IT professionals for over 10 years.
• 45 per cent work at companies whose headquarters are located in Europe, 27% in the USA.
• 44 per cent work at companies with more than 500 employees, 19% at companies with more than 10,000 employees.

Oracle, MySQL, SQL Server above others

The two most mature commercial DBMS offerings (Oracle and MySQL) are used in production by 51 per cent and 49 per cent of respondents, respectively — significantly ahead of the third-ranked DBMS (SQL Server, at 34 per cent). The top three, and the tight race between the top two, have not changed in years, among our survey respondents as well as on the DBMS ranking aggregator dbengines.com. The nearest NoSQL challenger, MongoDB, remains a distant fourth in production environments.

NoSQL adoption

Server graph

Non-production environments are more friendly to less mature and less thoroughly supported database management systems, and also more likely to be affected by desire to optimize for structural fit and ease of access.

In production, where data stores are often managed by specialist non-developers, factors other than developer experience and optimal match between data processing and storage and retrieval algorithms weigh into DBMS selection more heavily. NoSQL and generally less mature and/or less supported offerings should therefore be more popular in non-production environments.

DBMSes that implement simpler storage models well-suited to lightweight prototyping — especially, therefore, document-oriented DBMSes — should gain an extra boost in development environments.

Accordingly, the gap between production and non-production usage is greatest for the two most mature commercial DBMS offerings: Oracle (at 51 per cent in production vs. 37 per cent in nonproduction) and SQL Server (34 per cent in production vs. 25 per cent in non-production), and the gap between the most popular NoSQL offering in production (MongoDB, at 20 per cent adoption) and the least popular of the top three (SQL Server, at 34 per cent) enters within the survey’s margin of error in non-production environments (where MongoDB enjoys 25.4 per cent adoption vs. SQL Server’s 24.6 per cent).

MongoDB’s (static-schema-free) document orientation, familiar JSON-like document format (ordered lists supporting a variety of types), and widespread connector availability make it easy to set up without heavyweight data modeling and relatively straightforward to use for many less-data-intensive applications without cramping application architecture or code.

Indeed, many non-relational stores are easier to spin up quickly than a full-power RDBMS. Some benefits of the relational model (especially integrity enforcement) are less relevant in nonproduction environments, where updates don't always need to propagate across all entities.

Note: of the top three DBMSes, only MySQL enjoys greater adoption in non-production vs. production environments. MySQL is especially likely to be many developers’ default nonproduction RDBMS, presumably because it is popular, open source, mature, familiar, and supported by a strong community (For the importance of familiarity in developers’ preference for a particular data persistence technology, see the upcoming section on matching storage model to data structure).

Two storage models

Among our respondents, nearly as many respondents typically use two storage models in their applications (38 per cent) as use one (40 per cent). This result confirms that “NoSQL” is better understood as “Not Only SQL” because the most popular storage model (given DBMS and query language usage data) remains relational. Based on DBMS adoption data, the second most popular storage model by user count is probably document-oriented; but because other storage models (especially column-oriented, graph, and key-value) are particularly well suited to analytical processing of many data rows, further research is required to discover storage model usage by data volume.

In any case, the near-parity between one and two storage models per application indicates increasing interest in matching persistence mechanism to the structures of data to be persisted.

Matching storage model to data structure

Graph structures do not fit the relational model comfortably. In a relational database, most (Shannon) information is stored in the columns and rows of each table; the schema is a technical construct designed to enforce data integrity, make the data model more legible, and make the querying model more efficient; not to encode more information.

In a graph, however, most information is stored in the structure of the nodes and the edges; additional information about nodes and edges is treated as metadata. Yet many real-world entities are most naturally represented as graphs: social, travel, and trade networks; packet routes; control flows; etc.

Storing graph structures in tabular storage is inelegant and inefficient even at first, static only glance; but the problem gets worse in a dynamic setting. Because a graph’s computational complexity may diverge wildly from its combinatorial complexity, reducing a graph to a relational schema (e.g. two-column mapping tables that relate a row in one table to a row in another — that is, modeling nodes as columns and edges as rows in a new table) may work far better for some algorithms than for others (in ways that are not immediately obvious from the graph itself).

Nevertheless, three factors encourage developers and DBAs to store data that is naturally modeled as a graph in a relational DBMS: first, the maturity of relational DBMSes; second, the simplicity and familiarity of SQL (which 90 per cent of respondents use regularly); and third, the availability and maturity of powerful object-relational mappers (ORMs) that make relational data easily accessible (often with automatic and highly effective optimizations) from application code.

storage_header_contentfullwidth.jpg

Accordingly, only a small minority (20per cent) of respondents persist data that is naturally modeled as a graph in a specialized graph DBMS. Further, more respondents store naturally-graph data in a relational database without explicit modeling of edges as rows (39 per cent) than with node-node mapping tables (31per cent). We expect this distribution to change as graph DBMSes and query languages grow more familiar, as tooling ecosystem around these DBMSes approaches the maturity of ORMs, as inefficiencies introduced by storage-structure mismatch grow more expensive as graph data volume increases, and as use cases (and corresponding storage and retrieval algorithms) grow more varied.

Two possibly linked correlations are also worth noting. First, the largest chunk of respondents who store graphs in a relational database without explicit modeling of edges use Oracle (25 per cent) — probably the most mature and most thoroughly optimized RDBMS.

Second, the largest chunk of respondents who store graphs in relational database WITH node-node mapping tables use MySQL (24 per cent), which is also the only RDBMS that gains popularity in non-production vs. production environments. This difference may be a function of both the greater likelihood that MySQL will be used for experimental purposes – where graph problems, insofar as conceptually farther from actuarial use (for which relational databases are a more natural fit), are more likely to appear.

Matching processing approach to storage model

Most developers use SQL (90 per cent) but the relational algebra does not naturally capture object-orientation. Objects do not fall into Venn diagrams; but objects and relational tables do share enough structure that, for many simple (few-join) access patterns, the so-called object-relational impedance mismatch does not cause catastrophic performance or integrity loss.

Accordingly, object-relational mappers (ORMs) are not only widely used, but also preferred by a majority of developers. In response to our question, “What persistence-related technology do you most enjoy working with?” 58 per centof respondents answered that they most enjoy working with ORMs. Of these, 70 per cent specifically enjoyed working with Hibernate — probably a function of both Hibernate’s maturity and also our respondents’ heavy focus on Java. Although the tail of most-enjoyed data persistence technologies was quite long (26 distinct technologies), Spring Data emerged as the most popular comprehensive data access framework by far (16 per cent).

Why devs enjoy working with data persistence tech

The most popular reason by far was ease of use (34 per cent), followed by simplicity in distant second (21 per cent). The top four reasons relate more directly to developer experience than to outcomes (such as performance and scalability), as the wording of the question (“enjoy”) indicated.

Additional research is required to determine how developer experience relates to persistence-related technology selection, especially because many less-familiar (NoSQL) technologies are optimized for scalability and general performance for certain use cases.

Data partitioned as frequently as it is not

Modern storage engines, across all storage models, are highly optimized for current hardware, access patterns, and network performance. Theoretically massive inefficiencies of the relational storage model sometimes dominate the advantages offered by a higher degree of maturity among RDBMSes, although newer engines store data in structures that are less narrowly tuned to read-heavy loads using slow (spinning) physical media than (for example) B+ trees. But as Big Data strategies aggressively drive data storage and processing needs, data scale becomes increasingly difficult to manage.

To keep performance and availability high, data is often partitioned on physical and logical lines. Among our survey respondents, 38 per cent partition data in some way (vertical, horizontal, or functional) vs. 40 per cent who do not — a difference within the survey’s margin of error (5 per cent).

Two research follow-ups would prove interesting: first, what specific data volumes (or velocities), application requirements, and infrastructure constraints drive what kinds of partitioning; and second, which storage models are more likely to require partitioning (although application constraints presumably affect both choice of storage model and partition size/need). It would appear, however, that distributed data techniques designed to manage CAP trade-offs are often effective: 22 per cent of respondents—most of whom are developers and not DBAs — were not even aware of whether or not their databases were partitioned—a sign that, for nearly a quarter of developers, physical splitting of data had no visible impact on their development work.

John Esposito, Dzone.com