Skip to main content

The Big Data A-Z: Part three

If you have been following this A-Z series for the past couple of weeks, you will now know the importance of fully understanding the Big Data buzzwords. You’ll also be able to tell the difference between ELT and a BLT sandwich, and know your RAMs from your sheep.

Understanding these terms and what they mean in the real world of business is fundamental to the success of implementing your data analytics strategy and remaining competitive in the data driven age.

In parts one and two, we ticked off A to L and revealed the big mysteries behind terms such as In-Memory and Hadoop. In this third installment of our four-part series, we’ll be putting the spotlight on some of the most used terms, from M to R:

M is for: MPP

Massively Parallel Processing, or MPP, refers to the use of a large number of processors to perform a set of coordinated computations in parallel or simultaneously. Spread processing across clusters of servers in order to share the workload.

Say what? Rather than one process taking 100 minutes, have 100 processes each taking 1 minute each - you’ll have your answer in 1 minute!

Did you know? In 2012 computer engineers at the University of Southampton in the UK built a MPP supercomputer from a cluster of 64 Raspberry Pi computers and a rack made of Lego.

N is for: NoSQL

A NoSQL database allows you to store and retrieve data that is modelled in means other than the tabular way that is used in relational databases.
The data structures used by NoSQL databases (key-value, graph, document) are more easily horizontally scaled and the data structures can be changed on-the-fly as opposed to standard relational databases. However, NoSQL databases make concessions to the sophistication and power of SQL-based relational databases.

Say what? A different type of database: not tables with rows and columns but stores with keys and values.

Did you know? NoSQL databases existed since the late 1960s but the term only really took off with the rise of Big Data and related storage mechanisms in 2005-2010. It used to refer to “Non-SQL” but some NoSQL databases now allow SQL, or at least a subset of SQL commands, so now NoSQL = “Not Only SQL”!

O is for: Operational BI

Operational business intelligence, sometimes called real-time business intelligence, is an approach to data analysis that enables decisions to be taken based on the real-time data that companies generate and use on a day-to-day basis.

Say what? Understand your business and respond to actions by understanding your corporate and customer data.

Did you know? The term Business Intelligence was first used by Richard Millar Devens in 1865 in his Encyclopaedia of Commercial and Business Anecdotes. He used it to describe how the banker Sir Henry Furnese gained profit by receiving and acting upon information about his environment.

P is for: Python

Python is an interpreted, object-oriented, high-level programming language. It uses an elegant syntax, making programs easier to read and comes with a large standard library that makes it extensible.

Say what? The Zen of Python:
Beautiful is better than ugly, explicit is better than implicit, simple is better than complex, complex is better than complicated, readability counts. If you agree, then Python is for you.

Did you know? Python was conceived in December 1989, the name comes from Monty Python’s Flying Circus, a favourite of Guido van Rossum who created Python.

Q is for: QphH

The TPC-H Composite Query-per-Hour Performance (QphH@Size) is a metric used to reflect multiple aspects of a database system’s ability to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.

Say what? The faster the database, the higher the QphH value.

Did you know? The QphH metric is unique to the TPC-H benchmark. Other benchmarks have a different matric suffix,
for instance TPC-DS benchmark has a metric called QphDS.

R is for…R

R is a programming language and environment for statistical computing and graphics. By providing
a wide variety of statistical techniques (e.g. linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering), it is popular with statisticians and data miners for developing statistical software and for data analysis.

Say what? R is the language for statisticians.

Did you know? R is based on S, a statistical programming language originally developed in 1976 in Bell Labs - also the birthplace of the UNIX operating system and the C programming language.

In the final part of our A-Z guide next week, we will round off letters R to Z. If you want to find out why X marks the spot, and whether YARN is just a ball of string that your cat plays with, then make sure you come back next week.

Sean Jackson, CMO, EXASOL

Image source: Shutterstock/McIek