This article belongs to my “Data Engineering Explained” series on Medium, in which I share with you the following:
- Fundamentals of Data Engineering
- Buzzwords demystified and explained
- Best practices of Data Engineering
- Hands-on Tutorials
- Interviews with Data Engineering professionals
In today’s article, we’ll learn about the main three Data Model families that power modern Software and Data-intensive Systems: Relational Models, Document Models and Graph-like Models. We’ll learn about the advantages and drawbacks of each Model and a systematic approach on how to select the right model for your application.
Data models are one of the most important parts of developing software since they have a deep impact on:
- How we write software
- How we think about a given problem we are solving
There are many different kinds of data models and each one embodies assumptions about:
- How is the data being used ?
- How is my app / my data is going to evolve ?
They fall under three main families: Relational, Document, Graph-like models
The Relational Model
The relational model was initially proposed by Edgar Codd bach in 1970. Data is organized in relations (or tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).
The domination of the relational model has lasted around 25 years for the numerous advantages it offers :
- Relational model hides implementation detail behind a clean interface, which results into a simple code.
- SQL query language is declarative, you tell it what data you want, on which condition to filter and how (required processing) you want it and the query optimizer takes care of delivering results to you.
- Good support for joins and complex queries.
- Better at handling many-to-many and many-to-one relationships (Normalization, and support for joins allow to de-duplicate data, enforce style and spelling consistency, ease of updating denormalized data, tables are narrower …)
Object relational mismatch: There is an awkward translation layer between objects in the application code and the data layer when compared to other data models (e.g. document model). If we take the example of Java which implements the Object Model (vs relational model), here are the main instances of paradigm (or impedance) mismatch
Top Technologies that implement the Relational Model
- MySQL, PostgreSQL, MS SQL …
- Cloud-based: Amazon RDS, Amazon Aurora, GCP Cloud SQL, GCP Cloud Spanner, Azure SQL Database
- DWs : Amazon Redshift, GCP BigQuery, Azure Synapse Analytics, Snowflake
The term NoSQL appeared initially as a catchy twitter hashtag for a meetup on open source distributed non relational databases in 2009. It was too successful that it spread through web startup ecosystem and beyond. NoSQL doesn’t refer to a particular technology, it was interpreted later as Not-Only-SQL.
The Document Model is one of flagship NoSQL solutions
Schemaless: When you are a young startup and when your data is still evolving and has no flexible schema, document model is your best ally. The term schemaless is misleading since, at some point of your code, you have assumptions on some kind of your structure held by your data. Actually, there is a schema that is implicit (not enforced by the database) schema-on-read is a more accurate term (as opposed to on-write-schema in relational model). We can see the difference between the two approaches in situations where the application updates its data format:
- Document Databases: Start writing new docs with the new fields, handle old documents in your application code.
- Relational Databases: Run an ALTER TABLE + UPDATE queries. These operations are likely to be slow and cause the database to have an important downtime
Data locality: Documents are stored as single continuous strings (encoded as JSON or XML etc …), your database loads the entire document whenever you need it. Since the whole document is loaded each time it’s queried, a best practice is to keep documents concise and small in size. This feature comes in handy if your application needs the entire document (for example to render it on a web page), one query is sufficient. In contrast, the relational model requires you to perform multiple queries (query each table) or perform a complex multi-way joint between many tables to have all the relevant information.
Greater scalability: While document databases allow for horizontal scaling, traditional SQL databases can’t typically scale horizontally for write operations by adding more servers, (this is possible for read operations by adding read-only replicas).
Dynamic and expressive data model: The document model is closer to the data structure used by some applications, this prevents the impedance mismatch and yields a cleaner and simpler code. Moreover, the document model is good as representing one-to-many relationships.
Data locality (again): Locality can become problematic if only a small portion of the document is needed. The database will keep loading the document as a whole even when it is unnecessary. Changing fields within a document rewrites the whole document as well. An exception is made for changes that do not affect the encoded size of the document.
Weak support for joins: Joins can be emulated in application code by making multiple queries to the database. This is an anti-pattern for two reasons: It is not computationally efficient and moreover, it shifts the work of making joins from the database to the application code (unnecessary complexity).
Not a one-size-fits-all solution: When apps get complex data gets interconnected, and many-to-one / many-to-many relationships might be needed. Document Model becomes less interesting in those cases.
Top Technologies that implement the Document Model
- MongoDB, CouchDB, RethinkDB…
- Cloud-based: DynamoDB, DocumentDB, Google DataStore, Google FireStore, Google BigTable, Cosmos DB
The Graph-like Model
From what we saw so far:
- When your app has mostly one-to-many relationships (tree-structured data) or no relationships, then the document model is your best ally.
- When you are required to enforce schema on write / your data has strong and complex relationships (many-to-one, many-to-many) that document model cannot represent well, then go for SQL.
Taking things to the extreme, what if many-to-many relationships are VERY common and that anything in your data can potentially be related to everything ? In that case, it becomes natural to start modeling your data as a graph.
Graphs consist of two kinds of elements:
- vertices: also known as nodes or entities
- edges: also known as relationships or arcs
Example of data that can be modeled as graphs include: Social Media Data (people are vertices, and edges indicate which people are connected), Web graph (web pages are vertices, and edges indicate HTML links between pages) …
Many algorithms can operate on this kind of Data (e.g.)
- Dijkstra (shortest path) between two points in a road network
- PageRank for web graph
Graphs need not to have homogenous data. They are actually a powerful way of storing completely different types of objects in a single store.
This gives room for your graph to evolve and model more feature. Considering the example above, we can add book reviews as entities and link them to people who made them, books who are subject to these reviews, and maybe other people who find them useful / useless.
There are several ways of structuring graph data, here are the main two families. They are basically saying the same thing differently but they are worth mentioning.
- Schema flexibility, closer to the object model used in application code.
- Models many-to-many relationships very well
- Optimal if we need to navigate through vertices / entities (we can leverage famous and well-known Graph Optimization Algorithms)
- Not suited for data that is not highly interconnected. Even though each data model can be emulated by any data model, the result is often awkward and unnatural.
Top Technologies that implement Graph-like Models
- Neo4j, OrientDB, Titan, InfiniteGraph …
- Cloud-based: Amazon Neptune, CosmosDB
There is no one-size-fits-all solution:
- Use document databases for use cases where data comes in self-contained documents and relationships between one document and another are rare.
- Graph databases go in the opposite direction targeting use cases where anything is potentially related to everything
- Relational model is here to stay. It is highly used for Analytical workloads and still used for transactional workloads where a schema needs to be enforced on write.