Databases are generally regarded as persistent, consistent and queryable data stores. Caches behave like databases, except they shed many features to improve performance like it favors fast access over durability & consistency. We can’t strictly classify the two systems because the set of features offered by both are very close and further complication is added in case of distributed systems. In addition, database products can be adapted to work as caches.
This blog discusses various characteristics of distributed databases and caches to help determine how a software product behaves like a database vs how it behaves like a cache as it is difficult to define what an ideal database or cache actually is.
Persistence
- Database is usually expected to persist an entry unless it is explicitly deleted. Persistent storage tends to be slow and complex to implement.
- Cache often relaxes the persistence feature that is, even though it offers strong persistence, it might still be allowed to lose data occasionally, but this is never expected of a database.
The fact that caches are based on the premise of copies, either consistency or liveness of data must be traded off. Caches may also implement time-to-live for loosely managing consistency.
Durability
- Database usually supports durable persistence that is, supports backup and restore procedures.
- Cache on the other hand, almost never supports any form of durability or backups.
Authority
- Database contains a complete set of data so a query which yields no results is expected to be the correct answer.
- Cache is allowed to have a subset of data. If an entry is missing from the cache, applications are expected to fallback to an authoritative source. Caches should be considered non-primary stores as they generally front an authoritative database, in order to boost performance.
Non-authoritative replica databases can also front another database similar to cache. Replica databases contain the complete set of data from an authority, although possibly restricted to a partition however, caches aren’t required to be complete.
Eviction
- Database is not allowed to evict entries.
- Cache eviction allows data which is not being accessed to be discarded, while the working set is readily available to better utilize resources.
Cache eviction is most effective if the distribution of data access follows a power law distribution. When all data is accessed with equal probability, eviction strategies & caching in general become much less effective.
Queries
- Database often supports complex queries which is essential because they tend to store authoritative data, which needs to be retrieved in different ways.
- Cache often only supports simple key-value access as it provides quick access to data whose existence is already known of.
By ditching query support, caches can achieve much better performance than databases. This is because simpler data structures like hash tables can be used instead of trees. Databases can also support simpler data structures, but this restricts query capabilities.
Reliability/Availability
- Database cannot be easily repaired.
- Cache offers better reliability and less operational burden due to low complexity. If the cache data is broken somehow, simply deleting the cache contents can correct the problem.
With respect to availability, caches are assumed to be much more so than databases. A bug which affects availability can be masked as a cache miss. This might give the illusion of higher availability, but cache hit rate is a good measure of cache availability.
Distribution
- Database must be able to distribute data without affecting consistency or availability requirements. Host failures should not cause any data loss, and restored hosts should not introduce inconsistencies.
- Cache can afford to be less strict when hosts fail, again because applications need to fallback to an authority. If a host fails and 10% of the cache is wiped, this is just perceived as a burst of cache misses. If the host is restored, then it might report old entries again which is acceptable behavior if the cache doesn’t support strong consistency.
Partitioning
- Database supports partitioning however, it is difficult to implement due to database features like queries and transactions. Repartitioning is a desired feature as well, and it is generally assumed that a partitioned database support this.
- Cache which is distributed almost always supports partitioning however, caches need not support repartitioning. Instead, they can allow cache misses to occur while newly added partitions fill up.
A database or cache which supports partitions should scale for both reads and writes.
Consistency
- Database usually supports strong consistency instead of higher availability.
- Cache rarely supports strong consistency and favor availability and performance.
Eventual consistency describes the way in which a database or cache which is temporarily inconsistent automatically reaches consistency. Ideally, consistency should be reached within a few seconds. If too long, inconsistencies might appear to be permanently unresolved. Time to reach eventual consistency should not slow down as the data store grows in size.
Reaching eventual consistency by resynchronizing the database to an authority is a different concept. Likewise, scanning all replicas and merging differences is not a trait of eventually consistent systems. Rather, these two cases are examples are database repair and recovery which is slow as the database increases in size.
Replication
- Database replication must ensure that all replicas receive all changes. If eventually consistent, all replicas must all converge to the same set of entries.
- Cache can tolerate misses, so a replication system can be “weak” that is, it doesn’t need to ensure that all replicas are receiving all changes. Without replication, a distributed cache might see a high miss rate.
Transactions
- Database supports transactions that is, allows group of logical operations to occur all at once or not all.
- Cache that doesn’t support persistence cannot support implementation of transactions.
Triggers
- Database supports triggers to allow to run custom business logic in response to data changes. These triggers are transactional in nature that is, in case of failure, the complete transaction/data change rolls back.
- Cache generally doesn’t support triggers.