What Is a Graph Database?
Graph databases are most commonly used for highly interconnected data, and for situations where the content of the data itself matters less than the overall structure.
The most straightforward use case for graph data is for social networks. Consider a network of people; each person has a friends list and has relations to other people. Each person also makes posts, often hundreds of them. Every post could have thousands of people interacting with it. So, despite the tweet only being 280 characters long, there’s so much more to store about it.
This certainly isn’t the only use case, just the most digestible one—graph databases are used for all sorts of things. Another example is fraud detection; say you’re a bank, and want to flag suspicious accounts. It might be a little fishy for two separate accounts to have the same address or share phone numbers. With a graph database, you can make a graph of the connection between the two accounts, and identify problems like this much more efficiently than a relational database ever could.
In a graph database, each object is called a node. A node can have any number of properties, very similar to how a document database works. A document database would simply store each node as a separate document in a collection (array) of documents, without taking into account how they connect.
In a graph database, the connections between the nodes are called edges, and they can connect any two nodes from anywhere in the table. Edges define the relationships between nodes, and can have specific types. For example, two friends would be connected with a “Friends” edge, but a user would be connected to a post with a “Posted” or “Liked” edge.
What Makes Them Faster?
It’s not that relational databases like MySQL can’t store graph-like structures—links like these are still core concepts for SQL tables. Links form connections between tables, enabling data to be stored and updated in separate tables while maintaining a link elsewhere in the database, very similar to how pointers work in C. In the social network example, you wouldn’t want to store the name of every friend a given user has as that friend can change their name, so you instead store the friend’s ID, and then perform a lookup whenever you need the right data. Perhaps you cache the results to take some load of the database, but most systems will work similarly to this.
Storing one set of links (like a friends list) is fine, but the problem comes when you start to do any type of complex analysis. The classic example is the friends-of-friends search. To obtain a list of everyone who has a mutual friend with the given person, you would need to loop over the given person’s friends list, and then loop over each friend’s friends list, and then perform a lookup for each record. You’ve also got to make sure you’re not returning duplicate records, which is an extra loop.
If you’re familiar with Big O Notation, you may see the issue here already. It’s a problem with exponential complexity; doing multiple nested loops like this breaks the computer very quickly. It also isn’t a smart way to go about this problem.
Take a look at this benchmark of neo4j running the friends-of-friends query, compared to a traditional relational database (like MySQL):
At depth 2, it’s a simple query for both databases. At depth 3, MySQL becomes unable to perform this query in a normal timeframe, taking 30 seconds to return a response. To its credit, it manages to return the depth 4 query after 25 minutes, though depth 5 crashes the database.
The graph database has no problem returning any of these queries, with all execution times being under 2 seconds, making it thousands of times faster.
How does it do this? Very complicated math, mostly. Graphs are a mathematical structure at heart, and there’s a lot of theory behind it, which we’re thoroughly unqualified to discuss in detail. But relational graphs are fairly simple to understand visually, which makes working with graph databases in practice easy.
If you want to get started working with one, you’ll have to pick and install a graph database. Neo4j is free and open source, and a very popular option. AWS has their Neptune database, which you can try for free, but must host on AWS. Some multi-model databases support graphs as an option, such as Microsoft SQL Server, Oracle Database, and ArangoDB.