Matt Willard


Grokking the System Design Interview

Seven-step basic procedure for a system design:

1. clarify requirements, functional and nonfunctional, for what the system can do

2. define the API interface for the system for exact contract and requirements as needed

3. do a quick estimation: system scale, storage needed, bandwidth usage

4. define the data model: what field sare needed, what sort of database system to use, block storage

5. do a high-level design: block diagram with core components of system for end to end process, note things like servers, file storage, database, load balancer

6. go more in on components and design them out as needed, considering data storage, load balacing, etc.

7. identify and resolve bottlenecks such as points of failure, handling shut down services and data backups, performance monitoring, etc.

**********

When designing a system:

1. What are the different architectural pieces that can be used?

2. How do these pieces work with each other?

3. How can we best utilize these pieces: what are the right tradeoffs?

KEY CHARACTERISTICS OF A DISTRIBUTED SYSTEM:

Scalability, Reliability, Availability, Efficiency, and Manageability

* Scalability

-capability of a system, process, or a network to grow and manage increased demand

-if it can evolve to handle growing amount of work (data volume, number of transactions), it’s scalable

VERTICAL SCALING: give one machine more memory, storage, CPU power, etc.

HORIZONAL SCALING: add more machines to pool of resources

* Reliability

-it’s reliable if it can keep delivering services if a machine fails; distributed systems help with this redundancy at the tradeoff of increasing cost

* Availability

-availability is time system remains operational after taking into account downtime and repair time

-AKA, reliability over time

High reliablity => High availability

But an available product doesn’t mean it’s always reliable

* Efficiency

-basically how efficient a system can do its tasks based on resource use and other factors

-transactions like searching, sending messages, response time, etc. contribute to this

* Manageability

-how easy is it to operate and maintain: how fast and how easy it is to fix problems

LOAD BALANCING

Load balancing is when a “load balancer” (middleware or whatever) catches incoming client requests, uses an algorithm to determine which server in a pool of resources to send the request to, and then sends it off. This is good for distributed systems because it allows work balancing across all available resources and reduces risk of downtime and interrupted service and failed components.

There’s a variety of methods a load balance can use to determine which server to use - least connection or response time, weighted methods, round robin methods, etc. Also, multiple load balancers can be inserted at different points in the request process for more redundancy.

CACHING

Recently requested data is most likely to be accessed again. Caching is when data gets stored in a short-term memory for quick access in the near future of the most recently accessed items. You usually see this stored in browsers on the front end to reload websites faster (and it’s quicker the closer it is to the front end too).

Good examples of caches like this, especially for static media, are content distribution networks. CDNs will serve local content if available, otherwise it queries for the file and then caches it locally.

If stuff changes on the backend, that data should be invalidated in the cache otherwise it’ll cause weirdness. There’s a few techniques for this:

* Write-through: data gets written to cache and database at same time. Redundancy, but has higer latency.

* Write-around: bypass cache and write to storage only; can cause a cache miss if it’s not in the cache and they have to query the backend.

* Write-back: writes data to cache alone, and then writes to permenant storage at certain times. Low latency and high thoroughput, but a crash before permanent writing will lose the data.

When tossing data from cache, it can handle it in a few ways:

* first in first out: data that was accessed first is killed first

* last in first out: most recent data is killed first

* least recently used: kill least recently used items

* most recently used: kill most recently used items

* least frequently used: kills least frequently used items

* random replacement: randomly picks data to kill

SHARDING OR DATA PARTIONING

Data partioning/sharding is when one big database is broken up across different machines. It’s a way to bring the power of horizontal distribution systems to databases and make it easier/cheaper to scale simply by adding more machines.

Three big methods for breaking up a DB:

* horizontal partioning: different rows for the same entity (like locations) are put into multiple tables for each shard. Tradeoff: range could be unbalanced and overweigh one or more tables.

* vertical partitioning: entities are stored based on feature per server: user data is one server, photo data is another, etc. Tradeoff: really big features might need multiple partitions of their own.

* directory-based: use a lookup scheme to find where entities reside, which can help with the tradeoffs of the above schemes.

Ways to partition:

* Key/hash-based: use hash function on a key attribute to get number of partition where it’s stored. Tradeoff: adding new servers means having to change the hash and redistribute data. Use consistent hashing to minimize number of keys that need remapping.

* List: each partition has list of values; when inserting, see whic partition has key and store it there

* Round-robing: For N partitions, tuple I is assigned to partition (I mod N).

* Composite: combine methods like list first, then hash

Common sharding problems:

1. You can’t really do efficient joins on shared databases. Denomralized databases avoid this problem at the cost of data redundancy.

2. It’s harder to keep data integrity and often you have to run SQL cleanup jobs.

3. If you have to make more shards or rebalance what’s there, that can be challenging and create downtime.

INDEXES

One or more columns of a table that allow faster data lookup, like a (duh) index. Common example: IDs for an order. Find the ID, get all the order faster than iteration.

Tradeoff: write performance is lowered because new written records also means updating the index for update, insert, and delete functions. If you need to write more than you read, don’t use an index.

PROXIES

intermediate servers between client and back-end server (middleware)

filter/log/transform requests, can also cache and batch requests

Types:

* open: anyone can access it, anonymouse proxy or transparent that hides/shows IP

* reverse: proxy fetches resources and returns it to client from server

REDNDANCY AND REPLICATION

* Redundancy: duplicate cricial components like with backupos in case of main system failure

* Replication: share info and updates between redundant components to improve fault tolerance and reliability

SQL VS. NOSQL

Two types of databases:

* Relational: structured with predefined schemes, stored in rows and columns in tables, use SQL. Vertically scalable.

* Non-relational: unstructured and distributed with dynamic schema, AKA NoSQL. Easier to horizontally scale.

Common types of NoSQL:

-key-value pairs

-documents of data grouped into collections

-wide column columnar databases

-graph: structures with nodes and properties and connections between them

High-level differences:

SCHEMA: SQL has fixed schema, noSQL can adjust schema on the fly

QUERYING: SQL has strong structured queries, noSQL is more dynamic

SCALABILITY: SQL vertically scales, NoSQL is better at horizontal scaling

ACID: atomicity, consistency, isolation, durability => SQL is better at this, NoSQL is more about performance and scalability

USE SQL WHEN: you need ACID compilance and your database

USE NOSQL WHEN: you need to store large volumes of data, you need speed and rapid development/iteration, if you’re really invested in cloud computing and storage

CAP THEOREM

For a distributed software system, pick two out of three: Consistency, Availability, Partition tolerance. All of them at the same time is impossible.

CONSISTENCY: all nodes see the same data at the same time, and are updated before further reads

AVAILABILITY: every request gets response on success and failure, achieved by replicating data

PARTITION TOLERANCE: system will work despite loss or partial failure; replicate data across nodes/networks to do this

CONSISTENT HASHING

Distributed systems use hash tables to store values for caching data. These are useful, but it’s got two big weaknesses:

* if a new cache host is added, all the keys have to be reassigned which causes downtime.

* it might not be load-balanced, AKA, some keys might get way more data than others and it’s not uniformly distributed

Consistent hashing is a method that redistributes data more evenly and reduces node reorganization. It kind of works like a ring of cache servers. Basically the hash function finds the first available seat and puts data there. Then, if the data needs rearranging, you only need to shuffle a few chairs. It also uses virtual replicas for caches (multiple points on the ring) to balance the data distribution.

LONG-POLLING VS. WEBSOCKETS VS. SERVER-SENT EVENTS

An HTTP request goes like this:

1. The client opens a connection and requests data from the server.

2. The server calculates the response.

3. The server sends the response back to the client on the opened request.

AJAX is this:

1. The client opens a connection and requests data from the server using regular

HTTP.

2. The requested webpage sends requests to the server at regular intervals (e.g.,

0.5 seconds).

3. The server calculates the response and sends it back, just like regular HTTP

traffic.

4. The client repeats the above three steps periodically to get updates from the

server.

But empty responses make overhead.

Long-polling is a standard request, but the client expects that info might not be available, so the server holds the request until the data is available. It can also timeout.

WebSocket persistent client/server connection through a handshake (when they both agree on the rules for how the data connection will go). Then data can be sent back and forth at any time in a standardized way for less overhead and real-time transfer.

Server-side events is a persistent connection too, but it’s only used for a server to send data for a client. Good for sending multiple events to a client if needed.