High Level Design

HLD Basics & Consistent Hashing

Technologies keep changing. What matters is building the intuition what systems fit best. How architecture works.
LLD is about how to structure code
HLD is about how to structure components, machines & what are those machines doing.
Both LLD & HLD combine system design
del.icio.us(built during 2005/06) -> acquired by Yahoo at later point
ICANN -> non profit central authority that manages domain names
GoDaddy or domains.google.com -> resellers who sell on behalf of ICANN

to get domain name to ip mappings

ISPs(airtel, jeo, google, cloudflare(CDN), .. ) maintains DNS server with copy from ICANN for faster internet speed
DNS propagation takes 24 hours to sync across the world
ISP

One can connect to internet only via ISP(Internet service provider)
Every isp has range of ip addresses allocated to them.
ISP picks one available ip address in the range and allocate to system
One has option to ask for static IP address by paying additional amount

del.icio.us

start with one machine
When more than one machine is present, load balance is required to route traffic as required during downtimes/patching/distributing.
Is load balancer a single point of failure

Register multiple load balancer IPs in DNS and return all of them. Browser can talk to which ever IPs are responding
Browser can have another standby machine which can become active when the primary load balancer dies.

Ways to find a machine is live or not

Heart beat(servers will tell balancer that they are alive)
Health check(load balancer will ping the servers to check the status)

Zookeeper(distributed system) keeps track of status of all machines and notify when they are down
1 Million(10^6) bits is ~1MB
Storing the data in every server is not scalable due to storage capacity being limited
Scaling

Vertical scaling

better hardware

Horizontal scaling

Add more machines & split data among the machines(Sharding)
How to avoid load skew & how does load balancer know which machine to go
Distribution the users/load to servers should be Extremely light.
When new machines are added/removed, it should still work with out being too expensive

UserID % machines -> fails when machines are added/ removed
Range of users to machine -> fails when machines are added/ removed
Approach that works is Consistent hashing

Consistent Hashing(Standard)

take a huge number (long long int which is 10^18)
Come up with 2 hash functions one for machines & other for users. Range for both should be same.

hash function Hash(machine_num) which generate random number in range of [0, 10^18]. Each machine will be assigned with the value of hash
Take another hash function Hash(user) which also generate random number in range of [0, 10^18]. The machine with hash value greater than hash(user) will be assigned to the request

Draw back: When any machine dies, the load on the next server increases which subsequently can die and cause cascading die of all servers.

Modified Consistent Hashing

Generate multiple hashes for machines (say 3 diff hash functions)
This will distribute load evenly across other servers when machine is added/removed/dies

How is data copied and avoid down time

Happens in 2 stages
For cases where new machine is added

First copy the data and then added machine
Copy the delta that got added during the copy of above step
delta will be copied after new machine is added(eventual consistency)

Example of hash function

Given a number, generate large string
m = MD5(userId);
convert string m to integer & take mod %10^18

why consider 10^18 -> reduces collision due to huge number
Usecase of Sharding

Store large amount of data across machines

Usecase: Consistent hashing is used where ever sharding is used.

No SQL DBs: Casandra, HBase, Dynamo DB, elastic searches

Reverse Proxy/ Load Balancer

When a request comes, outsource work to different machine where client doesn't have exposure to that machine, get response and relay it.

HLD: Caching

Having data & code on same machine is good or bad model

Both on same machine: Tight coupling

Code deployment cause data also unavailability
Less resources for code as DB will consume RAM & CPU processing

Different machines

network latency
Appln servers will be stateless
Autoscaling feature can be utilized as they are stateless
Zookeeper is required to notify apps servers about db machines that are added/removed.

Min layers in any appln system

Load balancer
apps servers
DB servers

Work of apps server

Fetch bookmarks from DB
Fetch Images/multi-media

Caching is storing things close to you to fetch fast.

Browser storage(Caching starts at browser)

cookies
session storage
java script files
images
How browser works with html files

First Loadbalancer returns only HTML file which is text and has links to cdn
ISP has integrations with CDN. ISP DNS will give preference to nearest IP address
Redirection based on request origin (fb.com -> to facebook.com, google.com -> google.co.in)

CDN(2nd level caching)

Akamai
Cloudflare
Cloudfront
fastly
They will reroute to nearest data center

Local Caching

Cache in server itself

Global Caching

Cache - Challenges/Limitations

Limited in size
Not actual source of truth(can cause data inconsistency)

Cache Invalidation time

Write thru system

All writes to DB goes thru cache. It updates DB then itself & returns Success.
if cache doesn't have entry, first gets from DB and then updates
Writes slower and reads faster
Useful for Read Heave systems

Cache Invalidation types

write back cache

Update cache & return success
sync with DB after returning success
Gives high thruput
good for applns like google analytics,

write around cache

Fetch data and update cache(after TTL expirety) after DB is updated
appln like DNS entries,

Cache Eviction

LRU -> Songs
MRU-> one time use cases -> covid certificate

Cache hit/miss
Local caching of data on apps server(how to invalidate/update when changes done for files on file storage) eg: facebook news feed

HLD: Caching Continued

Caching for huge files

TTL : File can become invalid very quickly OR invalidation will be too late
Use DB to maintain meta data & go thru it for local cached data
https://gist.github.com/jboner/2841832
Redis

single threaded, key value store
https://try.redis.io/
https://redis.io/docs/data-types/sets/
https://redis.io/docs/data-types/sorted-sets/

why global cache(Redis)

Queried often
Derived information(pre computing)

Global Cache

This will reduce cache miss as local cache will miss for each min/apps server

Facebook newsfeed (case study)
3 billion users(1 Billion are active where 2 billion are monthly active, 500million are daily active, 5 million posts)80% users only read, 20% interact by liking & commenting, 1% users contribute)
100 bytes per post -> 5million * 100bytes *2posts per day = 1GB

Users, User_Friends, Posts
Profile Page

Posts made by respective user

NewsFeed

Posts made by friend posts
How to optimize news feed generation

maintain recent posts(last 30days) for all user in single table

Redis - has copy in hard disk
Memcache - stores data only in RAM
hazelcast, aerospike are other dbs similar to Redis

CAP Theorem & Master-Slave

https://www.linkedin.com/in/prakhar3agrwal/
https://www.youtube.com/user/prakhar3agrwal
CAP(Consistency, Availability, Network Partition)

In case of network partition(2 nodes cannot talk to each other, but they can talk to network), we have to choose between consistency & availability.

PACELC

It is nothing bug CAP theorem
In case of network partition, one have to choose between availability & consistency
If not network partition, then one have to choose between Latency & Consistency

Master-slave

Shard is collection of machines
replicas/slaves -> copies of master
High availability, not Eventual consistency, Low latency

Master takes write & returns status
Async, best try to sync data to slave

High availability, Eventual consistency, Low latency

Master write, one slave writes & returns status
All slaves sync eventually

Highly consisteny(Strong Consistency)

Master and all slaves write & return success
This can impact latency & availability

Gossip, RAFT, Paxos

One slave transferring the data to other slaves is Gossip

Zookeeper

It is cluster of odd number of nodes
Keeps track of who is master & who is slave
ephemeral node - the node that has lock on zookeeper master file
If master node fails to send heart beat, zookeeper releases lock
Slaves will try to acquire lock and one of them will get lock and get master

SQL vs NoSQL

E-commerce - sql schema doesn't work well - schema is not sql friendly
SQL queries power is reduced when info is across machines
SQL DB doesn't have good out of box sharding
how to find sharding key

identify regular queries or reports
find the common key across queries
common key will be probable candidate for sharding key
sharding key for

uber/ola systems will be location id
Irctc - ticket id

Slack

get all groups for a user, org
get recent messages for the groups
send message(group id, sender, text, message_id, ..)
get group metadata for a group

YouTube

getVideos(channel)
getNewsFeed(user, location..)
getVideosTrending
Seach(topic)
topic()
playVideo()
Datamodel

user, usertag, userhistory, videsodb(tags),

No sql db schmas

key-value pairs (similar to hash map) eg: Redisdb
Document DB (json) eg:

Good for random & fragmented schemas with filtering requirement. Within filtering you don't have too many entries.

Column family (sql)

Sharding key will be defined
Column family is like a table in sql(every column is a table with out anyother columns in it)
For email db, it maintains profile as column family, threads, emails as another and so..
Column family is by default ordered in desc
This is useful when matching entries are huge and accessing recent ones is quick eg: email

Design Data base with following requirements

All info doesn't fit on single machine(has shard key)
No manual load balance of machines. Only set of machines(cluster) are given.
when load goes up, add more machines to cluster
DB should never lose data
Good to have(consistent vs availability)

Solution

Cluster & inital set of machines
Sharding key

Case Study 2 (TypeAhead)

Design problems are open ended. If one follows random framework then discussion could go in any direction. Discussion can go for hours without any thing concrete. Hence, it is important to put a structure to what is going to be discussed.
One possible structure is to breakdown into 4 parts

Requirement Gathering: Make sure you understand what you are building and what are minimum set of things required for it(mvp)
Estimate Scale:

Read Heavy/Write Heavy
Sharding required OR not
How much Replication required
QPS

Design Goals

Consistency vs Availability /latency

APIs(how external world call me) + System Design

Storage requirements
apps servers role

Type Ahead

Given few letters show me 5 suggestions
Given letters should be prefix
Most frequently searched terms
No personalization
min 3 letters
levenshtein distance not in scope

Estimation of Scale

Read: entering few chars and waiting for response
Write: Search operation as it updates frequency in system
This is both read & write heavy system.
#number of searches in a day on Google is around 10 Billion
#number of new terms per day : 25% & 75% existing terms
Each search say 100 chars, new terms will be 2.5Billion * 100 chars = 250GB
Space per year = 250GB * 365 * 10 = ~800 TB

Design Goals

High Availability as we care about pattern and NOT about exact frequency
Latency should be extremely low

getSuggestions with search prefix
updateFrequency(search phase)
Note: Every node stores top 5 suggestions in the form of list. Also it stores frequency of the word that ends at particular node.
How to reduce writes

Sampling

rand()%1000 then add to hashmap

Threshold driven

Buffer hashmap - (term, frequency)

Sharding

by first 3 chars -> 26*26*26 = 17576 machines -> consistent hashing

Popularity is combination of (frequency, recency)
Time Decay

Time Decay Factor

Every day divide by time decay factor

No SQL

Replication level: #machines the data is replicated
Advantages of replication

Higher traffic can be supported
Slave can take master place incase master goes down(failure handling)

How to create your own nosql db

How to create particular replication level
How does db increase/decrease #of shards automatically

Qn: Imagine you are creating a DB

It has huge amount of data in TB/PB

Sharding is required for Huge amount of data & Replication required to handle high number of requests.
How will NOSql DB scale

Multiple shards -> each shard with one master & multiple slaves

Eg: BigTable by google(first nosql db ever created)

Multi-master setup

Casandra (Facebook)
DynamoDB(Amazon)

Approach1: Master Slave with consistent hasing

BigTable uses above master-slave architechture

Say we have 15 machines
Based on load, I divide my data into multiple shards using consistent hashing approach

Requests will come to manager like zookeeper.
Based on sharding key, manager decides which shard the request go to
Each shard has master + 2 read replicas
Create 4 shards and it will leave 3 spare machines
if #spare machines is <= #machines required for a shard, add them as extra replicas
If a machine dies from a shard with machines> replication count, then do nothing
When to add new shard

DB space
CPU utilization
#of queries
as a thumb rule - when 66.67% load is acheived

How a new shard is introduced

2 stages

Staging stage

new machines are created
Data starts getting copied from the earlier shards
Part of data which should now belong to new shard based on consistent hashing algorithm from read replicas of other shards
new read/write queries for the new requests are being served from earlier shards

Live Stage

Consistent systems

Downtime for users on new shard
Data is copied and then made live

High Availability(Non-Consistent) Systems

No downtime required, new shards can start serving requests. Asyncronously, the new data is copied to it from prior shards

If ConNew shard is also serving the queries

Approach2: Multi-Master architecture

Amazon Dynamo & Facebook Cassandra
Usecase: Too many writes
Lets assume #machines are present on a ring as per consistent hashing mechanism
When a user requests comes, store data to next immediate n distinct number(depending on replication factor) of machines to right
Cassandra has tunable consistency by controlling replication factor value

It allows user to set 2 properties

Number of machines to read from before sending results
Number of machines to write to before sending success

R+W>replication level-> consistent system
R+W<=replication level -> available system

HLD: Multi Master

Storage in NoSQL

Read & write should be fast

Computation across machines/shards
2 level of tables

Lookup Table -> key, start block, end block -> in harddisk
Index table -> sortedset, startaddress, #of entries, entrysize -> In RAM
Actual Data
LSM Tree - Log Structured Merge

To reduce search space on hard disk

Distributed Computation

Map Reduce

Map
Shuffling
Reduce

Machine1: A has friends B, C, D
F has friend C
Machine2: B has friends A, C, E
E has fiends B
Machine3: c: a, b, f
d: f, g
Return list of mutual friends

Case Study3: Messaging

MVP
Estimation of Scale

Sharding
Read vs Write

Design Goals
API
Architecture
Design of a Messaging Application

Whats app(1-1, no storage)
Facebook Messenger/Telegram(with Storage)
Slack/Microsoft Teams/Discord(Group chats with storage)

MVP Requirements

Ability of someone to chat
Authentication
Recent chat history
Online Indicator
Sent/Delivered/Read indicator
Only text vs multi-media messages
Group messaging
Upto 10K chars for text message
Notifications
store messages

Estimation of scale

read heavy/write heavy

message app is like read/write are same ration
Large number of writes - Casandra, HBase
Large number of reads - sql
reads & writes are almost equal - We use db optimized for writes with some optimizations
message (sender, receipeint, time, id, text)
avg size of text can be 250 unicode chars say.
50Billion messages per day on whats app
50B * 1KB = 50GB * 1kb = 50TB per data - hence sharding is required
in 5 years = 50TB * 365* 5 = ~90PB

will I need to shard

#users
message lenght
#requests per sec

https://www.imdb.com/title/tt2575988/
messages per sec - 50B/86400 messages/sec which is 500,000 per sec
Assuming the peak time traffic will be 3 times QPS will be 1.5million messages per sec
Design goals

low latency(time per task)
high throughput(#tasks per sec)
Facebook : consistency, availability
whatsapp: highly available & eventually consistent
slack, MS teams: highly available & eventually consistent

API Design

sendMessage(sender, receiver, text, idempotency key,

Idempotency key - not a db key, it is just to prevent duplication and redundancy.
If server receives message of same idempotency key twice, it just rejects the 2nd one

getRecentConversations(userId, limit, offset)
getMessages(senderId, receiverId, limit, offset)

Architecture

Sender -> Server -> Receiver (Client server architecture and not pier-pier)
Storage -> messages should be shared
What should be the param on which sharding should happen

By conversation id

This requires read of all shards to getRecentConversation of the user

By UserId

sendMessage is sent to 2 shards incase sender & receiver are on different shards
getMessages - can be read from any one shard
getRecentConversations - one shard

How to handle send when sharing by UserId

Message should be stored in 2 shards
Save to receipent shard - and send success. Sync to sender shard after sending success.

HLD: Zookeeper + Kafka

Term of the day: Reliability

If your system is able to perform correctly in the event of adversity

Problems: LinkedIn formed in 2003.

design goals system

need a message broker system, which can have a high throughput

Why do we need Kafka
What is apache Kafka
Cluster, Broker, Core of Kafka
Producer & Consumer
Demo
Producers & Consumers
Assignment(Lamda & SQS)
Demo

Start Zookeeper
Start Kafka
Create kafka topic "Kafka demo" with replicationFactor 1 & Partitions 1

This creates a directory with name "Kafka demo" & messages are saved in .log file

Create Producer for topic "Kafka demo"
Create Consumer for topic "Kafka demo"

Simple queuing service (SQS):

One producer & one type consumer where the message is deleted once any consumer consumes it

Partitions

HLD: Case Study 3(Messaging) 2

Sharding Keys evaluated

userId

is better
we have to write to 2 servers. Write to receiver first and update sender asyncly

ConversationId

has draw backs

Architecture

User sends/receive a message

Load Balancer

Many appln servers

connected to data source shared by userId

Type of DB

Reads and writes are equal. Hence, we choose one db which is read optimized and another which is write optimized. Keep both of them in sync as per requirement.

Use nosql(Hbase) for faster writes. In HBase, write is just a append on transaction log/commit log. (Mongo DB updates disk as well with values. So it will not be as fast as HBase). Second update will be to Redis for faster reads.
Reads will be best with In Memory Cache(Redis)

3 types of Cache systems

Write Thru(Write to Cache and disk. Return success only when both return true). This is appropriate strategy as READ happens immediately after Write.
Write Around()
Write back(Write to cache and return success. Sync with disk at later stage). This strategy cannot be used in this case.

how to notify to Users about message

In HTTP, server cannot initiate a connection to client. Server only responds to client created connection.
3 way

Polling: Client repeatedly(say every 10 secs) asking server if there message

Resource intensive
Too many requests at server
Waste lot of resources
Not real time

Long Polling

Client sends getNewMessages request
Server is not obligated to respond immediately
Server will respond only when new message is available. Once client recieves response, it will send another getNewMessage request.
Is this resource Intensive

It is somewhat intensive as TCP connection is always maintained
Every request creates a new connection.

This is real time response
Problems

Server will have lot of open connections. Each connection take memory and resources. It can lead to scalability issues.
https://blog.whatsapp.com/1-million-is-so-2011
https://stackoverflow.com/questions/22090229/how-did-whatsapp-achieve-2-million-connections-per-server#:~:text=To%20serve%201M%20connections%20with,used%20just%20for%20socket%20buffers.&text=is%20not%20useful-,Save%20this,Show%20activity%20on%20this%20post.&text=Curious%20to%20know%20if%20this,do%20it%20on%20Windows%20server%20%3F
new tcp connection has to be established on every request

Ideal Soln

Single connection between server and client that can be used when desired. This save resources on hand shake & sending http request every time
Kind of tunnel where server/client to write
Websockets

Web Sockets

As soon as user device comes online they open connection with server. The connection remains open till the time user is connected to internet. This connection can be used by server to send notifications, messages. Also, client can use it for any thing.
HW:

Learn how websockets work
how to create web socket connection.
chat appln using socket.io https://socket.io/get-started/chat

Web socket is a protocol which works over http & tcp where a connection is established between client & server, and connection is persisted that allows client & servers to talk to each other when ever needed.

Usecase of Kafka

Every server subscribes to Kafka for messages of users that have established
When ever a new message is sent, after storing the message in redis, hbase an Event is sent to Kafka.

Notification Service

What notification to send
Who to sent the notification
Where to send the notification
Notify() sends the message to queue
Workers take the message from queue and invoke email service, Firebase could messenger, Apple push notification.

Online Indicator

Web socket connection is present?
app is open (whats app uses it)
app is open and person is interacting(MS teams)
Consistency vs availability - it is fine with almost consistency and no need for 100% consistency

Heart beat mechanism

As soon as a user open the app, I will send a heart beat every 10 secs to the server telling the server that I am online
Redis cache to store userId, lastReadOn
Write back cache and sync data to sql db for persistence

Complete Design

Users

Load balancer

application servers

Message service

Redis write thru cache(as we need high consistency)
HBase

Online indicator service

Redis write back cache
SQL(using batch processing)

Queue

to send message & status to correct appln server

HLD: Zookeeper + Kafka 2

Great vs Avg software engineer

Logical thinking or Problem solving ability. Given a scenario, what is right kind of system to use. Knowing the possibilities of architecture to solve.(design principles)
Pick up things quickly
Ability to understand system and why it should be designed

Master - slave architecture

When write comes, it should write to Master first
when master is dead, a new master should be elected. Also, should be informed to every other server

ZooKeeper

Track health of master
Inform all apps servers/slaves of who the master is
When data is stored in zookeeper, it save similar to directory structure
File in zookeeper is called as ZooKeeper Node.
Every file in zookeeper is one of 2 types. Ephemeral, Persistent
The master node writes its ip to Ephemeral node and sends heart beat every 5 secs
All app servers & slaves subscribe to ZooKeeper node
When ever info in zookeeper node changes, it informs every one about the changes
Flow

User request

load balancer

apps servers

read master db node details for first time

send write request to the master node

zookeeper is collection of machines, usually Odd number of machines
Quorum is majority x/2+1 or more
Read and Write to zookeeper happens from Quorum number of machines
Assignment: https://docs.google.com/document/d/1P72fx_a6VwKfyZbZBrulJhLqRFFsCmqeZOTat2cCdRo/edit?usp=sharing

Kafka

Topic is a category within Queue

Persistent Queue which is broken down topics

Kafka is 2 times faster than RabbitMQ
Kafka

For event streaming

Persistent queue is different from DB. It has Event retention period which is 1 week by default
FIFO/only append
Immutable messages/events
We can specify partitions for a Topic

this will help not constrained by one machine
Parallelize consumption

Machines inside Kafka are known as Brokers/Servers
Consumer Groups

Specify which consumer will read from which partition
Each consumer maintains offset at partition level
Also, consumers put an event in consumer_offset topic to know the status incase they come back after they are down

Demo

Kafka messages/logs
Kafka fault tolerance
produce messages programatically
Start zookeeper server
Start Kafka server

bin/kafka-server-start.sh config/server-2.properties

Create Topic

bin/kafka-topics.sh --create --topic myTopic13 --zookeeper localhost:2181 --replication-factor 1 --partitions 1
cd myTopic13-0
ls -> shows index & log files

Create Producer

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic myTopic13

msg1
mst2...

Create Consumer

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic myTopic13 --from-beginning

Take Aways

Persists messages on HD in log file
Immutable messages

Demo2

start 3 kafka servers

bin/kafka-server-start.sh config/server-1.properties
bin/kafka-server-start.sh config/server-2.properties
bin/kafka-server-start.sh config/server-3.properties

bin/kafka-topics.sh --describe --topic myTopic13 --zookeeper localhost:2181
Learnings

Kafka is fault tolerant
decouple nature of producer, consumer, servers
Leader/follower mechnism done by Zookeeper

Demo3

Linger.ms in Kafka - 100ms
batch.size 1000
Kafka writes only when either of conditions are met

https://docs.google.com/document/d/1P72fx_a6VwKfyZbZBrulJhLqRFFsCmqeZOTat2cCdRo/edit?usp=sharing

LLD: Remaining Content 1

Machine Coding of Splitwise(Expense sharing appln)
https://github.com/Naman-Bhalla/splitwiseJune22/tree/master/src/main/java/com/scaler/splitwisejune22/models
doDutch/Paytm/google pay/HDFC
Class Diagram

Nouns or Visualization

Users, groups, Expense

BaseModel

User extends BaseModel

name
phone number
password

Expense extends BaseModel

amount
desc
createdBy
creationDate
List<UserExpense> paidBy
List<UserExpense> owedBy

UserExpense extends BaseModel

UserId
ExpenseId
Amount

ExpensePaidBy

ExpenseId
UserExpenseId
UserId

ExpenseOwedBy

ExpenseId
UserExpenseId
UserId

Group

Name
List<Users>
createdBy
Desc
List<Expenses>

https://github.com/Naman-Bhalla/splitwiseJune22/tree/master/src/main/java/com/scaler/splitwisejune22/models
https://github.com/scaleracademy/splitwise/blob/dec2022/docs/requirements.md

HLD: Case Study 4(Elastic Search)

Sentence to words
reverse index
it is for full text search-> ES is most efficient
running = ran
Apache Lucin -> Given a document,

removes/filters

is, the ,
pronouns - a an, the
stop words

lementization

reduce to root word

stemming

remove ing to reduce to root word

Lucene is single software on single machine

not fault tolerant & single point of failure

Elastic search is layer above Lucene
Google search is far more advanced then Elastic Search
Elastic Search

Machines are called Nodes
Index of (collection of docs)

how write works

when document is sent, it does hash to find the shard to write. using shard it finds machine
Writes to machine/shard(reverse index write)
Sync to other shards happen asyncronosly

HLD : S3 + Quad trees (nearest neighbors)

Build Large File Storage

Functional Requirements

read, write huge files - GBs to TBs files
generate id
Streamming of file
access control/permissions
Files are immutable

Non-Functional requirements

Fault tolerant + Durable
High throughput
High Availability
low response time

Possible Approaches

Split file into chunks
Split into too many small chunks(1kb)

Pros

Parallelize more
failure rate will be less

Cons

But every chunk will have overhead of opennig connection, requesting, OS identifying file and transferring. This can lead to massive overhead
Meta data of chunk will be huge

Split into big chunks(1Gb)

Solution Approach

LFS will be a collection of machines
LFS Client on client machine. It chunks file into
On LFS, have some Central storage(Master machines which has directory of chunks)

It has file -> chunk mapping & Chunk -> machine mapping

LFS client requests master to store file by giving name, # chunks
Master creates entry for the file and decides chunk to machine mapping
Client then connects to machines & writes to the machine
Pros

Parallelizing the approach/write
reducing the cost of upload failure

How to avoid single point of failure for master

Have replica
have zookeeper to keep watch on who the master is

How to handle cases where machines go down

Have replica for machines as well
Chunks have replicas
Master will have reference to all copies of chunks in different nodes

How to avoid failure of all machines having copies of a file

Maintain file copies in different RAC machines(same data center)
OR maintain in different data center

How to store chunk copies in diff machines

consistent hashing
round robin
random

How READ works

Reads can spread across primary and replicas
All nodes/machines to cache with LRU algorithm to perform well for hot documents

What we did to reduce latency

break the file into chunks and process parallely
cache in individual nodes

How many entries will be created in Master for 1PB of data in LFS

1000 * 1000 * 1000MB/128MB ~ 10Million entries of (chunkid, m1, m2,m3)
10B * 1Million = 160MB -> sharding is not required.

In HDFS(Hadoop distributed File System) -> Master is called NameNode & machines storing chunks is DataNode
When do we need collection of masters

when there are 10Million reads per sec

HDFC is not very good in latency but good in throughput

Use case

When client uploads the photos to Facebook
Facebook servers will store in LFS also stores on CDN

Google Maps

Given a random location by latitude and longitude, what are the nearest few restaurants around it.

If we go by latitude and longitude it will be 2D query and will not perform
Instead divide entire world into grids(to avoid 2D query and make 1D queries)
Quad tree

root grid with list<allrestaurants>
if list size is >100, then break into 4 identical grids with same shape(not size0
the grids will be children of root
Repeat the process with the child nodes until the restaurant list size<100

Add a restaurant

start with root and navigate to leaf
insert new node for the new restaurant
If children size becomes more than 100, then split the node into 4 different nodes and distribute children accordingly

Delete a restaurant

start with root and navigate to leaf
delete the restaurant

given a location how to find nearest 50 restaurants

Get the grid corresponding to the location along with Grid coordinate
Get mid points of the grid lines.
Add a small value to grid coordinate in the direction you want to find the grid details of the neighboring grid in the direction we intended.
We have 9 directions to move from one grid to neighboring grids

HLD: Case Study 5 (Design Uber)

Requirements

Intra-City transportation
Find nearest cabs (moving/Stationary)
Track cab position in real time
Start trip - path from source to destination
Live tracking

MVP(items to focus in this session)

Find nearest cabs (moving/Stationary)
Track cab position in real time

Shard by City(to reduce size of problem)

Say Bangalore (100K cabs)

Cab drivers will install Uber Driver App
Cab will have states - active, inactive, occupied,

Driver App

Sends heatbeat regularly
Status of driver location is updated in memory Map
Update Quad tree
maintain location & grid boundary in driver app
Driver app to inform when grid is changed

Say 100K drivers in Bangalore & 100K active users use app at any point of time. How to design such system.

Storage

InMemory storage of Cabid to currentLocation

100K * 8 * 8 * 8 = 2.4MB(single machine is enough)

If the cabs send heart beat every 5 secs, then we get 20K requests per sec that update in memory map(redis can support 100K requests per sec) - (single redis machine is enough)
DB

Cab Meta data-> MySQL can be used-> MySQL can be used as data is not modified frequently

100K * 10KB ~ 1GB

Users data (10Million users) -> MySQL can be used as data is not modified frequently

10M * 10Kb = 10GB

Quad Tree of 100K cabs

cabid, lat, long = 24B *100K = 2.4MB

When does Quad tree get accessed

Grid changes for driver - it could be 5% of total requests (100K * 5%) -> 1K/second
when cab status changes

1 cab takes around 8 to 15 rides
4/5 times due to driver inactivity
so, it will be 35 state changes per day
100K * 35 / 8hrs * 60 * 60 = 120 per sec

find nearest cabs/request cabs

100k/Second requests

How to scale reads

100 replicas which will make 1K requests/sec which is manageable

How to improve perf further

Identify hot spots & cache grid & cabs in a different storage

Steps

Given location (x,y) find grid in quad tree
From grid, get list<Cabs>
For Each Cab

identify the machine has a WebSocket connection to Driver app and send notification to driver
If no response, continue with next Cab
If Cab accepts

Get Driver details from mysql db and return to User

Uber uses hexagonal grids instead of rectangles grids

When cab is occupied

Start sending location at faster frequency. say every 3secs
Driver app will plot the animation of navigation

How to about in interview

Requirements
Estimation of scale
Design Goals
API design
Final Design

Bloom Filter

A bloom filter is a probabilistic data structure that is based on hashing. It is extremely space efficient and is typically used to add elements to a set and test if an element is in a set. Though, the elements themselves are not added to a set. Instead a hash of the elements is added to the set.
Problem being solved

When data for NOSQL db is saved in RAM & sorted sets on disk.
For any search for any item, system goes to RAM, if not found goes and searches in all sorted sets on disk
Searching an element which is not present takes more time.
Filter that can say that key doesn't exists with out searching db

It should be light weight with out storage
It can be approximate

Solution

Bit array of size N
k number of Hash functions which returns values in range of 1..N
For a given key, for each hash function get the hash value and set the bit of the value in bit arry
When searching an key, check if the all bit values in array for hash functions values are 1 or not
If not 1 then key is not present
If present, key may or may not be present

https://docs.google.com/document/d/1qVUOgfGeFeB8Xh8Zay-NbCjgDUjPVlYaN7_y8S2BTtE/edit

HLD: Popular interview questions

Unique ID Generator

UUID:

Generate MD5 hash which is a function of timestamp, machineId, ipaddress and so
It is alpha numeric id

For Numeric Ids & value should be in order of date creation.

Single machine generates IDs

single point of failure

Ranges to each pod

will not be incremental as per the date creation

How can all shards independently generate id which are Numeric, Unique & incremental
<timestamp><machineid><sequence>
requests at same time stamp should be unique & request ids coming at later time stamp should be unique and also greater than requests that came prior to that time stamp
Epoch secs : Number of secs elapsed after 1970
Epoch ms: Number of ms elapsed after 1970
#ms in a year: 3* 10^10

Considerations

Clock in all machines are in sync to the ms
NTP(Network time protocol) ensure clocks on all machines are in sync

Twitter

1 bit for sign + 41 bits for timestamp + 5 data center + 5 machine + 12 sequence id

Rate Limiter service to avoid DoS - Denial of service attack. How to build & where to build rate limiter. eg: Only allow max 10 users to register in a hour from an IP

App design

client (adding limiter here is not reliable)

load balancer

apps servers

Implement rate api to limit 100 requests per api per user per second

Brute Force

For every User+api -> have queue/array to maintain request + timestamp for last one sec
When new request comes, check if any request in queue is older than one sec. If yes, remove it and consider the new request. If not, reject the new request
Drawbacks

memory heavy
too many in memory operations when the requests are in millions

Say rule is 50 request per user per api per 10 secs

Maintain count for current 10sec bucket
when Y requests come, validate using following formula

curr_count+(prev_count * (10-y)/10) is within threshold
This is not 100% accurate rather it is approximation

This is called sliding window counting approach

How to make it work for Distributed systems

Redis where all api gateways talk to it
maintain user+api+count queues

Time Series DB(used for monitoring)

Data Dog
ODS in facebook
It store key, timestamp & value(time for the request to server)
Graph is plotted using above info

Implement/Design a timeseries db

Input given is key, timestamp & value

user.visit, at time T, with value 1
user.bookmark.latency, at T2 with value 15

Output desired

If key is given, timestamp range is given(t1 - t2), time chunk is given(5 mins) and aggregator func is given)
Give output values to plot graph

HLD: Case Studey 6(Design Hotstar) OTT-> Over the Air

Functional Requirements

View Videos/movies/tv shows
view live content like News/Sports
Search
Reviews, details of movies, recommendations
Type of users

Viewers(sport fan, movie buffs)
Content providers(upload videos)

Playback management(revind, forward)
Mobile, web apps (Andriod, IOS)
Profile, user accounts
Login, Authentication, Authorization
subscriptions/payments
MVP
Functional

Watching experience
live streaming

Sequence of data (no download required)
Streaming is a technique where if we cannot process large amount of data, process in smaller chunks

Non-Functional

performant

deliver content quickly/buffering
scalability
Highly available/reliable
low/medium latency for live streaming

time between live happening & video should differ in secs

High performance

initial load & NO buffering while watching

Security
Usability
Privacy

Back of envelop estimation

content providers should be able to upload
viewers should be able to view
number of videos - 10K
duration - 1hr on avg
other factos in size

resolution - number of pixels
bit rate- kilo bits/sec
frame rate
encoding
for larger devices like tv

1080/full hd, 5mbps -> 1hr such video will be of size 4GB
for descktop and laptop -> 720/p, 3mbps rate-> 2gb
mobile devices - 480p, 2 mbps, 1gb

why not store in single format and serve all devices

bandwidth wastage and processing overhead may result in latency

10K * 7GB -> 70 TB(+ 2 back ups) - 210TB
stored in BLOB/object storage/cloud storage

Data base to maintain Users, Subscription Plans, Devices users are using, Data about user views,

User(id, username, password, email, age,) -> say 1KB per user-> 1Million will take 1GB storage
Devices(ip, type, os)
Content(actors, description..)
ViewerHistory(1000 videos/user, start time, end time, start date, end date
5GB could be db size -> RDBMS can be used

This is read heavy system
Cache is required(20% data can cached based on 80-20 rule)
Master-slave architecture so that reads can go across machines

Live Streaming

When event is recorded with video camera & micro phone. The camera and micro phone will record in normal format and send to servers over internet.
Load balancer

Device to distribute traffic to avoid overload of few servers
Inconsistent Performance as servers are equally distributed
horizontal scalability

What happens on server side

Encode & Compression: Convert from the normal format to H264 format.
Segmentation: When video content is larger, then break into chunks. eg: break 1GB into 250 chunks of size 4MB
Streaming server: The segmented data will be uploaded to Streaming server and maintain sequence of chunks
Streaming server will feed the content to Edge Server/CDN servers.

CDN- caching, low latency

App/Client - webapp/mobile app that have video payer & various clients
Video player on client will take care of decodeing
Sequencing/Syncronizing the chunks

client knows how to sequence.

hashing generates checksum for chunk at server side. Client verifies the checksum and rerequests the chunk

API Design

/login & /logout
/search with parameter keyword
/recommendations
/content/{id}
/content/
Adaptive bitrate streaming - Method of delivery of audio/video content over the internet where quality of stream is adjusted based on viewer device & internet connection so that users will get best experience.

When is security important

Uploading

At client end - authentication and authorization works
Upload Intransit - Using encryption ssl/tls

rmtp - low latency & high quality video transmission

OTT vs video sharig platforms

video sharing platform - youtube

LLD: Remaining Content 2

Coding all models of splitwise and connect to DB
Handling inputs from commandline
How to settle every user in min time
https://docs.google.com/document/d/1JE2zuEWXF3ex5cuEde6rqyMYo338xhGF7oVvQMgW8ls/edit?usp=sharing
Client -> Dto -> Controller -> Service -> Repository -> Model
CommandRegistry
CommandLineRunner
run() method
How to simplify expenses

Calculate how much extra or less someone paid
ideal soln

any one who paid extra should only receive
any one who owes should only pay

Divide the people into 2 sets

people who paid extra
people who paid less

Get the person who has paid the least and the person should get most

Create transaction between than above with min(a,b) amounts
Reduce the amounts with the min amount
Continue the process UNTIL

How many txns are needed

for N people, at max N-1 txns are needed

Data structure to be used Max/min Heaps
https://github.com/Naman-Bhalla/splitwiseJune22
https://drive.google.com/drive/folders/1Ml376ah2XZdTdablZVmf0zlvh5miGymr

HW: go thru recording of split wise 2 & 3 in dashboard

HLD : Case study 7 (Design IRCTC)

MVP features(User journey)

get trains for source, destination, train
get schedule(trainId, date)
Auth Registration
checkAvailability(trainId, date, class, src, dest)
book ticket(trainId, class, date, #seats, addln info passenger, source, destination)
Payment Gateway
Notification
Non-functional

concurrency
high thru put
high availability

Estimate Scale

#Users(total 10M & per day 100K)
# booking(seat) per day -
#trains - 10K
#seats per train - 1K
#seat bookings per day - 10M
space per row - 150Bytes
rows in table -10M per day
how far ahead can we book - 60days -> 600M entries

600M * 90 = 90GB(Can fit in one machine)

Historical data - Move to achieve storage which is low cost & plentiful storage

Not frequently accessed
Read only - no modifications

Design Goals
API Design
Archive Storage

completely separate db
get past bookings of user - sql db
Analytics - Columnar db
audit - tape storage

Design Booking Service

This is both read and write heavy(Checking availability & writing once booking is confirmed)
Highly consistent system cannot depend on eventual consistent system
High throughput

peak bookings in irctc in one min is 26K and avg is 7k

26k/min - single machine cannot handle -

shard data by train_id

Replication

Prevents data loss
Increases read through put

Sharding

makes data smaller(per shard)
increases read/write throughput

HLD: Microservices

DesignAmazon/e-commerse appln

Functional Requirements

Catalog -> list of items organization is selling
shopping cart
Order mgmt
Payment
User Mgmt
Notification
Inventory Mgmt -> how many items are available/stock availability
Analytics
Search
Recommendations

Non-Functional Requirements

Scalability
Availability
Reliability
Security
Performance/Throughput

Communication between services

Sync
Async
Using message broker/event driven architecture/message driven
Architecture stbles

Monolyth
layered architecture/component architecture
Sync communication
Async communication

Micro services

communicates with REST/SOAP/gRPS/jsonRPC
Independently developed & deployed
Order management requires less scalabilty/traffic where as Search requires more scalability
can be scaled by customer traffic
easily scalable
scale up and down easily with less time
identify the bottleneck and scale it up
webserver and appln servers scale more then databases
Traditionally message brokers run on expensive hardware as a single appln
Apache Kafka is a distributed message queue for a message bus

Design Micro services architecture

Microservices are split based on following priciples

Single responsibility
Loose Coupling(Maintainable code)
Too many microservices will impact performance
High cohesion(all the related things are together)
Autonomous
stateless
Smart endpoints & dumb pipes

pipes are integration logic and endpoints are business logic
Business logic is within microservices. Integration logic is not within microservice
Use Distributed cache instead of appln server cache(sticky cache)

Drawbacks in microservice architectures

Authentication is implemented in all microservices
How to avoid DDos attach(denial of service)

User rate limiting
need to implement rate limiting in all micro services

API Gateway is solution. It is application component which can provide all the functionality of routing to microservices, authentication/authorization. It is kind of Facade for all APIs
Facade design pattern - The design pattern hides complexity of a system and provides an interface to the client using which the client can access the system. This comes under structural design pattern.
API Gateway functionality

routing
authentication/authorization

It calls security provider and reroute to security provider if not authenticated
If authentication is present for one of the service then it can share JWT token to other microservice so that authentication need not be taken care again and again.

rate limiting
Caching
Protocol transformation
Monitoring
logging
Aggregation of services

Components required for building microservices

Json-p /JsonB for json processing
logging and monitoring
API gateway to route to respective server load balancer
Loadbalancer to route to service instance with less load
Service Mesh -> It is service registration and service discovery

provides load balancing(alternative to load balancer)
Service Registration: registers the ip address of newly create service instances
Service Discovery: Other services can get the other service ip addresses(service discovery)
It maintains the list of service rules. What services can access what services

HLD: Popular Inteview Questions 2

Memory Management & Algorithm

Data structure and algorithms

garbage collection is a dsa problem in general

System design

Several choices with pros and cons
Important for large scale systems

Stack Memory

LIFO data structure
All variables in the function/method are stored in stack (pushed)
Primitive data types(small and fixed size)
Each thread has its own stack
no need for memory management
Object variables are stored and referred to Heap for actual values
Stored in RAM

Heap Memory

Complex data types likes lists, maps, sets(objects)
Shared across threads
Actual objects are stored
Stored in RAM
How to free up Heap

Manual

malloc to allocate memory in languages for C.
free() to free the memory
Pros

Good control & high performance
Predictability - job performance

Cons

Memory leak - Memory not freed up by programmer
Dangling pointers - pointer pointing to location that is already freedup
Double free -

Automatic

Pros

No manual free, no dangling pointers, no double free

Cons

Memory leaks can still be there
automatic memory mgmt has some overhead
lower predictability

When to use Manual vs Automatic

Automatic will be default choice. It is simple and less work
Mission critical systems (that require high predictability - game programming, lowlevel database programming, lowlevel system programming where every cpu cycle counts) require Manual

Pace Maker
Mars rover lander ()

RUST
Garbage collector in Automatic memory management

Functional Requirements

Find objects that are no longer used

Least recently used

Mark & Sweep strategy: Scan entire memory and check which parts are being used

Variables in stack are live & can be considered as roots
Use graph traversal from roots and find all reachable nodes
Pause all jobs when GC is running
Start at current stack + meta space + static variables
Mark phase: Traverse and mark any reachable objects
Sweep: scan entire heap and free space which is not marked
Cons

For memory intence appln it will be slow and can take few/many secs

free the objects that are no longer used only

Non-Functional requirements

Min overhead and high performance
It should not pause the application
High predictable

Generational garbage collection

Make the garbage collector pauses short
It divides the heap into Young & old generations. Young is around 20% of space.
New objects are allocated space in new generation.
It makes an assumption that if an object survives Garbage collection then it lives for ever
When young fills up, garbage collections kicks in. It will mark only objects in Young and ignore old generation region
Any young memory that services is moved to old generation
This algo is faster(100ms) as it need to perform mark and sweep only on 20% of memory
After many iterations OLD space will also gets filled up. In such case, major GC will run on whole memory.

JAVA

Java uses Genarational garbage collector
It further divides young space into 3 parts

Young

Eden, S0, S1(survival 0 & Survival 1)

Old
Permanent

When Eden is full, survivors are moved to S0
When Eden is full, GC will run on Eden+S0 and survivors moved to S1
When Eden is full, GC will run on Eden+S1 and survivors moved to S0
the process will continue
One of S0 or S1 is always empty by moving survivors to other space. Helps in compaction
If so or s1 gets filled OR fixed cycles happened survivors after GC will be moved to Old generation
Major GC run only when OLD generation is filled.

Reference counting(Python)

Every object will store a number that have reference(indegree) to me
When the refcount goes to 0 -> garbage collected
If there is circular dependency, then refcount never goes to 0
Use Mark and sweep periodically.
Pros

reactive- happens along with appln
no need to pause

Cons

Insufficient - we have to apply mark and sweep at periodic intervals
Still be pauses when the variable or object is holding large tree of memory due to cascading memory removal
High overhead

cpu - every object creation requires ref count to be updated. It should be automic and context switch can't happen
Memory - addln memory to maintain reference value

tricolor - Truly concurrent

Why is memory fragmentation bad

Sequential access is not possible
Caching happens in blocks. So, when data is together, it get fetched in min number of reads

HLD: Garbage Collection - continued

Highlevel design

Traffic ananlysis - QPS (avg & peak)
Data storage
CPU load = QPS * CPU time per query

CPU bound vs I/O bound

CPU speed is measured in clock seconds/speed which is measured in Giga Hertz
latest intel processor speed is 5.2Ghz and has 24 cores
Speed of light is 3 * 10^8 meter/sec
Time taken by light to travel one meter is 3 nano secs.
In 3 nano secs, each cpu core can do 15.6 operations(floating point operations) by each core.
https://gist.github.com/jboner/2841832
For processor with 24 cores, each core has following

ALU -> transistors (clock speed)
Registers -> memory made up of flip flops(clock speed)
L1 cache ~5 times slower than ALU/registers
L2 cache ~14 times slower than L1
L3 cache -> shared across all cores -> slower than L2

Primary memory -> ~500 times slower than cpu
Secondary memory
Web requests are I/O bound requests
DSA problems, Image processing are CPU bound tasks

Network Load

Now a days networks are even faster than cpus with Fiber technology
At every junction(to interface with digital device) the light has to be converted to electical signals. This is the primary bottleneck.
Network can operate at 10 pb/sec

Webservice requests

They create one internal thread for every request
How many requests can OS handle
Concurrent systems - Aync/await

If requests go to 100000 then network/disk becomes bottleneck

For 24 core system - 96 threads in parallen can scale
If the threads are bottleneck, then get rid of threads and move to Concurrent systems - Aync/await
Aync/await programming will not have context switch

Stack vs heap

stack is per thread and heap is shared
Variables/call stack in stack memory. Objects(dynamically allocated memory) are stored in Heaps
its clear when to free stack memory. Heap is not straight forward(difficult)

tricolor - Truly concurrent

Completely reactive with out pauses. Reactive means, garbage collector reacts to any changes to memory allocation.
Garbage collector keeps running as a separate thread
In the start, every thing in stack is marked Black, & heap as White
Black - is live/reachable.
Grey - reachable from roots, still need processing
white - no references(to be garbage collected)
Black is guranteed not to have any references to white.
Grey may or may not have any references to white. It needs further processing
Start from Black objects, perform BFS and mark all children as Grey
By end of cycle, every think that is reachable is black and not reachable is white
When new reference from Black to White happens - Target is marked as Grey
When reference is deleted, target is marked in White
Pros

No pauses
perf is predictable

Cons

Thruput is low
Cpu overhead is high

Java

Tri color mark sweep & Generational GC

https://v8.dev/blog/concurrent-marking

Search This Blog

Scaler

High Level Design

Comments

Post a Comment

Popular posts from this blog

Low Level Designs

System Design

CS Fundamentals