User:Rahilsonusrhn/sandbox

Knowledge
Spark

-Data processing engine to store and process data in real time across a cluster using Resilient distributed datasets(RDD).

-RDD can have transformation(map,filter,join) and action(reduce,count,first)

- Has rich set of machine learning algorithms and complex analytics (ML Lib - predictive analysis, recomendation systems etc)

- Can do realtime stream processing

- has component GraphX that hepls in graphbased processing. etc linkedin

-has componet Spark Core that does fault tolerance, memory management, scheduling and distribution across cluster and interaction with storage systems like hdfs,rdbmsetc.

- Can integrate spark to query, analyze and transform data.

- Supports Java, python, scala etc

- much faster than hadoop

Hadoop

- uses mapreduce to process data

- does batch processing

Splunk

-generates graphs, reports, alerts, dashboards and visualizations

Apache Beam

-Data processing framework.

-It is execution platform,data and languge agnostic.

-Write code in beam and run on any data processing engine ex spark, map reduce, google cloud dataflow

-uses pipelines to read data(Pcollection) ,transform(Ptransform) and output data

- Can add sdk as a dependency in pom and use the libraries to process data

- has functions like trigger ,window

Apache Flume

Apache Flume is an open-source, powerful, reliable and flexible system used to collect, aggregate and move large amounts of unstructured data from multiple data sources into HDFS/Hbase (for example) in a distributed fashion via it's strong coupling with the Hadoop cluster.

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.

Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.


 * Flume provides the feature of contextual routing.
 * The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.

apache mahout

analyze large sets of data effectively and in quick time. Uses mathematical models for


 * Recommendation
 * Classification
 * Clustering

Algorithms

User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.

Item-Based Colabrative filtering

Spectral clustering -derived from graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them

Random forest - uses decision trees to classify ramdom subsets to make predictions.

Matrix factorization for recommender systems

Scala

- statically typed and dynamically infered language

ELK


 * Logstash: Collect logs and events data. It even parses and transforms data. Logstash is the data collection pipeline tool. It collects data inputs and feeds into the Elasticsearch
 * ElasticSearch: The transformed data from Logstash is Store, Search, and indexed.
 * Has REST API web-interface with JSON output
 * Full-Text Search
 * Near Real Time (NRT) search
 * Sharded, replicated searchable, JSON document store
 * Schema-free, REST & JSON based distributed document store
 * Kibana: Kibana uses Elasticsearch DB to Explore, Visualize, and Search logs.

Kafka
fault tolerant, scalable, messaging system

project example : receive playtech events and send them in kafka topic. Different consumers can process these messages. ensures events are enot missed and processing is scaled.

Bet placement services send huge amount of requests during match. These requests are pushed to kafka and our processors pracess these aginst the fixtures to result specific journeys. Like calling payment systems.

- Log compaction - removes duplicate events from logs. there is key compaction which can be used if order is to be preserved. - Kafka Topic - messages belonging to one category

- topic can have many partitions to which producers can publish data. data goes in partitions based on partitioning key if specified else round robin

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1   --partitions 1 --topic Hello-Kafka

- Kafka Cluster -has more than 1 brokers managed by zookeeper.

-only one broker acts as Controller

- Kafka Broker - contains topic partitions.

-receives msgs from producer and stores them in topic partition using offset

- Kafka Zookeeper - manages brokers in the cluster

- Kafka Producer

Define kafka template giving the properties file that contain kafka configuration. - Kafka consumer - Kafka Consumer Group

The consumer join consumer group using group id and consumers in a group divides the topic partitions among themselves and each partition is only consumed by a single consumer from the group.

- Kafka streams (API)

Read records from one kafka topic and process it and write to other topic

//Read topics to consume and produce

String topic = configReader.getKStreamTopic;

String producerTopic = configReader.getKafkaTopic;

//Define serialization and deserialization types

final Serde stringSerde = Serdes.String;

final Serde longSerde = Serdes.Long;

// get stream data and apply functions

KStreamBuilder builder = new KStreamBuilder;

KStream inputStreamData = builder.stream(stringSerde, stringSerde, producerTopic);

KStream processedStream = inputStreamData.mapValues(record -> record.length )

- Kafka Connect (API)

Define connector class(like file stream connecter) and give source and sink properties

properties file

efi.kafka.replication.factor=3

efi.kafka.bootstrap-servers=at1p1xdkfk101.dbz.unix:9093,at1p1xdkfk102.dbz.unix:9093

efi.kafka.user=spspinhr01

efi.kafka.password=UzU


 * 1) kafka producer properties

efi.kafka.producer.retries=0

efi.kafka.producer.batch.size=16384

efi.kafka.producer.max.block.ms=3000

efi.kafka.producer.linger.ms=1

efi.kafka.producer.request.timeout.ms=5000

efi.kafka.producer.buffer.memory=33554432

efi.kafka.producer.acks=0

efi.kafka.producer.max.request.size=1048576

efi.kafka.producer.compression.type=none

efi.kafka.producer.max.in.flight.requests.per.connection=5

efi.kafka.producer.connections.max.idle.ms=540000

efi.kafka.producer.receive.buffer.bytes=32768

efi.kafka.producer.send.buffer.bytes=131072

efi.kafka.producer.metadata.max.age.ms=300000

efi.kafka.producer.reconnect.backoff.ms=50

efi.kafka.producer.retry.backoff.ms=100


 * 1) kafka consumer properties

efi.kafka.consumer.enable.auto.commit=false

efi.kafka.consumer.auto.commit.interval.ms=5000

efi.kafka.consumer.auto.offset.reset=latest

efi.kafka.consumer.reconnect.backoff.ms=5000

efi.kafka.consumer.retry.backoff.ms=5000

efi.kafka.consumer.max.poll.records=500

efi.kafka.consumer.max.poll.interval.ms=300000

efi.kafka.consumer.session.timeout.ms=300000

efi.kafka.consumer.heartbeat.interval.ms=3000

efi.kafka.consumer.partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor

efi.kafka.consumer.fetch.min.bytes=1

efi.kafka.consumer.fetch.max.bytes=52428800

efi.kafka.consumer.fetch.max.wait.ms=500

efi.kafka.consumer.max.partition.fetch.bytes=1048576

efi.kafka.consumer.connections.max.idle.ms=540000

efi.kafka.consumer.check.crcs=true

efi.kafka.consumer.request.timeout.ms=305000

efi.kafka.consumer.receive.buffer.bytes=65536

efi.kafka.consumer.send.buffer.bytes=131072

efi.kafka.consumer.metadata.max.age.ms=300000

efi.kafka.reconnect.period.ms=5000

efi.kafka.pollTimeoutMs=10000

Cassandra
Distributed high performing scalable database. Cassandra nodes are in ring based topology with different strategies of replication like simple strategy (rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy)

Keyspace - its like a schema/container which can contain multiple tables.

CREATE KEYSPACE Keyspace name WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}

column family -

primary key -

partitioning key-

clustering key -

compression- compression algorithm  LZ4Compressor (Cassandra 1.2.2 and later), SnappyCompressor, or DeflateCompressor. 25-30 % size decrease

compaction - It is like a IO operation that happens in the background. TO cleanup data n store data more efficiently. Compaction process merges keys, combines columns, evicts tombstones, consolidates SSTables, and creates a new index in the merged SSTable. : The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk. This strategy is an alternative for time series data. TWCS compacts SSTables using a series of time windows. While with a time window, TWCS compacts all SSTables flushed from memory into larger SSTables using STCS.

gossip protocol - node communicate with each other to pass/share data

project - Defined persistance modules like fixture store, provider store.

- used spring data with cassandra

- AbstractCassandraConfiguration class is extended and provided with the properties like ${cassandra.datastax.hosts}${cassandra.cluster.loadbalancing.localDc}

- Get CQL session from sessionfactrybean

- extend spring data crud repository

- @Table("fixture_sport_mapping_v2")

public class FixtureIDAndSportCodeMappingV2 {

@PrimaryKeyColumn(name = "provider_code", type = PARTITIONED)

private String providerCode;

KUBERNETES
- Orchestrator for container deployment (other ex docker swarm but lacs autoscaling features)

https://kubernetes.io/docs/concepts/services-networking/service/#type-clusterip

-Architecture contains control panel which has kube api server(front end of contol panel), controller manager(for noticing and responding if node goes down), etcd, cloud controller manager(external cloud provider connections) and scheduler( to schedule pods)

Nodes contain kublet(provides environment and runs pods), kube proxy(manages network layer) and pods

- POD can contain many containers.

-zuul and eureka not needed as there is out of the box support for service discovery and gateway is ingress.

- Desired state managment.

- Uses cluster API services to achive deesired state using config in yaml.

- deployment yaml contains container images, replica's config, CPU usage config

- kublet processes run on pods to coordinate with cluster api services

- Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may be a virtual or physical machine, depending on the cluster. Each node is managed by the control plane and contains the services necessary to run Pods

- Can use commandline interface kubectl or can use Kubernetes dashboard to manage deployments and pods.

---> kubectl create -f deployment-account.yaml

kubernetes.yml or deployment-account.yaml

apiVersion: extensions/v1beta1

kind: Deployment

metadata:

name: account-service

labels:

run: account-service

spec:

replicas: 1

template:

metadata:

labels:

run: account-service

spec:

containers:

- name: account-service

image: piomin/account-service

ports:

- containerPort: 2222

protocol: TCP

- name: mongo

image: library/mongo

ports:

- containerPort: 27017

protocol: TCP

- Service

kind: Service

apiVersion: v1

metadata:

name: account-service

spec:

selector:

run: account-service

ports:

- name: port1

protocol: TCP

port: 2222

targetPort: 2222

- name: port2

protocol: TCP

port: 27017

targetPort: 27017

type: NodePort

-Ingress -Ingress may provide load balancing, SSL termination and name-based virtual hosting. Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.

ingress.yml

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

name: gateway-ingress

spec:

backend:

serviceName: default-http-backend

servicePort: 80

rules:

- host: micro.all

http:

paths:

- path: /account

backend:

serviceName: account-service

servicePort: 2222

- path: /customer

backend:

serviceName: customer-service

servicePort: 3333

DOCKER
To create image - mvn clean install dockerfile:build

Create dockerfile in project root

Deploy docker container

docker run -p 8080:9080 -t hello-howtodoinjava/hello-docker --name hello-docker-image

LOGGING
Log4j - LoggerFactory.getlogger(classname). Can define log severity, appenders, rolling strategy , location etc.

Log4j2- faster and better

SLF4j - as an absrtaction. Can be used with log4j. Helps change underlyig implmentation without changing the code.

logstash - Define logstash properties in logback.xml

Define encoder(LoggingEventCompositeJsonEncoder) and pattern.

logstash agent can be a docker image that runs on the host to send data from log folder to elastic serch. config can be done in logstash.yml or env variables.

agent injects the corelation id/ trace id.

JMS/RabbitMQ
JMS : Java Message Service is an API that is part of Java EE for sending messages between two or more clients. There are many JMS providers such as OpenMQ (glassfish’s default), HornetQ(Jboss), and ActiveMQ.

JMS supports two models: one to one and publish/subscriber.

JMS is specific for java users only, but RabbitMQ supports many technologies.

RabbitMQ: is an open source message broker software which uses the AMQP standard and is written by Erlang.

RabbitMQ supports the AMQP model which has 4 models :

direct(producer sends to exchange and exchange sends the msg to queue where routing key matches the binding key)

fanout(producer sends msg to exchange and exchange send to all queues),

topic(partial match of keys)

headers(uses msg header instead of routing key)

Default is when routing key matches queue name

Heap Dump Analysis

Use Jmap tool that is shipped with jdk to extract dump. jmap -dump:format=b,file=

Eclipse Memory Analyzer to analyze

can analyse objects created, memory used, threads in different states etc

Security(JWT,SAML,OAUTH)
JWT - Json web token. generated by server after authentication username and password. Then encrypts userinfo in JWT token n sends to client. User info is already present in JWT when it is recieved so server doesnot have to save the session info. It just has to validate the token using its secret key. If we decode the token we can find header that contains encryption algo, payload that contains userinfo and expiry timestamp and finally the signature that is validated against the secret key.

SSO -SAML(Security assersion Markup language) 3 entities -> User, Service provider, identity provider(OpenId Connect/ OKTA/KeyClock)

Both authentication and authorization

saml xml contains - issuer, acs url, auth req id, timestamp

Step 1: User tries to access private resources from SP.

Step 2: SP generates SAML Request.

Step 3: After generating SAML Request SP redirects the user to IdP.

Step 4:  IdP ask the user to authenticate with login details.

Step 5: IdP validates the user and generates SAML Response that contains the SAML Assertion required for SP.

Step 6: The IdP redirects the user to SP’s Assertion Consumer Service (ACS).

Step 7: ACS validates the user and allows the user to access the protected resource.

Step 8: Now users able to access resources from SP.

OAUTH/OAUTH2.0 - For Authorization


 * 1) The client application requests authorization by directing the resource owner to the authorization server.
 * 2) The authorization server authenticates the resource owner and informs the user about the client and the data requested by the client. Clients cannot access user credentials since authentication is performed by the authentication server.
 * 3) Once the user grants permission to access the protected data, the authorization server redirects the user to the client with the temporary authorization code.
 * 4) The client requests an access token in exchange for the authorization code.
 * 5) The authorization server authenticates the client, verifies the code, and will issue an access token to the client.
 * 6) Now the client can access protected resources by presenting the access token to the resource server.
 * 7) If the access token is valid, the resource server returns the requested resources to the client.

JAVA8
Class: A class is a blueprint or template for creating objects. It defines the properties (attributes) and behaviors (methods) that the objects of the class will have.

Object: An object is an instance of a class. It is a runtime entity with its own set of data members (attributes) and methods (functions). Encapsulation: Encapsulation is the bundling of data (attributes) and methods that operate on the data into a single unit, i.e., a class. It helps in hiding the internal details of an object and only exposing what is necessary.

Inheritance: Inheritance allows a class (subclass/derived class) to inherit the properties and behaviors of another class (superclass/base class). It promotes code reusability and establishes a relationship between classes. public class Animal { public void eat { System.out.println("Animal is eating"); } } public class Dog extends Animal { public void bark { System.out.println("Dog is barking");   } } // Usage

Dog myDog = new Dog

; myDog.eat;  // Inherited method

myDog.bark; // Method specific to Dog class

Polymorphism: Polymorphism allows objects of different classes to be treated as objects of a common superclass. It can be achieved through method overloading and method overriding. overloading- compiletime polymorphism overriding - runtime polymorphism // compiler doesnt know which method is called. achived using upcasting Covariant Return Type : The covariant return type specifies that the return type may vary in the same direction as the subclass.


 * 1) class A{
 * 2) A get{return this;}
 * 3) }
 * 4) class B1 extends A{
 * 5) @Override
 * 6) B1 get{return this;}
 * 7) void message{System.out.println("welcome to covariant return type");}
 * 1) void message{System.out.println("welcome to covariant return type");}

super :


 * 1) super can be used to refer immediate parent class instance variable.
 * 2) super can be used to invoke immediate parent class method.
 * 3) super can be used to invoke immediate parent class constructor.

Instance initializer block : is used to initialize the instance data member. It run each time when object of the class is created.

IS-A relationship : Inheritance

HAS-A relationship : Aggregation (User Has address)

Final : can be on method class and variable.

Final class cannot be extended.

final method cannot be overriden

final variables cannot be changed - can be initialized in constructor or staticblock

Upcasting :


 * 1) class A{}
 * 2) class B extends A{}


 * 1) A a=new B;//upcasting

static binding : type of the object is determined at compiled time(by the compiler)

dynamic binding : just like runtime polymorphism.

Abstract class : concrete or abstract methods


 * An abstract class must be declared with an abstract keyword.
 * It can have abstract and non-abstract methods.
 * It cannot be instantiated.
 * It can have constructors and static methods also.
 * It can have final methods which will force the subclass not to change the body of the method.


 * 1) abstract class Bank{
 * 2) abstract int getRateOfInterest;
 * 3) }
 * 4) class SBI extends Bank{
 * 5) int getRateOfInterest{return 7;}
 * 6) }

Interface : Since Java 8, we can have default and static methods in an interface.

Since Java 9, we can have private methods in an interface.

Multiple inheritace can be achieved using interfaces.

Exception handling :

Throwable

Exception    Error

checked,unchecked

URL Shortner.
Questions : size of url(depends on DAU),  read write ratio(100:1), How many years to store

Shortened URL can be a combination of numbers (0-9) and characters (a-z, AZ).

Capacity estimation -> writes - 10 million/day = 10x10`6/10^5 -> 100 writes/sec

reads = 100*100 = 10k reads/sec

storage = 1 million per day x 100 years x 365 days x datasize= 400x10`9 x datasize

datasize = short url + long url +createdDate (100 bytes=10`2). hence 40 TB

in UTF-8 charset -> char= 1 byte, date =3 bytes, integer=4 bytes

API ->                       POST api/v1/generate

• request parameter: {longUrl: longURLString} • return shortURL with http ok 200

GET api/v1/shortUrl

• Return longURL for HTTP redirection(301 permanant)

DB Design ->. url table -> id, shorturl,longurl,createdate

Shortening Algo ->  base 64 encoding (64`7 or 64'6 based on datasize=40TB)

sha1 or MD5 hash ( then take first 7 char) collisions- take next 7

can use bloom filter too for collision detection

other way is use a uniqe id generator like snowflake. can also pregenerate

Rate Limiter
Questions : is rate limiting based on userId or ip etc

token bucket(redis), leaky bucket(using Queue), sliding window

replenish rate and burst rate

can have expiery in redis

can use lua script that makes atomic transactions for race condition. `can use optimistic locking using setnx

give 429 response (too many requests)

can also have queues to make extra req wait

Consistent hashing
Only limited amounts of keys are remapped

there could be uneven distribution when a node goes down. hence virtual nodes are used.

in virtual nodes for each node we have few replica nodes. Hence standard deviation decreases

class ConsistentHashing {

private final TreeMap circle = new TreeMap<>;

private final int numberOfReplicas;

public ConsistentHashing(int numberOfReplicas, List nodes)

{       this.numberOfReplicas = numberOfReplicas;

for (String node : nodes) {

addNode(node);       }    }

public void addNode(String node)

{

for (int i = 0; i < numberOfReplicas; i++)

{           String virtualNode = node + "-" + i;

circle.put(hash(virtualNode), virtualNode);

}   }

public void removeNode(String node) {

for (int i = 0; i < numberOfReplicas; i++)

{

String virtualNode = node + "-" + i;

circle.remove(hash(virtualNode));

}   }

public String getNode(String key) {

if (circle.isEmpty) {           return null;        }

long hashKey = hash(key);

Map.Entry entry = circle.ceilingEntry(hashKey);

if (entry == null) {

// Wrap around if the key is greater than the largest hash

entry = circle.firstEntry;

}       return entry.getValue;

}

Distributed KeyValue Store
put(key,value)

get(key)

CAP theorem

use a ring of nodes. consistent hashing,

every node has. data replicated,

UniqueId generator
UUID-> 32 char- 128 bit

might have duplicates

sorting is not possible

collisions can occur

TimeStamp -> 41 bits

multiple servers can have same timestamp

Snowflake approach -> timestamp+Machineid+sequence number

WebCrawler
Question : purpose ? search engine indexing ?

should store html content only or images, documents too ?

what is the refresh rate ?

For 1 billion pages per month

size of each website -100kb

1billion x 100kb = 10x10`12= 10 TB/month

QPS: 10`9 / 30 days *10`5= 3.3 x10`2= 330 qps

Can have multiple services like URL processor service, URL Downloader Service, Parsing service,

Seed Urls -> collect websites for different categories like news, ecommerce etc n put it in seedurl table in db

Url Processor Service -> checks if a url is already processed(as multiple websites can have same links) if not forwards req to queue for downloader service

Can have a prioritizer implemented that will prioritze websites based on ranking/category while putting to queue

Downloader service -> downloads contents and puts it in some file storage. can use Robots.txt that websites have to crawl only allowed pages.

Parser Service -> will parse the contents of html page

DNS cache -> implement a distributed cache to overcome dns lookup bottleneck

Notification System
Question -> which type of notificaations (sms,email, mobile push, ivr etc)

Apple push notification System

Android - Firebase

Sms - diff providers based on the region/country

Email - can have your own mail server or third party integration

Take the request and put it in a rabbitmq queue from any services

Rabbit mq supports priority so imp notifications can be handled first

Ratelimiting can be done using redis

News Feed - Twitter
Question -> DAU (100Million)

What are most imp features to build. (user can publish post, friends can see post, users can follow other users etc)

POST /v1/feed Params: • content: content is the text of the post.
 * 1) follow others
 * 2) post tweet
 * 3) view feed

GET /v1/me/feed Params: • auth_token:

DB. can be relational with sharding or graphDB

Feedservice -> queue / kafka -> feed cache(entry for every userid)

for ppl with millions of folllowers cant update for allusers in cache. so fetch dynamically and add to feed.

Images / videos from CDN

-- Users table

CREATE TABLE users (

user_id INT PRIMARY KEY,

username VARCHAR(255) NOT NULL,

);

-- Feeds table

CREATE TABLE feeds (

feed_id INT PRIMARY KEY,

user_id INT,

content TEXT NOT NULL,

created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

FOREIGN KEY (user_id) REFERENCES users(user_id)

);

-- Followers table

CREATE TABLE followers (

follower_id INT,

following_id INT,

PRIMARY KEY (follower_id, following_id),

FOREIGN KEY (follower_id) REFERENCES users(user_id),

FOREIGN KEY (following_id) REFERENCES users(user_id)

);

Chat System
Questions -> Do we need to store data in backend

Is group chat allowed

How many msgs/day

- cant use stateless so no REST (as everytime new connection)

- cant use polling

- use websockets

- use multiple chat servers

-use key value store like redis to msgs

-after authentication client connects to one of the chat servers. charserver info is sent n client connects to chatserver

-zookeper to keep track of chatservers (using service discovery)

-> use snowflake for msgid (for ordering)

-> DB design (msgid,fromid,toid,content,timestamp)

(groupid,msgid,userid,content,timestamp)

-> User A miight connect to server A and B might connect to other server

-> use msg queue. As soon as a msg is recieved in a server put it to the queue and other server will recieve it

-> to know online/offline can use redis

-> if user is not online notification service can send notification

Auto Complete- Google Search
Questions -> Suggesions based on frequency

is ranking required when giving suggesions

how many suggesions per prefix (10)

API  -> GET /api/v1/suggesions.? prefix="abc"

- NATIVE SOLUTION : Can use rdbms with querystring, frequency as coulms and update freq on every count

use like operator (not good design)

- use tri data structure ( to generate can log all http req and save in hadoop/kafka and use spark)

- can have replicas or shard based on starting letter

class TrieNode {

private final TrieNode[] children;

private boolean isEndOfWord;

public TrieNode {

this.children = new TrieNode[26]; // Assuming lowercase English letters

this.isEndOfWord = false;

}

public TrieNode getChild(char ch) {

return children[ch - 'a'];

}

public TrieNode createChild(char ch) {

TrieNode node = new TrieNode;

children[ch - 'a'] = node;

return node;

}

public boolean isEndOfWord {

return isEndOfWord;

}

public void setEndOfWord(boolean endOfWord) {

isEndOfWord = endOfWord;

}

}

public class Trie {

private final TrieNode root;

public Trie {

this.root = new TrieNode;

}

public void insert(String word) {

TrieNode current = root;

for (char ch : word.toCharArray) {

if (current.getChild(ch) == null) {

current = current.createChild(ch);

} else {

current = current.getChild(ch);

}

}

current.setEndOfWord(true);

}

public boolean search(String word) {

TrieNode node = searchNode(word);

return node != null && node.isEndOfWord;

}

public boolean startsWith(String prefix) {

return searchNode(prefix) != null;

}

private TrieNode searchNode(String word) {

TrieNode current = root;

for (char ch : word.toCharArray) {

if (current.getChild(ch) == null) {

return null;

} else {

current = current.getChild(ch);

}

}

return current;

}

}

YouTube/Netflix
Assume the product has 5 million daily active users (DAU).

• Users watch 5 videos per day. • 10% of users upload 1 video per day.

• Assume the average video size is 300 MB.

• Total daily storage space needed: 5 million * 10% * 300 MB = 150TB

-use CDN

-Use blob storage for raw files

-save metadata to DB

- save videos in s3 by giving presigned url

- Once data is in use transcoder to convert data to different bitrates and resolutions and formats

- can use aws lambda for this

Proximity Service
Can use quad tree - each node has 0 to 4 children

-divide 2d into 4 quadrants

- in quadtree leaf nodes can be decided based on no of locations

Other approch - divide world into 00,01,11,10

- each quadrant can again be divided into 0000,0010,...

ce- can put it in db abd do sql like query

DB - locationID

name

lat

long

description

Can use s2 library too from google(latlong to cell id. Can have range queries too)

Chatgpt
Functional Requirements -> Create conversation by sending a prompt

View Conversation

update conversation

Delete conversation

Feedback( thumbsup, thumbsdown)

Non Functional -> Latency

Security

Scalabliity

Conversation service will recieve the msg-> Profanity Service(ML model)

ChatgptService -> calls a (ML model)-> save request and response in DB

Use feedback like thumbsup to send to ML model to train the model

Have risk model to ensure no curruption

Talk of API details for crud

generative pretrained transformer. takes data from books, web crawling etc,

Ex : Earth is ....a planet, a place where humans live, part of solarsystem. probability based scoring, greedy. temperature, topk etc

then fine tune data, then reward model based on emotions parameter, reinforced learning using rewards

Distributed msg queue
Functional Req -> publish and consume from queue

Non Func -> Is it topic based or fanout based or direct message

Scalability -> 10k topics x 10 million msg/day = 100gbmsgs/day

`latency-> time to deliver and consume

producer pushes to the queue(can use batching)

consumer pulls from the queue(can pull in batch)

Message -> Key, Value

Write ahead log(append only log)- use segmentation to avoid large file size

partition using consistent hashing and write to different nodes

Have Metadata storage that would have offset, followers for replicas, topic details, retention policy

Zookeper to coordinate(out of the box solution for metadata storage, state storage and service for heartbeat)

Configuration of acknowledgement

Digital Wallet
-use rdbms(using sharding)

-Deduct can be on one shard and credit can be on other

-Substract first and then add

-2 phase commit protocol -> prepare(lock), commit

coordinator service can be single point of faluire

locking not great option

-SAGA -> small local transactions

Compensating transactions

Saga execution coordinator

Can use event sourcing for reproduction

Google Docs
Should be able to create docs

Should be able to see other user editing

-> websockets for realtime changes

-> users will have local version

-> positional indexing. Doc will have positions and that data is sent over websocket

-> instead of 1,2,3 can use .1,.2,.3 generated in runtime

- Api Gateway (ZUUL) - Dynamic Routing Monitoring, Security
CODING

reverse a sentence -> use "\\s" to split an then

Is string palindrome -> even charecters and atmost one odd charecter

BFS of a tree -> use queue and add root first then poll and add children

DFS of a tree -> this is pre order traversel using recursion