What is Kafka :
- Kafka is publish and subscribe messaging system which is durable, fast , scalable and fault-tolerant with exceptional performance and high throughput.
What Kafka is Used for :
- Kafka is used for stream processing, website activity tracking, metrics collection and monitoring, log aggregation, real-time analytics, ingesting data into Spark, ingesting data into Hadoop, replay messages, error recovery, and guaranteed distributed commit log for in-memory computing (microservices).
- Kafka can work with Spark Streaming, Storm, HBase and Spark for real-time ingesting, analysis and processing of streaming data.Kafka is a data stream used to feed Hadoop BigData lakes. Kafka is fast and uses IO efficiently by batching and compressing records.
Why Kafka is fast ?
- Kafka uses OS Kernel to move data around .It uses IO efficiently by batching and compressing records
Kafka Architecture :
Image : cloudurable
Kafka consists of Records, Topics, Consumers, Producers, Brokers, Logs, Partitions, and Clusters.
- Kafka Records can have key(optional), value and timestamp. Kafka Records are immutable.
- Kafka Topic is a stream of records. A topic has a Log which is the topic’s storage on disk. A Topic Log is broken up into partitions and segments.
- Kafka Producer API is used to produce streams of data records. The Kafka Consumer API is used to consume a stream of records .
- Kafka Broker is a Kafka server that runs in a Kafka Cluster. Kafka Brokers form a cluster.
Kafka uses Zookeeper to manage the cluster. Kafka uses Zookeeper to do leadership election of Kafka Broker and Topic Partition pairs.
How Kafka works : Kafka Producer, Consumer, Topic details :
Kafka producers write to Topics. Kafka consumers read from Topics. A topic is associated with a log which is data structure on disk. Kafka appends records from a producer(s) to the end of a topic log. A topic log consists of many partitions that are spread over multiple files which can be spread on multiple Kafka cluster nodes. Consumers read from Kafka topics at their cadence and can pick where they are (offset) in the topic log. Each consumer group tracks offset from where they left off reading. Kafka distributes topic log partitions on different nodes in a cluster for high performance with horizontal scalability. Spreading partitions aids in writing data quickly. Topic log partitions are Kafka way to shard reads and writes to the topic log. Also, partitions are needed to have multiple consumers in a consumer group work at the same time. Kafka replicates partitions to many nodes to provide failover.
How can Kafka scale if multiple producers and consumers read and write to same Kafka topic log at the same time? First Kafka is fast, Kafka writes to filesystem sequentially which is fast. Kafka scales writes and reads by “sharding” topic logs into partitions.