m Apache Kafka
Key Points
- Data pipelines connect, transform data sources to data targets in batches or event streams
- Kafka directory services definitions stored as metadata with optional data schemas
- brokers connect clients, producers to topics ( queues ) using Kafka directory services
- handles many types of producers, consumers, topics (queues), replication, partitions
- messages can be consumed by multiple clients, sent by multiple producers, serialized, synch or asynch
- clients must poll a queue - there is DIRECT NO PUSH options ( eg web sockets model etc )
- producers push message to topics queues
- queues can be persistent
- REST proxy available to interface to queues
- Future option to replace Zookeeper directory with distributed metadata services ( DMS )
References
Key Concepts
Kafka
Kafka 101 Tutorials
https://kafka-tutorials.confluent.io/?_ga=2.255305614.740418466.1611961981-2041899468.1610139243
Kafka 101 Basic Concepts
- Console producer and consumer basics
- Console producer and consumer basics with Confluent Cloud
- Console consumer with primitive keys and values
- Idempotent producer for message ordering and no duplication
- Avro console producer and consumer
- Change the number of partitions and replicas of a Kafka topic
- Console consumer reads from a specific offset and partition
- Produce and consume messages with clients in multiple languages
- Build your first Kafka producer application
- Using Callbacks to handle Kafka producer responses
- Build your first Kafka consumer application
- Count the number of messages in a Kafka topic
Kafka 101 Apply Functions
- Masking data
- Working with nested JSON
- Working with heterogenous JSON
- Concatenation
- Convert a stream's serialization format
- Scheduling operations in Kafka Streams
- Transform a stream of events
- Flatten deeply nested events
- Filter a stream of events
- Rekey a stream with a value
- Rekey a stream with a function
- Split a stream of events into substreams
- Merge many streams into one stream
- Find distinct events
- Apply a custom, user-defined function
- Handle deserialization errors
- Choose the output topic dynamically
- Name stateful operations
- Convert a KStream to a KTable
- Calculate the difference between two columns
Kafka 101 Aggregation Functions
- Count a stream of events
- Sum a stream of events
- Find the min/max in a stream of events
- Calculate a running average
- Cogroup multiple streams of aggregates
Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline!
Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with Kafka Connect, which is part of Apache Kafka. ksqlDB is the source-available SQL streaming engine for Apache Kafka and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.
In this talk, we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, Kafka Connect, and ksqlDB.
Gasp as we filter events in real-time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!
Kafka KIPS
KIP-500 RFC - replace Zookeeper for Kafka MDM
Currently, Kafka uses ZooKeeper to store its metadata about partitions and brokers, and to elect a broker to be the Kafka Controller. We would like to remove this dependency on ZooKeeper. This will enable us to manage metadata in a more scalable and robust way, enabling support for more partitions. It will also simplify the deployment and configuration of Kafka.
brokers should simply consume metadata events from the event log.
- Status
- Motivation
- Architecture
- Compatibility, Deprecation, and Migration Plan
- Rejected Alternatives
- Follow-on Work
- References
Metadata as an event log
We often talk about the benefits of managing state as a stream of events. A single number, the offset, describes a consumer's position in the stream. Multiple consumers can quickly catch up to the latest state simply by replaying all the events newer than their current offset. The log establishes a clear ordering between events, and ensures that the consumers always move along a single timeline.
However, although our users enjoy these benefits, Kafka itself has been left out. We treat changes to metadata as isolated changes with no relationship to each other. When the controller pushes out state change notifications (such as LeaderAndIsrRequest) to other brokers in the cluster, it is possible for brokers to get some of the changes, but not all. Although the controller retries several times, it eventually give up. This can leave brokers in a divergent state.
Currently, a Kafka cluster contains several broker nodes, and an external quorum of ZooKeeper nodes
Controller Log
The controller nodes comprise a Raft quorum which manages the metadata log. This log contains information about each change to the cluster metadata. Everything that is currently stored in ZooKeeper, such as topics, partitions, ISRs, configurations, and so on, will be stored in this log.
Broker Metadata Management
Instead of the controller pushing out updates to the other brokers, those brokers will fetch updates from the active controller via the new MetadataFetch API.
A MetadataFetch is similar to a fetch request. Just like with a fetch request, the broker will track the offset of the last updates it fetched, and only request newer updates from the active controller.
The broker will persist the metadata it fetched to disk. This will allow the broker to start up very quickly, even if there are hundreds of thousands or even millions of partitions. (Note that since this persistence is an optimization, we can leave it out of the first version, if it makes development easier.)
Most of the time, the broker should only need to fetch the deltas, not the full state. However, if the broker is too far behind the active controller, or if the broker has no cached metadata at all, the controller will send a full metadata image rather than a series of deltas.
The broker will periodically ask for metadata updates from the active controller. This request will double as a heartbeat, letting the controller know that the broker is alive.
Broker States
cluster membership is integrated with metadata updates. Brokers cannot continue to be members of the cluster if they cannot receive metadata updates. While it is still possible for a broker to be partitioned from a particular client, the broker will be removed from the cluster if it is partitioned from the controller.
Broker States
Offline
When the broker process is in the Offline state, it is either not running at all, or in the process of performing single-node tasks needed to starting up such as initializing the JVM or performing log recovery.
Fenced
When the broker is in the Fenced state, it will not respond to RPCs from clients. The broker will be in the fenced state when starting up and attempting to fetch the newest metadata. It will re-enter the fenced state if it can't contact the active controller. Fenced brokers should be omitted from the metadata sent to clients.
Online
When a broker is online, it is ready to respond to requests from clients.
Stopping
Brokers enter the stopping state when they receive a SIGINT. This indicates that the system administrator wants to shut down the broker.
When a broker is stopping, it is still running, but we are trying to migrate the partition leaders off of the broker.
Java Spring Boot app using Kafka as message broker tutorial
https://developer.okta.com/blog/2019/11/19/java-kafka
kafka-java-tutorial-developer.okta.com-Kafka with Java Build a Secure Scalable Messaging App.pdf
Kafka Microservices Integration with JHipster, OICD article
https://developer.okta.com/blog/2022/09/15/kafka-microservices
One of the traditional approaches for communicating between microservices is through their REST APIs. However, as your system evolves and the number of microservices grows, communication becomes more complex, and the architecture might start resembling our old friend the spaghetti anti-pattern, with services depending on each other or tightly coupled, slowing down development teams. This model can exhibit low latency but only works if services are made highly available.
To overcome this design disadvantage, new architectures aim to decouple senders from receivers, with asynchronous messaging. In a Kafka-centric architecture, low latency is preserved, with additional advantages like message balancing among available consumers and centralized management.
When dealing with a brownfield platform (legacy), a recommended way to decouple a monolith and ready it for a move to microservices is to implement asynchronous messaging.
In this tutorial you will learn how to:
- Create a microservices architecture with JHipster
- Enable Kafka integration for communicating microservices
- Set up Okta as the authentication provider
Potential Value Opportunities
Automating K8S Kafka recoveries with Supertubes
Using Supertubes, Kafka and tools to automate POD recoveries
Potential Challenges
Candidate Solutions
Step-by-step guide for Example
sample code block