m Apache Kafka

Key Points

  1. Data pipelines connect, transform data sources to data targets in batches or event streams
  2. Kafka directory services definitions stored as metadata with optional data schemas
  3. brokers connect clients, producers to topics ( queues ) using Kafka directory services
  4. handles many types of producers, consumers, topics (queues), replication, partitions
  5. messages can be consumed by multiple clients, sent by multiple producers, serialized, synch or asynch
  6. clients must poll a queue - there is DIRECT NO PUSH options ( eg web sockets model etc )
  7. producers push message to topics queues
  8. queues can be persistent
  9. REST proxy available to interface to queues
  10. Future option to replace Zookeeper directory with distributed metadata services ( DMS )


References

Reference_description_with_linked_URLs_______________________Notes______________________________________________________________




https://projects.apache.org/projects.html?category

Apache Big Data projects
https://www.educba.com/my-courses/dashboard/Data engineering education site:  educfba
m Apache Data Services




Kafka
https://kafka.apache.org/Apache Kafka home
https://kafka.apache.org/quickstartKafka quickstart
https://kafka.apache.org/usesKafka use cases
https://kafka.apache.org/documentation/Kafka documentation

https://drive.google.com/open?id=1s05zoAUC3kkW6ZDX_LGjMRptySCToZ_a

https://gist.github.com/fagossa/ff43fb522369efb2c88a82c8c72dec80

https://www.datadoghq.com/blog/collecting-kafka-performance-metrics/

https://www.datadoghq.com/blog/monitor-kafka-with-datadog/

Kafka concepts overview

part 1

part 2

part 3

https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners
-what-is-apache-kafka.html

part1-kafka-for-beginners-what-is-apache-kafka
Kafka overview 1 pager

https://www.cs.rutgers.edu/~pxk/417/notes/kafka.html

kafka-course.rutgers-Kafka.pdf

Kafka overview - Rutgers *


https://github.com/aidtechnology/lf-k8s-hlf-webinar

https://drive.google.com/open?id=17S6ONAr7n3FyIY7-dknBsIs2vKFrY3xr

Kafka on Kubernetes Tutorial
https://drive.google.com/open?id=1mOk7UJ_cF0EIXt2QWVeWCFJg7bIm8Nc1A Nodejs client app using Kafka overview
https://drive.google.com/open?id=1gwyX26VkGlI1wOiU2bgLk_ZaJIQNJ6aqebook - Kafa Streams in Action
https://www.confluent.io/online-talks/Kafka talks online
https://drive.google.com/open?id=1mYlwU0z8hirNQrlPJYtWFqaUh9P6xnhyKafka event streams with MongoDB connector
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+ProposalsKafka Improvement Proposals
https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+
ZooKeeper+with+a+Self-Managed+Metadata+Quorum
KIP-500 RFC - replace Zookeeper for Kafka MDM

https://drive.google.com/open?id=10T8h05mecdU12FTmSflww4HfZNKghx1_

kafka-guide-2019-confluent-kafka-definitive-guide-complete.pdf

Kafka Guide - 2019 - Confluent

https://assets.confluent.io/m/7a91acf41502a75e/original/20180328-EB-Confluent_Designing_Event_Driven_Systems.pdf?_ga=2.165530684.2462
0135.1589381838-1712595248.1587214799

kafka-ebook-20180328-EB-Confluent_Designing_Event_Driven_Systems.pdf

Kafka - Design Event Driven Systems ebook
https://www.confluent.io/resources/kafka-summit-2020/Kafka 2020 Summit with presentations **

https://kafka-tutorials.confluent.io/?_ga=2.255305614.740418466.1611961981-2041899468.1610139243

Kafka 101 Tutorial - Event Streaming start ***  


https://www.confluent.io/resources/kafka-summit-2020/apache-kafka-and-ksqldb-in-action-lets-build-a-streaming-data-pipeline/

Kafka and KSQL Tutorial - create a data pipeline ***  


https://talks.rmoff.net/JuUJ8l#sSlsXRYApache Kafka and ksqlDB in Action : Let’s Build a Streaming Data Pipeline
https://talks.rmoff.net/yE9NeVApache Kafka - From Zero to Hero with Kafka Connect

Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! by Robin Moffatt



https://banzaicloud.com/blog/supertubes-focal/

Apache Kafka Tutorial without Zookeeper url 

Apache Kafka Tutorial without Zookeeper-2024.pdf. link

Apache Kafka Tutorial without Zookeeper-2024.pdf. file

Apache Kafka Tutorial without Zookeeper-2024.pdf ** 


Compare to other open-source distributed file management systems

https://blockonomi.com/interplanetary-file-system/

ipfs-blockonomi.com-What is IPFS Interplanetary File System Complete Beginners Guide.pdf

IPFS tutorial

UDT 


Key Concepts



Kafka


Kafka 101 Tutorials

https://kafka-tutorials.confluent.io/?_ga=2.255305614.740418466.1611961981-2041899468.1610139243


Kafka 101 Apply Functions 

Apply a function to data.


Kafka 101 Aggregation Functions

Aggregate data.


Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline!

https://www.confluent.io/resources/kafka-summit-2020/apache-kafka-and-ksqldb-in-action-lets-build-a-streaming-data-pipeline/

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with Kafka Connect, which is part of Apache Kafka. ksqlDB is the source-available SQL streaming engine for Apache Kafka and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.

In this talk, we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, Kafka Connect, and ksqlDB.

Gasp as we filter events in real-time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!


Kafka KIPS



KIP-500 RFC - replace Zookeeper for Kafka MDM

https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum


Currently, Kafka uses ZooKeeper to store its metadata about partitions and brokers, and to elect a broker to be the Kafka Controller. We would like to remove this dependency on ZooKeeper. This will enable us to manage metadata in a more scalable and robust way, enabling support for more partitions. It will also simplify the deployment and configuration of Kafka.

brokers should simply consume metadata events from the event log.




Metadata as an event log

We often talk about the benefits of managing state as a stream of events. A single number, the offset, describes a consumer's position in the stream. Multiple consumers can quickly catch up to the latest state simply by replaying all the events newer than their current offset. The log establishes a clear ordering between events, and ensures that the consumers always move along a single timeline.

However, although our users enjoy these benefits, Kafka itself has been left out. We treat changes to metadata as isolated changes with no relationship to each other. When the controller pushes out state change notifications (such as LeaderAndIsrRequest) to other brokers in the cluster, it is possible for brokers to get some of the changes, but not all. Although the controller retries several times, it eventually give up. This can leave brokers in a divergent state.

Currently, a Kafka cluster contains several broker nodes, and an external quorum of ZooKeeper nodes

Controller Log

The controller nodes comprise a Raft quorum which manages the metadata log. This log contains information about each change to the cluster metadata. Everything that is currently stored in ZooKeeper, such as topics, partitions, ISRs, configurations, and so on, will be stored in this log.

Broker Metadata Management

Instead of the controller pushing out updates to the other brokers, those brokers will fetch updates from the active controller via the new MetadataFetch API.

A MetadataFetch is similar to a fetch request.  Just like with a fetch request, the broker will track the offset of the last updates it fetched, and only request newer updates from the active controller. 

The broker will persist the metadata it fetched to disk.  This will allow the broker to start up very quickly, even if there are hundreds of thousands or even millions of partitions.  (Note that since this persistence is an optimization, we can leave it out of the first version, if it makes development easier.)

Most of the time, the broker should only need to fetch the deltas, not the full state.  However, if the broker is too far behind the active controller, or if the broker has no cached metadata at all, the controller will send a full metadata image rather than a series of deltas.

The broker will periodically ask for metadata updates from the active controller.  This request will double as a heartbeat, letting the controller know that the broker is alive.

Broker States

cluster membership is integrated with metadata updates.  Brokers cannot continue to be members of the cluster if they cannot receive metadata updates.  While it is still possible for a broker to be partitioned from a particular client, the broker will be removed from the cluster if it is partitioned from the controller.

Broker States

Offline

When the broker process is in the Offline state, it is either not running at all, or in the process of performing single-node tasks needed to starting up such as initializing the JVM or performing log recovery.

Fenced

When the broker is in the Fenced state, it will not respond to RPCs from clients.  The broker will be in the fenced state when starting up and attempting to fetch the newest metadata.  It will re-enter the fenced state if it can't contact the active controller.  Fenced brokers should be omitted from the metadata sent to clients.

Online

When a broker is online, it is ready to respond to requests from clients.

Stopping

Brokers enter the stopping state when they receive a SIGINT.  This indicates that the system administrator wants to shut down the broker.

When a broker is stopping, it is still running, but we are trying to migrate the partition leaders off of the broker.





Java Spring Boot app using Kafka as message broker tutorial

https://developer.okta.com/blog/2019/11/19/java-kafka

kafka-java-tutorial-developer.okta.com-Kafka with Java Build a Secure Scalable Messaging App.pdf



Kafka Microservices Integration with JHipster, OICD article

https://developer.okta.com/blog/2022/09/15/kafka-microservices

kafka-microservices-integration-jhipster-oicd-2022-developer.oktacom-Communicate Between Microservices with Apache Kafka.pdf link

kafka-microservices-integration-jhipster-oicd-2022-developer.okta.com-Communicate Between Microservices with Apache Kafka.pdf file

One of the traditional approaches for communicating between microservices is through their REST APIs. However, as your system evolves and the number of microservices grows, communication becomes more complex, and the architecture might start resembling our old friend the spaghetti anti-pattern, with services depending on each other or tightly coupled, slowing down development teams. This model can exhibit low latency but only works if services are made highly available.

To overcome this design disadvantage, new architectures aim to decouple senders from receivers, with asynchronous messaging. In a Kafka-centric architecture, low latency is preserved, with additional advantages like message balancing among available consumers and centralized management.

When dealing with a brownfield platform (legacy), a recommended way to decouple a monolith and ready it for a move to microservices is to implement asynchronous messaging.

In this tutorial you will learn how to:

  • Create a microservices architecture with JHipster
  • Enable Kafka integration for communicating microservices
  • Set up Okta as the authentication provider


Potential Value Opportunities


Automating K8S Kafka recoveries with Supertubes

Using Supertubes, Kafka and tools to automate POD recoveries


Potential Challenges



Candidate Solutions



Step-by-step guide for Example



sample code block

sample code block
 



Recommended Next Steps