Key Points

Data pipelines connect, transform data sources to data targets in batches or event streams

ToDo > Data Services Source Reviews

eblock reqmts model - groovy dsl expando w events ?
eblock test app
eblock composer

esb pub / sub w events

alf airflow
alf nifi batch
alf camel
alf nifi batch
alf kafka
alf spark
alf beam
alf airflow
istio.io

https://istio.io/docs/concepts/what-is-istio/

http://apache.org/index.html#projects-list
https://www.linuxfoundation.org/projects/directory/
https://www.npmjs.com/
https://skywalking.apache.org/blog/2019-01-25-mesh-loadtest.html
https://www.nginx.com/blog/what-is-a-service-mesh/

References

Reference_description_with_linked_URLs_______________________	Notes______________________________________________________________


https://projects.apache.org/projects.html?category	Apache Big Data projects
https://www.educba.com/my-courses/dashboard/Ze	Data engineering education site: educfba

https://zeppelin.apache.org/	Zeppelin - Apache data notebook like Jupyter w python, dsl, sql etc

Kafka	see Kafka page m Apache Kafka
https://kafka.apache.org/	Apache Kafka home
https://kafka.apache.org/quickstart	Kafka quickstart
https://kafka.apache.org/uses	Kafka use cases
https://kafka.apache.org/documentation/	Kafka documentation
https://drive.google.com/open?id=1s05zoAUC3kkW6ZDX_LGjMRptySCToZ_a https://gist.github.com/fagossa/ff43fb522369efb2c88a82c8c72dec80 https://www.datadoghq.com/blog/collecting-kafka-performance-metrics/ https://www.datadoghq.com/blog/monitor-kafka-with-datadog/	Kafka concepts overview part 1 part 2 part 3
https://github.com/aidtechnology/lf-k8s-hlf-webinar https://drive.google.com/open?id=17S6ONAr7n3FyIY7-dknBsIs2vKFrY3xr	Kafka on Kubernetes Tutorial
https://drive.google.com/open?id=1mOk7UJ_cF0EIXt2QWVeWCFJg7bIm8Nc1	A Nodejs client app using Kafka overview
https://drive.google.com/open?id=1gwyX26VkGlI1wOiU2bgLk_ZaJIQNJ6aq	ebook - Kafa Streams in Action
https://www.confluent.io/online-talks/	Kafka talks online
https://drive.google.com/open?id=1mYlwU0z8hirNQrlPJYtWFqaUh9P6xnhy	Kafka event streams with MongoDB connector

Spark
https://spark.apache.org/	Spark Home
https://spark.apache.org/docs/latest/quick-start.html	Spark quick start
https://spark.apache.org/docs/0.9.1/java-programming-guide.html	Spark Java programming
https://spark.apache.org/docs/0.9.1/scala-programming-guide.html	Spark Scala programming
https://drive.google.com/open?id=1_bfxFX6kQf2gTEPyoPwOgmauWj2b-hv5	Spark overview in 7 steps - databricks
dzone-spark-refcard_0.pdf
spark-A-Gentle-Introduction-to-Apache-Spark.pdf
spark-tutorials-quick-guide-v2.pdf
spark-ebook-intro-2018.pdf
spark-Mini eBook - Aggregating Data with Apache Spark.pdf
spark-Data-Scientists-Guide-to-Apache-Spark.pdf
Spark GraphX.pdf
spark-The-Data-Engineers-Guide-to-Apache-Spark.pdf

https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9	Manning - Spark in Action

Cassandra
https://projects.apache.org/project.html?cassandra https://cassandra.apache.org/	Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make Apache Cassandra the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class. Cassandra provides full Hadoop integration, including with Pig and Hive. Cassandra database is the right choice when you need scalability and high availability without compromising performance.
https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf https://drive.google.com/open?id=1JJnMPTVAgwZF2PKqTQVWTii16oLPIeAi	Data modeling in Cassandra
https://www.tutorialspoint.com/cassandra/cassandra_introduction.htm	TutorialsPoint - Apache Cassandra *

CarbonData Services for Big Data
https://projects.apache.org/project.html?carbondata https://carbondata.apache.org/	Apache CarbonData is a new big data file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, which helps in speeding up queries by an order of magnitude faster over PetaBytes of data.
https://carbondata.apache.org/introduction.html https://carbondata.apache.org/quick-start-guide.html


Airflow



Nifi
https://nifi.apache.org/	like Informatica - data transformation pipelines
https://www.slideshare.net/Hadoop_Summit/apache-nifi-crash-course-131483547	Nifi crash course slides

Ignite	In memory db
https://ignite.apache.org/	Ignite summary



Hadoop

file:///C:/Users/Jim%20Mason/Google%20Drive/_books/tutorials/jp/hadoop/hadoop_101_introduction.pdf https://drive.google.com/open?id=0BxqKQGV-b4WQYXNnM3JHT1pzS2s https://drive.google.com/open?id=0BxqKQGV-b4WQNTRRanNtY0dVTXM https://drive.google.com/open?id=0BxqKQGV-b4WQS3VtaUlYb3Fld2s https://drive.google.com/open?id=0BxqKQGV-b4WQZGtQZ3d4OUY0cjg https://drive.google.com/open?id=0BxqKQGV-b4WQdUgtLXZ5MEsyWTQ https://drive.google.com/open?id=0BxqKQGV-b4WQTnBpWVR1VDhuOEU https://drive.google.com/open?id=0BxqKQGV-b4WQSE1DeGdTdWp4VW8 https://drive.google.com/open?id=0BxqKQGV-b4WQby1aVnJ1UXZ5Zm8 https://drive.google.com/open?id=0BxqKQGV-b4WQNzdZOTdicmdLM0k https://drive.google.com/open?id=0BxqKQGV-b4WQQTlwSGFoUnYydmc https://drive.google.com/open?id=0BxqKQGV-b4WQRmZnc1RnM1o5Nnc https://drive.google.com/open?id=0BxqKQGV-b4WQbGZzRVdlQVZiLXc	Hadoop intro slide deck - Java Passion intro install CDH hdfs map reduce pig1 pig2 pig3 Hive Hbase sqoop ooze flume
http://events.pentaho.com/rs/680-ONC-130/images/leveraging-hadoop-in-analytic-data- pipeline-ebook.pdf https://drive.google.com/open?id=1LlWanFdKIv1hB_ydGuwazqTARHFTrY_R	Hadoop in a big data system
https://towardsdatascience.com/what-happened-to-hadoop-what-should-you-do-now-2876f68dbd1d	Hadoop challenges article
	Cloudera open-source distribution for Hadoop tools
C:\Users\Jim Mason\Google Drive\_books\tutorials\jp\hadoop https://drive.google.com/open?id=0BxqKQGV-b4WQT2FpNU0tVGxydHc	Java Passion Hadoop courses

Mahout
http://mahout.apache.org/	Mahout - machine learning library ( see machine learning page )

Apache zeppelin - data books
https://zeppelin.apache.org/

Key Concepts

Key Apache Projects Supporting Distributed Data Services

https://www.linkedin.com/pulse/popular-apache-projects-burak-%C3%B6zt%C3%BCrk/

Apache-Data-Services-Projects-overview-2023-linkedin.com-Popular Apache Projects.pdf. link

Apache-Data-Services-Projects-overview-2023-linkedin.com-Popular Apache Projects.pdf file

Hadoop

m Hadoop

Airflow

a batch processing data pipeline similar to Ooze

Kestra

https://www.infoq.com/news/2022/03/kestra-orchestration-platform/

Kestra, a new open-source orchestration and scheduling platform, helps developers to build, run, schedule, and monitor complex pipelines.

uses yaml not python configs

https://github.com/kestra-io/kestra

Nifi

https://nifi.apache.org/

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Web-based user interface
- Seamless experience between design, control, feedback, and monitoring
Highly configurable
- Loss tolerant vs guaranteed delivery
- Low latency vs high throughput
- Dynamic prioritization
- Flow can be modified at runtime
- Back pressure
Data Provenance
- Track dataflow from beginning to end
Designed for extension
- Build your own processors and more
- Enables rapid development and effective testing
Secure
- SSL, SSH, HTTPS, encrypted content, etc...
- Multi-tenant authorization and internal authorization/policy management

FBP Term	NiFi Term	Description
Information Packet	FlowFile	Each object moving through the system.
Black Box	FlowFile Processor	Performs the work, doing some combination of data routing, transformation, or mediation between systems.
Bounded Buffer	Connection	The linkage between processors, acting as queues and allowing various processes to interact at differing rates.
Scheduler	Flow Controller	Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.
Subnet	Process Group	A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components

Flow File like HTTP data

Nifi integration points

NiFi Term	Description
Flow File Processor	Push/Pull behavior. Custom UI
Reporting Task	Used to push data from NiFi to some external service (metrics, provenance, etc.)
Controller Service	Used to enable reusable components / shared services throughout the flow
REST API	Allows clients to connect to pull information, change behavior, etc

Nifi can deliver data provenance

Flows passed in registry ( vs xml files )

Ignite - in memory db

https://ignite.apache.org/

Ignite™ is a memory-centric distributed database, caching, and processing platform for
transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale

Achieve horizontal scalability, strong consistency, and high availability with Ignite™ distributed SQL

Accelerate existing Relational and NoSQL databases with Ignite™ in-memory data grid and caching capabilities

Store and process distributed data in memory and on disk

Distributed memory-centric SQL database with support for joins

Read, write, transact with fastest key-value data grid and cache

Enforce full ACID compliance across distributed data sets

Avoid data noise by sending computations to cluster nodes

Train and deploy distributed machine learning models

Kafka

m Apache Kafka

Kafka Event Sourcing Article

Kafka Overview

Compare Kafka Alternatives

https://www.macrometa.com/event-stream-processing/kafka-alternatives

Kafka Alternatives

Apache Kafka is an open source software messaging bus that uses stream processing. Because it’s a distributed platform known for its scalability, resilience, and performance, Kafka has become very popular with large enterprises. In fact, 80% of the Fortune 500 use Kafka.

However, there’s no such thing as a one-size-fits-all solution. What’s best for Uber or PayPal may not be ideal for your application. Fortunately, there are several alternative messaging platforms available.

Having a thorough understanding of the technology landscape is essential, whether you are developing applications or considering ready-to-go industry solutions provided by Macrometa.

Knowing which platform is right for you requires understanding the pros and cons of each. To help you make the right choice, we’ll take an in-depth look at Apache Kafka and four popular Kafka alternatives.

Before taking a detailed look at Kafka and those four alternatives, let’s start with a high-level overview of features.

Kafka compared to alternatives

	Kafka	Google pub/sub	RabbitMQ	Pulsar	Macrometa
Ease of Setup	Hard	Easy	Medium	Hard	Easy
Storage	Seven days (Scalable to disk size)	Seven days (auto-scaling)	The message is lost once acknowledged	Tiered storage scalable to disk size	Three days (support retention and TTL)
Throughput & scalability	Best	Better	Good	Best	Best
Message re-read / replay	Yes	No	No	Yes	Yes
Type	Event streaming	Publish /Subscribe	Message Broker	Messaging and event streaming	Messaging and event streaming
Built-in multitenancy	No	No	No	Yes	Yes
Geo-replication	No	No	No	Yes	Yes
Languages supported	17	8	22	6	Supports all Kafka & Pulsar clients

Kafka faster than Pulsar on some latency benchmarks

https://www.confluent.io/kafka-vs-pulsar/#:~:text=Kafka%20provides%20the%20lowest%20latency,to%20fsync%20on%20every%20message.

Canonical Charmed Kafka reference architecture guide 3 release 1.pdf link

Canonical Charmed Kafka reference architecture guide 3 release 1 .pdf file

Apache Pulsar

https://pulsar.apache.org/

Cloud-Native, Distributed Messaging and Streaming

Apache® Pulsar™ is an open-source, distributed messaging and streaming platform built for the cloud.

Apache Pulsar is an all-in-one messaging and streaming platform. Messages can be consumed and acknowledged individually or consumed as streams with less than 10ms of latency. Its layered architecture allows rapid scaling across hundreds of nodes, without data reshuffling.

Its features include multi-tenancy with resource separation and access control, geo-replication across regions, tiered storage and support for six official client languages. It supports up to one million unique topics and is designed to simplify your application architecture.

Pulsar is a Top 10 Apache Software Foundation project and has a vibrant and passionate community and user base spanning small companies and large enterprises.

Producer & Consumer

A Pulsar client contains a consumer and a producer. A producer writes messages on a topic. A consumer reads messages from a topic and acknowledges specific messages or all up to a specific message.

Apache Zookeeper

Pulsar and BookKeeper use Apache ZooKeeper to save metadata coordinated between nodes, such as a list of ledgers per topic, segments per ledger, and mapping of topic bundles to a broker. It’s a cluster of highly available and replicated servers (usually 3).

Pulsar Brokers

Topics (i.e., partitions) are divided among Pulsar brokers. A broker receives messages for a topic and appends them to the topic’s active virtual file (a.k.a ledger), hosted on the Bookkeeper cluster. Brokers read messages from the cache (mostly) or BookKeeper and dispatch them to the consumers. Brokers also receive message acknowledgments and persist them to the BookKeeper cluster as well. Brokers are stateless (don't use/need a disk).

Apache Bookkeeper

Apache BookKeeper is a cluster of nodes called bookies. Each virtual file (a.k.a ledger) is divided into consecutive segments, and each segment is kept on 3 bookies by default (replicated by the client - i.e., the broker). Operators can add bookies rapidly since no data reshuffling (moving) between them is required. They immediately share the incoming write load.

Key Pulsar Features

horizontal scalability
low latency messaging and event streaming
geo location replication
multi-tenancy
ALB - automated load balancing
Multi-language support for Java, Go, Python, C++, Node.js, and C#.
Officially maintained connectors with popular 3rd parties: MySQL, Elasticsearch, Cassandra, and more. Allows streaming data in (source) or out (sink).
Serverless Functions - Write, deploy functions natively using Pulsar Functions. Process messages without deploying fully-fledged applications. Kubernetes runtime is bundled.
Topics - up to 1 million

Large configurations on HIgh bandwidth network

The testbed for the OpenMessaging Benchmark is set up as follows:

3 Broker VMs of type i3en.6xlarge, with 24-cores, 192GB of memory, 25Gbps guaranteed networking, and two NVMe SSD devices that support up to 1GB/s write throughput on each disk.
4 Client (producers and consumers) VMs of type m5n.8xlarge, with 32-cores and with 25Gbps of guaranteed networking throughput and 128GB of memory to ensure the bottleneck would not be on the client-side.

Results

Apache Event Mesh

https://eventmesh.apache.org/

EventMesh is a new generation serverless event middleware for building distributed event-driven applications.

EventMesh Architecture

https://github.com/apache/eventmesh

Apache EventMesh has a vast amount of features to help users achieve their goals. Let us share with you some of the key features EventMesh has to offer:

Built around the CloudEvents specification.
Rapidty extendsible interconnector layer connectors such as the source or sink of Saas, CloudService, and Database etc.
Rapidty extendsible storage layer such as Apache RocketMQ, Apache Kafka, Apache Pulsar, RabbitMQ, Redis, Pravega, and RDMS(in progress) using JDBC.
Rapidty extendsible controller such as Consul, Nacos, ETCD and Zookeeper.
Guaranteed at-least-once delivery.
Deliver events between multiple EventMesh deployments.
Event schema management by catalog service.
Powerful event orchestration by Serverless workflow engine.
Powerful event filtering and transformation.
Rapid, seamless scalability.
Easy Function develop and framework integration.

Roadmap

Please go to the roadmap to get the release history and new features of Apache EventMesh.

EventMesh-site: Apache official website resources for EventMesh.
EventMesh-workflow: Serverless workflow runtime for event Orchestration on EventMesh.
EventMesh-dashboard: Operation and maintenance console of EventMesh.
EventMesh-catalog: Catalog service for event schema management using AsyncAPI.
EventMesh-go: A go implementation for EventMesh runtime.

Event Mesh Intro

https://eventmesh.apache.org/docs/introduction/

key features EventMesh has to offer:

Built around the CloudEvents specification.
Rapidly extensible language sdk around gRPC protocols.
Rapidly extensible middleware by connectors such as Apache RocketMQ, Apache Kafka, Apache Pulsar, RabbitMQ, Redis, Pravega, and RDMS(in progress) using JDBC.
Rapidly extensible controller such as Consul, Nacos, ETCD and Zookeeper.
Guaranteed at-least-once delivery.
Deliver events between multiple EventMesh deployments.
Event schema management by catalog service.
Powerful event orchestration by Serverless workflow engine.
Powerful event filtering and transformation.
Rapid, seamless scalability.
Easy Function develop and framework integration.

Event Mesh Overview

https://www.infoq.com/news/2023/04/eventmesh-serverless/

Spark

https://spark.apache.org/

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Learning Apache Spark is easy whether you come from a Java, Scala, Python, R, or SQL background:

df = spark.read.json("logs.json") 
df.where("age > 21")   .
select("name.first").show()

Spark's Python DataFrame API
Read JSON files with automatic schema inference

Spark can access many data sources

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Spark Quick Start

https://spark.apache.org/docs/latest/quick-start.html

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.

To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.

Spark SQL Guide

https://spark.apache.org/docs/latest/sql-programming-guide.html

Spark in Action book

https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9

Iceberg - Data Lakehouse

https://iceberg.apache.org/

Open Table format for analytic datasets — same as my temp tables at PTP ( session and global )

https://iceberg.apache.org/community/

Spark in Lakehouse Architecture: Learn how Apache Iceberg integrates with Apache Spark, emphasizing its role in ETL and reducing the cost of data transformation and storage compared to a traditional data warehouse.
Simple and Reliable Ingestion with Upsolver: Examine how Upsolver simplifies the ingestion of operational data into Iceberg tables, highlighting its no-code and ZeroETL approaches for efficient data movement.
Impacts of Data Management on Query Performance: Explore the impacts of small files, fast and continuous updates/deletes, and manifest file churn on query performance. Compare how data is managed and optimized between Spark and Upsolver, including how each handles schema evolution and transactional concurrency.
Best Practices for Implementing Lakehouse Architectures: Discuss best practices for deploying and managing a lakehouse architecture using Spark and Upsolver, with insights into optimizing storage, improving query speeds, and ensuring high quality, reliable data.

Camel

m Camel Data Svcs

Apache Camel data integration services
Large library of components and connectors
For reading, generally outputs Java object maps
Need to see how write, update options work
Currently favors XML over JSON objects

https://camel.apache.org/

https://camel.apache.org/manual/latest/index.html

Arrow

https://arrow.apache.org/

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

Arrow Data Access Model for Columnar Data

Advantages of common Arrow data layer

Arrow Project Docs

https://arrow.apache.org/docs/

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as:

Zero-copy shared memory and RPC-based data movement
Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet)
In-memory analytics and query processing

Arrow Columnar Format

Storm

Beam

Cassandra

https://cassandra.apache.org/

Apache Cassandra is a free and open-source high-performance database that is provably fault-tolerant both on commodity hardware or cloud infrastructure. It can even handle failed node replacements without shutting down the systems and it can also replicate data automatically across multiple nodes. Moreover, Cassandra is a NoSQL database in which all the nods are peers without any master-slave architecture. This makes it extremely scalable and fault-tolerant and you can add new machines without any interruptions to already running applications. You can also choose between synchronous and asynchronous replication for each update.

Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make Apache Cassandra the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class.

Cassandra provides full Hadoop integration, including with Pig and Hive.

Cassandra database is the right choice when you need scalability and high availability without compromising performance.

Cassandra storage of NoSQL data

Cassandra clusters have multiple nodes running in local data centers or public clouds.
Data is typically stored redundantly across nodes according to a configurable replication factor so that the database continues to operate even when nodes are down or unreachable.
Tables in Cassandra are much like RDBMS tables.
Physical records in the table are spread across the cluster at a location determined by a partition key. The partition key is hashed to a 64-bit token that identifies the Cassandra node where data and replicas are stored.iv T
he Cassandra cluster is conceptually represented as a ring, as shown in Figure 1, where each cluster node is responsible for storing tokens in a range.
Queries that look up records based on the partition key are extremely fast because Cassandra can immediately determine the host holding required data using the partitioning function.
Since clusters can potentially have hundreds or even thousands of nodes, Cassandra can handle many simultaneous queries because queries and data are distributed across cluster nodes.

Cassandra Language Client Drivers

Cassandra has many client drivers particularly for Java

many are open source

There is also a JDBC wrapper for the Datastax Java driver

https://github.com/ing-bank/cassandra-jdbc-wrapper

Data modeling in performance

https://drive.google.com/open?id=1JJnMPTVAgwZF2PKqTQVWTii16oLPIeAi

https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf

define requirements model as an object model

create application object flow ( outputs > inputs > processes )

model queries to support flow

define Chebotko diagrams ( ERD for phyiscal, logical models )

set the primary key correctly

Primary key logic

Tables in Cassandra have a primary key. The primary key is made up of a partition key, followed by one or more optional clustering columns that control how rows are laid out in a Cassandra partition. Getting the primary key right for each table is one of the most crucial aspects of designing a good data model.In the latest_videos table, yyyymmdd is the partition key, and it is followed by two clustering columns, added_date and videoid, ordered in a fashion that supports retrieving the latest videos.

performance

Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.

There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.

Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.

Donated by Google - open-source and commercial distributions available

https://cassandra.apache.org/doc/latest/getting_started/index.html

https://cassandra.apache.org/doc/latest/architecture/index.html

This section describes the general architecture of Apache Cassandra.

Overview
- set replication and consistency policies
Dynamo
Storage Engine
Guarantees

Consistency Policy

Decide on a consistency policy to fit the use case

t is common to pick read and write consistency levels that are high enough to overlap, resulting in “strong” consistency. This is typically expressed as W + R > RF, where W is the write consistency level, R is the read consistency level, and RF is the replication factor. For example, if RF = 3, a QUORUM request will require responses from at least two of the three replicas. If QUORUM is used for both writes and reads, at least one of the replicas is guaranteed to participate in both the write and the read request, which in turn guarantees that the latest write will be read. In a multi-datacenter environment, LOCAL_QUORUM can be used to provide a weaker but still useful guarantee: reads are guaranteed to see the latest write from within the same datacenter.

Commit logs

Commitlogs are an append only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to memtables.

All mutations write optimized by storing in commitlog segments, reducing the number of seeks needed to write to disk.

Memtables

Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active memtable per table. Eventually, memtables are flushed onto disk and become immutable SSTables. This can be triggered in several ways:

The memory usage of the memtables exceeds the configured threshold (see memtable_cleanup_threshold)
The CommitLog approaches its maximum size, and forces memtable flushes in order to allow commitlog segments to be freed

Memtables may be stored entirely on-heap or partially off-heap, depending on memtable_allocation_type.

SSTables on disk

SSTables are the immutable data files that Cassandra uses for persisting data on disk.

As SSTables are flushed to disk from Memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.

Each SSTable is comprised of multiple components stored in separate files:

Query SSTable versions

The following example is useful for finding all sstables that do not match the “ib” SSTable version

find /var/lib/cassandra/data/ -type f | grep -v -- -ib- | grep -v "/snapshots"

Query Language

https://cassandra.apache.org/doc/latest/cql/index.html

Cassandra Query Language docs

Cassandra Operations

https://cassandra.apache.org/doc/latest/operating/index.html

Cassandra Operations

Troubleshooting Cassandra

https://cassandra.apache.org/doc/latest/troubleshooting/index.html

Contributing to Cassandra Development

https://cassandra.apache.org/doc/latest/development/index.html

HBase

https://hbase.apache.org/

Apache HBase is an open-source distributed Hadoop database that can be used to read and write to big data. HBase has been constructed so that it can manage billions of rows and millions of columns using commodity hardware clusters. This database is based on the Big Table which was a distributed storage system created for structured data. Apache HBase has many different capabilities including scalability, automatic sharding of tables, consistent reading and writing capabilities, support against failure for all the servers, etc.

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

HBase Features

Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

CarbonData

CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data with Spark SQL. CarbonData allows faster interactive queries over PetaBytes of data.

Apache CarbonData is a new big data file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, which helps in speeding up queries by an order of magnitude faster over PetaBytes of data.

https://carbondata.apache.org/language-manual.html

CarbonData has its own parser, in addition to Spark's SQL Parser, to parse and process certain Commands related to CarbonData table handling. You can interact with the SQL interface using the command-line or over JDBC/ODBC.

Data Types
Data Definition Statements
- DDL:Create,Drop,Partition,Bucketing,Alter,CTAS,External Table
- DataMaps
- Materialized Views (MV)
- Streaming
Data Manipulation Statements
- DML: Load, Insert, Update, Delete
- Segment Management
CarbonData as Spark's Datasource
Configuration Properties

https://carbondata.apache.org/introduction.html

CarbonData use cases

https://carbondata.apache.org/usecases.html

CarbonData is used for but not limited to

Bank
- fraud detection analysis
- risk profile analysis
- As a zip table to update the daily balance of customer

Web/Internet

Analysis of page or video being accessed,server loads, streaming quality, screen size

Solution to queries on high volume data

Setup a Hadoop + Spark + CarbonData cluster managed by YARN.

Query results

Parameter	Results
Query	< 3 Sec
Data Loading Speed	40 MB/s Per Node
Concurrent query performance (20 queries)	< 10 Sec

CarbonData QuickStart

https://carbondata.apache.org/quick-start-guide.html

load some test data to files using Spark shell

start the network

run queries

carbon.sql("SELECT * FROM test_table").show()

carbon.sql(
           s"""
              | SELECT city, avg(age), sum(age)
              | FROM test_table
              | GROUP BY city
           """.stripMargin).show()

Solr

Elastisearch

Log File Management Systems

Splunk

https://www.educba.com/is-splunk-open-source/

Elastic Stack

https://www.educba.com/splunk-alternatives/

The Elastic Stack (also called as ELK stack) has been a leading open source log management solution. It is a good alternative to Splunk. It comprises of 4 major modules:

Elasticsearch: Highly scalable search and analytics engine.
Logstash: Log processing component which conduits incoming logs to ES.
Kibana: Data visualization tool for the logs captured.
Beats: Also called data shipped for elastic search.

A regular stack provides all of the tools needed to conduit, process, and view log data using a web-based UI with binary dependancy as java. The elastic stack is an open source tool and stands stable with an active developer community supporting throughout, a wide range of plugins, and extensive formats support.

On the other side, running the Elastic Stack can be quite complex than other tools in the market. It is highly distributed and requires scaled supportive configuration setup to work as a full-fledged solution. It gets along best for geographical data and records high compression of memory storage.

Price model: Premium version of Elastic stack provides access controls statistical notifiers and reporting solutions in addition to the standard functionalities of the free version. However, it is also expensive to implement pricing almost $2,000,000 to run at an enterprise scale which has a duration stretch of about three years.

Zeppelin - Apache data notebook like Jupyter w python, dsl, sql etc

https://zeppelin.apache.org/

https://zeppelin.apache.org/docs/latest/quickstart/install.html

Other open source alternative data servers with data frames - BIRT !!!

Multi-purpose notebook which supports

20+ language backends

Data Ingestion
Data Discovery
Data Analytics
Data Visualization & Collaboration

https://zeppelin.apache.org/docs/0.10.0/

Navigating Zeppelin UI

https://zeppelin.apache.org/docs/0.10.0/quickstart/explore_ui.html

Zeppelin quickstart

https://zeppelin.apache.org/docs/latest/quickstart/install.html

on Ubuntu, MAC vms

JDK 1.8

Allows mutliple data resources as connections

create notebooks with paragraphs and data frames

q> can data frame 1 reference data frame 2 as content?

when notebook is run as a service, REST api allows selecting notebooks and actions

when executed, output can be a result set or a visualization

q> no drill through support easily available

Zeppelin Tutorial

https://zeppelin.apache.org/docs/0.9.0-preview1/quickstart/tutorial.html

Potential Value Opportunities

Compare When to use Camel vs Kafka

https://www.kai-waehner.de/blog/2022/01/28/when-to-use-apache-camel-vs-apache-kafka-for-etl-application-integration-event-streaming/

comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming.

Kappa Architecture replaces Lambda event streaming with a single data stream

https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/

Kappa Architecture with unsafe-one Pipeline for Real Time and Batch

enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers.

This blog post explores why a single real-time pipeline, called Kappa architecture, is the better fit.

Apache Kafka. The focus is on event streaming

Confluent Cloud is an Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

Open source under Apache 2.0 license
Vibrant community and adoption in the industry
Mature framework with deployments in enterprises across the globe
Fixing point-to-point spaghetti architectures with a central integration backbone
Open architecture and extensibility with custom functions and connectors
Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
Connectivity to any technology, API, communication paradigm, and SaaS
Transformation of any data types and formats
Processes transactional and analytical workloads
Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems.

Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

Event-based backbone based on well-known and adopted EIP concepts
Connectivity to almost any API
Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
Powerful routing capabilities with many built-in EIPs
Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore
Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
No stream processing (like you know it from Kafka Streams or Apache Flink)
Limited scalability, not built for massive volumes of data
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
No serverless cloud offering, with that also not competing with other iPaaS offerings
Red Hat is the only vendor supporting it
Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios

Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

Event-based streaming platform
A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
Guaranteed ordering of events in the distributed commit log
Distributed data processing with fault-tolerance and recoverability built-in
Replayability of events
The de facto standard for event streaming
Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

within a single infrastructure, including:

Real-time messaging at any scale
Storage for true decoupling between different applications and communication paradigms
Built-in backpressure handling and replayability of events
Data integration
Stream processing
Real-time data replication across hybrid and multi-cloud

Kafka fits

Big data processing?
A storage component for true decoupling and replayability of events?
Stateless or stateful stream processing?
A serverless cloud offering?

Decision Tree Apache Camel vs Apache Kafka Comparison

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

Apache Camel and Apache Kafka in the Enterprise Architecture

Camel connectors in Kafka Connect option is very complex

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

Problems with this option

Using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect), Bootstrap Server, ConsumerRecord, Retention Time, etc.
Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Better option - build a Kafka connector component

better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

When not to use Kafka or Camel

Both Camel and Kafka are NOT built for the following scenarios:

A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
A database for complex queries and batch analytics workloads
an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

m Apache Data Services

Key Points

ToDo > Data Services Source Reviews

References

Key Concepts

Key Apache Projects Supporting Distributed Data Services

Hadoop

Airflow

Kestra

Nifi

Ignite - in memory db

Kafka

Compare Kafka Alternatives

Kafka compared to alternatives

Kafka faster than Pulsar on some latency benchmarks

Apache Pulsar

Producer & Consumer

Apache Zookeeper

Pulsar Brokers

Apache Bookkeeper

Key Pulsar Features

Pulsar Quickstart

Why Pulsar vs Kafka?

Pulsar vs Kafka Benchmark

Apache Event Mesh

https://github.com/apache/eventmesh

Event Mesh Intro

Event Mesh Overview

Spark

Spark Quick Start

Spark SQL Guide

Spark SQL Guide

Spark in Action book

Iceberg - Data Lakehouse

Camel

Arrow

Advantages of common Arrow data layer

Storm

Beam

Cassandra

Cassandra storage of NoSQL data

Cassandra Language Client Drivers

Data modeling in performance

Query Language

Cassandra Operations

Troubleshooting Cassandra

HBase

CarbonData

CarbonData use cases

CarbonData QuickStart

Solr

Elastisearch

Log File Management Systems

Splunk

Elastic Stack

Zeppelin - Apache data notebook like Jupyter w python, dsl, sql etc

Other open source alternative data servers with data frames - BIRT !!!

Navigating Zeppelin UI

Zeppelin quickstart

Zeppelin Tutorial

Potential Value Opportunities

Compare When to use Camel vs Kafka

Kappa Architecture replaces Lambda event streaming with a single data stream

Features in Apache Camel AND Apache Kafka

Camel

Camel’s strengths

Camel’s weaknesses

Kafka

Kafka’s strengths

Kafka’s weaknesses

Camel connectors in Kafka Connect option is very complex

Better option - build a Kafka connector component

When not to use Kafka or Camel

Potential Challenges

Candidate Solutions

Step-by-step guide for Example

Recommended Next Steps

Related articles

Related content