m Apache Data Services

Key Points

  1. Data pipelines connect, transform data sources to data targets in batches or event streams



ToDo > Data Services Source Reviews


eblock reqmts model - groovy dsl expando w events ?
eblock test app
eblock composer

esb pub / sub w events

alf airflow
alf nifi batch
alf camel
alf nifi batch
alf kafka
alf spark
alf beam
alf airflow
istio.io

https://istio.io/docs/concepts/what-is-istio/

http://apache.org/index.html#projects-list
https://www.linuxfoundation.org/projects/directory/
https://www.npmjs.com/
https://skywalking.apache.org/blog/2019-01-25-mesh-loadtest.html
https://www.nginx.com/blog/what-is-a-service-mesh/



References

Reference_description_with_linked_URLs_______________________Notes______________________________________________________________




https://projects.apache.org/projects.html?category

Apache Big Data projects
https://www.educba.com/my-courses/dashboard/ZeData engineering education site:  educfba


https://zeppelin.apache.org/Zeppelin - Apache data notebook like Jupyter w python, dsl, sql etc


Kafkasee Kafka page m Apache Kafka
https://kafka.apache.org/Apache Kafka home
https://kafka.apache.org/quickstartKafka quickstart
https://kafka.apache.org/usesKafka use cases
https://kafka.apache.org/documentation/Kafka documentation

https://drive.google.com/open?id=1s05zoAUC3kkW6ZDX_LGjMRptySCToZ_a

https://gist.github.com/fagossa/ff43fb522369efb2c88a82c8c72dec80

https://www.datadoghq.com/blog/collecting-kafka-performance-metrics/

https://www.datadoghq.com/blog/monitor-kafka-with-datadog/

Kafka concepts overview

part 1

part 2

part 3

https://github.com/aidtechnology/lf-k8s-hlf-webinar

https://drive.google.com/open?id=17S6ONAr7n3FyIY7-dknBsIs2vKFrY3xr

Kafka on Kubernetes Tutorial
https://drive.google.com/open?id=1mOk7UJ_cF0EIXt2QWVeWCFJg7bIm8Nc1A Nodejs client app using Kafka overview
https://drive.google.com/open?id=1gwyX26VkGlI1wOiU2bgLk_ZaJIQNJ6aqebook - Kafa Streams in Action
https://www.confluent.io/online-talks/Kafka talks online
https://drive.google.com/open?id=1mYlwU0z8hirNQrlPJYtWFqaUh9P6xnhyKafka event streams with MongoDB connector


Spark
https://spark.apache.org/Spark Home
https://spark.apache.org/docs/latest/quick-start.htmlSpark quick start
https://spark.apache.org/docs/0.9.1/java-programming-guide.htmlSpark Java programming
https://spark.apache.org/docs/0.9.1/scala-programming-guide.htmlSpark Scala programming
https://drive.google.com/open?id=1_bfxFX6kQf2gTEPyoPwOgmauWj2b-hv5Spark overview in 7 steps - databricks
dzone-spark-refcard_0.pdf
spark-A-Gentle-Introduction-to-Apache-Spark.pdf
spark-tutorials-quick-guide-v2.pdf
spark-ebook-intro-2018.pdf
spark-Mini eBook - Aggregating Data with Apache Spark.pdf
spark-Data-Scientists-Guide-to-Apache-Spark.pdf
Spark GraphX.pdf
spark-The-Data-Engineers-Guide-to-Apache-Spark.pdf


https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9Manning - Spark in Action


Cassandra

https://projects.apache.org/project.html?cassandra


https://cassandra.apache.org/

Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make Apache Cassandra the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class.

Cassandra provides full Hadoop integration, including with Pig and Hive.

Cassandra database is the right choice when you need scalability and high availability without compromising performance.

https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf

https://drive.google.com/open?id=1JJnMPTVAgwZF2PKqTQVWTii16oLPIeAi

Data modeling in Cassandra
https://www.tutorialspoint.com/cassandra/cassandra_introduction.htmTutorialsPoint - Apache Cassandra *


CarbonData Services for Big Data

https://projects.apache.org/project.html?carbondata


https://carbondata.apache.org/

Apache CarbonData is a new big data file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, which helps in speeding up queries by an order of magnitude faster over PetaBytes of data.

https://carbondata.apache.org/introduction.html

https://carbondata.apache.org/quick-start-guide.html






Airflow






Nifi
https://nifi.apache.org/like Informatica - data transformation pipelines
https://www.slideshare.net/Hadoop_Summit/apache-nifi-crash-course-131483547Nifi crash course slides


IgniteIn memory db
https://ignite.apache.org/Ignite summary






Hadoop


file:///C:/Users/Jim%20Mason/Google%20Drive/_books/tutorials/jp/hadoop/hadoop_101_introduction.pdf

https://drive.google.com/open?id=0BxqKQGV-b4WQYXNnM3JHT1pzS2s

https://drive.google.com/open?id=0BxqKQGV-b4WQNTRRanNtY0dVTXM

https://drive.google.com/open?id=0BxqKQGV-b4WQS3VtaUlYb3Fld2s

https://drive.google.com/open?id=0BxqKQGV-b4WQZGtQZ3d4OUY0cjg

https://drive.google.com/open?id=0BxqKQGV-b4WQdUgtLXZ5MEsyWTQ

https://drive.google.com/open?id=0BxqKQGV-b4WQTnBpWVR1VDhuOEU

https://drive.google.com/open?id=0BxqKQGV-b4WQSE1DeGdTdWp4VW8

https://drive.google.com/open?id=0BxqKQGV-b4WQby1aVnJ1UXZ5Zm8

https://drive.google.com/open?id=0BxqKQGV-b4WQNzdZOTdicmdLM0k

https://drive.google.com/open?id=0BxqKQGV-b4WQQTlwSGFoUnYydmc

https://drive.google.com/open?id=0BxqKQGV-b4WQRmZnc1RnM1o5Nnc

https://drive.google.com/open?id=0BxqKQGV-b4WQbGZzRVdlQVZiLXc

Hadoop intro slide deck - Java Passion


intro

install CDH

hdfs

map reduce

pig1

pig2

pig3

Hive

Hbase

sqoop

ooze

flume


http://events.pentaho.com/rs/680-ONC-130/images/leveraging-hadoop-in-analytic-data-
pipeline-ebook.pdf

https://drive.google.com/open?id=1LlWanFdKIv1hB_ydGuwazqTARHFTrY_R

Hadoop in a big data system
https://towardsdatascience.com/what-happened-to-hadoop-what-should-you-do-now-2876f68dbd1dHadoop challenges article

Cloudera open-source distribution for Hadoop tools


C:\Users\Jim Mason\Google Drive\_books\tutorials\jp\hadoop

https://drive.google.com/open?id=0BxqKQGV-b4WQT2FpNU0tVGxydHc

Java Passion Hadoop courses




Mahout
http://mahout.apache.org/Mahout - machine learning library ( see machine learning page )


Apache zeppelin - data books
https://zeppelin.apache.org/





Key Concepts



Key Apache Projects Supporting Distributed Data Services

https://www.linkedin.com/pulse/popular-apache-projects-burak-%C3%B6zt%C3%BCrk/

Apache-Data-Services-Projects-overview-2023-linkedin.com-Popular Apache Projects.pdf.  link

Apache-Data-Services-Projects-overview-2023-linkedin.com-Popular Apache Projects.pdf file


Hadoop

m Hadoop



Airflow


a batch processing data pipeline similar to Ooze




Kestra

https://www.infoq.com/news/2022/03/kestra-orchestration-platform/

Kestra, a new open-source orchestration and scheduling platform, helps developers to build, run, schedule, and monitor complex pipelines.

uses yaml not python configs

https://github.com/kestra-io/kestra



Nifi

https://nifi.apache.org/

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

  • Web-based user interface
    • Seamless experience between design, control, feedback, and monitoring
  • Highly configurable
    • Loss tolerant vs guaranteed delivery
    • Low latency vs high throughput
    • Dynamic prioritization
    • Flow can be modified at runtime
    • Back pressure
  • Data Provenance
    • Track dataflow from beginning to end
  • Designed for extension
    • Build your own processors and more
    • Enables rapid development and effective testing
  • Secure
    • SSL, SSH, HTTPS, encrypted content, etc...
    • Multi-tenant authorization and internal authorization/policy management



FBP TermNiFi TermDescription
Information PacketFlowFileEach object moving through the system.
Black BoxFlowFile ProcessorPerforms the work, doing some combination of data routing, transformation, or mediation between systems.
Bounded BufferConnectionThe linkage between processors, acting as queues and allowing various processes to interact at differing rates.
SchedulerFlow ControllerMaintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.
SubnetProcess GroupA set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components


Flow File like HTTP data





Nifi integration points

NiFi TermDescription
Flow File ProcessorPush/Pull behavior. Custom UI
Reporting TaskUsed to push data from NiFi to some external service (metrics, provenance, etc.)
Controller ServiceUsed to enable reusable components / shared services throughout the flow
REST APIAllows clients to connect to pull information, change behavior, etc



Nifi can deliver data provenance




Flows passed in registry ( vs xml files )




Ignite - in memory db

https://ignite.apache.org/

Ignite™ is a memory-centric distributed database, caching, and processing platform for
transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale

Achieve horizontal scalability, strong consistency, and high availability with Ignite™ distributed SQL

Accelerate existing Relational and NoSQL databases with Ignite™ in-memory data grid and caching capabilities

Store and process distributed data in memory and on disk

Distributed memory-centric SQL database with support for joins

Read, write, transact with fastest key-value data grid and cache

Enforce full ACID compliance across distributed data sets

Avoid data noise by sending computations to cluster nodes

Train and deploy distributed machine learning models



Kafka

m Apache Kafka


Kafka Event Sourcing Article


Kafka Overview



Compare Kafka Alternatives

https://www.macrometa.com/event-stream-processing/kafka-alternatives


Kafka Alternatives

Apache Kafka is an open source software messaging bus that uses stream processing. Because it’s a distributed platform known for its scalability, resilience, and performance, Kafka has become very popular with large enterprises. In fact, 80% of the Fortune 500 use Kafka.

However, there’s no such thing as a one-size-fits-all solution. What’s best for Uber or PayPal may not be ideal for your application. Fortunately, there are several alternative messaging platforms available.

Having a thorough understanding of the technology landscape is essential, whether you are developing applications or considering ready-to-go industry solutions provided by Macrometa.

Knowing which platform is right for you requires understanding the pros and cons of each. To help you make the right choice, we’ll take an in-depth look at Apache Kafka and four popular Kafka alternatives.

Before taking a detailed look at Kafka and those four alternatives, let’s start with a high-level overview of features.

Kafka compared to alternatives


KafkaGoogle pub/subRabbitMQPulsarMacrometa
Ease of SetupHardEasyMediumHardEasy
StorageSeven days (Scalable to disk size)Seven days (auto-scaling)The message is lost once acknowledgedTiered storage scalable to disk sizeThree days (support retention and TTL)
Throughput & scalabilityBestBetterGoodBestBest
Message re-read / replayYesNoNoYesYes
TypeEvent streamingPublish /SubscribeMessage BrokerMessaging and event streamingMessaging and event streaming
Built-in multitenancyNoNoNoYesYes
Geo-replicationNoNoNoYesYes
Languages supported178226Supports all Kafka & Pulsar clients

Kafka faster than Pulsar on some latency benchmarks

https://www.confluent.io/kafka-vs-pulsar/#:~:text=Kafka%20provides%20the%20lowest%20latency,to%20fsync%20on%20every%20message.


Canonical Charmed Kafka reference architecture guide 3 release 1.pdf  link

Canonical Charmed Kafka reference architecture guide 3 release 1 .pdf file



Apache Pulsar

https://pulsar.apache.org/

Cloud-Native, Distributed Messaging and Streaming

Apache® Pulsar™ is an open-source, distributed messaging and streaming platform built for the cloud.

Apache Pulsar is an all-in-one messaging and streaming platform. Messages can be consumed and acknowledged individually or consumed as streams with less than 10ms of latency. Its layered architecture allows rapid scaling across hundreds of nodes, without data reshuffling.

Its features include multi-tenancy with resource separation and access control, geo-replication across regions, tiered storage and support for six official client languages. It supports up to one million unique topics and is designed to simplify your application architecture.

Pulsar is a Top 10 Apache Software Foundation project and has a vibrant and passionate community and user base spanning small companies and large enterprises.


Producer & Consumer

A Pulsar client contains a consumer and a producer. A producer writes messages on a topic. A consumer reads messages from a topic and acknowledges specific messages or all up to a specific message.


Apache Zookeeper

Pulsar and BookKeeper use Apache ZooKeeper to save metadata coordinated between nodes, such as a list of ledgers per topic, segments per ledger, and mapping of topic bundles to a broker. It’s a cluster of highly available and replicated servers (usually 3).


Pulsar Brokers

Topics (i.e., partitions) are divided among Pulsar brokers. A broker receives messages for a topic and appends them to the topic’s active virtual file (a.k.a ledger), hosted on the Bookkeeper cluster. Brokers read messages from the cache (mostly) or BookKeeper and dispatch them to the consumers. Brokers also receive message acknowledgments and persist them to the BookKeeper cluster as well. Brokers are stateless (don't use/need a disk).


Apache Bookkeeper

Apache BookKeeper is a cluster of nodes called bookies. Each virtual file (a.k.a ledger) is divided into consecutive segments, and each segment is kept on 3 bookies by default (replicated by the client - i.e., the broker). Operators can add bookies rapidly since no data reshuffling (moving) between them is required. They immediately share the incoming write load.

Key Pulsar Features

  • horizontal scalability
  • low latency messaging and event streaming
  • geo location replication
  • multi-tenancy 
  • ALB - automated load balancing
  • Multi-language support for Java, Go, Python, C++, Node.js, and C#.
  • Officially maintained connectors with popular 3rd parties: MySQL, Elasticsearch, Cassandra, and more. Allows streaming data in (source) or out (sink).
  • Serverless Functions - Write, deploy functions natively using Pulsar Functions. Process messages without deploying fully-fledged applications. Kubernetes runtime is bundled.
  • Topics - up to 1 million


Pulsar Quickstart

https://pulsar.apache.org/docs/3.0.x/


Why Pulsar vs Kafka?

Why Pulsar vs Kafka? Streamnative article for Pulsar



Pulsar vs Kafka Benchmark

Pulsar vs Kafka Benchmark article from streamnative

Large configurations on HIgh bandwidth network

The testbed for the OpenMessaging Benchmark is set up as follows:

  1. 3 Broker VMs  of type i3en.6xlarge, with 24-cores, 192GB of memory, 25Gbps guaranteed networking, and two NVMe SSD devices that support up to 1GB/s write throughput on each disk.
  2. 4 Client (producers and consumers) VMs  of type m5n.8xlarge, with 32-cores and with 25Gbps of guaranteed networking throughput and 128GB of memory to ensure the bottleneck would not be on the client-side.

Results




Apache Event Mesh

https://eventmesh.apache.org/

EventMesh is a new generation serverless event middleware for building distributed event-driven applications.


EventMesh Architecture


https://github.com/apache/eventmesh

Apache EventMesh has a vast amount of features to help users achieve their goals. Let us share with you some of the key features EventMesh has to offer:

  • Built around the CloudEvents specification.
  • Rapidty extendsible interconnector layer connectors such as the source or sink of Saas, CloudService, and Database etc.
  • Rapidty extendsible storage layer such as Apache RocketMQ, Apache Kafka, Apache Pulsar, RabbitMQ, Redis, Pravega, and RDMS(in progress) using JDBC.
  • Rapidty extendsible controller such as Consul, Nacos, ETCD and Zookeeper.
  • Guaranteed at-least-once delivery.
  • Deliver events between multiple EventMesh deployments.
  • Event schema management by catalog service.
  • Powerful event orchestration by Serverless workflow engine.
  • Powerful event filtering and transformation.
  • Rapid, seamless scalability.
  • Easy Function develop and framework integration.

Roadmap

Please go to the roadmap to get the release history and new features of Apache EventMesh.

Event Mesh Intro

https://eventmesh.apache.org/docs/introduction/

key features EventMesh has to offer:

  • Built around the CloudEvents specification.
  • Rapidly extensible language sdk around gRPC protocols.
  • Rapidly extensible middleware by connectors such as Apache RocketMQ, Apache Kafka, Apache Pulsar, RabbitMQ, Redis, Pravega, and RDMS(in progress) using JDBC.
  • Rapidly extensible controller such as Consul, Nacos, ETCD and Zookeeper.
  • Guaranteed at-least-once delivery.
  • Deliver events between multiple EventMesh deployments.
  • Event schema management by catalog service.
  • Powerful event orchestration by Serverless workflow engine.
  • Powerful event filtering and transformation.
  • Rapid, seamless scalability.
  • Easy Function develop and framework integration.

Event Mesh Overview

https://www.infoq.com/news/2023/04/eventmesh-serverless/




Spark

https://spark.apache.org/

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Learning Apache Spark is easy whether you come from a Java, Scala, Python, R, or SQL background:


df = spark.read.json("logs.json") 
df.where("age > 21")   .
select("name.first").show()
Spark's Python DataFrame API
Read JSON files with automatic schema inference
Spark can access many data sources
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Spark Quick Start

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.

To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.


Spark SQL Guide

Spark in Action book

https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9



Iceberg - Data Lakehouse

https://iceberg.apache.org/

Open Table format for analytic datasets — same as my temp tables at PTP ( session and global )


https://iceberg.apache.org/community/

  • Spark in Lakehouse Architecture: Learn how Apache Iceberg integrates with Apache Spark, emphasizing its role in ETL and reducing the cost of data transformation and storage compared to a traditional data warehouse.
  • Simple and Reliable Ingestion with Upsolver: Examine how Upsolver simplifies the ingestion of operational data into Iceberg tables, highlighting its no-code and ZeroETL approaches for efficient data movement.
  • Impacts of Data Management on Query Performance: Explore the impacts of small files, fast and continuous updates/deletes, and manifest file churn on query performance. Compare how data is managed and optimized between Spark and Upsolver, including how each handles schema evolution and transactional concurrency.
  • Best Practices for Implementing Lakehouse Architectures: Discuss best practices for deploying and managing a lakehouse architecture using Spark and Upsolver, with insights into optimizing storage, improving query speeds, and ensuring high quality, reliable data.



Camel

m Camel Data Svcs

  1. Apache Camel data integration services
  2. Large library of components and connectors
  3. For reading, generally outputs Java object maps
  4. Need to see how write, update options work
  5. Currently favors XML over JSON objects

https://camel.apache.org/

https://camel.apache.org/manual/latest/index.html

Arrow

https://arrow.apache.org/

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

Arrow Data Access Model for Columnar Data


Advantages of common Arrow data layer


Arrow Project Docs

https://arrow.apache.org/docs/

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as:

  • Zero-copy shared memory and RPC-based data movement

  • Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet)

  • In-memory analytics and query processing



Storm



Beam



Cassandra

https://cassandra.apache.org/

Apache Cassandra is a free and open-source high-performance database that is provably fault-tolerant both on commodity hardware or cloud infrastructure. It can even handle failed node replacements without shutting down the systems and it can also replicate data automatically across multiple nodes. Moreover, Cassandra is a NoSQL database in which all the nods are peers without any master-slave architecture. This makes it extremely scalable and fault-tolerant and you can add new machines without any interruptions to already running applications. You can also choose between synchronous and asynchronous replication for each update.


Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make Apache Cassandra the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class.

Cassandra provides full Hadoop integration, including with Pig and Hive.

Cassandra database is the right choice when you need scalability and high availability without compromising performance.

Cassandra storage of NoSQL data

  • Cassandra clusters have multiple nodes running in local data centers or public clouds.
  • Data is typically stored redundantly across nodes according to a configurable replication factor so that the database continues to operate even when nodes are down or unreachable.
  • Tables in Cassandra are much like RDBMS tables.
  • Physical records in the table are spread across the cluster at a location determined by a partition key. The partition key is hashed to a 64-bit token that identifies the Cassandra node where data and replicas are stored.iv T
  • he Cassandra cluster is conceptually represented as a ring, as shown in Figure 1, where each cluster node is responsible for storing tokens in a range.
  • Queries that look up records based on the partition key are extremely fast because Cassandra can immediately determine the host holding required data using the partitioning function.
  • Since clusters can potentially have hundreds or even thousands of nodes, Cassandra can handle many simultaneous queries because queries and data are distributed across cluster nodes.

Cassandra Language Client Drivers

Cassandra has many client drivers particularly for Java

many are open source

There is also a JDBC wrapper for the Datastax Java driver

https://github.com/ing-bank/cassandra-jdbc-wrapper


Data modeling in performance

https://drive.google.com/open?id=1JJnMPTVAgwZF2PKqTQVWTii16oLPIeAi

https://www.datastax.com/sites/default/files/content/whitepaper/files/2019-10/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf

define requirements model as an object model

create application object flow ( outputs > inputs > processes )

model queries to support flow

define Chebotko diagrams ( ERD for phyiscal, logical models )

set the primary key correctly


Primary key logic

Tables in Cassandra have a primary key. The primary key is made up of a partition key, followed by one or more optional clustering columns that control how rows are laid out in a Cassandra partition. Getting the primary key right for each table is one of the most crucial aspects of designing a good data model.In the latest_videos table, yyyymmdd is the partition key, and it is followed by two clustering columns, added_date and videoid, ordered in a fashion that supports retrieving the latest videos.





performance

Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.

There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.

Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.

Donated by Google - open-source and commercial distributions available


https://cassandra.apache.org/doc/latest/getting_started/index.html


https://cassandra.apache.org/doc/latest/architecture/index.html

This section describes the general architecture of Apache Cassandra.


Consistency Policy

Decide on a consistency policy to fit the use case

t is common to pick read and write consistency levels that are high enough to overlap, resulting in “strong” consistency. This is typically expressed as W + R > RF, where W is the write consistency level, R is the read consistency level, and RF is the replication factor. For example, if RF = 3, a QUORUM request will require responses from at least two of the three replicas. If QUORUM is used for both writes and reads, at least one of the replicas is guaranteed to participate in both the write and the read request, which in turn guarantees that the latest write will be read. In a multi-datacenter environment, LOCAL_QUORUM can be used to provide a weaker but still useful guarantee: reads are guaranteed to see the latest write from within the same datacenter.

Commit logs

Commitlogs are an append only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to memtables.

All mutations write optimized by storing in commitlog segments, reducing the number of seeks needed to write to disk.

Memtables

Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active memtable per table. Eventually, memtables are flushed onto disk and become immutable SSTables. This can be triggered in several ways:

  • The memory usage of the memtables exceeds the configured threshold (see memtable_cleanup_threshold)
  • The CommitLog approaches its maximum size, and forces memtable flushes in order to allow commitlog segments to be freed

Memtables may be stored entirely on-heap or partially off-heap, depending on memtable_allocation_type.

SSTables on disk

SSTables are the immutable data files that Cassandra uses for persisting data on disk.

As SSTables are flushed to disk from Memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.

Each SSTable is comprised of multiple components stored in separate files:

Query SSTable versions

The following example is useful for finding all sstables that do not match the “ib” SSTable version

find /var/lib/cassandra/data/ -type f | grep -v -- -ib- | grep -v "/snapshots"

Query Language

https://cassandra.apache.org/doc/latest/cql/index.html


Cassandra Operations

https://cassandra.apache.org/doc/latest/operating/index.html


Troubleshooting Cassandra

https://cassandra.apache.org/doc/latest/troubleshooting/index.html


Contributing to Cassandra Development

https://cassandra.apache.org/doc/latest/development/index.html


HBase

https://hbase.apache.org/

Apache HBase is an open-source distributed Hadoop database that can be used to read and write to big data. HBase has been constructed so that it can manage billions of rows and millions of columns using commodity hardware clusters. This database is based on the Big Table which was a distributed storage system created for structured data. Apache HBase has many different capabilities including scalability, automatic sharding of tables,  consistent reading and writing capabilities, support against failure for all the servers, etc.

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

HBase Features

  • Linear and modular scalability.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers.
  • Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
  • Easy to use Java API for client access.
  • Block cache and Bloom Filters for real-time queries.
  • Query predicate push down via server side Filters
  • Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
  • Extensible jruby-based (JIRB) shell
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX




CarbonData


CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data with Spark SQL. CarbonData allows faster interactive queries over PetaBytes of data.

Apache CarbonData is a new big data file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, which helps in speeding up queries by an order of magnitude faster over PetaBytes of data.

https://carbondata.apache.org/language-manual.html

CarbonData has its own parser, in addition to Spark's SQL Parser, to parse and process certain Commands related to CarbonData table handling. You can interact with the SQL interface using the command-line or over JDBC/ODBC.


https://carbondata.apache.org/introduction.html


CarbonData use cases

https://carbondata.apache.org/usecases.html

CarbonData is used for but not limited to

  • Bank

    • fraud detection analysis
    • risk profile analysis
    • As a zip table to update the daily balance of customer

Web/Internet

  • Analysis of page or video being accessed,server loads, streaming quality, screen size

Solution to queries on high volume data

Setup a Hadoop + Spark + CarbonData cluster managed by YARN.

Query results

ParameterResults
Query< 3 Sec
Data Loading Speed40 MB/s Per Node
Concurrent query performance (20 queries)< 10 Sec


CarbonData QuickStart

https://carbondata.apache.org/quick-start-guide.html

load some test data to files using Spark shell

start the network

run queries

carbon.sql("SELECT * FROM test_table").show()

carbon.sql(
           s"""
              | SELECT city, avg(age), sum(age)
              | FROM test_table
              | GROUP BY city
           """.stripMargin).show()


Solr



Elastisearch




Log File Management Systems


Splunk

https://www.educba.com/is-splunk-open-source/


Elastic Stack

https://www.educba.com/splunk-alternatives/

The Elastic Stack (also called as ELK stack) has been a leading open source log management solution. It is a good alternative to Splunk. It comprises of 4 major modules:

  1. Elasticsearch: Highly scalable search and analytics engine.
  2. Logstash: Log processing component which conduits incoming logs to ES.
  3. Kibana: Data visualization tool for the logs captured.
  4. Beats: Also called data shipped for elastic search.

A regular stack provides all of the tools needed to conduit, process, and view log data using a web-based UI with binary dependancy as java. The elastic stack is an open source tool and stands stable with an active developer community supporting throughout, a wide range of plugins, and extensive formats support.

On the other side, running the Elastic Stack can be quite complex than other tools in the market. It is highly distributed and requires scaled supportive configuration setup to work as a full-fledged solution. It gets along best for geographical data and records high compression of memory storage.

Price model: Premium version of Elastic stack provides access controls statistical notifiers and reporting solutions in addition to the standard functionalities of the free version. However, it is also expensive to implement pricing almost $2,000,000 to run at an enterprise scale which has a duration stretch of about three years.



Zeppelin - Apache data notebook like Jupyter w python, dsl, sql etc

https://zeppelin.apache.org/

https://zeppelin.apache.org/docs/latest/quickstart/install.html


Other open source alternative data servers with data frames - BIRT !!!



Multi-purpose notebook which supports

20+ language backends

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration


https://zeppelin.apache.org/docs/0.10.0/


Navigating Zeppelin UI

https://zeppelin.apache.org/docs/0.10.0/quickstart/explore_ui.html

Zeppelin quickstart

https://zeppelin.apache.org/docs/latest/quickstart/install.html

on Ubuntu, MAC vms

JDK 1.8


Allows mutliple data resources as connections

create notebooks with paragraphs and data frames

q> can data frame 1 reference data frame 2 as content?

when notebook is run as a service, REST api allows selecting notebooks and actions

when executed, output can be a result set or a visualization

q> no drill through support easily available



Zeppelin Tutorial

https://zeppelin.apache.org/docs/0.9.0-preview1/quickstart/tutorial.html











Potential Value Opportunities



Compare When to use Camel vs Kafka

https://www.kai-waehner.de/blog/2022/01/28/when-to-use-apache-camel-vs-apache-kafka-for-etl-application-integration-event-streaming/

comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

 Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming.

Kappa Architecture replaces Lambda event streaming with a single data stream

https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/

Kappa Architecture with unsafe-one Pipeline for Real Time and Batch

enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers.

This blog post explores why a single real-time pipeline, called Kappa architecture, is the better fit.


Apache Kafka. The focus is on event streaming

 Confluent Cloud is an Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

  • Open source under Apache 2.0 license
  • Vibrant community and adoption in the industry
  • Mature framework with deployments in enterprises across the globe
  • Fixing point-to-point spaghetti architectures with a central integration backbone
  • Open architecture and extensibility with custom functions and connectors
  • Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
  • Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
  • Connectivity to any technology, API, communication paradigm, and SaaS
  • Transformation of any data types and formats
  • Processes transactional and analytical workloads
  • Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
  • Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
  • Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems.

Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

  • Event-based backbone based on well-known and adopted EIP concepts
  • Connectivity to almost any API
  • Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
  • Powerful routing capabilities with many built-in EIPs
  • Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore 
  • Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

  • Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
  • No stream processing (like you know it from Kafka Streams or Apache Flink)
  • Limited scalability, not built for massive volumes of data
  • No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
  • No serverless cloud offering, with that also not competing with other iPaaS offerings
  • Red Hat is the only vendor supporting it
  • Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios


Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

  • Event-based streaming platform
  • A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
  • Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
  • True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
  • Guaranteed ordering of events in the distributed commit log
  • Distributed data processing with fault-tolerance and recoverability built-in
  • Replayability of events
  • The de facto standard for event streaming
  • Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
  • Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
  • Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

  • Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
  • No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
  • Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
  • Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

within a single infrastructure, including:

  • Real-time messaging at any scale
  • Storage for true decoupling between different applications and communication paradigms
  • Built-in backpressure handling and replayability of events
  • Data integration
  • Stream processing
  • Real-time data replication across hybrid and multi-cloud 

Kafka fits

  • Big data processing?
  • A storage component for true decoupling and replayability of events?
  • Stateless or stateful stream processing?
  • A serverless cloud offering?

Decision Tree Apache Camel vs Apache Kafka Comparison

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

Apache Camel and Apache Kafka in the Enterprise Architecture

Camel connectors in Kafka Connect option is very complex

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

Problems with this option

Using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

  • Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect),  Bootstrap Server, ConsumerRecord, Retention Time, etc.
  • Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Better option - build a Kafka connector component

better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

When not to use Kafka or Camel

Both Camel and Kafka are NOT built for the following scenarios:

  • A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
  • An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
  • A database for complex queries and batch analytics workloads
  • an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
  • A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

Potential Challenges



Candidate Solutions



Step-by-step guide for Example



sample code block

sample code block
 



Recommended Next Steps