m Data Architecture

m Data Architecture

Key Points

  1. foundation for enterprise architecture, solutions

  2. driven by business use cases

  3. covers variety of use cases:  support, self service, enterprise services, integration ...



References

Reference_description_with_linked_URLs_________________

Notes_________________________________________________________________

Reference_description_with_linked_URLs_________________

Notes_________________________________________________________________

Data Use Cases

 

 

 

 

 


Data Governance Concepts







http://www.as400pro.com/tipListInq.php?cat=iSJava

Some AS/400 data related articles that included Jim Mason and more

techtarget has killed the links

https://www.roseindia.net/

https://www.roseindia.net/jdbc/jdbc-mysql/

roseindia.net



ISO 8000-150 Data Governance Standard

 





Data Architecture



Real-Time Data Architecture Patterns.pdf. GD

EEP, 7 Vs, NFRs, Patterns: Search, Directories ( Virtual, Logical, Physical ), Registries, Streams, Events, Alerts, Transaction LC, Layers, Services, SCRUD, Async, Messages, Promises, Cache, Age, Stacks, Queues, Pub/Sub, Brokers, Handlers, Validation LC, Trust LC, Processes, Decisions, RDS, ODS, DWH, Batch / Real-time, Hashes, Proofs, CDC, Ledgers, Journals, Blocks, Controllers, Observers, Idempotency, Rollback, Replay, Versioning, Interfaces, Delegates, Adapters, Routers, Decorators, Models ( Physical, Logical) , MetaModels, MDM, Observability, BCP, Aggregates ( Virtual, Persistent ), Factories, Pipelines, Transformers, DSLs, Flow - ( Ingestiion, Transform, Validation, Normalization, Edit, Process, Post, Store, Search, Retrieve, Transform, Present ), Distributed, Replicated, Decentralized ( Commit, Finalization ), DRDA, GIGO2, Secrets, Rollups, SOLID, DATES2, Age, Level Checks, Versions, Mutations, Mappings, Liquibase, Attributes ( fixed, managed, unmanaged ), Data Types, Functions, Procs, RPC, GRPC, Concurrency, Semaphores, Switches, Tokens, Credentials, Authorizations, IAM, Encryption, SDC, UTXO vs Lots, HTLC vs Escrow Transactions, Classes, MetaClasses,

https://www.slideshare.net/Dataversity/das-slides-enterprise-architecture-vs-data-architecture

Data Architecture vs Enterprise Architecture

https://www.slideshare.net/lmartins_us/enterprise-data-architecture-deliverables

Enterprise Data Architecture Deliverables

https://www.slideshare.net/Dataversity/data-architecture-strategies-building-an-enterprise-data-strategy-where-to-start

Enterprise Data Strategy

https://www.slideshare.net/Dataversity/data-architecture-the-foundation-for-enterprise-architecture-and-governance

Data Architecture - Foundation for Enterprise Architecture





https://www.slideshare.net/Dataversity/data-architecture-strategies-artificial-intelligence-realworld-applications-for-your-organization

Data Architecture Strategies for AI

https://www.slideshare.net/Dataversity/data-architecture-best-practices-for-todays-rapidly-changing-data-landscape

Data Architecture Best Practices

Data-Virtualization-for-Dummies.pdf

Data Virtualization for Dummies

https://www.datacamp.com/community/blog/data-infrastructure-tools

Sample Data Infrastructure - datacamp *

https://www.slideshare.net/Dataversity/data-lake-architecture-modern-strategies-approaches

Data Lake Architecture Strategies

https://www.informatica.com/resources.asset.cd3434c8d2aae44c6071d19d9077ca60.
pdf?mkt_tok=eyJpIjoiTm
1Wak1EWXlNMkl5TURFeCIsInQiOiJlREFlZjlEa0VvK1BsTlViZDJIeHdcL2FZTGhC
Mit4OXVOR0JYNnVWckNZZHU0UHRPZElJaGpMOHh4S09QN1dhdTB2MndZV
m5RWkszNkFBQmUxZkpaOXlSVzhUVXJhSU1GOWVlYkdEMldTOUlDK2JHVXJ
aY0FKeHVyeFJGVHN1WCtcL0RnY1pZemM1S1FZeVpwOWRSb3d2QT09In0%3D

data-lake-concepts-2019-resources.asset.cd3434c8d2aae44c6071d19d9077ca60.pdf

Data Lake Design Principles - Informatica

data-lakes-Six-Guiding-Principles-for-Effective-Data-Lake-Pipelines.pdf



https://dzone.com/articles/four-data-sharding-strategies-for-distributed-sql?edition=568292&utm_source=Daily%20Digest&utm_medium=email&utm_campaign=
Daily%20Digest%202020-01-29

disributed-db-sharding-strategies1.pdf

Data Sharding Strategies compared









snowflake-data-db-cloud-service-2333957-solution-brief-snowflake.pdf











Data Modeling

Models, MetaModels, MDG, RoundTripping, VersionMgt

sustainable data architecture concepts 

basics on data concepts for blockchain *

















Data Management



SFTP secure shell vs FTPS encrypted ftp explained.

spiceworks.com-SFTP vs FTPS Understanding the 8 Key Differences.pdf file



















Data Governance



https://profisee.com/data-governance-what-why-how-who/

data-governance-profisee.com-Data Governance What Why How Who 15 Best Practices.pdf

Data Governance Concepts & Tools

s Blockchain Data Compliance Services

Data Compliance

Sichern-data-compliance-whitepaper-short-version.201906.docx



DMX - Blockchain and Data Compliance Services.v2.pptx











Data Services Open Solutions



m Apache Data Services

 

 

 

Ubuntu

https://docs.ubuntu.com/

https://help.ubuntu.com/stable/ubuntu-help/

Ubuntu Server docs

Ubuntu Server documentation

Ubuntu Multipass

Multipass
Multipass is a tool to generate cloud-style Ubuntu VMs quickly on Linux, macOS, and Windows.

Docker

Home

 

 

 

 

Run multiple JDKS on MACOS

Installing many JDK on macOS using Homebrew ad openjdk

https://wiki.classe.cornell.edu/Computing/InstallingMultipleVersionsOfJavaOnMac

Managing multiple Java versions in MacOS

Manually install a JDK version from

Open JDK

https://openjdk.org/

Open JDK 19

https://jdk.java.net/19/

Open JDK 11

Archived OpenJDK GA Releases

Apache Tomcat

Apache Tomcat® - Welcome!

Apache Tomcat 10 (10.1.42) - Documentation Index

Oracle MySQL

https://dev.mysql.com/doc/

Postgres

https://www.postgresql.org/docs/

https://www.postgresql.org/files/documentation/pdf/15/postgresql-15-US.pdf

PostgresSQL v15 manual pdf. link

CouchDB

https://docs.couchdb.org/en/stable/

SQLite

https://www.sqlite.org/docs.html

Derby Java DB

https://db.apache.org/derby/manuals/

JHipster Lite

https://github.com/jhipster/jhipster-lite

https://hub.docker.com/r/jhipster/jhipster-lite JHipster Lite image

Grails

https://grails.org/documentation.html
https://views.grails.org/latest/

https://cs4760.csl.mtu.edu/2019/assignments/cs4760-assignments/programming-assignments/1-building-your-first-app/

https://groovy-lang.org/documentation.html

Spark

https://github.com/apache/spark

Apache-Spark-Beginners-Guide-2023-Ebook_8-Steps-V2.pdf link

Apache EventMesh

EventMesh is a new generation serverless event middleware for building distributed event-driven applications.

IPFS

https://docs.ipfs.tech/

Kafka

https://kafka.apache.org/documentation/

https://www.confluent.io/resources/online-talk/fundamentals-for-apache-kafka-2-part-series/?utm_medium=sem&utm_source=google&utm_campaign=ch.sem_br.nonbrand_tp.prs_tgt.kafka_mt.mbm_rgn.namer_lng.eng_dv.all_con.kafka-service&utm_term=%2Bkafka%20%2Bservice&creative=&device=c&placement=&gad=1&gclid=Cj0KCQjw6cKiBhD5ARIsAKXUdyap0EBNsV1F-fyqcrdi927y1qpcJx7mi2YbmuD3LfXkSCoHUURN9PwaAqEHEALw_wcB video

Hyperledger

https://hyperledger-fabric.readthedocs.io/en/release-2.5/

https://skywebteam.atlassian.net/wiki/spaces/KHUB/pages/54919289#mMessaging-ActiveMQ

ActiveMQ - topic queues, broadcast models ( pub / sub )

AretemisMQ - docs url

Apache Service Mix - ActiveMQ, Camel, Service mesh ..

https://servicemix.apache.org/docs/7.x/index.html

heavyweight framework

Apache Camel - data source connections

https://camel.apache.org/docs/

JDBC Drivers

https://www.geeksforgeeks.org/jdbc-drivers/

https://www.ibm.com/docs/en/i/7.4?topic=jdbc-types-drivers

https://en.wikipedia.org/wiki/JDBC_driver

BIRT - the original open-source data frames solution for analytics workbooks

https://eclipse.github.io/birt-website/

https://www.eclipse.org/community/eclipse_newsletter/2015/september/article3.php

Grafana - open-source data visualization on many data sources

https://grafana.com/docs/

https://grafana.com/docs/grafana/latest/introduction/

Data Beaver Open DB Client tool

https://dbeaver.io/

Web CMS list from wikipedia

https://en.wikipedia.org/wiki/List_of_content_management_systems

OpenCMS

http://www.opencms.org/en/

https://en.wikipedia.org/wiki/OpenCms

http://www.opencms.org/en/news/230425-opencms-v1500.html

https://documentation.opencms.org/central/

xWiki

 

https://www.xwiki.org/xwiki/bin/view/Main/WebHome

https://www.xwiki.org/xwiki/bin/view/Documentation/UserGuide/Features/SecondGenerationWiki/

https://www.xwiki.org/xwiki/bin/view/Documentation/DevGuide/

https://www.xwiki.org/xwiki/bin/view/Documentation/

https://dev.xwiki.org/xwiki/bin/view/Community/SupportStrategy/DatabaseSupportStrategy

https://www.xwiki.org/xwiki/bin/view/Documentation/

https://en.wikipedia.org/wiki/XWiki

JSPWiki

https://jspwiki.apache.org/

https://jspwiki-wiki.apache.org/Wiki.jsp?page=Documentation

https://jspwiki-wiki.apache.org/Wiki.jsp?page=Getting%20Started

https://jspwiki-wiki.apache.org/Wiki.jsp?page=ContributedPlugins#section-ContributedPlugins-ContributedPluginsPriorToV2.9.x

Drupal

https://www.drupal.org/

Eclipse

https://www.eclipse.org/documentation/

VStudio

https://code.visualstudio.com/docs

IntelliJ

https://www.jetbrains.com/help/idea/getting-started.html

 

 

 

 

 

 

Data Services  Commercial Solutions



GCP 



AWS Aurora



AWS RedShift



Azure



Snowflake



Teallium







Cloud data warehouse solutions



https://www.scnsoft.com/analytics/data-warehouse/cloud

cloud-dwh-2022-Top 6 Cloud Data Warehouse Solutions.pdf file

cloud-dwh-2022-Top 6 Cloud Data Warehouse Solutions.pdf











Key Concepts

 

Data Use Cases & Decision Match Data Processing Flows

 

 

image-20240823-180636.png

 


Functional Data Layers Architecture

https://www.linkedin.com/posts/rajkgrover_dataplatforms-businessintelligence-analytics-activity-7133748286368141313-RXgv?utm_source=share&utm_medium=member_desktop


Source: Deloitte

The purpose of a data platform is to collect, store, transform and analyze data and make that data available to (business) users or other systems. It is often used for #businessIntelligence, (advanced) #analytics (such as #machineLearning) or as a data hub.

The platform consists of several components that can be categorized into common layers that each have a certain function. These layers are: Data Sources, Integration Layer, Processing Layer, Storage Layer, Analytics Layer, #Visualization Layer, Security, and #DataGovernance (Figure 1).
 
Data Sources
This layer contains the different sources of the data platform. This can be any information system, like ERP or CRM systems, but it can also be other sources like Excel files, Text files, pictures, audio, video or streaming sources like IOT devices.

Ingestion Layer
The ingestion layer is responsible for loading the data from the data sources into the data platform. This layer is about extracting data from the source systems, checking the data quality and storing the data in the landing or staging area of the data platform.

Processing Layer
The processing layer is responsible for transforming the data so that it can be stored in the correct data model. Processing can be done in batches (scheduled on a specific time/day) or done real-time depending on the type of data source and the requirements for the data availability.

Storage Layer
The data is stored in the storage layer. This can be a relational database or some other storage technologies such as cloud storage, Hadoop, NoSQL database or Graph database.

Analytics Layer
In the analytics layer the data is further processed (analyzed). This can be all kinds of (advanced) analytics algorithms, for example for machine learning. The outcome of the analytics can be sent to the visualization layer or stored in the storage layer.

Visualization Layer
The data is presented to the end-user in the visualization layer. This can be in the form of reports, dashboards, self-service BI tooling or #API ’s so that the data can be used by other systems.

Security One of the important tasks of a data platform is to guarantee that only users that are allowed to use the data have access. A common method is user authentication and authorization, but it can also be required that the data is encrypted (storage and in transfer) and that all activities on the data are audited so that is it known who has accessed or modified which data.

Data Governance
Data governance is about locating the data in a data catalog, collecting and storing metadata about the data, managing the master data and/or reference data, and providing insights on where the data in the data platform originates from (i.e., #datalineage).

Is Hadoop still valid for batch data processing?

Apache Spark and other open-source frameworks are better now for some use cases

logz.io/blog/hadoop-vs-spark/

Spark has been found to run 100 times faster in-memory, and 10 times faster on disk. It’s also been used to sort 100 TB of data 3 times faster than Hadoop MapReduce on one-tenth of the machines. Spark has particularly been found to be faster on machine learning applications, such as Naive Bayes and k-means.

Spark performance, as measured by processing speed, has been found to be optimal over Hadoop, for several reasons:  

  1. Spark is not bound by input-output concerns every time it runs a selected part of a MapReduce task. It’s proven to be much faster for applications

  2. Spark’s DAGs enable optimizations between steps. Hadoop doesn’t have any cyclical connection between MapReduce steps, meaning no performance tuning can occur at that level.

However, if Spark is running on YARN with other shared services, performance might degrade and cause RAM overhead memory leaks. For this reason, if a user has a use-case of batch processing, Hadoop has been found to be the more efficient system.  

 

Improve RAJG data virtualization layers with data consumption methods

 

https://www.linkedin.com/posts/rajkgrover_data-datamanagement-banking-activity-7221140607228911617-hgeR?utm_source=share&utm_medium=member_desktop

add consumption models from sources, lakes

batch, transaction request, events, streams

add column for MDM, governance

 

 

 

 

 

 

Data Services Methods

 

https://www.linkedin.com/posts/giorgiotorre1234_how-many-api-architecture-styles-do-you-activity-7059072388340064256-TM9U?utm_source=share&utm_medium=member_desktop

Architecture styles define how different components of an application programming interface (API) interact with one another.

Here are the most used styles:

🔹SOAP:
Mature, comprehensive, XML-based
Best for enterprise applications

🔹RESTful:
Popular, easy-to-implement, HTTP methods
Ideal for web services

🔹GraphQL:
Query language, request specific data
Reduces network overhead, faster responses

🔹gRPC:
Modern, high-performance, Protocol Buffers
Suitable for microservices architectures

🔹WebSocket:
Real-time, bidirectional, persistent connections
Perfect for low-latency data exchange

🔹Webhook:
Event-driven, HTTP callbacks, asynchronous
Notifies systems when events occur

Are there any other famous styles I missed? 👇🏿

Source: ByteByteGo, Alex Xu

 

Data Event Message Communications



  1. Most applications are integrated based on events with shared data. 

  2. An event occurs in PC1 ( Process Context 1 ) and 1 or more other dependent processes ( PC2 .. PCN ) will listen and react to the events and the related event data.

  3. Sources and Handlers of Events?  Function or Object?  While pure functions can generate events, objects provide a context for the event beyond the current function.

  4. Objects are key for automated processing of events. They go beyond functions, providing a valid context and the capability to handle responsibilities for events, behaviors and data.

  5. They organize and simplify the use of functions and APIs.



Event requirements

  1. consider async event handling requirements vs sync handling

  2. requires saving state for later processing, replays and recovery

  3. is event access based on push or pull model?  in many scenarios, push may be more efficient for real-time responsive application flows

  4. is event message persistence required?

  5. what are the replay and recovery requirements?

  6. are messages broadcast or handled by specific handler?

  7. is the design for a configured handler ( eg API, database, Web Sockets or RPC )?

  8. is the design for a registered event listener (  pub / sub messages, SSE ( Server Sent Events ) )?

  9. what are the V's ( Volatility, Volume, Variety, Variance, Value, Validity ) ?

Frameworks for communicating Events

  • API service - a client can call an API to send an event object to a service for processing and a response ( can be sync or async invocation )

  • Database with SSE ( Server Sent Events ) and Streams. SSE doesn't require a database ( can be a service ) but the DB can persist the events for replay, recovery etc

  • Messaging with optional persistence, push or pull delivery models to clients supports a wide set of interaction models including push or pull delivery, async or sync, with broadcast or request / response handling

  • RPC - Remote Procedure - direct invocation of remote process from current process passing an event object and optionally returning a result object ( can be sync or async invocation )

  • Web Sockets - a synchronous interactive communications model between a 2 processes ( source and target ). a Web socket is created from an HTTP connection that is upgraded

  • SSE - Server Sent Events - from a data service or custom api service- 1 way async messages from server to client

  • Distributed Files sent using a file service ( SFTP, Rsync etc )

  • Blockchain agents - most DLTs offer transaction finality and a variety of environment events

  • Custom Communications Service - custom comm apps can be fully duplex





GEMS = Global Event Management System



supports the Modern Value Chain Networks

used by multiple solution layers:  net apps, malls, stores, services providers, tools

modular solution that connects across networks and platforms with interfaces and adapters to connect many components

open standards-based platform composed primarily of sustainable, open frameworks with added open source connector services

project managed by an open foundation ( see Linux Foundation or ? )

provides solid NFR capabilities for most use cases

extensible on services, interfaces

manageable and maintainable

version tolerance with mutation policies





Steps to define GEMS

I have a clear idea of the problem I'm trying to solve but need to build a real doc to clarify the use cases with concrete examples. Firefly may be a big part of the solution since it can connect multiple DLT nets, can communicate in multiple ways at the services layer ( not just DLT ) and has basic event support capabilities.
If I name the solution it's a global event workflow service that does pub / sub ( both push and pull models ) across multiple networks connected by a supernode model. There are concrete examples I need to specify.
I've built a simple version of that type of service in the past on a single distributed platform on IBM i because the platform had built-in support for event workflows that could be connected over a distributed net.



EDA > Event Driven Architecture for a simple event workflow solution

https://www.linkedin.com/posts/rajkgrover_eventdrivenarchitecture-microservices-transformpartner-activity-6988742530594942977-eGkE?utm_source=share&utm_medium=member_desktop



Architectural blueprint for #EventDrivenArchitecture-#Microservices Systems

The following figure is an architectural diagram of an EDA-microservices-based enterprise system. Some microservices components and types are shown separately for better clarity of the architecture.
 
The EDA and microservices-specific components in this blueprint are:
 
·Event backbone. The event backbone is primarily responsible for transmission, routing, and serialization of events. It can provide APIs for processing event streams. The event backbone offers support for multiple serialization formats and has a major influence on architectural qualities such as fault tolerance, elastic scalability, throughput, and so on. Events can also be stored to create event stores. An event store is a key architectural pattern for recovery and resiliency.

§ Services layer. The services layer consists of microservices, integration, and data and analytics services. These services expose their functionality through a variety of interfaces, including REST API, UI, or as EDA event producers and consumers. The services layer also contains services that are specific to EDA and that address cross-cutting concerns, such as orchestration services, streaming data processing services, and so on.

§ Data layer. The data layer typically consists of two sublayers. In this blueprint, individual databases owned by microservices are not shown.
§ Caching layer, which provides distributed and in-memory data caches or grids to improve performance and support patterns such as CQRS. It is horizontally scalable and may also have some level of replication and persistence for resiliency.
§ Big data layer, which is comprised of data warehouses, ODS, data marts, and AI/ML model processing.
§ Microservices chassis. The microservices chassis provides the necessary technical and cross-cutting services that are required by different layers of the system. It provides development and runtime capabilities. By using a microservices chassis, you can reduce design and development complexity and operating costs, while you improve time to market, quality of deliverables, and manageability of a huge number of microservices.

§ Deployment platform: Elastic, cost optimized, secure, and easy to use cloud platforms should be used. Developers should use as many PaaS services as possible to reduce maintenance and management overheads. The architecture should also provision for hybrid cloud setup, so platforms such as Red Hat OpenShift should be considered.



Key architectural considerations
The following architectural considerations are extremely important for event-driven, microservices-based systems:
 
·Architectural patterns
§ Technology stack
§ Event modeling
§ Processing topology
§ Deployment topology
§ Exception handling
§ Leveraging event backbone capabilities
§ Security
§ Observability
§ Fault tolerance and response


Source: IBM

#TransformPartner – Your #DigitalTransformation Consultancy

Jim >>

The concepts shown are a good start but not adequate to meet the event solution models we are looking at. On our end, we are looking to define a more global model for different use cases than you have here. I'm sure your implementation can be successful for your use case.

Solve the CAP theorem for async acid event transactions



Raj Grover >>

Event processing topology

https://www.linkedin.com/posts/rajkgrover_eventdrivenarchitecture-microservices-activity-6988742530594942977-rNFF/?originalSubdomain=my

In  hashtag#EDA , processing topology refers to the organization of producers, consumers, enterprise integration patterns, and topics and queues to provide event processing capability. They are basically event processing pipelines where parts of functional logic (processors) are joined together using enterprise integration patterns and queues and topics. Processing topology is a combination of the SEDA, EIP, and Pipes & Filter patterns. For complex event processing, multiple processing topologies can be connected to each other.
The following figure depicts a blueprint of a processing topology.





compare to Firefly Core Stack for Event Management

Firefly: Web3 Blockchain framework#FireflyFeaturesandServices

 

Apache Pulsar - Event messaging & streaming

https://skywebteam.atlassian.net/wiki/spaces/KHUB/pages/61112419#mApacheDataServices-ApachePulsar

Apache® Pulsar™ is an open-source, distributed messaging and streaming platform built for the cloud.

What is Pulsar

Apache Pulsar is an all-in-one messaging and streaming platform. Messages can be consumed and acknowledged individually or consumed as streams with less than 10ms of latency. Its layered architecture allows rapid scaling across hundreds of nodes, without data reshuffling.

Its features include multi-tenancy with resource separation and access control, geo-replication across regions, tiered storage and support for six official client languages. It supports up to one million unique topics and is designed to simplify your application architecture.

Pulsar is a Top 10 Apache Software Foundation project and has a vibrant and passionate community and user base spanning small companies and large enterprises

 

Apache EventMesh

EventMesh is a new generation serverless event middleware for building distributed event-driven applications.

key features EventMesh has to offer:

  • Built around the CloudEvents specification.

  • Rapidly extensible language sdk around gRPC protocols.

  • Rapidly extensible middleware by connectors such as Apache RocketMQ, Apache Kafka, Apache Pulsar, RabbitMQ, Redis, Pravega, and RDMS(in progress) using JDBC.

  • Rapidly extensible controller such as Consul, Nacos, ETCD and Zookeeper.

  • Guaranteed at-least-once delivery.

  • Deliver events between multiple EventMesh deployments.

  • Event schema management by catalog service.

  • Powerful event orchestration by Serverless workflow engine.

  • Powerful event filtering and transformation.

  • Rapid, seamless scalability.

  • Easy Function develop and framework integration.

 

Event Solutions Comparisons

 

Solace

 

https://www.slideshare.net/Pivotal/solace-messaging-for-open-data-movement

image-20240131-162627.png

 

https://www.slideshare.net/MagaliBoulet/solace-an-open-data-movement-company

image-20240131-162705.png

 

image-20240131-162749.png

 

Hyperledger Firefly Distributed Ledger Event Management

https://hyperledger.github.io/firefly/

https://hyperledger.github.io/firefly/reference/events.html

swt>FireflySolutionReview-HyperledgerFireflyDistributedLedgerEventManagement

Hyperledger FireFly Event Bus

The FireFly event bus provides your application with a single stream of events from all of the back-end services that plug into FireFly.