m Data Services Concepts

Key Points

  1. Data pipelines connect, transform data sources to data targets in batches or event streams
  2. Semantic data services - data that can interpret, map other data sources semantically to a data context


References

Reference_description_with_linked_URLs_______________________Notes______________________________________________________________
m Data Mgt Concepts
m Data Architecture
m Data Quality
i Data Mgt
s Blockchain Data Privacy & Compliance Services


https://www.w3resource.com/mongodb/nosql.phpSQL compared to NoSQL article
https://www.freetsa.org/index_en.phpFree Timestamp Services authority service *

microservices-events-20201028-WP-Event_Driven_Microservices.pdf

https://drive.google.com/file/d/1X8nU4ayzt0TlUZb61xl4UwtRKMIxpUJA/view?usp=sharing


CDP - Customer Data Platforms - let customers consume data for their own needs - article 

Tealium_Definitive-Guide-for-CDPs-eBook-NA. link

data-customer-access-Tealium_Definitive-Guide-for-CDPs-eBook-NA.pdf file

examples, decision matrix, process to implement. sourcing options ( 3B model ) for customer data platforms ( see Apache solutions )


Kafka




https://github.com/aidtechnology/lf-k8s-hlf-webinarKafka on Kubernetes Tutorial






https://postlight.com/trackchanges/introducing-glide-
graphql-for-salesforce
Glide - open-source GraphQL front-end for Salesforce API set




https://pages.databricks.com/201811-US-WB-AzureSeries-ty-wb.html?aliId=eyJpIjoiSXJcL2pld0hRSHRNOFVcL3oxIiwidC
I6IjJYaDQ3S0VZNkhRS2Q3WldKeW4rQ1E9PSJ9

https://pages.databricks.com/WB-azuretraining-01.html

https://pages.databricks.com/WB-azuretraining-02.html

https://pages.databricks.com/WB-azuretraining-03.html

Databricks video tutorials on Azure - engineering, analytics, data science




Articles
https://www.imperva.com/learn/data-security/soc-2-compliance/SOC 2 Data Quality Standards for Data Services

https://www.oss.com/asn1/resources/asn1-made-simple/introduction.html

ASN.1 data encoding stds for data exchange - gdrive

ASN.1 data encoding stds for data exchange

https://www.oss.com/asn1/resources/reference/ASN.1-Reference-Card-
format-USA.pdf

ASN.1 Reference card - gdrive


ASN.1 Reference card
Data Infrastructure Concepts - What good data looks like - datacamp 
5 modern ETL tools  tech target





Key Concepts


Supply Chain Data Examples Categorized by Structure, Volume Dimensions

Linkedin - Raj Grover - Transform Partners - Supply Chain Data Governance tip




Data Management Standards


  1. Data Modeling Standards
  2. Enterprise Metadata Standards
  3. Data Architecture Standards
  4. Data Quality Standards
  5. Master Data Management Standards
  6. Data Governance Standards & Polcies
  7. Data Services Standards
  8. Data Transformation Standards
  9. Data Security Standards & Policies
  10. Data Transaction Standards
  11. Distributed Data Standards
  12. Data Control Standards
  13. Data Event Services Standards



Data Table Types



  • Master Code Tables
  • Meta model definition Tables
  • Service Control Tables
  • Process Control Tables
  • Interface Control Tables
  • Product Control Tables
  • Environment Control Tables
  • Event Control Tables
  • Data Management Control Tables
  • Configuration Tables
  • Master Tables
  • Transaction Tables 
  • Transaction Ledgers
  • Transaction Logs
  • Event Transaction Tables
  • Work Tables ( global, process, user )
  • Work Management Tables 




Data Attribute Types


Character

Number

Decimal

Date

Boolean

Binary



OLTP 


Observability tools - common features for OLTP

oltp-observability-astronomer.io-Watch Demos On-Demand - Astronomer.pdf link

oltp-observability-astronomer.io-Watch Demos On-Demand - Astronomer.pdf file




OLAP

real-time on-demand access

smart caching and data age

data streaming

event streaming and reactive services



Enterprise Data Landscape Components

Raj Grover

https://www.linkedin.com/posts/rajkgrover_data-dataintegration-datagovernance-activity-7123621892367736832-NbI3?utm_source=share&utm_medium=member_desktop

Components of #Data Landscape:

 

The term "data landscape" refers to the comprehensive view of all data assets and their interactions within an organization. It encompasses the various data sources, data repositories, data flows, data processing, and data usages across the enterprise.



The data landscape provides a holistic perspective on how data is collected, stored, processed, and utilized to support business operations, decision-making, and strategic objectives.

 

Key components and aspects of a data landscape typically include:

 

-Data Sources: These are the origins of data, such as databases, applications, sensors, external partners, or any system that generates or captures data.


 

-Data Types: Different categories of data, including structured data (e.g., databases, spreadsheets) and unstructured data (e.g., text documents, images, videos).

 

-Data Repositories: The places where data is stored, including on premises databases, data warehouses, data lakes, cloud storage, and file systems.


 

-Data Flows: The pathways through which data moves within the organization. This includes data ingestion, transformation, movement between systems, and data export.


 

-Data Processing: The actions and transformations applied to data as it moves through the data landscape. This can involve data cleansing, aggregation, analytics, and reporting.


 

- #DataIntegration: The process of combining data from different sources to provide a unified view. Integration helps eliminate data silos and ensures data consistency.


 

-Data Usage: How data is utilized within the organization, including reporting, business intelligence, analytics, and other applications. This also covers data access and usage permissions.


 

- #DataGovernance: The policies, procedures, and controls governing data quality, security, privacy, and compliance throughout the data landscape.


 

- #Metadata: Information about the data, such as data definitions, data lineage, data ownership, and data quality indicators. Metadata is critical for understanding and managing the data landscape.


 

- #DataArchitecture: The structural design and organization of data assets, including data models, schemas, and data structures.

 

- #DataManagement Tools: The software and platforms used to manage and maintain the data landscape, such as data integration tools, data modeling tools, data cataloging tools, and data governance platforms.


 

- #DataSecurity and Access Control: Measures in place to protect sensitive data, including authentication, authorization, encryption, and data masking.


 

-Data Ownership and Stewardship: Roles and responsibilities for managing and maintaining data assets, including data owners and data stewards.


 

- #DataQuality and Data Profiling: Processes for assessing and improving the quality of data, as well as understanding the characteristics of data elements



Datacamp Data and Machine Learning Tools Landscape Resource Map

Identifies many data technologies, tools and resources that can be used to build, operate and support an Enterprise Data Landscape

 

Image Source: Datacamp

https://www.datacamp.com/blog/infographic-data-and-machine-learning-tools-landscape





Data Lake versus Data Warehouse

https://www.linkedin.com/posts/transformpartner_datalake-metadata-dataanalytics-activity-7093934239548633088-k2Eo?utm_source=share&utm_medium=member_desktop

When to use a data lake

Typical uses for a data lake include data exploration, data analytics, and #machinelearning.
A data lake can also act as the data source for a #datawarehouse. With this approach, the raw data is ingested into the data lake and then transformed into a structured queryable format. Typically this transformation uses an ELT (extract-load-transform) pipeline, where the data is ingested and transformed in place. Source data that is already relational may go directly into the data warehouse, using an ETL process, skipping the data lake.
Data lake stores are often used in event streaming or IoT scenarios, because they can persist large amounts of relational and nonrelational data without transformation or schema definition. They are built to handle high volumes of small writes at low latency, and are optimized for massive throughput.


Data Lake vs Data Warehouse






m Apache Data Services


Data Exchange Stds - ISO20022

https://www.iso20022.org/sites/default/files/documents/general/Scripted_ISO_20022_ppt_long_version_v176.ppt

ISO_20022-fin-data-exchange-stds_ppt_long_version_v176.pdf


Data Exchange Formats - ISO ASN.1 std

ASN.1 Reference card

Serializable data formats with different encoding stds

concept of a schema and data

supports XML, JSON, BER, PER formats

ISO ASN.1 std reference card - gdrive


Contact ::= SEQUENCE {
name VisibleString,
phone NumericString
}

So, the above contact named John Smith with a phone number of 987 6543210 might be serialized in binary or text format (ASN.1 Encoding Rules). For example:

                     Basic Encoding Rules (BER)
       30 19  80 0A 4A6F686E20536D697468  81 0B 3938372036353433323130
 
                     Packed Encoding Rules (PER) 
       0A 4A 6F 68 6E 20 53 6D 69 74 68 0B A9 80 76 54 32 10
 
                     XML Encoding Rules (XER)    
       <?xml version="1.0" encoding="UTF-8"?> <Contact> 
       <name>John Smith</name> <phone>987 6543210</phone> </Contact> 
 
                     JSON Encoding Rules (JER)  
       { "name" : "John Smith", "phone" : "987 6543210" }



Data Quality - SOC 2 Data Quality Standards for Data Services

https://www.imperva.com/learn/data-security/soc-2-compliance/

SOC 2 is an auditing procedure that ensures your service providers securely manage your data to protect the interests of your organization and the privacy of its clients. For security-conscious businesses, SOC 2 compliance is a minimal requirement when considering a SaaS provider.

Developed by the American Institute of CPAs (AICPA), SOC 2 defines criteria for managing customer data based on five “trust service principles”—security, availability, processing integrity, confidentiality and privacy.

SOC 2 Certification Criteria




SQL Basics

https://datamovesme.com/2019/12/30/free-sql-for-data-science-course/

DDL

DML


see Jim Mason SQL Sessions

m IBM i and Z Solutions & Services#JimMason-BestSQLsession


NoSQL compared to SQL

https://www.w3resource.com/mongodb/nosql.php

https://drive.google.com/open?id=1UAwrKXa9PN3SG4V6IrVfLzdb3L_lgwnw

Transcript


What is NoSQL?

NoSQL is a non-relational database management systems, different from traditional relational database management systems in some significant ways. It is designed for distributed data stores where very large scale of data storing needs (for example Google or Facebook which collects terabits of data every day for their users). These type of data storing may not require fixed schema, avoid join operations and typically scale horizontally.

NoSQL

  • Represente Not Only SQL
  • Aucun langage de requete declarative
  • Aucun schema predefini
  • Key-Value stockage de paire, colonne Store, espace de stockage de documents, bases de donnees de graphes
  • consistance eventuel plutot la propriete ACID
  • Les donnees non structurees et imprevisible
  • Priorise haute performance, haute disponibilite et i'evolutivite

NoSQL Really Means...

mettre laccent sur la non relationnelle,
La prochaine generation
Espace de stockage de donnees
operationnelles
et
bases de donnees....

Why NoSQL?

aujourd'hui les donnees deviennent de plus facile d'acceder et de capturer par des tiers tels que Facebook, Google+ et d'autres. Informations personnelles de i'utilisateur, des graphiques sociaux, donnees de localisation geo, le contenu genere par i'utilisateur et de i'exploitation des donnees de la machine ne sont que quelques exemples ou les donnees ont ete en augmentation exponentielle. Pour profiter du service ci-dessus correctement, il est necessaire de traiter une enorme quantite de donnees. L'evolution des bases de donnees NoSQL permet est de gerer ces enormes donnees correctement.

Where to use NoSQL

  • Les donnees sociales
  • Traitement de I'information
  • Enregistrement
  • analyse geospatiale
  • La modelisation moleculaire
  • etc......

When to use NoSQL

  • Une grande quantite de donnees
  • Beaucoup de lectures / ecritures
  • Economique
  • schema flexible
  • Aucune transaction necessaires
  • ACID est pas important
  • Aucune jointure

Relational vs NoSQL Database Systems

Relations						Basic Relations
Data Manipulation Language(DML) No predefined schema
Indexes							Indexes
Structured query language(SQL)	Limited query capabilities
Transactions						Generally No
Structured and organized data	Unstructured and unpredictable data
Only Values						Values & references
Strongely typed					Loosely typed
Joins							No Joins

CAP Theorem

Vous devez comprendre le theoreme CAP lorsque vous parlez de bases de donnees NoSQL ou en fait lors de la conception de tout systeme distribue. CAP theoreme affirme qu'il ya trois exigeneces fondamentales qui existent dans une relation particuliere lors de la conception d'applications pour une architecture distribuee.

  • Consistency - Cela signifie que les donnees contenues dans la base de donnees reste constant apres I'execution d'une operation. Par exemple, apres une operation de mise a jour de tous les clients de voir les memes donnees,
  • Availability - Cela signifie que le systeme est toujours (disponibilite garantie de service), sans temps d'arret.
  • Partition Tolerance - Cela signifie que le systeme continue de fonctionner meme si la communication entre les serveurs ne sont pas fiables, I.S. les serveurs peuvent etre divises en plusieurs groupes qui ne peuvent pas communiquer entre eux.

Graphical Presentation

Image

NoSQL Categories

It existe quatre types generaux(categories les plus courantes) des bases de donnees NoSQL. Chacune de ces categories a ses propres attributs et limitations specifiques. Il n'y a pas une seule solution qui est mieux que tous les autres, mais il y a des bases de donnees qui sont mieux pour resoudre des problemes specifiques. Afin de clarifier les bases de donnees NoSQL, on se permet de discuter les categories les plus courantes:

* Key-Value store

* Column store

* Document store

* Graph Databases

Key-Value store

* Key-Value stores sont les types les plus elementaires des bases de donnees NoSQL.

* Concu pour traiter d'enormes quantites de donnees.

* Base sur le papier Dynamo d'Amazon.

* Key value stores permettent au developpeur de stocker des donnees de schema-moins.

* Dans le stockage de cle-valeur, stocke les donnees de base de donnees comme table de hachage ou chaque cle est unique et la valeur peut etre une chaine, JSON, BLOB (grand objet de base), etc.

* Une cle peut etre des chaines, hash, listes, ensembles, tries ensembles et les valeurs sont stockees contre ces touches.

* Key value stores peuvent etre utilises comme des collections, des dictionnaires, des tableaux associatifs, etc.

* Key value stores suit la <<disponibilite>> et<<aspects de la partition>> du CAP theoreme.

* Key value stores serait bien travailler pour le shopping contenu de panier, ou des valeurs individuelles comme des couleurs, une page d'atterrissage URI, ou un numero de compte par defaut.

Key- Value stire

Image

Key-Value store Databases

Examples:

* Redis

* Dynamo

* Riak

Column-oriented

* bases de donnees en colonnes travaillent principalement sur des colonnes et chaque colonne est traitee individuellement

* Les valeurs d'une seule colonne sont stockes de maniere contigue.

* stocke les donnees de colonne dans des fichiers specifiques colonnes.

* Dans les column stores, les processeurs de requetes fonctionnent sur des colonnes aussi

* Toutes les donnees dans chaque fichier de donnees de colonne ont le meme type qui le rend ideal pour la compression

* Les column stores peuvent ameliorer les performances des requetes qu'il peut acceder a des donnees specifiques de la colonne.

* Haute performance sur les requetes d'agregation (par exemple COUNT, SUM, AVG, MIN, MAX)

* Travaux sur les entrepots de donnees et de business intelligence, gestion de la relation client (CRM), la bibliotheque de catalogues de cartes, etc.

Key-Value Store

Image

Column-oriented

Image

Column-oriented Databases

Examples:

* BigTable

* Cassandra

* SimpleDB

Document oriented databases

* Une collection de documents

* Les donnees de ce modele sont stockees a I'interieur de documents.

* Un documents est un ensemble de valeurs cles ou la cle permet d'acceder a sa valeur.

* Les documents ne sont generalement pas force d'avoir un schema et sont donc flexibles et faciles a changer.

* Les documents sont stockes dans des collections afin de regrouper differents types de donnees..

* Les documents peuvent contenir plusieurs differentes paires cle-valeur, ou des paires cle-tableau, ou meme des documents imbriques.

Document oriented databases

Image

Document oriented databases

Examples:

* MongoDB

* CouchDB

Graph databases

* Une base de donnees graphique stocke les donnees dans un graphique.

* Il est capable de representer elegamment toute sorte de donnees d'une maniere tres accessible.

* Une base de donnees graphique est un ensemble de noeuds et d'aretes

* Chaque noeud represente une entite (comme un etudiant) et chaque bord represente une connexion ou d'une relation entre deux noeuds.

* Chaque noeud et le bord est defini par un identifiant unique.

* Chaque noeud connait ses noeuds adjacents.

* Comme le nombre de noeuds augmente, le cout d'une etape locale (ou hop) reste le meme.

* Index pour les recherches.

Graph databases

Image

Graph databases

Examples:

* OrientDB

* Neo4J

* Titan

NoSQL pros/cons

Advantages:

  • Haute evolutivite
  • Distributed Computing
  • Moindre cout
  • flexibilite du schema, donnees semi-structurees
  • Pas de relations compliquees

Disadvantages:

  • Pas de normalisation
  • capacites de requete Limited (jusqu'a present)
  • la coherence eventuel est pas intuitif pour programmer
  • Pas assez mature pour les entreprises

Production deployment

There is a large number of companies using NoSQL. To name a few:

  • Google
  • Facebook
  • Mozilla
  • Adobe
  • Foursquare
  • LinkedIn

Thank you for your Time and Attention!




Redis DB Cache Management

https://www.w3resource.com/slides/redis-nosql-database-an-introduction.php

https://drive.google.com/open?id=1E3-abaRe0dSBiGCrJJqJ9hx-1P5HDDgL

<<tip>>  compare to ehcache, internal server caches ( session and global )


Redis (Remote DIctionary Server) is an open source, networked, single threaded, in- memory, advanced key-value store with optional durability. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

★ Developer(s) : Salvatore Sanfilippo, Pieter Noordhuis
★ Initial release : April 10, 2009
★ Stable release : 2.8.12 / June 23, 2014
★ Development status : Active
★ Written in : ANSI C
★ Operating system : Cross-platform
★ Type : Key–value stores
★ License : BSD

Redis : What for?

★ Cache out-of-process
★ Duplicate detector
★ LIFO/FIFO Queues
★ Priority Queue
★ Distributed HashMap
★ UID Generator
★ Pub/Sub
★ Real-time analytics & chat apps
★ Counting Stuff
★ Metrics DB
★ Implement expires on items
★ Leaderboards (game high scores)
★ Geolocation lookup
★ API throttling (rate-limits)
★ Autocomplete
★ Social activity feed

Who's using Redis?

★ Twitter
★ Github
★ Weibo
★ Pinterest
★ Snapchat
★ Craigslist
★ Digg
★ StackOverflow
★ Flickr
★ 220.000.000.000 commands per day
★ 500.000.000.000 reads per day
★ 50.000.000.000 writes per day
★ 500+ servers 2000+ Redis instances!

key Features

★ Atomic operations
★ Lua Scripting
★ Pub/Sub
★ Transactions
★ Master/Slave replication
★ Cluster (with automatic sharding)*
★ Automatic failover (Redis Sentinel)
★ Append Only File (AOF) persistence
★ Snapshot (RDB file) persistence
★ Redis can handle up to 232 keys, and was tested in practice to handle at
least 250 million of keys per instance.
★ Every list, set, and sorted set, can hold 232 elements.

key Advantages & Disadvantages

► Advantages

★ Blazing Fast
★ Robust
★ Easy to setup, Use and maintain
★ Extensible with LUA scripting

► Disadvantages

★ Persistence consumes lot of I/O when using RDB
★ All your data must fit in memory

Redis in-memory

► Redis is an in-memory but persistent on disk database, so it represents a different trade off where very high write and read speed is achieved with the limitation of data sets that can't be larger than memory.



Why GraphQL


Fits CQRS - Command Query Request Separation Principle for performance, modularity, control

Glide - open-source GraphQL front-end for Salesforce API set

https://postlight.com/trackchanges/introducing-glide-graphql-for-salesforce

Glide is a free toolkit for engineers who are building Salesforce integrations into their apps. Glide allows you to seamlessly execute GraphQL requests against any Salesforce instance with minimal setup. The code is free and open source. Get Glide on GitHub.

Developers love working with GraphQL, because it brings a strongly-typed schema to your application’s data, and gives clients control over requesting exactly what they need, no more and no less. This reduces network traffic (and latency) compared to REST, where your app has to make multiple calls to various endpoints to retrieve the data it needs (plus some it doesn’t). When GraphQL is paired with React, it’s an incredible development toolchain.

Glide is the quickest way to prototype and build GraphQL applications backed by Salesforce data. Glide automatically introspects your Salesforce data models, and creates an intuitive and idiomatic GraphQL schema out of the box. Glide also handles Salesforce authentication for you, so you can hit the ground running and start prototyping right away.

Glide can spin up a GraphQL server pointing at your Salesforce instance, or you can import Glide into your existing server code and deploy it. With Glide, you can start making GraphQL queries and mutations on your Salesforce data in two commands. Step-by-step instructions to get started are available in the docs.





Semantic Data Services and Data Models


Goals

Break down data silos, data translation mapping work

Foundation for mapping to smart contracts across disparate organizations


Data Services Catalog Service for a Context


Find, add, map new items to the catalog ( source can be a data store, message service, etc )

Client understands a data source semantically ( organization meaning vs simple schema attributes ).

theCustomerOrder, theCustomerOrderNbr vs simple   ORDNO, type packed decimal etc

Automated Curation, Data Certification ( eg Pyramis data feeds w suspended batches / transactions ) of source and data



Free Time Stamp Authority Service

https://www.freetsa.org/index_en.php

Trusted time stamping is the process of securely keeping track of the creation and modification times of a document. Security here means that no one - not even the owner of the document - should be able to change it once it has been recorded provided that the time stamper's integrity is never compromised. FreeTSA trusted time stamping Software as a Service (SaaS) provides an easy method to apply RFC 3161 trusted timestamps to time-sensitive transactions through independently verified and auditable date and UTC (Coordinated Universal Time) sources.

Adding a trusted timestamp to code or to an electronic signature provides a digital seal of data integrity and a trusted date and time of when the transaction took place. Recipients of documents and code with a trusted timestamp can verify when the document or code was digitally or electronically signed, as well as verify that the document or code was not altered after the date the timestamp vouches for. (Readme).

For multiple files, the general concept is that time stamping a single file that contains an aggregate list of fingerprints of other files, also proves that these other files must have existed before the aggregate file was created, provided that both the aggregate file and the referenced file are available during verification process. Freetsa also offers the possibility of URLs timestamps (do not abuse). If you are interested in implementing timestamps on your project / company using the FreeTSA service, you can contact me for specific requirements. Freetsa can also be used within the Tor anonymity network.

Making timestamps from Android/iOS devices is possible and no software installation is required (Video). To timestamp a picture you must go to the "Online Signature" section in the browser, click "Choose a file" and select "Use camera". Once the photo or video are completed, you can download the timestamp (TimeStampResponse). Just that. You also may select any other file available in the device if you choose so.

Freetsa.org offers free OCSP and CRL, NTP, DNSCRYPT, DNS (Port 53) and DNS over TLS (DoT) services for time synchronisation and encrypted name resolution respectively. The resolution of DNSCRYPT (Port 553) do not have any type of restriction (SPAM, Malware, Parental,...). No logs are saved. Like the rest of the services offered by FreeTSA, DoT accepts TCP connections to port 853 on all its IPv4/IPv6 addresses.

Guide: How to sign PDF documents files with time stamp
FreeTSA onion domain (Tor): th3ccojidpgbgv5d.onion (https /http).
RFC 3161 TSA: Time-Stamp Protocol (TSP).
RFC 958 NTP: Network Time Protocol (NTP).



Event-Driven Data Services concepts


Difference between REST API and Event Driven API 

REST APIs can be asynchronous using

  1. web hooks - call backs
  2. web sockets


Event-Driven APIs

generally designed to respond or handle a stream of events from clients

the simplified diagram from Gravitee below highlights the differences 

In reality, the request model for event-driven COULD be request / response ( not shown below ) OR pub / sub ( broadcast )


SSE - Server Sent Events - publish an event to a client listener

A number of databases now support SSE.

This is a server to client push technology for messages the client can respond to

Unlike web sockets that provides the same server push capability, SSE does not allow the client to respond directly to the message which may be a securit advantage

This was a very useful model when I needed to capture transaction data from vendor applications that refused to provide an agent to see those transactions

If you control the database directly, those enterprise databases can be configured to publish SSE to clients. If not, you can use the binary logs as I did below to generate SSE from a server app to clients.

Most enterprise databases ( MySQL, Postgres, DB2, Oracle, SQL Server etc ) have binary log files for federated or distributed clusters that can be read by a client app to capture the data transactions and publish those to remote clients listening for a SSE. You can define a responsive workflow that is initiated by a transaction in a proprietary app IF it uses one of these enterprise databases.


https://www.linkedin.com/posts/gravitee-io_book-a-demo-with-one-of-our-api-experts-today-activity-7161448046931771392-ECW0?utm_source=share&utm_medium=member_desktop


any enterprise Data Service ( database, service etc ) should be able to:

  • generate events ( typically SSE - Server Sent Events to any registered client that is read-only event data 
  • register to listen for events from other event sources including event frameworks like Apache EventMesh, Kafka etc
  • a data service event handler would be to invoke a process ( eg gRPC, API or run a query ) typically

Using SSE ( server sent events ) generated by a service from SQL DB binary replicas to integrate with closed solutions like Cavu HCM


you may have a software vendor where you license their application software and you own your data but they won't provide a free API to access your data from their servers. Given 90% of those solutions use a standard database on the backend like ( DB2, MySQL, Mongo etc )  and the standard databases can update a replica using a binary log, the DB vendors have tools or apis to access the binary log file transactions.  Creating a function to read each selected binary log event, you can create SSE ( Server Sent Events ) to publish any selected events to registered listeners.  This may be an effective work around to integrate using event-driven automation even when your software vendor doesn't provide direct api or event access.


More on SSE

m Data Architecture#Data-Architecture-Support-for-Event-Streams-for-Async-Server-Data-Flow-with-SSE

https://en.wikipedia.org/wiki/Server-sent_events

https://shopify.engineering/server-sent-events-data-streaming


Event-Driven Services Concepts Terminology

https://developer.confluent.io/faq/apache-kafka/architecture-and-terminology/


What is event-driven architecture?

An event-driven architecture is an architecture based on producing, consuming, and reacting to events, either within a single application or as part of an intersystem communication model. Events are communicated via event streams, and interested consumers can subscribe to the event streams and process the events for their own business purposes.


Why use an event-driven architecture?

Event-driven architecture enables loose coupling of producers and consumers via event streams, and is often used in conjunction with microservices. The event streams provide a mechanism of asynchronous communication across the organization, so that each participating service can be independently created, scaled, and maintained. Event-driven architectures are resistant to the impact of intermittent service failures, as events can simply be processed when the service comes back up. This is in contrast to REST API / HTTP communication, where a request will be lost if the server fails to reply.


What is an event stream?

An event stream is a durable and replayable sequence of well-defined domain events. Consumers independently consume and process the events according to their business logic requirements.

A topic in Apache Kafka in an example of an event stream.


What are the benefits of streaming data?

Streaming data enables you to create applications and services that react to events as they happen, in real time. Your business can respond to changing conditions as they occur, altering priorities and making accommodations as necessary. The same streams of operational events can also be used to generate analytics and real-time insights into current operations.

Check out the free courses to learn more about the benefits of thinking in events as well as building streaming data pipelines.


What is event sourcing?

Event sourcing is the capture of all changes to the state of an object, frequently as a series of events stored in an event stream. These events, replayed in the sequence in which they occurred, can be used to reconstruct both the intermediate and final states of the object.


What is an event broker?

An event broker hosts event streams so that other applications can consume from and publish to the streams via a publish/subscribe protocol. Kafka's event brokers are usually set up to be distributed, durable, and resilient to failures, to enable big data scale event-driven communication in real time.


What is stream processing?

Stream processing is an architectural pattern where an application consumes event streams and processes them, optionally emitting its own resultant events to a new set of event streams. The application may be stateless, or it may also build internal state based on the consumed events. Stream processing is usually implemented with a dedicated stream processing technology, such as Kafka Streams or ksqlDB.


What’s the difference between a queue and a topic?

Topics (also known as Event streams) are durable and partitioned, and they can be read by multiple consumers as many times as necessary. They are often used to communicate state and to provide a replayable source of truth for consumers.

Queues are usually unpartitioned and are frequently used as an input buffer for work that needs to be done. Usually, each message in a queue is dequeued, processed, and deleted by a single consumer.


What is distributed computing?

Distributed computing is an architecture in which components of a single system are located on different networked computers. These components communicate and coordinate their actions across the network, either using direct API calls or by sending messages to each other.


What is a microservice?

A microservice is a standalone and independently deployable application, hosted inside of a container or virtual machine. A microservice is purpose-built to serve a well-defined and focused set of business functions, and communicates with other microservices through either event streams or direct request-response APIs. Microservices typically leverage a common compute resource platform to streamline deployments, monitoring, logging, and dynamic scaling.

Read more about microservices and Kafka in this blog series.


What is CQRS?

Command Query Responsibility Segregation (CQRS) is an application architecture that separates commands (modifications to data) from queries (accessing data). This pattern is often used alongside event sourcing in event-driven architectures.

You can learn more about CQRS as part of the free Event Sourcing and Event Storage with Apache Kafka® training course


What is a REST API?

A REST API is an Application Programming Interface (API) that adheres to the constraints of the Representational State Transfer (REST) architectural style. REST API is often used as an umbrella term to describe a client and server communicating via HTTP(S).

You can use the REST Proxy to send and receive messages from Apache Kafka.


What is a data lake?

A data lake is a centralized repository for storing a broad assortment of data sourced from across an organization, for the primary purpose of cross-domain analytical computation. Data may be structured, semi-structured, or unstructured, and is usually used in combination with big data batch processing tools.

Data may be loaded into a data lake in batch, but is commonly done by streaming data into it from Kafka.


What is data mesh?

Data mesh is an approach to solving data communication problems in an organization by treating data with the same amount of rigor as any other product. It is founded on four pillars: data as a product, domain ownership, federated governance, and self-service infrastructure. This strategy is a formal and well-supported approach for providing reliable, trustworthy, and effective access to data across an organization.


What is real-time data analysis?

Real-time data analysis, also known as streaming analytics, refers to the processing of streams of data in real time to extract important business results. This processing is usually implemented with a dedicated stream processing technology, such as Kafka Streams or ksqlDB.


What is an enterprise service bus?

An enterprise service bus (ESB) is a software platform that routes and distributes data among connected applications, without requiring sending applications to know the identity or destination of receiving applications. ESBs differ from Kafka in several key ways. ESBs typically have specialized routing logic, whereas Kafka leaves routing to the application management side. ESBs also do not generally support durable and replayable events, a key feature for services using Kafka.

This blogdiscusses in more detail the similarities and differences between Kafka and ESBs


Apache Projects for GEMS - Global Event Management Solutions

m Apache Data Services#ApachePulsar


FinTech common data models



CDM (Common Domain Model) from ISDA

https://www.isda.org/2019/03/20/isda-publishes-cdm-2-0-for-deployment-and-opens-access-to-entire-market/


CMD Fact Sheet

https://www.isda.org/a/z8AEE/ISDA-CDM-Factsheet.pdf

The ISDA CDM is the first industry solution to tackle the lack of standard conventions in how derivatives trade events and processes are represented. Developed in response to regulatory changes, high costs associated with current manual processes and a demand for greater automation across the industry, the ISDA CDM for the first time creates a common blueprint for events that occur throughout the derivatives lifecycle, paving the way for greater automation and efficiency at scale.

“The current derivatives infrastructure is hugely inefficient and costly, and there’s virtually no way to implement scalable automated solutions across the industry.



PURE data modeler with Alloy UI - GS

( see https://medium.com/@vipinsun/pure-alloy-2fd3cd995e90 for inspiration on Data Silos and Pure + Alloy )





Trust Services Extensions to Data Solutions

STS - Smart Trust Services: deliver trusted outcomes



Potential Value Opportunities



Smart Data Services = the right Data Value Services infrastructure 


Data Infrastructure Concepts - What good data looks like - datacamp

sdm-smart-data-mgt-datacamp-What Mature Data Infrastructure Looks Like.pdf

What valuable data looks like 

https://www.datacamp.com/resources/webinars/data-trends-2021


Smart Data Services Infrastructure


Core Smart Data Management Principles

EEP - Entry > Edit > Post

Data Quality = 

  • data issue prevention  ( goalie )
  • data issue cleanup   ( janitor )
  • data understanding  ( whisperer )
  • data satisfaction  ( therapist )
  • data value  ( value provider )


Data Operations Management Model for Analytics and AI


https://www.linkedin.com/posts/rajkgrover_dataops-datamanagement-agile-activity-7066804921358032897-qRV9?utm_source=share&utm_medium=member_desktop

DataOps is a collaborative data management discipline that focuses on end-to-end data management and the elimination of data silos.

DataOps offers the following benefits:
 
•           Decreases the cycle time in deploying analytical solutions
•           Lowers data defects
•           Reduces the time required to resolve data defects
•           Minimizes data silos
 




Reactive Relational DB Connections  ( R2DBC ) and services example

https://dzone.com/articles/transactional-outbox-pattern-step-by-step-with-spr?edition=598292&utm_campaign=&utm_content=Writing%20a%20vector%20database%20in%20a%20week%20in%20Rust&utm_medium=email&utm_source=dzone&utm_term=

r2dbc-dzone.com-Transactional Outbox Patterns - DZone.pdf link

r2dbc-dzone.com-Transactional Outbox Patterns - DZone.pdf file

we need a Transactional Outbox is that a service often needs to publish messages as part of a transaction that updates the database. Both the database update and the sending of the message must happen within a transaction. Otherwise, if the service doesn’t perform these two operations automatically, a failure could leave the system in an inconsistent state.

The GitHub repository with the source code for this article.

Transactional Outbox

In this article, we will implement it using Reactive Spring and Kotlin with Coroutines. Here is a full list of used dependencies: Kotlin with Coroutines, Spring Boot 3, WebFlux, R2DBC, Postgres, MongoDB, Kafka, Grafana, Prometheus, Zipkin, and Micrometer for observability.



Data Governance Opportunities


see STS - Smart Trust Services: deliver trusted outcomes for advanced concepts


m Data Architecture#Data-Governance-Management-Concepts


m Data Architecture#Data-Governance-Concepts-and-Tools

Data Governance Concepts

Linkedin - RajG on Data Governance concepts


Data governance has its roots in the field of information management and has evolved over time in response to the increasing importance of data in various industries. The concept of data governance emerged as organizations recognized the need to manage their data as a valuable asset and ensure its quality, security, and compliance. The history of data governance can be traced back to several key factors and developments:
 
- Early #InformationManagement
- IT and Database Management
- Regulatory and #Compliance Requirements
- Emergence of #DataWarehousing and #BusinessIntelligence
- Key US Regulations > Sarbanes-Oxley Act (SOX) and Dodd-Frank Act 
- #DataQuality and Trust
- Master Data Management (MDM) 
- Digital Transformation 
- Emergence of Big Data and #Analytics
- Risk Management
- Complexity and Diversity of Data Sources
- Data Privacy Concerns and #GDPR
- Digital Asset Tokenization and Digital Twins 
- Modern Data Governance
- Globalisation and Data Sharing
- Technology Advancement

Over time, these factors led to the formalization of data governance frameworks and practices within organizations. Today, data governance encompasses a range of activities, including defining data ownership, establishing data policies and standards, ensuring data quality, managing data access and security, complying with regulations, and aligning data initiatives with business goals.
 
Overall, data governance has evolved from a response to challenges posed by data management and regulatory compliance into a strategic discipline that enables organizations to maximize the value of their data while managing risks and ensuring accountability.


Data Governance and Data Management



a>> Data Governance areas for VSLN





Potential Challenges



Candidate Solutions & Tools



Kafka & Spark - distributed event message network with streaming support


Kafka - distributed messaging service for message brokers and pub sub use cases


Spark - data transformation services library with data streaming support


Apache Events

EventMesh - global event mesh network supporting local and remote events for global workflows


Apache Pulsar - distributed global event mesh




CouchDB - NoSQL Distributed database with built-in API and Logical Query language - open-source



LevelDB - Key Value distributed database - open-source


CockroachDB - Key Value distributed database


https://alternativeto.net/software/cockroachdb/



YugabyteDB - Cloud native DB w Postgres SQL, Query Layer, Storage Layer, Replicas
( and Processes? Trusts? Events? Interfaces? Extensions? Mapping? Directories ? )


https://en.wikipedia.org/wiki/YugabyteDB

YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte.[1]


data-Distributed SQL Databases for Dummies-Yugabyte.pdf. link



Architecture

YugabyteDB is a distributed SQL database that aims to be strongly transactionally consistent across failure zones (i.e. ACID compliance].[22][23] Jepsen testing, the de facto industry standard for verifying correctness, has never fully passed, mainly due to race conditions during schema changes.[24] In CAP Theorem terms YugabyteDB is a Consistent/Partition Tolerant (CP) database.[25][26][27]

YugabyteDB has two layers,[28] a storage engine known as DocDB and the Yugabyte Query Layer.[

YugabyteDB has the ability to replicate between database instances.[47][48] The replication can be one-way or bi-directional and is asynchronous

DocDB Storage Services

The storage engine consists of a customized RocksDB[29][30] combined with sharding and load balancing algorithms for the data. In addition, the Raft consensus algorithm controls the replication of data between the nodes.[29][30] There is also a Distributed transaction manager[29][30] and Multiversion concurrency control (MVCC)[29][30] to support distributed transactions.[30]

The engine also exploits a Hybrid Logical Clock[31][29] that combines coarsely-synchronized physical clocks with Lamport clocks to track causal relationships.[32]

The DocDB layer is not directly accessible by users.[29

YQL = Y Query Layer

Yugabyte has a pluggable query layer that abstracts the query layer from the storage layer below.[33] There are currently two APIs that can access the database:[30]

YSQL[34] is a PostgreSQL code-compatible API[35][36] based around v11.2. YSQL is accessed via standard PostgreSQL drivers using native protocols.[37] It exploits the native PostgreSQL code for the query layer[38] and replaces the storage engine with calls to the pluggable query layer. This re-use means that Yugabyte supports many features, including:

  • Triggers & Stored Procedures[36]
  • PostgreSQL extensions that operate in the query layer[36]
  • Native JSONB support[36]

YCQL[39] is a Cassandra-like API based around v3.10 and re-written in C++. YCQL is accessed via standard Cassandra drivers[40] using the native protocol port of 9042. In addition to the 'vanilla' Cassandra components, YCQL is augmented with the following features:

  • Transactional consistency - unlike Cassandra, Yugabyte YCQL is transactional.[41]
  • JSON data types supported natively[42]
  • Tables can have secondary indexes[43]

Currently, data written to either API is not accessible via the other API, however YSQL can access YCQL using the PostgreSQL foreign data wrapper feature.[44]

The security model for accessing the system is inherited from the API, so access controls for YSQL look like PostgreSQL,[45] and YCQL looks like Cassandra access controls.


YugabyteDB versions






Step-by-step guide for Example



sample code block

sample code block
 



Recommended Next Steps