m Database

Key Points

  1. SQL - DB2, Oracle, SQL Server, MySQL, Postgres
  2. NoSQL - Mongodb, Couchdb, Couchbase, Domno
  3. ACID vs CAP compliant transactions - Atomic Consistent Isolated Durable vs Consistency Availability Partition Tolerance
  4. Isolation levels:  Serialized, Repeatable Read, Committed Read, Uncomitted
  5. RDB Normalization for ODS = 3rd normal form - fields have no transitive dependencies ( whse_code, whse_name in orderLineItem table )
  6. RDB Normalization for RDS = 1st normal form - fields have dependency on primary key


References

Reference_description_with_linked_URLs________________________________Notes______________________________________________________________
i Data Services
m Data Services Concepts
m Data Architecture
m.Data Analytics
m Data Quality


m MySQL overview
m MySQL Workbench
s SQL and data solutions
m Data CouchDb
m MongoDb usage basics
m MongoDB
m Mongoose for MongoDb
m Apache Spark
m Apache Data
m Snowflake data services




https://en.wikipedia.org/wiki/Database_normalization

https://www.bkent.net/Doc/simple5.htm

RDB normalization - wikipedia

how to normalize data - process steps





https://docs.google.com/document/d/1rIt7LDGuAlM-n7vIVLc4A3r3LMyBCAcZrzSkWIKyfaQ/edit

Basic Database Design Concepts Course Overview

data-privacy-management-study-2019-Integris-and-Ivy-Exec-Data-Privacy-Maturity-Study.pdf

Interview questions -

General Enterprise and Data Architecture concepts doc



https://www.zentut.com/data-warehouse/kimball-and-inmon-data-warehouse-architectures/

Inmon vs Kimball data models

file:///C:/Users/Jim%20Mason/Google%20Drive/_groups/nemug/SQL%20Stored%20
Procedures%20and%20Application%20Modernization.pdf

SQL Stored Procedures

https://www.datacamp.com/community/tutorials/finance-python-trading?fbclid=IwAR1n33gRfaNvZv02RR5wCrSNkS4RqxJQqipBGQuqMyWc6
_akSPQ2hAkFp6c

DataCamp - Python for Algorithmic Trading - Jupyter Notebooks

file:///C:/Users/Jim%20Mason/Google%20Drive/_groups/nemu/NEMUG%20-%20June%20Meeting%20-%20Ted%20Holt%20-
%2030%20SQL%20tips%20in%2060%20minutes.pdf

30 SQL Tips - Ted Holt



NoSQL

CouchDb

CouchDb

CouchDbase

https://www.couchbase.com/partners/amazon/

https://docs.couchbase.com/server/current/cloud/couchbase-aws-marketplace.html

CouchDbase on AWS

https://stackshare.io/immudb

https://codenotary.com/technologies/immudb/

https://docs.immudb.io/master/

immuno.db - immutable key value store **

it is immutable: history is preserved and can't be changed without clients noticing. It can be used both as a key-value store, or as a relational database (SQL).

  • Transactional integrity, multi-node
  • ACID compliance
  • SQL and K/V
  • SDKs for Python, Node.js, Java, Go

immudb is licensed under the Apache v2.0 License.

https://github.com/codenotary/immudb

how to setup run immudb in AWS S3 or locally in minio store 


https://github.com/minio/minio

minio local object store ( like AWS S3 ) *

features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding



Jim Mason related data services presentations 


Advanced SQL. Jim Mason. Web solutions for iseries engineer, build, deploy, support, train - PDF Free Download.pdf
as400pro.com-iSeries AS400 Web Applications Web Services Development Websphere Apache Tomcat PHP EGL CGIDEV2 etc f.pdf
blog.chinaunix.net-JDBCon iSeries-mason.pdf
jim-mason_5-2008_WebFacing_HATS_overview.pdf
jim-mason_windows_integration_4i_work_v2.pdf
jim-mason-QuickWebDataMart_MySQL.pdf
ibm-wdsc-2005-rse1.pdf















Key Concepts


RDB Normalization 

https://en.wikipedia.org/wiki/Database_normalization


UNF
(1970)
1NF
(1970)
2NF
(1971)
3NF
(1971)
EKNF
(1982)
BCNF
(1974)
4NF
(1977)
ETNF
(2012)
5NF
(1979)
DKNF
(1981)
6NF
(2003)
Primary key (no duplicate tuples)
Maybe
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No repeating groups
Maybe
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Atomic columns (cells have single value)[8]
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Every non-trivial functional dependency either does not begin with a proper subset of a candidate key or ends with a prime attribute (no partial functional dependencies of non-prime attributes on candidate keys)[8]
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Every non-trivial functional dependency begins with a superkey or ends with a prime attribute (no transitive functional dependencies of non-prime attributes on candidate keys)[8]
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Every non-trivial functional dependency either begins with a superkey or ends with an elementary prime attribute[8]
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
N/A
Every non-trivial functional dependency begins with a superkey[8]
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
N/A
Every non-trivial multivalued dependency begins with a superkey[8]
No
No
No
No
No
No
Yes
Yes
Yes
Yes
N/A
Every join dependency has a superkey component[9]
No
No
No
No
No
No
No
Yes
Yes
Yes
N/A
Every join dependency has only superkey components[8]
No
No
No
No
No
No
No
No
Yes
Yes
N/A
Every constraint is a consequence of domain constraints and key constraints[8]
No
No
No
No
No
No
No
No
No
Yes
N/A
Every join dependency is trivial[8]
No
No
No
No
No
No
No
No
No
No
Yes


Steps to normalize data 

https://www.bkent.net/Doc/simple5.htm

good examples

rdb-normalization-examples-bkent.net-A Simple Guide to Five Normal Forms in Relational Database Theory.pdf


Normalization concepts, quiz 

https://rdb.dp.ua/en/chapter_03#:~:text=There%20are%20five%20different%20levels,BCNF)%2C%204NF%2C%205NF.

rdb-normalization-how2-quiz-rdb.dp.ua-3 The normalization of relations during the DB designing.pdf


NoSQL databases often provide CAP support for transactions

https://www.ibm.com/cloud/learn/cap-theorem

Consistency

Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated to all the other nodes in the system before the write is deemed ‘successful.’

Availability

Availability means that that any client making a request for data gets a response, even if one or more nodes are down. Another way to state this—all working nodes in the distributed system return a valid response for any request, without exception.

Partition tolerance

A partition is a communications break within a distributed system—a lost or temporarily delayed connection between two nodes. Partition tolerance means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system.

Data models 


Inmon - application driven data models


generally fit transaction processing environments well ( eg an ODS - operational data store )

typically important for real-time data capture, processing and event handling

feed analytics systems at some level of aggregation


Kimball - dimension driven data models 


dimensions and facts



BWC - low-level sales summary dm allowed easy user analytics without paying for Power BI servers !!

PTP - different priority streams for data warehouse - real time dashboards vs stand alone analytics 

PTP - custom user-defined data models, actions, events at runtime  ( leverage MySQL with Java threads )


Some Common Data Modeling Considerations


Some EBC DB design considerations

db design should consider "next" release features
even if not supported in app in some cases

surrogate ids
all records primary key is a single surrogate id
reduces update anomalies

use natural id key references vs surrogates for links
( eg orderNumber vs _id )
_id is good for speed but not for recovery, migrations
is it product.id or product.productNbr ??

add indexes where needed for lookups

define product wrapper to wrap any type of product
wrapped types all have std interface to expose
eg id, productNbr, qty, summary, description, status, availableDateRange, etc
the rest of a wrapped item is extensible, custom
accessible by full query of the product using virtual fields or plain gql
Java provides an interface which makes this easier in apps to build services

logical transactions vs physical
logical transaction can span
multiple records ( eg order w details )
multiple stores ( eg RDB, DLT )
multiple partitions ( local, remote stores )
must track completion of logical updates
ideally async updates vs sync for performance
define usage models
define core data, first stores for initial verify


performance testing scenarios
as needed, create test cases

consider lot data without lot tracking ( accounting )
big difference
1 is simple db design
lot tracking is a chain of custody solution

denormalize transactions
reduces high pecent of lookups in theory
( eg customer.id, customerName )
when to carry a description field in addition to key id on reference to a master record ?

dimension capture
when to capture dimension data on transactions for use in warehouse

minimize use of virtual fields for high performance, real time apps
use materialized views, not vfields
vfields ok for normal reporting etc with normal volumes on queries

aggregations
decide when to materialize aggregation
vs generate on the fly
( updates vs reads usage modeling )



Basic Database Design Concepts Course Overview

https://docs.google.com/document/d/1rIt7LDGuAlM-n7vIVLc4A3r3LMyBCAcZrzSkWIKyfaQ/edit


Potential Value Opportunities



Potential Challenges



Candidate Solutions



Map Database Use Cases to Databases

https://www.linkedin.com/posts/ray-rahul_database-databaseselection-sqlvsnosql-activity-7125065141687115776-zeV6?utm_source=share&utm_medium=member_desktop

👉Reasons for SQL:

Structured data
Strict schema
Relational data
Need for complex joins
Transactions
Clear patterns for scaling
More established: developers, community, code, tools, etc
Lookups by index are very fast

👉Reasons for NoSQL:

Semi-structured data
Dynamic or flexible schema
Non-relational data
No need for complex joins
Store many TB (or PB) of data
Very data-intensive workload
Very high throughput for IOPS

💁‍♂️Sample data well-suited for NoSQL:
Rapid ingest of clickstream and log data
Leaderboard or scoring data
Temporary data, such as a shopping cart
Frequently accessed ('hot') tables
Metadata/lookup table

If you're new to these terms or looking to dive deeper, give this a ❤️ and share!
Together, we can make tech jargon easier to grasp for everyone. 🌍




MongoDB






MySQL



Configuration


1 GB memory is a minimum for a dev environment only. 2 GB recommended .. see below

http://www.tocker.ca/2014/03/10/configuring-mysql-to-use-minimal-memory.html

https://www.mysqlcalculator.com/


Postgres


recommend 2 core minimum

low memory < 256M ok to start


CouchDB



requirements

https://stackoverflow.com/questions/47302848/apache-couchdb-system-minimum-requirements

The spec is one CPU, 512mb RAM and 20gb SSD disk. We run CouchDB in production on AWS using Docker. In general, we have found that we need at least a t2.medium for each node to keep up with the CPU and memory demands, i.e. each t2.medium instance has 2 CPU cores and 4GB


MongoDB

MongoDB Full Text Search with Lucene

works in aggregation pipeline –  FMR sequence

for clusters M30 or higher in 4.2 release now







Auto-scale storage


OLD - up not down though

NEW - auto-scale based on usage, not just disk





Step-by-step guide for Example



sample code block

sample code block
 



Recommended Next Steps