m Database
Key Points
- SQL - DB2, Oracle, SQL Server, MySQL, Postgres
- NoSQL - Mongodb, Couchdb, Couchbase, Domno
- ACID vs CAP compliant transactions - Atomic Consistent Isolated Durable vs Consistency Availability Partition Tolerance
- Isolation levels: Serialized, Repeatable Read, Committed Read, Uncomitted
- RDB Normalization for ODS = 3rd normal form - fields have no transitive dependencies ( whse_code, whse_name in orderLineItem table )
- RDB Normalization for RDS = 1st normal form - fields have dependency on primary key
References
Key Concepts
RDB Normalization
https://en.wikipedia.org/wiki/Database_normalization
UNF (1970) | 1NF (1970) | 2NF (1971) | 3NF (1971) | EKNF (1982) | BCNF (1974) | 4NF (1977) | ETNF (2012) | 5NF (1979) | DKNF (1981) | 6NF (2003) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Primary key (no duplicate tuples) | |||||||||||
No repeating groups | |||||||||||
Atomic columns (cells have single value)[8] | |||||||||||
Every non-trivial functional dependency either does not begin with a proper subset of a candidate key or ends with a prime attribute (no partial functional dependencies of non-prime attributes on candidate keys)[8] | |||||||||||
Every non-trivial functional dependency begins with a superkey or ends with a prime attribute (no transitive functional dependencies of non-prime attributes on candidate keys)[8] | |||||||||||
Every non-trivial functional dependency either begins with a superkey or ends with an elementary prime attribute[8] | N/A | ||||||||||
Every non-trivial functional dependency begins with a superkey[8] | N/A | ||||||||||
Every non-trivial multivalued dependency begins with a superkey[8] | N/A | ||||||||||
Every join dependency has a superkey component[9] | N/A | ||||||||||
Every join dependency has only superkey components[8] | N/A | ||||||||||
Every constraint is a consequence of domain constraints and key constraints[8] | N/A | ||||||||||
Every join dependency is trivial[8] |
Steps to normalize data
https://www.bkent.net/Doc/simple5.htm
good examples
Normalization concepts, quiz
rdb-normalization-how2-quiz-rdb.dp.ua-3 The normalization of relations during the DB designing.pdf
NoSQL databases often provide CAP support for transactions
https://www.ibm.com/cloud/learn/cap-theorem
Consistency
Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated to all the other nodes in the system before the write is deemed ‘successful.’
Availability
Availability means that that any client making a request for data gets a response, even if one or more nodes are down. Another way to state this—all working nodes in the distributed system return a valid response for any request, without exception.
Partition tolerance
A partition is a communications break within a distributed system—a lost or temporarily delayed connection between two nodes. Partition tolerance means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system.
Data models
Inmon - application driven data models
generally fit transaction processing environments well ( eg an ODS - operational data store )
typically important for real-time data capture, processing and event handling
feed analytics systems at some level of aggregation
Kimball - dimension driven data models
dimensions and facts
BWC - low-level sales summary dm allowed easy user analytics without paying for Power BI servers !!
PTP - different priority streams for data warehouse - real time dashboards vs stand alone analytics
PTP - custom user-defined data models, actions, events at runtime ( leverage MySQL with Java threads )
Some Common Data Modeling Considerations
Some EBC DB design considerations
db design should consider "next" release features
even if not supported in app in some cases
surrogate ids
all records primary key is a single surrogate id
reduces update anomalies
use natural id key references vs surrogates for links
( eg orderNumber vs _id )
_id is good for speed but not for recovery, migrations
is it product.id or product.productNbr ??
add indexes where needed for lookups
define product wrapper to wrap any type of product
wrapped types all have std interface to expose
eg id, productNbr, qty, summary, description, status, availableDateRange, etc
the rest of a wrapped item is extensible, custom
accessible by full query of the product using virtual fields or plain gql
Java provides an interface which makes this easier in apps to build services
logical transactions vs physical
logical transaction can span
multiple records ( eg order w details )
multiple stores ( eg RDB, DLT )
multiple partitions ( local, remote stores )
must track completion of logical updates
ideally async updates vs sync for performance
define usage models
define core data, first stores for initial verify
performance testing scenarios
as needed, create test cases
consider lot data without lot tracking ( accounting )
big difference
1 is simple db design
lot tracking is a chain of custody solution
denormalize transactions
reduces high pecent of lookups in theory
( eg customer.id, customerName )
when to carry a description field in addition to key id on reference to a master record ?
dimension capture
when to capture dimension data on transactions for use in warehouse
minimize use of virtual fields for high performance, real time apps
use materialized views, not vfields
vfields ok for normal reporting etc with normal volumes on queries
aggregations
decide when to materialize aggregation
vs generate on the fly
( updates vs reads usage modeling )
Basic Database Design Concepts Course Overview
https://docs.google.com/document/d/1rIt7LDGuAlM-n7vIVLc4A3r3LMyBCAcZrzSkWIKyfaQ/edit
Potential Value Opportunities
Potential Challenges
Candidate Solutions
Map Database Use Cases to Databases
👉Reasons for SQL:
Structured data
Strict schema
Relational data
Need for complex joins
Transactions
Clear patterns for scaling
More established: developers, community, code, tools, etc
Lookups by index are very fast
👉Reasons for NoSQL:
Semi-structured data
Dynamic or flexible schema
Non-relational data
No need for complex joins
Store many TB (or PB) of data
Very data-intensive workload
Very high throughput for IOPS
💁♂️Sample data well-suited for NoSQL:
Rapid ingest of clickstream and log data
Leaderboard or scoring data
Temporary data, such as a shopping cart
Frequently accessed ('hot') tables
Metadata/lookup table
If you're new to these terms or looking to dive deeper, give this a ❤️ and share!
Together, we can make tech jargon easier to grasp for everyone. 🌍
MongoDB
MySQL
Configuration
1 GB memory is a minimum for a dev environment only. 2 GB recommended .. see below
http://www.tocker.ca/2014/03/10/configuring-mysql-to-use-minimal-memory.html
https://www.mysqlcalculator.com/
Postgres
recommend 2 core minimum
low memory < 256M ok to start
CouchDB
requirements
https://stackoverflow.com/questions/47302848/apache-couchdb-system-minimum-requirements
The spec is one CPU, 512mb RAM and 20gb SSD disk. We run CouchDB in production on AWS using Docker. In general, we have found that we need at least a t2.medium for each node to keep up with the CPU and memory demands, i.e. each t2.medium instance has 2 CPU cores and 4GB
MongoDB
MongoDB Full Text Search with Lucene
works in aggregation pipeline – FMR sequence
for clusters M30 or higher in 4.2 release now
Auto-scale storage
OLD - up not down though
NEW - auto-scale based on usage, not just disk
Step-by-step guide for Example
sample code block