Key Points

Resources

Resource____________________________________________________	Notes____________________________________________________________
SQL concepts

https://blog.ansi.org/2018/10/sql-standard-iso-iec-9075-2016-ansi-x3-135/	ANSI SQL


NoSQL concepts
https://www.w3resource.com/mongodb/nosql.php https://drive.google.com/open?id=1UAwrKXa9PN3SG4V6IrVfLzdb3L_lgwnw	SQL vs NoSQL concepts


Big Data Concepts



https://www.datacamp.com/courses/big-data-fundamentals-via-pyspark?fbclid=IwAR1wu8n6y_lAQkQozuFP4P4NrLtirlUsLt-RhEpKYtHUHhbGg5RPndTeEaM	Big Data with PySpark ( DataCamp )
https://www.udemy.com/gcp-data-engineer-and-cloud-architect/learn/lecture/7599964#overview	Udemy - Google Big Data Analytics Engineering course


Databases
https://docs.oracle.com/en/database/oracle/oracle-database/18/index.html	Oracle reference
https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/Oracle -Compliance-To-Core-SQL2011.html#GUID-D372D906-805B-49B8-824A-D4697B05B7F8	Oracle support of SQL core
https://docs.oracle.com/cd/B28359_01/appdev.111/b28370/toc.htm	Oracle PL/SQL reference
https://docs.oracle.com/cd/B28359_01/appdev.111/b28370/subprograms.htm#CHDDCFHD	Oracle PL/SQL sub programs





	DB2 reference
	DB2 30 SQL tips


https://www.datacamp.com/courses/introduction-to-sql-server?fbclid=IwAR12Q0J8iEo5Y7IHc2wmoxIlyO2VPeuoVplloXZpvXXN CAxD67XsJxm59QQ	SQL Server intro

https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration- services?view=sql-server-2017	SSIS - SQL Server Integration Services


	MySQL



	Postgres






	Mongo




	Couchdb



	Couchbase


https://cassandra.apache.org/ https://en.wikipedia.org/wiki/Apache_Cassandra	Cassandra - wide column db similar to BigTable, DynamoDB etc



Data Services Concepts
https://www.jooq.org/	JOOQ - generate Java from DB metadata ( SCRUD )

https://clouderanow.com/agenda2019	Cloudera Data Platform - data eng, mgt, services

	Integrated Data Analytics servers

m BIRT Analytics	BIRT
m Apache Data Services	Apache Zeppelin

Key Concepts

Hadoop Ecosystem

Apache Spark

Big Data Fundamentals via PySpark

https://www.datacamp.com/courses/big-data-fundamentals-via-pyspark?fbclid=IwAR1wu8n6y_lAQkQozuFP4P4NrLtirlUsLt-RhEpKYtHUHhbGg5RPndTeEaM

Fundamentals of Big Data via PySpark. Spark is “lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc., to interact with works of William Shakespeare, analyze Fifa football 2018 data and perform clustering of genomic datasets. At the end of this course, you will gain an in-depth understanding of PySpark and it’s application to general Big Data analysis.

Introduction to Big Data analysis with Spark

This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

Programming in PySpark RDD’s

The main absFraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

PySpark SQL & DataFrames

In this chapter, you'll learn about Spark SQL which is a Spark module for stured data processing. It provides a programming abstraction called DataFrames aark SQL allows you to use DataFrames in Python.

Machine Learning with PySpark MLlib

PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.

Column Databases - Casandra, BigTable, DynamoDB

Problem	Technique	Advantage
Dataset partitioning	Consistent Hashing	Incremental, possibly linear scalability in proportion to the number of collaborating nodes.
Highly available writes	Vector Clock or Dotted-Version-Vector Sets, reconciliation during reads	Version size is decoupled from update rates.
Handling temporary failures	Sloppy Quorum and Hinted Handoff	Provides high availability and durability guarantee when some of the replicas are not available.
Recovering from permanent failures	Anti-entropy using Merkle tree	Can be used to identify differences between replica owners and synchronize divergent replicas pro-actively.
Membership and failure detection	Gossip-based membership protocol and failure detection	Avoids having a centralized registry for storing membership and node liveness information, preserving symmetry.

Casandra - wide column DB

https://cassandra.apache.org/

https://en.wikipedia.org/wiki/Apache_Cassandra

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Proven

Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets.

Fault tolerant

Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

Performant

Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.

Decentralized

There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.

Scalable

Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).

Durable

Cassandra is suitable for applications that can't afford to lose data, even when an entire data center goes down.

You're in control

Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.

Elastic

Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.

Professionally Supported

Cassandra support contracts and services are available from third parties.

Opportunities

Challenges

Solutions

Data Integration Services

SSIS

https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017

m Apache Data