m Apache Data


Key Points


Resources


Resource____________________________________________________Notes____________________________________________________________
SQL concepts


https://blog.ansi.org/2018/10/sql-standard-iso-iec-9075-2016-ansi-x3-135/ANSI SQL




NoSQL concepts

https://www.w3resource.com/mongodb/nosql.php

https://drive.google.com/open?id=1UAwrKXa9PN3SG4V6IrVfLzdb3L_lgwnw

SQL vs NoSQL concepts




Big Data Concepts






https://www.datacamp.com/courses/big-data-fundamentals-via-pyspark?fbclid=IwAR1wu8n6y_lAQkQozuFP4P4NrLtirlUsLt-RhEpKYtHUHhbGg5RPndTeEaMBig Data with PySpark ( DataCamp )
https://www.udemy.com/gcp-data-engineer-and-cloud-architect/learn/lecture/7599964#overviewUdemy - Google Big Data Analytics Engineering course




Databases
https://docs.oracle.com/en/database/oracle/oracle-database/18/index.htmlOracle reference
https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/Oracle
-Compliance-To-Core-SQL2011.html#GUID-D372D906-805B-49B8-824A-D4697B05B7F8
Oracle support of SQL core
https://docs.oracle.com/cd/B28359_01/appdev.111/b28370/toc.htmOracle PL/SQL reference
https://docs.oracle.com/cd/B28359_01/appdev.111/b28370/subprograms.htm#CHDDCFHDOracle PL/SQL sub programs











DB2 reference

DB2 30 SQL tips




https://www.datacamp.com/courses/introduction-to-sql-server?fbclid=IwAR12Q0J8iEo5Y7IHc2wmoxIlyO2VPeuoVplloXZpvXXN
CAxD67XsJxm59QQ
SQL Server intro


https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-
services?view=sql-server-2017
SSIS - SQL Server Integration Services





MySQL







Postgres













Mongo









Couchdb







Couchbase




https://cassandra.apache.org/

https://en.wikipedia.org/wiki/Apache_Cassandra

Cassandra - wide column db similar to BigTable, DynamoDB etc






Data Services Concepts
https://www.jooq.org/JOOQ - generate Java from DB metadata ( SCRUD )


https://clouderanow.com/agenda2019Cloudera Data Platform - data eng, mgt, services



Integrated Data Analytics servers


m BIRT AnalyticsBIRT
m Apache Data ServicesApache Zeppelin



Key Concepts



Hadoop Ecosystem





Apache Spark





Big Data Fundamentals via PySpark

https://www.datacamp.com/courses/big-data-fundamentals-via-pyspark?fbclid=IwAR1wu8n6y_lAQkQozuFP4P4NrLtirlUsLt-RhEpKYtHUHhbGg5RPndTeEaM

Fundamentals of Big Data via PySpark. Spark is “lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc., to interact with works of William Shakespeare, analyze Fifa football 2018 data and perform clustering of genomic datasets. At the end of this course, you will gain an in-depth understanding of PySpark and it’s application to general Big Data analysis.

Introduction to Big Data analysis with Spark

This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

Programming in PySpark RDD’s

The main absFraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

PySpark SQL & DataFrames

In this chapter, you'll learn about Spark SQL which is a Spark module for stured data processing. It provides a programming abstraction called DataFrames aark SQL allows you to use DataFrames in Python.

Machine Learning with PySpark MLlib

PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.



Column Databases - Casandra, BigTable, DynamoDB



ProblemTechniqueAdvantage
Dataset partitioningConsistent HashingIncremental, possibly linear scalability in proportion to the number of collaborating nodes.
Highly available writesVector Clock or Dotted-Version-Vector Sets, reconciliation during readsVersion size is decoupled from update rates.
Handling temporary failuresSloppy Quorum and Hinted HandoffProvides high availability and durability guarantee when some of the replicas are not available.
Recovering from permanent failuresAnti-entropy using Merkle treeCan be used to identify differences between replica owners and synchronize divergent replicas pro-actively.
Membership and failure detectionGossip-based membership protocol and failure detectionAvoids having a centralized registry for storing membership and node liveness information, preserving symmetry.

Casandra - wide column DB

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Proven

Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets.

Fault tolerant

Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

Performant

Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.

Decentralized

There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.

Scalable

Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).

Durable

Cassandra is suitable for applications that can't afford to lose data, even when an entire data center goes down.

You're in control

Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.


Elastic

Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.

Professionally Supported

Cassandra support contracts and services are available from third parties.




Opportunities



Challenges



Solutions


Data Integration Services


SSIS

https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017



AWS




GCP - Google Data Platform

see ...



Informatica



Open-Source - see Big Data section on open source solutions


Groovy etc




Details



Next Steps



Link



Instructions - Move sites between Confluence instances

  1. How to Export a Space in Confluence Cloud to xml zip file
  2. How to Import a Space in Confluence Cloud from xml zip file