m Apache Data
Key Points
Resources
Key Concepts
Hadoop Ecosystem
Apache Spark
Big Data Fundamentals via PySpark
Fundamentals of Big Data via PySpark. Spark is “lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc., to interact with works of William Shakespeare, analyze Fifa football 2018 data and perform clustering of genomic datasets. At the end of this course, you will gain an in-depth understanding of PySpark and it’s application to general Big Data analysis.
Introduction to Big Data analysis with Spark
This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.
Programming in PySpark RDD’s
The main absFraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.
PySpark SQL & DataFrames
In this chapter, you'll learn about Spark SQL which is a Spark module for stured data processing. It provides a programming abstraction called DataFrames aark SQL allows you to use DataFrames in Python.
Machine Learning with PySpark MLlib
PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.
Column Databases - Casandra, BigTable, DynamoDB
Problem | Technique | Advantage |
---|---|---|
Dataset partitioning | Consistent Hashing | Incremental, possibly linear scalability in proportion to the number of collaborating nodes. |
Highly available writes | Vector Clock or Dotted-Version-Vector Sets, reconciliation during reads | Version size is decoupled from update rates. |
Handling temporary failures | Sloppy Quorum and Hinted Handoff | Provides high availability and durability guarantee when some of the replicas are not available. |
Recovering from permanent failures | Anti-entropy using Merkle tree | Can be used to identify differences between replica owners and synchronize divergent replicas pro-actively. |
Membership and failure detection | Gossip-based membership protocol and failure detection | Avoids having a centralized registry for storing membership and node liveness information, preserving symmetry. |
Casandra - wide column DB
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Proven
Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets.
Fault tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
Performant
Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.
Decentralized
There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.
Scalable
Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).
Durable
Cassandra is suitable for applications that can't afford to lose data, even when an entire data center goes down.
You're in control
Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.
Elastic
Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
Professionally Supported
Cassandra support contracts and services are available from third parties.
Opportunities
Challenges
Solutions
Data Integration Services
SSIS
AWS
GCP - Google Data Platform
see ...
Informatica
Open-Source - see Big Data section on open source solutions
Groovy etc
Details
Next Steps
Related articles
Link
Instructions - Move sites between Confluence instances
- How to Export a Space in Confluence Cloud to xml zip file
- How to Import a Space in Confluence Cloud from xml zip file