Table of Contents

Reference_description_with_linked_URLs_______________________	Notes______________________________________________________________


https://projects.apache.org/projects.html?category	Apache Big Data projects
https://www.educba.com/my-courses/dashboard/	Data engineering education site: educfba
https://github.com/bcafferky/shared	Bryan Cafferky's Python Spark repos **
Bryan Cafferky on youtube	Bryan Python, Spark videos **
https://www.tutorialspoint.com/apache_spark/index.htm	TutorialsPoint Spark Tutorial - older

Spark
https://spark.apache.org/	Spark Home
https://spark.apache.org/docs/latest/quick-start.html	Spark quick start
https://spark.apache.org/docs/0.9.1/java-programming-guide.html	Spark Java programming
https://spark.apache.org/docs/0.9.1/scala-programming-guide.html	Spark Scala programming
https://drive.google.com/open?id=1_bfxFX6kQf2gTEPyoPwOgmauWj2b-hv5	Spark overview in 7 steps - databricks
dzone-spark-refcard_0.pdf
spark-A-Gentle-Introduction-to-Apache-Spark.pdf
spark-tutorials-quick-guide-v2.pdf
spark-ebook-intro-2018.pdf
spark-Mini eBook - Aggregating Data with Apache Spark.pdf
spark-Data-Scientists-Guide-to-Apache-Spark.pdf
LearningSpark2.0.pdf	Learning Spark ebook
Spark GraphX.pdf
spark-The-Data-Engineers-Guide-to-Apache-Spark.pdf

https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9	Manning - Spark in Action

Udemy courses on Spark
https://www.youtube.com/watch?v=xkD1l9_nj5w&feature=youtu.be	Bryan Cafferky on Apache Spark, Databricks
https://www.udemy.com/course/apache-spark-20-java-do-big-data-analytics-ml/learn/lecture/5715610#overview	Spark 2.0, Java ML course


Databricks and Spark


https://www.linkedin.com/posts/bryancafferky_koalas-databricks- documentation-activity-6683752516523433984-fjEH	Bryan Cafferky on combining Koalas, Airflow with Spark to get pandas for analysis
databricks-data-services-ai-HighPerformanceAI.pdf	Databricks a high performance data and AI organization

Key Concepts

Spark

https://spark.apache.org/

...

Arrow creates optimized memory stores for data in Spark from any language client

RDD or Resilient Distributed Dataset

An RDD or Resilient Distributed Dataset is the actual fundamental data Structure of Apache Spark. These are immutable (Read-only) collections of objects of varying types, which computes on the different nodes of a given cluster

https://databricks.com/glossary/what-is-rdd

Image ModifiedRDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

5 Reasons on When to use RDDs

...

Versions Compared

Old Version 9

New Version 10

Key

Key Concepts

Spark

RDD or Resilient Distributed Dataset

5 Reasons on When to use RDDs

Page Comparison

Versions Compared

Old Version 9

New Version 10

Key

Key Concepts

Spark

RDD or Resilient Distributed Dataset

5 Reasons on When to use RDDs