Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Reference_description_with_linked_URLs_______________________Notes______________________________________________________________




https://projects.apache.org/projects.html?category

Apache Big Data projects
https://www.educba.com/my-courses/dashboard/Data engineering education site:  educfba
https://github.com/bcafferky/sharedBryan Cafferky's Python Spark repos **
Bryan Cafferky on youtubeBryan Python, Spark videos **
https://www.tutorialspoint.com/apache_spark/index.htmTutorialsPoint Spark Tutorial - older


Spark
https://spark.apache.org/Spark Home
https://spark.apache.org/docs/latest/quick-start.htmlSpark quick start
https://spark.apache.org/docs/0.9.1/java-programming-guide.htmlSpark Java programming
https://spark.apache.org/docs/0.9.1/scala-programming-guide.htmlSpark Scala programming
https://drive.google.com/open?id=1_bfxFX6kQf2gTEPyoPwOgmauWj2b-hv5Spark overview in 7 steps - databricks
dzone-spark-refcard_0.pdf
spark-A-Gentle-Introduction-to-Apache-Spark.pdf
spark-tutorials-quick-guide-v2.pdf
spark-ebook-intro-2018.pdf
spark-Mini eBook - Aggregating Data with Apache Spark.pdf
spark-Data-Scientists-Guide-to-Apache-Spark.pdf
 LearningSpark2.0.pdfLearning Spark ebook
Spark GraphX.pdf
spark-The-Data-Engineers-Guide-to-Apache-Spark.pdf


https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9Manning - Spark in Action


Udemy courses on Spark
https://www.youtube.com/watch?v=xkD1l9_nj5w&feature=youtu.beBryan Cafferky on Apache Spark, Databricks
https://www.udemy.com/course/apache-spark-20-java-do-big-data-analytics-ml/learn/lecture/5715610#overviewSpark 2.0, Java ML course




Databricks and Spark




https://www.linkedin.com/posts/bryancafferky_koalas-databricks-
documentation-activity-6683752516523433984-fjEH
Bryan Cafferky on combining Koalas, Airflow with Spark to get pandas for analysis
databricks-data-services-ai-HighPerformanceAI.pdfDatabricks a high performance data and AI organization







Key Concepts



Spark

https://spark.apache.org/

...

Arrow creates optimized memory stores for data in Spark from any language client


RDD or Resilient Distributed Dataset

An RDD or Resilient Distributed Dataset is the actual fundamental data Structure of Apache Spark. These are immutable (Read-only) collections of objects of varying types, which computes on the different nodes of a given cluster

https://databricks.com/glossary/what-is-rdd

Image ModifiedRDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

5 Reasons on When to use RDDs

...