Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Key Points

  1. Data pipelines connect, transform data sources to data targets in batches or event streams
  2. The interactive set of data stream services that compares to batch Hadoop services
  3. supports many data sources including messaging services ( Kafka, MQ etc )


References

Reference_description_with_linked_URLs_______________________Notes______________________________________________________________




https://projects.apache.org/projects.html?category

Apache Big Data projects
https://www.educba.com/my-courses/dashboard/Data engineering education site:  educfba
https://github.com/bcafferky/sharedBryan Cafferky's Python Spark repos **
Bryan Cafferky on youtubeBryan Python, Spark videos **
https://www.tutorialspoint.com/apache_spark/index.htmTutorialsPoint Spark Tutorial - older


Spark
https://spark.apache.org/Spark Home
https://spark.apache.org/docs/latest/quick-start.htmlSpark quick start
https://spark.apache.org/docs/0.9.1/java-programming-guide.htmlSpark Java programming
https://spark.apache.org/docs/0.9.1/scala-programming-guide.htmlSpark Scala programming
https://drive.google.com/open?id=1_bfxFX6kQf2gTEPyoPwOgmauWj2b-hv5Spark overview in 7 steps - databricks
dzone-spark-refcard_0.pdf
spark-A-Gentle-Introduction-to-Apache-Spark.pdf
spark-tutorials-quick-guide-v2.pdf
spark-ebook-intro-2018.pdf
spark-Mini eBook - Aggregating Data with Apache Spark.pdf
spark-Data-Scientists-Guide-to-Apache-Spark.pdf
 LearningSpark2.0.pdfLearning Spark ebook
Spark GraphX.pdf
spark-The-Data-Engineers-Guide-to-Apache-Spark.pdf


https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9Manning - Spark in Action


Udemy courses on Spark
https://www.youtube.com/watch?v=xkD1l9_nj5w&feature=youtu.beBryan Cafferky on Apache Spark, Databricks
https://www.udemy.com/course/apache-spark-20-java-do-big-data-analytics-ml/learn/lecture/5715610#overviewSpark 2.0, Java ML course




Databricks and Spark




https://www.linkedin.com/posts/bryancafferky_koalas-databricks-
documentation-activity-6683752516523433984-fjEH
Bryan Cafferky on combining Koalas, Airflow with Spark to get pandas for analysis









Key Concepts



Spark

https://spark.apache.org/

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Learning Apache Spark is easy whether you come from a Java, Scala, Python, R, or SQL background:


df = spark.read.json("logs.json") 
df.where("age > 21")   .
select("name.first").show()
Spark's Python DataFrame API
Read JSON files with automatic schema inference
Spark can access many data sources
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Spark Quick Start

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.

To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.


Spark SQL Guide

Spark in Action book

https://drive.google.com/open?id=1ZqKpzFcCgy3r4lyeuN2t2gINpegHIte9




Combine Spark with Airflow, Koalas to get pandas

https://www.linkedin.com/posts/bryancafferky_koalas-databricks-documentation-activity-6683752516523433984-fjEH


Bryan Cafferky on Spark







Arrow creates optimized memory stores for data in Spark from any language client


RDD or Resilient Distributed Dataset

An RDD or Resilient Distributed Dataset is the actual fundamental data Structure of Apache Spark. These are immutable (Read-only) collections of objects of varying types, which computes on the different nodes of a given cluster

https://databricks.com/glossary/what-is-rdd

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

5 Reasons on When to use RDDs

  1. You want low-level transformation and actions and control on your dataset;
  2. Your data is unstructured, such as media streams or streams of text;
  3. You want to manipulate your data with functional programming constructs than domain specific expressions;
  4. You don’t care about imposing a schema, such as columnar format while processing or accessing data attributes by name or column; and
  5. You can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.


Create data frame from a table with Spark SQL


Plot it 


Each language has a separate shell ( R, PySpark etc )


Spark SQL




Potential Value Opportunities



Potential Challenges



Candidate Solutions



Step-by-step guide for Example



sample code block

sample code block
 



Recommended Next Steps



  • No labels