Spark

If you have any doubts in the below, contact us by dropping a mail to the Kung Fu Panda. We will get back to you very soon.

Basics

Spark Modules

Spark Context

Resilient Distributed dataset(RDD)

RDD creation

RDDs can be created in the following ways.

Operations in RDD

Flow in a spark program

Fault tolerance in Spark

Lazy evaluation of RDDs

Transformations in RDD

following are the examples of transformation.

Actions in RDD

following are the examples of actions.

RDD persistence

Pair RDDs

Global Variables in Spark

we can pass outer variable values from the driver program to be used in the filter() or map() functions which are executed on the different nodes, but each node gets a copy of the variable, and updates from all nodes are not propagated back to the driver. For the accumulators and broadcast variables, it is not the case, and they are truly global variables.

Accumulators

Broadcast variables

Components in Spark

Other operations

Memory Management

By default, here is the memory consumption distribution.

Twitter Sentiment Analysis

The data from twitter can be used as a stream by spark to perform the basic sentiment analysis.

Examples/References