Spark RDD Dataframes SQL
Spark RDD Dataframes SQL
to start spark
include modules :
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
import com.databricks.spark.xml._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types.{StructType, StructField, StringType,
DoubleType};
type(df.describe())
df.select("_id","author","description").show()
hdfs://quickstart.cloudera:8020/user/cloudera/
REDUCE :
val rdd1 = sc.parallelize(List(20,32,45,62,8,5))
val sum = rdd1.reduce(_+_)
ReduceByKey:
val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data_RDD = sc.parallelize(words)
val mapped_RDD = data_RDD.map(w => (w,1))
mapped_RDD.take(10)
FilTER:
val data_RDD =
sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/temperature_2014.csv")
data_RDD.take(100)
RDD to DATAFRAME:
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// How many bids per auction?
val results = sqlContext.sql("SELECT auctionid, item, count(bid) FROM
RDD_table GROUP BY auctionid, item")
referrences:
https://mapr.com/ebooks/spark/05-processing-tabular-data-with-spark-sql.html
https://www.supergloo.com/fieldnotes/spark-sql-csv-examples-python/
http://sparktutorials.net/Opening+CSV+Files+in+Apache+Spark+-
+The+Spark+Data+Sources+API+and+Spark-CSV