Scala Spark

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

how to map some of the functional abstractions that you've learned in previous Scala courses to

computations on multiple machines over massive data sets.

What is, we will see first-hand how the functional abstractions that we've covered in the previous Scala
courses makes it easier and more user-friendly to scale computations over large clusters. Or easier, per
se, than scalingcomputations on imperative frameworks, imperative systems fordistributedcomputation.

we're always going to focus on analyzing large data sets. That is you'll be challenged to think about
common data science tasks like K-means functionally, such as that they can be adopted to and
implemented in the context of Spark.

A functionally oriented framework for large scale data processing that's implemented in Scala

you might beasking well, if we're going to be focusing on a lightweight data science flavor of the
processing tasks, then why are we bothering with Scala and

why are we bothering with Spark? After all

if you want to learn data science in the classroom off of statistics professor's favorite languages or

frameworks like R or Python or Octave and/or MATLAB.

So then why should one bother running Scala or Spark which are both arguably very unlike R, Python,
Octave and MATLAB? The answer is that these language and frameworks are good for data science in
the small.

Algorithms on data sets that are perhaps just a few hundred megabytes or even a few gigabytes in size.
However, once the dataset becomes too large to fit into main memory on one computer, it suddenly
becomes much more difficult to use one of these languages or frameworks alone.

if your small dataset grows into a much larger data set than these languages and frameworks like
R,Python, MATLAB, etc. They won't allow you to scale,you'll need to start completely from scratch
reimplementing all of your algorithms using a system like Hadoop or Spark anyway. We'll need to
manually figure out how to distribute your problem over many machines without the help of such a
framework.

Which is kind of a bad idea if you're not already an expert in building distributed systems.

there's also this wholehuge massive industry shift towards data-oriented decision making. Nowadays,
many companies across manydifferent industries have realized that by looking more closely at the data
they'recollecting from device logs to health or genetic data, they can innovate in ways that were
impossible before. For example, now we have all of these devices surrounding us, collecting information
and attempting to provide all kinds of insights to enrich our day-to-day lives.

instead, imagine hundreds of thousands of users of some device, say a smartphone or


some wearable or something. And imagine as part of your job, you'rresponsible for providing some
analysis or insight behind all of the data that's collected.

You might also like