This is the presentation for Rapid Cluster Computing with Apache Spark session I did in Oracle Week few weeks ago.
I wrote about the Oracle Week conference in a previous post so I won’t go over that again – this was my 3rd session of that week.
Although Oracle Week was for years about Oracle related products, this year they decided to open it up for other technologies as well. They had NoSQL sessions, Hadoop sessions, and even open stack sessions (including ElasticSearch and others). I was fortunate enough to be accepted to give this session which was about Apache Spark.
Apache Spark is the ad-hoc solution for every new Big Data project we encounter in the last year or so. Spark is a cluster solution which uses the Map Reduce paradigm without the need for a Hadoop cluster. It is based on handling the different map-reduce function in-memory and orchestrate everything internally. If you do have a Hadoop deployment, it can interact with it very easily using its internal master or using YARN instead.
This seminar is an introduction level for Oracle DBAs and other database developers. It’s main goal is to get to know this amazing technology. In the session we go over Spark Core, RDDs, how to develop for clusters and some behind the scenes for better understanding. Some of the presentation does require programming backgrounds (Java, Scala, Python) but I tried to cut it to the minimum.
To my surprise, this seminar had around 35 percipient (which was way more than I expected), and got 4.91/5 in the feedback. I presented a similar session in another conference back in March but this is an updated version of that slide deck.
Here is the Agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming (by Ishay Wayner)
- Performance and Troubleshooting