Big Data Archives - Real Time DBA Magic

Running High Scale Low Latency Database with Zero Data In Memory?

I was talking to one of my oldest database colleagues (and a very dear friend of mine). We were chatting about how key/value stores and databases are evolving. My friend mentioned how they always seem to be revolving around in-memory solutions and cache. The main rant was how this kind of thing doesn’t scale well, while being expensive and complicated to maintain.

My friend’s background story was that they are running an application that uses a user profile with almost 700 million profiles (their total size was around 2TB, with a replication factor of 2). Since the access to the user profile is very random. In short terms, it means, the application is not able to “guess” which user it would need next as that is pretty much random. Therefore, they could not use pre-heating of the data to memory. Their main issue was that every now and then they were getting high peaks of over 500k operations per second of this kind of mixed workloads and that didn’t scale very well.

User Profile use case summary

Read more →

22/02/2022/0 Comments/by Zohar Elkayam/Estimated Reading Time: 4 Minutes

Oracle Week 2016: Introduction to Apache Spark (slides)

Big Data, Presentations, Spark

This is the presentation for Rapid Cluster Computing with Apache Spark session I did in Oracle Week few weeks ago.

I wrote about the Oracle Week conference in a previous post so I won’t go over that again – this was my 3rd session of that week.

Although Oracle Week was for years about Oracle related products, this year they decided to open it up for other technologies as well. They had NoSQL sessions, Hadoop sessions, and even open stack sessions (including ElasticSearch and others). I was fortunate enough to be accepted to give this session which was about Apache Spark.

Apache Spark is the ad-hoc solution for every new Big Data project we encounter in the last year or so. Spark is a cluster solution which uses the Map Reduce paradigm without the need for a Hadoop cluster. It is based on handling the different map-reduce function in-memory and orchestrate everything internally. If you do have a Hadoop deployment, it can interact with it very easily using its internal master or using YARN instead.

This seminar is an introduction level for Oracle DBAs and other database developers. It’s main goal is to get to know this amazing technology. In the session we go over Spark Core, RDDs, how to develop for clusters and some behind the scenes for better understanding. Some of the presentation does require programming backgrounds (Java, Scala, Python) but I tried to cut it to the minimum.

To my surprise, this seminar had around 35 percipient (which was way more than I expected), and got 4.91/5 in the feedback. I presented a similar session in another conference back in March but this is an updated version of that slide deck.

Here is the Agenda:

The Big Data problem and possible solutions
Basic Spark Core
Working with RDDs
Working with Spark Cluster and Parallel programming
Spark modules: Spark SQL and Spark Streaming (by Ishay Wayner)
Performance and Troubleshooting

Read more →

22/12/2016/0 Comments/by Zohar Elkayam/Estimated Reading Time: 2 Minutes

Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem (slides)

Big Data, Hadoop, Presentations

This is the deck for a presentation I had the pleasure to present in multiple forums over the last year. It’s a short introduction for Oracle personal (DBAs and DB Developers) to the Big Data challenges and solutions. This presentation is focusing on the Hadoop Ecosystem but also shows other solutions – such as Apache Spark.

This is things every DBA needs to know, and not EVERYTHING a DBA needs to know. This is only an introductory to the subject. I also have a 200+ slides deck for getting the in depth view. If someone find this interesting and want to read more, feel free to contact me and I’ll post the longer deck as well.

In the agenda:

What is the Big Data challenge?
A Big Data Solution: Apache Hadoop
- HDFS
- MapReduce and YARN
Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other tools
Another Big Data Solution: Apache Spark
Where does the DBA fits in?

This presentation was presented in BGOUG 2016, ILOUG Tech Days 2016, HROUG 2016 and DOAG 2016 Oracle user groups. I also presented this in smaller, more private sessions.

Read more →

11/12/2016/0 Comments/by Zohar Elkayam/Estimated Reading Time: 1 Minute

Spark SQL and Oracle Database Integration

Big Data, Oracle, Spark

I’ve been meaning to write about Apache Spark for quite some time now – I’ve been working with a few of my customers and I find this framework powerful, practical, and useful for a lot of big data usages. For those of you who don’t know about Apache Spark, here is a short introduction.

Apache Spark is a framework for distributed calculation and handling of big data. Like Hadoop, it uses a clustered environment in order to partition and distribute the data to multiple nodes, dividing the work between them. Unlike Hadoop, Spark is based on the concepts of in-memory calculation. Its main advantages are the ability to pipeline operations (thus breaking the initial concept of a single map-single reduce in the MapReduce framework), making the code much easier to write and run, and doing it in an in-memory architecture so things run much faster.

Hadoop and Spark can co-exist, and by using YARN – we get many benefits from that kind of environment setup.

Of course, Spark is not bulletproof and you do need to know how to work with it to achieve the best performance. As a distributed application framework, Spark is awesome – and I suggest getting to know with it as soon as possible.

~~I will probably make a longer post introducing it in the near future (once I’m over with all of my prior commitments)~~.
In the meantime, here is a short explanation about how to connect from Spark SQL to Oracle Database.

Update: here is the 200 long slides presentation I made for Oracle Week 2016: it should cover most of the information new comers need to know about spark.
Read more →

03/07/2016/13 Comments/by Zohar Elkayam/Estimated Reading Time: 9 Minutes

Big Data for CIOs Presentation

Hadoop, Presentations

A few months ago I was asked to give a two hours lecture to a group of CIOs. The topic was a bit vague – “Introduction to Big Data and NoSQL” but I agreed to give it a try anyway.

Since I feel Big Data is such a big topic and since I really wanted to give the CIO so added value, I created this presentation. The aim of the presentation wasn’t to cover all the technological aspects of the topic, but to give some overview and pointers for the future. We talked about basic principles, issues that needs tackling, and solution that might be relevant in the near future. We also talked about NoSQL in order to understand the relation between RDBMS based solution and other kind of solutions.
Read more →

08/09/2015/0 Comments/by Zohar Elkayam/Estimated Reading Time: 2 Minutes

Oracle Pre-Built Developer VMs and VMBox

Big Data, Hadoop, Oracle

Virtual machines (VM) are not new –it has been around for quite some time, and as a consultant I find myself use them all the time. As a matter of fact, just on my laptop and external drive there are at least 15 or 20 different virtual environment which I use for testing, experimenting, and for creating new blog posts.

The thing with virtual machines that you need to be a little more than just a simple DBA to set it up – you need to know how to install an operating system, configure storage, and get your system ready for database installation, which many junior and less experienced DBAs find problematic at times.

Well, no more! Oracle comes to the rescue and provide us with pre-build developer virtual machines.

Read more →

11/05/2015/0 Comments/by Zohar Elkayam/Estimated Reading Time: 2 Minutes

Archive for category: Big Data