Level 2 : Apache Spark + Scala + Cassandra + Mongodb
21st May 2015

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Spark supports Scala, Java and Python.

Course Highlights

  • A Single Course covers all the Spark  components and Nosql’s.
  • 80 hours of course
  • Overall 60+ Assignments
  • Scala Refresher for Non Scala Candidates
  • No Pre-Configured VMs.
  • 7 day money back guarantee with no questions asked
  • Our 24×7 expert support team will be available to help through email , phone or live chat for any issues you may face during the course
  • Get access to recorded version of the session if you miss upon any live session
  • Get life time access to the course material and session videos
  • Chat with the instructor any time. Instructors will always be ready to answer your questions
  • Vendor Neutral. Using Apache versions.
  • Offering Big Data Course not Spark Alone.
  • Spark databricks Certification Assistance
  • Course taken by trainers from real time professionals team ( who have taken for 800+ professionals

Course Objectives

After the completion of the Bigdata & Spark Developer course, you will be equipped & self-reliant with the following.

  • Understanding Big Data
  • Understanding various types of data that can be stored in Spark
  • Understanding how Big Data & Spark fit in the current environment and infrastructure
  • Master the core concepts of  Spark eco-system.
  • Writing complex Scala programs on Spark
  • Setting up a Spark cluster
  • Mastering with various other components of Spark eco-system
  • Performing Data Analytics using Spark & SparkSQL
  • Implementing a Sparkproject
  • Working on live/real life POC on big data analytics using Spark eco-system
  • And Much More..

Course Delivery Method

All our courses are live instructor led and interactive sessions handled by highly reputed and experienced professionals from industry giants such as CTS, DataDotz, TCS and  etc.

Who can take up this course?

  • Data Architects
  • Data Integration Architects
  • Tech Managers
  • Decision Makers
  • Database Administrators
  • Java Developers/ Any other developers
  • Technical Infrastructure Team
  • Hadoop developers
  • Any working professional interested in knowing Spark
  • Any graduate/post-graduate with an urge to learn Spark

Pre-requisites to take this course  

  • 64 Bit processor laptop/PC with minimum 4GB RAM (for programming practice along with sessions)
  • Familiarity with core java will be an advantage, but is not mandatory.
  • Familiarity with any database will be an advantage , but is not mandatory.

Project & Certification

Towards the end of the course, there will be an assignment which you will have to work on. This assignment can be a real life data based assignment with business problems. On successful completion of this assignment (it will be reviewed by instructor & industry expert).

Here are some of the data sets on which you may work as a part of project work ?

Drug Data Set –  contains the day to day records of all the Drugs. It will provide you with the information like opening rate, closing rate, etc. for individual Drug. Hence, this data is highly valuable for people you have to make decision based on the market trend

Spark databricks Certification Assistance

Why take this course?

Big Data is a term used to describe large sets/volumes of data which companies/organizations store, process & analyze to make better decisions beneficial for overall organization & its stakeholders. Now, these data sets have become so huge that companies are facing difficulties in storing these data & processing them. Traditional systems which were used to store & process data have almost become obsolete when it comes to Big Data. This is where Hadoop comes into existence & companies involved in working with Big Data have started opting/implementing Hadoop for collecting, storing, processing & retrieving peta bytes of data.

Gone are the days when decisions were made on the basis of gut feeling, but currently, all decisions are made on the basis of historical data which is processed & analyzed & accordingly forecasting is done.

The right mix of a professional with excellent analytical skills & hands on experience with advanced technology like Hadoop is what companies/organizations are looking for. According to latest McKinsey report, more than 2,00,000 data scientists will be needed by the industry (2014-2016).

Huge opportunity in the market for you after successful completion of this course!!!


Course Outline


Big Data (What, Why, Who) – 3++Vs – Overview of BigData Systems – Role of Spark in Big data – Overview of other Big Data Systems – Who is using Spark –Relationship between Apache Spark and Hadoop – Integrations into Exiting Software Products – Current Scenario in Spark – Installation of SparkShell – Configuration

Hands on with Scala

Introduction to scala – scala environment setup and installation – A first example – Scala internals – Interaction with java – Basic syntax usage -variables – Functions – Access modifiers – Closures – Strings – Collections(set , list , map, tuples, options , iterators ) – class and object – Traits – Pattern matching – Scala Regular Expressions -Exception Handling – Extractors -Polymorphic Methods

Spark Basics

Spark Shell – Resilient Distributed Datasets(RDD) – RDD Operations (Transformations and Actions) – KeyValue RDDs – Numeric RDDs – Stages and Tasks in DAG – Serialization – Caching RDDs – Spark-Submit

Spark Architecture

Installation of StandAlone Cluster – Cluster Components (Master , Workers, Executors) – Spark-Submit – Application Deployment Modes(Cluster Mode & Client Mode)


Installation of StandAlone Cluster – SchemaRDD – Hive as a DataSource (HiveContext – Instalation – Hive UDFs) – SparkSQL UDF – Thrift Server – JSON (& nested JSONs)– Parquet – DSL support(Scala Based) in SparkSQL – SparkSQL cli Shell – DataSources API

Spark Streaming

Introduction to Stream Processing –Streaming vs MicroBatch vs Batch – Introduction to DStreams – Input DStream – Sources for DStreams – Writing a custom Source – DStream Transformations –Output operations – Sliding window operations – Caching/ Persistence – Deployment – Monitoring and Performance tuning – Fault Tolerance – Comparison to other streaming frameworks (Storm , DataTorrent)

Spark Administration

Introduction to Cluster Managers – Spark on Hadoop YARN (Installation- YARN Architecture) – Hardware Recommendations – Monitoring & Metrics (Spark Web UI – Logging) – Performance Tuning – Commercial Support

Apache Spark Extras

Spark with Hadoop1.X – Introduction to Spark MLlib – Introduction to GraphX – Executing Spark Applications written in Python as well as Java

Introduction to NOSQL & MongoDB

Introduction- Architecture- JSON – BSON – Installation of Single Node MongoDB Collections(Documents, Fields) – CRUD operations – Cursors – Indexes (Single, Compound, Text, MultiKey ) – References – Embedded Documents – GridFS (files, Chunks) –– Aggregation Pipeline – MapReduce – server Side Scripting – Mongo Shell – Commands – Java Driver

MongoDB Administration 

Configuration Parameters – Multi Node Cluster Installation – Backup and Restore (mongodump, FileSystem SnapShots, mongorestore) – Import and Export – Security – Sharding (Routers, Shards, Config Servers, Chunk Spliting, Chunk Migration/Balancer) – High Availability using Replication – Monitoring Utilities ( mongostat, mongotop, REST Interface, HTTP Console – MMS ) – Optimization – Deployment Modes

Cassandra Core Concepts 

Introduction – Installation of Single Node Cassandra -KeySpaces – CQL – using cqlSh and its commands – CQL Data Types – CRUD Operations – TTL – Counters – Indexes – LightWeight Transactions – Collections (List, Set, Map) – comments – Compound Primary Key – Composite Partition Key – Prepared Statements – triggers – Tunable Consistency (Write, Read) – Java Drivers

Cassandra Architecture

Gossip (InterNode Communication) – Failure Detection and recovery – Data replication – Partitioners (MurMur3, Random, ByteOrdered) – Snitches (Simple, Ec2, PropertyFile) – Read Path – write Path(Insert, Update, Delete) – Hinted HandOff

Cassandra Administration

Configuration – Token Generation – Data Caching – Compaction – Compression – BackUp and Restore (SnapShots) – Introduction to Security – Cassandra Tools (nodetool, Cassandra-stress, Cassandra bulk loader) – DevCenter – MultiNode Cassandra Cluster (Lab) – Adding and Removing a Node – Deployment Models.


Write your comment here ...

Leave a Reply