Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Spark supports Scala, Java and Python.
- A Single Course covers all the Spark components and Nosql’s.
- 80 hours of course
- Overall 60+ Assignments
- Scala Refresher for Non Scala Candidates
- No Pre-Configured VMs.
- 24 Hours SLA for email Support
- Refresher Classes
- Vendor Neutral. Using Apache versions.
- Offering Big Data Course not Spark Alone.
- Spark databricks Certification Assistance
- Course taken by trainers from real time professionals team (www.datadotz.com) who have taken for 800+ professionals
After the completion of the Bigdata & Spark Developer course @sparkhadoop.com, you will be equipped & self-reliant with the following.
- Understanding Big Data
- Understanding various types of data that can be stored in Spark
- Understanding how Big Data & Spark fit in the current environment and infrastructure
- Master the core concepts of Spark eco-system.
- Writing complex Scala programs on Spark
- Setting up a Spark cluster
- Mastering with various other components of Spark eco-system
- Performing Data Analytics using Spark & SparkSQL
- Implementing a Sparkproject
- Working on live/real life POC on big data analytics using Spark eco-system
- And Much More..
Course Delivery Method
All our courses are live instructor led and interactive sessions handled by highly reputed and experienced professionals from industry giants such as CTS, DataDotz, TCS and etc.
Who can take up this course?
- Data Architects
- Data Integration Architects
- Tech Managers
- Decision Makers
- Database Administrators
- Java Developers/ Any other developers
- Technical Infrastructure Team
- Hadoop developers
- Any working professional interested in knowing Spark
- Any graduate/post-graduate with an urge to learn Spark
Pre-requisites to take this course
- 64 Bit processor laptop/PC with minimum 4GB RAM (for programming practice along with sessions)
- Familiarity with core java will be an advantage, but is not mandatory.
- Familiarity with any database will be an advantage , but is not mandatory.
Project & Certification
Towards the end of the course, there will be an assignment which you will have to work on. This assignment can be a real life data based assignment with business problems. On successful completion of this assignment (it will be reviewed by instructor & industry expert).
Here are some of the data sets on which you may work as a part of project work ?
Drug Data Set – contains the day to day records of all the Drugs. It will provide you with the information like opening rate, closing rate, etc. for individual Drug. Hence, this data is highly valuable for people you have to make decision based on the market trend
Spark databricks Certification Assistance
Why take this course?
Big Data is a term used to describe large sets/volumes of data which companies/organizations store, process & analyze to make better decisions beneficial for overall organization & its stakeholders. Now, these data sets have become so huge that companies are facing difficulties in storing these data & processing them. Traditional systems which were used to store & process data have almost become obsolete when it comes to Big Data. This is where Hadoop comes into existence & companies involved in working with Big Data have started opting/implementing Hadoop for collecting, storing, processing & retrieving peta bytes of data.
Gone are the days when decisions were made on the basis of gut feeling, but currently, all decisions are made on the basis of historical data which is processed & analyzed & accordingly forecasting is done.
The right mix of a professional with excellent analytical skills & hands on experience with advanced technology like Hadoop is what companies/organizations are looking for. According to latest McKinsey report, more than 2,00,000 data scientists will be needed by the industry (2014-2016).
Huge opportunity in the market for you after successful completion of this course!!!
Big Data (What, Why, Who) – 3++Vs – Overview of BigData Systems – Role of Spark in Big data – Overview of other Big Data Systems – Who is using Spark –Relationship between Apache Spark and Hadoop – Integrations into Exiting Software Products – Current Scenario in Spark – Installation of SparkShell – Configuration
Hands on with Scala
Introduction to scala – scala environment setup and installation – A first example – Scala internals – Interaction with java – Basic syntax usage -variables – Functions – Access modifiers – Closures – Strings – Collections(set , list , map, tuples, options , iterators ) – class and object – Traits – Pattern matching – Scala Regular Expressions -Exception Handling – Extractors -Polymorphic Methods
Spark Shell – Resilient Distributed Datasets(RDD) – RDD Operations (Transformations and Actions) – KeyValue RDDs – Numeric RDDs – Stages and Tasks in DAG – Serialization – Caching RDDs – Spark-Submit
Installation of StandAlone Cluster – Cluster Components (Master , Workers, Executors) – Spark-Submit – Application Deployment Modes(Cluster Mode & Client Mode)
Installation of StandAlone Cluster – SchemaRDD – Hive as a DataSource (HiveContext – Instalation – Hive UDFs) – SparkSQL UDF – Thrift Server – JSON (& nested JSONs)– Parquet – DSL support(Scala Based) in SparkSQL – SparkSQL cli Shell – DataSources API
Introduction to Stream Processing –Streaming vs MicroBatch vs Batch – Introduction to DStreams – Input DStream – Sources for DStreams – Writing a custom Source – DStream Transformations –Output operations – Sliding window operations – Caching/ Persistence – Deployment – Monitoring and Performance tuning – Fault Tolerance – Comparison to other streaming frameworks (Storm , DataTorrent)
Introduction to Cluster Managers – Spark on Hadoop YARN (Installation- YARN Architecture) – Hardware Recommendations – Monitoring & Metrics (Spark Web UI – Logging) – Performance Tuning – Commercial Support
Apache Spark Extras
Spark with Hadoop1.X – Introduction to Spark MLlib – Introduction to GraphX – Executing Spark Applications written in Python as well as Java
Introduction to NOSQL & MongoDB
Introduction- Architecture- JSON – BSON – Installation of Single Node MongoDB – Collections(Documents, Fields) – CRUD operations – Cursors – Indexes (Single, Compound, Text, MultiKey ) – References – Embedded Documents – GridFS (files, Chunks) –– Aggregation Pipeline – MapReduce – server Side Scripting – Mongo Shell – Commands – Java Driver
Configuration Parameters – Multi Node Cluster Installation – Backup and Restore (mongodump, FileSystem SnapShots, mongorestore) – Import and Export – Security – Sharding (Routers, Shards, Config Servers, Chunk Spliting, Chunk Migration/Balancer) – High Availability using Replication – Monitoring Utilities ( mongostat, mongotop, REST Interface, HTTP Console – MMS ) – Optimization – Deployment Modes
Cassandra Core Concepts
Introduction – Installation of Single Node Cassandra -KeySpaces – CQL – using cqlSh and its commands – CQL Data Types – CRUD Operations – TTL – Counters – Indexes – LightWeight Transactions – Collections (List, Set, Map) – comments – Compound Primary Key – Composite Partition Key – Prepared Statements – triggers – Tunable Consistency (Write, Read) – Java Drivers
Gossip (InterNode Communication) – Failure Detection and recovery – Data replication – Partitioners (MurMur3, Random, ByteOrdered) – Snitches (Simple, Ec2, PropertyFile) – Read Path – write Path(Insert, Update, Delete) – Hinted HandOff
Configuration – Token Generation – Data Caching – Compaction – Compression – BackUp and Restore (SnapShots) – Introduction to Security – Cassandra Tools (nodetool, Cassandra-stress, Cassandra bulk loader) – DevCenter – MultiNode Cassandra Cluster (Lab) – Adding and Removing a Node – Deployment Models.