Boffins Course

Master in Data Science & Engineering

A unique Program which combines Data Engineering with Data Science

"Data Science at Boffins is both in-lab or online program that is designed for either Working Professionals, University Students or all Aspirants of data science and engineering."

The program features a real-project-based, interdisciplinary curriculum that helps the learner build the in-demand technical, analytical and communications skills needed to manage large and complex data sets. Students attend an in-person immersion, participate in live, weekly in-lab or online classes and complete interactive coursework that provides a comprehensive understanding of computer science, statistics, strategic behavior and data visualization.

This Master’s Data Science Program provides the skills required to become a Boffins certified Data Scientist equipped with the skills of Data Engineer. You will learn the most in-demand technologies such as Data Science , Machine Learning, Python, Big Data on Hadoop and implement concepts such as data exploration, regression models, hypothesis testing, Spark, LINUX, SHELL SCRIPT, SCALA and JAVA."

Boffins Data Science Academy
Location: IT Park, Nagpur
Mode: Full-Time Program for 3 Months
Schedule: Everyday 3 hours, five days a week

"Are you a university Student, an engineering aspirant or a current data engineer who wish to do better at job, or a data engineer who want to become a data scientist? We will explain why you cannot become a data scientist without being a data engineer and also why the corporations need atleast 5 Data Engineers behind one Data Scientist."

Course Modules

LINUX
SHELL SCRIPT
JAVA
HADOOP
SQl/NOSQL
SPARK
PYTHON
MACHINE LEARNING

In 2012, ever since Harvard declared Data Scientist as one of the hottest jobs everybody wants to be the one. BUT, can you be a good Data Scientist without being a good Data Engineer? It will not be wrong to assume that Data Scientists are expected to sit across two roles – statistics and computer science (i.e. competent at both Analytics and Data), this is where the assumption is wrong. Data Scientists don’t help in activating the data and the analytics into our business processes, applications and systems. That is for someone else, which is, a, Data Engineer.

According to Gartner, only 15% of big data projects ever make it into production and the KEY reasons why 85% projects never make it there are:

Either they never find an insight worth putting into production.
Or, they find an insight and build a model but fail to build a production pipeline that can run within the service level agreement on a repeatable basis.
Or, they don’t need an insight, because the data analysis they want to run isn’t dependent on some complicated model, but still fail to build a production pipeline that can run within the service level agreement on a repeatable basis.

Data Engineers take back the ownership of data engineering and the computer science side of data architecture, management and governance. Data Engineers instrument data and analytics. They harness the strategy and investment plans of Data Architects. They enable analytics and data science. They adopt and activate data governance policies. They ensure data and analytic investment is getting its full return vertically and horizontally. So the corporations create a data engineering workbench;

1. To accelerate data science.
2. To ensure data lake adoption.
3. To activate data and analytics in systems and processes.
4. To create consistency and reduce data risk.

While data engineers may be more important than data scientists, there is hope in the form of automation which can make today’s data engineers 10x more productive. In the same way that Integrated Development Environments, IDEs, made software developers significantly more productive, data engineering automation does the same in the BIG DATA space. So while data engineering is hard, data engineers are rare and demand is high, it isn’t coincidental that you now here at Boffins Data Science Academy, So if you are a big data engineer, or a university student or you want to get more efficient data engineer, or you know a big data engineer, or someone who wants to become one, CONTACT US TODAY!

Course content

LINUXModule 1

"Most data science, engineering, analytics and machine learning tools are native to the Linux ecosystem. And if you stage your experimental pipelines using leased servers on the cloud (e.g. EC2, Heroku, AWS etc.), an approach that is increasingly popular these days, you will need to get comfortable with the Linux OSes. In fact, according to recent surveys, 92% of all cloud VM instances run Linux OS, prominently Ubuntu as well as others such as CentOS, RedHat, etc."

Introduction to Red Hat Enterprise Linux
Introduction to GNU/Linux
Installing Red Hat Enterprise Linux
Automating Programs
Login Options
The GNU/Linux Filesystem
Key Filesystem Locations
BASH - Borne Again Shell
User Management
Software Management
Hardware Management
Network Management
Network Services - FTP, NFS, Samba
Network Services - Part II, Sendmail, Apache, Squid Proxy Server
Exam Practice
Installation Challenges
Configuration Challenges
Troubleshooting Challenges

SHELL SCRIPTModule 2

"Most of the Well-Known Data Scientists have some knowledge about SHELL Scripting. Using bash scripts to create data pipelines is incredibly useful as a data scientist. Data Scientists do complex things with just a few keystrokes. Sometimes called "the universal glue of programming. This course will introduce its key elements and show you how to use them efficiently. Manipulating files and directories, Manipulating data, Combining tools, Batch processing, Creating new tools, Creating Data pipe lines. The possibilities with these scripts are almost endless. "

JAVA Module 3

"For exploratory data science over single-machine datasets, R and python suffice. Moving to distributed datasets, one could query with front-ends such as Hive or Pig. However, when it comes to running data science models in production, a most of companies use JVM based languages and platforms. A variety of tools and libraries exists for machine learning such as Spark/Hadoop for computation and MLlib/H2O/Mahout/Oryx for machine learning. Looking at the recent trends, most libraries work with multiple languages so you will end up using a language that fits well with the rest of your codebase."

HADOOP Module 4

"A study of more than 100 data scientists by Paradigm4 found that only 48% of data scientists used Hadoop or Spark on their jobs whilst 76% of the data scientists said that Hadoop is too slow and requires more effort on data preparation to program. Contrary to this, a recent analysis by CrowdFlower on 3490 LinkedIn jobs for data science ranked Apache Hadoop as the second most important skill for a data scientist with 49% rating. Reasons to use Hadoop for Data Science Data Exploration with full datasets, Mining larger datasets, large scale pre-processing of raw data, Data Agility. "

Introduction to Hadoop
Hadoop Eco System
Big Data Overview
Choosing Hardware for Hadoop Cluster
Apache Hadoop Installation
Installing Hadoop Ecosystem and Integration with Hadoop
Hadoop Commands Usage
Import Data In HDFS
Sample Hadoop Example
Hadoop Map Reduce
Understanding the Map Reduce Internal Components
HBase Mapreduce Program
Hive
Pig
Sqoop
Flume
Zookeeper

PYTHONModule 6

"Python is gaining increasing acceptance among the enterprises and clients. Now the companies are smart enough to understand the power of this programming language. In addition, it is simple for them to modify their tested platforms. Hence, they are more willing to shift to Python.Python is the de facto language of machine learning. Notably, Google’s TensorFlow works primarily with Python. Almost every course on neural networks uses Python. The data analysis and parsing required for machine learning go well with Python, and its libraries. Machine learning as a skill is in greater demand every day. A good grasp of the Python programming language puts you a step ahead of others learning it from scratch"

MACHINE LEARNING: Python NumPY for mathematical computingModule 7

"Machine Learning is the science (and art) of programming computers so they can learn from data."

The NumPy ndarray: A Multidimensional Array Object
Creating ndarrays
Data Types for ndarrays
Arithmetic with NumPy Arrays
Basic Indexing and Slicing
Boolean Indexing
Fancy Indexing
Transposing Arrays and Swapping Axes
Expressing Conditional Logic as Array Operations
Mathematical and Statistical Methods
Methods for Boolean Arrays
Sorting
File Input and Output with Arrays
Pseudorandom Number Generation

MACHINE LEARNING: SupervisedModule 8

"In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels"

MACHINE LEARNING: Unsupervised Learning:Module 9

"In unsupervised learning, as you might guess, the training data is unlabeled. The system tries to learn without a teacher."

k-Means
Hierarchical Cluster Analysis (HCA)
Expectation Maximization
Principal Component Analysis (PCA)
Kernel PCA
Locally-Linear Embedding (LLE)
t-distributed Stochastic Neighbor Embedding (t-SNE)
Apriori
Eclat