Skip to main content

Data Engineering - Tools & Intro

Data Engineering - Tools & Intro So I just realized that I am here after a month or so. I was busy at work and traveling. I am starting a kind of new series, I say it Data Engineering Series in which I will be discussing different tools. Of course, I am not able to discuss the entire concept of Data Engineering neither I know it as I will be learning myself. What is Data Engineering? Data Engineering is all about developing, maintaining systems that are responsible for transferring data in large volumes and make it available for analysts and data scientists to use it for analyzing and data modeling. Data engineering is a superset of Data Science or the subset, not clear to me but the collaboration of data engineers and scientists fruits useful data-driven solutions. Data Engineering tools It consists of several tools. Some are dealing with data storage while others with analysis and ETL. Ofcourse, Apache Kafka is one of them. The others tools that I might be covering are Apache Airflow, an ETL tool and Hadoop Ecosystem components like HDFS, Hive, Yarn, Pig etc. There is no such specific roadmap so tools can be covered in any order. Since I mostly work in Python, Java so will be trying my best to find some way to interact with Python or Java but it is not necessary as most of Hadoop related systems are in either Java or Scala. So, stay tuned and I will be back shortly with the new post.

Comments

Popular posts from this blog

Introduction to Customer Segmentation in Python !!!

Introduction to Customer Segmentation in Python In this tutorial, you're going to learn how to implement customer segmentation using RFM(Recency, Frequency, Monetary) analysis from scratch in Python. In the Retail sector, the various chain of hypermarkets generating an exceptionally large amount of data. This data is generated on a daily basis across the stores. This extensive database of customers transactions needs to analyze for designing profitable strategies. All customers have different-different kind of needs. With the increase in customer base and transaction, it is not easy to understand the requirement of each customer. Identifying potential customers can improve the marketing campaign, which ultimately increases the sales. Segmentation can play a better role in grouping those customers into various segments. In this tutorial, you will cover the following topics: What is Customer Segmentation? Need of Customer Segmentation Types of Segmentation Customer

Let's Understand Ten Machine Learning Algorithms

Ten Machine Learning Algorithms to Learn Machine Learning Practitioners have different personalities. While some of them are “I am an expert in X and X can train on any type of data”, where X = some algorithm, some others are “Right tool for the right job people”. A lot of them also subscribe to “Jack of all trades. Master of one” strategy, where they have one area of deep expertise and know slightly about different fields of Machine Learning. That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across. This is a whirlwind tour of common machine learning algorithms and quick resources about them which can help you get started on them. 1. Principal Component Analysis(PCA)/SVD PCA is an unsupervised method to understand global properties of a dataset consisting of vectors. Covariance Matrix of data points is analyzed here to un

Tapping Into the “Long Tail” of Big Data

Variety, not volume or velocity, drives big-data investments !!! Gartner defines big data as the three Vs: high-volume, high-velocity, high-variety information assets. While all three Vs are growing, variety is becoming the single biggest driver of big-data investments, as seen in the results of a recent survey by New Vantage Partners. This trend will continue to grow as firms seek to integrate more sources and focus on the “long tail” of big data. From schema-free JSON to nested types in other databases (relational and NoSQL), to non-flat data (Avro, Parquet, XML), data formats are multiplying and connectors are becoming crucial. In 2017, analytics platforms will be evaluated based on their ability to provide live direct connectivity to these disparate sources. Tapping Into the “Long Tail” of Big Data When asked about drivers of Big Data success, 69% of corporate executives named greater data variety as the most important factor, followed by volume (25%), with velocity (6%)