.

Cleansing Data

Cleansing Data in Big Data Analytics

  • The process next to the collection of data from various sources is to form the sources of data homogenous and persist to design own data product
  • the loss of data may persist as data are arriving from various sources, in this scenario, consider an alternative
  • Building a regression model or classification model depending upon data sources and working with different sources is an advantage or loss of information , all these decision making process makes analytics appealing and challenging
  • For reviewing, there are 2 options for considering language for every data source
  • Homogenization supports converting various language to the language supporting more data
  • It costs when converting lots of data with an Application Programming Interface
  • Heterogenization supports to design a solution foe every language
  • Do build up a suggesting option for every language
  • It tuning is easier for a few languages

Sample Mini Project on Twitter

  • In the current box, it is proposed to clean the unformatted data and then change into a matrix of data to apply modeling areas.
  • While receiving data from twitter, a lot of characters are not required for use
  • So, in the process of cleansing data the strange characters are to be illuminated
  • e.g.

  • The specified emotions in a tweet , In this case, to clean such things,
  • The beneath R programming language script will support to remove those unstructured data
  • cleansing data img1
  • The last step of cleansing data related to the mini project is to get cleaned data
  • To achieve this, converting to a matrix and supplement an algorithm
  • From the clean tweets vector the stored data is  simply altered to a  collection of matrix terms and implement an algorithm of unsupervised learning
.