Friday, 22 December 2017

Data Warehouse Vs Data Lake

Data Generation, Analysis, and Usage – Current Scenario

Last decade has seen an exponential increase in the data being generated from across traditional as well as non-traditional data sources. International Data Corporation (IDC)report says that, data generated in the year 2020 alone will be a staggering 40 zettabytes which would constitute a 50-fold growth from 2010. The data generated per second has increased to 2.5 Quintillion bytes and with the advent of latest innovations like the Internet of Things; it is poised to grow even more rapidly. This increase in data generation coupled with growing ability to store various types of data that is being generated has ensued in a vast repository of data which is now available for scrutiny.

Unstructured Data

According to reports by wealth management firm Merrill Lynch,among all these data,80 percent of business-relevant information originates in unstructured form. Now unstructured data refers to information which either does not tailor to a pre-defined data model or is not organized in a pre-defined manner. These could be images, videos, emails, social media data or even sonar readings. Essentially these are data points which cannot be captured in our traditional relational databases.

Analysis of Unstructured Data

As the ability to store varied data increased so did our ability to analyze and derive actionable insights from it. Thecompanies started realizing the significance of analyzing unstructured data along with structured data,started investing more into it andas a result, thepotential benefits which could be harnessed from these previously useless data became more apparent.The personalized loan offerings from banks or the customized offers from e-commerce sites or exclusive loyalty discounts offered by retail chains are just a few examples of how organizations have started deep diving into the unstructured data to come up with tailored offerings.

This blog post brings out the significance of the data storage Repositories namely Data Warehouse and Data Lake, does a comparative analysis and suggests on the different approaches to be adopted based on the implementation decision and architecture.

Traditional Data Warehouse Challenges
Storage and Performance:
A Data Warehouse is a conceptual architecture that helps to store structured, subject-oriented, time variant, non-volatile data for decision making. Historical as well as real-time data from various sources are transformedto load to a structured form.

While a traditional Data warehouse can act as a master repository for all the structured data across the organization, its inability to store unstructured data prevents it from acting as a unified data source for analytics thereby hampering its ability to successfully garner value from such hugedata. Because unstructured data constitutes such large chunk of business-related information, enterprises can no longer afford to neglect it, and leaving this data out of the purview of analytics could prove detrimental for companies.

Also with the exponential increase in the data being generated each day, storing these data in traditional databases could prove expensive for organizations. And as a result of such humongous data being stored, the performance also suffers unless we invest more heavily in the hardware configurations.

Data Quality:
From an implementation standpoint one of the main challenges a data warehousing project poses is pertaining to the data quality. Often when we try to combine inconsistent data from disparate sources it would result in duplicates, inconsistent data, missing data and logical conflicts. Varied level of standardizations across different databases also adds to the issue. These would create a problem at a later stage and will result in faulty reporting and analytics thereby affecting optimal decision making.

Reporting:
By the virtue of having data from across different databases, data warehouse projects often cater to varied reports and analytics as per user demand.Data warehouses being ‘schema on-write’,such reporting and analytics need to be taken into design considerations upfront as we need to define the schema before loading data into the databases. However, envisioning all such reports at the onset might be difficult for business users who are not exposed to the capabilities of the tools and will often result in rework for the technical team.

Change Management:
Because data warehouse projects are structure driven, it does not adapt itself easily to changes. The effort and resource required to adapt to any such changes are invariably exorbitant and will most likely drive up the cost significantly. For instance, if a new business requirement emerges at a later point, which fundamentally changes the original data structure, it would necessitate remodeling of Data Warehouse and this can be extremely time-consuming.

Read more at http://www.infotrellis.com/data-warehouse-vs-data-lake/

Monday, 18 December 2017

How to access Informatica PowerCenter as a Web Service?

Web Services Overview:

Web Services are services available over the web that enables communication and provide a standard protocol for communication. To enable the communication, we need a medium (HTTP) and a format (XML/JSON).

There are two parties to the web services, namely Service Provider and Service Consumer. A web service provider develops/implements the application (web service) and makes it available over the internet (web).  Service Provider publishes an interface for the web services that describes all the attributes of the web service. Service Consumer consumes the web service. For the Consumer to consume the web service, the consumer has to know the services available, request and response parameters, how to call the services and so on.

Hence we can define Web Service as a standardized way of integrating web desk applications using XML, SOAP, WSDL and UDDI open standards over an internet protocol backbone. XML is used to tag the data. SOAP is used to transfer the data. WSDL is used for describing the services available and UDDI is used for listing what services are available.

Why Web Services?
Web Services is used mainly for two reasons:

Platform agnostic communication
Two different applications can talk to each other and exchange data using web services

PowerCenter Web Hub and Web Services
InformaticaPowerCenter has the ability to expose its job (workflow) as a SOAP web service which can be utilized from external applications to access the data integration functionalities even outside Informatica PowerCenter.

This blog post gives an overview of web service, explains how to create web source and targets, create workflows and test the functionality from Web Service Hub.

Read more at http://www.infotrellis.com/how-to-access-informatica-powercenter-as-a-web-service/

Sunday, 10 December 2017

How to integrate Informatica Data Quality (IDQ) with Informatica MDM?

Overview

Data cleansing and standardization is an important aspect of any Master Data Management (MDM) project. Informatica MDM Multi-Domain Edition (MDE) provides reasonable number of cleanse functions out-of-the-box. However, there are requirements when the OOTB cleanse functions are not enough and there is a need for comprehensive functions to achieve data cleansing and standardization, for e.g. address validation, sequence generation. Informatica Data Quality (IDQ) provides an extensive array of cleansing and standardization options. IDQ can easily be used along with Informatica MDM.

This blog post describes the various options to integrate Informatica MDM and IDQ, explains the advantages and disadvantages of each approach to aid in deciding the optimal approach based on the requirements.


Informatica MDM-IDQ Integration Options
There are three options through which IDQ can be integrated with Informatica MDM.

  • Informatica Platform staging
  • IDQ Cleanse Library
  • Informatica MDM as target
  • Option 1: Informatica Platform Staging


Starting with Informatica MDM’s Multi-Domain Edition (MDE) version 10.x, Informatica has introduced a new feature called “Informatica Platform Staging” within MDM to integrate with IDQ (Developer Tool). This feature enables to directly stage/cleanse data using IDQ mappings to MDM’s Stage tables bypassing Landing tables.

Advantages

  • Stage tables are immediately available to use in the Developer tool after synchronization eliminating the need to manually create physical data objects.
  • Changes to the synchronized structures are reflected into the Developer tool automatically.
  • Enables loading data into Informatica MDM’s staging tables bypassing the landing tables.


Read more at, http://www.infotrellis.com/integrate-informatica-data-quality-idq-informatica-mdm/

Friday, 1 December 2017

How to integrate Informatica Data Quality (IDQ) with Informatica MDM

Data cleansing and standardization is an important aspect of any Master Data Management (MDM) project. Informatica MDM Multi-Domain Edition (MDE) provides reasonable number of cleanse functions out-of-the-box. However, there are requirements when the OOTB cleanse functions are not enough and there is a need for comprehensive functions to achieve data cleansing and standardization, for e.g. address validation, sequence generation. Informatica Data Quality (IDQ) provides an extensive array of cleansing and standardization options. IDQ can easily be used along with Informatica MDM.
This blog post describes the various options to integrate Informatica MDM and IDQ, explains the advantages and disadvantages of each approach to aid in deciding the optimal approach based on the requirements.

Informatica MDM-IDQ Integration Options

There are three options through which IDQ can be integrated with Informatica MDM.
  1. Informatica Platform staging
  2. IDQ Cleanse Library
  3. Informatica MDM as target
Option 1: Informatica Platform Staging
Starting with Informatica MDM’s Multi-Domain Edition (MDE) version 10.x, Informatica has introduced a new feature called “Informatica Platform Staging”within MDM to integrate with IDQ (Developer Tool). This feature enables to directly stage/cleanse data using IDQ mappings to MDM’s Stage tables bypassing Landing tables.

http://www.infotrellis.com/integrate-informatica-data-quality-idq-informatica-mdm/