Thursday, 27 December 2018

Informatica MDM – Suspect Duplicate Process (SDP) Approach


A master data management (MDM) system is installed so that the core data of an organization is secure,  is accessible by multiple systems as and when required and does not have multiple copies floating in the system, in order to have a single source of truth. A solid Suspect Duplicate Process is required in order to achieve the 360 degree view of an entity.

The concept of Suspect Duplicate Processing represents the broad category of activities related to identifying entities that are likely duplicates of each other. Suspect duplicate processing is the process of searching for, matching, creating associations between and, when appropriate, merging data for existing duplicate party records in the system.

To achieve this functionality, Informatica MDM has come up with its own Suspect Duplicate Processing (SDP) approach. An organization based on its use case can opt any of the following two approaches:


  • Deterministic Matching Approach
  • Fuzzy Matching Approach


Deterministic Matching Approach

Deterministic Matching uses a series of rules, like nested if statements, to run a series of logical tests on the data sets. This is how we determine relationships, hierarchies, and households within a dataset. Deterministic matching seeks a clear “Yes” or “No” result on each and every attribute, based on which we define whether:


  • Two records are duplicates
  • should be resolved by a data steward or
  • Two unique entities.


It doesn’t leave any room for error and provides the result in an ideal scenario. But most of the data in organizations is far from an ideal scenario. These are the cases when the Fuzzy Matching Approach of Informatica comes handy.

Learn more at http://www.infotrellis.com/informatica-mdm-fuzzy-matching/

Tuesday, 25 December 2018

Mastech InfoTrellis - Experts in Big Data Analytics

Mastech InfoTrellis’ diverse expertise in the Big Data space, has helped to assist global enterprises in their Big Data initiatives

Big Data Analytics Hub

Mastech InfoTrellis offers managed Big Data Analytics Hub Solution Centered on Hadoop, which enables customers to consolidate multi-channel data of various formats into a single source. Big Data Analytics Hub enables self service analytics by different business functions.

AllSight – Customer Intelligence Management

AllSight Customer Intelligence Management System which delivers Enterprise Customer 360 by ingesting structured and unstructured data from disparate data sources across the organization.

IBM Big Data Solutions

IBM Big Data Solutions combine open source Hadoop and Spark for the open enterprise to cost effectively analyze and manage big data. With BigInsights, you spend less time creating an enterprise-ready Hadoop infrastructure, and more time gaining valuable insights. IBM provides a complete solution, including Spark, SQL, Text Analytics and more to scale analytics quickly and easily.

Learn more at http://www.infotrellis.com/big-data/

Saturday, 22 December 2018

Data Management and IBM IIS Tools

As per a study conducted by a leading market research and advisory company the data that we have generated in the past two years is many times more than that we generated in over two decades. It has not just multiplied, but have also become complex, varied and is being generated at much more rate than it ever was. These factors present a data integration challenge to the industries and businesses to be able to better utilize their data for help building strategies, provide services, introduce policy regulations such that their business is empowered to bridge or completely meet the gap for that matter between data and analytics.
IBM has always been innovative, technology-driven and in fact, they pioneer in data integration and management technologies. They have always provided the business with the right set of tools and IBM MDM (Master Data Management) is the best example of that. Besides MDM, IBM also has IIS (InfoSphere Information Server) in its quiver to target data integration and management challenges that almost every line of business in this age encounters.
This blog aims to provide an outlook on the IBM IIS suite and how it can empower your business data integration demands for better resource utilization and finding the right set of tools to address key business challenges.

http://www.infotrellis.com/data-management-ibm-iis-tools/

Wednesday, 19 December 2018

Best Practices in Data Validation

Data Quality is the buzz word in the digital age.

What is data quality and why is it so important?

“Data quality” is the term that is probably hidden but plays an important role in many streams. Data plays a vital role in acquiring a market place, especially in enterprise data management stream.

Data Quality Examples

Following are some examples which emphasize the need for data quality.
  • A customer shouldn’t be allowed to enter his age where he has to mention his marital status.
  • When a customer enters a store, there is a high possibility that he might miss out his original details to be filled up with the forms, some of it can be in a hurry not mentioning a correct phone number.
  • There is also a possibility of the billing staff to wrongly enter the store address as default in place of the customer address which contributes to a bad quality data that gets persisted in the system.
This data may be crucial as the customer might not just be a Guest customer and the customers’ viable interest towards the store becomes obscure.
This blog post speaks on Data Quality, the significance of Data Quality, business impacts, best practices to be followed, and Mastech InfoTrellis’ specialization in Data validation.
http://www.infotrellis.com/best-practices-data-validation/

Monday, 10 December 2018

Why Big Data in Healthcare is so required

“Data analytics” refers to the practice of taking masses of aggregated data and analyzing them to draw important insights and information contained in it. This process is increasingly aided by new software and technology that helps examine large volumes of data for hidden information that can help us in many areas and healthcare is one of those areas.

80% of all healthcare information is unstructured data which is so vast and complex that it needs specialized methods and tools to make meaningful use of the data. The new and emerging technologies like artificial intelligence (AI), machine learning, and predictive analytics are bringing in powerful tools for healthcare technologists and thought leaders to capture these data and process it effectively and efficiently for the complete transformation of the healthcareindustry.Physician decisions are winding up increasingly prove based, implying that they depend on expansive swathes of research and clinical information rather than exclusively their tutoring and expert sentiment. This new treatment state of mind implies there is a more prominent interest for big data analytics in medicinal services offices than at any other time. There is almost certainly that big data has developed as a defining moment changer for the healthcare industry to enable it to advance to another level.

Read full article at http://www.infotrellis.com/big-data-analytics-augmented-patient-care/

Why Big Data in Healthcare is so required

“Data analytics” refers to the practice of taking masses of aggregated data and analyzing them to draw important insights and information contained in it. This process is increasingly aided by new software and technology that helps examine large volumes of data for hidden information that can help us in many areas and healthcare is one of those areas.

80% of all healthcare information is unstructured data which is so vast and complex that it needs specialized methods and tools to make meaningful use of the data. The new and emerging technologies like artificial intelligence (AI), machine learning, and predictive analytics are bringing in powerful tools for healthcare technologists and thought leaders to capture these data and process it effectively and efficiently for the complete transformation of the healthcareindustry.Physician decisions are winding up increasingly prove based, implying that they depend on expansive swathes of research and clinical information rather than exclusively their tutoring and expert sentiment. This new treatment state of mind implies there is a more prominent interest for big data analytics in medicinal services offices than at any other time. There is almost certainly that big data has developed as a defining moment changer for the healthcare industry to enable it to advance to another level.

Read full article at http://www.infotrellis.com/big-data-analytics-augmented-patient-care/

Overview of Informatica PowerCenter Web Service

Web Services Overview:
Web Services are services available over the web that enables communication and provide a standard protocol for communication. To enable the communication, we need a medium (HTTP) and a format (XML/JSON).

There are two parties to the web services, namely Service Provider and Service Consumer. A web service provider develops/implements the application (web service) and makes it available over the internet (web).  Service Provider publishes an interface for the web services that describes all the attributes of the web service. Service Consumer consumes the web service. For the Consumer to consume the web service, the consumer has to know the services available, request and response parameters, how to call the services and so on.

Hence we can define Web Service as a standardized way of integrating web desk applications using XML, SOAP, WSDL and UDDI open standards over an internet protocol backbone. XML is used to tag the data. SOAP is used to transfer the data. WSDL is used for describing the services available and UDDI is used for listing what services are available.

Learn more at, http://www.infotrellis.com/how-to-access-informatica-powercenter-as-a-web-service/

Tuesday, 4 December 2018

Informatica Power Center Solutions

Promote automation, reuse and agility with the industry's only full integrated end-to-end enterprise data integration platform.

Informatica's modern data integration infrastructure combines advanced hybrid data integration capabilities and centralized governance with flexible self-service business access for analytics. By providing a robust integrated codeless environment, teams can collaboratively connect systems and transform and integrate data at any scale and any speed.

Read full article at http://www.infotrellis.com/informatica-data-integration/

Monday, 3 December 2018

Big Data Analytics & Data Management Services

Mastech InfoTrellis’ diverse expertise in the Big Data space, has helped to assist global enterprises in their Big Data initiatives

Big Data Analytics Hub
Mastech InfoTrellis offers managed Big Data Analytics Hub Solution Centered on Hadoop, which enables customers to consolidate multi-channel data of various formats into a single source. Big Data Analytics Hub enables self service analytics by different business functions.

AllSight — Customer Intelligence Management
AllSight Customer Intelligence Management System which delivers Enterprise Customer 360 by ingesting structured and unstructured data from disparate data sources across the organization.

IBM Big Data Solutions
IBM Big Data Solutions combine open source Hadoop and Spark for the open enterprise to cost effectively analyze and manage big data. With BigInsights, you spend less time creating an enterprise-ready Hadoop infrastructure, and more time gaining valuable insights. IBM provides a complete solution, including Spark, SQL, Text Analytics and more to scale analytics quickly and easily.

Read full story at http://www.infotrellis.com/big-data/

Sunday, 2 December 2018

Enterprise Data Integration Services

Enterprise Data Integration Services

Using niche technologies, Mastech InfoTrellis enables customers to extract, transform and load data from disparate source systems to centralized data repositories like Master Data Management Hub, Big Data and Analytics Hub.


  • Etl performance tuning
  • Metadata management
  • Data quality monitoring
  • Cross platform integration
  • Data modelling
  • Data profiling


Our Solutions

  • Informatica Intelligent Data Integration
  • Informatica Intelligent Cloud Services
  • Collibra Data Governance


Learn more at http://www.infotrellis.com/enterprise-data-integration/

Friday, 30 November 2018

Automate Informatica Data Quality (IDQ)

Data Quality – Overview
Data Quality is the process of understanding the quality of data attributes such as data types, data pattern, existing values, and so on. Data quality is also about capturing the score of an attribute based on some specific constraints. For example, get the count of records for which the attribute value is NULL, or find the count of records for which a date attribute does not fit into the specified Date Pattern.

Managing your Data Quality
This means that we can weigh the quality of data to any extent irrespective of the available data being good or bad. This Data Quality report can be captured with the complete data details, at record level or even at the attribute level. Using this report, business can identify the quality of data and make out how it can be used to help / benefit the customer. A plan can also be worked out to enhance the quality of data by applying business rules and correcting the required information based on the business needs.

This blog post aims at bringing out the significance of data quality, data quality report generation, steps involved in automation of the data quality report using the scheduler feature of Informatica IDQ.

Deriving Quality Data
We have tools in the market to generate these Data Quality reports based on the input data we provide with configuration of some business specifications. An important solution provider in the market for Data Quality report generation is Informatica IDQ which is formulated to generate profiling reports and Data Quality reports.

Read full article at http://www.infotrellis.com/automate-data-quality-informatica-idq/

Thursday, 22 November 2018

Informatica MDM Solution - Mastech Infotrellis

A master data management (MDM) system is installed so that the core data of an organization is secure,  is accessible by multiple systems as and when required and does not have multiple copies floating in the system, in order to have a single source of truth. A solid Suspect Duplicate Process is required in order to achieve the 360 degree view of an entity.

The concept of Suspect Duplicate Processing represents the broad category of activities related to identifying entities that are likely duplicates of each other. Suspect duplicate processing is the process of searching for, matching, creating associations between and, when appropriate, merging data for existing duplicate party records in the system.

To achieve this functionality, Informatica MDM has come up with its own Suspect Duplicate Processing (SDP) approach. An organization based on its use case can opt any of the following two approaches:


  • Deterministic Matching Approach
  • Fuzzy Matching Approach


Deterministic Matching Approach

Deterministic Matching uses a series of rules, like nested if statements, to run a series of logical tests on the data sets. This is how we determine relationships, hierarchies, and households within a dataset. Deterministic matching seeks a clear “Yes” or “No” result on each and every attribute, based on which we define whether:


  • Two records are duplicates
  • should be resolved by a data steward or
  • Two unique entities.


It doesn’t leave any room for error and provides the result in an ideal scenario. But most of the data in organizations is far from an ideal scenario. These are the cases when the Fuzzy Matching Approach of Informatica comes handy.

Read full article at http://www.infotrellis.com/informatica-mdm-fuzzy-matching/

Tuesday, 20 November 2018

Why is Master Data Management important?

Mastech InfoTrellis offers best of breed Master Data Management Services enabling Customers to harness the power of their Master Data. Mastech InfoTrellis has successfully delivered Master Data Management Projects time and again over the past decade.

Performance tuning
Production support
Health check
Solution architecture
Needs assessment
Program strategy & roadmap
Solution upgrade
Design and development

Our Solutions

IBM InfoSphere Master Data Management
Cloud Customer 360 For Sales Force
Informatica Intelligent Master Data Management
IBM PIM For Manufacturing

Learn more at http://www.infotrellis.com/master-data-management/

Saturday, 17 November 2018

Best Practices for Master Data Management



Mastech InfoTrellis offers best of breed Master Data Management Services enabling Customers to harness the power of their Master Data. Mastech InfoTrellis has successfully delivered Master Data Management Projects time and again over the past decade.

For more information https://bit.ly/2TmCvCj

Tuesday, 23 October 2018

Interfacing Virtual MDM through DataStage

MDM Connector stage is a key to open the door of IBM Virtual MDM. Yes, we can manipulate the data in MDM (MDM refers to IBM Virtual MDM in this post) using the MDM Connector stage which was introduced in IBM DataStage v11.3.
We know that loading data into MDM is not an easy task since it involves many tables and the relationship among the tables should be maintained properly, otherwise will end up dealing with junk not with the data. MDM Connector stage makes this task simpler by allowing us to configure everything in the single configuration window.
This blog post details on how the basic operations (read/write) on data can be performed using the Connector stage in v11.5.

http://www.infotrellis.com/interfacing-virtual-mdm-datastage/

Monday, 15 October 2018

Best Master Data Management Strategy

Mastech InfoTrellis offers best of breed Master Data Management Services enabling Customers to harness the power of their Master Data. Mastech InfoTrellis has successfully delivered Master Data Management Projects time and again over the past decade.


IBM InfoSphere Master Data Management

IBM InfoSphere Master Data Management (MDM) manages all aspects of your critical enterprise data, no matter what system or model, and delivers it to your application users in a single, trusted view. Provides actionable insight, instant business value alignment and compliance with data governance, rules and policies across the enterprise.

Cloud Customer 360 For Sales Force

Informatica Cloud Customer 360 for Salesforce eradicates duplicate, inaccurate, and incomplete account and contact records. It provides clean, trusted data, increases Salesforce user adoption, and boosts ROI.

Informatica Intelligent Master Data Management

A complete master data management solution addresses the critical business objectives digital organizations face. Informatica MDM offers the only true end-to-end solution, with a modular approach to ensure better customer experience, decision making and compliance.

IBM PIM For Manufacturing

Establish a single view of product information for strategic business initiatives with IBM Product Information Management solution. IBM PIM solution enables Service Oriented Architecture, provides a flexible data model, aligns to existing business processes and is scalable to suit to the growing product landscape of an organization.

Contact:
9390 Research Blvd., Suite 330
Austin, TX 78759
United States
Phone: +1-512-358-1396
Website: http://www.infotrellis.com

Tuesday, 11 September 2018

Big Data Management System

Mastech InfoTrellis’ diverse expertise in the Big Data space, has helped to assist global enterprises in their Big Data initiatives

Big Data Management System

Our Solutions

Big Data Analytics Hub
Mastech InfoTrellis offers managed Big Data Analytics Hub Solution Centered on Hadoop, which enables customers to consolidate multi-channel data of various formats into a single source. Big Data Analytics Hub enables self service analytics by different business functions.

AllSight – Customer Intelligence Management

AllSight Customer Intelligence Management System which delivers Enterprise Customer 360 by ingesting structured and unstructured data from disparate data sources across the organization.

IBM Big Data Solutions

IBM Big Data Solutions combine open source Hadoop and Spark for the open enterprise to cost effectively analyze and manage big data. With BigInsights, you spend less time creating an enterprise-ready Hadoop infrastructure, and more time gaining valuable insights. IBM provides a complete solution, including Spark, SQL, Text Analytics and more to scale analytics quickly and easily.


Monday, 3 September 2018

IBM MDM BatchProcessor – Tips for better throughput

MDM BatchProcessor is a multi-threaded J2SE client application used in most of the MDM implementations to load large volumes of enterprise data into MDM during initial and delta loads. Oftentimes, processing large volumes of data might cause performance issues during the Batch Processing stage thus bringing down the TPS (Transactions per Second).

Poor performance of the batch processor often disrupts the data load process and impacts the go-live plans. Unfortunately, there is no panacea available for this common problem. Let us help you by highlighting some of the potential root causes that influence the BatchProcessor performance. We will be suggesting remedies for each of these bottlenecks in the later part of this blog.

Infrastructure Concerns
Any complex, business-critical Enterprise application needs careful planning, well ahead of time, to achieve optimal performance and MDM is no exception. During development phase it is perfectly fine to host MDM, DB Server and BatchProcessor all in one physical server. But the world doesn’t stop at development. The sheer volume of data MDM will handle in production needs execution of a carefully thought-out infrastructure plan. Besides, when these applications are running in shared environments Profiling, Benchmarking and Debugging become a tedious affair.

CPU Consumption
BatchProcessor can consume lot of precious CPU cycles in most trivial of operations when it is not configured properly. Keeping an eye for persistently high CPU consumption and sporadic surges is vital to ensure CPU is optimally used by BatchProcessor.

Deadlock
Deadlock is one of the frequent issues encountered during the Batch Processing in multi-threaded mode. Increasing the submitter threads count beyond the recommended value might lead into deadlock issue.

Stale Threads
As discussed earlier, a poorly configured BatchProcessor might open up Pandora’s Box. Stale threads can be a side-effect of thread count configuration in BatchProcessor. Increasing the submitter threads, reader and writer threads beyond the recommended numbers may cause some of the threads to wait indefinitely thus wasting precious system resources.

100% CPU Utilization
“Cancel Thread” is one of the BatchProcessor daemon threads, designed to gracefully shutdown BatchProcessor when the user intends to. Being a daemon thread, this thread is alive during the natural lifecycle of the BatchProcessor. But the catch here is it hogs up to nearly 90% of CPU cycles for a trivial operation thus bringing down the performance.

Let us have a quick look at the UserCancel thread in BatchProcessor client. The thread waits for user interruption indefinitely and checks for the same every 2 seconds once while holding on the CPU all the time.

Read full article at https://bit.ly/2Nvx3Nh

Monday, 27 August 2018

MDM for Regulatory Compliance in the Banking Industry

Banking Regulations – Overview
Managing regulatory issues and risk has never been so complex. Regulatory expectations continue to rise with increased emphasis on the institution’s ability to respond to the next potential crisis. Financial Institutions continue to face challenges implementing a comprehensive enterprise-wide governance program that meets all current and future regulatory expectations. There has been a phenomenal rise in expectations related to data quality, risk analytics and regulatory reporting.
Following are some of the US regulations that MDM and customer 360 reports can be used for compliance:
FATCA (Foreign Account Tax Compliance Act)
FATCA was enacted to target non-compliance by U.S. taxpayers using foreign accounts. The objective of FATCA is the reporting of foreign financial assets. The ability to align all key stakeholders, including operations, technology, risk, legal, and tax, is critical to successfully comply with FATCA.
OFAC (Office of Foreign Asset Control)
The Office of Foreign Assets Control (OFAC) administers a series of laws that impose economic sanctions against hostile targets to further U.S. foreign policy and national security objectives. The bank regulatory agencies should cooperate in ensuring financial institutions comply with the Regulations.
FACTA (Fair and Accurate Credit Transactions Act)
Its primary purpose is to reduce the risk of identity theft by regulating how consumer account information (such as Social Security numbers) is handled.
HMDA (Home Mortgage Disclosure Act)
This Act requires financial institutions to provide mortgage data to the public. HMDA data is used to identify probable housing discrimination in various ways.
Dodd Frank Regulations
The primary goal of the Dodd-Frank Wall Street Reform and Consumer Protection Act was to increase financial stability. This law places major regulations in the financial industry.
Basel III
A wide sweeping international set of regulations that many US banks must adhere to is Basel III. Basel III is a comprehensive set of reform measures, developed by the Basel Committee on Banking Supervision, to strengthen the regulation, supervision and risk management of the banking sector.
What do banks need to meet regulatory requirements?
To meet the regulatory requirements described in the previous section, Banks need an integrated systems environment that addresses requirements such as Enterprise-wide data access, single source of truth for customer details, customer identification programs, data auditability & traceability, customer data synchronization across multiple heterogeneous operational systems, ongoing data governance, risk and compliance reports.

What You May Be Missing by Not Monitoring Your MDM Hub

Organizations spend millions of dollars to implement their MDM solution. They may have different approaches (batch vs. real time; integrated customer view vs. integrated supplier view etc.) – but in general they all expect to get a “one version of the truth” view by integrating different data sources and then providing that integrated view to a variety of different users.

After the completion and successful testing of the MDM implementation project, companies sit back and enjoy the benefits of their MDM hub – and more often than not don’t even think about looking under the hood. It never occurs to them that they could be trying to gain insights into what’s happening inside that MDM hub by asking questions like

–          How is the data quality changing?

–          What are the primary activities (in processing time) inside the MDM hub?

–          How are service levels changing?

However, organizations change, people change, requirements change – impacting what is happening inside the MDM Hub. Such changes can open up significant opportunities for an organization – but without doing any sort of investigation that opportunity is typically not recognized.

Here are two examples – diagnosed through the use of an MDM audit tool:

–          The company’s MDM Hub had approximately 100,000 incorrect customer addresses. These addresses were used for regular mailings; the mailings generated (in case of correct address) incremental revenues. Impact on the business related to just one mailing:

$400K wasted on the mailing cost ($4 is the conservative mailing cost per person – for postage, printing of the mailer etc.)
$100K of immediately lost revenues (as past data shows that one in 50 customers spends about $50 immediately following the mailing)
The longer term revenue lost was not assessed, but was estimated to be well over $400K
The opportunity: Cost saving of $400k and revenue increase of $500K or more
–          At a different company, by analyzing data processed by week the resulting report was able to determine that the number of new customers processed was declining by 1-2% every week – starting about 6 weeks before the audit was conducted. A deeper review of the audit report suggested that

The original service levels related to customer file changes had been getting worse and worse over that same time period

As customer file changes (as per the audit report) took over 85% of the total processing time, the slower processing lead to less time available for new customer processing

This initial diagnostic was confirmed by the client – they had a slowly growing backlog of new customer files

Ultimately the audit was able to highlight which input data source had been causing the slowdown, allowing the company to resolve the problem at its source

Business impact: a major risk (very significant slowdown in new customer set up) was eliminated before it became a real problem

Read full story at https://bit.ly/2PTp0Ix

Monday, 20 August 2018

Blueprint for a successful Data Quality Program

Data Quality – Overview

Corporates have started to realize that Data accumulated over the years is proving to be an invaluable asset for the business. The data is analyzed and strategies are devised for the business based on the outcome of Analytics. The accuracy of the prediction and hence the success of the business depends on the quality of the data upon which analytics is performed. So it becomes all the more important for the business to manage data as a strategic asset and its benefits can be fully exploited.
This blog aims to provide a blueprint for a highly successful Data Quality project, practices to be followed for improving the Data Quality and how companies can make the right data-driven decisions by following these best practices.

Source Systems and Data Quality Measurement

To measure the quality of the data, third party “Data Quality tools “should hook on to the source system and measure the Data Quality. Having a detailed discussion with the owners of the systems identified for Data Quality measurement needs to be undertaken at a very early stage of the project. Many system owners may not have an issue with allowing a third party Data Quality tool to access their data directly.
But some systems will have regulatory compliance because of which the systems’ owners will not permit other users or applications to access their systems directly. In such a scenario the systems owner and the Data Quality architect will have to agree upon the format in which the data will be extracted from the source system and shared with the Data Quality measurement team for assessing the Data Quality.
Some of the Data Quality tools that are leaders in the market are Informatica, IBM, SAP, SAS, Oracle, Syncsort, Talend.
The Data Quality Architecture should be flexible enough to absorb the data from such systems in any standard format such as CSV, API, and Messages. Care should be taken such that the data that is being made available for Data Quality measurement is extracted and shared with them in an automated way.

Environment Setup

If the Data Quality tool is directly going to connect to the source system, evaluation of the source systems’ metadata, across various environments is another important activity which should be carried out at the initial days of the Data Quality Measurement program. The tables or objects, which hold the source data, should be identical across different environments. If they are not identical, then decisions should be taken to sync them up across environments and should be completed before the developers are on-boarded in the project.
If the Data Quality Team is going to receive data in the form of files, then the location in which the files or data will be shared should be identified and the shared location is created with the help of the infrastructure team. Also, the Data Quality tool should be configured so that it can READ the files available in the SHARED Folder.

Monday, 13 August 2018

Connecting MongoDB using IBM DataStage

Introduction

MongoDB is an open-source document- oriented schema-less database system. It does not organize the data using rules of a classical relational data model. Unlike other relational databases where data is stored in columns and rows, MongoDB is built on the architecture of collections and documents. One collection holds different documents and functions. Data is stored in the form of JSON style documents. MongoDB supports dynamic queries on documents using a document based query language like SQL.

This blog post explains how MongoDB can be integrated with IBM DataStage with an illustration.

Why MongoDB?
For the past two decades we have been using Relational Database as data store as they were the only option that was available. But with the introduction of NoSQL, we have more options based on the requirement. Mongo DB is predominantly used in insurance and travel industry.

We can extract any semi-structured data and load it to MongoDB through any of the integration tools. Also Extract from MongoDB is easier and faster when compared to relational databases.

MongoDB integration with IBM DataStage
Since we don’t have a specific external stage in IBM DataStage tool to integrate MongoDB, we are going with Java Integration stage to load or extract data from MongoDB.

Since MongoDB is a schema free database, we can use structured or semi-structured data extracted through DataStage and load it to MongoDB.

Prerequisites

  • Make sure you have java installed on your machine.
  • Install Eclipse tool.
  • Java requires below MongoDB jar to be imported inside the package to use MongoDB functions
    • mongo-java-driver-2.11.3.jar or higher versions if available (Download it from the internet)
  • Also, Java requires below jar file to be imported inside the package to extract or load data from DataStage
    • jar (It is available on the DataStage server. Location: /opt/IBM/InformationServer/Server/DSEngine/java/lib)


Illustration of a DataStage job
Create a job in DataStage to parse the below sample XML




Read more steps at http://www.infotrellis.com/connecting-mongodb-using-ibm-datastage/

Tuesday, 7 August 2018

How to Match Tweets to Customer Records

Many organizations are analyzing Tweets for various purposes such as sentiment at an aggregate level.  For example, “generally what are people saying about us in the Twitter universe?”  This is a good baby step into Big Data Analytics but where organizations want to get to is “what is my customer John Smith saying about us?”  This customer-level analytics is much more valuable as it allows the organization to serve the customer better, identify “market of one” opportunities and so on.
You have to match Tweets to customer records as a pre-requisite to such analytics.  So what are the considerations in doing so?  It is a key capability of MDM hubs to address the problem of matching customers together by using structured data sourced from internal systems within the organization and applying traditional deterministic and/or probabilistic matching techniques.  But the problem shifts dramatically when trying to match Big Data together.  You need to re-think a solution given the problem has changed.
Many are familiar with Twitter and Tweets.  What some don’t know is that there is a set of metadata that is distributed with each Tweet.  Some of it is useful for matching purposes such as the user’s name, Tweet timestamp, high level location information and so on.  This information along with information in the text of the Tweet triangulated with internal information can yield high quality matches.
So below are some considerations in matching Tweets to internal customer records.

http://www.infotrellis.com/how-to-match-tweets-to-customer-records/

Thursday, 2 August 2018

Approaching Data as an Enterprise Asset

If you walk into a meeting with all your senior executives and pose the question:
“Do you consider and treat your data as an Enterprise Asset?”
The response you will get is:
“Of course we do.”
The problem in most organizations, however, is that while it is recognized that data is a corporate asset, the practices surrounding the data do not support the automatic response of Yes We Do.
What does it really mean, to treat your data as an enterprise asset? 

http://www.infotrellis.com/approaching-data-as-an-enterprise-asset/

Friday, 29 June 2018

Predictive Analytics in the Retail Industry

Predictive analytics is the utilization of information, factual calculations and machine learning methods to distinguish the probability of future results in light of chronicled information. The objective is to go past recognizing what has happened to give a best appraisal of what will occur later on. Predictive analytics is utilized as a part of numerous divisions for future expectations with the help of machine learning and artificial intelligence.

http://www.infotrellis.com/predictive-analytics-in-the-retail-industry/

Big Data Analytics and augmented patient care

Why Big Data in Healthcare is so required?


“Data analytics” refers to the practice of taking masses of aggregated data and analyzing them to draw important insights and information contained in it. This process is increasingly aided by new software and technology that helps examine large volumes of data for hidden information that can help us in many areas and healthcare is one of those areas.
80% of all healthcare information is unstructured data which is so vast and complex that it needs specialized methods and tools to make meaningful use of the data. The new and emerging technologies like artificial intelligence (AI), machine learning, and predictive analytics are bringing in powerful tools for healthcare technologists and thought leaders to capture these data and process it effectively and efficiently for the complete transformation of the healthcareindustry.Physician decisions are winding up increasingly prove based, implying that they depend on expansive swathes of research and clinical information rather than exclusively their tutoring and expert sentiment. This new treatment state of mind implies there is a more prominent interest for big data analytics in medicinal services offices than at any other time. There is almost certainly that big data has developed as a defining moment changer for the healthcare industry to enable it to advance to another level.

http://www.infotrellis.com/big-data-analytics-augmented-patient-care/

Friday, 22 June 2018

MDM Project Management in Emerging Economies (Africa)

Introduction

Inter and Intra business communications turn out to be perpetually mind boggling, the need to successfully manage time and assets winds up the key. All things considered, and coupled with huge interest in infrastructure and service, the interest for project management (PM) aptitude in Africa has expanded significantly over a wide assortment of Industries. Africa is a market of middle class consumers which is expected to reach 1.1 Billion by 2060 and will need to cater for maintaining the Master data of the customers. The best possible option is using Master Data Management tools and this will lead to more development projects eventually more project managers to make sure everyone works together as a well-coordinated team. Currently the need for project management is on high in Africa, but MDM project managers are rare.

Project Managers in Africa

African is an emerging economy and it is our responsibility to educate the African clients regarding the importance of top notch IT infrastructure and MDM implementation which can do wonders by streamlining the business processes. The knowledge should be circulated via the POC at client location having the blend of technical knowledge with flavor of sales and can understand the business who can be none other than project managers. Africa can cross the hurdle to a new picture of development if project management is taken as an important milestone to achieve this, everyone should contribute with cohesiveness in both public and private sectors which is a rare scenario currently. All the sophisticated techniques for planning and scheduling, project is available, but if the line manager puts too much pressure on project managers or they don’t report the issues as they believe that no help or sympathy will be provided can lead to disaster unlike the metered markets. The current situation in Africa demands more skilled project managers who can work with own initiative and cooperation with team members, complete a task with limited time and resources.
Learn more at http://www.infotrellis.com/mdm-project-management-emerging-economies-africa/

Wednesday, 20 June 2018

How to enhance your Business Value using Data Quality (DQ) Tools

This blog post lists out some of the core concepts  of Data Quality assessment, perception changes in data quality, data quality management over the years, use of data quality tools to parse, standardize and cleanse data.

What is Quality Information?

There has been a change in the conventional focus on data usage with regard to how data sets support operation of transactional systems. Data quality or data fitness in today’s world is perceived in the context of reuse, repurpose and evaluated in line with conformance to business rules.

Why do we need DQ Tools and Techniques?

The fundamental steps in improving Data quality continue to be the same; however we need to rely on data quality tools for the following reasons:
  • Evaluate when data errors exist
  • Assess severity of the problem
  • Get rid of Root cause and correct the data
  • Inspect and Monitor
  • Enhance quality of data
Learn more http://www.infotrellis.com/enhance-your-business-value-using-data-quality-dq-tools/



Tuesday, 5 June 2018

IICS and Data Solutions on the DX Platform

The move to the cloud is fully in force, and therefore the variety of organizations’ integrated knowledge in hybrid environments has multiplied two-fold. Nearly three-quarters of respondents’ integrated knowledge in hybrid and cloud environments claimed poor knowledge quality in cloud services, restricted API access, company security and compliance policies as being among the key problems in implementations. The largest issue among all was an absence of data and skills contained inside their organization’s IT departments on the way to integrate with cloud services.

iPaaS (Integrated Platform as a Service)

For the past few years, cloud knowledge management has been outlined by Integration Platform-as-a-Service (iPaaS). iPaaS could be a set of integration tools delivered from a public cloud and needs no on-premises hardware or computer code. iPaaS was specifically designed to handle the light-weight electronic messaging and document standards (REST, JSON, etc.) employed by today’s cloud apps.

http://www.infotrellis.com/iics-data-solutions-dx-platform/

Monday, 4 June 2018

Leveraging Event Manager for Key Data Changes Orchestration

In a connected world, data should travel fast, in fact on a real-time fashion, to serve the purpose – enabling business to thrive. Data that is not available in time impacts business. Events that matter to business, that effects critical data changes, should orchestrate data changed sync up with connected systems that are the backbone of the business.
Statistics show that 200 businesses change addresses, 150 business telephone numbers will change or be disconnected, 5 Supplier/Vendor will go through rebranding and retailers lose USD 40 billion or 3.5% of total sales due to product information inefficiencies, all in one hour. Data that is not efficiently syndicated to the consuming systems by the System of Record hurts businesses.
In this blog, we will see IBM’s Event Manager in action and how it could seamlessly funnel event-driven data changes to connected systems, just when they need them.

http://www.infotrellis.com/leveraging-event-manager-key-data-changes-orchestration/

Saturday, 26 May 2018

Best Practices in Data Validation

Data Quality is the buzz word in the digital age.

What is data quality and why is it so important?

“Data quality” is the term that is probably hidden but plays an important role in many streams. Data plays a vital role in acquiring a market place, especially in enterprise data management stream.

Data Quality Examples

Following are some examples which emphasize the need for data quality.
  • A customer shouldn’t be allowed to enter his age where he has to mention his marital status.
  • When a customer enters a store, there is a high possibility that he might miss out his original details to be filled up with the forms, some of it can be in a hurry not mentioning a correct phone number.
  • There is also a possibility of the billing staff to wrongly enter the store address as default in place of the customer address which contributes to a bad quality data that gets persisted in the system.
http://www.infotrellis.com/best-practices-data-validation/

Tuesday, 22 May 2018

Data Warehouse Migration to Amazon Redshift – Part 3

This blog post is the final part of the Data Warehouse Migration to AR series. The second part of the blog post series Data Warehouse Migration to Amazon Redshift – Part 2 details on how to get started with Amazon Redshift, the business and technical benefits of using AR.

1. Migrating to AR

The migrating strategy that you choose depends on various factors such as:
  1. The size of the database and its tables
  2. Network bandwidth between the source server and AWS
  3. Whether the migration and switchover to AWS will be done in one step or a sequence of steps over time
  4. The data change rate in the source system
  5. Transformations during migration
  6. The partner tool that you plan to use for migration and ETL
Learn more: http://www.infotrellis.com/data-warehouse-migration-amazon-redshift-part-3/

Tuesday, 15 May 2018

MDM Validations – Things to remember when implementing InfoSphere MDM Server

Validation is an important aspect of any application or system. Validations could arise as part of functional requirements (e.g., business rules) or non-functional requirements (e.g., maintain data integrity). Data validation is a process of ensuring that a program operates on clean, correct and useful data.
In any MDM implementation, data validation plays an important role. Since MDM deals with maintaining consolidated view of entities, it is critical that the data stored is valid and meets all business rules.
IBM’s MDM Server comes with a robust and easily customizable validation framework. MDM Server validations can be broadly classified into two types: 1) External validations 2) Internal validations
 First let us talk about what are External validations
External validations are first level of validation. The validation rules are configured in database tables, the definition metadata is retrieved at runtime by validation engines for execution.
http://www.infotrellis.com/mdm-validations-things-to-remember-when-implementing-infosphere-mdm-server/

Wednesday, 9 May 2018

Master Data Management: Are you flying blind?

How can you govern your master data without knowing your master data?
For many years I’ve been saying that the one thing all MDM clients have in common is that the quality of data in their source systems is not as good as they thought.  Over the past several years I’ve found that all MDM clients have a second thing in common: they are unaware of the quality of data in their MDM hub and they don’t know how the data is changing.  This is surprising since an MDM hub contains your most critical business data that is used in real-time processes and analytics across the organization.  How can you govern your data when you don’t know its trend in quality, how it is being used and how it is changing over time?  This is flying blind.
There are a few contributing factors to this issue.  The first is that MDM products don’t provide capabilities to analyze and report on data.  The second is an MDM hub is not the appropriate place to do this.

http://www.infotrellis.com/master-data-management-are-you-flying-blind/

Tuesday, 8 May 2018

MDM Validations – Things to remember when implementing InfoSphere MDM Server

Validation is an important aspect of any application or system. Validations could arise as part of functional requirements (e.g., business rules) or non-functional requirements (e.g., maintain data integrity). Data validation is a process of ensuring that a program operates on clean, correct and useful data.
In any MDM implementation, data validation plays an important role. Since MDM deals with maintaining consolidated view of entities, it is critical that the data stored is valid and meets all business rules.
IBM’s MDM Server comes with a robust and easily customizable validation framework. MDM Server validations can be broadly classified into two types: 1) External validations 2) Internal validations

http://www.infotrellis.com/mdm-validations-things-to-remember-when-implementing-infosphere-mdm-server/

Saturday, 28 April 2018

Planning for Big Data Success

Most organizations are either starting or have already started on their Big Data journey.  As most other technology hypes, Big Data has also followed the hype cycle and there was a drop in interest in past years after the initial frenzy.  What we are seeing now is the second round of influx in interest around Big Data after the initial peak some years back. With increasing maturity in Big Data products and business use cases, we are inclined to believe that Big Data is here to stay and will prove to be a differentiator by providing the competitive advantage needed in this age.
With this move, we should see more successful adoptions of Big Data technologies. There is a definite opportunity to cash in early on this technology and get the early bird advantage. But, let’s not get carried away! The industry wide success rate for Big Data project still remains between 22-27% (In 2017, Gartner had got this figure at 40% which most other studies found this figure optimistic and peg it at 15%). So, even if you start on your Big Data journey early, there is no guarantee on the returns UNLESS there is a way to increase the probability of success.
This blog post lists some key takeaways from our experience with Big Data initiatives across several industry segments. These insights should assist in better planning and increased control on your Big Data initiative.

Tuesday, 17 April 2018

Informatica MDM – Fuzzy Matching

This blog touches upon the basics of Informatica MDM Fuzzy Matching.

Informatica MDM – SDP approach

A master data management (MDM) system is installed so that the core data of an organization is secure,  is accessible by multiple systems as and when required and does not have multiple copies floating in the system, in order to have a single source of truth. A solid Suspect Duplicate Process is required in order to achieve the 360 degree view of an entity.
The concept of Suspect Duplicate Processing represents the broad category of activities related to identifying entities that are likely duplicates of each other. Suspect duplicate processing is the process of searching for, matching, creating associations between and, when appropriate, merging data for existing duplicate party records in the system.
To achieve this functionality, Informatica MDM has come up with its own Suspect Duplicate Processing (SDP) approach. An organization based on its use case can opt any of the following two approaches:
  1. Deterministic Matching Approach
  2. Fuzzy Matching Approach
Deterministic Matching Approach
Deterministic Matching uses a series of rules, like nested if statements, to run a series of logical tests on the data sets. This is how we determine relationships, hierarchies, and households within a dataset. Deterministic matching seeks a clear “Yes” or “No” result on each and every attribute, based on which we define whether:
  • Two records are duplicates
  • should be resolved by a data steward or
  • Two unique entities.
It doesn’t leave any room for error and provides the result in an ideal scenario. But most of the data in organizations is far from an ideal scenario. These are the cases when the Fuzzy Matching Approach of Informatica comes handy.
Fuzzy Matching Approach
A fuzzy matching approach is required when we are dealing with less than perfect data to improve the quality of results. Fuzzy Matching measures the statistical likelihood that two records are the same. By rating the “matchiness” of the two records, the fuzzy method is able to find non-obvious correlations between data and hence rates the two records by saying how close they are to each other.
Informatica MDM fuzzy matching offers the above in an easy to configure, flexible, repeatable and probabilistic manner. It gives us the flexibility to define which attributes are required to be matched deterministically (such as Country IDs) and which using the fuzzy logic (such as Names).
The fuzzy matching in Informatica works on different aspects of the data.  The algorithm can be configured depending on whether we are catering our algorithm to match an Individual or a household, contact person or an organization, etc. This helps us to handle different scenarios in the data. Also based on the understanding of the data we can choose the strictness of the algorithm, not only in terms of the matching but in terms searching as well.
The main strength of Informatica MDM Fuzzy matching is that it is a rule-based matching system and unless and until the match criterion is met we won’t be getting a match, which makes it a business user-friendly matching system.
The match criteria can be defined into two categories,
  • Automatic Merge and
  • Manual Merge.
Automatic Merge is a scenario where the system by itself finds out that the two entities in question are duplicates whereas Manual merge is a scenario where we need a Data Steward to decide whether two parties in question are duplicates or not. Based on the rule (Automatic or Manual) that is satisfied by a suspect pair, the fate of the pair is decided whether the records merge automatically or a task is created for a Data Steward. If none of the defined rules satisfy the suspect pair then the two records are treated two unique parties/entities.
The rule based approach of Fuzzy logic makes it easy for Business Users and Data Stewards to identify what record patterns can constitute of a duplicate pair. Thus making it a hit with Business Users and resonating the effect with the program sponsors by making the MDM implementation successful.
About the Author
Ripudaman Singh Dhaliwal, Manager at Mastech InfoTrellis has considerable experience in Probabilistic (Fuzzy) Matching Algorithms.
http://www.infotrellis.com/informatica-mdm-fuzzy-matching/

Wednesday, 11 April 2018

Interfacing Virtual MDM through DataStage

MDM Connector stage is a key to open the door of IBM Virtual MDM. Yes, we can manipulate the data in MDM (MDM refers to IBM Virtual MDM in this post) using the MDM Connector stage which was introduced in IBM DataStage v11.3.
We know that loading data into MDM is not an easy task since it involves many tables and the relationship among the tables should be maintained properly, otherwise will end up dealing with junk not with the data. MDM Connector stage makes this task simpler by allowing us to configure everything in the single configuration window.
This blog post details on how the basic operations (read/write) on data can be performed using the Connector stage in v11.5.

Tuesday, 10 April 2018

Intelligent Data Management meets SMART MDM™ Methodology

Informatica MDM Multi Domain Edition (MDE) supports multiple Business Data Domains with a flexible data model which allows you to adapt to Data model in line with your business requirements; also provides you the flexibility not to adapt to fixed vendor defined data models. The business rules can be reused across unified MDM, data quality and data integration on a single platform. The granular web services are automatically generated and high level composite services are created for rapid integration. The UI elements are automatically generated from Data model definitions. The value can be delivered within weeks hence reduces the chances of risks for delays and cancellations.

Learn more. http://www.infotrellis.com/intelligent-data-management-meets-smart-mdm-methodology/