Infotrellis: November 2017

Wednesday, 29 November 2017

Interpreting your data graphically

Apart from the better understanding on data, we need to pay more attention towards the basic statistics as it is the key concept of driving the data to develop interactive visualizations and convert tables into pictures.

The rapid rise of visualization tools such as Spotfire, Tableau, Qlikview and Zoomdata, has gained immense use of graphics in the media. These tools currently hold an ability to transform the data into meaningful information with proper standard level principles in the statistical visualization world. They are very helpful in translating the analysis into pixels when you are ready with the cleansed data/analysis.

It is quite often in everyday life, we get to see a massive amount of data crawling that is continuously increasing. Now the question is how to get better insights before making any decision for a business problem. Let us assume you are inclined towards a specific sector say sales (Or any other sector like finance, marketing & operations) that has billions of records and you want to draw a graphical layer on the data set for a better decision. Being a decision maker, you should probably think about a way to analyze the data, which is a data visualization Tool. Now the question arises on what data visualization is and why to adopt data visualization?

This blog post is about data visualization. It explains in detail on how to convert the massive volume databanks into statistical graphics in a pedagogically meaningful way.

Data Visualization in the Pharma Industry

Pharma industry faces unprecedented challenges that have an impact on development, production and marketing of the medical products.

It has been facing declined success rates in Research & development, patent expirations, global sales, medical bill review, reference based reimbursement system, drug testing/clinical trials, electronic trial master file, Hospital Food, drug and maintenance administration due to huge volume of databanks coupled with lack of decision making strategies where key element of cure is big data and the analytics that go with it. Big Data helps in organizing your data for future analysis and to derive new business logic from it. You can also change the current business logic as per the data trend to increase your business throughput.

http://www.infotrellis.com/interpreting-data-graphically/

Wednesday, 22 November 2017

Mastech InfoTrellis - Enterprise Data Integration

Using niche technologies, Mastech InfoTrellis enables customers to extract, transform and load data from disparate source systems to centralized data repositories like Master Data Management Hub, Big Data and Analytics Hub.

Mastech InfoTrellis can help your organization attain and manage consistent and transformed data throughout your organization through the use of state-of-the-art ETL tools. Starting from source data system analysis to performance centric data loading processes, we leverage our expertise and experience to get your data organized and available for data analysis by the business user community.

Poor data quality costs firms upwards of $8.8 million per year. Regardless of your business or data management initiative, regardless of whether you are a business or IT user, or where you are on your data quality journey, as a data user you must have ready access to trusted data. With Informatica Data Quality, organizations deliver business value by ensuring that all key initiatives and processes are fueled with relevant, timely and trustworthy data.

Visit our website at http://www.infotrellis.com/enterprise-data-integration/

Sunday, 12 November 2017

Thrive on IBM MDM CE

IBM InfoSphere Master Data Management Collaborative Edition provides a highly scalable, enterprise Product Information Management (PIM) solution that creates a golden copy of products and becomes trusted system of record for all product related information.

Performance is critical for any successful MDM solution which involves complex design and architecture. Performance issues become impedance for smooth functioning of an application, thus obstructing the business to get the best out of it. Periodic profiling and optimizing the application based on the findings is vital for a seamless application.

InfoTrellis has been providing services in PIM space over a decade now for our esteemed clientele that is spread across the globe.

This blog post details on optimizing IBM InfoSphere MDM Collaborative Edition application based on the tacit knowledge acquired from implementations and upgrades carried out over the years.

Performance is paramount
Performance is one of the imperative factors that make an application more reliable. Application performance of MDM Collaborative Edition is influenced by various factors such as solution design, implementation, infrastructure, data volume, DB configurations, WebSphere setup, application version, and so on. These factors play a huge role in affecting business either positively or otherwise. Besides, even in a carefully designed and implemented MDM CE solution, performance issues creep up over a period of time owing to miscellaneous reasons.

Performance Diagnosis
The following questions might help you to narrow down a performance problem to a specific component.

What exactly is slow – Only a specific component or general slowness which affects all UI interactions or scheduled jobs?
When did the problem manifest?
Did performance degrade over time or was there an abrupt change in performance after a certain event?
Answers to the above queries may not be the panacea but provide a good starting point to improve the performance.

Hardware Sizing and Tuning
Infrastructure for the MDM CE application is the foundation on top of which lays the superstructure.

IBM recommends a hardware configuration for a standard MDM CE Production server. But then, that is just a pointer towards the right direction and MDM CE Infrastructure Architects should take it with a pinch of salt.

Some of the common areas which could be investigated to tackle performance bottlenecks are:

Ensuring availability of physical memory (RAM) so no or little memory swapping and paging occurs.
Latency and bandwidth between the application server boxes and database server. This gains prominence if the Data Centers hosting these are far and away. Hosting Primary DB and App Servers on Data Center could help here.
Running MDM CE on dedicated set of boxes will greatly help so that all the hardware resources are up for grabs and isolating performance issues becomes a fairly simple process, of course, relatively.
Keeping an eye on disk reads, writes and queues. Any of these rising beyond dangerous levels is not a good sign.

Clustering and Load Balancing

Clustering and Load balancing are two prevalent techniques used by applications to provide “High Availability and Scalability”.

Horizontal Clustering – Add more firepower to MDM CE Application by adding more Application Servers
Vertical Clustering – Add more MDM CE Services per App Server box by taking advantage of MDM CE configuration – like more Scheduler and AppServer services as necessary
Adding a Load Balancer, a software or hardware IP sprayer or IBM HTTP Server will greatly improve the Business user’s experiences with the MDM CE GUI application

Go for High Performance Network File System
Typically clients go with NFS filesystem for MDM CE clustered environments as it is a freeware. For a highly concurrent MDM CE environment, opt for a commercial-grade, tailor-made high performance network file system like IBM Spectrum Scale .

Database Optimization
The performance and reliability of MDM CE is highly dependent on a well-managed database. Databases are highly configurable and can be monitored to optimize performance by proactively resolving performance bottlenecks.

The following are the few ways to tweak database performance.

Optimize database lock wait, buffer pool sizes, table space mappings and memory parameters to meet the system performance requirements
Go with recommended configuration of a Production-class DB server for MDM CE Application
Keeping DB Server and Client and latest yet compatible versions to take advantage of bug fixes and optimizations
Ensuring database statistics are up to date. Database statistics can be collected manually by running the shell script from MDM CE located in $TOP/src/db/schema/util/analyze_schema.sh
Check memory allocation to make sure that there are no unnecessary disk reads.
Defragmenting on need-basis
Checking long running queries and optimizing query execution plans, indexing potential columns
Executing $TOP/bin/indexRegenerator.sh whenever the indexed attributes in MDM CE data model is modified

MDM CE Application Optimization
The Performance in MDM CE application can be overhauled at various components like data model, Server config., etc. We have covered the best practices that have to be followed in the application side.

Data Model and Initial Load

Carefully choose the number of Specs. Discard the attributes that will not be mastered, governed in MDM CE
Similarly, larger number of views, Attribute Collections, Items and attributes slower the user interface screen performance. Tabbed views should come handy here to tackle this.
Try to off-load cleansing and standardization activities outside of MDM solution
Workflow with a many steps can result in multiple problems ranging from an unmanageable user interface to very slow operations that manage and maintain the workflow, so it should be carefully designed.

MDM CE Services configuration

MDM CE application comprises of the following services which are highly configurable to provide optimal performance – Admin, App Server, Event Processor, Queue Manager, Workflow Engine and Scheduler.

All the above services can be fine-tuned through the following configuration files found within the application.

$TOP/bin/conf/ini – Allocate sufficient memory to the MDM CE Services here
$TOP/etc/default/common.properties – Configure connection pool size and polling interval for individual services here
Docstore Maintenance

Document Store is a placeholder for unstructured data in MDM CE – like logs, feed files and reports. Over a period of time the usage of Document Store grows exponentially, so are the obsolete files. The document store maintenance report shall be used to check document store size and purge documents that do not hold significance anymore.

Use the IBM® MDMPIM DocStore Volume Report and IBM MDMPIM DocStore Maintenance Report jobs to analyze the volume of DocStore and to clean up the documents beyond configured data retention period in IBM_MDMPIM_DocStore_Maintenance_Lookup lookup table.
Configure IBM_MDMPIM_DocStore_Maintenance_Lookup lookup table to configure data retention period for individual directories and action to be performed once that is elapsed – like Archive or Purge
Cleaning up Old Versions

MDM CE does versioning in two ways.

Implicit versioning

This occurs when the current version of an object is modified during the export or import process.

Explicit versioning

This kind of versioning occurs when you manually request a backup.

Older versions of items, performance profiles and job history need to be cleansed periodically to save load on DB server and application performance in turn.

Use the IBM MDMPIM Delete Old Versions Report and IBM MDMPIM Estimate Old Versions Report in scheduled fashion to estimate and clear out old entries respectively
Configure IBM MDMPIM Data Maintenance Lookup lookup table to hold appropriate data retention period for Old Versions, Performance Profiles and Job History

Best Practices in Application Development

MDM CE presents couple of programming paradigms for application developers who are customizing the OOTB solution.

Scripting API – Proprietary scripting language which at runtime converts the scripts into java classes and run them in JVM. Follow the best practices documented here for better performance
Java API – Always prefer Java API over the Scripting API to yield better performance. Again, ensure the best practices documented here are diligently followed

If Java API is used for the MDM CE application development, or customization, then :

Use code analyzing tools like PMD, Findbung, SonarQube to have periodic checkpoints so that only the optimized code is shipped at all times
Use profiling tools like JProfiler, XRebel, YourKit or VisualVM to constantly monitor thread pools use, memory pools statistics, details about the frequency of garbage collection and so on. Using these tools during resource-intensive activities in MDM CE, like running heavyweight import or export jobs, will not just shed light on inner workings of JVM but offers cues on candidates for optimization

Cache Management

Keeping frequent accessed objects in cache is a primary technique to improvement performance. Cache hit percentage need to be really high for smooth functioning of the application.

Check the Cache hit percentage for various objects in the GUI menu System Administrator->Performance Info->Caches
The $TOP/etc/default/mdm-ehcache-config.xml and $TOP/etc/default/mdm-cache-config.properties files can be configured to hold large number of entries in cache for better performance

Performance Profiling
A successful performance testing will project most of the performance issues, which could be related to database, network, software, hardware etc. Establish a baseline, identify targets, and analyze use cases to make sure that the performance of the application holds good for long.

You should identify areas of solution that generally extends beyond normal range and few examples are large number of items, lots of searchable attributes, large number of lookup tables.
Frameworks such as JUnit, JMeter shall be used in a MDM CE engagement where Java API is the programming language of choice

About the author

Sruthi is a MDM Consultant at InfoTrellis and worked in multiple IBM MDM CE engagements. She has over 2 years of experience in technologies such as IBM Master Data Management Collaborative Edition and BPM.

Selvagowtham is a MDM Consultant at InfoTrellis and plying his trade in Master Data Management for over 2 years. He is a proficient consultant in IBM Master Data Management Collaborative Edition and Advanced Edition product.

Wednesday, 1 November 2017

Data Warehouse Migration to Amazon Redshift – Part 1

Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. As your data grows, you need to always exchange-off what data to stack into your data warehouse and what data to archive in storage so you can oversee costs, keep ETL complexity low, and deliver good performance.

This blog post details on how Amazon Redshift can make a significant impact in lowering the cost and operational overheads of a data warehouse, how to get started with Redshift, what are the steps involved in migration, prerequisites for migration, post migration activities.

Key business and technical challenges faced:

Business Challenges

What kind of analysis do the business users want to perform?
Do you currently collect the data required to support that analysis?
Where is the data?
Is the data clean?
What is the process for gathering business requirements?

Technical Challenges

Data Quality –Data comes from many disparate sources of an organization. When a data warehouse tries to combine inconsistent data from disparate sources, it runs into errors. Inconsistent data, duplicates, logic conflicts, and missing data all result in data quality challenges. Poor data quality results in faulty reporting and analytics necessary for optimal decision making.

Understanding Analytics – When building a data warehouse, analytics and reporting will have to be taken into design considerations. In order to do this, the business user will need to know exactly what analysis will be performed. Envisioning these reports is a great challenge.

Quality Assurance – The end user of a data warehouse makes use of Big Data reporting and analytics to make the best decisions possible. Consequently, the data must be 100 percent accurate. This high reliance on data quality makes testing a high priority issue that will require a lot of resources to ensure the information provided is accurate. Successful STLC process has to be completed which is a costly and time intensive process.

Performance –A data warehouse must be carefully designed to meet overall performance requirements. While the final product can be customized to fit the performance needs of the organization, the initial overall design must be carefully thought out to provide a stable foundation from which to start.

Designing the Data Warehouse – Lack of clarity in defining what is expected from a data warehouse by the business users’ result in miscommunication between the business users and the technicians building the data warehouse. Hence the expected end results are not delivered to the user which calls for fixes after delivery adding up to the existing development fees.

User Expectation – People are not keen to changing their daily routine especially if the new process is not intuitive. There are many challenges to overcome to make a data warehouse that is quickly adopted by an organization. Having a comprehensive user training program can ease this hesitation but will require planning and additional resources.

Cost – Building a data warehouse in house to save money though a great idea has multitude of hidden problems. The required levels of skill sets to deliver effective result is not feasible with few experienced professionals leading a team of non-BI trained technicians. The do it yourself efforts turn out costlier than expected.

Data Structuring and Systems Optimization – As you add more and more information to your warehouse; structuring data becomes increasingly difficult and can slow down the process significantly. In addition, it will become difficult for the system manager to qualify the data for analytics. In terms of systems optimization, it is important to carefully design and configure data analysis tools.

Selecting the right type of Warehouse – Choosing the right type of warehouse from the variety of warehouse types available in the market is challenging. You can choose a pre-assembled or customized warehouse. Choosing a custom warehouse saves time building a warehouse from various operational databases, but pre-assembled warehouses save time on initial configuration. Depending on the business model and specific goals the choice has to be made.

Data Governance and Master Data – Information being one of the crucial assets should be carefully monitored. Implementing data governance is mandatory because it allows organizations to clearly define ownership and ensures that shared data is both consistent and accurate.

Amazon Redshift

Redshift is a managed data warehousing and analytics service from AWS, It will make it easy for developers and businesses to set up, operate and scale a clustered relational database engine suitable for complex analytic queries over large data sets. It is fast, utilizing columnar technology and compression to reduce IOs and spreading data across nodes and spindles to parallelize execution. It is disruptively cost-efficient, removing software licensing costs and supporting a pay-as-you-go and grow-as-you-need model. It is a managed service, greatly reducing the hassles of monitoring, backing up, patching and repairing a parallel, distributed environment. It is standards-based, using PostgreSQL as the basic query language and JDBC/ODBC interfaces, enabling a variety of tool integrations.

Amazon Redshift also includes Amazon Redshift Spectrum, allowing you to directly run SQL queries against exabytes of unstructured data in Amazon S3. No loading or transformation is required, and you can use open data formats.

Read more here:http://www.infotrellis.com/data-warehouse-migration-amazon-redshift-part-1/