Blog

Developments and insights into the Business Intelligence industry.

Using Infinispan In-Memory Cache with Pentaho

Posted by Bo Borland on Mar 7, 2012 in Big Data, blog | 0 comments

In the latest General Availability release of Pentaho Business Intelligence (BI) Enterprise Edition 4.1, there is built-in functionality to leverage Inifinispan In-Memory cache in place of the Mondrian caching system that has been in the product since its incepetion.

Imagine performing user analytics on your entire set of user metrics in memory. This would greatly improve big data analytics for user segmentation. The Infinispan cache allows for a much larger in-memory setup to do analysis on your data. This caching system is easy to setup, and can be used across multiple nodes to leverage memory and computing power across those nodes. Infinispan uses multi-casting which allows for “self-aware” clusters of Infinispan within a network.
This caching system is only available in the Enterprise edition, and is only leveraged by the Pentaho Analyzer interface. Unfortunately, this functionality is not installed with the binary Pentaho installers and has to be manually setup. Following is a step-by step list of things that need to be done to implement the Infinispan cache.

  • Download the Pentaho Analysis Plugin package from the Pentaho FTP site. ftp://enterpriseeval:Pre5weYa@subscription2.pentaho.com/Pentaho_BI_Suite/4.1.0-GA/developers/pentaho-analysis-ee-3.3.0-GA.zip
  • Move this package to your Linux or Windows server that is running Pentaho EE.
  • Unzip this file into a temporary directory that is not within your Pentaho installation.
  • Make a backup of your existing /server/biserver-ee/tomcat/webapps/pentaho/WEB-INF directory
  • Move the following files from your unzipped files /lib directory to the pentaho installation /tomcat/webapps/pentaho/WEB-INF/lib/ directory.
    1. infinispan-core-4.2.1.FINAL
    2. jboss-transaction-api-1.0.1.GA
    3. jcip-annotations-1.0
    4. jgroups-2.12.0.CR5
    5. marshalling-api-1.2.3.GA
    6. memcached-0.0.1-PENTAHO
    7. pentaho-analysis-ee-3.3.0-GA-obf
    8. rhq-pluginAnnotations-3.0.1
    9. river-1.2.3.GA

Move ALL following files from your unzipped files /classes directory to the pentaho installation /tomcat/webapps/pentaho/WEB-INF/classes directory.
If you are installing on LINUX, you must chmod 664 on all the files you just moved to your current installation. If you do not, then you will be quite frustrated when nothing works!

  • Edit /WEB-INF/classes/pentaho-analysis-config.xml
  • Change false to true
  • Re-start your BI Server.
  • Check your /tomcat /logs/catalina.out file for any start-up errors and debugging.

Infinispan should now be running. To test your implementation, open up the Pentaho Analyzer interface. Drag over a measure and dimension from one of your Cubes. Then click on the “Log” link within in Analyzer. You should now see references in the log to “writing to segement cache Mondrian.inifinspan” or something similar.
Sit back, relax, and enjoy all of your new found in-memory performance!
*** tip *** by default Infinispan uses UDP Multicasting to search for other infinispan instances on a network. You can change the ports that using multicasting and also change protocols to TCP by editing the pentaho-analysis-config.xml file.

The Right Tool For the Job

Posted by Sal Scalisi on Jul 6, 2011 in BI in the Cloud, Big Data, Data Integration, blog | 0 comments

There have been many advances in data analytics, business intelligence, and in technology overall over the past several years.  New advances in big data and cloud computing have allowed companies to process data in ways they never have before.  However, these new advances present a new challenge.   What is the best tool for the challenges faced by your organization?

Relational Database Management Systems  (RDBMS)

Traditional relational databases such as Oracle, SQL Server, and MySQL are where most organizations start.  These platforms provide aggregation capabilities as well as detail level querying on small-mid range data sets, but scalability is limited.  In addition, because these technologies have been around for so long, there are plenty of applications that integrate with them for reporting and visualization purposes.
If the organization’s data set is capable of being stored, managed, and processed on an RDBMS, and the data set is not growing rapidly, there probably is no need to change.  If there are performance impacts, utilizing techniques such as indexing, basic aggregation, and partitioning may solve the problem.
If the data set is growing rapidly, or there is a new need to process large amounts of customer data (for instance), one of the options below may be a better fit.

Massively Parallel Processing (MPP) and Columnar Databases

Massively Parallel Processing (MPP) and columnar databases, such as Greenplum, Infobright, and Hbase, are designed to process large sets of data and provide row level detail. These technologies are scalable and are capable of distributing workload over commodity machines. They are a fit for organizations that need to aggregate large sets of data, but still need the low level detail for reporting purposes.
If an organization finds that their data is outgrowing their RDBMS install, and there is a need for row level reporting, MPP or columnar database technologies may be the right alternative.  Once implemented, these solutions are scalable, providing support long into the future.

Hadoop, MapReduce, and Hive

Hadoop is an open source framework that is capable of storing and processing large data sets over clusters of computers.  MapReduce is the software framework that defines and configures jobs that process these large data sets.  Hive provides an SQL-like interface for querying large unstructured data sets stored in Hadoop.  Hive turns the SQL-like queries into MapReduce jobs in the background, removing the java programming burden from the user.
These work best with Big Data, but what is Big Data?  The term is general and usually refers to data sets that are beyond the managing, processing, and storage  capabilities of commonly used software tools in a given time frame.  It’s a bit vague, but it is commonly applied web logs, social website interactions, call detail records, scientific research, and medical records, to name a few.
One other attribute would be the constant need to process such large data sets.  If there is a need to process a one-time data set, there are probably other options available than a full-blown hadoop implementation.   For instance, Pentaho’s Data Integration product is able to process large web log files in a timely manner, and may be an acceptable alternative if the data sets are of a manageable size.
For those who have a need to process large data sets, but do not have the java programming resources available, Hive may be an acceptable solution.  As mentioned earlier, it gives the user access to Hadoop and MapReduce functionality without the need to know java code.  One who is knowledgeable of SQL should be able to learn the Hive Query Language (HiveQL).
One final note is that this type of processing is for aggregating or performing complex statistical operations on large data sets.  If there is a need to aggregate, but also have access to the detail records, this may not be the right solution.  Instead, explore RDBMS,  MPP, or Columnar database technologies.

Whatever your need, Management Signals can help find the right solution for you. Whether it be one of the solutions above, or a combination, Management Signals has the staff to solve your data processing needs.

Games Analytic Solution- Beta Program

Posted by Parker Brown on Jun 29, 2011 in blog | 0 comments

Management Signals is looking for Social Gaming Companies that want to help shape Management Signals new Gaming Analytics Solution for big gaming data.

SaaS, Mobile App, and Gaming companies like Skype, Gazillion, and QuickOffice have partnered with Management Signals for our business intelligence, data analytics, and big data expertise. Now Management Signals is about to change the game for gaming companies that want more insight into social event and revenue tracking as well as retention analysis around user behaviors and events. Other tools will include, but not limited to, predictive demographic forecasting based on geographic location. We need your input. Participate in our Management Signals SaaS Gaming Analytics Beta program!

Benefits include:

  • Flexible and highly intuitive analytic solution leveraging our uniquely tailored CUBES for the gaming industry
  • Free data transformation, analytics modeling and development using your gaming data
  • Deep dive insights for effective monetization, retention, and development strategies
  • User friendly drag and drop capability for unrestrained ad-hoc reporting
  • Free access for the first twelve months after product General Availability
  • A dedicated Product Development Sponsor to facilitate deployment and maximize value
  • An opportunity to shape the platform for General Availability

With numerous customer success stories, Management Signals BI solutions are simple, proven and secure.

Only the first qualifying partner who registers with Management Signals will receive this exclusive offer. Offer ends July 28, 2011.

We look forward to hearing  from you. Don’t waste any time. Register with us now, shoot us an email or give us a call 877 623 DATA (3282)

Ongoing Agile BI for Handling Change

Posted by Bo Borland on Mar 13, 2011 in Agile BI, Data Integration, blog | 0 comments

In order to compete more aggressively in the information age companies need to be in an ongoing mode of harnessing and sharing the ever-growing volume and sources of corporate data. Source data changes are inevitable, making business intelligence an evolving competency that requires continued development and refinement. Companies that recognize and address this change with an implementation methodology efficient at delivering BI on an ongoing basis will gain a competitive advantage over those who don’t.

Here is a sampling of common business events that change the quality, structure, and source of data used by BI tools.

Data Change

  • New business rules change how data is processed
  • Corporate standards or new compliance regulations change how data is stored or categorized
  • Data migration projects merge, aggregate, or impact historical records

System Change

  • Companies migrate from onsite applications to hosted or SaaS applications
  • New business applications go live
  • System upgrades change the underlying database schema

Business Process Change

  • New or modified business processes affect how data is captured or processed
  • New products or services require new systems or enhancements to existing systems
  • New products or services cause process changes that ripple through every department

Business Metric Change

  • Company strategy changes
  • Business metrics are created, changed or re-prioritized
  • Metric benchmarks and tolerance thresholds change

Company Ownership or Structure Change

  • Companies acquire or get acquired by other companies
  • Systems are modified, retired, or merged to reflect the new entity structure

In response to these types of company changes, BI development teams are constantly delivering new or modified data transformations and views. BI tools are important for developers because they reduce the amount of hand-written code, but BI tools alone cannot provide a method to the madness of corporate change.

Armed with great BI tools and talented developers, companies need a BI development methodology that is highly efficient at: organizing teams, forecasting labor, managing projects, and delivering code on an ongoing basis. Agile BI is effective for driving those activities because it:

  1. focuses on short iterations of deliverable modules
  2. provides a framework for quickly decomposing user requirements into use cases the developers can understand and accurately forecast
  3. eliminates a lot of time-consuming, up-front design that is useless when changes occur or when business rules appear late in the cycle
  4. focuses less on to-be documents with a short shelf life and procedural ceremony and more on working code for data transformation and presentation
  5. welcomes and effectively deals with changing requirements
  6. supports self organizing teams
  7. eliminates many ineffective meetings

When implemented correctly, these rules will streamline delivery and positively impact the cost, quality and speed of ongoing development. Since the only guarantee in life is change, companies can benefit from an agile methodology that helps BI teams welcome change with confidence and responsiveness.