Let’s amend our Recommender

Collaborative Filtering came long way since Netflix Competition teams, realized the ease of integrating intuitive aspects of data in Matrix Factorization. The team that won a million dollar has written a paper on techniques adopted by them.

In our demo, while learning to develop a model, we are going to compare and contrast the results from Matrix Factorization with other approaches.

Please find the links below that can help you, to get prepared for our demo.



Are you building a Causal or Casual relationship?

Ever played “hide and seek” with a menu of an ecommerce website? or even better, ever tested tolerance level with an “Interactive” Voice Response (IVR)? Welcome to ever engaging dubious channels of business.

An ecommerce website or an IVR, can make or break a relation with customer or a prospect, if it has not been crafted right. In the rush to be a digitally mature business, businesses came to parting of the ways with customer rather build a relation. Be it just a presence in Web or to grow revenue, crafting a system to behave like best sales person is a challenge. Besides testing every scenario to sort out technical flaws, the system should understand the context and engage the user better.

First, the data points to understand user can be collected using multiple channels (viz., web, mobile, apps, print, social, email, commerce and federated) and using data append services as well. Above all the strategy to collect data should strive for single source of truth (SSOT) rather like a Blind Men and The Elephant parable.

Second, to revolutionize relationship with customer, data and their distributions is not sufficing, we need answers for frequently asked causal questions like “Do the landing page encourage enough to get engaged more?” OR “Is new IVR set up driving away customers?” OR “Is the new loyalty program improving customer retention period?”. Not to mention, a causal analysis of multivariate data can help to not only infer under static conditions, but also under changing conditions by external interventions and treatments.

And for causal analysis, as we need a large sample size, collecting more data is inevitable. Similarly, we need to choose and collect as much variables as we can, either be correlated or uncorrelated. More to the point while choosing variables make sure to add ones considered as instrumental, supports internal/external attribution etc., And, I cannot stress that enough, collecting right variables can help to avoid any spurious correlation as well as to improve regression coefficients.

Finally, a causal approach start with assumptions like, important causal variables are known and changes in the same be forecasted with accuracy, a causal relationship can be measured, and errors in these predictions of the causal variables will not lead to serious errors in prediction.

Provided that, causal models are used in cross-sectional surveys, time-series analyses, cohort investigation, true and quasi experiments. Potential-Outcome (Counterfactual) & Structural-Equations Models are used to depict more detailed quantitative assumptions, whereas Sufficient-Component Models used for qualitative assumptions. Also Graphical Models (Causal Diagrams) comes handy to understand qualitative assumptions and to validate covariates.

Going by zig ziglar’s quote “By the mile it’s a trial; by the yard it is hard; by the inch it’s a cinch”,  I bet every causal question answered will bring your customer closer by one step, for which causal models can help.

Collect in Context

Of course everyone agrees, the cost of acquiring a new customer is expensive than retaining an existing one. And one way to retain a customer is to enhance the relationship by either allowing him/her to customize or personalize their product/service.

As you know, personalization is nothing but “accurately” predicting the interest of a given user or a user segment. If your business is disciplined enough to collect and store data profile of prospects/responders/active customers/former customers, then predicting their interest can either be based on past behavior or by understanding customer’s community. But attaining accuracy has always been a challenge.

Lately with spatial & temporal data like location, weather etc., the user profiles are sliced and diced to build contextual profiles. Also recent projects like analyze weather and consumer behavior by Hershey’s, context-aware service by Samsung etc. shows that it is high time to start your data collection along with its context to enhance your customer’s experience.

And it takes a holistic approach to collect data along with context in this technology driven world. The context data source may range from simple survey feedback to social media to customer’s smart phone sensors. Since collecting context data has privacy, legal and other issues, you can either constrain the data you collect that goes with your objective or you can infer the context from less sensitive data using various machine learning techniques.

Finally, context data along with context prediction algorithm like Markov, ARMA etc., can take you a long way. Needless to say, a simple correlation between product feature like color preference and a customer segment can definitely help you understand your customer better.

With Internet of things (IOT) and the data collected from its sensors, no matter what industry you belong to, applying contextual design in improving your product/service appear closer than before.

Reap what you SQOOP

According to 2013 TDWI Research, about 10% of organizations have implemented Hadoop in production, and about 51% says they’ll have one within three years. And you can bet that their core objectives are, to scale as the data volume grow, and apply advanced analytics to learn from their own data.

So with the given heterogeneous environment in Data warehousing/BI environment, SQOOP is going to play a significant role in shoveling data in and out of HDFS/Hadoop at ease.

As you SQOOP, unlike structured data environment, where we need to profile, parse, standardize, match, validate and enrich the data quality upfront, the unstructured data can be in raw form. The reason being, for exploratory and discovery analytics the data scientists got to play “Jeopardy”, in other words they got to find the hidden intelligence from as-is data.

Moreover to improve the data quality just in time, we have tools and packages. For instance, one of the method used by statisticians to populate missing data is called Multiple Imputation, as simple as it is explained in Wiki “It is the process of replacing missing data with substituted values”. And the good news is we have a library package for that in R Studio called MICE (Multivariate Imputation by Chained Equations). Using MICE you can generate multiple imputation, analyze the imputed data as well as pool the analyzed results.

If enterprise can improve their business intelligences from what they SQOOP by using advanced library packages, then we can rephrase George Bernard Shaw’s quote as “If history repeats itself, and the unexpected always happen, how capable incapable must enterprise man be of learning from data experience?”.

Are you wearing the right CAP?

As you might have learned by this time, CAP stands for Consistency, Availability, and Partition tolerance. Let us understand in simple terms what they are,

Availability promises 24 X 7 data access without any excuses. Besides network and server failures, other factors like natural disaster can wipe or disrupt data availability. So engineers have been trying various strategies in architecture to make data available 24X7, but the basic technique is to either make multiple copies of entire data store or to cost effectively spread the data across in an array of disk drives so that failed disk data can be reconstructed on the fly.

For instance if your strategy for data availability is to maintain multiple copies, then making sure all the copies are identical all the time, is what we call Consistency in other words every read get recent write. Data consistency can be Absolute or Eventual or it can be somewhere in between depending on your application requirement.

And while engineers have been trying different strategy to achieve absolute consistency and 24X7 availability with traditional RDBMS databases like Oracle, MS-SQL Server, MySQL, PostgreSQL etc., the world added Zetta Bytes (1021 bytes) of data. Also given that companies doubling their data every two years on average, the challenge is to build databases that can economically scale with high availability and absolute consistency.

In fact to scale economically, Shared Nothing architecture (implemented by Teradata in 1983) seems to be most preferred architecture in the industry. And a node (individual machine) in shared nothing architecture is set to be lean and mean for performance, the challenge is to split your database among the nodes as it grows, also known as Sharding.

When you “shard” (split) your database into smaller databases for solving data volume problem, the sliced/split databases expected to work as one unit to serve users requests. But Partition Tolerance promises the shards should continue to serve even if communication between them is broken.

To keep up the promise, the applications should either loose 24×7 availability (by waiting for the broken communication to get restored and resolved as consistency is your application’s priority) or absolute consistency (by keep serving the request without syncing recent writes due to broken communication, as availability is your application’s priority).

Every traditional and NO-SQL databases in the market are working to get the best out of CAP using various strategies, so find out what part of CAP is really offered in the database that you pick to implement, for instance with guaranteed Partition Tolerance, HBase is inclined more towards absolute consistency over 24×7 availability, whereas Cassandra insists 24×7 availability over absolute consistency.

Share or Not to Share : All in the game

With “shared what” revolution in parallel database architecture, choosing a database for processing your data will broadly fall into either share or not to share.

Shared Memory

Shared Memory as the name goes, the memory is shared among processors. This approach considered to be easy to program, but difficult to economically scale beyond few processors.

The advantage of sharing memory is low latency, traditional databases like PostgreSQL, MySQL, etc. have been under this architecture are processing Giga bytes of data with ease and low latency.

Lately with Microsoft’s announcement that SQL Server 2014 will be using in-memory technology for transaction processing, and also upcoming in-memory open source technologies like Volt DB are promising shared memory databases with low latency and high throughput.

Along with that, given memory price trend, which fell to $5/GB from $1000/GB in the last decade, processing real time transactions in Peta bytes is not far away.

Shared Nothing

On the other hand Shared Nothing approach is considered complex to program but easy to build and scale.

With data explosion, specifically in semi-structured and unstructured data like text and videos, Shared Nothing architecture gained momentum with NOSQL (Not Only SQL) databases like Hadoop/HBase, Cassandra, Mongo DB etc., followed by traditional databases like Sharded MySQL, Postgres-XC etc.,

Even though a high throughput was achieved by implementing MAP-REDUCE pattern for analytical data processing in databases like Hadoop/HBase, CouchDB, Riak, etc., low latency had taken a toll for sure.

In the NOSQL world, while low latency databases like Cassandra, Alchemy etc., are improving their throughput, companies like Yahoo (Storm & YARN on Hadoop), Google (Spanner) are working to solve their low latency problem.

Until share or not to share architectures are tweaked, experimented and tested to have everything in one box, both architectures are in the game.

Cross-Site Scripting (XSS)

OWASP released its top 10 application security risks for 2010 lately. Cross Site Scripting (XSS) remains unbeaten in its second place for the past few years.

Forget about other 9, XSS alone has enough potential to create wide spectrum of issues, ranging from trivial defacing of your web page, to more lethal identity theft.

Lately if you have experienced a denial-of-service attack (for instance blocking your system access by opening multiple browser windows) or if you have seen a bizarre banner in your website or if you come across an irrelevant/misleading news in the middle of your corporate’s website, then you have two choices.

First choice is most preferred one by “humans”  ie., to raise your index finger and confidently point straight to XSS and do nothing about it :-).

Second choice is, do something to mitigate the attack. The action items are listed below to boost your defense.

  1. Set character encoding. (a) Start with your Web Server (if you have access), (b) Set meta tags right below the <head> tag of your web documents (e.g <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> ), (c) Set @charset at the top of your stylesheets (e.g @charset “Windows-1251”;) & (d) Set in xml document tag
  2. If you have access to the application code SANITIZE IT !. Start by validating all your input data of your application, while doing so look and filter the special characters of HTML. For instance, to validate phone number fields,  do not allow alphabetical or special characters, if you have to, then device your filter to look for expected input. Filtering also should be done before output generation.
  3. If you need special characters to be used in your code,  then you should always consider encoding them ( for instance replacing the special character with HTML entity or ISO Latin-1 code is one type of encoding)
  4. “Cookies” 🙂 not only spoils your health, but also your application. If you still want to use cookie, then protect it with “HttpOnly” attribute.
  5. There is this famous quote by M.K.Gandhi “We must become the change we wish to see in the world”, if you apply the same to our security context, unless you start a secure coding practice no tool can protect you. Not even any Web Application Firewalls. It was demonstrated in OWASP Europe 2009 presentation, that Web Application Firewalls can be hacked by using our beloved XSS.