DSAA Competition

Kaggle Challenge

Telecom System Reconciliation on high dimensional datasets:
Predict the correct configuration of Billing system based on CRM configuration.

Kaggle Website https://www.kaggle.com/t/10bf70a6f68f4e57bea47a3b32b6e746

Have you ever been charged for something you did not subscribe to? How hard could it be to get my invoice right?

One of the most important processes for telecommunications operators is service provisioning to a customer. This is the process responsible to provide you with the services you contracted and to charge you for them.

A simple 4 Play subscription (Internet, TV, mobile phone and fixed line) involves configuring multiple systems:

  • CRM (Customer Relationship Management) – the master system that keeps the information about the offers that the customer subscribed to;
  • TV platform – defines the channels that the customer can watch;
  • Internet platform – defines the internet speed;
  • Billing system – responsible for billing the services and the usage that the customer makes of them.

It is a complex process as there are hundreds of options, thousands of possible combinations and millions of customers. It is subject to errors, both human and automatic. For this reason, operators have different "audit" processes that try to ensure that all intervening systems are consistent with each other.

In this competition, we challenge you to predict the right bill for a customer given the services he contracted. The goal is to build a model that receives the configuration of the CRM system and predicts the correct configuration of the Billing system.

As mentioned, CRM is normally the master of the information of the services that the customer has subscribed to, so it will be possible to infer which configuration to expect in the Billing system. Note, however, that the configurations between these two systems may not be one to one, in many cases there are many to one configurations, that is, two different CRM configurations may point to the same Billing configuration.

The configurations in both systems have been pre-processed to obtain only binary variables. In this way, each column corresponds to an option in the configuration of the respective system (CRM / BILLING), the value 0 indicates that the option is inactive and the value 1 indicates that it is active. The training dataset contains one line for each client and given the set of CRM system settings, the objective is to return the Billing system configuration. The training dataset contains errors in the configuration of CRM/BILLING, however the test dataset contains only pairs of CRM-BILLING configurations that we deem as correct.

The training and test datasets are structured as follows: 1 column to identify the customer, 745 columns for the CRM configuration and 731 for the BILLING configuration.

  • 1st column (MSISDN): customer's identifier;
  • Columns 2nd – 746th (inclusive): these correspond to the input configuration (CRM);
  • Columns 746th – 1477th (inclusive): these correspond to the output configuration (BILLING);

The training dataset has the configurations of 187 442 customers and the test dataset has 97 142.

Characteristics of the problem:

  • The dataset contains several provisioning errors, i.e. in the dataset it may happen that for a CRM configuration there are different configurations in Billing;
  • Large dimensional output space: 731 outputs (Billing configurations);
  • Different CRM configurations can correspond to only one Billing configuration;
  • The individual variables of the CRM configuration may not have a direct correspondence to the individual variables of the Billing configuration.


1000 $US Dollars

Best solutions will be invited to present their solution at DSAA'2021


Join us at IEEE DSAA’2021