New Algorithm for Outlier Detection in Categorical Data

Volkan Yurtseven
12 min readNov 8, 2023

--

Photo by Will Myers on Unsplash

Observation Based Outlier Detection

(Before start, those who are interested in Turkish version of this post, please click here)

INTRODUCTION

There is no one who hasn’t heard about artificial intelligence and machine learning technologies these days. Although the parts that touch the customer have become prominent, organizations also use these technologies in their middle and back office work.

In this post, I will share the details of a study that can be considered in the back office category. This study, which we briefly mentioned in the previous post titled “Data Quality Monitoring with Advanced Analytical Methods”(in Turkish), is an ML algorithm we developed ourselves. By the way, this work can actually be indirectly considered within the scope of the front office, as problems caused by data quality can lead to incorrect offers to customers at the end of the day.

With this algorithm, we are trying to find whether there is an anomaly between two features(columns). Let’s give a different example in order not to repeat the example I gave in my previous post; let’s consider two categorical features called “Customer Type” and “Customer Segment”. Let the distribution of these be as follows:

In this table, it can be easily detected that the rows marked in red are outliers. However, considering the different values that countless possible column combinations can take, we need an algorithm that detects this programmatically instead of scanning the data set one by one. Moreover, we need to be able to make this detection on both existing and new data coming from the field, which requires an algorithm that knows what to call an outlier and what to call an inlier, that is, an algorithm that has learned this.

However, the outputs here need to be evaluated from a business perspective, because not every outlier is necessarily wrong. For example, an “Affluent Segment” may also be assigned to corporate-type customers due to a special and rare situation. Our goal as data scientists is to uncover these outlier records. But how to actually treat them needs to be decided together with the business unit.

Before going into the details of the algorithm, it would be useful to give a brief terminological information. In the literature, the detection of problems in existing data is generally referred to as Outlier or Anomaly Detection, and we will use these terms interchangeably. Detection of outliers in new data, especially outliers in new data when the existing data is clean, is referred to as Novelty Detection. Some algorithms on the market focus on only one of these, while others can be used for both purposes. Our algorithm can be used for both purposes.

METHOD

The logic of our algorithm is quite simple. As the name suggests, the smaller the Prior probability (pp for short) of the values (*) in the relevant columns appearing together, the higher the probability of being an outlier. However, this probability of occurrence should be checked in both ways. For example, when determining the pp’s above, the probability of the individual-mass pair is 58.8% for column 1 and 100% for column 2. We will see in a moment how these two probabilities are taken into account.

The step-by-step steps of the algorithm are as follows:

1. A contingency table of observations is created.

2. pp for each of these observations calculated (2 pp from both directions).

3. If pp is less than a predefined threshold (default 1%), this combination is marked as outlier (others are inliers). However, there is an exception to this. If the observation frequency of the combination is above a minimum value (local_cluster_threshold, default 100), it can be considered as a local cluster and not marked as an outlier even if its pp is less than the threshold. The blue row in the table is an example of this.

4. Both outlier and inlier combinations are kept in a dataframe and the training (fit) process is completed.

5. Finally, prediction is performed with the predict method, either for the existing data or for new data coming from the field. When predicting, the relevant column combination in the raw data is compared with the inliers in the trained data. If there is a match, 1 (inlier) is written, otherwise -1 (outlier) is written. The reason for comparing with the inliers is the possibility that a combination that has not been seen before may also come from the field. For example, the Corporate-Private pair, which is not in the table above, will also need to be marked as an outlier.

* For the sake of simplicity, a combination of 2 was chosen. However, the algorithm also works for the combination of 3. It does not work for combinations of 4 and above, as the display of the output becomes more difficult after 4.

Of these, I think point 3 needs a little more elaboration, because this question can be asked here: Two types of pp have been identified above, and which one do you take into account? Answer: If one of the two ratios fulfills the condition, it is enough to mark it as an outlier. Of course, if there is a triple column combination, then there are three different ratios in total, and if at least two of them are less than the threshold, then it is labeled as an outlier.

To explain the situation in the blue row a bit more; here the second of the pp’s is 100%, meaning that all those in the large corporation segment are of the corporate type. While this is quite normal, when looked at in reverse, it is an anomaly. However, due to the local cluster situation mentioned above, we ultimately do not call it an anomaly.

Now let’s look at how this algorithm is coded:

from sklearn.base import BaseEstimator, OutlierMixin, TransformerMixin
class ObservationBasedOutlierDetector(BaseEstimator, OutlierMixin):

def __init__(self, column_comb : list, outlier_threshold : float = 0.01, local_cluster_exception_list = [], local_cluster_thresh: int=100, local_cluster_jumping_decision_point: int = 10, local_cluster_ratio_decision_point:float = 0.05, returnOutlierLabels=True):

self.column_comb = column_comb
self.outlier_threshold = outlier_threshold
self.local_cluster_thresh = local_cluster_thresh
self.returnOutlierLabels = returnOutlierLabels
self.local_cluster_jumping_decision_point = local_cluster_jumping_decision_point
self.local_cluster_ratio_decision_point = local_cluster_ratio_decision_point
self.local_cluster_exception_list = local_cluster_exception_list
self.to_bo_evaluated = True

def fit(self,df:object):
try:
thresh=self.outlier_threshold
cols_df=df[self.column_comb].value_counts().reset_index()
cols_df["columns_combination"]=str(self.column_comb)
cols_df=cols_df.rename(columns={self.column_comb[0]:"first",self.column_comb[1]:"second",0:"count"})
if len(self.column_comb)==3:
cols_df=cols_df.rename(columns={self.column_comb[2]:"third"})
else:
cols_df["third"]=pd.NA
cols_df=cols_df[["columns_combination","first","second","third","count"]] #sort columns
if len(self.column_comb)==2:
cols_df["sum_12_or_only1"]=cols_df.groupby(["columns_combination","first"])["count"].transform("sum")
cols_df["sum_13_or_only2"]=cols_df.groupby(["columns_combination","second"])["count"].transform("sum")
cols_df["sum_23_or_nan"]=pd.NA
cols_df["ratio_12_or_only1"]=cols_df["count"]/cols_df["sum_12_or_only1"]
cols_df["ratio_13_or_only2"]=cols_df["count"]/cols_df["sum_13_or_only2"]
cols_df["ratio_23_or_nan"]=pd.NA
else:
cols_df["sum_12_or_only1"]=cols_df.groupby(["columns_combination","first","second"])["count"].transform("sum")
cols_df["sum_13_or_only2"]=cols_df.groupby(["columns_combination","first","third"])["count"].transform("sum")
cols_df["sum_23_or_nan"]=cols_df.groupby(["columns_combination","second","third"])["count"].transform("sum")
cols_df["ratio_12_or_only1"]=cols_df["count"]/cols_df["sum_12_or_only1"]
cols_df["ratio_13_or_only2"]=cols_df["count"]/cols_df["sum_13_or_only2"]
cols_df["ratio_23_or_nan"]=cols_df["count"]/cols_df["sum_23_or_nan"]
numberofTrue_atleast=len(self.column_comb)-1
conditions=np.array([(cols_df["ratio_12_or_only1"]<thresh)*1, (cols_df["ratio_13_or_only2"]<thresh)*1, (cols_df["ratio_23_or_nan"]<thresh)*1])
self.outlier_df_=cols_df[np.logical_and(cols_df["count"]<self.local_cluster_thresh,conditions.sum(axis=0)>=numberofTrue_atleast)]
self.inlier_df_=cols_df[np.logical_or(cols_df["count"]>=self.local_cluster_thresh,conditions.sum(axis=0)<numberofTrue_atleast)].copy() #bu niye copy de üstteki dğeil
self.inlier_df_["threevalues"]=self.inlier_df_["first"].astype(str)+"-"+self.inlier_df_["second"].astype(str)+"-"+self.inlier_df_["third"].astype(str)

if self.column_comb not in self.local_cluster_exception_list:
self.inlier_df_["count1"]=self.inlier_df_["count"].shift(1)
self.inlier_df_["count_prev_ratio"]=self.inlier_df_["count1"]/self.inlier_df_["count"]
self.inlier_df_.sort_values("count",inplace=True,ignore_index=True)
jumping_point=0
for idx,p in self.inlier_df_["count_prev_ratio"].iteritems():
if p>self.local_cluster_jumping_decision_point:
jumping_point=idx
break
lc_ratio_calculated=sum(self.inlier_df_["count"][:idx+1])/sum(self.inlier_df_["count"])
self.local_cluster_values={"jumping_point":jumping_point,
"lc_ratio_calculated":lc_ratio_calculated,
"max_ratio":np.round(self.inlier_df_["count_prev_ratio"].max(),2)
}
if lc_ratio_calculated>self.local_cluster_ratio_decision_point:
logging.debug(f"In the combination {self.column_comb}, the ratio of the local clusters to whole dataset size is {lc_ratio_calculated:.2f} and is hiher than the threshold ({self.local_cluster_ratio_decision_point}), thus not to be included in the model")
self.to_bo_evaluated=False
except Exception as e:
logging.exception(f'ML fit error: {e}')

return self

def predict(self,df:object):
try:
if self.to_bo_evaluated:
third="-"+df[self.column_comb[2]].astype(str) if len(self.column_comb)==3 else "-<NA>"
threevalues=df[self.column_comb[0]].astype(str)+"-"+df[self.column_comb[1]].astype(str)+third
col_=str(self.column_comb)
self.labels_=np.where(threevalues.isin(self.inlier_df_[self.inlier_df_["columns_combination"]==col_]["threevalues"].values),1,-1)
if self.returnOutlierLabels:
self.labels_ = np.argwhere(self.labels_==-1).reshape(1,-1)[0]
return self.labels_
else:
return None
except Exception as e:
logging.exception(f'ML predict error: {traceback.format_exc()}')
return None

I think the code is pretty self-explanatory, but it would be useful to verbalize some of it. Our algorithm can behave differently depending on whether the combinations contain 2 or 3 columns. However, in order for the output to be presented in a standard way from a dahsboard, the dataframe produced must also contain standard column headers. For example, the column named “sum_12_or_only1” gives ‘the number of observations based on column 1 only’ if the number of combinations is 2, but gives ‘the number of observations in column 3 based on groups 1 and 2’ if the number of combinations is 3. Other columns can be interpreted similarly.

SCOPE

We had started o build the algorithm with only categorical data in mind, but at the end of the day, we realized that most of the outlier detection algorithms for numerical data are density/distance based and therefore do not work for large data sets like ours(hundreds of millions and more), while tree based algorithms like “Isolation Forest” take too long and have high ratio of false predictions. Therefore, we decided to discretize the numeric features and categorize them and use this algorithm for them as well. We found that the results were quite satisfactory for them as well(though not as successful as pure categoricals). Of course, the speed of operation was as good as the results. By the way, although there is a KbinsDiscretizer algorithm in sklearn for discretization, we wrote a separate algorithm for this because it returns results to nominal values such as 1, 2, 3, but we need the relevant ranges themselves (to show them on the dashboard). The code for this is as follows:

class CustomDiscretizer(BaseEstimator, TransformerMixin):
def __init__(self, features, bins=100):
self.features = features
self.bins = bins

def fit(self, df):
self.bin_dict_={}
tempRemove=[]
for f in self.features:
if df[f].nunique()>=self.bins:
self.bin_dict_[f] = [np.NINF]+sorted([x.right for x in pd.qcut(df[f], self.bins, duplicates="drop").unique()])+[np.PINF]
else:
logging.warn(f"Since the number of distinct values for feature {f} is less than the bins value, discretization was not done.")
tempRemove.append(f)

for t in tempRemove:
self.features.remove(t)
return self

def transform(self, df, returnType=1):
"""
returnType:1: whole df, 2:original numerics+bins, 3:only bins
"""
for f in self.features:
df[f+"_bins"]=pd.cut(df[f], bins=self.bin_dict_[f], right=True, duplicates="drop")

if returnType==1:
return df
elif returnType==2:
return df[self.features + [x+"_bins" for x in self.features]]
elif returnType==3:
return df[[x+"_bins" for x in self.features]]
else:
logging.warn("Wrong value for returnType. It must be one of 1,2,3")

Now let’s give an example of mix data sets. With this example, let’s also examine a 3-column data set. This time I didn’t need to write the total number of observations and pp, the same logic will apply. It is obvious that there will be a high correlation in the table below (let’s consider that this is a small sample from a table with thousands of rows). In fact, if we first look at the pairwise correlations, we will find a correlation between monthly and total amount. However, a more accurate relationship can be found when the payment period comes into play during the detection of triple combinations. Let’s consider a scenario where the monthly payment type dominates (80–90% of the total data). In such a case, if we were to look only at the pairwise correlation, all records from the 3rd row onwards (except the last row) would be anomalous, since the relationship between the two columns would generally be 1–1, i.e. the correlation value would be close to 1. However, when the payment period is included in the equation, it becomes clear that only the rows in red are anomalous. In the first entry of red, either the last column should be 250 or the first column should be “Quarterly”. Likewise, in the second record of red, either the first column should be “Monthly” or the last column should be 6000 (While monthly and total amount are equal in monthly records, a 3-fold relationship is expected in quarterly records and a 12-fold relationship is expected in annual records). Looking at all 3 columns in this way, it will be seen that the correlation is actually even higher than the binary correlation.

DETERMINATION OF COMBINATIONS

An important difference of this algorithm from other algorithms already in use is that it works with only 2 or 3 columns, which is what should be done in our case. In outlier detection studies in other domains, it may be meaningful to look at all columns at once. Even so, some algorithms can perform feature selection or apply a dimension reduction process beforehand. However, here, we preferred to look at the column combinations with the highest level of direct correlation. As a matter of fact, when it comes to banking data(i work with), domain information tells us that the relationship is observed usually between 2, rarely 3, and very rarely more columns.

On the other hand, the way the data is displayed is also important, so we evaluated this relationship for a maximum of 3 columns. We think that there will be problems in displaying and interpreting the output of more than three combinations.

When determining the combinations, we look at the cramers_v metric for twos and the chi2 statistic and indirectly the phi coefficient for triples, which we obtained with the same logic (by adding a layer to the Contingency table). Taking into account the “p-value” values, we decide whether or not to add the relevant combination to the final list of combinations. If one of the pairwise combinations is in a triple combination and the correlation score is not better than that of the triple (unless it is at least 1 point higher), this combination is removed from the final list to avoid duplication.

It is important to note that while the cramers_v metric we use for binary combinations is symmetric, the one we use for triples is not. Therefore, for binaries we look at all combinations, while for triples we look at permutations. However, in order not to increase the computational cost too much, depending on the total number of permutations, we may choose a random sample of them, which may cause us to miss some good combinations. Once you have the full list of combinations, they are sent to the algorithm above, either serially or in parallel. Of course, in parallel processing, the main dataframe you have will be distributed to all CPUs, so there is a high probability of encountering memory problems. Therefore, it should be carefully decided whether to run in serial or parallel.

Although the combination determination process takes place before the modeling phase, we also eliminate some combinations during modeling. Namely, in the fit (training) phase, we sort the inliers from smallest to largest and compare the population size for each value set with the next one. So we get a rate of change. When we mark the point where this ratio is greater than the set ratio (local_cluster_jumping_decision_point, default 10) and compare the population in the clusters up to this point to the total population, if the new ratio we get is greater than the set ratio (local_cluster_ratio_decision_point, default 0.05), we consider ‘the amount of local clusters is too high’ and do not produce results for this combination. This step, which seems a bit complicated, is added to check if the number of local clusters is significant. Otherwise, there will be many local clusters following each other with small differences, which is not very meaningful for categorical data. We expect something a bit closer to the pareto principle, i.e. many large clusters and only a few local clusters.

Another issue is that even if there is a high correlation between certain columns, they may be deemed irrelevant by the business unit. Therefore, in an end-to-end system, there should be a setup that includes business unit intervention. We receive them from the data owner as an input list.

TRAINING AND PREDICTION CODES

The algorithm is run for each combination and the models created for that combination are saved in pickle format. These pickle files are then called for new data coming from the field and used for prediction.

During the first run, the fit_predict method is used on all the existing data, while only the predict operation is performed on the new data coming from the field. Of course, after the necessary preprocessing is done first.

You can also find code examples of these below:

#first run
obodmodels={}
for e,col_comb in tqdm(enumerate(col_combs)):
col_=",".join(col_comb)
OBOD=ObservationBasedOutlierDetector(col_comb,outlier_threshold=findThresh(corrlist[e]),
local_cluster_exception_list=lc_exception_list) #26.04.2023
ongoruler=OBOD.fit_predict(finaldatadf)
obodmodels[col_]=OBOD #pickle yapacağımız dictionary'ye atıyoruz
outlier_indices_= ongoruler
if outlier_indices_ is not None:
if len(outlier_indices_)>0:
resultdf=finaldatadf.iloc[outlier_indices_][col_comb]
resultdf["col_comb"]=col_
resultdf=resultdf.rename(columns={col_comb[0]:"first",col_comb[1]:"second"})
if len(col_comb)==3:
resultdf=resultdf.rename(columns={col_comb[2]:"third"})
resultdf["score_12_or_only1"]=pd.merge(OBOD.outlier_df_,resultdf,how="right")["ratio_12_or_only1"].fillna(0).values
resultdf["score_13_or_only2"]=pd.merge(OBOD.outlier_df_,resultdf,how="right")["ratio_13_or_only2"].fillna(0).values
resultdf["score_23_or_nan"]=pd.merge(OBOD.outlier_df_,resultdf,how="right")["ratio_23_or_nan"].fillna(0).values
resultdf["observation"]=pd.merge(OBOD.outlier_df_,resultdf,how="right")["count"].fillna(0).values
results.append(resultdf)
# for newcoming data
results=[]
for col_,model in obod_pkl.items():
col_comb=col_.split(",")
ongoruler=model.predict(finaldatadf)
outlier_indices_= ongoruler
if outlier_indices_ is not None:
if len(outlier_indices_)>0:
resultdf=finaldatadf.iloc[outlier_indices_][col_comb]
resultdf["col_comb"]=col_
resultdf=resultdf.rename(columns={col_comb[0]:"first",col_comb[1]:"second"})
if len(col_comb)==3:
resultdf=resultdf.rename(columns={col_comb[2]:"third"})
resultdf["score_12_or_only1"]=pd.merge(model.outlier_df_,resultdf,how="right")["ratio_12_or_only1"].fillna(0).values
resultdf["score_13_or_only2"]=pd.merge(model.outlier_df_,resultdf,how="right")["ratio_13_or_only2"].fillna(0).values
resultdf["score_23_or_nan"]=pd.merge(model.outlier_df_,resultdf,how="right")["ratio_23_or_nan"].fillna(0).values
resultdf["observation"]=pd.merge(model.outlier_df_,resultdf,how="right")["count"].fillna(0).values
results. Append(resultdf)

Finally, our output will be as follows. We explained how to read this table in the last paragraph of the METHOD section above:

CONCLUSION

As mentioned above, although our algorithm works for 2 and 3 columns, it is usually unnecessary to look further when it comes to data quality. Other highlights can be summarized as follows:

  • It runs very fast since it has O(n) complexity.
  • By discretizing the numeric columns, it is possible to obtain high speed output in numeric columns as well.
  • Therefore, it would be wrong to think of it as an algorithm that works only for categorical data. It can be used for both full categorical, full numeric and mix data sets.
  • We have four parameters to tune: Threshold (default 0.01), local_cluster_threshold (default 100), local_cluster_jumping_decision_point (default 10) and local_cluster_ratio_decision_point (default 0.05). Although in the case of unsupervised learning there is no reliable metric to decide on, you can set your own criteria. For example, you may want the number of anomalies to be lower or higher, you can play around with it, or you can use ParameterGrid and decide based on the results. By the way, you can also consider variable parameter sets for each combination (e.g. 0.01, 100, 10 and 0.05 for the pair [A,B] and 0.001, 10, 5, 0.01 for [B-C] respectively), which means modifying the algorithm itself. This can also be easily implemented if needed. We have not needed this for now.
  • It can be used both for Anomaly/Outlier Detection and Novelty Detection.

I hope you will use this algorithm in your own work and find it useful. If so, it would be very useful for everyone who reads this article if you give your opinion in the comments.

Thank you for reading this article. See you in our next article…

Additional readings:

--

--

Volkan Yurtseven

Once self-taught-Data Science Enthusiast,now graduate one