Software story

A decision tree for questions that come in pairs

PWC4.5 is an open-source classifier for problems where the answer hides in the relationship between two instances, not in either instance alone. This is the story of why those problems break ordinary classifiers, and what fixing that looks like. You can run every experiment yourself in one command.

Software: PWC4.5 (Java, runnable JAR)
Repository: github.com/Hebaelfiqi/PWC4.5
Datasets: Synthetic 2D/5D + 21 translator pairs (CC BY 4.0)
Paper: ACM TALLIP, 2016 · doi:10.1145/2898997

The problem

When the signal lives between the rows

Take two translators and hand them the same source text. Each produces a fluent translation, and each leaves stylistic fingerprints: tiny grammatical habits that show up as network motifs in the structure of their sentences. Now the question: given a translation, who produced it?

Here is the trap. Both translators use the same motifs; what differs is how often, and only relative to each other on the same source. A classifier that looks at each translation in isolation sees nothing usable. Standard C4.5, a workhorse decision tree, managed 52.12% on translator identification: a coin flip. The information was never in the individual rows. It was in the comparison between them, and classical classifiers have no way to say “compare this row with its partner.”

The idea

Let the tree ask comparative questions

We named the setting the Pairwise Comparative Classification Problem (PWCCP): data arrives in matched pairs, and the class of each member is defined relative to its partner. PWC4.5 rebuilds C4.5 for that setting. Instead of splitting on “is this value above a threshold?”, it splits on a relationship over the pair — is this member’s value the smaller or the larger of the two? — and it keeps pairs together as they travel down the tree, so partners are never separated from the very context that defines them.

The full pipeline: motif frequencies extracted from paired translations, the pairwise framing, and the modified split rule. On translator identification, C4.5 scores 52.12%; PWC4.5 reaches 80.23%.

What's in the box

Everything needed to reproduce the paper

The repository ships the full Java implementation as a standalone runnable JAR, the synthetic 2D and 5D benchmarks used to stress the method under controlled noise, the 21-pair translator-stylometry datasets (released under CC BY 4.0), and scripts that reproduce every experiment in the paper. One command runs an experiment end to end:

$ java -jar pwc45-1.0.0.jar -ip data//2d_data//1st_exp// -f 2D_Noise_0.0 -u

Grab the JAR from the releases page; the datasets are in the repository.

Who should care

Paired data is everywhere

Stylometry was the birthplace, but the pattern is general. Which of two candidate documents is the original and which the imitation? Which of two sensor calibrations drifted? Which of a matched case–control pair is the case? Whenever instances arrive in twos and the label is relative, per-instance classifiers throw the signal away and PWCCP framing recovers it. PWC4.5 is a working, reproducible baseline for exactly those problems.

Cite & explore

The formal version

H. El-Fiqi, E. Petraki and H. A. Abbass, “Pairwise Comparative Classification for Translator Stylometric Analysis,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 16, no. 1, article 2, 2016. doi:10.1145/2898997

Get the code and datasets →
The research story in the paper gallery →

How this page was written. The research, the results, and the ideas here are mine and my co-authors’. To retell them in plain language, I worked with an AI writing assistant that helped draft the text and render the diagrams in this site’s style. I reviewed and edited everything, and the technical responsibility rests with me. If the prose reads a little differently from my papers, that is why.