English in, emoji out — and a measurable answer to whether the translation is any good.
Most emoji translators just emit emoji and stop. The hard part is knowing whether the translation is any good — and the academic literature shows why that's genuinely difficult.
EmojiLM (Peng et al., 2023) is a distilled BART/T5 encoder–decoder for bidirectional text ↔ emoji. It's trained on Text2Emoji — a ~503K-pair English→emoji corpus that was itself synthesized by prompting an LLM (gpt-3.5-turbo) across 19 domains, with a 2.3K-emoji vocabulary (over 100× the 20 classes of the older TweetEval benchmark).
emoji2vec (Eisner et al., 2016) maps every Unicode emoji into the same 300-d word2vec space by training each emoji's vector against the summed word vectors of its Unicode name and keywords — giving usable vectors even for rare emoji.
One sentence maps to many valid emoji renderings, and one emoji carries many senses (😂 ≠ "crying"). So token-overlap metrics like BLEU and exact-match against a single reference are invalid — "I'm happy 🎉" and "I'm happy 😄" are both correct but share no tokens.
That splits "is it good?" into two different questions: Fidelity (does the emoji preserve the meaning?) and Naturalness (would a real person actually text this?). A string of 12 literal emoji can be perfectly faithful yet completely inhuman.
| In the literature | On this page |
|---|---|
| Forced-choice human preference vs a reference — EmojiLM's human study has annotators pick the better of model-vs-corpus emoji. Measures naturalness, crudely. | → motivates our Naturalness score. |
| Back-translation / cloze test — Emojinize (2024) asks whether a third party can recover the original meaning from the emoji alone. An objective fidelity probe. arXiv:2403.03857 ↗ | → this is exactly our Fidelity score: we back-translate the emoji and check how well the meaning survives ("reads back as …"). |
| Single-emoji prediction with macro-F1 — SemEval-2018 Task 2. Useful, but single-emoji and exact-match only. ACL S18-1003 ↗ | → context for why exact-match alone isn't enough. |
| Semantics-preserving evaluation (2024) — instead of exact match, check whether attributes like sentiment, emotion, and stance are preserved. arXiv:2409.10760 ↗ | → motivates our Tone-match check. |
That gap is why this page runs two models head-to-head and scores both with the same fixed referee.