Observation
Something off with return_alternatives - would have expected that the returned alternative tokens at the next-position would correspond to the top probability sampled tokens at that given position.
But it seems some of the best tokens can be omitted from the alternatives, while tokens that are weird (given the the src and the prefix) and don't show up during sampling can be returned in the alternatives.
This can then naturally badly influence the rest of the completions after the alternatives, but I kept the focus rather only on the next 1-token alternative vs sampling.
Model preparation
# see BPE and SP variants at https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/de-en/README.md, using latest 2020 SP variant below
# for https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2020-02-26.zip
# extract content to $OPUS/opus-de-en, then:
ct2-marian-converter --model_path $OPUS/opus-de-en/*.npz \
--vocab_paths $OPUS/opus-de-en/*vocab.yml $OPUS/opus-de-en/*vocab.yml \
--output_dir $OPUS/opus-de-en/ct2-full
Testing and results
(See code at end to reproduce)
For a given input sentence about planning in German, a (tokenized) target prefix "Good planning" is set in English.
Then, two actions are performed on that prefix:
a) topK next-token is sampled
b) return_alternatives next-tokens are queried
Output of sampling/alternatives:
Observe that well and good don't show up in the top-10 samples, but do show up in the alternatives. Especially good would result in completions starting with Good planning good, which makes not much sense, thus raised the suspicion.
=== Top-K @k=10, @temp=1.0
▁is 404
, 3
▁was 3
▁has 2
▁will 3
▁for 3
▁of 1
▁can 1
=== Top-K @k=10, @temp=1000.0
▁is 43
▁for 38
, 49
▁will 44
▁becomes 50
▁can 49
▁of 42
▁has 24
▁was 45
▁remains 36
=== Alternatives @minprob=0.001
▁is
▁well
▁for
▁good
▁of
▁a
▁quality
▁has
▁in
▁will
=== Alternatives @minprob=0.02
▁is
▁well
▁for
▁good
=== Alternatives @minprob=0.03
▁is
▁well
=== Alternatives @minprob=0.11
▁is
Other models tried
Tried the small toy test model CTranslate2/tests/data/models/v2/aren-transliteration (code has input example for that), but it did not reproduce suche inconsistency.
Tried the BPE-encoded opus de-en model, and it behaves similarly to this latter SP-encoded one.
Tried CT2 3.1.0 from a docker image, it had this same behavior too. I couldn't try any earlier CT2, since the docker images don't have ct2-marian-converter, and didn't try if it would be easy to backport (they do have other converters with missing py deps like onmt or fairseq - while I could get torch+cpu to install within the docker, I couldn't get those to, so couldn't convert any sensible model for them).
Changes that seemed relevant
I eyeballed 62d5396 (touching around 3.1), but at least for me nothing stuck out. I also saw 739a5b1 (touching around 2.20) but didn't review that deeply. Would have been nice if could have run some older early 2.x CT2 to see if this was a regression or something originally present, but as described above, no luck.
Test code
import ctranslate2
import sys
model = sys.argv[1]
translator = ctranslate2.Translator(model)
# for use with opus-de-en
src = '▁Eine ▁gute ▁Planung ▁ist ▁für ▁ein ▁leistungs basierte s ▁Modell ▁unverzichtbar , ▁wenngleich ▁sich ▁die s ▁als ▁schwierig ▁erwies .'
tgt = '▁Good ▁planning'
"""
# for use with CTranslate2/tests/data/models/v2/aren-transliteration:
src = 'و ر ن ل س'
tgt = ''
"""
"""
# for use with opus-de-en BPE
src = 'Eine gute Planung ist für ein leistungs basierte s Modell unverzichtbar , wenngleich sich die s als schwierig erwies .'
tgt = 'Good planning'
"""
print("Prefix:", tgt)
NREPS=42
# 4.7 can handle beams=1 and 10 for hyps/topk, but 3.1 needs beams>=hyps.
"""
SAMPK=10
SAMPBEAMS=2
SAMPHYPS=2
"""
SAMPK=10
SAMPBEAMS=1
SAMPHYPS=10
# 4.7, 3.1: has "_good" and "_well" in top-4 alternatives.
def topk_at_temp(k, temp):
counts = {}
# Large rep count so we have more representative sampling
for rep in range(NREPS):
results = translator.translate_batch([src.split()],
beam_size=SAMPBEAMS, #1, # 1 works with latest, 3.1.0 needs to spec k
target_prefix=[tgt.split()],
max_decoding_length=len(tgt.split())+1,
num_hypotheses=SAMPHYPS,
sampling_topk=SAMPK,
sampling_temperature=temp,
)
for i, hyp in enumerate(results[0].hypotheses):
sample = hyp[-1]
if sample not in counts:
counts[sample] = 0
counts[sample] += 1
print(f"=== Top-K @k={k}, @temp={temp}")
for k, v in counts.items():
print(k, v)
def alternatives_minprob(minp):
results = translator.translate_batch([src.split()],
beam_size=1,
target_prefix=[tgt.split()],
max_decoding_length=len(tgt.split())+1,
num_hypotheses=10,
return_alternatives=True,
min_alternative_expansion_prob=minp,
)
print(f"=== Alternatives @minprob={minp}")
for i, hyp in enumerate(results[0].hypotheses):
print(hyp[-1])
topk_at_temp(SAMPK, 1.0)
topk_at_temp(SAMPK, 2.0)
topk_at_temp(SAMPK, 10.0)
topk_at_temp(SAMPK, 1000.0)
alternatives_minprob(0.001)
alternatives_minprob(0.01)
alternatives_minprob(0.02)
alternatives_minprob(0.03)
alternatives_minprob(0.11)
Observation
Something off with
return_alternatives- would have expected that the returned alternative tokens at the next-position would correspond to the top probability sampled tokens at that given position.But it seems some of the best tokens can be omitted from the alternatives, while tokens that are weird (given the the src and the prefix) and don't show up during sampling can be returned in the alternatives.
This can then naturally badly influence the rest of the completions after the alternatives, but I kept the focus rather only on the next 1-token alternative vs sampling.
Model preparation
Testing and results
(See code at end to reproduce)
For a given input sentence about planning in German, a (tokenized) target prefix "Good planning" is set in English.
Then, two actions are performed on that prefix:
Output of sampling/alternatives:
Observe that
wellandgooddon't show up in the top-10 samples, but do show up in the alternatives. Especiallygoodwould result in completions starting withGood planning good, which makes not much sense, thus raised the suspicion.Other models tried
Tried the small toy test model
CTranslate2/tests/data/models/v2/aren-transliteration(code has input example for that), but it did not reproduce suche inconsistency.Tried the BPE-encoded opus de-en model, and it behaves similarly to this latter SP-encoded one.
Tried CT2 3.1.0 from a docker image, it had this same behavior too. I couldn't try any earlier CT2, since the docker images don't have
ct2-marian-converter, and didn't try if it would be easy to backport (they do have other converters with missing py deps like onmt or fairseq - while I could get torch+cpu to install within the docker, I couldn't get those to, so couldn't convert any sensible model for them).Changes that seemed relevant
I eyeballed 62d5396 (touching around 3.1), but at least for me nothing stuck out. I also saw 739a5b1 (touching around 2.20) but didn't review that deeply. Would have been nice if could have run some older early 2.x CT2 to see if this was a regression or something originally present, but as described above, no luck.
Test code