return_alternatives next-pos tokens don't match top-probability tokens of sampling mode

## Observation

Something off with `return_alternatives` - would have expected that the returned alternative tokens at the next-position would correspond to the top probability sampled tokens at that given position.

But it seems some of the best tokens can be omitted from the alternatives, while tokens that are weird (given the the src and the prefix) and don't show up during sampling can be returned in the alternatives.

This can then naturally badly influence the rest of the completions after the alternatives, but I kept the focus rather only on the next 1-token alternative vs sampling.

## Model preparation
```bash
# see BPE and SP variants at https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/de-en/README.md, using latest 2020 SP variant below
# for https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2020-02-26.zip
# extract content to $OPUS/opus-de-en, then:
ct2-marian-converter --model_path $OPUS/opus-de-en/*.npz \
    --vocab_paths $OPUS/opus-de-en/*vocab.yml $OPUS/opus-de-en/*vocab.yml \
    --output_dir $OPUS/opus-de-en/ct2-full
```

## Testing and results

(See code at end to reproduce)
For a given input sentence about planning in German, a (tokenized) target prefix "Good planning" is set in English.
Then, two actions are performed on that prefix:

    a) topK next-token is sampled

    b) return_alternatives next-tokens are queried

### Output of sampling/alternatives:

Observe that `well` and `good` don't show up in the top-10 samples, but do show up in the alternatives. Especially `good` would result in completions starting with `Good planning good`, which makes not much sense, thus raised the suspicion.

```
=== Top-K @k=10, @temp=1.0
▁is 404
, 3
▁was 3
▁has 2
▁will 3
▁for 3
▁of 1
▁can 1

=== Top-K @k=10, @temp=1000.0
▁is 43
▁for 38
, 49
▁will 44
▁becomes 50
▁can 49
▁of 42
▁has 24
▁was 45
▁remains 36

=== Alternatives @minprob=0.001
▁is
▁well
▁for
▁good
▁of
▁a
▁quality
▁has
▁in
▁will

=== Alternatives @minprob=0.02
▁is
▁well
▁for
▁good
=== Alternatives @minprob=0.03
▁is
▁well
=== Alternatives @minprob=0.11
▁is
```

### Other models tried
Tried the small toy test model `CTranslate2/tests/data/models/v2/aren-transliteration` (code has input example for that), but it did not reproduce suche inconsistency.

Tried the BPE-encoded opus de-en model, and it behaves similarly to this latter SP-encoded one.

Tried CT2 3.1.0 from a docker image, it had this same behavior too. I couldn't try any earlier CT2, since the docker images don't have `ct2-marian-converter`, and didn't try if it would be easy to backport (they do have other converters with missing py deps like onmt or fairseq - while I could get torch+cpu to install within the docker, I couldn't get those to, so couldn't convert any sensible model for them).

### Changes that seemed relevant

I eyeballed https://github.com/OpenNMT/CTranslate2/commit/62d5396a7620ab8ee73e115ba825bcb8143ed4d3 (touching around 3.1), but at least for me nothing stuck out. I also saw https://github.com/OpenNMT/CTranslate2/commit/739a5b1a3166f6fb0433ef5a102eaba0b9a61408 (touching around 2.20) but didn't review that deeply. Would have been nice if could have run some older early 2.x CT2 to see if this was a regression or something originally present, but as described above, no luck.

## Test code

```python
import ctranslate2
import sys

model = sys.argv[1]
translator = ctranslate2.Translator(model)

# for use with opus-de-en
src = '▁Eine ▁gute ▁Planung ▁ist ▁für ▁ein ▁leistungs basierte s ▁Modell ▁unverzichtbar , ▁wenngleich ▁sich ▁die s ▁als ▁schwierig ▁erwies .'
tgt = '▁Good ▁planning'

"""
# for use with CTranslate2/tests/data/models/v2/aren-transliteration:
src = 'و ر ن ل س'
tgt = ''
"""

"""
# for use with opus-de-en BPE
src = 'Eine gute Planung ist für ein leistungs basierte s Modell unverzichtbar , wenngleich sich die s als schwierig erwies .'
tgt = 'Good planning'
"""


print("Prefix:", tgt)

NREPS=42
# 4.7 can handle beams=1 and 10 for hyps/topk, but 3.1 needs beams>=hyps.
"""
SAMPK=10
SAMPBEAMS=2
SAMPHYPS=2
"""
SAMPK=10
SAMPBEAMS=1
SAMPHYPS=10

# 4.7, 3.1: has "_good" and "_well" in top-4 alternatives.

def topk_at_temp(k, temp):
    counts = {}
    # Large rep count so we have more representative sampling
    for rep in range(NREPS):
        results = translator.translate_batch([src.split()],
                                             beam_size=SAMPBEAMS,  #1, # 1 works with latest, 3.1.0 needs to spec k
                                             target_prefix=[tgt.split()],
                                             max_decoding_length=len(tgt.split())+1,
                                             num_hypotheses=SAMPHYPS,
                                             sampling_topk=SAMPK,
                                             sampling_temperature=temp,
                                             )
        for i, hyp in enumerate(results[0].hypotheses):
            sample = hyp[-1]
            if sample not in counts:
                counts[sample] = 0
            counts[sample] += 1
    print(f"=== Top-K @k={k}, @temp={temp}")
    for k, v in counts.items():
        print(k, v)

def alternatives_minprob(minp):
    results = translator.translate_batch([src.split()],
                                         beam_size=1,
                                         target_prefix=[tgt.split()],
                                         max_decoding_length=len(tgt.split())+1,
                                         num_hypotheses=10,
                                         return_alternatives=True,
                                         min_alternative_expansion_prob=minp,
                                         )

    print(f"=== Alternatives @minprob={minp}")
    for i, hyp in enumerate(results[0].hypotheses):
        print(hyp[-1])

topk_at_temp(SAMPK, 1.0)
topk_at_temp(SAMPK, 2.0)
topk_at_temp(SAMPK, 10.0)
topk_at_temp(SAMPK, 1000.0)

alternatives_minprob(0.001)
alternatives_minprob(0.01)
alternatives_minprob(0.02)
alternatives_minprob(0.03)
alternatives_minprob(0.11)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return_alternatives next-pos tokens don't match top-probability tokens of sampling mode #2014

Observation

Model preparation

Testing and results

Output of sampling/alternatives:

Other models tried

Changes that seemed relevant

Test code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

return_alternatives next-pos tokens don't match top-probability tokens of sampling mode #2014

Description

Observation

Model preparation

Testing and results

Output of sampling/alternatives:

Other models tried

Changes that seemed relevant

Test code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions