Hi,
I would like to access attention from decoder-only LLMs. I need it for simultaneous translation, to detect in each generation step, to which part of source (user message in the prompt) the model is attending most, to see if it is near the end of current source. Then I would stop generation, to continue with next partial source.
Am I right that this feature is not yet implemented for Generator, but is there for Translator? Could it be implemented for Generator? How?
Thanks!
Hi,
I would like to access attention from decoder-only LLMs. I need it for simultaneous translation, to detect in each generation step, to which part of source (user message in the prompt) the model is attending most, to see if it is near the end of current source. Then I would stop generation, to continue with next partial source.
Am I right that this feature is not yet implemented for Generator, but is there for Translator? Could it be implemented for Generator? How?
Thanks!