File chunks reported as Roff #7656
Replies: 2 comments 8 replies
-
|
Any thoughts on this @Alhadis? I was thinking we could use a regex trick to limit the search to a certain number of rows (similar to #6965). For example, we could limit the search to the first 100 rows by prepending this: |
Beta Was this translation helpful? Give feedback.
-
|
The more I think about this the more it seems that we just need to add the fallback to - extensions: ['.1', '.2', '.3', '.4', '.5', '.6', '.7', '.8', '.9']
rules:
- language: Roff Manpage
and:
- named_pattern: mdoc-date
- named_pattern: mdoc-title
- named_pattern: mdoc-heading
- language: Roff Manpage
and:
- named_pattern: man-title
- named_pattern: man-heading
- language: Roff
pattern: '^\.(?:[A-Za-z]{2}(?:\s|$)|\\")'
- language: Text # <------ Adding thisExplanation: This will make sure that even if the This solves the issue for the file chunk mentioned in the OP, but also for files like these ones: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I was looking at Programming Language Trends on Languish and saw that Roff had a major uptick:
Turns out it's related to files like this one that are a binary file chunk.
Original hypothesis for the cause
It seems the reason why these files are not seen as binary is because although the start of the file looks like text, the binary data appears too far into the file to trigger charlock_holmes' binary detection:
[...]

Perhaps, the heuristics for Roff could be edited to avoid matching this type of file? The rule that matches is this one:
^\.(?:[A-Za-z]{2}(?:\s|$)|\\")at the end.Matching here:
Beta Was this translation helpful? Give feedback.
All reactions