Created attachment 244077 [details] extract-msg-0.45.0.patch Description: ChangeLog: https://github.com/TeamMsgExtractor/msg-extractor/blob/v0.45.0/CHANGELOG.md * BREAKING: Changed parsing of string multiple properties to remove the trailing null byte. This *will* cause the output of parsing them to differ. * Updated typing information for some functions and classes. * Fixed a bug with `MessageSignedBase.attachments` that would cause it to return None instead of an empty list if the number of normal attachments was 0 was the error behavior was set to ignore violations of the standard. * Updated `MessageSignedBase.attachments` to use `functools.cached_property` instead of `property`. * Fixed spelling errors in some exception strings. * Made `NamedPropertyBase` a subclass of `abc.ABC`. * Cleaned up some of the code for named properties to remove unused variables and remove inefficient code. * Changed `PropBase` to be a subclass of `abc.ABC`. * Added detailed versioning info to the README. * Deprecated many private functions, including methods on many of the classes. Of primary note are `_getStream` and `_getStringStream`, which have been moved to the public API as `getStream` and `getStringStream`. Any deprecated functions still exist and will forward to a public API function if they are not being removed. Additionally, all internal usage of them has been removed. This change is one of the big preparations that is needed for the `1.0.0` release. * As mentioned, a number of these deprecated functions have been moved to the public API. It is recommended that you run tests with your code after enabling deprecation warnings to see what should be changed. * Removed items deprecated in or before `0.42.0`. * Changed the API for the private method `_genRecipient`. This is not intended for use outside of the module *except* for subclasses. The change removed the allowance of ints for the second argument, requiring that it be a valid enum type. * Convert many enum types to `IntEnum`. * Extended functionality of `PropertiesStore` to allow for integer property names and getting a property based on just the ID. You can also get a list of all properties that use a given ID. * Added new function `PropertiesStore.getProperties` which gets a list of all properties matching the property ID. Return type is a list of `PropBase` instances. * Added new function `PropertiesStore.getValue` which looks for the first matching `FixedLengthProp` and returns the value from it. * Improved internal code related to getting a property with a potentially unknown type. * Added a number of entirely new functions to the public API on `MSGFile`, `AttachmentBase`, `PropertiesStore`, and `Recipient` objects: * `getMultipleBinary`: Gets a multiple binary property as a list of `bytes` objects. * `getSingleOrMultipleBinary`: A combination of `getStream` and `getMultipleBinary` which prefers a single binary stream. Returns a single `bytes` object or a list of `bytes` objects. * `getMultipleString`: Gets a multiple string property as a list of `str` objects. * `getSingleOrMultipleString`: A combination of `getStringStream` and `getMultipleString` which prefers a single string stream. Returns a single bytes objecct or a list of bytes objects. * `getPropertyVal`: Shortcut for `instance.props.getValue` that allows new behavior to be added by overriding it. * `getNamedProp`: Shortcut for `instance.namedProperties.get((propertyName, guid), default)` that allows new behavior to be added by overriding it. * Removed `Named._getStringStream` and `Named.sExists`. The named properties storage will *always* use regular streams and not string streams. * Changed all `Named` methods to no longer have a prefix argument. The prefix should *always* be false sense the named property mapping will only exist in the top level directory. * Adjusted `tryGetMimeType` to allows any attachments whose `data` property would return a `bytes` instance. * Changed internal code to use public API functions wherever possible. This includes making many private API functions use calls to the public API for getting bits of data. * Fixed potential issue with `AttachmentBase.clsid` which had the potential to cause some attachments to fail to generate a CLSID. * Outright removed or changed a significant portion of the private API. I have rarely, if ever, seen references to these parts, so this should cause you no issues. Some of these have also been moved to the public API, either identically or with changes, and the mapping is as such: * `_getNamedAs` -> `getNamedAs`: Changed to *always* require a conversion argument. If you were previously using it to plainly get a named property or to handle the properly being None or a real value, you should use the return value of `getNamedProp` instead. * `_getPropertyAs` -> `getPropertyAs`: Same as above, use `getPropertyVal` instead for None or plain access. * `_getStreamAs` -> `getStreamAs`, `getStringStreamAs`: Once again, see above. Use `getStream` and `getStringStream`, respectively. QA: * portlint: - WARN: Makefile: using hyphen in PORTNAME. consider using PKGNAMEPREFIX and/or PKGNAMESUFFIX. * testport: OK (poudriere: 13.2-RELEASE, amd64, WKHTMLTOPDF tested) Notes: * This update breaks textproc/py-textract when parsing a .msg file. If the following message is displayed, ``` The filename extension .msg is not yet supported by textract. Please suggest this filename extension here: https://github.com/deanmalmgren/textract/issues Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tab, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx ``` it is a hidden exception: ``` Traceback (most recent call last): File "/usr/local/bin/textract", line 33, in <module> main() File "/usr/local/bin/textract", line 25, in main output = process(**vars(args)) File "/usr/local/lib/python3.9/site-packages/textract/parsers/__init__.py", line 70, in process filetype_module = importlib.import_module( File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1030, in _gcd_import File "<frozen importlib._bootstrap>", line 1007, in _find_and_load File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 680, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 850, in exec_module File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed File "/usr/local/lib/python3.9/site-packages/textract/parsers/msg_parser.py", line 3, in <module> import extract_msg File "/usr/local/lib/python3.9/site-packages/extract_msg/__init__.py", line 65, in <module> from .ole_writer import OleWriter File "/usr/local/lib/python3.9/site-packages/extract_msg/ole_writer.py", line 19, in <module> from red_black_dict_mod import RedBlackTree ModuleNotFoundError: No module named 'red_black_dict_mod' ```
Any progress on fixing the bug? It's unlikely this'll get committed if it breaks a downstream port. Either patch textproc/py-textract to no longer fail with the new version of textproc/py-extract-msg, or fix this port to not break textproc/py-textract.
Created attachment 245366 [details] extract-msg-0.45.0.patch Description: * Add devel/py-red-black-tree-mod dependency. QA: * portlint: - WARN: Makefile: using hyphen in PORTNAME. consider using PKGNAMEPREFIX and/or PKGNAMESUFFIX. * testport: OK (poudriere: 13.2-RELEASE, amd64, WKHTMLTOPDF tested) Notes: * Adding devel/py-red-black-tree-mod fixes Exception described in #c0.
(In reply to Robert Clausecker from comment #1) Thanks!
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/ports/commit/?id=d48062ec61d4ec3420833c451f146e15a38bd9bb commit d48062ec61d4ec3420833c451f146e15a38bd9bb Author: Jesús Daniel Colmenares Oviedo <DtxdF@disroot.org> AuthorDate: 2023-10-01 22:18:39 +0000 Commit: Robert Clausecker <fuz@FreeBSD.org> CommitDate: 2023-10-04 19:59:56 +0000 textproc/py-extract-msg: Update to 0.45.0 * Add devel/py-red-black-tree-mod dependency. ChangeLog: https://github.com/TeamMsgExtractor/msg-extractor/blob/v0.45.0/CHANGELOG.md * BREAKING: Changed parsing of string multiple properties to remove the trailing null byte. This *will* cause the output of parsing them to differ. * Updated typing information for some functions and classes. * Fixed a bug with `MessageSignedBase.attachments` that would cause it to return None instead of an empty list if the number of normal attachments was 0 was the error behavior was set to ignore violations of the standard. * Updated `MessageSignedBase.attachments` to use `functools.cached_property` instead of `property`. * Fixed spelling errors in some exception strings. * Made `NamedPropertyBase` a subclass of `abc.ABC`. * Cleaned up some of the code for named properties to remove unused variables and remove inefficient code. * Changed `PropBase` to be a subclass of `abc.ABC`. * Added detailed versioning info to the README. * Deprecated many private functions, including methods on many of the classes. Of primary note are `_getStream` and `_getStringStream`, which have been moved to the public API as `getStream` and `getStringStream`. Any deprecated functions still exist and will forward to a public API function if they are not being removed. Additionally, all internal usage of them has been removed. This change is one of the big preparations that is needed for the `1.0.0` release. * As mentioned, a number of these deprecated functions have been moved to the public API. It is recommended that you run tests with your code after enabling deprecation warnings to see what should be changed. * Removed items deprecated in or before `0.42.0`. * Changed the API for the private method `_genRecipient`. This is not intended for use outside of the module *except* for subclasses. The change removed the allowance of ints for the second argument, requiring that it be a valid enum type. * Convert many enum types to `IntEnum`. * Extended functionality of `PropertiesStore` to allow for integer property names and getting a property based on just the ID. You can also get a list of all properties that use a given ID. * Added new function `PropertiesStore.getProperties` which gets a list of all properties matching the property ID. Return type is a list of `PropBase` instances. * Added new function `PropertiesStore.getValue` which looks for the first matching `FixedLengthProp` and returns the value from it. * Improved internal code related to getting a property with a potentially unknown type. * Added a number of entirely new functions to the public API on `MSGFile`, `AttachmentBase`, `PropertiesStore`, and `Recipient` objects: * `getMultipleBinary`: Gets a multiple binary property as a list of `bytes` objects. * `getSingleOrMultipleBinary`: A combination of `getStream` and `getMultipleBinary` which prefers a single binary stream. Returns a single `bytes` object or a list of `bytes` objects. * `getMultipleString`: Gets a multiple string property as a list of `str` objects. * `getSingleOrMultipleString`: A combination of `getStringStream` and `getMultipleString` which prefers a single string stream. Returns a single bytes objecct or a list of bytes objects. * `getPropertyVal`: Shortcut for `instance.props.getValue` that allows new behavior to be added by overriding it. * `getNamedProp`: Shortcut for `instance.namedProperties.get((propertyName, guid), default)` that allows new behavior to be added by overriding it. * Removed `Named._getStringStream` and `Named.sExists`. The named properties storage will *always* use regular streams and not string streams. * Changed all `Named` methods to no longer have a prefix argument. The prefix should *always* be false sense the named property mapping will only exist in the top level directory. * Adjusted `tryGetMimeType` to allows any attachments whose `data` property would return a `bytes` instance. * Changed internal code to use public API functions wherever possible. This includes making many private API functions use calls to the public API for getting bits of data. * Fixed potential issue with `AttachmentBase.clsid` which had the potential to cause some attachments to fail to generate a CLSID. * Outright removed or changed a significant portion of the private API. I have rarely, if ever, seen references to these parts, so this should cause you no issues. Some of these have also been moved to the public API, either identically or with changes, and the mapping is as such: * `_getNamedAs` -> `getNamedAs`: Changed to *always* require a conversion argument. If you were previously using it to plainly get a named property or to handle the properly being None or a real value, you should use the return value of `getNamedProp` instead. * `_getPropertyAs` -> `getPropertyAs`: Same as above, use `getPropertyVal` instead for None or plain access. * `_getStreamAs` -> `getStreamAs`, `getStringStreamAs`: Once again, see above. Use `getStream` and `getStringStream`, respectively. PR: 273123 textproc/py-extract-msg/Makefile | 3 ++- textproc/py-extract-msg/distinfo | 6 +++--- 2 files changed, 5 insertions(+), 4 deletions(-)