Bug 273123 - textproc/py-extract-msg: Update to 0.45.0
Summary: textproc/py-extract-msg: Update to 0.45.0
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Robert Clausecker
URL: https://github.com/TeamMsgExtractor/m...
Depends on: 274204
  Show dependency treegraph
Reported: 2023-08-14 00:26 UTC by Jesús Daniel Colmenares Oviedo
Modified: 2023-10-04 20:07 UTC (History)
1 user (show)

See Also:

extract-msg-0.45.0.patch (6.30 KB, patch)
2023-08-14 00:26 UTC, Jesús Daniel Colmenares Oviedo
DtxdF: maintainer-approval+
Details | Diff
extract-msg-0.45.0.patch (6.92 KB, patch)
2023-10-01 22:24 UTC, Jesús Daniel Colmenares Oviedo
DtxdF: maintainer-approval+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jesús Daniel Colmenares Oviedo 2023-08-14 00:26:49 UTC
Created attachment 244077 [details]



* BREAKING: Changed parsing of string multiple properties to remove the trailing
  null byte. This *will* cause the output of parsing them to differ.
* Updated typing information for some functions and classes.
* Fixed a bug with `MessageSignedBase.attachments` that would cause it to return
  None instead of an empty list if the number of normal attachments was 0 was
  the error behavior was set to ignore violations of the standard.
* Updated `MessageSignedBase.attachments` to use `functools.cached_property`
  instead of `property`.
* Fixed spelling errors in some exception strings.
* Made `NamedPropertyBase` a subclass of `abc.ABC`.
* Cleaned up some of the code for named properties to remove unused variables and
  remove inefficient code.
* Changed `PropBase` to be a subclass of `abc.ABC`.
* Added detailed versioning info to the README.
* Deprecated many private functions, including methods on many of the classes.
  Of primary note are `_getStream` and `_getStringStream`, which have been
  moved to the public API as `getStream` and `getStringStream`. Any
  deprecated functions still exist and will forward to a public API function
  if they are not being removed. Additionally, all internal usage of them has
  been removed. This change is one of the big preparations that is needed for
  the `1.0.0` release.
    * As mentioned, a number of these deprecated functions have been moved to
      the public API. It is recommended that you run tests with your code
      after enabling deprecation warnings to see what should be changed.
* Removed items deprecated in or before `0.42.0`.
* Changed the API for the private method `_genRecipient`. This is not intended for
  use outside of the module *except* for subclasses. The change removed the
  allowance of ints for the second argument, requiring that it be a valid
  enum type.
* Convert many enum types to `IntEnum`.
* Extended functionality of `PropertiesStore` to allow for integer property names
  and getting a property based on just the ID. You can also get a list of all
  properties that use a given ID.
* Added new function `PropertiesStore.getProperties` which gets a list of all
  properties matching the property ID. Return type is a list of `PropBase`
* Added new function `PropertiesStore.getValue` which looks for the first matching
  `FixedLengthProp` and returns the value from it.
* Improved internal code related to getting a property with a potentially unknown
* Added a number of entirely new functions to the public API on `MSGFile`,
  `AttachmentBase`, `PropertiesStore`, and `Recipient` objects:
    * `getMultipleBinary`: Gets a multiple binary property as a list of `bytes`
    * `getSingleOrMultipleBinary`: A combination of `getStream` and `getMultipleBinary`
       which prefers a single binary stream. Returns a single `bytes` object or a
       list of `bytes` objects.
    * `getMultipleString`: Gets a multiple string property as a list of `str` objects.
    * `getSingleOrMultipleString`: A combination of `getStringStream` and
      `getMultipleString` which prefers a single string stream. Returns
      a single bytes objecct or a list of bytes objects.
    * `getPropertyVal`: Shortcut for `instance.props.getValue` that allows new behavior
      to be added by overriding it.
    * `getNamedProp`: Shortcut for `instance.namedProperties.get((propertyName, guid), default)`
      that allows new behavior to be added by overriding it.
* Removed `Named._getStringStream` and `Named.sExists`. The named properties storage will
  *always* use regular streams and not string streams.
* Changed all `Named` methods to no longer have a prefix argument. The prefix should
  *always* be false sense the named property mapping will only exist in the top level directory.
* Adjusted `tryGetMimeType` to allows any attachments whose `data` property would return a
  `bytes` instance.
* Changed internal code to use public API functions wherever possible. This includes making many
  private API functions use calls to the public API for getting bits of data.
* Fixed potential issue with `AttachmentBase.clsid` which had the potential to cause some
  attachments to fail to generate a CLSID.
* Outright removed or changed a significant portion of the private API. I have rarely, if ever,
  seen references to these parts, so this should cause you no issues. Some of these have also
  been moved to the public API, either identically or with changes, and the mapping is as such:
    * `_getNamedAs` -> `getNamedAs`: Changed to *always* require a conversion argument. If you
      were previously using it to plainly get a named property or to handle the properly being
      None or a real value, you should use the return value of `getNamedProp` instead.
    * `_getPropertyAs` -> `getPropertyAs`: Same as above, use `getPropertyVal` instead for None
      or plain access.
    * `_getStreamAs` -> `getStreamAs`, `getStringStreamAs`: Once again, see above. Use `getStream`
      and `getStringStream`, respectively.


* portlint:
  - WARN: Makefile: using hyphen in PORTNAME. consider using PKGNAMEPREFIX and/or PKGNAMESUFFIX.
* testport: OK (poudriere: 13.2-RELEASE, amd64, WKHTMLTOPDF tested)


* This update breaks textproc/py-textract when parsing a .msg file. If the following message is displayed,

The filename extension .msg is not yet supported by
textract. Please suggest this filename extension here:


Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tab, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx


it is a hidden exception:

Traceback (most recent call last):
  File "/usr/local/bin/textract", line 33, in <module>
  File "/usr/local/bin/textract", line 25, in main
    output = process(**vars(args))
  File "/usr/local/lib/python3.9/site-packages/textract/parsers/__init__.py", line 70, in process
    filetype_module = importlib.import_module(
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/local/lib/python3.9/site-packages/textract/parsers/msg_parser.py", line 3, in <module>
    import extract_msg
  File "/usr/local/lib/python3.9/site-packages/extract_msg/__init__.py", line 65, in <module>
    from .ole_writer import OleWriter
  File "/usr/local/lib/python3.9/site-packages/extract_msg/ole_writer.py", line 19, in <module>
    from red_black_dict_mod import RedBlackTree
ModuleNotFoundError: No module named 'red_black_dict_mod'
Comment 1 Robert Clausecker freebsd_committer freebsd_triage 2023-09-30 21:45:51 UTC
Any progress on fixing the bug?
It's unlikely this'll get committed if it breaks a downstream port.  Either patch textproc/py-textract to no longer fail with the new version of textproc/py-extract-msg, or fix this port to not break textproc/py-textract.
Comment 2 Jesús Daniel Colmenares Oviedo 2023-10-01 22:24:42 UTC
Created attachment 245366 [details]


* Add devel/py-red-black-tree-mod dependency.


* portlint:
  - WARN: Makefile: using hyphen in PORTNAME. consider using PKGNAMEPREFIX and/or PKGNAMESUFFIX.
* testport: OK (poudriere: 13.2-RELEASE, amd64, WKHTMLTOPDF tested)


* Adding devel/py-red-black-tree-mod fixes Exception described in #c0.
Comment 3 Jesús Daniel Colmenares Oviedo 2023-10-01 22:32:56 UTC
(In reply to Robert Clausecker from comment #1)

Comment 4 commit-hook freebsd_committer freebsd_triage 2023-10-04 20:04:30 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=d48062ec61d4ec3420833c451f146e15a38bd9bb

commit d48062ec61d4ec3420833c451f146e15a38bd9bb
Author:     Jesús Daniel Colmenares Oviedo <DtxdF@disroot.org>
AuthorDate: 2023-10-01 22:18:39 +0000
Commit:     Robert Clausecker <fuz@FreeBSD.org>
CommitDate: 2023-10-04 19:59:56 +0000

    textproc/py-extract-msg: Update to 0.45.0

    * Add devel/py-red-black-tree-mod dependency.


    * BREAKING: Changed parsing of string multiple properties to remove the trailing
      null byte. This *will* cause the output of parsing them to differ.
    * Updated typing information for some functions and classes.
    * Fixed a bug with `MessageSignedBase.attachments` that would cause it to return
      None instead of an empty list if the number of normal attachments was 0 was
      the error behavior was set to ignore violations of the standard.
    * Updated `MessageSignedBase.attachments` to use `functools.cached_property`
      instead of `property`.
    * Fixed spelling errors in some exception strings.
    * Made `NamedPropertyBase` a subclass of `abc.ABC`.
    * Cleaned up some of the code for named properties to remove unused variables and
      remove inefficient code.
    * Changed `PropBase` to be a subclass of `abc.ABC`.
    * Added detailed versioning info to the README.
    * Deprecated many private functions, including methods on many of the classes.
      Of primary note are `_getStream` and `_getStringStream`, which have been
      moved to the public API as `getStream` and `getStringStream`. Any
      deprecated functions still exist and will forward to a public API function
      if they are not being removed. Additionally, all internal usage of them has
      been removed. This change is one of the big preparations that is needed for
      the `1.0.0` release.
        * As mentioned, a number of these deprecated functions have been moved to
          the public API. It is recommended that you run tests with your code
          after enabling deprecation warnings to see what should be changed.
    * Removed items deprecated in or before `0.42.0`.
    * Changed the API for the private method `_genRecipient`. This is not intended for
      use outside of the module *except* for subclasses. The change removed the
      allowance of ints for the second argument, requiring that it be a valid
      enum type.
    * Convert many enum types to `IntEnum`.
    * Extended functionality of `PropertiesStore` to allow for integer property names
      and getting a property based on just the ID. You can also get a list of all
      properties that use a given ID.
    * Added new function `PropertiesStore.getProperties` which gets a list of all
      properties matching the property ID. Return type is a list of `PropBase`
    * Added new function `PropertiesStore.getValue` which looks for the first matching
      `FixedLengthProp` and returns the value from it.
    * Improved internal code related to getting a property with a potentially unknown
    * Added a number of entirely new functions to the public API on `MSGFile`,
      `AttachmentBase`, `PropertiesStore`, and `Recipient` objects:
        * `getMultipleBinary`: Gets a multiple binary property as a list of `bytes`
        * `getSingleOrMultipleBinary`: A combination of `getStream` and `getMultipleBinary`
           which prefers a single binary stream. Returns a single `bytes` object or a
           list of `bytes` objects.
        * `getMultipleString`: Gets a multiple string property as a list of `str` objects.
        * `getSingleOrMultipleString`: A combination of `getStringStream` and
          `getMultipleString` which prefers a single string stream. Returns
          a single bytes objecct or a list of bytes objects.
        * `getPropertyVal`: Shortcut for `instance.props.getValue` that allows new behavior
          to be added by overriding it.
        * `getNamedProp`: Shortcut for `instance.namedProperties.get((propertyName, guid), default)`
          that allows new behavior to be added by overriding it.
    * Removed `Named._getStringStream` and `Named.sExists`. The named properties storage will
      *always* use regular streams and not string streams.
    * Changed all `Named` methods to no longer have a prefix argument. The prefix should
      *always* be false sense the named property mapping will only exist in the top level directory.
    * Adjusted `tryGetMimeType` to allows any attachments whose `data` property would return a
      `bytes` instance.
    * Changed internal code to use public API functions wherever possible. This includes making many
      private API functions use calls to the public API for getting bits of data.
    * Fixed potential issue with `AttachmentBase.clsid` which had the potential to cause some
      attachments to fail to generate a CLSID.
    * Outright removed or changed a significant portion of the private API. I have rarely, if ever,
      seen references to these parts, so this should cause you no issues. Some of these have also
      been moved to the public API, either identically or with changes, and the mapping is as such:
        * `_getNamedAs` -> `getNamedAs`: Changed to *always* require a conversion argument. If you
          were previously using it to plainly get a named property or to handle the properly being
          None or a real value, you should use the return value of `getNamedProp` instead.
        * `_getPropertyAs` -> `getPropertyAs`: Same as above, use `getPropertyVal` instead for None
          or plain access.
        * `_getStreamAs` -> `getStreamAs`, `getStringStreamAs`: Once again, see above. Use `getStream`
          and `getStringStream`, respectively.

    PR:             273123

 textproc/py-extract-msg/Makefile | 3 ++-
 textproc/py-extract-msg/distinfo | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)