With the acceptance of PEP 414 (Explicit Unicode Literal for Python 3.3), string literals with u prefixes will be permitted syntax in Python 3.3, though they cause a SyntaxError in Python 3.2 (and earlier 3.x versions). The motivation behind the PEP is to make it easier to port any 2.x project which has a lot of Unicode literals to 3.3 using a single codebase strategy. That’s a strategy which avoids the need to repeatedly run 2to3 on the code, either during development or at installation time. The single codebase strategy has gained currency because the (repeated) running of 2to3 over parts of the codebase causes impedance in the development workflow. The main impact of the PEP from a porting perspective is that the diffs between ported code and unported code will not have the noise created by removing the u prefixes from all the string literals, and should therefore give project owners an easier time of it when reviewing changes. Of course it also saves the work of actually removing the literals, but that would be a one-time operation automated by 2to3, and so not really a significant part of a porting effort.
While this PEP is fine for people who want to port their 2.x project straight to 3.3, it leaves Python 3.2 users a little bit out in the cold. The PEP does consider the question of 3.2 support, but does somewhat treat 3.2 as a second-class citizen. And for those proposing that people just move to 3.3 as the latest and greatest Python, remember there will be people who are constrained to use 3.2 because of project dependency constraints, whether they are technical or organisational in nature. For example, Ubuntu 12.04 LTS (Long Term Support) will receive 5 years of support; there are already people who have invested time and effort in projects with 3.2 as a dependency, which may not be possible to migrate to 3.3 (which, let’s remember, won’t be released for a while – the scheduled release date is 18 August 2012).
The PEP offers to support 3.2 users by means of an installation hook which works similarly to 2to3 used at installation time. However, an installation time hook does not provide the benefits of a single codebase in terms of streamlined iterative workflow, involving making code changes interspersed with testing with multiple Python versions.
An import hook (which was suggested during the PEP 414 discussions on the python-dev mailing list) is a much more attractive proposition. The benefits are that you can have code containing u'xxx' literals, which 3.3 will allow by virtue of PEP 414’s proposed change to Python, and also work with that code transparently in Python 3.2. How it would work is:
- An import hook is installed.
- When importing a module, if the compiled .pyc file exists and is up to date, it will be used. The hook will not do anything in this case.
- If when importing, the .py file is newer than the .pyc file, the hook will load the source code, convert all string literals with u prefixes to unadorned string literals (as expected by 3.1/3.2), and then compile the converted source. The compiled code will be stored in the .pyc file, so conversion will not be performed again until the .py file’s timestamp is more recent than that of its .pyc file.
- There is no need to integrate with editing environments – any updated source files that are imported will automatically be converted lazily, as needed.
I set out to try and implement an import hook to do the prefix removal. The initial result is uprefix, a package containing the hook and functions to register and unregister it. It’s available on PyPI, so you can try it out in a virtualenv using pip install uprefix. Or you can just download a source tarball and install it (e.g. into a virtual environment) using python setup.py install, or run the tests using python setup.py test before installing.
Usage is easy: once you have it on your path, on Python 3.2, you can do
>>> import uprefix; uprefix.register_hook()
>>>
That’s it. You should now be able to import any module containing string literals using u prefixes, as if they weren’t there.
You can call uprefix.unregister_hook() to remove the hook from the import pipeline.
This is a proof of concept, and uses lib2to3 to strip the prefixes. It should allow you to import 2.x code into Python 3.x without worrying about literal syntax (though other gotchas such as relative import, exception syntax etc. may prevent a successful import).
The performance seems to be good enough. I couldn’t use the modified tokenize.py which is used by the PEP 414 installation hook, even though it would be faster, because it has a couple of bugs which cause it to break on real-world codebases. (These bugs were reported a month ago, but so far don’t appear to have received any love.)
Your feedback is welcome. I’m just dipping my toes in the Python import machinery, so I might well have missed some things.
Add a comment