Here is my proposal:
PLATFORM
- Convert the input file to UTF-8 encoded CSV format.
- Use a custom Java application to process the data. Java has excellent Unicode support.
DATA
- The provided input file
- Find or generate alphabetized term lists for English and Polish.
- Create a "stoplist" of words to be ignored, such as the part-of-speech markers.
PROCESS
(1) Generate all necessary data files (lists of words, abbreviations, etc.) and manually edit these if necessary.
(2) Do an initial parse of the input file simply to make sure all the input data is identifiable.
(3) Parse and generate the output files, following the directions.
(4) Spot check the output and look for improvements; ask questions.
OUTCOMES
With this type of project, it should be fairly straight-forward to get around 80% accuracy. Then with more work and refinement, we should be able to get a higher percentage. The last few percentage points are typically almost impossible without manual intervention.
My offer is to implement the specification you have given, plus any reasonable modifications that give a big improvement in the output. Once we reach a point where the remainder is best done with manual editing, or if we find many cases that we had not considered that do not lend themselves to a straight-forward algorithm (e.g., requiring machine learning), then I consider myself done.