UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

Alexander Kobel-2
Dear all,

I'm happily using center-lyrics-ignoring-punctuation.ily in the version
from
https://lists.gnu.org/archive/html/lilypond-user/2016-12/msg00382.html 
for a while now. (It's attached as well.)

Now I stumbled across the following unexpected warning:

\version "2.19.82"
\include "center-lyrics-ignoring-punctuation.ily"
{ d4 }
\addlyrics { à }

> GNU LilyPond 2.19.82
> Processing `test.ly'
> Parsing...
> Interpreting music...
> Preprocessing graphical objects...
> (process:2528): Pango-WARNING **: 17:49:34.874: Invalid UTF-8 string passed to pango_layout_set_text()
>
> warning: no glyph for character U+FFFD in font `/usr/share/fonts/tex-gyre/texgyreschola-regular.otf'

Observations:
1. The warning vanishes as soon as I replace the lyrics by àa, but not
by aà.
2. There is no U+FFFD (REPLACEMENT CHARACTER) anywhere in my input.

This makes me wonder whether the problem is in the backwards
string-search in string-skip-right or in the substring routine used in
the make-center-on-word-callback; the only reason I can imagine why this
pops up is that some blind-to-Unicode slicing cuts some string in the
middle of a multi-byte Unicode character.

The warning is merely annoying, but harmless; but I suspect that there
are unfortunate combinations where trailing "to-be-ignored" characters
are missed, or "not-to-be-ignored" characters are split into an
ill-defined character and a false "to-be-ignored".

Does anyone have a hint how to approach this one? (Or is the answer
just: be patient and hope for Guile v2?)


Thanks,
Alex

_______________________________________________
lilypond-user mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/lilypond-user

center-lyrics-ignoring-punctuation.ily (4K) Download Attachment
test.ly (106 bytes) Download Attachment
smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

Aaron Hill
On 2018-10-30 10:01 am, Alexander Kobel wrote:
> This makes me wonder whether the problem is in the backwards
> string-search in string-skip-right or in the substring routine used in
> the make-center-on-word-callback; the only reason I can imagine why
> this pops up is that some blind-to-Unicode slicing cuts some string in
> the middle of a multi-byte Unicode character.

It's a little of both.

Near as I can tell, proper Unicode support was only added in Guile 2.0.  
So 1.8 only thinks of characters as 8-bit values.  While a UTF8-encoded
string can be represented, none of the built-in character- or
string-handling routines in 1.8 understand that encoding directly.

For instance, the code you posted contains a mistake in defining the
character set for punctuation symbols.  string->list will convert a
string into individual characters, but remember that 1.8 doesn't
understand anything beyond ASCII.  As such, the following string:

     .?-;,:„“‚‘«»‹›『』「」“”‘’–— */()[]{}|<>!`~&…†‡

gets converted into the following list:

     (#\. #\? #\- #\; #\, #\: #\342 #\200 #\236 #\342 #\200 #\234
      #\342 #\200 #\232 #\342 #\200 #\230 #\302 #\253 #\302 #\273
      #\342 #\200 #\271 #\342 #\200 #\272 #\343 #\200 #\216 #\343
      #\200 #\217 #\343 #\200 #\214 #\343 #\200 #\215 #\342 #\200
      #\234 #\342 #\200 #\235 #\342 #\200 #\230 #\342 #\200 #\231
      #\342 #\200 #\223 #\342 #\200 #\224 #\space #\* #\/ #\( #\)
      #\[ #\] #\{ #\} #\| #\< #\> #\! #\` #\~ #\& #\342 #\200
      #\246 #\342 #\200 #\240 #\342 #\200 #\241)

The resulting character set is then just the unique individual bytes,
not the original characters which may have been composed of two or more
surrogates:

     #<charset {#\space #\! #\& #\( #\) #\* #\, #\- #\. #\/ #\:
                #\; #\< #\> #\? #\[ #\] #\` #\{ #\| #\} #\~ #\200
                #\214 #\215 #\216 #\217 #\223 #\224 #\230 #\231
                #\232 #\234 #\235 #\236 #\240 #\241 #\246 #\253
                #\271 #\272 #\273 #\302 #\342 #\343}>

The result is something that may at first glance appear to handle
things, since what is happening is that the logic is stripping away
individual bytes from the left and right ends of the string.  When you
have a leading or trailing symbol that was in the list, then its
individual bytes are stripped properly.  However, if you include a
character that just so happens to begin with or end with one of these
bytes, then it will be split improperly.

In your example, "à" is encoded as #\303 #\240.  But take note of #\240
which is in the character set.  It was included in the set because of
"†" which is encoded as #\342 #\200 #\240.  If you were to remove "†"
from the list of symbols, you'd find that the warning will go away,
because #\240 is no longer being stripped.

> Does anyone have a hint how to approach this one? (Or is the answer
> just: be patient and hope for Guile v2?)

The only hint here is to replace the built-in functions with ones which
understand UTF8 encoding and can perform the work needed.  There very
well might be someone online who has already done this work, which would
save on having to do it yourself.

Otherwise, the basic strategy is to replace string->list with a version
that decodes UTF8 and returns a list of integers (essentially UTF32).  
Then, all of the string work is being done with these lists of integers
instead.  (The character set would also just be a set of integers
representing the unique Unicode code points.)  After you find the
subsets of the list that are interesting to measure, you'll then need to
convert the list back into a string.  This means encoding back into UTF8
and emitting a string.


-- Aaron Hill

_______________________________________________
lilypond-user mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/lilypond-user
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

Aaron Hill
On 2018-10-30 4:56 pm, Aaron Hill wrote:

> On 2018-10-30 10:01 am, Alexander Kobel wrote:
>> Does anyone have a hint how to approach this one? (Or is the answer
>> just: be patient and hope for Guile v2?)
>
> The only hint here is to replace the built-in functions with ones
> which understand UTF8 encoding and can perform the work needed.  There
> very well might be someone online who has already done this work,
> which would save on having to do it yourself.
>
> Otherwise, the basic strategy is to replace string->list with a
> version that decodes UTF8 and returns a list of integers (essentially
> UTF32).  Then, all of the string work is being done with these lists
> of integers instead.  (The character set would also just be a set of
> integers representing the unique Unicode code points.)  After you find
> the subsets of the list that are interesting to measure, you'll then
> need to convert the list back into a string.  This means encoding back
> into UTF8 and emitting a string.
Here's a quick-n-dirty patch to address the issue.

%%%%
\version "2.19.82"
\include "center-lyrics-ignoring-punctuation.ily"
{ d'4 4 4 }
\addlyrics { Å Ɓ† «Ḉ…» }
%%%%

-- Aaron Hill
_______________________________________________
lilypond-user mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/lilypond-user

center-lyrics-ignoring-punctuation.ily (7K) Download Attachment
utf8.cropped.png (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

Aaron Hill
On 2018-10-30 9:23 pm, Aaron Hill wrote:
> Here's a quick-n-dirty patch to address the issue.
>
> %%%%
> \version "2.19.82"
> \include "center-lyrics-ignoring-punctuation.ily"
> { d'4 4 4 }
> \addlyrics { Å Ɓ† «Ḉ…» }
> %%%%

Agh, I goofed.  I forgot the logic is that you want to trim the "space"
characters from the ends only.  Anything included within the "word" is
okay.

So with two drop-whiles and reverses, here's the patch to my patch,
including a new test document:

%%%%
\version "2.19.82"
\include "center-lyrics-ignoring-punctuation.ily"
{ d'4 4 4 4 }
\addlyrics { Å Ɓ† «Ḉ…» ?Ḓ—Ḛ }
%%%%

-- Aaron Hill
_______________________________________________
lilypond-user mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/lilypond-user

center-lyrics-ignoring-punctuation.ily (7K) Download Attachment
utf8.cropped.png (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

Alexander Kobel-2
In reply to this post by Aaron Hill
Gosh,

once again I'm flabbergasted about the expertise and helpfulness of the
folks on this list. Thanks a ton!

As far as I can tell, your corrected version works like a charm (and
even fixed a minor misalignment of the "à" syllable that I did not spot
earlier). And IIUC, the only drawback of the quick'n'dirty variant is
that it does not actually do a canonicalization of the strings involved,
so that two different representations of a glyph (e.g., using different
combining characters) will not match. But this is extremely unlikely
given that they should be usually entered by the same user in the same
way, and that the set of characters to compare with is fairly limited
and probably even has a unique encoding anyway.
So, problem solved for me.

I can only hope that you did not spend the 4.75 hours between your mails
for the "quick fix"...


Cheers,
Alex


On 31/10/2018 05.23, Aaron Hill wrote:

> On 2018-10-30 4:56 pm, Aaron Hill wrote:
>> On 2018-10-30 10:01 am, Alexander Kobel wrote:
>>> Does anyone have a hint how to approach this one? (Or is the answer
>>> just: be patient and hope for Guile v2?)
>>
>> The only hint here is to replace the built-in functions with ones
>> which understand UTF8 encoding and can perform the work needed.  There
>> very well might be someone online who has already done this work,
>> which would save on having to do it yourself.
>>
>> Otherwise, the basic strategy is to replace string->list with a
>> version that decodes UTF8 and returns a list of integers (essentially
>> UTF32).  Then, all of the string work is being done with these lists
>> of integers instead.  (The character set would also just be a set of
>> integers representing the unique Unicode code points.)  After you find
>> the subsets of the list that are interesting to measure, you'll then
>> need to convert the list back into a string.  This means encoding back
>> into UTF8 and emitting a string.
>
> Here's a quick-n-dirty patch to address the issue.
>
> %%%%
> \version "2.19.82"
> \include "center-lyrics-ignoring-punctuation.ily"
> { d'4 4 4 }
> \addlyrics { Å Ɓ† «Ḉ…» }
> %%%%
>
> -- Aaron Hill
>
> _______________________________________________
> lilypond-user mailing list
> [hidden email]
> https://lists.gnu.org/mailman/listinfo/lilypond-user
>

_______________________________________________
lilypond-user mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/lilypond-user

smime.p7s (6K) Download Attachment