Skip to content

More flexible handling of case sensitivity in all keys #477

Open
@kmccurley

Description

Is your feature request related to a problem? Please describe.
The bibtex file format is ill-defined when it comes to case sensitivity on keys. This is NOT a duplicate of #453, because that only talks about entry types.

There is a great deal of confusion about case sensitivity in bibtex. This applies to:

  1. entry types
  2. entry keys
  3. field keys

There is also some different use cases for bibtexparser. Some people want to use it to parse a bibtex file and get the same thing back when they print it out. Others want to use bibtexparser to parse in a way that is close to the behavior of some other tool to parse bibtex files (notably the bibtex binary and the biblatex package). This is part of the problem, because different processing tools will exhibit different behavior when they encounter keys that agree in lower case.

For example, consider the following LaTeX file:

\begin{filecontents}[overwrite]{the.bib}
@misc{CamelCase,
  author = {Fester Bestertester},
  Title = {What happens to this title?},
  title = {This has a camel case key},
}
@misc{camelcase,
  author = {Foster Bostertoster},
  title = {This has a lower case key}
}
@misc{another, 
 aUTHOR = {Anthony Ordinary},
 title = {Just a third entry},
}
\end{filecontents}
%%%%%%%%%%%%%%%%%%%
\documentclass{article}
% uncomment this out and use biber to see the difference. You will have to remove main.bbl and main.aux first.
%\usepackage{biblatex}
\usepackage{hyperref}
\IfPackageLoadedTF{biblatex}{\addbibresource{the.bib}}{\bibliographystyle{alpha}}
\begin{document}
I don't have much to say.
\cite{camelcase} and \cite{CamelCase} and \cite{another}.
\IfPackageLoadedTF{biblatex}{\printbibliography}{\bibliography{the}}
\end{document}

This example can be used to illustrate the difference between bibtex and biblatex. If you process this with the bibtex binary, it produces two warnings from bibtex:

Case mismatch error between cite keys CamelCase and camelcase
---line 7 of file main.aux
 : \citation{CamelCase
 :                    }
I'm skipping whatever remains of this command
Database file #1: the.bib
Warning--I'm ignoring camelcase's extra "title" field
--line 8 of file the.bib
Repeated entry---line 11 of file the.bib
 : @misc{camelcase
 :                ,
I'm skipping whatever remains of this entry

If you view the PDF, it took the first Title field and dropped the second title field. It also dropped the second camelcase entry, producing an undefined reference. Hence you may consider the bibtex binary to treat both entry keys and field keys as case-insensitive. From my observation of author behavior, about 90% use bibtex, and maybe 10% use biblatex. Since the bibtex file format was original bundled to the bibtex binary, I consider this to be the proper interpretation of case-sensitivity but others may disagree.

Now consider the case of biblatex. Uncomment the line to load biblatex, remove main.aux and main.bbl, and run pdflatex main;biber main;pdflatex main;pdflatex main. The resulting PDF file contains three references, and the first reference takes the second title "This has a camel case key".

The decades-long problem here is that the syntax for original bibtex file format was never really defined (and it's still on version 0.99d). There are various tools to parse and handle them, but they have different behavior because they interpret the file format differently. You could argue that both bibtex and biblatex treat entry keys and field keys as lower case, but they have different behavior when they encounter keys that have the same lower case. Perhaps other tools have their own weird behavior based on their own interpretation of the incomplete bibtex file format.

I came across this problem because I was using bibtexparser to produce an HTML format for the bibtex entries, and I wanted our system to emulate the behavior of both biblatex and bibtex.

The solutions that I came to:

  1. I wrote middleware to convert entry types to lower case, and both bibtex and biblatex do the same. That way it's easier to decide how to format the entries. The only reason I can see to preserve this is if the bibtexparser user is expecting to see the same thing after parsing and writing out again.
  2. Because of the behavior of \cite is case-sensitive, I decided not to convert entry keys to lower case. It appears that bibtexparser.parse_string does not check case of keys, and only declares a duplicate if the keys match in their original case. The second and subsequent entries with the same key are kept as DuplicateBlockKeyBlocks but are not treated as entries. This is not the same behavior of the bibtex tool, which drops entries if the lower case key is the same as something already seen. It is consistent with how biber parses the entries.
  3. I wrote middleware to convert field keys to lower case (it's easier to look them up that way, and both biblatex and bibtex treat them as such). There is a question as to whether to take the first or last field encountered when there are duplicates, and it depends on whether you are trying to mimic bibtex or biblatex (or something else). I see no reason to keep both 'title' and 'Title' field keys, but this depends on the use case. I use a flag in the constructor to choose between "keep all", "keep first", or "keep last".

We are using bibtexparser in a system to process latex+bibtex that is uploaded by authors. Our system uses bibexport to extract the entries that are actually cited, and this uses the bibtex binary in the script. This tool only works if the authors use the bibtex tool, since it looks in the .aux file for \bibcite. In order to get around this for authors who use biblatex, our system creates an artificial .aux file that looks like it was produced by the bibtex tool, and we process that with bibexport so that it can extract the entries. Of course biber and bibtex treat duplicate keys differently, so this will fail if authors depend on the biber behavior to save entries with keys that collide in lower case with others.

Describe the solution you'd like

The bottom line here is that software tools to handle the bibtex file format are inconsistent on how they treat keys. It seems useful to offer options for bibtexparser to emulate the behavior of other tools that process the bibtex file format. This can be customized by the use of middleware, and it might be useful to have additional standard middleware classes to support the different behavior required. It also seems like it's long overdue for a bibtex file format replacement. There are too many nonstandard entry types and field types. It's probably too late to fix the definition of the bibtex file format unless we add something like @version at the beginning of the file to say what tools the file is intended to be processed with. I don't think that's the job of bibtexparser though unless it is used in a tool to replace the bibtex or biber tools

I would be willing to contribute a PR to offer other middleware to handle these cases.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions