LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives

60 points by oldfuture 9 hours ago

As far as I can tell they trawled a big archive for sensitive information, (unsurprisingly) found some, and then didn't try to contact anyone affected before telling the world "hey, there are login credentials to be found in here".

crote 7 hours ago

Don't forget giving it a fancy name in the hope that it'll go viral!
I am getting so tired of every vulnerability getting a cutesy pet name trying to pretend being the new Heartbleed / Spectre / Meltdown...
- wongarsu 7 hours ago
  
  Beats having to remember and communicate CVE numbers
KeplerBoy 7 hours ago

It's not like every datapoint comes with the email of the corresponding author.

mseri 7 hours ago

Google has a great aid to reduce the attack surface: https://github.com/google-research/arxiv-latex-cleaner

Y_Y 6 hours ago

I use this before submission and recommend others do too. If ai was in charge of arXiv Id have it integrated as an optional part of the submission process.

barthelomew 7 hours ago

Paper LaTeX files often contain surprising details. When a paper lacks code, looking at latex source has become a part of my reproduction workflow. The comments often reveal non-trivial insights. Often, they reveal a simpler version of the methodology section (which for poor "novelty" purposes is purposely obscured via mathematical jargon).

seg_lol 6 hours ago

Reading the LaTex equations also makes for easier (llm) translation into code rather than trying to read the pdf.

sneela 3 hours ago

I agree with other comments that this research treads a fine, unethical line. Did the authors responsibly disclose this, as is often done in the security research community? I cannot find any mention of it in the paper. The researchers seem to be involved in security-related research (first author is doing a PhD, last author holds a PhD).

At least arxiv could have run the cleaner [1] before the print of this pre-print (lol). If there was no disclosure, then I think this pre-print becomes unethical to put up.

> leading to the identification of nearly 1,200 images containing sensitive metadata. The types of data represented vary significantly. While device information (e.g., the camera used) or software details (such as the exact version of Photoshop) may already raise concerns, in over 600 cases the metadata contained GPS coordinates, potentially revealing the precise location where a photo was taken. In some instances, this could expose a researcher’s home address (when tied to a profile picture) or the location of research facilities (when images capture experimental equipment)

Oof, that's not too great.

[1] https://github.com/google-research/arxiv-latex-cleaner

michaelmior 24 minutes ago

Having arXiv run the cleaner automatically would definitely be cool. Although I've found it non-trivial to get working consistently for my own papers. That said, it would be nice if this was at least an option.
calvinmorrison 2 hours ago

They responsibly disclosed it in their research paper. An unethical use would be to use those coordinates to gain state secrets about say, research facilities

agarttha an hour ago

I offer free beer in a comment in my arxiv tex source.

cozzyd 5 hours ago

This is why my forarxiv.tex make targets always include a call to latexpand --empty-comments

Though I doubt all my collaborators do something similar.

kmm 7 hours ago

I sort of understand the reasoning on why Arxiv prefers tex to pdf[1], even though I feel it's a bit much to make it mandatory to submit the original tex file if they detect a submitted pdf was produced from one. But I've never understood what the added value is in hosting the source publicly.

Though I have to admit, when I was still in academia, whenever I saw a beautiful figure or formatting in a preprint, I'd often try to take some inspiration from the source for my own work, occasionally learning a new neat trick or package.

1: https://info.arxiv.org/help/faq/whytex.html

irowe 6 hours ago

A huge value in having authors upload the original source, is it divorces the content from the presentation (mostly). That the original sources were available was sufficient for a large majority of the corpus to be automatically rendered into HTML for easier reading on many devices: https://info.arxiv.org/about/accessible_HTML.html. I don't think it would have been as simple if they had to convert PDFs.