How do I poison pdfs against LLM ?

Maroon@lemmy.world · 17 days ago

How do I poison pdfs against LLM ?

underscores@lemmy.dbzer0.com · 17 days ago

A lot of the ways they scrape documents are the same used by accessibility tools, so I’d generally recommend against doing this.

AnUnusualRelic@lemmy.world · 17 days ago

So a layer of transparent text wouldn’t work?

underscores@lemmy.dbzer0.com · 17 days ago

I’m pretty sure most screen readers and stuff like copy/paste would also get whatever nonsense you filled it with.

AnUnusualRelic@lemmy.world · 17 days ago

It would be a side effect, most likely.

Maroon@lemmy.world · 17 days ago

I’m sorry, but “transparent text”? Is this done in LaTeX?

AnUnusualRelic@lemmy.world · 17 days ago

What, you can’t set the alpha channel on your text in a pdf?

Maroon@lemmy.world · 16 days ago

I think I didn’t explain myself clearly, sorry. I meant that in LaTeX, I can make my text transparent /white and have them overlap for a couple of paragraphs by adjusting text boxes. I’m not sure if how scalable this solution is for me.

AnUnusualRelic@lemmy.world · 16 days ago

There are other ways of making pdf files, so it all depends on what you want.

Strayce@lemmy.sdf.org · 17 days ago

Entire Bee Movie script in 0.1pt white on white in the header

TimeSquirrel@kbin.melroy.org · 17 days ago

“Why TF is this one-page document half a gigabyte?”

DannyBoy@sh.itjust.works · 17 days ago

Text is small! The Bee Movie script is 89.2kb

Markaos@discuss.tchncs.de · edit-2 17 days ago

Obviously you need some redundancy in case the script gets corrupted. 5000 repetitions seems reasonable for such a high quality work

Dave.@aussie.zone · 17 days ago

“Oh, it’s got an embedded TIFF of the actual content. That explains it.”

Yes, I am quite old now.

u/lukmly013 💾 (lemmy.sdf.org)@lemmy.sdf.org · 17 days ago

Would the Shrek script be compatible too?

TheTechyHobbit@sh.itjust.works · edit-2 17 days ago

Image poisoning’s general principle is to change pixels in a way were our eye can’t notice, but that screw up the labeling by LLMs.

You can probably try to apply the same principle, poison the PDF in a way that only humans can read it.

Thing is, I assume you distribute your content on PDFs to make the content accessible to humans. That usually means having the text embedded for easy copy-paste and similar methods. Poisoning these might end up being counterproductive for your objective.

All this to say that No, I have no idea of a poisoning algorithm for PDFs

DragonsInARoom@lemmy.world · 17 days ago

Put the word stolen at the end of every document, the llm will learn that the word stolen is normal and should be included

lily33@lemm.ee · 17 days ago

I don’t think any kind of “poisoning” actually works. It’s well known by now that data quality is more important than data quantity, so nobody just feeds training data in indiscriminately. At best it would hamper some FOSS AI researchers that don’t have the resources to curate a dataset.

Ledivin@lemmy.world · 17 days ago

At best it would hamper some FOSS AI researchers that don’t have the resources to curate a dataset.

If you can’t source a dataset, then you shouldn’t be researching AI. It’s the first and single most important step of the entire process.

CapriciousDay@lemmy.ml · 15 days ago

Some LLMs have specific jailbreaks which including in the document may cause them to act strangely in a way that is specific to the LLM. But it’s unlikely to be robust over time as they get patched/changed/etc.

qjkxbmwvz@startrek.website · 17 days ago

man rot13 ;)

kekmacska@lemmy.zip · 16 days ago

if possible, do not make it aviable on the public internet and don’t let search engines access it

LiamTheBox@lemmy.ml · 16 days ago

I tried to copy some text in a report once.

It came out as gibberish.