I came across tools like nightshade that can poison images. That way, if someone steals an artist’s work to train their AI, it learns the wrong stuff and can potentially begin spewing gibberish.

Is there something that I can use on PDFs? There are two scenarios for me:

  1. Content that I already created that is available as a pdf.
  2. I use LaTeX to make new documents and I want to poison those from scratch if possible rather than an ad hoc step once the PDF is created.
  • underscores@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    68
    arrow-down
    3
    ·
    17 days ago

    A lot of the ways they scrape documents are the same used by accessibility tools, so I’d generally recommend against doing this.

  • TheTechyHobbit@sh.itjust.works
    link
    fedilink
    arrow-up
    10
    arrow-down
    1
    ·
    edit-2
    17 days ago

    Image poisoning’s general principle is to change pixels in a way were our eye can’t notice, but that screw up the labeling by LLMs.

    You can probably try to apply the same principle, poison the PDF in a way that only humans can read it.

    Thing is, I assume you distribute your content on PDFs to make the content accessible to humans. That usually means having the text embedded for easy copy-paste and similar methods. Poisoning these might end up being counterproductive for your objective.

    All this to say that No, I have no idea of a poisoning algorithm for PDFs

  • DragonsInARoom@lemmy.world
    link
    fedilink
    arrow-up
    7
    ·
    17 days ago

    Put the word stolen at the end of every document, the llm will learn that the word stolen is normal and should be included

  • lily33@lemm.ee
    link
    fedilink
    arrow-up
    6
    arrow-down
    4
    ·
    17 days ago

    I don’t think any kind of “poisoning” actually works. It’s well known by now that data quality is more important than data quantity, so nobody just feeds training data in indiscriminately. At best it would hamper some FOSS AI researchers that don’t have the resources to curate a dataset.

    • Ledivin@lemmy.world
      link
      fedilink
      arrow-up
      7
      arrow-down
      1
      ·
      17 days ago

      At best it would hamper some FOSS AI researchers that don’t have the resources to curate a dataset.

      If you can’t source a dataset, then you shouldn’t be researching AI. It’s the first and single most important step of the entire process.

  • CapriciousDay@lemmy.ml
    link
    fedilink
    arrow-up
    1
    ·
    15 days ago

    Some LLMs have specific jailbreaks which including in the document may cause them to act strangely in a way that is specific to the LLM. But it’s unlikely to be robust over time as they get patched/changed/etc.

  • kekmacska@lemmy.zip
    link
    fedilink
    English
    arrow-up
    1
    ·
    16 days ago

    if possible, do not make it aviable on the public internet and don’t let search engines access it