Error importing documents
I am trying to import ~2.5k documents from my local paperless-ngx. Unfortunately, I am getting an error using the document_importer as described in the docs.
The error is:
CommandError: The manifest file refers to "<Some-Filename>" which does not appear to be in the source directory.
The filename includes german umlauts. Regarding to this issue there may be some connection to the locale.
Has anybody successfully imported a larger number of documents?
Thanks in advance.
@stantropics Dropping a file named
/app/data/consumeworks for me. Let me try the CLI now.
OK, what I did was:
# mkdir -p /app/data/out # python3 manage.py document_exporter /app/data/out
That exported the documents. Then, I deleted everything in paperless. Then, I reimported:
# python3 manage.py document_importer /app/data/out Installed 4 object(s) from 1 fixture(s) Copy files into paperless... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 49.80it/s] Updating search index... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 125.67it/s]
So, maybe there is something else going on. Are you using Web Terminal or Cloudron CLI?
@girish Hey, thanks for getting back to this.
I found a workaround: Using
--use-filename-formatwill name the exported documents by their document ID.
I am giving paperless-ngx on cloudron another try, but I am again facing some problems im porting my documents.
Exporting the documents from another instance works as expected and importing them back into paperless-ngx works as well:
root@d9967e75-b4cc-4808-ba75-a5f12498470c:/app/code# python3 src/manage.py document_importer /app/data/import/ Installed 3258 object(s) from 1 fixture(s) Copy files into paperless... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3097/3097 [00:17<00:00, 177.43it/s] Updating search index... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3097/3097 [00:13<00:00, 229.10it/s] root@d9967e75-b4cc-4808-ba75-a5f12498470c:/app/code#
However, after performing the import I am not seeing any data in paperless. Any idea what is going on here? Any help is appreciated.
@stantropics for a start, does importing a single document work? Maybe something to do with filenames that you reported earlier?
@girish Thanks for getting back to this. The last time it failed I was able to import ~2k documents and their data (tags etc.) successfully. However, I was not able to develop a nice workflow to get data into paperless. This time I have a workflow idea but cannot get my documents (~3k) into paperless on cloudron.
First thing I see is I cannot execute any operation using the
script. It is always mandatory to perform the following first:
python3 src/manage.py migrate
Unfortunately none of my operations threw any error, but didn't work.
I did the following steps to reproduce the problem:
- Install new paperless app
- Import one pdf document
- Export the data to /app/data/export (export data was generated but it looks like there are no documents backed up, just unser information (e.g. the pdf file has not been copied to the backup folder))
- delete paperless app and create a new one
- Import backup
Unfortunately (and as expected) I am not seeing the document or data.
@stantropics One idea is to just drop the files into consume directory and let paperless consume things at it's own pace. Maybe the import flow has a limit , I have no insight into this though.
@girish Hi, this in one idea. However, in this case I need to set tags for about 3k documents again. I would like to avoid that.
The strange thing is that there was no issue importing 2,5k documents the last time (after I found a workaround for the document names). This time I cannot even import 1 document. Is it working for you or any other here?
@stantropics you can only import an existing paperless export (via document_exporter command). That's my understanding anyway.
I documented it in https://docs.cloudron.io/apps/paperless-ngx/#importing .