Are you checkpointing?
If so, how often?
You have all 1 million files in a single directory?
What is your row size (trying to get an idea how many rows per file)?
Are you using the Load operator?
How long are you *anticipating* the acquisition phase taking?
If this a one-time job, or something that will need to be run over and over again (e.g. daily, weekly, monthly, etc.)?
Again, we have not had anyone try to process this many files. Therefore, we cannot foresee the types of issues you might encounter.
Our file reader operator will attempt to store all of the file names and their sizes to try to perform load balancing; thus that is a lot of internal memory needed for that type of job.
I suspect you would do better with more than 2 instances of the file reader.
↧