IndexWriter
An IndexWriter creates and maintains an index.
The {@link OpenMode} option on {@link IndexWriterConfig#setOpenMode(OpenMode)} determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with {@link OpenMode#CREATE} even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. If {@link OpenMode#CREATE_OR_APPEND} is used IndexWriter will create a new index if there is not already an index at the provided path and otherwise open the existing index.
In either case, documents are added with {@link #addDocument(Iterable) addDocument} and removed with {@link #deleteDocuments(Term...)} or {@link #deleteDocuments(Query...)}. A document can be updated with {@link #updateDocument(Term, Iterable) updateDocument} (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, {@link #close() close} should be called.
Each method that changes the index returns a {@code long} sequence number, which expresses the effective order in which each change was applied. {@link #commit} also returns a sequence number, describing which changes are in the commit point and which are not. Sequence numbers are transient (not saved into the index in any way) and only valid within a single {@code IndexWriter} instance.
These changes are buffered in memory and periodically flushed to the {@link Directory} (during the above method calls). A flush is triggered when there are enough added documents since the last flush. Flushing is triggered either by RAM usage of the documents (see {@link IndexWriterConfig#setRAMBufferSizeMB}) or the number of added documents (see {@link IndexWriterConfig#setMaxBufferedDocs(int)}). The default is to flush when RAM usage hits {@link IndexWriterConfig#DEFAULT_RAM_BUFFER_SIZE_MB} MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. In contrast to the other flush options {@link IndexWriterConfig#setRAMBufferSizeMB} and {@link IndexWriterConfig#setMaxBufferedDocs(int)}, deleted terms won't trigger a segment flush. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either {@link #commit()} or {@link #close} is called. A flush may also trigger one or more segment merges, which by default run within a background thread so as not to block the addDocument calls (see below for changing the {@link MergeScheduler}).
Opening an IndexWriter creates a lock file for the directory in use. Trying to
open another IndexWriter on the same directory will lead to a {@link
LockObtainFailedException}.
Expert: IndexWriter allows an optional {@link IndexDeletionPolicy} implementation
to be specified. You can use this to control when prior commits are deleted from the index. The
default policy is {@link KeepOnlyLastCommitDeletionPolicy} which removes all prior commits as
soon as a new commit is done. Creating your own policy can allow you to explicitly keep previous
"point in time" commits alive in the index for some time, either because this is useful for your
application, or to give readers enough time to refresh to the new commit without having the old
commit deleted out from under them. The latter is necessary when multiple computers take turns
opening their own {@code IndexWriter} and {@code IndexReader}s against a single shared index
mounted via remote filesystems like NFS which do not support "delete on last close" semantics. A
single computer accessing an index via NFS is fine with the default deletion policy since NFS
clients emulate "delete on last close" locally. That said, accessing an index via NFS will likely
result in poor performance compared to a local IO device.
Expert: IndexWriter allows you to separately change the {@link MergePolicy} and
the {@link MergeScheduler}. The {@link MergePolicy} is invoked whenever there are changes to the
segments in the index. Its role is to select which merges to do, if any, and return a {@link
MergePolicy.MergeSpecification} describing the merges. The default is {@link
LogByteSizeMergePolicy}. Then, the {@link MergeScheduler} is invoked with the requested merges
and it decides when and how to run the merges. The default is {@link ConcurrentMergeScheduler}.
NOTE: if you hit an Error, or disaster strikes during a checkpoint then IndexWriter will close itself. This is a defensive measure in case any internal state (buffered documents, deletions, reference counts) were corrupted. Any subsequent calls will throw an AlreadyClosedException.
NOTE: {@link IndexWriter} instances are completely thread safe, meaning multiple
threads can call any of its methods, concurrently. If your application requires external
synchronization, you should not synchronize on the IndexWriter instance as
this may cause deadlock; use your own (non-Lucene) objects instead.
NOTE: If you call Thread.interrupt() on a thread that's within
IndexWriter, IndexWriter will try to catch this (eg, if it's in a wait() or Thread.sleep()), and
will then throw the unchecked exception {@link ThreadInterruptedException} and clear the
interrupt status on the thread.
Types
If DirectoryReader.open has been called (ie, this writer is in near real-time mode), then after a merge completes, this class can be invoked to warm the reader on the newly merged segment, before the merge commits. This is not required for near real-time search, but will reduce search latency on opening a new near real-time reader after a merge completes.
Properties
Returns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).
If enabled, information about merges will be printed to this.
Returns an unmodifiable set of segments that are currently merging.
Functions
Adds a document to this index.
Atomically adds a block of documents with sequentially assigned document IDs, such that an external reader will see all or none of the documents.
Merges the provided indexes into this index.
Adds all segments from an array of indexes into this index.
Runs a single merge operation for IndexWriter.addIndexes.
If SegmentInfos.getVersion is below newVersion then update it to this value.
Tests should use this method to snapshot the current segmentInfos to have a consistent view
Commits all pending changes (added and deleted documents, segment merges, added indexes, etc.) to the index, and syncs all referenced index files, such that a reader will see the changes and the index updates will survive an OS or machine crash or power loss. Note that this does not wait for any running background merges to finish. This may be a costly operation, so you should test the cost in your application and do it only when really necessary.
Record that the files referenced by this SegmentInfos are no longer in use. Only call this if you are sure you previously called .incRefDeleter.
Deletes the document(s) containing any of the terms. All given deletes are applied and flushed atomically at the same time.
Deletes the document(s) matching any of the provided queries. All given deletes are applied and flushed atomically at the same time.
Expert: remove any index files that are no longer used.
Used internally to throw an AlreadyClosedException if this IndexWriter has been closed (closed=true) or is in the process of closing (closing=true).
Expert: Flushes the next pending writer per thread buffer if available or the largest active non-pending writer per thread buffer in the calling thread. This can be used to flush documents to disk outside of an indexing thread. In contrast to .flush this won't mark all currently active indexing buffers as flush-pending.
Translates a frozen packet of delete term/query, or doc values updates, into their actual docIDs in the index, and applies the change. This is a heavy operation and is done concurrently by incoming indexing threads.
Forces merge policy to merge segments until there are <= maxNumSegments. The actual merges to be executed are determined by the MergePolicy.
Forces merging of all segments that have deleted documents. The actual merges to be executed are determined by the MergePolicy. For example, the default TieredMergePolicy will only pick a segment if the percentage of deleted docs is over 10%.
Returns the analyzer used by this index.
Returns the Directory used by this index.
Returns accurate DocStats for this writer. The numDoc for instance can change after maxDoc is fetched that causes numDocs to be greater than maxDoc which makes it hard to get accurate document stats from IndexWriter.
Return an unmodifiable set of all field names as visible from this IndexWriter, across all segments of the index.
Returns the number of bytes currently being flushed
Returns the commit user data iterable previously set with .setLiveCommitData, or null if nothing has been set yet.
Returns the highest #sequence_number across all completed operations, or 0 if no operations have finished yet. Still in-flight operations (in other threads) are not counted until they finish.
Returns the number of documents in the index including documents are being added (i.e., reserved).
Expert: returns a readonly reader, covering all committed as well as un-committed changes to the index. This provides "near real-time" searching, in that changes made during an IndexWriter session can be quickly made available for searching without closing the writer nor calling .commit.
If this IndexWriter was closed as a side-effect of a tragic exception, e.g. disk full while flushing a new segment, this returns the root cause exception. Otherwise (no tragic exception has occurred) it returns null.
Returns true if there are any changes or deletes that are not flushed or applied.
Returns true if this index has deletions (including buffered deletions). Note that this will return true if there are buffered Term/Query deletions, even if it turns out those buffered deletions don't match any documents.
Expert: returns true if there are merges waiting to be scheduled.
Returns true if there may be changes that have not been committed. There are cases where this may return true when there are no actual "real" changes to the index, for example if you've deleted by Term or Query but that Term or Query does not match any documents. Also, if a merge kicked off as a result of flushing a new segment during .commit, or a concurrent merged finished, this method may return true right after you had just called .commit.
Record that the files referenced by this SegmentInfos are still in use.
Expert: asks the mergePolicy whether any merges are necessary now and if so, runs the requested merges and then iterate (test again if merges are needed) until no more merges are returned by the mergePolicy.
Does initial setup for a merge, which is fast but holds the synchronized lock on IndexWriter instance.
Obtain the number of deleted docs for a pooled reader. If the reader isn't being pooled, the segmentInfo's delCount is returned.
Returns the number of deletes a merge would claim back if the given segment is merged.
Expert: Return the number of documents currently buffered in RAM.
This method should be called on a tragic event ie. if a downstream class of the writer hits an unrecoverable exception. This method does not rethrow the tragic event exception.
Expert: prepare for commit. This does the first phase of 2-phase commit. This method does all steps necessary to commit changes since this writer was opened: flushes pending added and deleted docs, syncs the index files, writes most of next segments_N file. After calling this you must call either .commit to finish the commit, or .rollback to revert the commit and undo all changes done since the writer was opened.
Return the memory usage of this object in bytes. Negative values are illegal.
Close the IndexWriter without committing any changes that have occurred since the last commit (or since it was opened, if commit hasn't been called). This removes any temporary files that had been created, after which the state of the index will be the same as it was when commit() was last called or when this writer was first opened. This also clears a previous call to .prepareCommit.
Sets the iterator to provide the commit user data map at commit time. Calling this method is considered a committable change and will be .commit even if there are no other changes this writer. Note that you must call this method before .prepareCommit. Otherwise it won't be included in the follow-on .commit.
Sets the commit user data iterator, controlling whether to advance the SegmentInfos.getVersion.
Expert: Updates a document by first updating the document(s) containing term with the given doc-values fields and then adding the new document. The doc-values update and the subsequent addition are atomic, as seen by a reader on the same index (a flush may happen only after the addition).
Expert: Atomically updates documents matching the provided term with the given doc-values fields and adds a block of documents with sequentially assigned document IDs, such that an external reader will see all or none of the documents.
Translates a frozen packet of delete term/query, or doc values updates, into their actual docIDs in the index, and applies the change. This is a heavy operation and is done concurrently by incoming indexing threads. This method will return immediately without blocking if another thread is currently applying the package. In order to ensure the packet has been applied, IndexWriter.forceApply must be called.
Expert: attempts to delete by document ID, as long as the provided reader is a near-real-time reader (from DirectoryReader.open). If the provided reader is an NRT reader obtained from this writer, and its segment has not been merged away, then the delete succeeds and this method returns a valid (> 0) sequence number; else, it returns -1 and the caller must then separately delete by Term or Query.
Expert: attempts to update doc values by document ID, as long as the provided reader is a near-real-time reader (from DirectoryReader.open). If the provided reader is an NRT reader obtained from this writer, and its segment has not been merged away, then the update succeeds and this method returns a valid (> 0) sequence number; else, it returns -1 and the caller must then either retry the update and resolve the document again. If a doc values fields data is null the existing value is removed from all documents matching the term. This can be used to un-delete a soft-deleted document since this method will apply the field update even if the document is marked as deleted.
Updates a document's BinaryDocValues for field to the given value * . You can only update fields that already exist in the index, not add new fields through this method. You can only update fields that were indexed only with doc values.
Updates a document by first deleting the document(s) containing term and then adding the new document. The delete and then add are atomic as seen by a reader on the same index (flush may happen only after the add).
Atomically deletes documents matching the provided delTerm and adds a block of documents with sequentially assigned document IDs, such that an external reader will see all or none of the documents.
Similar to .updateDocuments, but take a query instead of a term to identify the documents to be updated
Updates documents' DocValues fields to the given values. Each field update is applied to the set of documents that are associated with the Term to the same value. All updates are atomically applied and flushed together. If a doc values fields data is null the existing value is removed from all documents matching the term.
Updates a document's NumericDocValues for field to the given value * . You can only update fields that already exist in the index, not add new fields through this method. You can only update fields that were indexed with doc values only.
Wait for any currently outstanding merges to finish.