Wiki source code of Confluence Import Process
Last modified by Vincent Massol on 2026/04/08 13:49
Hide last authors
| author | version | line-number | content |
|---|---|---|---|
| |
1.1 | 1 | Confluence XML is implemented as an input filter stream. [[ConfluenceInputFilterStream>>https://github.com/xwiki-contrib/confluence/blob/master/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/internal/input/ConfluenceInputFilterStream.java]] is instantiated with [[input properties>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-xml/src/main/java/org/xwiki/filter/confluence/input/ConfluenceInputProperties.java]] describing what to import and how. |
| 2 | |||
| 3 | ConfluenceInputFilterStream then sets up [[ConfluenceXMLPackage>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/input/ConfluenceXMLPackage.java]], which will extract the confluence package and index it. | ||
| 4 | |||
| 5 | ConfluenceXMLPackage is built in such a way it is able to handle huge export package: | ||
| 6 | |||
| 7 | * the XML parsing is streamed | ||
| 8 | * instead of keeping everything in RAM, individual objects are written in individual Apache Commons Configuration Properties files in a temporary directory (we could probably use some database engine like SQLite for this, it would be possibly even more efficient) | ||
| 9 | |||
| 10 | ConfluenceInputFilterStream is built in the same spirit: it browses things from the package and send them streamed using the filter stream API. If the output filter that is used is also built like this, the whole process is a pipeline that can handle huge imports, and it is normally the case. | ||
| 11 | |||
| 12 | More precisely, ConfluenceInputFilterStream: | ||
| 13 | |||
| 14 | * Imports users | ||
| 15 | * Imports groups | ||
| 16 | * Browses spaces | ||
| 17 | |||
| 18 | For each space: | ||
| 19 | |||
| 20 | * imports the home page | ||
| 21 | * imports the orphans (which are pages with no parents which are not the home page) | ||
| 22 | * imports the space blog | ||
| 23 | * imports the space templates | ||
| 24 | * import permissions from the space permissions and the home page permission | ||
| 25 | |||
| 26 | For each page: | ||
| 27 | |||
| 28 | * imports revisions | ||
| 29 | ** imports page metadata (dates, author, title, ...) | ||
| 30 | ** imports page permissions | ||
| |
4.3 | 31 | ** imports content (by instantiating the corresponding syntax, see [[details about the Confluence XHTML Parser>>doc:xwiki:documentation.extensions.dev.confluence.xhtml-parsing.WebHome]], it is also possible for a page to be in old confluence syntax or to contain plain text that's not to be converted (for pages that are used as some data storage) |
| |
1.1 | 32 | ** imports comments |
| 33 | ** imports attachments | ||
| 34 | ** imports labels as tags | ||
| 35 | * imports children (recursive operation) | ||
| 36 | |||
| |
3.1 | 37 | And then, the confluence package is closed (property files are removed, by default synchronously but a parameter can make this process asynchronous so you don't need to wait for the clean up to end for the job to end. There is also a parameter to avoid the cleanup altogether, this is useful for debugging purposes mostly: this lets you inspect the extracted properties and skip the parsing phase if you are to run several imports of the same package). |
| |
1.1 | 38 | |
| 39 | The starting point is the [[##read## method in ##ConfluenceInputFilterStream##>>https://github.com/xwiki-contrib/confluence/blob/ecd749e56f3671df9d08ca57912d49d4d38ff42a/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/internal/input/ConfluenceInputFilterStream.java#L268]]. This is where any new developer should probably go to start hacking on Confluence XML. |