tiling¶
Routines in support of tiled segmentation of very large rasters.
Main entry routine is doTiledShepherdSegmentation(). See that function for further details.
The broad idea is that the Shepherd segmentation algorithm, as implemented in the shepseg module, runs entirely in memory. For larger raster files, it is more efficient to divide the raster into tiles, segment each tile individually, and stitch the results together to create a segmentation of the whole raster.
The main caveats arise from the fact that the initial clustering is performed on a uniform subsample of the whole image, in order to give consistent segment boundaries at tile intersections. This means that for a larger raster, with a greater range of spectra, one may wish to increase the number of clusters in order to allow sufficient initial segments to characterize the variation.
Related to this, one may also consider reducing the percentile
used for automatic estimation of maxSpectralDiff (see
pyshepseg.shepseg.doShepherdSegmentation()
and
pyshepseg.shepseg.autoMaxSpectralDiff()
for further details).
Because of these caveats, one should be very cautious about segmenting something like a continental-scale image. There is a lot of spectral variation across an area like a whole continent, and it may be unwise to use all the same parameters for the whole area.
- exception pyshepseg.tiling.PyShepSegTilingError¶
- class pyshepseg.tiling.RatPage(numIntCols, numFloatCols, startSegId, numSeg)¶
Hold a single page of the paged RAT
- getIndexInPage(segId)¶
Return the index for the given segment, within the current page.
- getRatVal(segId, colType, colArrayNdx)¶
Get the RAT entry for the given segment.
- getSegmentComplete(segId)¶
Returns True if the segment has been flagged as complete
- pageComplete()¶
Return True if the current page has been completed
- setRatVal(segId, colType, colArrayNdx, val)¶
Set the RAT entry for the given segment, to be the given value.
- setSegmentComplete(segId)¶
Flag that the given segment has had all stats calculated.
- class pyshepseg.tiling.TileInfo¶
Class that holds the pixel coordinates of the tiles within an image.
- addTile(xpos, ypos, xsize, ysize, col, row)¶
Add a new tile to the set
- Parameters:
- xpos, yposint
Pixel column & row of top left pixel of tile
- xsize, ysizeint
Number of pixel columns & rows in tile
- col, rowint
Tile column & row
- getNumTiles()¶
Get total number of tiles in the set
- Returns:
- numTilesint
Total number of tiles
- getTile(col, row)¶
Return the position and shape of the requested tile, as a single tuple of values
- Parameters:
- col, rowint
Tile column & row
- Returns:
- xpos, yposint
Pixel column & row of top left pixel of tile
- xsize, ysizeint
Number of pixel columns & rows in tile
- class pyshepseg.tiling.TiledSegmentationResult¶
Result of tiled segmentation
- Attributes:
- maxSegIdshepseg.SegIdType
Largest segment ID used in final segment image
- numTileRowsint
Number of rows of tiles used
- numTileColsint
Number of columns of tiles used
- subsamplePcntfloat
Percentage of image subsampled for clustering
- maxSpectralDifffloat
The value used to limit segment merging (in all tiles)
- kmeanssklearn.cluster.KMeans
The sklearn KMeans object, after fitting
- hasEmptySegmentsbool
True if the segmentation contains segments with no pixels. This is an error condition, probably indicating that the merging of segments across tiles has produced inconsistent numbering. A warning message will also have been printed.
- outDs: gdal.Dataset
Open GDAL dataset object to the output file. May not be set - see the returnGDALDS parameter to doTiledShepherdSegmentation.
- pyshepseg.tiling.calcHistogramTiled(segfile, maxSegId, writeToRat=True)¶
Calculate a histogram of the given segment image file.
Note that we need this function because GDAL’s GetHistogram function does not seem to work when attempting a histogram with very large numbers of entries. We want an entry for every segment, rather than an approximate count for a range of segment values, and the number of segments is very large. So, we need to write our own routine.
It works in tiles across the image, so that it can process very large images in a memory-efficient way.
For a raster which can easily fit into memory, a histogram can be calculated directly using
pyshepseg.shepseg.makeSegSize()
.- Parameters:
- segfilestr or gdal.Dataset
Segmentation image file. Can be either the file name string, or an open Dataset object.
- maxSegIdshepseg.SegIdType
Maximum segment ID used
- writeToRatbool
If True, the completed histogram will be written to the image file’s raster attribute table. If segfile was given as a Dataset object, it would therefore need to have been opened with update access.
- Returns:
- histint ndarray (numSegments+1, )
Histogram counts for each segment (index is segment ID number)
- pyshepseg.tiling.checkForEmptySegments(outDs, maxSegId, overlapSize, writeToRat=False)¶
Check the final segmentation for any empty segments. These can be problematic later, and should be avoided. Prints a warning message if empty segments are found.
- Parameters:
- outDsgdal.Dataset
Open Dataset of output raster
- maxSegIdshepseg.SegIdType
Maximum segment ID used
- overlapSizeint
Number of pixels to use in overlaps between tiles
- Returns:
- hasEmptySegmentsbool
True if there are segment ID numbers with no pixels
- pyshepseg.tiling.createPagedRat()¶
Create the dictionary for the paged RAT. Each element is a page of the RAT, with entries for a range of segment IDs. The key is the segment ID of the first entry in the page.
The returned dictionary is initially empty.
- pyshepseg.tiling.crossesMidline(overlap, segLoc, orientation)¶
Check whether the given segment crosses the midline of the given overlap. If it does not, then it will lie entirely within exactly one tile, but if it does cross, then it will need to be re-coded across the midline.
- Parameters:
- overlapshepseg.SegIdType ndarray (overlapNrows, overlapNcols)
Array of segments just for this overlap region
- segLocshepseg.RowColArray
The row/col coordinates (within the overlap array) of the pixels for the segment of interest
- orientation{HORIZONTAL, VERTICAL}
Indicates the orientation of the midline
- Returns:
- crossesbool
True if the given segment crosses the midline
- pyshepseg.tiling.doTiledShepherdSegmentation(infile, outfile, tileSize=4096, overlapSize=1024, minSegmentSize=50, numClusters=60, bandNumbers=None, subsamplePcnt=None, maxSpectralDiff='auto', imgNullVal=None, fixedKMeansInit=False, fourConnected=True, verbose=False, simpleTileRecode=False, outputDriver='KEA', creationOptions=[], spectDistPcntile=50, kmeansObj=None, tempfilesDriver='KEA', tempfilesExt='kea', tempfilesCreationOptions=[], writeHistogram=True, returnGDALDS=False)¶
Run the Shepherd segmentation algorithm in a memory-efficient manner, suitable for large raster files. Runs the segmentation on separate (overlapping) tiles across the raster, then stitches these together into a single output segment raster.
The initial spectral clustering is performed on a sub-sample of the whole raster (using fitSpectralClustersWholeFile), to create consistent clusters. These are then used as seeds for all individual tiles. Note that subsamplePcnt is used at this stage, over the whole raster, and is not passed through to shepseg.doShepherdSegmentation() for any further sub-sampling.
Most of the arguments are passed through to shepseg.doShepherdSegmentation, and are described in the docstring for that function.
- Parameters:
- infilestr
Filename of input raster
- outfilestr
Filename of output segmentation raster
- tileSizeint
Desired width & height (in pixels) of the tiles (i.e. desired tiles have shape (tileSize, tileSize). Tiles on the right and bottom edges of the input image may end up slightly larger than tileSize to ensure there are no small tiles.
- overlapSizeint
Number of pixels to overlap tiles. The overlap area is a rectangle, this many pixels wide, which is covered by both adjacent tiles.
- minSegmentSizeint
Minimum number of pixels in a segment
- numClustersint
Number of clusters to request in k-means clustering
- bandNumberslist of int
The GDAL band numbers (i.e. start at 1) of the bands of input raster to use for segmentation
- subsamplePcntfloat or None
See fitSpectralClustersWholeFile()
- maxSpectralDifffloat or str
See shepseg.doShepherdSegmentation()
- spectDistPcntileint
See shepseg.doShepherdSegmentation()
- imgNullValfloat or None
If given, use this as the null value for the input raster. If None, use the value defined in the raster file
- fixedKMeansInitbool
If True, use a fixed set of initial cluster centres for the KMeans clustering. This is good to ensure exactly reproducible results
- fourConnectedbool
If True, use 4-way connectedness, otherwise use 8-way
- verbosebool
If True, print informative messages during processing (to stdout)
- simpleTileRecodebool
If True, use only a simple tile recoding procedure. See stitchTiles() for more detail
- outputDriverstr
The short name of the GDAL format driver to use for output file
- creationOptionslist of str
The GDAL output creation options to match the outputDriver
- kmeansObjsklearn.cluster.KMeans
See shepseg.doShepherdSegmentation() for details
- tempfilesDriverstr
Short name of GDAL driver to use for temporary raster files
- tempfilesExtstr
File extension to use for temporary raster files
- tempfilesCreationOptionslist of str
GDAL creation options to use for temporary raster files
- writeHistogram: bool
Whether to write histogram to file
- returnGDALDS: bool
Whether to set the outDs member of TiledSegmentationResult when returning. If set, this will be open in update mode.
- Returns:
- tileSegResultTiledSegmentationResult
- pyshepseg.tiling.doTiledShepherdSegmentation_doOne(inDs, filename, tileInfo, col, row, bandNumbers, imgNullVal, kmeansObj, minSegmentSize=50, maxSpectralDiff='auto', verbose=False, spectDistPcntile=50, fourConnected=True, tempfilesDriver='KEA', tempfilesCreationOptions=[])¶
Called from doTiledShepherdSegmentation(). Does a single tile, and is split out here as a separate function so it can be called from in parallel with other tiles if desired.
tileInfo is object returned from doTiledShepherdSegmentation_prepare() and col, row describe the tile that this call will process.
See doTiledShepherdSegmentation() for detailed descriptions of other parameters.
Return value is that from shepseg.doShepherdSegmentation().
- pyshepseg.tiling.doTiledShepherdSegmentation_finalize(inDs, outfile, tileFilenames, tileInfo, overlapSize, tempDir, simpleTileRecode=False, outputDriver='KEA', creationOptions=[], verbose=False, writeHistogram=True)¶
Do the stitching of tiles and check for empty segments. Call after every doTiledShepherdSegmentation_doOne() has completed for a given tiled segmentation.
See doTiledShepherdSegmentation() for detailed descriptions of parameters.
Returns a tuple with (maxSegId, hasEmptySegments, outDs).
- pyshepseg.tiling.doTiledShepherdSegmentation_prepare(infile, tileSize=4096, overlapSize=1024, bandNumbers=None, imgNullVal=None, kmeansObj=None, verbose=False, numClusters=60, subsamplePcnt=None, fixedKMeansInit=False)¶
Do all the preparation for the tiled segmentation. Call this first if creating a parallel implementation, then call doTiledShepherdSegmentation_doOne() for each tile in the returned TileInfo object.
See doTiledShepherdSegmentation() for detailed parameter descriptions.
- Parameters:
- Returns a tuple with: (datasetObj, bandNumbers, kmeansObj, subsamplePcnt,
- imgNullVal, tileInfo)
- pyshepseg.tiling.fitSpectralClustersWholeFile(inDs, bandNumbers, numClusters=60, subsamplePcnt=None, imgNullVal=None, fixedKMeansInit=False)¶
Given an open raster Dataset, read a selected sample of pixels and use these to fit a spectral cluster model. Uses GDAL to read the pixels, and shepseg.fitSpectralClusters() to do the fitting.
- Parameters:
- inDsgdal.Dataset
Open GDAL Dataset object for the input raster
- bandNumberslist of int (or None)
List of GDAL band numbers for the bands of interest. If None, then use all bands in the dataset. Note that GDAL band numbers start at 1.
- numClustersint
Desired number of clusters
- subsamplePcntfloat or None
Percentage of pixels to use in fitting. If it is None, then a suitable subsample is calculated such that around one million pixels are sampled. (Note - this would include null pixels, so if the image is dominated by nulls, this would undersample.) No further subsampling is carried out by fitSpectralClusters().
- imgNullValfloat or None
Pixels with this value in the input raster are ignored. If None, the NoDataValue from the raster file is used
- fixedKMeansInitbool
If True, then use a fixed estimate for the initial KMeans cluster centres. See shepseg.fitSpectralClusters() for details.
- Returns:
- kmeansObjsklearn.cluster.KMeans
The fitted KMeans object
- subsamplePcntfloat
The subsample percentage actually used
- imgNullValfloat
The image null value (possibly read from the file)
- pyshepseg.tiling.getImgNullValue(inDs, bandNumbers)¶
Return the null value for the given dataset
- Parameters:
- inDsgdal.Dataset
Open input Dataset
- bandNumberslist of int
GDAL band numbers of interest
- Returns:
- imgNullValfloat or None
Null value from input raster, None if there is no null value
- Raises:
- PyShepSegTilingError
If not all bands have the same null value
- pyshepseg.tiling.getRatPageId(segId)¶
For the given segment ID, return the page ID. This is the segment ID of the first segment in the page.
- pyshepseg.tiling.getTilesForFile(ds, tileSize, overlapSize)¶
Return a TileInfo object for a given file and input parameters.
- Parameters:
- dsgdal.Dataset
Open GDAL Dataset object for raster to be tiles
- tileSizeint
Size of tiles, in pixels. Individual tiles may end up being larger in either direction, when they meet the edge of the raster, to ensure we do not use very small tiles
- overlapSizeint
Number of pixels by which tiles will overlap
- Returns:
- tileInfoTileInfo
TileInfo object detailing the sizes and positions of all tiles across the raster
- pyshepseg.tiling.overlapFilename(col, row, edge, tempDir)¶
Return the temporary filename used for the overlap array
- Parameters:
- col, rowint
Tile column & row numbers
- edge{right’, ‘bottom’}
Indicates from which edge of the given tile the overlap is taken
- Returns:
- filenamestr
Temp numpy array filename for the overlap
- pyshepseg.tiling.readSubsampledImageBand(bandObj, subsampleProp)¶
Read in a sub-sampled copy of the whole of the given band.
Note that one can, in principle, do this directly using GDAL. However, if overview layers are present in the file, it will use these, and so is dependent on how these were created. Since these are often created just for display purposes, past experience has shown that they are not always to be trusted as data, so we have chosen to always go directly to the full resolution image.
- Parameters:
- bandObjgdal.Band
An open Band object for input
- subsamplePropfloat
The proportion by which to sub-sample (i.e. a value between zero and 1, applied to rows and columns separately)
- Returns:
- imgSub<dtype> ndarray (nRowsSub, nColsSub)
A numpy array of the image subsample, equivalent to calling gdal.Band.ReadAsArray()
Work out a mapping which recodes segment ID numbers from the tile in tileData. Segments to be recoded are those which are in the overlap with an earlier tile, and which cross the midline of the overlap, which is where the stitchline between the tiles will fall.
Updates recodeDict, which is a dictionary keyed on the existing segment ID numbers, where the value of each entry is the segment ID number from the earlier tile, to be used to recode the segment in the current tile.
overlapA and overlapB are numpy arrays of pixels in the overlap region in question, giving the segment ID numbers in the two tiles. The values in overlapB are from the earlier tile, and those in overlapA are from the current tile.
It is critically important that the overlapping region is either at the top or the left of the current tile, as this means that the row and column numbers of pixels in the overlap arrays match the same pixels in the full tile. This cannot be used for overlaps on the right or bottom of the current tile.
- Parameters:
- tileDatashepseg.SegIdType ndarray (tileNrows, tileNcols)
Tile subset of segment ID image
- overlapA, overlapBshepseg.SegIdType ndarray (overlapNrows, overlapNcols)
Tile overlap subsets of segment ID image
- orientation{HORIZONTAL, VERTICAL}
The orientation parameter defines whether we are dealing with overlap at the top (orientation == HORIZONTAL) or the left (orientation == VERTICAL).
- recodeDictdict
Keys and values are both segment ID numbers. Defines the mapping which recodes segment IDs. Updated in place.
- pyshepseg.tiling.recodeTile(tileData, maxSegId, tileRow, tileCol, overlapSize, tempDir, top, bottom, left, right)¶
Adjust the segment ID numbers in the current tile, to make them globally unique (and contiguous) across the whole mosaic.
Make use of the overlapping regions of tiles above and left, to identify shared segments, and recode those to segment IDs from the adjacent tiles (i.e. we change the current tile, not the adjacent ones). Non-shared segments are increased so they are larger than previous values.
- Parameters:
- tileDatashepseg.SegIdType ndarray (tileNrows, tileNcols)
The array of segment IDs for a single image tile
- maxSegIdshepseg.SegIdType
The current maximum segment ID for all preceding tiles.
- tileRow, tileColint
The row/col numbers of this tile, within the whole-mosaic tile numbering scheme. (These are not pixel numbers, but tile grid numbers)
- overlapSizeint
Number of pixels in tile overlap
- tempDirstr
Name of directory for temporary files
- top, bottom, left, rightint
Pixel coordinates within tile of the non-overlap region of the tile.
- Returns:
- newTileDatashepseg.SegIdType ndarray (tileNrows, tileNcols)
A copy of tileData, with new segment ID numbers.
- pyshepseg.tiling.relabelSegments(tileData, recodeDict, maxSegId, top, bottom, left, right)¶
Recode the segment IDs in the given tileData array.
For segment IDs which are keys in recodeDict, these are replaced with the corresponding entry. For all other segment IDs, they are replaced with sequentially increasing ID numbers, starting from one more than the previous maximum segment ID (maxSegId).
A re-coded copy of tileData is created, the original is unchanged.
- Parameters:
- tileDatashepseg.SegIdType ndarray (tileNrows, tileNcols)
Segment IDs of tile
- recodeDictdict
Keys and values are segment ID numbers. Defines mapping for segment relabelling
- maxSegIdshepseg.SegIdType
Maximum segment ID number
- top, bottom, left, rightint
Pixel coordinates within tile of the non-overlap region of the tile.
- Returns:
- newTileDatashepseg.SegIdType ndarray (tileNrows, tileNcols)
Segment IDs of tile, after relabelling
- newMaxSegIdshepseg.SegIdType
New maximum segment ID after relabelling
- pyshepseg.tiling.stitchTiles(inDs, outfile, tileFilenames, tileInfo, overlapSize, tempDir, simpleTileRecode, outputDriver, creationOptions, verbose)¶
Recombine individual tiles into a single segment raster output file. Segment ID values are recoded to be unique across the whole raster, and contiguous.
- Parameters:
- inDsgdal.Dataset
Open Dataset of input raster
- outfilestr
Filename of the final output raster
- tileFilenamesdict
Dictionary of the individual tile filenames, keyed by a tuple of (col, row) defining which tile it is.
- tileInfoTileInfo
Positions and sizes of all tiles across the raster. As returned by getTilesForFile().
- overlapSizeint
The number of pixels in the overlap between tiles.
- tempDirstr
Name of directory for temporary files
- simpleTileRecodebool
If True, a simpler method will be used to recode segment IDs, using just a block offset to shift ID numbers. This is useful when testing, but leaves non-contiguous segment numbers. If False, then a more complicated method is used which recodes and merges segments which cross the boundary between tiles (this is the intended normal behaviour).
- outputDriverstr
Short name string of the GDAL driver to use for the output file.
- creationOptionslist of str
GDAL creation options for output driver
- verbosebool
If True, print informative messages to stdout
- Returns:
- maxSegIdshepseg.SegIdType
The maximum segment ID used.
- outDs: gdal.Dataset
Open dataset of the output file
- pyshepseg.tiling.updateCounts(tileData, hist)¶
Fast function to increment counts for each segment ID in the given tile