tilingstats¶
Routines to support calculation of statistics on large rasters. The statistics are calculated per segment so require two input rasters - the segmentation output, and another image to gather statistics from, for each segment in the first image.
These are optimised to work on one tile of the images at a time so should be efficient in terms of memory use.
The main functions in this module are calcPerSegmentStatsTiled()
and calcPerSegmentSpatialStatsTiled()
.
- exception pyshepseg.tilingstats.PyShepSegStatsError¶
- class pyshepseg.tilingstats.SegPoint(x, y, val)¶
Class for handling a given data point and it’s location in pixel space (within the whole image, not a tile).
Used so that all the data for a given segment can be collected together even if the segment straddles a tile.
- class pyshepseg.tilingstats.SegmentStats(segmentHistDict, missingStatsValue)¶
Manage statistics for a single segment
- getPercentile(percentile)¶
Return the pixel value for the given percentile, e.g. getPercentile(50) would return the median value of the segment
- getStat(statID, param)¶
Return the requested statistic
- pyshepseg.tilingstats.accumulateSegDict(segDict, noDataDict, imgNullVal, tileSegments, tileImageData)¶
Accumulate per-segment histogram counts for all pixels in the given tile. Updates segDict entries in-place.
- Parameters:
- segDictnumba.typed.Dict
Dictionary of segments keyed on segment id. Values are histograms for the segment
- noDataDictnumba.typed.Dict
Dictionary of nodata values for each segment.
- imgNullValint
No data value for image
- tileSegmentsint ndarray of shape (nRows, nCols)
Contains the tile of segments currently being processed.
- tileImageDataint ndarray of shape (nRows, nCols)
Contains the tile of the image data currently being processed.
- pyshepseg.tilingstats.accumulateSegSpatial(segDict, noDataDict, imgNullVal, tileSegments, tileImageData, topLine, leftPix)¶
Accumulates data for each segment for the given tile. The data is put into segDict.
- Parameters:
- segDictnumba.typed.Dict
Dictionary of segments keyed on segment id. Values are a list of SegPoint objects.
- noDataDictnumba.typed.Dict
Dictionary of nodata values for each segment.
- imgNullValint
No data value for image
- tileSegmentsint ndarray of shape (nRows, nCols)
Contains the tile of segments currently being processed.
- tileImageDataint ndarray of shape (nRows, nCols)
Contains the tile of the image data currently being processed.
- topLineint
row of the top of this tile in the full image
- leftPixint
col of the left of this tile in the full image
- pyshepseg.tilingstats.calcPerSegmentSpatialStatsRIOS(imgfile, imgbandnum, segfile, colNamesAndTypes, userFunc, userParam=None, concurrencyStyle=None, missingStatsValue=-9999)¶
Similar to the
calcPerSegmentStatsTiledRIOS()
function but allows the user to calculate spatial statistics on the data for each segment. This is done by recording the location and value for each pixel within the segment. Once all the pixels are found for a segment theuserFunc
is called with the following parameters:pts, imgNullVal, intArr, floatArr, userParam
where
pts
is List ofSegPoint
objects. If 2D Numpy tile is prefered theuserFunc
can callconvertPtsInto2DArray()
.intArray
is a 1D numpy array which all the integer output values are to be put (in the same order given incolNamesAndTypes
).floatArr
is a 1D numpy array which all the floating point output values are to be put (in the same order given incolNamesAndTypes
). These arrays are initialised withmissingStatsValue
so the function can skip any that it doesn’t have values for.userParam
is the same value passed to this function and needs to be a type understood by Numba.This function uses RIOS to perform the reading so it works in a memory-efficient way. Also, the number of read workers can be varied by passing an instance of rios.applier.ConcurrencyStyle which may be helpful when reading from data sources with high latency (ie S3). RIOS timeouts can also be changed using this method. Note that only 0 compute workers is supported and computeWorkerKind must be set to CW_NONE.
- Parameters:
- imgfilestring
Path to input file for collecting statistics from
- imgbandnumint
1-based index of the band number in imgfile to use for collecting stats
- segfilestr or gdal.Dataset
Path to segmented file or an open GDAL Dataset. Will collect stats in imgfile for each segment in this file.
- colNamesAndTypeslist of
(colName, colType)
tuples This defines the names, types and order of the output RAT columns.
colName
should be a string containing the name of the RAT column to be created andcolType
should be one ofgdal.GFT_Integer
orgdal.GFT_Real
and this controls the type of the column to be created. Note that the order of columns given in this parameter is important as this dicates the order of theintArray
andfloatArr
parameters touserFunc
.- userFunca Numba function (ie decorated with @jit or @njit).
See above for description
- userParamanything that can be passed to a Numba function
This includes: arrays, scalars and @jitclass decorated classes.
- concurrencyStylerios.applier.ConcurrencyStyle
Concurrency parameters for RIOS
- missingStatsValueint
The value to fill in for segments that have no data.
- pyshepseg.tilingstats.calcPerSegmentSpatialStatsTiled(imgfile, imgbandnum, segfile, colNamesAndTypes, userFunc, userParam=None, missingStatsValue=-9999)¶
Similar to the
calcPerSegmentStatsTiled()
function but allows the user to calculate spatial statistics on the data for each segment. This is done by recording the location and value for each pixel within the segment. Once all the pixels are found for a segment theuserFunc
is called with the following parameters:pts, imgNullVal, intArr, floatArr, userParam
where
pts
is List ofSegPoint
objects. If 2D Numpy tile is prefered theuserFunc
can callconvertPtsInto2DArray()
.intArray
is a 1D numpy array which all the integer output values are to be put (in the same order given incolNamesAndTypes
).floatArr
is a 1D numpy array which all the floating point output values are to be put (in the same order given incolNamesAndTypes
). These arrays are initialised withmissingStatsValue
so the function can skip any that it doesn’t have values for.userParam
is the same value passed to this function and needs to be a type understood by Numba.- Parameters:
- imgfilestring
Path to input file for collecting statistics from
- imgbandnumint
1-based index of the band number in imgfile to use for collecting stats
- segfilestr or gdal.Dataset
Path to segmented file or an open GDAL Dataset. Will collect stats in imgfile for each segment in this file.
- colNamesAndTypeslist of
(colName, colType)
tuples This defines the names, types and order of the output RAT columns.
colName
should be a string containing the name of the RAT column to be created andcolType
should be one ofgdal.GFT_Integer
orgdal.GFT_Real
and this controls the type of the column to be created. Note that the order of columns given in this parameter is important as this dicates the order of theintArray
andfloatArr
parameters touserFunc
.- userFunca Numba function (ie decorated with @jit or @njit).
See above for description
- userParamanything that can be passed to a Numba function
This includes: arrays, scalars and @jitclass decorated classes.
- missingStatsValueint
The value to fill in for segments that have no data.
- pyshepseg.tilingstats.calcPerSegmentSpatialStats_riosFunc(info, inputs, outputs, otherArgs)¶
Called by RIOS from inside calcPerSegmentSpatialStatsRIOS. Do accumulation of statistics
- pyshepseg.tilingstats.calcPerSegmentStatsRIOS(imgfile, imgbandnum, segfile, statsSelection, concurrencyStyle=None, missingStatsValue=-9999)¶
Calculate selected per-segment statistics for the given band of the imgfile, against the given segment raster file. Calculated statistics are written to the segfile raster attribute table (RAT), so this file format must support RATs.
This function uses RIOS to perform the reading so it works in a memory-efficient way. Also, the number of read workers can be varied by passing an instance of rios.applier.ConcurrencyStyle which may be helpful when reading from data sources with high latency (ie S3). RIOS timeouts can also be changed using this method. Note that only 0 compute workers is supported and computeWorkerKind must be set to CW_NONE.
- Parameters:
- imgfilestring
Path to input file for collecting statistics from
- imgbandnumint
1-based index of the band number in imgfile to use for collecting stats
- segfilestr
Path to segmented file. Will collect stats in imgfile for each segment in this file.
- statsSelectionlist of tuples.
One tuple for each statistic to be included. Each tuple is either 2 or 3 elements:
(columnName, statName)
or
(columnName, statName, parameter)
The columnName is a string, used to name the column in the output RAT. The statName is a string used to identify which statistic is to be calculated. Available options are:
'min', 'max', 'mean', 'stddev', 'median', 'mode', 'percentile', 'pixcount'
The ‘percentile’ statistic requires the 3-element form, with the 3rd element being the percentile to be calculated.
For example:
[('Band1_Mean', 'mean'), ('Band1_stdDev', 'stddev'), ('Band1_LQ', 'percentile', 25), ('Band1_UQ', 'percentile', 75)]
would create 4 columns, for the per-segment mean and standard deviation of the given band, and the lower and upper quartiles, with corresponding column names.
Any pixels that are set to the nodata value of imgfile (if set) are ignored in the stats calculations. If there are no pixels that aren’t the nodata value then the value passed in as missingStatsValue is put into the RAT for the requested statistics. The ‘pixcount’ statName can be used to find the number of valid pixels (not nodata) that were used to calculate the statistics.
- concurrencyStylerios.applier.ConcurrencyStyle
Concurrency parameters for RIOS
- missingStatsValueint or float
What to set for segments that have no valid pixels in imgile
- pyshepseg.tilingstats.calcPerSegmentStatsTiled(imgfile, imgbandnum, segfile, statsSelection, missingStatsValue=-9999)¶
Calculate selected per-segment statistics for the given band of the imgfile, against the given segment raster file. Calculated statistics are written to the segfile raster attribute table (RAT), so this file format must support RATs.
Calculations are carried out in a memory-efficient way, allowing very large rasters to be processed. Raster data is handled in small tiles, attribute table is handled in fixed-size chunks.
- Parameters:
- imgfilestring
Path to input file for collecting statistics from
- imgbandnumint
1-based index of the band number in imgfile to use for collecting stats
- segfilestr or gdal.Dataset
Path to segmented file or an open GDAL dataset. Will collect stats in imgfile for each segment in this file.
- statsSelectionlist of tuples.
One tuple for each statistic to be included. Each tuple is either 2 or 3 elements:
(columnName, statName)
or
(columnName, statName, parameter)
The columnName is a string, used to name the column in the output RAT. The statName is a string used to identify which statistic is to be calculated. Available options are:
'min', 'max', 'mean', 'stddev', 'median', 'mode', 'percentile', 'pixcount'
The ‘percentile’ statistic requires the 3-element form, with the 3rd element being the percentile to be calculated.
For example:
[('Band1_Mean', 'mean'), ('Band1_stdDev', 'stddev'), ('Band1_LQ', 'percentile', 25), ('Band1_UQ', 'percentile', 75)]
would create 4 columns, for the per-segment mean and standard deviation of the given band, and the lower and upper quartiles, with corresponding column names.
Any pixels that are set to the nodata value of imgfile (if set) are ignored in the stats calculations. If there are no pixels that aren’t the nodata value then the value passed in as missingStatsValue is put into the RAT for the requested statistics. The ‘pixcount’ statName can be used to find the number of valid pixels (not nodata) that were used to calculate the statistics.
- missingStatsValueint or float
What to set for segments that have no valid pixels in imgile
- pyshepseg.tilingstats.calcPerSegmentStats_riosFunc(info, inputs, outputs, otherArgs)¶
Called by RIOS from inside calcPerSegmentStatsRIOS. Do accumulation of statistics
- pyshepseg.tilingstats.calcStatsForCompletedSegs(segDict, noDataDict, missingStatsValue, pagedRat, statsSelection_fast, segSize, numIntCols, numFloatCols)¶
Calculate statistics for all complete segments in the segDict. Update the pagedRat with the resulting entries. Completed segments are then removed from segDict.
- Parameters:
- segDictnumba.typed.Dict
Dictionary of segments keyed on segment id. Values are histograms for the segment
- noDataDictnumba.typed.Dict
Dictionary of nodata values for each segment.
- missingStatsValueint
value to insert into the RAT where no valid pixels were found
- pagedRatnumba.typed.Dict
The RAT as a paged data structure
- statsSelection_fastint ndarray of shape (numStats, 5)
Allows quick access to the types of stats required
- segSizeint indarray of shape (numSegments+1, )
Array containing the histograms of the segment file
- numIntColsint
Number of Integer RAT cols to be created
- numFloatColsint
Number of Float RAT cols to be created
- pyshepseg.tilingstats.calcStatsForCompletedSegsSpatial(segDict, noDataDict, missingStatsValue, pagedRat, segSize, userFunc, userParam, statsSelection_fast, intArr, floatArr, imgNullVal)¶
Calls the
userFunc
on data for completed segs and saves the results into thepagedRat
.- Parameters:
- segDictnumba.typed.Dict
Dictionary of segments keyed on segment id. Values are a numba.typed.List containing instances of SegPoint.
- noDataDictnumba.typed.Dict
Dictionary of nodata values for each segment.
- missingStatsValueint
Value to fill the intArr and floatArr parameters with so this valu gets written to the RAT if no valid pixels found
- pagedRatnumba.typed.Dict
The RAT as a paged data structure
- segSizeint indarray of shape (numSegments+1, )
Array containing the histograms of the segment file
- userFuncNumba function
This is the user defined function to call for each completed segment
- userParamNumba compatible argument
This is passed to the userFunc
- statsSelection_fastint ndarray of shape (numStats, 5)
Allows quick access to the types of stats required
- intArrint indarray of shape (numIntCols,)
Spaced used for storing the integer outputs for this segment
- floatArrfloat indarray of shape (numFloatCols,)
Spaced used for storing the float outputs for this segment
- imgNullValint
No data value for image
- pyshepseg.tilingstats.checkHistColumn(existingColNames)¶
Check for the Histogram column in the attribute table. Return its column number, and raise an exception if it is not present
- Parameters:
- existingColNameslist of strings
The existing column names in the segment file
- Returns:
- histColNdxint
The index of the histogram column
- pyshepseg.tilingstats.checkSegComplete(segDict, noDataDict, segSize, segId)¶
Return True if the given segment has a complete entry in the segDict, meaning that the pixel count is equal to the segment size
- Parameters:
- segDictnumba.typed.Dict
Dictionary of segments keyed on segment id. Values are histograms for the segment
- noDataDictnumba.typed.Dict
Dictionary of nodata values for each segment.
- segSizeint indarray of shape (numSegments+1, )
Array containing the histograms of the segment file
- segIdshepseg.SegIdType
Segment to check for completeness
- Returns:
- completebool
Whether the segId is complete or not
- pyshepseg.tilingstats.checkSegCompleteSpatial(segDict, noDataDict, segSize, segId)¶
Return True if the given segment has a complete entry in the segDict, meaning that the pixel count is equal to the segment size
Note: this is distinct from checkSegComplete() as that function is working with a dictionary of histograms
- Parameters:
- segDictnumba.typed.Dict
Dictionary of segments keyed on segment id. Values are a list of SegPoint objects.
- noDataDictnumba.typed.Dict
Dictionary of nodata values for each segment.
- segSizeint indarray of shape (numSegments+1, )
Array containing the histograms of the segment file
- segIdshepseg.SegIdType
Segment to check for completeness
- Returns:
- completebool
Whether the segId is complete or not
- pyshepseg.tilingstats.convertPtsInto2DArray(pts, imgNullVal)¶
Given a list of points for a segment turn this back into a 2D array where the value for each pixel is the value of the pixel in the original (data) image.
The tile is created just large enough for the shape of the segment. Areas of the tile not within a segment is given the value of
imgNullVal
.- Parameters:
- ptsnumba.typed.List containing SegPoint objects
This is the list passed to the userFunc.
- imgNullValint
No data value for image
- Returns:
- tiletiling.numbaTypeForImageType ndarray of shape (ysize, xsize)
Where ysize and xsize are the total extent of the tile
- pyshepseg.tilingstats.convertPtsInto2DMaskArray(pts, imgNullVal)¶
Similar to
convertPtsInto2DArray()
but burns in the value 1 where each pixel is within a segment.Given a list of points for a segment turn this back into a 2D array for passing to the user function.
The tile is created just large enough for the shape of the segment. Areas of the tile not within a segment is given the value of
imgNullVal
.- Parameters:
- ptsnumba.typed.List containing SegPoint objects
This is the list passed to the userFunc.
- imgNullValint
No data value for image
- Returns:
- tilenumpy.uint8 ndarray of shape (ysize, xsize)
Where ysize and xsize are the total extent of the tile
- pyshepseg.tilingstats.createNoDataDict()¶
Create the dictionary that holds counts of nodata seen for each segment. The key is the segId, value is the count of nodata seen for that segment in the image data.
- Returns:
- nodataDictnumba.typed.Dict
Dictionary of nodata counts for each segment
- pyshepseg.tilingstats.createSegDict()¶
Create the Dict of Dicts for handling per-segment histograms. Each entry is a dictionary, and the key is a segment ID. Each dictionary within this is the per-segment histogram for a single segment. Each of its entries is for a single value from the imagery, the key is the pixel value, and the dictionary value is the number of times that pixel value appears in the segment.
- Returns:
- segDictnumba.typed.Dict
Dictionary of histograms used for calculating statistics.
- pyshepseg.tilingstats.createSegSpatialDataDict()¶
Create a dictionary where the key is the segment ID and the value is a List of
SegPoint
objects.- Returns:
- segDictnumba.typed.Dict
Dictionary containing lists of
SegPoint
objects.
- pyshepseg.tilingstats.createStatColumns(statsSelection, attrTbl, existingColNames)¶
Create requested statistic columns on the segmentation image RAT. Statistic columns are of type gdal.GFT_Real for mean and stddev, and gdal.GFT_Integer for all other statistics.
Return the column indexes for all requested columns, in the same order.
- Parameters:
- statsSelectionlist of tuples
Same as passed to
calcPerSegmentStatsTiled()
- attrTblgdal.RasterAttributeTable
The Raster Attribute Table object for the file
- existingColNameslist of strings
A list of the existing column names
- Returns:
- colIndexListlist of ints
A list of the indexes of each of the requested new columns in the same order as statsSelection.
- pyshepseg.tilingstats.createUserColumnsSpatial(colNamesAndTypes, attrTbl, existingColNames)¶
Used by
calcPerSegmentSpatialStatsTiled()
to create columns specified in thecolNamesAndTypes
structure.Returns a tuple with number of integer columns, number of float columns and a statsSelection_fast array for use by
calcStatsForCompletedSegsSpatial()
.- Parameters:
- colNamesAndTypeslist of (colName, colType) tuples
Same as passed to
calcPerSegmentSpatialStatsTiled()
.- attrTblgdal.RasterAttributeTable
The Raster Attribute Table object for the file
- existingColNameslist of strings
A list of the existing column names
- Returns:
- numIntColsint
The number of new Integer columns
- numFloatColsint
The number of new Float columns
- statsSelection_fastint ndarray of shape (numStats, 5)
Allows quick access to the types of stats required
- pyshepseg.tilingstats.doImageAlignmentChecks(segfile, imgfile, imgbandnum)¶
Do the checks that the segment file and image file that is being used to collect the stats actually align. We refuse to process the files if they don’t as it is not clear how they should be made to line up - this is up to the user to get right. Also checks that imgfile is not a float image.
- Parameters:
- segfilestr or gdal.Dataset
Path to segmented file or an open GDAL dataset.
- imgfilestring
Path to input file for collecting statistics from
- imgbandnumint
1-based index of the band number in imgfile to use for collecting stats
- Returns:
- segds: gdal.Dataset
Opened GDAL datset for the segments file
- segband: gdal.Band
First Band of the segds
- imgds: gdal.Dataset
Opened GDAL dataset for the image data file
- imgband: gdal.Band
Requested band for the imgds
- pyshepseg.tilingstats.equalProjection(proj1, proj2)¶
Returns True if the proj1 is the same as proj2
Stolen from rios/pixelgrid.py
- Parameters:
- proj1string
WKT string for the first projection
- proj2string
WKT string for the second projection
- Returns:
- equalbool
Whether the projections are equal or not
- pyshepseg.tilingstats.getSortedKeysAndValuesForDict(d)¶
The given dictionary is keyed by pixel values from the imagery, and the values are counts of occurences of the corresponding pixel value. This function returns a pair of numpy arrays (as a tuple), one for the list of pixel values, and one for the corresponding counts. The arrays are sorted in increasing order of pixel value.
- Parameters:
- ddictionary of int
Counts of each pixel value.
- Returns:
- keysSortedtiling.numbaTypeForImageType ndarray of shape (numValues,)
Pixel values sorted
- valuesSortedint ndarray of shape (numValues,)
Counts of each pixel sorted by pixel value
- pyshepseg.tilingstats.makeFastStatsSelection(colIndexList, statsSelection)¶
Make a fast version of the statsSelection data structure, combined with the global column index numbers.
Return a tuple of
(statsSelection_fast, numIntCols, numFloatCols)
The statsSelection_fast is a single array, of shape (numStats, 5). The first index corresponds to the sequence in statsSelection. The second index corresponds to the STATSEL_* values.
Everything is encoded as an integer value in a single numpy array, suitable for fast access within numba njit-ed functions.
This is all a bit ugly and un-pythonic. Not sure if there is a better way.
- Parameters:
- colIndexListlist of ints
The column indexes for all requested columns
- statsSelectionlist of tuples
See
tilingstats.calcPerSegmentStatsTiled()
for a complete description of this parameter.
- Returns:
- statsSelection_fastint ndarray of shape (numStats, 5)
The statsSelection_fast structure
- intCountint
Number of int columns
- floatCountint
Number of float columns
- pyshepseg.tilingstats.userFuncMeanCoord(pts, imgNullVal, intArr, floatArr, transform)¶
Calculates the mean coordinate of each segment. This function is intended to be passed in as the
userFunc
parameter tocalcPerSegmentSpatialStatsTiled()
if the mean coordinates of each segment are required.The
transform
of the image (ie. returned by ds.GetGeoTransform() where ds is a GDAL dataset) is to be passed asuserParam
but converted to an array so it works with Numba.The result will be written to floatArr (2 values - easting then northing). It is expected that at least 2 float columns are available.
- Parameters:
- ptsnumba.typed.List containing SegPoint objects
This is the list passed to the userFunc.
- imgNullValint
The nodata value for the imagery
- intArrint ndarray of shape (numIntCols, )
The integer columns - not used in this function
- floatArrint ndarray of shape (numFloatCols, )
The float columns - output written here
- transformfloat ndarray of shape (6,)
The GDAL transform array. Passed in as a
userParam
totilingstats.calcPerSegmentSpatialStatsTiled()
.
- pyshepseg.tilingstats.userFuncNumEdgePixels(pts, imgNullVal, intArr, floatArr, fourConnected)¶
Calculates the number of ‘edge’ pixels for each segment. Edge pixels are pixels that touch either another segment or the edge of the image. This function is intended to be passed in as the
userFunc
parameter tocalcPerSegmentSpatialStatsTiled()
if the number of edge pixels are required.The result will be written to intArr (1 value - number of edge pixels).
- Parameters:
- ptsnumba.typed.List containing SegPoint objects
This is the list passed to the userFunc.
- imgNullValint
The nodata value for the imagery
- intArrint ndarray of shape (numIntCols, )
The integer columns - output written here
- floatArrint ndarray of shape (numFloatCols, )
The float columns - not used in this function
- fourConnectedbool
If True, use four-way connectedness to judge neighbours, otherwise use eight-way.
- pyshepseg.tilingstats.userFuncVariogram(pts, imgNullVal, intArr, floatArr, maxDist)¶
Calculates the variogram at the given distance for the segment contained in the tile. This function is intended to be passed in as the
userFunc
parameter tocalcPerSegmentSpatialStatsTiled()
if variograms are to be calculated.maxDist
is the number of variograms to calculate and should be passed in as theuserParam
argument tocalcPerSegmentSpatialStatsTiled()
.It is assumed that floatArr has enough space for the number of variograms requested (this is calculated from the
colNamesAndTypes
parameter tocalcPerSegmentSpatialStatsTiled()
).- Parameters:
- ptsnumba.typed.List containing SegPoint objects
This is the list passed to the userFunc.
- imgNullValint
The nodata value for the imagery
- intArrint ndarray of shape (numIntCols, )
The integer columns - not used in this function
- floatArrint ndarray of shape (numFloatCols, )
The float columns - output written here
- maxDistint
Number of variograms to calculate
- pyshepseg.tilingstats.writeCompletePages(pagedRat, attrTbl, statsSelection_fast)¶
Check for completed pages, and write them to the attribute table. Remove them from the pagedRat after writing.
- Parameters:
- pagedRatnumba.typed.Dict
The RAT as a paged data structure
- attrTblgdal.RasterAttributeTable
The Raster Attribute Table object for the file
- statsSelection_fastint ndarray of shape (numStats, 5)
Allows quick access to the types of stats required