Audio and Spectrograms¶
Annotations¶
functions and classes for manipulating annotations of audio
includes BoxedAnnotations class and utilities to combine or “diff” annotations, etc.
-
class
opensoundscape.annotations.
BoxedAnnotations
(df, audio_file=None)¶ container for “boxed” (frequency-time) annotations of audio
(for instance, annotations created in Raven software) includes functionality to load annotations from Raven txt files, output one-hot labels for specific clip lengths or clip start/end times, apply corrections/conversions to annotations, and more.
Contains some analogous functions to Audio and Spectrogram, such as trim() [limit time range] and bandpass() [limit frequency range]
-
bandpass
(low_f, high_f, edge_mode='trim')¶ Bandpass a set of annotations, analogous to Spectrogram.bandpass()
Out-of-place operation: does not modify itself, returns new object
Parameters: - low_f – low frequency (Hz) bound
- high_f – high frequench (Hz) bound
- edge_mode – what to do when boxes overlap with edges of trim region - ‘trim’: trim boxes to bounds - ‘keep’: allow boxes to extend beyond bounds - ‘remove’: completely remove boxes that extend beyond bounds
Returns: a copy of the BoxedAnnotations object on the bandpassed region
-
convert_labels
(conversion_table)¶ modify annotations according to a conversion table
Changes the values of ‘annotation’ column of dataframe. Any labels that do not have specified conversions are left unchanged.
Returns a new BoxedAnnotations object, does not modify itself (out-of-place operation). So use could look like: my_annotations = my_annotations.convert_labels(table)
Parameters: conversion_table – current values -> new values. can be either - pd.DataFrame with 2 columns [current value, new values] or - dictionary {current values: new values} Returns: new BoxedAnnotations object with converted annotation labels
-
classmethod
from_raven_file
(path, annotation_column, keep_extra_columns=True, audio_file=None)¶ load annotations from Raven txt file
Parameters: - path – location of raven .txt file, str or pathlib.Path
- annotation_column – (str) column containing annotations
- keep_extra_columns – keep or discard extra Raven file columns (always keeps start_time, end_time, low_f, high_f, annotation audio_file). [default: True] - True: keep all - False: keep none - or iterable of specific columns to keep
- audio_file – optionally specify the name or path of a corresponding audio file.
Returns: BoxedAnnotations object containing annotaitons from the Raven file
-
global_one_hot_labels
(classes)¶ get a dictionary of one-hot labels for entire duration :param classes: iterable of class names to give 0/1 labels
Returns: list of 0/1 labels for each class
-
one_hot_clip_labels
(full_duration, clip_duration, clip_overlap, classes, min_label_overlap, min_label_fraction=1, final_clip=None)¶ Generate one-hot labels for clips of fixed duration
wraps helpers.generate_clip_times_df() with self.one_hot_labels_like() - Clips are created in the same way as Audio.split() - Labels are applied based on overlap, using self.one_hot_labels_like()
Parameters: - full_duration – The amount of time (seconds) to split into clips
- clip_duration (float) – The duration in seconds of the clips
- clip_overlap (float) – The overlap of the clips in seconds [default: 0]
- classes – list of classes for one-hot labels. If None, classes will be all unique values of self.df[‘annotation’]
- min_label_overlap – minimum duration (seconds) of annotation within the time interval for it to count as a label. Note that any annotation of length less than this value will be discarded. We recommend a value of 0.25 for typical bird songs, or shorter values for very short-duration events such as chip calls or nocturnal flight calls.
- min_label_fraction – [default: None] if >= this fraction of an annotation overlaps with the time window, it counts as a label regardless of its duration. Note that if either of the two criterea (overlap and fraction) is met, the label is 1. if None (default), this criterion is not used (i.e., only min_label_overlap is used). A value of 0.5 for ths parameter would ensure that all annotations result in at least one clip being labeled 1 (if there are no gaps between clips).
- final_clip (str) –
Behavior if final_clip is less than clip_duration seconds long. By default, discards remaining time if less than clip_duration seconds long [default: None]. Options:
- None: Discard the remainder (do not make a clip)
- ”extend”: Extend the final clip beyond full_duration to reach clip_duration length
- ”remainder”: Use only remainder of full_duration (final clip will be shorter than clip_duration)
- ”full”: Increase overlap with previous clip to yield a clip with clip_duration length
Returns: dataframe with index [‘start_time’,’end_time’] and columns=classes
-
one_hot_labels_like
(clip_df, classes, min_label_overlap, min_label_fraction=None, keep_index=False)¶ create a dataframe of one-hot clip labels based on given starts/ends
Uses start and end clip times from clip_df to define a set of clips. Then extracts annotatations associated overlapping with each clip. Required overlap parameters are selected by user: annotation must satisfy the minimum time overlap OR minimum % overlap to be included (doesn’t require both conditions to be met, only one)
clip_df can be created using opensoundscap.helpers.generate_clip_times_df
Parameters: - clip_df – dataframe with ‘start_time’ and ‘end_time’ columns specifying the temporal bounds of each clip
- min_label_overlap – minimum duration (seconds) of annotation within the time interval for it to count as a label. Note that any annotation of length less than this value will be discarded. We recommend a value of 0.25 for typical bird songs, or shorter values for very short-duration events such as chip calls or nocturnal flight calls.
- min_label_fraction – [default: None] if >= this fraction of an annotation overlaps with the time window, it counts as a label regardless of its duration. Note that if either of the two criterea (overlap and fraction) is met, the label is 1. if None (default), this criterion is not used (i.e., only min_label_overlap is used). A value of 0.5 for ths parameter would ensure that all annotations result in at least one clip being labeled 1 (if there are no gaps between clips).
- classes – list of classes for one-hot labels. If None, classes will be all unique values of self.df[‘annotation’]
- keep_index – if True, keeps the index of clip_df as an index in the returned DataFrame. [default:False]
Returns: DataFrame of one-hot labels (multi-index of (start_time, end_time), columns for each class, values 0=absent or 1=present)
-
subset
(classes)¶ subset annotations to those from a list of classes
out-of-place operation (returns new filtered BoxedAnnotations object)
Parameters: - classes – list of classes to retain (all others are discarded)
- the list can include np.nan or None if you want to keep them (-) –
Returns: new BoxedAnnotations object containing only annotations in classes
-
to_raven_file
(path)¶ save annotations to a Raven-compatible tab-separated text file
Parameters: path – path for saved test file (extension must be “.tsv”) - can be str or pathlib.Path - Outcomes:
- creates a file containing the annotations in a format compatible with Raven Pro/Lite.
Note: Raven Lite does not support additional columns beyond a single annotation column. Additional columns will not be shown in the Raven Lite interface.
-
trim
(start_time, end_time, edge_mode='trim')¶ Trim a set of annotations, analogous to Audio/Spectrogram.trim()
Out-of-place operation: does not modify itself, returns new object
Parameters: - start_time – time (seconds) since beginning for left bound
- end_time – time (seconds) since beginning for right bound
- edge_mode – what to do when boxes overlap with edges of trim region - ‘trim’: trim boxes to bounds - ‘keep’: allow boxes to extend beyond bounds - ‘remove’: completely remove boxes that extend beyond bounds
Returns: a copy of the BoxedAnnotations object on the trimmed region. - note that, like Audio.trim(), there is a new reference point for 0.0 seconds (located at start_time in the original object)
-
unique_labels
()¶ get list of all unique (non-Falsy) labels
-
-
opensoundscape.annotations.
categorical_to_one_hot
(labels, classes=None)¶ transform multi-target categorical labels (list of lists) to one-hot array
Parameters: - labels – list of lists of categorical labels, eg [[‘white’,’red’],[‘green’,’white’]] or [[0,1,2],[3]]
- classes=None – list of classes for one-hot labels. if None, taken to be the unique set of values in labels
Returns: 2d array with 0 for absent and 1 for present classes: list of classes corresponding to columns in the array
Return type: one_hot
-
opensoundscape.annotations.
combine
(list_of_annotation_objects)¶ combine annotations with user-specified preferences Not Implemented.
-
opensoundscape.annotations.
diff
(base_annotations, comparison_annotations)¶ look at differences between two BoxedAnnotations objects Not Implemented.
Compare different labels of the same boxes (Assumes that a second annotator used the same boxes as the first, but applied new labels to the boxes)
-
opensoundscape.annotations.
one_hot_labels_on_time_interval
(df, classes, start_time, end_time, min_label_overlap, min_label_fraction=None)¶ generate a dictionary of one-hot labels for given time-interval
Each class is labeled 1 if any annotation overlaps sufficiently with the time interval. Otherwise the class is labeled 0.
Parameters: - df – DataFrame with columns ‘start_time’, ‘end_time’ and ‘annotation’
- classes – list of classes for one-hot labels. If None, classes will be all unique values of self.df[‘annotation’]
- start_time – beginning of time interval (seconds)
- end_time – end of time interval (seconds)
- min_label_overlap – minimum duration (seconds) of annotation within the time interval for it to count as a label. Note that any annotation of length less than this value will be discarded. We recommend a value of 0.25 for typical bird songs, or shorter values for very short-duration events such as chip calls or nocturnal flight calls.
- min_label_fraction – [default: None] if >= this fraction of an annotation overlaps with the time window, it counts as a label regardless of its duration. Note that if either of the two criterea (overlap and fraction) is met, the label is 1. if None (default), the criterion is not used (only min_label_overlap is used). A value of 0.5 would ensure that all annotations result in at least one clip being labeled 1 (if no gaps between clips).
Returns: label 0/1} for all classes
Return type: dictionary of {class
-
opensoundscape.annotations.
one_hot_to_categorical
(one_hot, classes)¶ transform one_hot labels to multi-target categorical (list of lists)
Parameters: - one_hot – 2d array with 0 for absent and 1 for present
- classes – list of classes corresponding to columns in the array
Returns: - list of lists of categorical labels for each sample, eg
[[‘white’,’red’],[‘green’,’white’]] or [[0,1,2],[3]]
Return type: labels
Audio¶
audio.py: Utilities for loading and modifying Audio objects
Note: Out-of-place operations
Functions that modify Audio (and Spectrogram) objects are “out of place”,
meaning that they return a new Audio object instead of modifying the
original object. This means that running a line
`
audio_object.resample(22050) # WRONG!
`
will not change the sample rate of audio_object!
If your goal was to overwrite audio_object with the new,
resampled audio, you would instead write
`
audio_object = audio_object.resample(22050)
`
-
class
opensoundscape.audio.
Audio
(samples, sample_rate, resample_type='kaiser_fast', max_duration=None, metadata=None)¶ Container for audio samples
Initialization requires sample array. To load audio file, use Audio.from_file()
Initializing an Audio object directly requires the specification of the sample rate. Use Audio.from_file or Audio.from_bytesio with sample_rate=None to use a native sampling rate.
Parameters: - samples (np.array) – The audio samples
- sample_rate (integer) – The sampling rate for the audio samples
- resample_type (str) – The resampling method to use [default: “kaiser_fast”]
- max_duration (None or integer) – The maximum duration in seconds allowed for the audio file (longer files will raise an exception)[default: None] If None, no limit is enforced
Returns: An initialized Audio object
-
bandpass
(low_f, high_f, order)¶ Bandpass audio signal with a butterworth filter
Uses a phase-preserving algorithm (scipy.signal’s butter and solfiltfilt)
Parameters: - low_f – low frequency cutoff (-3 dB) in Hz of bandpass filter
- high_f – high frequency cutoff (-3 dB) in Hz of bandpass filter
- order – butterworth filter order (integer) ~= steepness of cutoff
-
duration
()¶ Return duration of Audio
Returns: The duration of the Audio Return type: duration (float)
-
extend
(length)¶ Extend audio file by adding silence to the end
Parameters: length – the final duration in seconds of the extended audio object Returns: a new Audio object of the desired duration
-
classmethod
from_bytesio
(bytesio, sample_rate=None, max_duration=None, resample_type='kaiser_fast')¶ Read from bytesio object
Read an Audio object from a BytesIO object. This is primarily used for passing Audio over HTTP.
Parameters: - bytesio – Contents of WAV file as BytesIO
- sample_rate – The final sampling rate of Audio object [default: None]
- max_duration – The maximum duration of the audio file [default: None]
- resample_type – The librosa method to do resampling [default: “kaiser_fast”]
Returns: An initialized Audio object
-
classmethod
from_file
(path, sample_rate=None, resample_type='kaiser_fast', max_duration=None, metadata=True, offset=0, duration=None)¶ Load audio from files
Deal with the various possible input types to load an audio file Also attempts to load metadata using tinytag.
Audio objects only support mono (one-channel) at this time. Files with multiple channels are mixed down to a single channel.
Optionally, load only a piece of a file using offset and duration. This will efficiently read sections of a .wav file regardless of where the desired clip is in the audio. For mp3 files, access time grows linearly with time since the beginning of the file.
This function relies on librosa.load(), which supports wav natively but requires ffmpeg for mp3 support.
Parameters: - path (str, Path) – path to an audio file
- sample_rate (int, None) – resample audio with value and resample_type, if None use source sample_rate (default: None)
- resample_type – method used to resample_type (default: kaiser_fast)
- max_duration – the maximum length of an input file, None is no maximum (default: None)
- metadata (bool) – if True, attempts to load metadata from the audio file. If an exception occurs, self.metadata will be None. Otherwise self.metadata is a dictionary. Note: will also attempt to parse AudioMoth metadata from the comment field, if the artist field includes AudioMoth. The parsing function for AudioMoth is likely to break when new firmware versions change the comment metadata field.
- offset – load audio starting at this time (seconds) after the start of the file. Default: 0 seconds.
- duration – load audio of this duration (seconds) starting at offset. If None, loads all the way to the end of the file.
Returns: samples, sample_rate, resample_type, max_duration, metadata (dict or None)
Return type: Audio object with attributes
Note: default sample_rate=None means use file’s sample rate, don’t resample
-
loop
(length=None, n=None)¶ Extend audio file by looping it
Parameters: - length – the final length in seconds of the looped file (cannot be used with n)[default: None]
- n – the number of occurences of the original audio sample (cannot be used with length) [default: None] For example, n=1 returns the original sample, and n=2 returns two concatenated copies of the original sample
Returns: a new Audio object of the desired length or repetitions
-
resample
(sample_rate, resample_type=None)¶ Resample Audio object
Parameters: - sample_rate (scalar) – the new sample rate
- resample_type (str) – resampling algorithm to use [default: None (uses self.resample_type of instance)]
Returns: a new Audio object of the desired sample rate
-
save
(path)¶ Save Audio to file
NOTE: currently, only saving to .wav format supported
Parameters: path – destination for output
-
spectrum
()¶ Create frequency spectrum from an Audio object using fft
Parameters: self – Returns: fft, frequencies
-
split
(clip_duration, clip_overlap=0, final_clip=None)¶ Split Audio into even-lengthed clips
The Audio object is split into clips of a specified duration and overlap
Parameters: - clip_duration (float) – The duration in seconds of the clips
- clip_overlap (float) – The overlap of the clips in seconds [default: 0]
- final_clip (str) –
Behavior if final_clip is less than clip_duration seconds long. By default, discards remaining audio if less than clip_duration seconds long [default: None]. Options:
- None: Discard the remainder (do not make a clip)
- ”extend”: Extend the final clip with silence to reach clip_duration length
- ”remainder”: Use only remainder of Audio (final clip will be shorter than clip_duration)
- ”full”: Increase overlap with previous clip to yield a clip with clip_duration length
Returns: list of audio objects - dataframe w/columns for start_time and end_time of each clip
Return type: - audio_clips
-
split_and_save
(destination, prefix, clip_duration, clip_overlap=0, final_clip=None, dry_run=False)¶ Split audio into clips and save them to a folder
Parameters: - destination – A folder to write clips to
- prefix – A name to prepend to the written clips
- clip_duration – The duration of each clip in seconds
- clip_overlap – The overlap of each clip in seconds [default: 0]
- final_clip (str) –
Behavior if final_clip is less than clip_duration seconds long. [default: None] By default, ignores final clip entirely. Possible options (any other input will ignore the final clip entirely),
- ”remainder”: Include the remainder of the Audio (clip will not have clip_duration length)
- ”full”: Increase the overlap to yield a clip with clip_duration length
- ”extend”: Similar to remainder but extend (repeat) the clip to reach clip_duration length
- None: Discard the remainder
- dry_run (bool) – If True, skip writing audio and just return clip DataFrame [default: False]
Returns: pandas.DataFrame containing paths and start and end times for each clip
-
time_to_sample
(time)¶ Given a time, convert it to the corresponding sample
Parameters: time – The time to multiply with the sample_rate Returns: The rounded sample Return type: sample
-
trim
(start_time, end_time)¶ Trim Audio object in time
If start_time is less than zero, output starts from time 0 If end_time is beyond the end of the sample, trims to end of sample
Parameters: - start_time – time in seconds for start of extracted clip
- end_time – time in seconds for end of extracted clip
Returns: a new Audio object containing samples from start_time to end_time
-
exception
opensoundscape.audio.
OpsoLoadAudioInputError
¶ Custom exception indicating we can’t load input
-
exception
opensoundscape.audio.
OpsoLoadAudioInputTooLong
¶ Custom exception indicating length of audio is too long
AudioMoth¶
Utilities specifically for audio files recoreded by AudioMoths
-
opensoundscape.audiomoth.
audiomoth_start_time
(file, filename_timezone='UTC', to_utc=False)¶ parse audiomoth file name into a time stamp
AudioMoths create their file name based on the time that recording starts. This function parses the name into a timestamp. Older AudioMoth firmwares used a hexidecimal unix time format, while newer firmwares use a human-readable naming convention. This function handles both conventions.
Parameters: - file – (str) path or file name from AudioMoth recording
- filename_timezone – (str) name of a pytz time zone (for options see pytz.all_timezones). This is the time zone that the AudioMoth uses to record its name, not the time zone local to the recording site. Usually, this is ‘UTC’ because the AudioMoth records file names in UTC.
- to_utc – if True, converts timestamps to UTC localized time stamp. Otherwise, will return timestamp localized to timezone argument [default: False]
Returns: localized datetime object - if to_utc=True, datetime is always “localized” to UTC
-
opensoundscape.audiomoth.
parse_audiomoth_metadata
(metadata)¶ parse a dictionary of AudioMoth .wav file metadata
-parses the comment field -adds keys for gain_setting, battery_state, recording_start_time -if available (firmware >=1.4.0), addes temperature
Notes on comment field: - Starting with Firmware 1.4.0, the audiomoth logs Temperature to the
metadata (wav header) eg “and temperature was 11.2C.”- At some point the firmware shifted from writing “gain setting 2” to “medium gain setting”. Should handle both modes.
- Tested for AudioMoth firmware versions:
- 1.5.0
Parameters: metadata – dictionary with audiomoth metadata Returns: metadata dictionary with added keys and values
Audio Tools¶
audio_tools.py: set of tools that filter or modify audio files or sample arrays (not Audio objects)
opensoundscape.audio_tools.
bandpass_filter
(signal, low_f, high_f, sample_rate, order=9)¶perform a butterworth bandpass filter on a discrete time signal using scipy.signal’s butter and solfiltfilt (phase-preserving version of sosfilt)
Parameters:
- signal – discrete time signal (audio samples, list of float)
- low_f – -3db point (?) for highpass filter (Hz)
- high_f – -3db point (?) for highpass filter (Hz)
- sample_rate – samples per second (Hz)
- order=9 – higher values -> steeper dropoff
Returns: filtered time signal
opensoundscape.audio_tools.
butter_bandpass
(low_f, high_f, sample_rate, order=9)¶generate coefficients for bandpass_filter()
Parameters:
- low_f – low frequency of butterworth bandpass filter
- high_f – high frequency of butterworth bandpass filter
- sample_rate – audio sample rate
- order=9 – order of butterworth filter
Returns: set of coefficients used in sosfiltfilt()
opensoundscape.audio_tools.
clipping_detector
(samples, threshold=0.6)¶count the number of samples above a threshold value
Parameters:
- samples – a time series of float values
- threshold=0.6 – minimum value of sample to count as clipping
Returns: number of samples exceeding threshold
opensoundscape.audio_tools.
convolve_file
(in_file, out_file, ir_file, input_gain=1.0)¶apply an impulse_response to a file using ffmpeg’s afir convolution
ir_file is an audio file containing a short burst of noise recorded in a space whose acoustics are to be recreated
this makes the files ‘sound as if’ it were recorded in the location that the impulse response (ir_file) was recorded
Parameters:
- in_file – path to an audio file to process
- out_file – path to save output to
- ir_file – path to impulse response file
- input_gain=1.0 – ratio for in_file sound’s amplitude in (0,1)
Returns: os response of ffmpeg command
opensoundscape.audio_tools.
mixdown_with_delays
(files_to_mix, destination, delays=None, levels=None, duration='first', verbose=0, create_txt_file=False)¶use ffmpeg to mixdown a set of audio files, each starting at a specified time (padding beginnings with zeros)
Parameters:
- files_to_mix – list of audio file paths
- destination – path to save mixdown to
- delays=None – list of delays (how many seconds of zero-padding to add at beginning of each file)
- levels=None – optionally provide a list of relative levels (amplitudes) for each input
- duration='first' – ffmpeg option for duration of output file: match duration of ‘longest’,’shortest’,or ‘first’ input file
- verbose=0 – if >0, prints ffmpeg command and doesn’t suppress ffmpeg output (command line output is returned from this function)
- create_txt_file=False – if True, also creates a second output file which lists all files that were included in the mixdown
Returns: ffmpeg command line output
opensoundscape.audio_tools.
silence_filter
(filename, smoothing_factor=10, window_len_samples=256, overlap_len_samples=128, threshold=None)¶Identify whether a file is silent (0) or not (1)
Load samples from an mp3 file and identify whether or not it is likely to be silent. Silence is determined by finding the energy in windowed regions of these samples, and normalizing the detected energy by the average energy level in the recording.
If any windowed region has energy above the threshold, returns a 0; else returns 1.
Parameters:
- filename (str) – file to inspect
- smoothing_factor (int) – modifier to window_len_samples
- window_len_samples – number of samples per window segment
- overlap_len_samples – number of samples to overlap each window segment
- threshold – threshold value (experimentally determined)
Returns: 0 if file contains no significant energy over bakcground 1 if file contains significant energy over bakcground
If threshold is None: returns net_energy over background noise
opensoundscape.audio_tools.
window_energy
(samples, window_len_samples=256, overlap_len_samples=128)¶Calculate audio energy with a sliding window
Calculate the energy in an array of audio samples
Parameters:
- samples (np.ndarray) – array of audio samples loaded using librosa.load
- window_len_samples – samples per window
- overlap_len_samples – number of samples shared between consecutive windows
Returns: list of energy level (float) for each window
Spectrogram¶
spectrogram.py: Utilities for dealing with spectrograms
-
class
opensoundscape.spectrogram.
MelSpectrogram
(spectrogram, frequencies, times, decibel_limits, window_samples=None, overlap_samples=None, window_type=None, audio_sample_rate=None)¶ Immutable mel-spectrogram container
A mel spectrogram is a spectrogram with pseudo-logarithmically spaced frequency bins (see literature) rather than linearly spaced bins.
See Spectrogram class an Librosa’s melspectrogram for detailed documentation.
NOTE: Here we rely on scipy’s spectrogram function (via Spectrogram) rather than on librosa’s _spectrogram or melspectrogram, because the amplitude of librosa’s spectrograms do not match expectations. We only use the mel frequency bank from Librosa.
-
classmethod
from_audio
(audio, n_mels=64, window_samples=512, overlap_samples=256, decibel_limits=(-100, -20), htk=False, norm='slaney', window_type='hann', dB_scale=True)¶ Create a MelSpectrogram object from an Audio object
First creates a spectrogram and a mel-frequency filter bank, then computes the dot product of the filter bank with the spectrogram.
The kwargs for the mel frequency bank are documented at: - https://librosa.org/doc/latest/generated/librosa.feature.melspectrogram.html#librosa.feature.melspectrogram - https://librosa.org/doc/latest/generated/librosa.filters.mel.html?librosa.filters.mel
Parameters: - n_mels – Number of mel bands to generate [default: 128] Note: n_mels should be chosen for compatibility with the Spectrogram parameter window_samples. Choosing a value > ~ window_samples/10 will result in zero-valued rows while small values blend rows from the original spectrogram.
- window_type – The windowing function to use [default: “hann”]
- window_samples – n samples per window [default: 512]
- overlap_samples – n samples shared by consecutive windows [default: 256]
- htk – use HTK mel-filter bank instead of Slaney, see Librosa docs [default: False]
- norm='slanley' – mel filter bank normalization, see Librosa docs
- dB_scale=True – If True, rescales values to decibels, x=10*log10(x) - if dB_scale is False, decibel_limits is ignored
Returns: opensoundscape.spectrogram.MelSpectrogram object
-
plot
(inline=True, fname=None, show_colorbar=False)¶ Plot the mel spectrogram with matplotlib.pyplot
We can’t use pcolormesh because it will smash pixels to achieve a linear y-axis, rather than preserving the mel scale.
Parameters: - inline=True –
- fname=None – specify a string path to save the plot to (ending in .png/.pdf)
- show_colorbar – include image legend colorbar from pyplot
-
classmethod
-
class
opensoundscape.spectrogram.
Spectrogram
(spectrogram, frequencies, times, decibel_limits, window_samples=None, overlap_samples=None, window_type=None, audio_sample_rate=None)¶ Immutable spectrogram container
Can be initialized directly from spectrogram, frequency, and time values or created from an Audio object using the .from_audio() method.
-
frequencies
¶ (list) discrete frequency bins generated by fft
-
times
¶ (list) time from beginning of file to the center of each window
-
spectrogram
¶ a 2d array containing 10*log10(fft) for each time window
-
decibel_limits
¶ minimum and maximum decibel values in .spectrogram
-
window_samples
¶ number of samples per window when spec was created [default: none]
-
overlap_samples
¶ number of samples overlapped in consecutive windows when spec was created [default: none]
-
window_type
¶ window fn used to make spectrogram, eg ‘hann’ [default: none]
-
audio_sample_rate
¶ sample rate of audio from which spec was created [default: none]
-
amplitude
(freq_range=None)¶ create an amplitude vs time signal from spectrogram
by summing pixels in the vertical dimension
- Args
- freq_range=None: sum Spectrogrm only in this range of [low, high] frequencies in Hz (if None, all frequencies are summed)
Returns: a time-series array of the vertical sum of spectrogram value
-
bandpass
(min_f, max_f, out_of_bounds_ok=True)¶ extract a frequency band from a spectrogram
crops the 2-d array of the spectrograms to the desired frequency range
Parameters: - min_f – low frequency in Hz for bandpass
- max_f – high frequency in Hz for bandpass
- out_of_bounds_ok – (bool) if False, raises ValueError if min_f or max_f are not within the range of the original spectrogram’s frequencies [default: True]
Returns: bandpassed spectrogram object
-
duration
()¶ calculate the ammount of time represented in the spectrogram
Note: time may be shorter than the duration of the audio from which the spectrogram was created, because the windows may align in a way such that some samples from the end of the original audio were discarded
-
classmethod
from_audio
(audio, window_type='hann', window_samples=512, overlap_samples=256, decibel_limits=(-100, -20), dB_scale=True)¶ create a Spectrogram object from an Audio object
Parameters: - window_type="hann" – see scipy.signal.spectrogram docs for description of window parameter
- window_samples=512 – number of audio samples per spectrogram window (pixel)
- overlap_samples=256 – number of samples shared by consecutive windows
- = (decibel_limits) – limit the dB values to (min,max) (lower values set to min, higher values set to max)
- dB_scale=True – If True, rescales values to decibels, x=10*log10(x) - if dB_scale is False, decibel_limits is ignored
Returns: opensoundscape.spectrogram.Spectrogram object
-
classmethod
from_file
()¶ create a Spectrogram object from a file
Parameters: file – path of image to load Returns: opensoundscape.spectrogram.Spectrogram object
-
limit_db_range
(min_db=-100, max_db=-20)¶ Limit the decibel values of the spectrogram to range from min_db to max_db
values less than min_db are set to min_db values greater than max_db are set to max_db
similar to Audacity’s gain and range parameters
Parameters: - min_db – values lower than this are set to this
- max_db – values higher than this are set to this
Returns: Spectrogram object with db range applied
-
linear_scale
(feature_range=(0, 1))¶ Linearly rescale spectrogram values to a range of values using in_range as decibel_limits
Parameters: feature_range – tuple of (low,high) values for output Returns: Spectrogram object with values rescaled to feature_range
-
min_max_scale
(feature_range=(0, 1))¶ Linearly rescale spectrogram values to a range of values using in_range as minimum and maximum
Parameters: feature_range – tuple of (low,high) values for output Returns: Spectrogram object with values rescaled to feature_range
-
net_amplitude
(signal_band, reject_bands=None)¶ create amplitude signal in signal_band and subtract amplitude from reject_bands
rescale the signal and reject bands by dividing by their bandwidths in Hz (amplitude of each reject_band is divided by the total bandwidth of all reject_bands. amplitude of signal_band is divided by badwidth of signal_band. )
Parameters: - signal_band – [low,high] frequency range in Hz (positive contribution)
- band (reject) – list of [low,high] frequency ranges in Hz (negative contribution)
return: time-series array of net amplitude
-
plot
(inline=True, fname=None, show_colorbar=False)¶ Plot the spectrogram with matplotlib.pyplot
Parameters: - inline=True –
- fname=None – specify a string path to save the plot to (ending in .png/.pdf)
- show_colorbar – include image legend colorbar from pyplot
-
to_image
(shape=None, mode='RGB', colormap=None)¶ Create a Pillow Image from spectrogram
Linearly rescales values in the spectrogram from self.decibel_limits to [255,0]
Default of self.decibel_limits on load is [-100, -20], so, e.g., -20 db is loudest -> black, -100 db is quietest -> white
Parameters: - destination – a file path (string)
- shape=None – tuple of image dimensions as (height, width),
- mode="RGB" – RGB for 3-channel output “L” for 1-channel output
- colormap=None – if None, greyscale spectrogram is generated Can be any matplotlib colormap name such as ‘jet’ Note: if mode=”L”, colormap will have no effect on output
Returns: Pillow Image object
-
trim
(start_time, end_time)¶ extract a time segment from a spectrogram
Parameters: - start_time – in seconds
- end_time – in seconds
Returns: spectrogram object from extracted time segment
-
window_length
()¶ calculate length of a single fft window, in seconds:
-
window_start_times
()¶ get start times of each window, rather than midpoint times
-
window_step
()¶ calculate time difference (sec) between consecutive windows’ centers
-
Machine Learning¶
Convolutional Neural Networks¶
classes for pytorch machine learning models in opensoundscape
For tutorials, see notebooks on opensoundscape.org
-
class
opensoundscape.torch.models.cnn.
CnnResampleLoss
(architecture, classes, single_target=False)¶ Subclass of PytorchModel with ResampleLoss.
ResampleLoss may perform better than BCE Loss for multitarget problems in some scenarios.
Parameters: - architecture – a model architecture object, for example one generated with the torch.architectures.cnn_architectures module
- classes – list of class names. Must match with training dataset classes.
- single_target –
- True: model expects exactly one positive class per sample
- False: samples can have an number of positive classes
[default: False]
-
class
opensoundscape.torch.models.cnn.
InceptionV3
(classes, freeze_feature_extractor=False, use_pretrained=True, single_target=False)¶ -
train_epoch
()¶ perform forward pass, loss, backpropagation for one epoch
need to override parent because Inception returns different outputs from the forward pass (final and auxiliary layers)
Returns: (targets, predictions, scores) on training files
-
-
class
opensoundscape.torch.models.cnn.
InceptionV3ResampleLoss
(classes, freeze_feature_extractor=False, use_pretrained=True, single_target=False)¶
-
class
opensoundscape.torch.models.cnn.
PytorchModel
(architecture, classes, single_target=False)¶ Generic Pytorch Model with .train(), .predict(), and .save()
flexible architecture, optimizer, loss function, parameters
for tutorials and examples see opensoundscape.org
Parameters: - architecture –
EITHER a pytorch model object (subclass of torch.nn.Module), for example one generated with the cnn_architectures module OR a string matching one of the architectures listed by cnn_architectures.list_architectures(), eg ‘resnet18’. - If a string is provided, uses default parameters
(including use_pretrained=True) - classes – list of class names. Must match with training dataset classes if training.
- single_target –
- True: model expects exactly one positive class per sample
- False: samples can have an number of positive classes
[default: False]
-
predict
(prediction_dataset, batch_size=1, num_workers=0, activation_layer=None, binary_preds=None, threshold=0.5, error_log=None)¶ Generate predictions on a dataset
Choose to return any combination of scores, labels, and single-target or multi-target binary predictions. Also choose activation layer for scores (softmax, sigmoid, softmax then logit, or None).
Note: the order of returned dataframes is (scores, preds, labels)
Parameters: - prediction_dataset – a Preprocessor or DataSset object that returns tensors, such as AudioToSpectrogramPreprocessor (no augmentation) or CnnPreprocessor (w/augmentation) from opensoundscape.datasets
- batch_size – Number of files to load simultaneously [default: 1]
- num_workers – parallelization (ie cpus or cores), use 0 for current process [default: 0]
- activation_layer – Optionally apply an activation layer such as sigmoid or softmax to the raw outputs of the model. options: - None: no activation, return raw scores (ie logit, [-inf:inf]) - ‘softmax’: scores all classes sum to 1 - ‘sigmoid’: all scores in [0,1] but don’t sum to 1 - ‘softmax_and_logit’: applies softmax first then logit [default: None]
- binary_preds – Optionally return binary (thresholded 0/1) predictions options: - ‘single_target’: max scoring class = 1, others = 0 - ‘multi_target’: scores above threshold = 1, others = 0 - None: do not create or return binary predictions [default: None]
- threshold – prediction threshold(s) for sigmoid scores. Only relevant when binary_preds == ‘multi_target’
- error_log – if not None, saves a list of files that raised errors to the specified file location [default: None]
- Returns: 3 DataFrames (or Nones), w/index matching prediciton_dataset.df
- scores: post-activation_layer scores predictions: 0/1 preds for each class labels: labels from dataset (if available)
- Note: if loading an audio file raises a PreprocessingError, the scores
- and predictions for that sample will be np.nan
Note: if no return type selected for labels/scores/preds, returns None instead of a DataFrame in the returned tuple
-
split_and_predict
(prediction_dataset, file_batch_size=1, num_workers=0, activation_layer=None, binary_preds=None, threshold=0.5, error_log=None, clip_batch_size=None)¶ Generate predictions on long audio files
This function integrates in-pipline splitting of audio files into shorter clips with clip-level prediction.
The input dataset should be a LongAudioPreprocessor object
Choose to return any combination of scores, labels, and single-target or multi-target binary predictions. Also choose activation layer for scores (softmax, sigmoid, softmax then logit, or None).
Parameters: - prediction_dataset – a LongAudioPreprocessor object
- file_batch_size – Number of audio files to load simultaneously [default: 1]
- num_workers – parallelization (ie cpus or cores), use 0 for current process [default: 0]
- activation_layer – Optionally apply an activation layer such as sigmoid or softmax to the raw outputs of the model. options: - None: no activation, return raw scores (ie logit, [-inf:inf]) - ‘softmax’: scores all classes sum to 1 - ‘sigmoid’: all scores in [0,1] but don’t sum to 1 - ‘softmax_and_logit’: applies softmax first then logit [default: None]
- binary_preds – Optionally return binary (thresholded 0/1) predictions options: - ‘single_target’: max scoring class = 1, others = 0 - ‘multi_target’: scores above threshold = 1, others = 0 - None: do not create or return binary predictions [default: None]
- threshold – prediction threshold for sigmoid scores. Only relevant when binary_preds == ‘multi_target’
- clip_batch_size – batch size of preprocessed samples for CNN prediction
- error_log – if not None, saves a list of files that raised errors to the specified file location [default: None]
- Returns: DataFrames with multi-index: path, clip start & end times
- scores: post-activation_layer scores predictions: 0/1 preds for each class, if binary_preds given unsafe_samples: list of samples that failed to preprocess
- Note: if loading an audio file raises a PreprocessingError, the scores
- and predictions for that sample will be np.nan
Note: if no return type selected for scores/preds, returns None instead of a DataFrame for predictions
Note: currently does not support passing labels. Meaning of a label is ambiguous since the original files are split into clips during prediction (output values are for clips, not entire file)
-
train
(train_dataset, valid_dataset, epochs=1, batch_size=1, num_workers=0, save_path='.', save_interval=1, log_interval=10, unsafe_sample_log='./unsafe_samples.log')¶ train the model on samples from train_dataset
If customized loss functions, networks, optimizers, or schedulers are desired, modify the respective attributes before calling .train().
Parameters: - train_dataset – a Preprocessor that loads sample (audio file + label) to Tensor in batches (see docs/tutorials for details)
- valid_dataset – a Preprocessor for evaluating performance
- epochs – number of epochs to train for [default=1] (1 epoch constitutes 1 view of each training sample)
- batch_size – number of training files to load/process before re-calculating the loss function and backpropagation
- num_workers – parallelization (ie, cores or cpus) Note: use 0 for single (root) process (not 1)
- save_path – location to save intermediate and best model objects [default=”.”, ie current location of script]
- save_interval – interval in epochs to save model object with weights [default:1] Note: the best model is always saved to best.model in addition to other saved epochs.
- log_interval – interval in epochs to evaluate model with validation dataset and print metrics to the log
- unsafe_sample_log – file path: log all samples that failed in preprocessing (file written when training completes) - if None, does not write a file
-
train_epoch
()¶ perform forward pass, loss, backpropagation for one epoch
Returns: (targets, predictions, scores) on training files
- architecture –
-
class
opensoundscape.torch.models.cnn.
Resnet18Binary
(classes, use_pretrained=True)¶ Subclass of PytorchModel with Resnet18 architecture
This subclass allows separate training parameters for the feature extractor and classifier via optimizer_params
Parameters: - classes – list of class names. Must match with training dataset classes.
- single_target –
- True: model expects exactly one positive class per sample
- False: samples can have an number of positive classes
[default: False]
-
class
opensoundscape.torch.models.cnn.
Resnet18Multiclass
(classes, single_target=False, use_pretrained=True)¶ Multi-class model with resnet18 architecture and ResampleLoss.
Can be single or multi-target.
Parameters: - classes – list of class names. Must match with training dataset classes.
- single_target –
- True: model expects exactly one positive class per sample
- False: samples can have an number of positive classes
[default: False]
Notes - Allows separate parameters for feature & classifier blocks
via self.optimizer_params’s keys: “feature” and “classifier”- Uses ResampleLoss
-
opensoundscape.torch.models.cnn.
load_model
(path, device=None)¶ load a saved model object
Parameters: - path – file path of saved model
- device – which device to load into, eg ‘cuda:1’
- [default – None] will choose first gpu if available, otherwise cpu
Returns: a model object with loaded weights
-
opensoundscape.torch.models.cnn.
load_outdated_model
(path, model_class, architecture_constructor=None, device=None)¶ load a CNN saved with a previous version of OpenSoundscape
This function enables you to load models saved with opso 0.4.x, 0.5.x, and 0.6.0 when using >=0.6.1. For models created with 0.6.1 and above, use load_model(path) which is more robust.
Note: If you are loading a model created with opensoundscape 0.4.x, you most likely want to specify model_class = opensoundscape.torch.models.CnnResnet18Binary. If your model was created with opensoundscape 0.5.x or 0.6.0, you need to choose the appropriate class.
Note: for future use of the loaded model, you can simply call model.save(path) after creating it, then reload it with model = load_model(path). The saved model will be fully compatible with opensoundscape >=0.6.1.
Examples: ``` #load a binary resnet18 model from opso 0.4.x, 0.5.x, or 0.6.0 from opensoundscape.torch.models.cnn import Resnet18Binary model = load_outdated_model(‘old_model.tar’,model_class=Resnet18Binary)
#load a resnet50 model of class PytorchModel created with opso 0.5.0 from opensoundscape.torch.models.cnn import PytorchModel from opensoundscape.torch.architectures.cnn_architectures import resnet50 model_050 = load_outdated_model(‘opso050_pytorch_model_r50.model’,model_class=PytorchModel,architecture_constructor=resnet50) ```
Parameters: - path – path to model file, ie .model or .tar file
- model_class – the opensoundscape class to create, eg PytorchModel, CnnResampleLoss, or Resnet18Binary from opensoundscape.torch.models.cnn
- architecture_constructor – the function that creates desired cnn architecture eg opensoundscape.torch.architectures.cnn_architectures.resnet18 Note: this is only required for classes that take the architecture as an input, for instance PytorchModel or CnnResampleLoss. It’s not required for e.g. Resnet18Binary or InceptionV3 which internally create a specific architecture.
- device – optionally specify a device to map tensors onto, eg ‘cpu’, ‘cuda:0’, ‘cuda:1’[default: None] - if None, will choose cuda:0 if cuda is available, otherwise chooses cpu
Returns: a cnn model object with the weights loaded from the saved model
-
class
opensoundscape.torch.models.utils.
BaseModule
¶ Base class for a pytorch model pipeline class.
All child classes should define load, save, etc
-
opensoundscape.torch.models.utils.
apply_activation_layer
(x, activation_layer=None)¶ applies an activation layer to a set of scores
Parameters: - x – input values
- activation_layer –
- None [default]: return original values
- ’softmax’: apply softmax activation
- ’sigmoid’: apply sigmoid activation
- ’softmax_and_logit’: apply softmax then logit transform
Returns: values with activation layer applied
-
opensoundscape.torch.models.utils.
cas_dataloader
(dataset, batch_size, num_workers)¶ Return a dataloader that uses the class aware sampler
Class aware sampler tries to balance the examples per class in each batch. It selects just a few classes to be present in each batch, then samples those classes for even representation in the batch.
Parameters: - dataset – a pytorch dataset type object
- batch_size – see DataLoader
- num_workers – see DataLoader
-
opensoundscape.torch.models.utils.
collate_lists_of_audio_clips
(batch)¶ Collate function for splitting + prediction of long audio files
Puts each data field into a tensor with outer dimension batch size
Additionally, concats the dfs from each audio file into one long df for the entire batch
-
opensoundscape.torch.models.utils.
get_batch
(array, batch_size, batch_number)¶ get a single slice of a larger array
using the batch size and batch index, from zero
Parameters: - array – iterable to split into batches
- batch_size – num elements per batch
- batch_number – index of batch
Returns: one batch (subset of array)
Note: the final elements are returned as the last batch even if there are fewer than batch_size
Example
if array=[1,2,3,4,5,6,7] then:
- get_batch(array,3,0) returns [1,2,3]
- get_batch(array,3,3) returns [7]
-
opensoundscape.torch.models.utils.
get_dataloader
(safe_dataset, batch_size=64, num_workers=1, shuffle=False, sampler='')¶ Create a DataLoader from a DataSet - chooses between normal pytorch DataLoader and ImbalancedDatasetSampler. - Sampler: None -> default DataLoader; ‘imbalanced’->ImbalancedDatasetSampler
-
opensoundscape.torch.models.utils.
tensor_binary_predictions
(scores, mode, threshold=None)¶ generate binary 0/1 predictions from continuous scores
Parameters: - scores – torch.Tensor of dim (batch_size, n_classes) with input scores [-inf:inf]
- mode – ‘single_target’, ‘multi_target’, or None (return empty tensor)
- threshold – minimum score to predict 1, if mode==’multi_target’. threshold
- be a single value for all classes or a list of class-specific values. (can) –
Returns: torch.Tensor of 0/1 predictions in same shape as scores
Note: expects real-valued (unbounded) input scores, i.e. scores take values in [-inf, inf]. Sigmoid layer is applied before multi-target prediction, so the threshold should be in [0,1].
Module to initialize PyTorch CNN architectures with custom output shape
This module allows the use of several built-in CNN architectures from PyTorch. The architecture refers to the specific layers and layer input/output shapes (including convolution sizes and strides, etc) - such as the ResNet18 or Inception V3 architecture.
We provide wrappers which modify the output layer to the desired shape (to match the number of classes). The way to change the output layer shape depends on the architecture, which is why we need a wrapper for each one. This code is based on pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
To use these wrappers, for example, if your model has 10 output classes, write
my_arch=resnet18(10)
Then you can initialize a model object from opensoundscape.torch.models.cnn with your architecture:
model=PytorchModel(my_arch,classes)
or override an existing model’s architecture:
model.network = my_arch
Note: the InceptionV3 architecture must be used differently than other architectures - the easiest way is to simply use the InceptionV3 class in opensoundscape.torch.models.cnn.
-
opensoundscape.torch.architectures.cnn_architectures.
alexnet
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for AlexNet architecture
input size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
densenet121
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for densenet121 architecture
input size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
inception_v3
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for Inception v3 architecture
Input: 229x229
WARNING: expects (299,299) sized images and has auxiliary output. See InceptionV3 class in opensoundscape.torch.models.cnn for use.
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
resnet101
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for ResNet101 architecture
input_size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
resnet152
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for ResNet152 architecture
input_size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
resnet18
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for ResNet18 architecture
input_size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
resnet34
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for ResNet34 architecture
input_size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
resnet50
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for ResNet50 architecture
input_size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
set_parameter_requires_grad
(model, freeze_feature_extractor)¶ if necessary, remove gradients of all model parameters
if freeze_feature_extractor is True, we set requires_grad=False for all features in the feature extraction block. We would do this if we have a pre-trained CNN and only want to change the shape of the final layer, then train only that final classification layer without modifying the weights of the rest of the network.
-
opensoundscape.torch.architectures.cnn_architectures.
squeezenet1_0
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for squeezenet architecture
input size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
-
opensoundscape.torch.architectures.cnn_architectures.
vgg11_bn
(num_classes, freeze_feature_extractor=False, use_pretrained=True)¶ Wrapper for vgg11 architecture
input size = 224
Parameters: - num_classes – number of output nodes for the final layer
- freeze_feature_extractor – if False (default), entire network will have gradients and can train if True, feature block is frozen and only final layer is trained
- use_pretrained – if True, uses pre-trained ImageNet features from Pytorch’s model zoo.
defines feature extractor and Architecture class for ResNet CNN
This implementation of the ResNet18 architecture allows for separate access to the feature extraction and classification blocks. This can be useful, for instance, to freeze the feature extractor and only train the classifier layer; or to specify different learning rates for the two blocks.
This implementation is used in the Resnet18Binary and Resnet18Multiclass classes of opensoundscape.torch.models.cnn.
-
class
opensoundscape.torch.architectures.resnet.
ResNetArchitecture
(num_cls, weights_init='ImageNet', num_layers=18, init_classifier_weights=False)¶ ResNet architecture with 18 or 50 layers
This implementation enables separate access to feature and classification blocks.
Parameters: - num_cls – number of classes (int)
- weights_init –
- “ImageNet”: load the pre-trained weights for ImageNet dataset
- path: load weights from a path on your computer or a url
- None: initialize with random weights
- num_layers – 18 for Resnet18 or 50 for Resnet50
- init_classifier_weights –
- if True, load the weights of the classification layer as well as
feature extraction layers - if False (default), only load the weights of the feature extraction layers
-
load
(init_path, init_classifier_weights=True, verbose=False)¶ load state dict (weights) of the feature+classifier optionally load only feature weights not classifier weights
Parameters: - init_path –
- url containing “http”: download weights from web
- path: load weights from local path
- init_classifier_weights –
- if True, load the weights of the classification layer as well as
feature extraction layers - if False (default), only load the weights of the feature extraction layers
- verbose – if True, print missing/unused keys [default: False]
- init_path –
-
class
opensoundscape.torch.architectures.resnet.
ResNetFeature
(block, layers, zero_init_residual=False, groups=1, width_per_group=64, replace_stride_with_dilation=None, norm_layer=None)¶
-
class
opensoundscape.torch.architectures.utils.
BaseArchitecture
¶ Base architecture for reference.
-
class
opensoundscape.torch.architectures.utils.
CompositeArchitecture
(*args, **kwargs)¶ Architecture with separate feature and classsifier blocks
Data Selection¶
-
opensoundscape.data_selection.
resample
(df, n_samples_per_class, upsample=True, downsample=True, random_state=None)¶ resample a one-hot encoded label df for a target n_samples_per_class
Parameters: - df – dataframe with one-hot encoded labels: columns are classes, index is sample name/path
- n_samples_per_class – target number of samples per class
- upsample – if True, duplicate samples for classes with <n samples to get to n samples
- downsample – if True, randomly sample classis with >n samples to get to n samples
- random_state – passed to np.random calls. If None, random state is not fixed.
Note: The algorithm assumes that the label df is single-label. If the label df is multi-label, some classes can end up over-represented.
Note 2: The resulting df will have samples ordered by class label, even if the input df had samples in a random order.
-
opensoundscape.data_selection.
upsample
(input_df, label_column='Labels', random_state=None)¶ Given a input DataFrame of categorical labels, upsample to maximum value
Upsampling removes the class imbalance in your dataset. Rows for each label are repeated up to max_count // rows. Then, we randomly sample the rows to fill up to max_count.
The input df is NOT one-hot encoded in this case, but instead contains categorical labels in a specified label_columns
Parameters: - input_df – A DataFrame to upsample
- label_column – The column to draw unique labels from
- random_state – Set the random_state during sampling
Returns: An upsampled DataFrame
Return type: df
Grad Cam¶
GradCAM is a method of visualizing the activation of the network on parts of an image
# Author: Kazuto Nakashima # URL: http://kazuto1011.github.io # Created: 2017-05-26
Loss Functions¶
loss function classes to use with opensoundscape models
-
class
opensoundscape.torch.loss.
BCEWithLogitsLoss_hot
¶ use pytorch’s nn.BCEWithLogitsLoss for one-hot labels by simply converting y from long to float
-
class
opensoundscape.torch.loss.
CrossEntropyLoss_hot
¶ use pytorch’s nn.CrossEntropyLoss for one-hot labels by converting labels from 1-hot to integer labels
throws a ValueError if labels are not one-hot
-
class
opensoundscape.torch.loss.
ResampleLoss
(class_freq, reduction='mean', loss_weight=1.0)¶
-
opensoundscape.torch.loss.
reduce_loss
(loss, reduction)¶ Reduce loss as specified.
Parameters: - loss (Tensor) – Elementwise loss tensor.
- reduction (str) – Options are “none”, “mean” and “sum”.
Returns: Reduced loss tensor.
Return type: Tensor
-
opensoundscape.torch.loss.
weight_reduce_loss
(loss, weight=None, reduction='mean', avg_factor=None)¶ Apply element-wise weight and reduce loss.
Parameters: - loss (Tensor) – Element-wise loss.
- weight (Tensor) – Element-wise weights.
- reduction (str) – Same as built-in losses of PyTorch.
- avg_factor (float) – Avarage factor when computing the mean of losses.
Returns: Processed loss values.
Return type: Tensor
Safe Dataloading¶
Dataset wrapper to handle errors gracefully in Preprocessor classes
A SafeDataset handles errors in a potentially misleading way: If an error is raised while trying to load a sample, the SafeDataset will instead load a different sample. The indices of any samples that failed to load will be stored in ._unsafe_indices.
The behavior may be desireable for training a model, but could cause silent errors when predicting a model (replacing a bad file with a different file), and you should always be careful to check for ._unsafe_indices after using a SafeDataset.
based on an implementation by @msamogh in nonechucks (github.com/msamogh/nonechucks/)
-
class
opensoundscape.torch.safe_dataset.
SafeDataset
(dataset, unsafe_behavior, eager_eval=False)¶ A wrapper for a Dataset that handles errors when loading samples
WARNING: When iterating, will skip the failed sample, but when using within a DataLoader, finds the next good sample and uses it for the current index (see __getitem__).
Parameters: - dataset – a torch Dataset instance or child such as a Preprocessor
- eager_eval – If True, checks if every file is able to be loaded during initialization (logs _safe_indices and _unsafe_indices)
Attributes: _safe_indices and _unsafe_indices can be accessed later to check which samples threw errors.
-
_build_index
()¶ tries to load each sample, logs _safe_indices and _unsafe_indices
-
__getitem__
(index)¶ If loading an index fails, keeps trying the next index until success
-
_safe_get_item
()¶ Tries to load a sample, returns None if error occurs
-
is_index_built
¶ Returns True if all indices of the original dataset have been classified into safe_samples_indices or _unsafe_samples_indices.
Sampling¶
classes for strategically sampling within a DataLoader
-
class
opensoundscape.torch.sampling.
ClassAwareSampler
(labels, num_samples_cls=1)¶ In each batch of samples, pick a limited number of classes to include and give even representation to each class
-
class
opensoundscape.torch.sampling.
ImbalancedDatasetSampler
(dataset, indices=None, num_samples=None, callback_get_label=None)¶ Samples elements randomly from a given list of indices for imbalanced dataset :param indices: a list of indices :type indices: list, optional :param num_samples: number of samples to draw :type num_samples: int, optional :param callback_get_label func: a callback-like function which takes two arguments - dataset and index
Performance Metrics¶
-
opensoundscape.metrics.
binary_metrics
(targets, preds, class_names=[0, 1])¶ labels should be single-target
-
opensoundscape.metrics.
multiclass_metrics
(targets, preds, class_names)¶ provide a list or np.array of 0,1 targets and predictions
-
opensoundscape.metrics.
predict
(scores, single_target=False, threshold=0.5)¶ convert numeric scores to binary predictions
return 0/1 for an array of scores: samples (rows) x classes (columns)
Parameters: - scores – a 2-d list or np.array. row=sample, columns=classes
- single_target – if True, predict 1 for highest scoring class per sample, 0 for other classes. If False, predict 1 for all scores > threshold [default: False]
- threshold – Predict 1 for score > threshold. only used if single_target = False. [default: 0.5]
Preprocessing¶
Image Augmentation¶
Transforms and augmentations for PIL.Images
-
opensoundscape.preprocess.img_augment.
time_split
(img, seed=None)¶ Given a PIL.Image, split into left/right parts and swap
Randomly chooses the slicing location For example, if h chosen
- abcdefghijklmnop
- ^
hijklmnop + abcdefg
Parameters: img – A PIL.Image Returns: A PIL.Image
Preprocessing Actions¶
Actions for augmentation and preprocessing pipelines
This module contains Action classes which act as the elements in Preprocessor pipelines. Action classes have go(), on(), off(), and set() methods. They take a single sample of a specific type and return the transformed or augmented sample, which may or may not be the same type as the original.
See the preprocessor module and Preprocessing tutorial for details on how to use and create your own actions.
-
class
opensoundscape.preprocess.actions.
ActionContainer
¶ this is a container object which holds instances of Action child-classes
the Actions it contains each have .go(), .on(), .off(), .set(), .get()
The actions are un-ordered and may not all be used. In preprocessor objects such as AudioToSpectrogramPreprocessor, Actions from the action container are listed in a pipeline(list), which defines their order of use.
To add actions to the container: action_container.loader = AudioLoader() To set parameters of actions: action_container.loader.set(param=value,…)
Methods: list_actions()
-
class
opensoundscape.preprocess.actions.
AudioClipLoader
(**kwargs)¶ Action to load only a specific segment of an audio file
Loads an audio file or part of a file. see Audio.from_file() for documentation.
Parameters: Audio.from_file (see) – Note: default sample_rate=None means use file’s sample rate, don’t resample
-
class
opensoundscape.preprocess.actions.
AudioLoader
(**kwargs)¶ Action child class for Audio.from_file() (path -> Audio)
Loads an audio file or part of a file. see Audio.from_file() for documentation.
Parameters: Audio.from_file (see) – Note: default sample_rate=None means use file’s sample rate, don’t resample
-
class
opensoundscape.preprocess.actions.
AudioToMelSpectrogram
(**kwargs)¶ Action child class for MelSpectrogram.from_audio() (Audio -> MelSpectrogram)
see spectrogram.MelSpectrogram.from_audio for documentation
Parameters: - n_mels – Number of mel bands to generate [default: 128] Note: n_mels should be chosen for compatibility with the Spectrogram parameter window_samples. Choosing a value > ~ window_samples/10 will result in zero-valued rows while small values blend rows from the original spectrogram.
- window_type – The windowing function to use [default: “hann”]
- window_samples – n samples per window [default: 512]
- overlap_samples – n samples shared by consecutive windows [default: 256]
- htk – use HTK mel-filter bank instead of Slaney, see Librosa docs [default: False]
- norm='slanley' – mel filter bank normalization, see Librosa docs
- dB_scale=True – If True, rescales values to decibels, x=10*log10(x) - if dB_scale is False, decibel_limits is ignored
-
class
opensoundscape.preprocess.actions.
AudioToSpectrogram
(**kwargs)¶ Action child class for Spectrogram.from_audio() (Audio -> Spectrogram)
see spectrogram.Spectrogram.from_audio for documentation
Parameters: - window_type="hann" – see scipy.signal.spectrogram docs for description of window parameter
- window_samples=512 – number of audio samples per spectrogram window (pixel)
- overlap_samples=256 – number of samples shared by consecutive windows
- = (decibel_limits) – limit the dB values to (min,max) (lower values set to min, higher values set to max)
- dB_scale=True – If True, rescales values to decibels, x=10*log10(x) - if dB_scale is False, decibel_limits is ignored
-
class
opensoundscape.preprocess.actions.
AudioTrimmer
(**kwargs)¶ Action child class for trimming audio (Audio -> Audio)
Trims an audio file to desired length Allows audio to be trimmed from start or from a random time Optionally extends audio shorter than clip_length with silence
Parameters: - audio_length – desired final length (sec); if None, no trim is performed
- extend – if True, clips shorter than audio_length are extended with silence to required length
- random_trim – if True, a random segment of length audio_length is chosen from the input audio. If False, the file is trimmed from 0 seconds to audio_length seconds.
-
class
opensoundscape.preprocess.actions.
BaseAction
(**kwargs)¶ Parent class for all Actions (used in Preprocessor pipelines)
New actions should subclass this class.
Subclasses should set self.requires_labels = True if go() expects (X,y) instead of (X). y is a row of a dataframe (a pd.Series) with index (.name) = original file path, columns=class names, values=labels (0,1). X is the sample, and can be of various types (path, Audio, Spectrogram, Tensor, etc). See ImgOverlay for an example of an Action that uses labels.
-
class
opensoundscape.preprocess.actions.
FrequencyMask
(**kwargs)¶ add random horizontal bars over image
Parameters: - max_masks – max number of horizontal bars [default: 3]
- max_width – maximum size of horizontal bars as fraction of image height
-
go
(x)¶ torch Tensor in, torch Tensor out
-
class
opensoundscape.preprocess.actions.
ImgOverlay
(overlay_df, audio_length, loader_pipeline, update_labels, **kwargs)¶ iteratively overlay images on top of eachother
Overlays images from overlay_df on top of the sample with probability overlay_prob until stopping condition. If necessary, trims overlay audio to the length of the input audio. Overlays the images on top of each other with a weight.
- Overlays can be used in a few general ways:
- a separate df where any file can be overlayed (overlay_class=None)
- same df as training, where the overlay class is “different” ie,
- does not contain overlapping labels with the original sample
- same df as training, where samples from a specific class are used
- for overlays
Parameters: - overlay_df – a labels dataframe with audio files as the index and classes as columns
- audio_length – length in seconds of original audio sample
- loader_pipeline – the preprocessing pipeline to load audio -> spec
- update_labels – if True, add overlayed sample’s labels to original sample
- overlay_class –
how to choose files from overlay_df to overlay Options [default: “different”]: None - Randomly select any file from overlay_df “different” - Select a random file from overlay_df containing none
of the classes this file containsspecific class name - always choose files from this class
- overlay_prob – the probability of applying each subsequent overlay
- max_overlay_num –
the maximum number of samples to overlay on original - for example, if overlay_prob = 0.5 and max_overlay_num=2,
1/2 of images will recieve 1 overlay and 1/4 will recieve an additional second overlay - overlay_weight – can be a float between 0-1 or range of floats (chooses randomly from within range) such as [0.1,0.7]. An overlay_weight <0.5 means more emphasis on original image.
-
go
(x, x_labels)¶ Overlay images from overlay_df
-
class
opensoundscape.preprocess.actions.
ImgToTensor
(**kwargs)¶ Convert PIL image to RGB Tensor (PIL.Image -> Tensor)
convert PIL.Image w/range [0,255] to torch Tensor w/range [0,1] converts image to RGB (3 channels)
-
class
opensoundscape.preprocess.actions.
ImgToTensorGrayscale
(**kwargs)¶ Convert PIL image to greyscale Tensor (PIL.Image -> Tensor)
convert PIL.Image w/range [0,255] to torch Tensor w/range [0,1] converts image to grayscale (1 channel)
-
class
opensoundscape.preprocess.actions.
SaveTensorToDisk
(save_path, **kwargs)¶ save a torch Tensor to disk (Tensor -> Tensor)
Requires x_labels because the index of the label-row (.name) gives the original file name for this sample.
Uses torchvision.utils.save_image. Creates save_path dir if it doesn’t exist
Parameters: save_path – a directory where tensor will be saved -
go
(x, x_labels)¶ we require x_labels because the .name gives origin file name
-
-
class
opensoundscape.preprocess.actions.
SpecToImg
(**kwargs)¶ Action class to transform Spectrogram to PIL image
(Spectrogram -> PIL.Image)
Parameters: - destination – a file path (string)
- shape=None – image dimensions for 1 channel, (height, width)
- mode="RGB" – RGB for 3-channel color or “L” for 1-channel grayscale
- colormap=None – (str) Matplotlib color map name (if None, greyscale)
-
class
opensoundscape.preprocess.actions.
SpectrogramBandpass
(**kwargs)¶ Action class for Spectrogram.bandpass() (Spectrogram -> Spectrogram)
see opensoundscape.spectrogram.Spectrogram.bandpass() for documentation
To bandpass the spectrogram from 1kHz to 5Khz: action = SpectrogramBandpass(1000,5000)
Parameters: - min_f – low frequency in Hz for bandpass
- max_f – high frequency in Hz for bandpass
- out_of_bounds_ok – if False, raises error if min or max beyond spec limits
-
class
opensoundscape.preprocess.actions.
TensorAddNoise
(**kwargs)¶ Add gaussian noise to sample (Tensor -> Tensor)
Parameters: std – standard deviation for Gaussian noise [default: 1] Note: be aware that scaling before/after this action will change the effect of a fixed stdev Gaussian noise
-
class
opensoundscape.preprocess.actions.
TensorAugment
(**kwargs)¶ combination of 3 augmentations with hard-coded parameters
time warp, time mask, and frequency mask
use (bool) time_warp, time_mask, freq_mask to turn each on/off
Note: This function reduces the image to greyscale then duplicates the image across the 3 channels
-
go
(x)¶ torch Tensor in, torch Tensor out
-
-
class
opensoundscape.preprocess.actions.
TensorNormalize
(**kwargs)¶ torchvision.transforms.Normalize (WARNING: FIXED shift and scale)
(Tensor->Tensor)
WARNING: This does not perform per-image normalization. Instead, it takes as arguments a fixed u and s, ie for the entire dataset, and performs X=(X-u)/s.
- Params:
- mean=0.5 std=0.5
-
class
opensoundscape.preprocess.actions.
TimeMask
(**kwargs)¶ add random vertical bars over image (Tensor -> Tensor)
Parameters: - max_masks – maximum number of bars [default: 3]
- max_width – maximum width of horizontal bars as fraction of image width
- [default – 0.2]
-
class
opensoundscape.preprocess.actions.
TimeWarp
(**kwargs)¶ Time warp is an experimental augmentation that creates a tilted image.
Parameters: warp_amount – use higher values for more skew and offset (experimental) Note: this augmentation reduces the image to greyscale and duplicates the result across the 3 channels.
-
class
opensoundscape.preprocess.actions.
TorchColorJitter
(**kwargs)¶ Action class for torchvision.transforms.ColorJitter
(Tensor -> Tensor) or (PIL Img -> PIL Img)
Parameters: - brightness=0.3 –
- contrast=0.3 –
- saturation=0.3 –
- hue=0 –
-
class
opensoundscape.preprocess.actions.
TorchRandomAffine
(**kwargs)¶ Action class for torchvision.transforms.RandomAffine
(Tensor -> Tensor) or (PIL Img -> PIL Img)
Parameters: - = 0 (degrees) –
- = (fill) –
- = –
Note: If applying per-image normalization, we recommend applying RandomAffine after image normalization. In this case, an intermediate gray value is ~0. If normalization is applied after RandomAffine on a PIL image, use an intermediate fill color such as (122,122,122).
Preprocessors¶
-
class
opensoundscape.preprocess.preprocessors.
AudioLoadingPreprocessor
(df, return_labels=True, audio_length=None)¶ creates Audio objects from file paths
Parameters: - df – dataframe of audio clips. df must have audio paths in the index. If df has labels, the class names should be the columns, and the values of each row should be 0 or 1. If data does not have labels, df will have no columns
- return_labels – if True, __getitem__ returns {“X”:batch_tensors,”y”:labels} if False, __getitem__ returns {“X”:batch_tensors} [default: True]
- audio_length – length in seconds of audio to return - None: do not trim the original audio - seconds (float): trim longer audio to this length. Shorter audio input will raise a ValueError.
-
class
opensoundscape.preprocess.preprocessors.
AudioToSpectrogramPreprocessor
(df, audio_length=None, out_shape=[224, 224], return_labels=True)¶ loads audio paths, creates spectrogram, returns tensor
by default, does not resample audio, but bandpasses to 0-11025 Hz (to ensure all outputs have same scale in y-axis) can change with .actions.load_audio.set(sample_rate=sr)
Parameters: - df – dataframe of audio clips. df must have audio paths in the index. If df has labels, the class names should be the columns, and the values of each row should be 0 or 1. If data does not have labels, df will have no columns
- audio_length – length in seconds of audio clips [default: None] If provided, longer clips trimmed to this length. By default, shorter clips will not be extended (modify actions.AudioTrimmer to change behavior).
- out_shape – output shape of tensor in pixels [default: [224,224]]
- return_labels – if True, the __getitem__ method will return {X:sample,y:labels} If False, the __getitem__ method will return {X:sample} If df has no labels (no columns), use return_labels=False [default: True]
-
class
opensoundscape.preprocess.preprocessors.
BasePreprocessor
(df, return_labels=True)¶ Base class for Preprocessing pipelines (use in place of torch Dataset)
Custom Preprocessor classes should subclass this class or its children
Parameters: - df – dataframe of audio clips. df must have audio paths in the index. If df has labels, the class names should be the columns, and the values of each row should be 0 or 1. If data does not have labels, df will have no columns
- return_labels – if True, the __getitem__ method will return {X:sample,y:labels} If False, the __getitem__ method will return {X:sample} If df has no labels (no columns), use return_labels=False [default: True]
Raises: PreprocessingError if exception is raised during __getitem__
-
class_counts_cal
()¶ count number of each label
-
head
(n=5)¶ out-of-place copy of first n samples
performs df.head(n) on self.df
Parameters: - n – number of first samples to return, see pandas.DataFrame.head()
- [default – 5]
Returns: a new dataset object
-
pipeline_summary
()¶ Generate a DataFrame describing the current pipeline
The DataFrame has columns for name (corresponds to the attribute name, eg ‘to_img’ for self.actions.to_img), on (not bypassed) / off (bypassed), and action_reference (a reference to the object)
-
sample
(**kwargs)¶ out-of-place random sample
creates copy of object with n rows randomly sampled from dataframe
Args: see pandas.DataFrame.sample()
Returns: a new dataset object
-
class
opensoundscape.preprocess.preprocessors.
ClipLoadingSpectrogramPreprocessor
(df)¶ load audio samples from long audio files
Directly loads a part of an audio file, eg 5-10 seconds, without loading entire file. This alows for prediction on long audio files without needing to pre-split or load large files into memory.
It will load the requested audio segments into samples, regardless of length
Parameters: df – a dataframe with file paths as index and 2 columns: [‘start_time’,’end_time’] (seconds since beginning of file) Returns: ClipLoadingSpectrogramPreprocessor object Examples: You can quickly create such a df for a set of audio files like this:
``` import librosa from opensoundscape.helpers import generate_clip_times_df files = glob(‘/path_to//.WAV’) #get list of full-length files clip_dfs = [] clip_duration=5.0 clip_overlap = 0.0 for f in files:
t = librosa.get_duration(filename=f) clips = generate_clip_times_df(t,clip_duration,clip_overlap) clips.index = [f]*len(clips) clips.index.name = ‘file’ clip_dfs.append(clips)clip_df = pd.concat(clip_dfs) #contains clip times for all files ```
If you use this preprocessor with model.predict(), it will work, but the scores/predictions df will only have file paths not the times of clips. You will want to re-add the start and end times of clips as multi-index:
``` score_df = model.predict(clip_loading_ds) #for instance score_df.index = pd.MultiIndex.from_arrays(
[clip_df.index,clip_df[‘start_time’],clip_df[‘end_time’]]
-
class
opensoundscape.preprocess.preprocessors.
CnnPreprocessor
(df, audio_length=None, return_labels=True, debug=None, overlay_df=None, out_shape=[224, 224])¶ Child of AudioToSpectrogramPreprocessor with full augmentation pipeline
loads audio, creates spectrogram, performs augmentations, returns tensor
by default, does not resample audio, but bandpasses to 0-10 kHz (to ensure all outputs have same scale in y-axis) can change with .actions.load_audio.set(sample_rate=sr)
Parameters: - df – dataframe of audio clips. df must have audio paths in the index. If df has labels, the class names should be the columns, and the values of each row should be 0 or 1. If data does not have labels, df will have no columns
- audio_length – length in seconds of audio clips [default: None] If provided, longer clips trimmed to this length. By default, shorter clips will not be extended (modify actions.AudioTrimmer to change behavior).
- out_shape – output shape of tensor in pixels [default: [224,224]]
- return_labels – if True, the __getitem__ method will return {X:sample,y:labels} If False, the __getitem__ method will return {X:sample} If df has no labels (no columns), use return_labels=False [default: True]
- debug – If a path is provided, generated samples (after all augmentation) will be saved to the path as an image. This is useful for checking that the sample provided to the model matches expectations. [default: None]
-
augmentation_off
()¶ use pipeline that skips all augmentations
-
augmentation_on
()¶ use pipeline containing all actions including augmentations
-
exception
opensoundscape.preprocess.utils.
PreprocessingError
¶ Custom exception indicating that a Preprocessor pipeline failed
Tensor Augmentation¶
Augmentations and transforms for torch.Tensors
These functions were implemented for PyTorch in: https://github.com/zcaceres/spec_augment The original paper is available on https://arxiv.org/abs/1904.08779
-
opensoundscape.preprocess.tensor_augment.
freq_mask
(spec, F=30, max_masks=3, replace_with_zero=False)¶ draws horizontal bars over the image
F:maximum frequency-width of bars in pixels
max_masks: maximum number of bars to draw
replace_with_zero: if True, bars are 0s, otherwise, mean img value
-
opensoundscape.preprocess.tensor_augment.
time_mask
(spec, T=40, max_masks=3, replace_with_zero=False)¶ draws vertical bars over the image
T:maximum time-width of bars in pixels
max_masks: maximum number of bars to draw
replace_with_zero: if True, bars are 0s, otherwise, mean img value
-
opensoundscape.preprocess.tensor_augment.
time_warp
(spec, W=5)¶ apply time stretch and shearing to spectrogram
fills empty space on right side with horizontal bars
W controls amount of warping. Random with occasional large warp.
Signal Processing¶
RIBBIT¶
Detect periodic vocalizations with RIBBIT
This module provides functionality to search audio for periodically fluctuating vocalizations.
-
opensoundscape.ribbit.
calculate_pulse_score
(amplitude, amplitude_sample_rate, pulse_rate_range, plot=False, nfft=1024)¶ Search for amplitude pulsing in an audio signal in a range of pulse repetition rates (PRR)
scores an audio amplitude signal by highest value of power spectral density in the PRR range
Parameters: - amplitude – a time series of the audio signal’s amplitude (for instance a smoothed raw audio signal)
- amplitude_sample_rate – sample rate in Hz of amplitude signal, normally ~20-200 Hz
- pulse_rate_range – [min, max] values for amplitude modulation in Hz
- plot=False – if True, creates a plot visualizing the power spectral density
- nfft=1024 – controls the resolution of the power spectral density (see scipy.signal.welch)
Returns: pulse rate score for this audio segment (float)
-
opensoundscape.ribbit.
ribbit
(spectrogram, signal_band, pulse_rate_range, clip_duration, clip_overlap=0, final_clip=None, noise_bands=None, plot=False)¶ Run RIBBIT detector to search for periodic calls in audio
This tool searches for periodic energy fluctuations at specific repetition rates and frequencies.
Parameters: - spectrogram – opensoundscape.Spectrogram object of an audio file
- signal_band – [min, max] frequency range of the target species, in Hz
- pulse_rate_range – [min,max] pulses per second for the target species
- clip_duration – the length of audio (in seconds) to analyze at one time - each clip is analyzed independently and recieves a ribbit score
- clip_overlap (float) – overlap between consecutive clips (sec)
- final_clip (str) – behavior if final clip is less than clip_duration seconds long. By default, discards remaining audio if less than clip_duration seconds long [default: None]. Options: - None: Discard the remainder (do not make a clip) - “remainder”: Use only remainder of Audio (final clip will be shorter than clip_duration) - “full”: Increase overlap with previous clip to yield a clip with clip_duration length Note that the “extend” option is not supported for RIBBIT.
- noise_bands – list of frequency ranges to subtract from the signal_band For instance: [ [min1,max1] , [min2,max2] ] - if None, no noise bands are used - default: None
- plot=False – if True, plot the power spectral density for each clip
Returns: DataFrame of index=(‘start_time’,’end_time’), columns=[‘score’], with a row for each clip.
Notes
__PARAMETERS__ RIBBIT requires the user to select a set of parameters that describe the target vocalization. Here is some detailed advice on how to use these parameters.
Signal Band: The signal band is the frequency range where RIBBIT looks for the target species. It is best to pick a narrow signal band if possible, so that the model focuses on a specific part of the spectrogram and has less potential to include erronious sounds.
Noise Bands: Optionally, users can specify other frequency ranges called noise bands. Sounds in the noise_bands are _subtracted_ from the signal_band. Noise bands help the model filter out erronious sounds from the recordings, which could include confusion species, background noise, and popping/clicking of the microphone due to rain, wind, or digital errors. It’s usually good to include one noise band for very low frequencies – this specifically eliminates popping and clicking from being registered as a vocalization. It’s also good to specify noise bands that target confusion species. Another approach is to specify two narrow noise_bands that are directly above and below the signal_band.
Pulse Rate Range: This parameters specifies the minimum and maximum pulse rate (the number of pulses per second, also known as pulse repetition rate) RIBBIT should look for to find the focal species. For example, choosing pulse_rate_range = [10, 20] means that RIBBIT should look for pulses no slower than 10 pulses per second and no faster than 20 pulses per second.
Clip Duration: The clip_duration parameter tells RIBBIT how many seconds of audio to analyze at one time. Generally, you should choose a clip_length that is similar to the length of the target species vocalization, or a little bit longer. For very slowly pulsing vocalizations, choose a longer window so that at least 5 pulses can occur in one window (0.5 pulses per second -> 10 second window). Typical values for are 0.3 to 10 seconds. Also, clip_overlap can be used for overlap between sequential clips. This is more computationally expensive but will be more likely to center a target sound in the clip (with zero overlap, the target sound may be split up between adjacent clips).
Plot: We can choose to show the power spectrum of pulse repetition rate for each window by setting plot=True. The default is not to show these plots (plot=False).
__ALGORITHM__ This is the procedure RIBBIT follows: divide the audio into segments of length clip_duration for each clip:
calculate time series of energy in signal band (signal_band) and subtract noise band energies (noise_bands) calculate power spectral density of the amplitude time series score the file based on the maximum value of power spectral density in the pulse rate range
Signal Processing¶
Signal processing tools for feature extraction and more
-
opensoundscape.signal.
cwt_peaks
(audio, center_frequency, wavelet='morl', peak_threshold=0.2, peak_separation=None, plot=False)¶ compute a cwt, post-process, then extract peaks
Performs a continuous wavelet transform (cwt) on an audio signal at a single frequency. It then squares, smooths, and normalizes the signal. Finally, it detects peaks in the resulting signal and returns the times and magnitudes of detected peaks. It is used as a feature extractor for Ruffed Grouse drumming detection.
Parameters: - audio – an Audio object
- center_frequency – the target frequency to extract peaks from
- wavelet – (str) name of a pywt wavelet, eg ‘morl’ (see pywt docs)
- peak_threshold – minimum height of peaks - if None, no minimum peak height - see “height” argument to scipy.signal.find_peaks
- peak_separation – minimum time between detected peaks, in seconds - if None, no minimum distance - see “distance” argument to scipy.signal.find_peaks
Returns: list of times (from beginning of signal) of each peak peak_levels: list of magnitudes of each detected peak
Return type: peak_times
Note
consider downsampling audio to reduce computational cost. Audio must have sample rate of at least 2x target frequency.
-
opensoundscape.signal.
detect_peak_sequence_cwt
(audio, sr=400, window_len=60, center_frequency=50, wavelet='morl', peak_threshold=0.2, peak_separation=0.0375, dt_range=[0.05, 0.8], dy_range=[-0.2, 0], d2y_range=[-0.05, 0.15], max_skip=3, duration_range=[1, 15], points_range=[9, 100], plot=False)¶ Use a continuous wavelet transform to detect accellerating sequences
This function creates a continuous wavelet transform (cwt) feature and searches for accelerating sequences of peaks in the feature. It was developed to detect Ruffed Grouse drumming events in audio signals. Default parameters are tuned for Ruffed Grouse drumming detection.
Analysis is performed on analysis windows of fixed length without overlap. Detections from each analysis window across the audio file are aggregated.
Parameters: - audio – Audio object
- sr=400 – resample audio to this sample rate (Hz)
- window_len=60 – length of analysis window (sec)
- center_frequency=50 – target audio frequency of cwt
- wavelet='morl' – (str) pywt wavelet name (see pywavelets docs)
- peak_threshold=0.2 – height threhsold (0-1) for peaks in normalized signal
- peak_separation=15/400 – min separation (sec) for peak finding
- 0.8] (dt_range=[0.05,) – sequence detection point-to-point criterion 1 - Note: the upper limit is also used as sequence termination criterion 2
- 0] (dy_range=[-0.2,) – sequence detection point-to-point criterion 2
- 0.15] (d2y_range=[-0.05,) – sequence detection point-to-point criterion 3
- max_skip=3 – sequence termination criterion 1: max sequential invalid points
- 15] (duration_range=[1,) – sequence criterion 1: length (sec) of sequence
- 100] (points_range=[9,) – sequence criterion 2: num points in sequence
- plot=False – if True, plot peaks and detected sequences with pyplot
Returns: dataframe summarizing detected sequences
Note: for Ruffed Grouse drumming, which is very low pitched, audio is resampled to 400 Hz. This greatly increases the efficiency of the cwt, but will only detect frequencies up to 400/2=200Hz. Generally, choose a resample frequency as low as possible but >=2x the target frequency
Note: the cwt signal is normalized on each analysis window, so changing the analysis window size can change the detection results.
Note: if there is an incomplete window remaining at the end of the audio file, it is discarded (not analyzed).
-
opensoundscape.signal.
find_accel_sequences
(t, dt_range=[0.05, 0.8], dy_range=[-0.2, 0], d2y_range=[-0.05, 0.15], max_skip=3, duration_range=[1, 15], points_range=[5, 100])¶ detect accelerating/decelerating sequences in time series
developed for deteting Ruffed Grouse drumming events in a series of peaks extracted from cwt signal
The algorithm computes the forward difference of t, y(t). It iterates through the [y(t), t] points searching for sequences of points that meet a set of conditions. It begins with an empty candidate sequence.
“Point-to-point criterea”: Valid ranges for dt, dy, and d2y are checked for each subsequent point and are based on previous points in the candidate sequence. If they are met, the point is added to the candidate sequence.
“Continuation criterea”: Conditions for max_skip and the upper bound of dt are used to determine when a sequence should be terminated.
- max_skip: max number of sequential invalid points before terminating
- dt<=dt_range[1]: if dt is long, sequence should be broken
“Sequence criterea”: When a sequence is terminated, it is evaluated on conditions for duration_range and points_range. If it meets these conditions, it is saved as a detected sequence.
- duration_range: length of sequence in seconds from first to last point
- points_range: number of points included in sequence
When a sequence is terminated, the search continues with the next point and an empty sequence.
Parameters: - t – (list or np.array) times of all detected peaks (seconds)
- dt_range=[0.05,0.8] – valid values for t(i) - t(i-1)
- dy_range=[-0.2,0] – valid values for change in y (grouse: difference in time between consecutive beats should decrease)
- d2y_range=[-.05,15] – limit change in dy: should not show large decrease (sharp curve downward on y vs t plot)
- max_skip=3 – max invalid points between valid points for a sequence (grouse: should not have many noisy points between beats)
- duration_range=[1,15] – total duration of sequence (sec)
- points_range=[9,100] – total number of points in sequence
Returns: lists of t and y for each detected sequence
Return type: sequences_t, sequences_y
-
opensoundscape.signal.
frequency2scale
(frequency, wavelet, sr)¶ determine appropriate wavelet scale for desired center frequency
Parameters: - frequency – desired center frequency of wavelet in Hz (1/seconds)
- wavelet – (str) name of pywt wavelet, eg ‘morl’ for Morlet
- sr – sample rate in Hz (1/seconds)
Returns: (float) scale parameter for pywt.ctw() to extract desired frequency
Return type: scale
Note: this function is not exactly an inverse of pywt.scale2frequency(), because that function returns frequency in sample-units (cycles/sample) rather than frequency in Hz (cycles/second). In other words, freuquency_hz = pywt.scale2frequency(w,scale)*sr.
Misc tools¶
Helpers¶
-
opensoundscape.helpers.
binarize
(x, threshold)¶ return a list of 0, 1 by thresholding vector x
-
opensoundscape.helpers.
bound
(x, bounds)¶ restrict x to a range of bounds = [min, max]
-
opensoundscape.helpers.
file_name
(path)¶ get file name without extension from a path
-
opensoundscape.helpers.
generate_clip_times_df
(full_duration, clip_duration, clip_overlap=0, final_clip=None)¶ generate start and end times for even-lengthed clips
The behavior for incomplete final clips at the end of the full_duration depends on the final_clip parameter.
This function only creates a dataframe with start and end times, it does not perform any actual trimming of audio or other objects.
Parameters: - full_duration – The amount of time (seconds) to split into clips
- clip_duration (float) – The duration in seconds of the clips
- clip_overlap (float) – The overlap of the clips in seconds [default: 0]
- final_clip (str) –
Behavior if final_clip is less than clip_duration seconds long. By default, discards remaining time if less than clip_duration seconds long [default: None]. Options:
- None: Discard the remainder (do not make a clip)
- ”extend”: Extend the final clip beyond full_duration to reach clip_duration length
- ”remainder”: Use only remainder of full_duration (final clip will be shorter than clip_duration)
- ”full”: Increase overlap with previous clip to yield a clip with clip_duration length
Returns: DataFrame with columns for ‘start_time’, ‘end_time’, and ‘clip_duration’ of each clip (which may differ from clip_duration argument for final clip only)
Return type: clip_df
Note: using “remainder” or “full” with clip_overlap>0 is not recommended. This combination may result in several duplications of the same final clip.
-
opensoundscape.helpers.
hex_to_time
(s)¶ convert a hexidecimal, Unix time string to a datetime timestamp in utc
Example usage: ``` # Get the UTC timestamp t = hex_to_time(‘5F16A04E’)
# Convert it to a desired timezone my_timezone = pytz.timezone(“US/Mountain”) t = t.astimezone(my_timezone) ```
Parameters: s (string) – hexadecimal Unix epoch time string, e.g. ‘5F16A04E’ Returns: datetime.datetime object representing the date and time in UTC
-
opensoundscape.helpers.
inrange
(x, r)¶ return true if x is in range [r[0],r1] (inclusive)
-
opensoundscape.helpers.
isNan
(x)¶ check for nan by equating x to itself
-
opensoundscape.helpers.
jitter
(x, width, distribution='gaussian')¶ Jitter (add random noise to) each value of x
Parameters: - x – scalar, array, or nd-array of numeric type
- width – multiplier for random variable (stdev for ‘gaussian’ or r for ‘uniform’)
- distribution – ‘gaussian’ (default) or ‘uniform’ if ‘gaussian’: draw jitter from gaussian with mu = 0, std = width if ‘uniform’: draw jitter from uniform on [-width, width]
Returns: x + random jitter
Return type: jittered_x
-
opensoundscape.helpers.
linear_scale
(array, in_range=(0, 1), out_range=(0, 255))¶ Translate from range in_range to out_range
- Inputs:
- in_range: The starting range [default: (0, 1)] out_range: The output range [default: (0, 255)]
- Outputs:
- new_array: A translated array
-
opensoundscape.helpers.
make_clip_df
(files, clip_duration, clip_overlap=0, final_clip=None)¶ generate df of fixed-length clip times for a set of file_batch_size
Used to prepare a dataframe for ClipLoadingSpectrogramPreprocessor
A typical prediction workflow: ``` #get list of audio files files = glob(‘./dir/*.WAV’)
#generate clip df clip_df = make_clip_df(files,clip_duration=5.0,clip_overlap=0)
#create dataset dataset = ClipLoadingSpectrogramPreprocessor(clip_df)
#generate predictions with a model model = load_model(‘/path/to/saved.model’) scores, _, _ = model.predict(dataset)
This function creates a single dataframe with audio files as the index and columns: ‘start_time’, ‘end_time’. It will list clips of a fixed duration from the beginning to end of each audio file.
Parameters: - files – list of audio file paths
- clip_duration (float) – see generate_clip_times_df
- clip_overlap (float) – see generate_clip_times_df
- final_clip (str) – see generate_clip_times_df
-
opensoundscape.helpers.
min_max_scale
(array, feature_range=(0, 1))¶ rescale vaues in an a array linearly to feature_range
-
opensoundscape.helpers.
overlap
(r1, r2)¶ “calculate the amount of overlap between two real-numbered ranges
-
opensoundscape.helpers.
overlap_fraction
(r1, r2)¶ “calculate the fraction of r1 (low, high range) that overlaps with r2
-
opensoundscape.helpers.
rescale_features
(X, rescaling_vector=None)¶ rescale all features by dividing by the max value for each feature
optionally provide the rescaling vector (1xlen(X) np.array), so that you can rescale a new dataset consistently with an old one
returns rescaled feature set and rescaling vector
-
opensoundscape.helpers.
run_command
(cmd)¶ run a bash command with Popen, return response
-
opensoundscape.helpers.
sigmoid
(x)¶ sigmoid function
Taxa¶
a set of utilites for converting between scientific and common names of bird species in different naming systems (xeno canto and bird net)
-
opensoundscape.taxa.
bn_common_to_sci
(common)¶ convert bird net common name (ignoring dashes, spaces, case) to scientific name as lowercase-hyphenated
-
opensoundscape.taxa.
common_to_sci
(common)¶ convert bird net common name (ignoring dashes, spaces, case) to scientific name as lowercase-hyphenated
-
opensoundscape.taxa.
get_species_list
()¶ list of scientific-names (lowercase-hyphenated) of species in the loaded species table
-
opensoundscape.taxa.
sci_to_bn_common
(scientific)¶ convert scientific name as lowercase-hyphenated to birdnet common name as lowercasenospaces
-
opensoundscape.taxa.
sci_to_xc_common
(scientific)¶ convert scientific name as lowercase-hyphenated to xeno-canto common name as lowercasenospaces
-
opensoundscape.taxa.
xc_common_to_sci
(common)¶ convert xeno-canto common name (ignoring dashes, spaces, case) to scientific name as lowercase-hyphenated
Localization¶
-
opensoundscape.localization.
calc_speed_of_sound
(temperature=20)¶ Calculate speed of sound in meters per second
Calculate speed of sound for a given temperature in Celsius (Humidity has a negligible effect on speed of sound and so this functionality is not implemented)
Parameters: temperature – ambient temperature in Celsius Returns: the speed of sound in meters per second
-
opensoundscape.localization.
localize
(receiver_positions, arrival_times, temperature=20.0, invert_alg='gps', center=True, pseudo=True)¶ Perform TDOA localization on a sound event
Localize a sound event given relative arrival times at multiple receivers. This function implements a localization algorithm from the equations described in the class handout (“Global Positioning Systems”). Localization can be performed in a global coordinate system in meters (i.e., UTM), or relative to recorder positions in meters.
Parameters: - receiver_positions – a list of [x,y,z] positions for each receiver Positions should be in meters, e.g., the UTM coordinate system.
- arrival_times – a list of TDOA times (onset times) for each recorder The times should be in seconds.
- temperature – ambient temperature in Celsius
- invert_alg – what inversion algorithm to use
- center – whether to center recorders before computing localization result. Computes localization relative to centered plot, then translates solution back to original recorder locations. (For behavior of original Sound Finder, use True)
- pseudo – whether to use the pseudorange error (True) or sum of squares discrepancy (False) to pick the solution to return (For behavior of original Sound Finder, use False. However, in initial tests, pseudorange error appears to perform better.)
Returns: The solution (x,y,z,b) with the lower sum of squares discrepancy b is the error in the pseudorange (distance to mics), b=c*delta_t (delta_t is time error)
-
opensoundscape.localization.
lorentz_ip
(u, v=None)¶ Compute Lorentz inner product of two vectors
For vectors u and v, the Lorentz inner product for 3-dimensional case is defined as
u[0]*v[0] + u[1]*v[1] + u[2]*v[2] - u[3]*v[3]Or, for 2-dimensional case as
u[0]*v[0] + u[1]*v[1] - u[2]*v[2]Parameters: - u – vector with shape either (3,) or (4,)
- v – vector with same shape as x1; if None (default), sets v = u
Returns: value of Lorentz IP
Return type: float
-
opensoundscape.localization.
travel_time
(source, receiver, speed_of_sound)¶ Calculate time required for sound to travel from a souce to a receiver
Parameters: - source – cartesian position [x,y] or [x,y,z] of sound source
- receiver – cartesian position [x,y] or [x,y,z] of sound receiver
- speed_of_sound – speed of sound in m/s
Returns: time in seconds for sound to travel from source to receiver